On Thu, Jan 03, 2008 at 01:07:15PM -0800, Don Armstrong wrote:
> There are already mirrors which allow indexing, and you can use the
> BTS's own search engine which is far superior to gooogle [...]

Uh, you're kidding right? The BTS's own search engine won't turn up hits
outside the BTS, as a trivial example...

On Fri, Jan 04, 2008 at 08:49:08AM +0100, Raphael Hertzog wrote:
> Most of the content is generated dynamically nowadays and this file has
> been put in place because web crawlers have been known to severly hit the
> machine hosting the BTS...

AFAIK it was put in place when we first went dynamic, when bugs.d.o was
on master and horribly overloaded (so much so that updating the static
pages was taking over half a day).

It hasn't been removed ultimately because the CGIs provide too many
similar urls that shouldn't all be indexed; it's definitely a bug that
we don't provide some URLs that can be indexed.

Hacking around that in robots.txt seems tricky, as you can only reliably
specify Disallow: prefixes in robots.txt. Google supports "*" matches and
"$" to match against end of string and Allow: fields, and at least "*"
seems somewhat common, so something like this could work:

        Disallow: /*/       # exclude everything but the shortcuts
        Allow: /cgi-bin/bugreport.cgi?bug=
        Allow: /cgi-bin/pkgreport.cgi?pkg=*;dist=unstable$

That doesn't prevent bug=1234;reverse=yes and such, but I can't see a good
way of doing that.

I've set that up on rietz for Googlebot, we'll see if it works ok. I
don't think it's possible to make "Disallow: /*/" be the default for
all User-Agents since "*" is an extension, but extending it to MSN and
Yahoo should be fine.

Getting smarturl.cgi properly done is still probably the real solution.

Cheers,
aj

Attachment: signature.asc
Description: Digital signature

Reply via email to