Re: Googling crawling wiki for hours

William L. Thomson Jr. Tue, 18 Jan 2011 22:42:59 -0800

On Tue, 2011-01-18 at 22:08 -0500, Chad Bailey wrote:
> I'd disagree that blocking IP's is easier...


Well when the goal at the time was to stop the immediate load, blocking
the offending IPs would have been much quicker, thus easier. Where as
the robots.txt file is a bit different and a broader solution.

> as there is no feasable
> way to know all IP addresses of current and future robots that crawl
> the web, especially with regard to Google not being the only one out
> there.

That is correct, but I was not bothered by any other crawlers, only
Googles. As for Googles crawlers IP address changing, yes that would be
a problem, and would have to build up a list of IPs for google crawlers.
Which I don't think would be unreasonable to do. I could start by
grepping logs and stuff to get an idea of past IPs used. Then look at
the ip blocks and scan for others. Since Google is doing PTRs for the
IPs that identify it as a crawler ;)

Keep in mind I only wanted to stop the unwanted load from Google's
crawlers. The robots.txt file now effects every crawler, every search
engine. Which is really not ideal for a variety of reasons.

At some point I and/or the group will need to revisit the robots.txt
file and dial it in so we can allow some stuff out to search engine. But
just sucks that since Google's crawler is not wiki friendly, other
crawlers must suffer and pay the price as well.

The blocking IP approach would only have been specific to Google. Then
we would not have to worry about dealing with allowing certain
pages/areas of the wiki, denying others, etc. All because of Google, but
can't easily come up with a Google specific solution using robots.txt
file.

Not to mention it took several hours for Googles crawlers to discover
the robots.txt file. Though I did find out it had been looked for
previously, like 8hrs before I created the one that exists now. But I
still do not see the current solution as simple, elegant, or friendly to
all other crawlers/search engines. More of a band aid, temporary fix
than any sort of a solution really.


-- 
William L. Thomson Jr.
Systems Administrator
Jacksonville Linux Users Group


---------------------------------------------------------------------
Archive      http://marc.info/?l=jaxlug-list&r=1&w=2
RSS Feed     http://www.mail-archive.com/[email protected]/maillist.xml
Unsubscribe  [email protected]

Re: Googling crawling wiki for hours

Reply via email to