Re: Googling crawling wiki for hours

William L. Thomson Jr. Mon, 17 Jan 2011 08:33:57 -0800

On Mon, 2011-01-17 at 07:17 -0500, William L. Thomson Jr. wrote:
> On Mon, 2011-01-17 at 03:34 -0500, Tom Allen wrote:
> > Don't block Google IPs.  Just use robots.txt on the google bot, it
> > will respect it.  You could even use that to block some pages if you
> > want. 
> 
> The problem comes down to time. Blocking ips is quick and easy. Per the
> provided examples, creating a robots.txt file for the wiki is not so
> easy. Also really seems like one should be pre-existing in Wikimedia.
> Since this should be a common problem. Not to mention Google's crawler
> should have been adapted for Wikis surely MediaWiki which is used all
> over the place.
> 
> Kinda crazy that each has to go out and do the work on their own. If it
> was a small or standard robots.txt file as I am accustomed to working
> with no problem. But I have to go like dig in logs, monitor what pages
> and queries Google is sending over. Then come up with rules in a
> robots.txt file to prevent that. Along with trial and error, and all the
> various pages or ways to hit the wiki. Seems like a massive waste of
> time IMHO.


Went with this for now, we will see if it stops Google anytime soon.
>From there can work on a creating a better robots.txt file for the wiki.
Contributions are welcomed, as this is not something I care to spend
much if any time on, thanks! :)

http://www.jaxlug.org/robots.txt

Though blocking IPs is not totally out of the question either
http://www.mediawiki.org/wiki/Manual:Robots.txt#Problems

-- 
William L. Thomson Jr.
Systems Administrator
Jacksonville Linux Users Group


---------------------------------------------------------------------
Archive      http://marc.info/?l=jaxlug-list&r=1&w=2
RSS Feed     http://www.mail-archive.com/[email protected]/maillist.xml
Unsubscribe  [email protected]

Re: Googling crawling wiki for hours

Reply via email to