Re: How to let crawlers in, but prevent their damage?

Dennis Gearon Mon, 10 Jan 2011 09:11:14 -0800

----- Original Message ----

From: lee carroll <lee.a.carr...@googlemail.com>
To: solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 6:48:12 AM
Subject: Re: How to let crawlers in, but prevent their damage?

Sorry not an answer but a +1 vote for finding out best practice for this.

Related to it is DOS attacks. We have rewrite rules  in between the proxy
server and solr which attempts to filter out undesriable stuff but would it
be better to have a query app doing this?

any standard rewrite rules which drop invalid or potentially malicious
queries would be very nice :-

What exactly are milicious queries? (besides scraping) What's the problem with 
invalid queries? Unless someone is doing a custom crawl/scraping of your site, 
how are they going to issue queries that aren't alread on the site as URLs?

On 10 January 2011 13:41, Otis Gospodnetic <otis_gospodne...@yahoo.com>wrote:

> Hi,
>
> How do people with public search services deal with bots/crawlers?
> And I don't mean to ask how one bans them (robots.txt) or slow them down
> (Delay
> stuff in robots.txt) or prevent them from digging too deep in search
> results...
>
> What I mean is that when you have publicly exposed search that bots crawl,
> they
> issue all kinds of crazy "queries" that result in errors, that add noise to
> Solr
> caches, increase Solr cache evictions, etc. etc.
>
> Are there some known recipes for dealing with them, minimizing their
> negative
> side-effects, while still letting them crawl you?
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>

Re: How to let crawlers in, but prevent their damage?

Reply via email to