Hi Ken, thanks Ken. :)

The problem with this approach is that it exposes very limited content to 
bots/web search engines.

Take http://search-lucene.com/ for example.  People enter all kinds of queries 
in web search engines and end up on that site.  People who visit the site 
directly don't necessarily search for those same things.  Plus, new terms are 
entered to get to search-lucene.com every day, so keeping up with that would 
mean constantly generating more and more of those static pages.  Basically, the 
tail is super long.  On top of that, new content is constantly being generated, 
so one would have to also constantly both add and update those static pages.

I have a feeling there is not a good solution for this because on one hand 
people don't like the negative bot side effect, on the other hand people want 
as 
much of their sites indexed by the big guys.  The only half-solution that comes 
to mind involves looking at who's actually crawling you and who's bringing you 
visitors, then blocking those with a bad ratio of those two - bots that crawl a 
lot but don't bring a lot of value.

Any other ideas?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Ken Krugler <kkrugler_li...@transpac.com>
> To: solr-user@lucene.apache.org
> Sent: Mon, January 10, 2011 9:43:49 AM
> Subject: Re: How to let crawlers in, but prevent their damage?
> 
> Hi Otis,
> 
> From what I learned at Krugle, the approach that worked for us  was:
> 
> 1. Block all bots on the search page.
> 
> 2. Expose the target  content via statically linked pages that are separately 
>generated from the same  backing store, and optimized for target search terms 
>(extracted from your own  search logs).
> 
> -- Ken
> 
> On Jan 10, 2011, at 5:41am, Otis Gospodnetic  wrote:
> 
> > Hi,
> > 
> > How do people with public search  services deal with bots/crawlers?
> > And I don't mean to ask how one bans  them (robots.txt) or slow them down 
>(Delay
> > stuff in robots.txt) or  prevent them from digging too deep in search 
>results...
> > 
> > What I  mean is that when you have publicly exposed search that bots crawl, 
>they
> >  issue all kinds of crazy "queries" that result in errors, that add noise 
> > to  
>Solr
> > caches, increase Solr cache evictions, etc. etc.
> > 
> >  Are there some known recipes for dealing with them, minimizing their  
>negative
> > side-effects, while still letting them crawl you?
> > 
> > Thanks,
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> > 
> 
> --------------------------
> Ken  Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n  g
> 
> 
> 
> 
> 
> 

Reply via email to