On Jan 10, 2011, at 7:02am, Otis Gospodnetic wrote:
Hi Ken, thanks Ken. :)
The problem with this approach is that it exposes very limited
content to
bots/web search engines.
Take http://search-lucene.com/ for example. People enter all kinds
of queries
in web search engines and end up on that site. People who visit the
site
directly don't necessarily search for those same things. Plus, new
terms are
entered to get to search-lucene.com every day, so keeping up with
that would
mean constantly generating more and more of those static pages.
Basically, the
tail is super long.
To clarify - the issue of using actual user search traffic is one of
SEO, not what content you expose.
If, for example, people commonly do a search for "java <something>"
then that's a hint that the URL to the static content, and the page
title, should have the language as part of it.
So you shouldn't be generating static pages based on search traffic.
Though you might want to decide what content to "favor" (see below)
based on popularity.
On top of that, new content is constantly being generated,
so one would have to also constantly both add and update those
static pages.
Yes, but that's why you need to automate that content generation, and
do it on a regular (e.g. weekly) basis.
The big challenges we ran into were:
1. Dealing with badly behaved bots that would hammer the site.
We wound up putting this content on a separate system, so it wouldn't
impact users on the main system.
And generating a regular report by user agent & IP address, so that we
could block by robots.txt and IP when necessary.
2. Figuring out how to structure the static content so that it didn't
look like spam to Google/Yahoo/Bing
You don't want to have too many links per page, or too much depth, but
that constrains how many pages you can reasonably expose.
We had project scores based on code, activity, usage - so we used that
to rank the content and focus on exposing early (low depth) the "good
stuff". You could do the same based on popularity, from search logs.
Anyway, there's a lot to this topic, but it doesn't feel very Solr
specific. So apologies for reducing the signal-to-noise ratio with
talk about SEO :)
-- Ken
I have a feeling there is not a good solution for this because on
one hand
people don't like the negative bot side effect, on the other hand
people want as
much of their sites indexed by the big guys. The only half-solution
that comes
to mind involves looking at who's actually crawling you and who's
bringing you
visitors, then blocking those with a bad ratio of those two - bots
that crawl a
lot but don't bring a lot of value.
Any other ideas?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ----
From: Ken Krugler <kkrugler_li...@transpac.com>
To: solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 9:43:49 AM
Subject: Re: How to let crawlers in, but prevent their damage?
Hi Otis,
From what I learned at Krugle, the approach that worked for us was:
1. Block all bots on the search page.
2. Expose the target content via statically linked pages that are
separately
generated from the same backing store, and optimized for target
search terms
(extracted from your own search logs).
-- Ken
On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote:
Hi,
How do people with public search services deal with bots/crawlers?
And I don't mean to ask how one bans them (robots.txt) or slow
them down
(Delay
stuff in robots.txt) or prevent them from digging too deep in
search
results...
What I mean is that when you have publicly exposed search that
bots crawl,
they
issue all kinds of crazy "queries" that result in errors, that add
noise to
Solr
caches, increase Solr cache evictions, etc. etc.
Are there some known recipes for dealing with them, minimizing their
negative
side-effects, while still letting them crawl you?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g