Hi all.  First off, I'm using Nutch 0.72.

 I've been playing with nutch for a couple weeks now, and have some
questions relating to indexing blog sites.

 Many blog platforms have a changes.xml file posted on some schedule (
blogger.com/changes10.xml is every 10 minutes), that list the blogs that
have been updated in the last 10 minutes.  Others have an atom stream...
either way, the URLs you need to index are included, and there are -always-
new URLs to crawl, and I know which ones are updated, and don't want to
automatically recrawl them when I hit some time period (like crawl again in
30 days).

 Nutch seems to be designed to be given a few seed URLs, which it can
inject into it's DB, crawl them, extract new links from those sites, and
crawl those too...  previously crawled sites will be recrawled again
automatically once the time since last crawl hits some predefined number (30
days by default).  ie: perfectly normal search engine behavior.

 For blogs... I want it to crawl the injected URLs, and none of the links
on the page.  I did this (I think!) by setting db.max.outlinks.per.page to
zero.  I want it to ONLY crawl the newly injected URLs (I did this by
setting urlfilter.prefix.file to the name of my file that has the list of
updated blog URLs).

 I not sure this setup will ensure that, when 30 days rolls around, nutch
doesn't start automatically throwing old URLs into newly generated segments
for a recrawl.

 For this test, I have this cycle: download changes10.xml, process with
xsltproc to a plain text list of URLs.  Inject into db...  make sure
urlfilter.prefix.file is set to the file with this list of URLs.  Generate a
new segment, fetch, and index.

 This results in a new segment every 10 minutes.  Every 30 minutes I run
'merge' to merge the segment indexes into crawl/index.

 Now first... anyone see any problems with this setup?

 Second... I end up with perpetually growing list of segments, meaning the
'merge' run is taking longer and longer each time.  How do I fix this?

 Third...  just in general... it seems I've had to goof with nutch's config
enough to make this work in this way, that it makes me want to ask if using
nutch for this purpose is indeed the correct path.  I know Technorati just
directly uses lucene for a similar purpose.  Should that be the path I take
(HTMLParser to fecth and extract text, lucene setup with incremental
indexes)?

Thanks for any help anyone can provide.

Chris



--
Chris Newton,
CTO Radian6, www.radian6.com
Phone: 506-452-9039

Reply via email to