Hi.
I am building (yet another) crawler, parsing and indexing the html files
crawled with Lucene. Then I came to think about it. Stupido! why aren't you
using nutch instead!
My use case is something like this.
100-1000 domains with average depth of 3 to 5 I think. If I miss some pages
it is not the end of the world so a tradeoff between depth and crawl speed
is taken.
All urls must be crawled at least once a day and be crontabbed.
I would like to have one lucene dir which is optimized after each reindexing
not one dir per crawl so I need to create something like the recrawl script
which is published on the Wiki.
I would prefer to search the content myself by creating an IndexSearcher,
this is because I already index a whole lot of RSS feeds so I would like to
do a "MultiIndex" search, think that will be hard to do without doing it
yourself.
I noticed the WAR file but I would prefer too create the templates myself.
Anyone have a good pattern regarding this ?
Kindly
//Marcus Herou
--
Marcus Herou Solution Architect & Core Java developer Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general