hi Marcus,
> Hi.
>
> I am building (yet another) crawler, parsing and indexing the html files
> crawled with Lucene. Then I came to think about it. Stupido! why aren't you
> using nutch instead!
>
> My use case is something like this.
>
> 100-1000 domains with average depth of 3 to 5 I think. If I miss some pages
> it is not the end of the world so a tradeoff between depth and crawl speed
> is taken.
> All urls must be crawled at least once a day and be crontabbed.
>
> I would like to have one lucene dir which is optimized after each reindexing
> not one dir per crawl so I need to create something like the recrawl script
> which is published on the Wiki.
>   
Not sure I understand: why don't you just throw away the old index once 
you have successfully created the new one (since you have to re-crawl 
the whole content daily)?
> I would prefer to search the content myself by creating an IndexSearcher,
> this is because I already index a whole lot of RSS feeds so I would like to
> do a "MultiIndex" search, think that will be hard to do without doing it
> yourself.
>   
Or you could index the feeds with Nutch, too. There's a plugin for RSS...
> I noticed the WAR file but I would prefer too create the templates myself.
>   
Actually, the WAR is just a started, you will have to implement your 
layout anyway in the jsp's.

HTH,
Renaud

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to