hi Marcus, > Hi. > > I am building (yet another) crawler, parsing and indexing the html files > crawled with Lucene. Then I came to think about it. Stupido! why aren't you > using nutch instead! > > My use case is something like this. > > 100-1000 domains with average depth of 3 to 5 I think. If I miss some pages > it is not the end of the world so a tradeoff between depth and crawl speed > is taken. > All urls must be crawled at least once a day and be crontabbed. > > I would like to have one lucene dir which is optimized after each reindexing > not one dir per crawl so I need to create something like the recrawl script > which is published on the Wiki. > Not sure I understand: why don't you just throw away the old index once you have successfully created the new one (since you have to re-crawl the whole content daily)? > I would prefer to search the content myself by creating an IndexSearcher, > this is because I already index a whole lot of RSS feeds so I would like to > do a "MultiIndex" search, think that will be hard to do without doing it > yourself. > Or you could index the feeds with Nutch, too. There's a plugin for RSS... > I noticed the WAR file but I would prefer too create the templates myself. > Actually, the WAR is just a started, you will have to implement your layout anyway in the jsp's.
HTH, Renaud ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general