Hi All, I've been using Solr and Lucene for some time. I started with Solr then moved to lucene because of more flexibility/openness in lucene, but I like both. As per my requirement I want to crawl webpages and add to lucene indexing. So far I've been doing crawling manually and adding them to lucene index though lucene APIs. The webpages have content which is a mix of say 5% english and remaining non-english[indian] content. To handle stemming/stop word removal for the english part, I wrote a small custom analyzer for use in lucene and thats working fairly well. Now I was thinking of doing the crawling part using Nutch. Does this sound OK. I went through the nutch wiki page and found that it supports a bunch of file types[like html/xml, pdf, odf, ppt, ms word etc ] but for me html is good enough. Also the wiki says that it builds distributed indexes using Hadoop[I've used Hadoop a bit] that uses teh map-reduce architecture. But for my requirement I dont need that much of things. Distributed inexing is not required, so essentially I dont need hadoop/map-reduce stuffs. So let me summarize things I want #. Crawl the webpage, I want nutch to hand me over the content, I dont want it to directly post that content to lucene by itself. Essentially I want to interfare in between crawling and indexing, as I've to use custom analyzer before the contents are indexed by lucene. #. For me html parsing is good enough[no need of pdf/odf/msword etc] #. No need of hadoop/map-reduce.
I'ld like the users of nutch to let me know their views. Other option is to look for Java opensource crawlers that can do the job. I dont find any and I'm more interested in using something really good/well tested like nutch. Let me know your opinions. Thanks, KK.
