Hi All,
I've been using Solr and Lucene for some time. I started with Solr then
moved to lucene because of more flexibility/openness in lucene, but I like
both. As per my requirement I want to crawl webpages and add to lucene
indexing. So far I've been doing crawling manually and adding them to lucene
index though lucene APIs. The webpages have content which is a mix of say 5%
english and remaining non-english[indian] content. To handle stemming/stop
word removal for the english part, I wrote a small custom analyzer for use
in lucene and thats working fairly well. Now I was thinking of doing the
crawling part using Nutch. Does this sound OK. I went through the nutch wiki
page and found that it supports a bunch of file types[like html/xml, pdf,
odf, ppt, ms word etc ] but for me html is good enough. Also the wiki says
that it builds distributed indexes using Hadoop[I've used Hadoop a bit] that
uses teh map-reduce architecture. But for my requirement I dont need that
much of things. Distributed inexing is not required, so essentially I dont
need hadoop/map-reduce stuffs. So let me summarize things I want
#. Crawl the webpage, I want nutch to hand me over the content, I dont want
it to directly post that content to lucene by itself. Essentially I want to
interfare in between crawling and indexing, as I've to use custom analyzer
before the contents are indexed by lucene.
#. For me html parsing is good enough[no need of pdf/odf/msword etc]
#. No need of hadoop/map-reduce.

I'ld like the users of nutch to let me know their views. Other option is to
look for Java opensource crawlers that can do the job. I dont find any and
I'm more interested in using something really good/well tested like nutch.
Let me know your opinions.

Thanks,
KK.

Reply via email to