Thanks Raymond. So, as per your mail if I should use nutch for both crawling and indexing, right? And for the pre-processing the input before indexing I've to write some plugin that will do the job of customanalyzer that I was using earlier with lucene, right? How easy/difficult is writing this analyzer as a plugin for nutch, and I must say that I'm a avg Java programmer? Let me know your views.
Thanks, KK On Sat, Jun 6, 2009 at 2:03 PM, Raymond Balmès <[email protected]>wrote: > I had started your approach initially , ie building my indexing on lucene > only... but eventually completely dropped it. > Only working out of nutch with custom indexing plug-in and I'm quite happy > with it, the only downside I found is that the Nutch search bean does not > offer as much functionnality as I needed, so I had to build my own plug-in > too there. Later I decided to build a scoring plug-in too to focus the > crawl > : works great. > > I dont see the need for hadoop really either right now, but I like the idea > that if I need it it will be there because my crawls might become quite > big/long. > > Not sure if I will move to solr indexing in the future, I'm trying to avoid > at this moment to minimize complexity. > > -Raymond- > > > > 2009/6/6 KK <[email protected]> > > > Hi All, > > I've been using Solr and Lucene for some time. I started with Solr then > > moved to lucene because of more flexibility/openness in lucene, but I > like > > both. As per my requirement I want to crawl webpages and add to lucene > > indexing. So far I've been doing crawling manually and adding them to > > lucene > > index though lucene APIs. The webpages have content which is a mix of say > > 5% > > english and remaining non-english[indian] content. To handle > stemming/stop > > word removal for the english part, I wrote a small custom analyzer for > use > > in lucene and thats working fairly well. Now I was thinking of doing the > > crawling part using Nutch. Does this sound OK. I went through the nutch > > wiki > > page and found that it supports a bunch of file types[like html/xml, pdf, > > odf, ppt, ms word etc ] but for me html is good enough. Also the wiki > says > > that it builds distributed indexes using Hadoop[I've used Hadoop a bit] > > that > > uses teh map-reduce architecture. But for my requirement I dont need that > > much of things. Distributed inexing is not required, so essentially I > dont > > need hadoop/map-reduce stuffs. So let me summarize things I want > > #. Crawl the webpage, I want nutch to hand me over the content, I dont > want > > it to directly post that content to lucene by itself. Essentially I want > to > > interfare in between crawling and indexing, as I've to use custom > analyzer > > before the contents are indexed by lucene. > > #. For me html parsing is good enough[no need of pdf/odf/msword etc] > > #. No need of hadoop/map-reduce. > > > > I'ld like the users of nutch to let me know their views. Other option is > to > > look for Java opensource crawlers that can do the job. I dont find any > and > > I'm more interested in using something really good/well tested like > nutch. > > Let me know your opinions. > > > > Thanks, > > KK. > > >
