If you've done it for Lucene, it shouldn't be that difficult in Nutch, I'm also no Java guru. Look at the wiki it's rather well explained. If you add fields which I did you need both an indexing & query plug-in, but it is quite straightforward. My only real problem was to find that fields have to be lowercase otherwise because of the search bean.
-Ray- 2009/6/6 KK <[email protected]> > Thanks Raymond. > So, as per your mail if I should use nutch for both crawling and indexing, > right? And for the pre-processing the input before indexing I've to write > some plugin that will do the job of customanalyzer that I was using earlier > with lucene, right? How easy/difficult is writing this analyzer as a plugin > for nutch, and I must say that I'm a avg Java programmer? Let me know your > views. > > Thanks, > KK > > On Sat, Jun 6, 2009 at 2:03 PM, Raymond Balmès <[email protected] > >wrote: > > > I had started your approach initially , ie building my indexing on lucene > > only... but eventually completely dropped it. > > Only working out of nutch with custom indexing plug-in and I'm quite > happy > > with it, the only downside I found is that the Nutch search bean does not > > offer as much functionnality as I needed, so I had to build my own > plug-in > > too there. Later I decided to build a scoring plug-in too to focus the > > crawl > > : works great. > > > > I dont see the need for hadoop really either right now, but I like the > idea > > that if I need it it will be there because my crawls might become quite > > big/long. > > > > Not sure if I will move to solr indexing in the future, I'm trying to > avoid > > at this moment to minimize complexity. > > > > -Raymond- > > > > > > > > 2009/6/6 KK <[email protected]> > > > > > Hi All, > > > I've been using Solr and Lucene for some time. I started with Solr then > > > moved to lucene because of more flexibility/openness in lucene, but I > > like > > > both. As per my requirement I want to crawl webpages and add to lucene > > > indexing. So far I've been doing crawling manually and adding them to > > > lucene > > > index though lucene APIs. The webpages have content which is a mix of > say > > > 5% > > > english and remaining non-english[indian] content. To handle > > stemming/stop > > > word removal for the english part, I wrote a small custom analyzer for > > use > > > in lucene and thats working fairly well. Now I was thinking of doing > the > > > crawling part using Nutch. Does this sound OK. I went through the nutch > > > wiki > > > page and found that it supports a bunch of file types[like html/xml, > pdf, > > > odf, ppt, ms word etc ] but for me html is good enough. Also the wiki > > says > > > that it builds distributed indexes using Hadoop[I've used Hadoop a bit] > > > that > > > uses teh map-reduce architecture. But for my requirement I dont need > that > > > much of things. Distributed inexing is not required, so essentially I > > dont > > > need hadoop/map-reduce stuffs. So let me summarize things I want > > > #. Crawl the webpage, I want nutch to hand me over the content, I dont > > want > > > it to directly post that content to lucene by itself. Essentially I > want > > to > > > interfare in between crawling and indexing, as I've to use custom > > analyzer > > > before the contents are indexed by lucene. > > > #. For me html parsing is good enough[no need of pdf/odf/msword etc] > > > #. No need of hadoop/map-reduce. > > > > > > I'ld like the users of nutch to let me know their views. Other option > is > > to > > > look for Java opensource crawlers that can do the job. I dont find any > > and > > > I'm more interested in using something really good/well tested like > > nutch. > > > Let me know your opinions. > > > > > > Thanks, > > > KK. > > > > > >
