If you've done it for Lucene, it shouldn't be that difficult in Nutch, I'm
also no Java guru.
Look at the wiki it's rather well explained.
If you add fields which I did you need both an indexing & query plug-in, but
it is quite straightforward. My only real problem was to find that fields
have to be lowercase otherwise because of the search bean.

-Ray-

2009/6/6 KK <[email protected]>

> Thanks Raymond.
> So, as per your mail if I should use nutch for both crawling and indexing,
> right? And for the pre-processing the input before indexing I've to write
> some plugin that will do the job of customanalyzer that I was using earlier
> with lucene, right? How easy/difficult is writing this analyzer as a plugin
> for nutch, and I must say that I'm a avg Java programmer? Let me know your
> views.
>
> Thanks,
> KK
>
> On Sat, Jun 6, 2009 at 2:03 PM, Raymond Balmès <[email protected]
> >wrote:
>
> > I had started your approach initially , ie building my indexing on lucene
> > only... but eventually completely dropped it.
> > Only working out of nutch with custom indexing plug-in and I'm quite
> happy
> > with it, the only downside I found is that the Nutch search bean does not
> > offer as much functionnality as I needed, so I had to build my own
> plug-in
> > too there. Later I decided to build a scoring plug-in too to focus the
> > crawl
> > : works great.
> >
> > I dont see the need for hadoop really either right now, but I like the
> idea
> > that if I need it it will be there because my crawls might become quite
> > big/long.
> >
> > Not sure if I will move to solr indexing in the future, I'm trying to
> avoid
> > at this moment to minimize complexity.
> >
> > -Raymond-
> >
> >
> >
> > 2009/6/6 KK <[email protected]>
> >
> > > Hi All,
> > > I've been using Solr and Lucene for some time. I started with Solr then
> > > moved to lucene because of more flexibility/openness in lucene, but I
> > like
> > > both. As per my requirement I want to crawl webpages and add to lucene
> > > indexing. So far I've been doing crawling manually and adding them to
> > > lucene
> > > index though lucene APIs. The webpages have content which is a mix of
> say
> > > 5%
> > > english and remaining non-english[indian] content. To handle
> > stemming/stop
> > > word removal for the english part, I wrote a small custom analyzer for
> > use
> > > in lucene and thats working fairly well. Now I was thinking of doing
> the
> > > crawling part using Nutch. Does this sound OK. I went through the nutch
> > > wiki
> > > page and found that it supports a bunch of file types[like html/xml,
> pdf,
> > > odf, ppt, ms word etc ] but for me html is good enough. Also the wiki
> > says
> > > that it builds distributed indexes using Hadoop[I've used Hadoop a bit]
> > > that
> > > uses teh map-reduce architecture. But for my requirement I dont need
> that
> > > much of things. Distributed inexing is not required, so essentially I
> > dont
> > > need hadoop/map-reduce stuffs. So let me summarize things I want
> > > #. Crawl the webpage, I want nutch to hand me over the content, I dont
> > want
> > > it to directly post that content to lucene by itself. Essentially I
> want
> > to
> > > interfare in between crawling and indexing, as I've to use custom
> > analyzer
> > > before the contents are indexed by lucene.
> > > #. For me html parsing is good enough[no need of pdf/odf/msword etc]
> > > #. No need of hadoop/map-reduce.
> > >
> > > I'ld like the users of nutch to let me know their views. Other option
> is
> > to
> > > look for Java opensource crawlers that can do the job. I dont find any
> > and
> > > I'm more interested in using something really good/well tested like
> > nutch.
> > > Let me know your opinions.
> > >
> > > Thanks,
> > > KK.
> > >
> >
>

Reply via email to