Re: Use nutch for crawling purpose?

Raymond Balmès Sat, 06 Jun 2009 05:39:48 -0700

If you've done it for Lucene, it shouldn't be that difficult in Nutch, I'm
also no Java guru.
Look at the wiki it's rather well explained.
If you add fields which I did you need both an indexing & query plug-in, but
it is quite straightforward. My only real problem was to find that fields
have to be lowercase otherwise because of the search bean.


-Ray-

2009/6/6 KK <[email protected]>

> Thanks Raymond.
> So, as per your mail if I should use nutch for both crawling and indexing,
> right? And for the pre-processing the input before indexing I've to write
> some plugin that will do the job of customanalyzer that I was using earlier
> with lucene, right? How easy/difficult is writing this analyzer as a plugin
> for nutch, and I must say that I'm a avg Java programmer? Let me know your
> views.
>
> Thanks,
> KK
>
> On Sat, Jun 6, 2009 at 2:03 PM, Raymond Balmès <[email protected]
> >wrote:
>
> > I had started your approach initially , ie building my indexing on lucene
> > only... but eventually completely dropped it.
> > Only working out of nutch with custom indexing plug-in and I'm quite
> happy
> > with it, the only downside I found is that the Nutch search bean does not
> > offer as much functionnality as I needed, so I had to build my own
> plug-in
> > too there. Later I decided to build a scoring plug-in too to focus the
> > crawl
> > : works great.
> >
> > I dont see the need for hadoop really either right now, but I like the
> idea
> > that if I need it it will be there because my crawls might become quite
> > big/long.
> >
> > Not sure if I will move to solr indexing in the future, I'm trying to
> avoid
> > at this moment to minimize complexity.
> >
> > -Raymond-
> >
> >
> >
> > 2009/6/6 KK <[email protected]>
> >
> > > Hi All,
> > > I've been using Solr and Lucene for some time. I started with Solr then
> > > moved to lucene because of more flexibility/openness in lucene, but I
> > like
> > > both. As per my requirement I want to crawl webpages and add to lucene
> > > indexing. So far I've been doing crawling manually and adding them to
> > > lucene
> > > index though lucene APIs. The webpages have content which is a mix of
> say
> > > 5%
> > > english and remaining non-english[indian] content. To handle
> > stemming/stop
> > > word removal for the english part, I wrote a small custom analyzer for
> > use
> > > in lucene and thats working fairly well. Now I was thinking of doing
> the
> > > crawling part using Nutch. Does this sound OK. I went through the nutch
> > > wiki
> > > page and found that it supports a bunch of file types[like html/xml,
> pdf,
> > > odf, ppt, ms word etc ] but for me html is good enough. Also the wiki
> > says
> > > that it builds distributed indexes using Hadoop[I've used Hadoop a bit]
> > > that
> > > uses teh map-reduce architecture. But for my requirement I dont need
> that
> > > much of things. Distributed inexing is not required, so essentially I
> > dont
> > > need hadoop/map-reduce stuffs. So let me summarize things I want
> > > #. Crawl the webpage, I want nutch to hand me over the content, I dont
> > want
> > > it to directly post that content to lucene by itself. Essentially I
> want
> > to
> > > interfare in between crawling and indexing, as I've to use custom
> > analyzer
> > > before the contents are indexed by lucene.
> > > #. For me html parsing is good enough[no need of pdf/odf/msword etc]
> > > #. No need of hadoop/map-reduce.
> > >
> > > I'ld like the users of nutch to let me know their views. Other option
> is
> > to
> > > look for Java opensource crawlers that can do the job. I dont find any
> > and
> > > I'm more interested in using something really good/well tested like
> > nutch.
> > > Let me know your opinions.
> > >
> > > Thanks,
> > > KK.
> > >
> >
>

Re: Use nutch for crawling purpose?

Reply via email to