Thanks Raymond.
So, as per your mail if I should use nutch for both crawling and indexing,
right? And for the pre-processing the input before indexing I've to write
some plugin that will do the job of customanalyzer that I was using earlier
with lucene, right? How easy/difficult is writing this analyzer as a plugin
for nutch, and I must say that I'm a avg Java programmer? Let me know your
views.

Thanks,
KK

On Sat, Jun 6, 2009 at 2:03 PM, Raymond Balmès <[email protected]>wrote:

> I had started your approach initially , ie building my indexing on lucene
> only... but eventually completely dropped it.
> Only working out of nutch with custom indexing plug-in and I'm quite happy
> with it, the only downside I found is that the Nutch search bean does not
> offer as much functionnality as I needed, so I had to build my own plug-in
> too there. Later I decided to build a scoring plug-in too to focus the
> crawl
> : works great.
>
> I dont see the need for hadoop really either right now, but I like the idea
> that if I need it it will be there because my crawls might become quite
> big/long.
>
> Not sure if I will move to solr indexing in the future, I'm trying to avoid
> at this moment to minimize complexity.
>
> -Raymond-
>
>
>
> 2009/6/6 KK <[email protected]>
>
> > Hi All,
> > I've been using Solr and Lucene for some time. I started with Solr then
> > moved to lucene because of more flexibility/openness in lucene, but I
> like
> > both. As per my requirement I want to crawl webpages and add to lucene
> > indexing. So far I've been doing crawling manually and adding them to
> > lucene
> > index though lucene APIs. The webpages have content which is a mix of say
> > 5%
> > english and remaining non-english[indian] content. To handle
> stemming/stop
> > word removal for the english part, I wrote a small custom analyzer for
> use
> > in lucene and thats working fairly well. Now I was thinking of doing the
> > crawling part using Nutch. Does this sound OK. I went through the nutch
> > wiki
> > page and found that it supports a bunch of file types[like html/xml, pdf,
> > odf, ppt, ms word etc ] but for me html is good enough. Also the wiki
> says
> > that it builds distributed indexes using Hadoop[I've used Hadoop a bit]
> > that
> > uses teh map-reduce architecture. But for my requirement I dont need that
> > much of things. Distributed inexing is not required, so essentially I
> dont
> > need hadoop/map-reduce stuffs. So let me summarize things I want
> > #. Crawl the webpage, I want nutch to hand me over the content, I dont
> want
> > it to directly post that content to lucene by itself. Essentially I want
> to
> > interfare in between crawling and indexing, as I've to use custom
> analyzer
> > before the contents are indexed by lucene.
> > #. For me html parsing is good enough[no need of pdf/odf/msword etc]
> > #. No need of hadoop/map-reduce.
> >
> > I'ld like the users of nutch to let me know their views. Other option is
> to
> > look for Java opensource crawlers that can do the job. I dont find any
> and
> > I'm more interested in using something really good/well tested like
> nutch.
> > Let me know your opinions.
> >
> > Thanks,
> > KK.
> >
>

Reply via email to