Re: lucene with hadoop but without nutch, looking for documentation

Ted Dunning Wed, 18 Jul 2007 09:29:46 -0700

Nutch is intended to handle large collections.  The simplest way to get hold
of large collections is to simply search the web.

But Nutch is not just a web search engine.  It also provides distributed
creation of indexes and distributed search which is the motivation of my
comment about it being the networked version of Lucene.

So, while I agree with your statement that Nutch was "especially designed to
deal with web documents", but would strongly disagree that this is a
limitation.  For one thing, if you actually have gobs of documents, you
probably will have to store them in a networked form somehow.  That
networked form is probably pretty easy to make accessible via HTTP and that
makes a web-oriented search engine like Nutch just what you need.

Another way to say this is that is if you need a general purpose
networked/distributed search engine and you have a web-oriented distributed
search engine, you can either adapt the search engine to not be web
oriented, or you can adapt your collection to be web-oriented.

On 7/18/07 8:32 AM, "Samuel LEMOINE" <[EMAIL PROTECTED]> wrote:

> You quote Nutch as being "the networked version of Lucene", but from
> what I've seen it's more precise than that, especially designed to deal
> with web documents... am I wrong assuming this ?

Re: lucene with hadoop but without nutch, looking for documentation

Reply via email to