Nutch is intended to handle large collections. The simplest way to get hold of large collections is to simply search the web.
But Nutch is not just a web search engine. It also provides distributed creation of indexes and distributed search which is the motivation of my comment about it being the networked version of Lucene. So, while I agree with your statement that Nutch was "especially designed to deal with web documents", but would strongly disagree that this is a limitation. For one thing, if you actually have gobs of documents, you probably will have to store them in a networked form somehow. That networked form is probably pretty easy to make accessible via HTTP and that makes a web-oriented search engine like Nutch just what you need. Another way to say this is that is if you need a general purpose networked/distributed search engine and you have a web-oriented distributed search engine, you can either adapt the search engine to not be web oriented, or you can adapt your collection to be web-oriented. On 7/18/07 8:32 AM, "Samuel LEMOINE" <[EMAIL PROTECTED]> wrote: > You quote Nutch as being "the networked version of Lucene", but from > what I've seen it's more precise than that, especially designed to deal > with web documents... am I wrong assuming this ?