Re: Creating Lucene index in Hadoop

2009-03-17 Thread Ning Li
Lucene on a local disk benefits significantly from the local filesystem's RAM cache (aka the kernel's buffer cache).  HDFS has no such local RAM cache outside of the stream's buffer.  The cache would need to be no larger than the kernel's buffer cache to get an equivalent hit ratio.  And if

Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
in HDFS compares with a ParallelMultiReader (or whatever its called) over RPC on a local filesystem? I'm missing why you would ever want the Lucene index in HDFS for reading. Ian Ning Li ning.li...@gmail.com writes: I should have pointed out that Nutch index build and contrib/index targets

Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
ratio... Cheers, Ning On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting cutt...@apache.org wrote: Ning Li wrote: With http://issues.apache.org/jira/browse/HADOOP-4801, however, it may become feasible to search on HDFS directly. I don't think HADOOP-4801 is required.  It would help, certainly

Re: Creating Lucene index in Hadoop

2009-03-13 Thread Ning Li
Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. -

Re: Distributed Lucene - from hadoop contrib

2008-08-18 Thread Ning Li
On 8/12/08, Deepika Khera [EMAIL PROTECTED] wrote: I was imagining the 2 concepts of i) using hadoop.contrib.index to index documents ii) providing search in a distributed fashion, to be all in one box. Ideally, yes. However, while it's good to use map/reduce when batch-building index, there

Re: Distributed Lucene - from hadoop contrib

2008-08-08 Thread Ning Li
1) Katta n Distributed Lucene are different projects though, right? Both being based on kind of the same paradigm (Distributed Index)? The design of Katta and that of Distributed Lucene are quite different last time I checked. I pointed out the Katta project because you can find the code for

Re: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Ning Li
http://wiki.apache.org/hadoop/DistributedLucene and hadoop.contrib.index are two different things. For information on hadoop.contrib.index, see the README file in the package. I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene at http://katta.wiki.sourceforge.net/.

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ning Li
You can build Lucene indexes using Hadoop Map/Reduce. See the index contrib package in the trunk. Or is it still not something you are looking for? Regards, Ning On 4/4/08, Aayush Garg [EMAIL PROTECTED] wrote: No, currently my requirement is to solve this problem by apache hadoop. I am trying

Re: Nutch and Distributed Lucene

2008-04-01 Thread Ning Li
Hi, Nutch builds Lucene indexes. But Nutch is much more than that. It is a web search application software that crawls the web, inverts links and builds indexes. Each step is one or more Map/Reduce jobs. You can find more information at http://lucene.apache.org/nutch/ The Map/Reduce job to build

A contrib package to build/update a Lucene index

2008-03-10 Thread Ning Li
Hi, Is there any interest in a contrib package to build/update a Lucene index? I should have asked the question before creating the JIRA issue and attaching the patch. In any case, more details can be found at https://issues.apache.org/jira/browse/HADOOP-2951 Regards, Ning

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Ning Li
We welcome your input. Discussions are mainly on [EMAIL PROTECTED] now (a thread with the same title). On 2/7/08, Dennis Kubes [EMAIL PROTECTED] wrote: This is actually something we were planning on building into Nutch. Dennis

Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
that an application has more control over where to store the primary and replicas of an HDFS block. This feature may be useful for other HDFS applications (e.g., HBase). We would like to collaborate with other people who are interested in adding this feature to HDFS. Regards, Ning Li

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
On 2/6/08, Ted Dunning [EMAIL PROTECTED] wrote: Our best work-around is to simply take a shard out of service during delivery of an updated index. This is obviously not a good solution. How many shard servers are serving each shard? If it's more than one, you can have the rest of the shard