Re: Creating Lucene index in Hadoop

Doug Cutting Mon, 16 Mar 2009 14:37:19 -0700

Ning Li wrote:

With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.

I don't think HADOOP-4801 is required. It would help, certainly, butit's so fraught with security and other issues that I doubt it will becommitted anytime soon.

What would probably help HDFS random access performance for Lucenesignificantly would be:1. A cache of connections to datanodes, so that each seek() does notrequire an open(). If we move HDFS data transfer to be RPC-based (see,e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this willcome for free, since RPC already caches connections. We hope to do thisfor Hadoop 1.0, so that we use a single transport for all Hadoop's coreoperations, to simplify security.2. A local cache of read-only HDFS data, equivalent to kernel's buffercache. This might be implemented as a Lucene Directory that keeps anLRU cache of buffers from a wrapped filesystem, perhaps a subclass ofRAMDirectory.

With these, performance would still be slower than a local drive, butperhaps not so dramatically.


Doug

Re: Creating Lucene index in Hadoop

Reply via email to