Ning Li wrote:
With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.

I don't think HADOOP-4801 is required. It would help, certainly, but it's so fraught with security and other issues that I doubt it will be committed anytime soon.

What would probably help HDFS random access performance for Lucene significantly would be: 1. A cache of connections to datanodes, so that each seek() does not require an open(). If we move HDFS data transfer to be RPC-based (see, e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come for free, since RPC already caches connections. We hope to do this for Hadoop 1.0, so that we use a single transport for all Hadoop's core operations, to simplify security. 2. A local cache of read-only HDFS data, equivalent to kernel's buffer cache. This might be implemented as a Lucene Directory that keeps an LRU cache of buffers from a wrapped filesystem, perhaps a subclass of RAMDirectory.

With these, performance would still be slower than a local drive, but perhaps not so dramatically.

Doug

Reply via email to