Re: Distributed Indexes, Searches and HDFS

Michael McCandless Fri, 22 Sep 2006 06:23:54 -0700


I think this is a great question ("what's the best way to really scale
up Lucene?").  I don't have alot of experience in that area so I'll
defer to others (and I'm eager to learn myself!).


I think understanding Solr's overall approach (whose design I believe
came out of the thread you've referenced) is also a good step here.
Even if you can't re-use the hard links trick, you might be able to
reuse its snapshotting & index distribution protocol.

However, I have been working on some "bottoms up" improvements to
Lucene (getting native OS locking working and [separate but related]
"lock-less commits") that I think could be related to some of the
issues you're seeing with HDFS -- see below:

> > The cronjob/link solution which is quite clean, doesn't work well in
> > a windows environment. While it's my favorite, no dice... Rats.
>
> There may be hope yet for that on Windows.  Hard links work on
> Windows, but the only problem is that you can't rename/delete any
> links when the file is open. Michael McCandless is working on a
> patch that would eliminate all renames (and deletes can be handled
> by deferring them).

Right, with "lock-less commits" patch we never rename a file and also
never re-use a file name (ie, making Lucene's use of the filesystem
"write once").

> 1) Indexing and Searching Directly from HDFS
>
> Indexing to HDFS is possible with a patch if we don't use CFS. While
> not ideal performance-wise, it's reliable and takes care of data
> redundancy, component failure and means that I can have cheap small
> drives instead of a large expensive NAS. It's also quite simple to
> implement (see Nutch's indexer.FsDirectory for the Directory
> implmentation)

This is very interesting!  I don't know enough about HDFS (yet!).  On
very quick read, I like that it's a "write once" filesystem because
it's a good match to lock-less commits.

> So I would have several indexes (ie 16) and the same number of
> indexers, and a searcher for each index (possibly in the same
> process) that searches each one directly from HDFS. One problem I'm
> having is an occasional filenotfound exception. (Probably locking
> related)
>
> It comes out of the Searcher when I try to do a search while things
> are being indexed. I'd be interested to know what exactly is
> happening when this exception is thrown, maybe I can design around
> it. (Do synchronization at the appropriate times or similar)

That exception looks disturbingly similar to the ones Lucene hits on
NFS.  See here for gory details:

    http://issues.apache.org/jira/browse/LUCENE-673

The summary of that [long] issue is that these exceptions seem to be
due to cache staleness of Lucene's "segments" file (due to how the NFS
client does caching, even on NFS V4 client/server) and not in fact due
to locking (as had been previously assumed/expected).  The good news
is the lock-less commits fixes resolve this at least in my testing so
far (ie, make it possible to share a single index over NFS).

I wonder if in HDFS a similar cause is at work?  HDFS is "write once"
but the current Lucene isn't (not until we can get lock-less commits
in).  For example, it re-uses the "segments" file.

I think even if lock-less commits ("write once") enables sharing of a
single copy of index over remote filesystems like HDFS or NFS or
SMB/CIFS, whether or not that's performant enough (vs replicating
copies to local filesystems that are presumably quite a bit faster at
IO, at the expense of local storage consumed) would still be a big
open question.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Distributed Indexes, Searches and HDFS

Reply via email to