I think this is a great question ("what's the best way to really scale up Lucene?"). I don't have alot of experience in that area so I'll defer to others (and I'm eager to learn myself!).
I think understanding Solr's overall approach (whose design I believe came out of the thread you've referenced) is also a good step here. Even if you can't re-use the hard links trick, you might be able to reuse its snapshotting & index distribution protocol. However, I have been working on some "bottoms up" improvements to Lucene (getting native OS locking working and [separate but related] "lock-less commits") that I think could be related to some of the issues you're seeing with HDFS -- see below: > > The cronjob/link solution which is quite clean, doesn't work well in > > a windows environment. While it's my favorite, no dice... Rats. > > There may be hope yet for that on Windows. Hard links work on > Windows, but the only problem is that you can't rename/delete any > links when the file is open. Michael McCandless is working on a > patch that would eliminate all renames (and deletes can be handled > by deferring them). Right, with "lock-less commits" patch we never rename a file and also never re-use a file name (ie, making Lucene's use of the filesystem "write once"). > 1) Indexing and Searching Directly from HDFS > > Indexing to HDFS is possible with a patch if we don't use CFS. While > not ideal performance-wise, it's reliable and takes care of data > redundancy, component failure and means that I can have cheap small > drives instead of a large expensive NAS. It's also quite simple to > implement (see Nutch's indexer.FsDirectory for the Directory > implmentation) This is very interesting! I don't know enough about HDFS (yet!). On very quick read, I like that it's a "write once" filesystem because it's a good match to lock-less commits. > So I would have several indexes (ie 16) and the same number of > indexers, and a searcher for each index (possibly in the same > process) that searches each one directly from HDFS. One problem I'm > having is an occasional filenotfound exception. (Probably locking > related) > > It comes out of the Searcher when I try to do a search while things > are being indexed. I'd be interested to know what exactly is > happening when this exception is thrown, maybe I can design around > it. (Do synchronization at the appropriate times or similar) That exception looks disturbingly similar to the ones Lucene hits on NFS. See here for gory details: http://issues.apache.org/jira/browse/LUCENE-673 The summary of that [long] issue is that these exceptions seem to be due to cache staleness of Lucene's "segments" file (due to how the NFS client does caching, even on NFS V4 client/server) and not in fact due to locking (as had been previously assumed/expected). The good news is the lock-less commits fixes resolve this at least in my testing so far (ie, make it possible to share a single index over NFS). I wonder if in HDFS a similar cause is at work? HDFS is "write once" but the current Lucene isn't (not until we can get lock-less commits in). For example, it re-uses the "segments" file. I think even if lock-less commits ("write once") enables sharing of a single copy of index over remote filesystems like HDFS or NFS or SMB/CIFS, whether or not that's performant enough (vs replicating copies to local filesystems that are presumably quite a bit faster at IO, at the expense of local storage consumed) would still be a big open question. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]