[moved from private to lucy-dev in case others are interested] On Thu, Mar 12, 2009 at 5:13 PM, Marvin Humphrey <[email protected]> wrote: > On Fri, Mar 06, 2009 at 10:17:55AM -0700, Nathan Kurz wrote: >> There was an article recently that might be relevant to your desire >> for real time updates of KinoSearch databases: >> http://news.ycombinator.com/item?id=497039 > > It looks like that system can handle a much greater change rate than KS. KS > has a slow best-case write speed, but I'm not worried about that. The problem > me and the Lucene folks are trying to address under the topic heading of > "real-time indexing" is *worst-case* write speed: most of the time you're > fine, but every once in a while you trigger a large merge and you wait a > loooooong time. That problem has a different solution.
What is the actual case that would trigger such a problem? My instinct is that while there is no way to avoid the long merge, that there are schemes where only the update is slow, and the readers can continue at more or less full speed. > Except on NFS, KS doesn't have much in the way of lock contention issues > because index files are never modified. But regardless of its applicability > to KS, this "sneaky lock" trick is pretty nice: > ... > We couldn't do that because we can't reach out across the system to active > IndexReaders in other processes, but still: nice. I realize this, but I'm wondering if a locking approach might be preferable. Would the equivalent of row-level-locking allow you to modify the index files in real time, instead of adding addenda and tombstones? I'm not necessarily suggesting that the default index format should do this, but that it might be worth considering whether a proposed API would support such a real-time format. > I'm remind of this presentation on a lock-free hash table: > > http://video.google.com/videoplay?docid=2139967204534450862 > > http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf Thanks, I took a glance, and it does seem interesting. In addition to the shared memory relevance, I've been looking at CUDA programming on GPU's, and am interested in lock-free data structures such as this. > This was well put: > > RAM can be viewed as a 16 Gig L4 cache, and disk as a multi-Terabyte L5. > Just as one currently writes code that doesn't distinguish between a fetch > from L1 and a fetch from main memory, mmap() allows extending this syntax > all the way to a fetch from disk. Thanks. While I certainly think that mmap() can have great performance advantages, it's the simplification it provides that really appeals to me. Instead of fighting with the OS, use it! Nathan Kurz [email protected]
