Kevin, Sorry for the delay in replying. I think your idea for an external field storage mechanism is excellent. I'd love to see it, and if I can, will be willing to help make that happen.
Regards, Terry ----- Original Message ----- From: Kevin A. Burton To: Lucene Users List Sent: Sunday, November 07, 2004 4:47 PM Subject: Lucene external field storage contribution. About 3 months ago I developed a external storage engine which ties into lucene. I'd like to discuss making a contribution so that this is integrated into a future version of Lucene. I'm going to paste my original PROPOSAL in this email. There wasn't a ton of feedback first time around but I figure squeaky wheel gets the grease... >> >> >> I created this proposal because we need this fixed at work. I want to >> go ahead and work on a vertical fix for our version of lucene and then >> submit this back to Jakarta. >> There seems to be a lot of interest here and I wanted to get feedback >> from the list before moving forward ... >> >> Should I put this in the wiki?! >> >> Kevin >> >> ** OVERVIEW ** >> >> Currently Lucene supports 'stored fields; where the content of these >> fields are >> kept within the lucene index for use in the future. >> >> While acceptable for small indexes, larger amounts of stored fields >> prevent: >> >> - Fast index merges since the full content has to be continually merged. >> >> - Storing the indexes in memory (since a LOT of memory would be >> required and >> this is cost prohibitive) >> >> - Fast queries since block caching can't be used on the index data. >> >> For example in our current setup our index size is 20G. Nearly 90% of >> this is >> content. If we could store the content outside of Lucene our merges and >> searches would be MUCH faster. If we could store the index in MEMORY >> this could >> be orders of magnitude faster. >> >> ** PROPOSAL ** >> >> Provide an external field storage mechanism which supports legacy indexes >> without modification. Content is stored in a "content segment". The only >> changes would be a field with 3(or 4 if checksum enabled) values. >> >> - CS_SEGMENT >> >> Logical ID of the content segment. This is an integer value. >> There is >> a global Lucene property named CS_ROOT which stores all the >> content. >> The segments are just flat files with pointers. Segments are >> broken >> into logical pieces by time and size. Usually 100M of content >> would be >> in one segment. >> >> - CS_OFFSET >> >> The byte offset of the field. >> >> - CS_LENGTH >> >> The length of the field. >> >> - CS_CHECKSUM >> >> Optional checksum to verify that the content is correct when >> fetched >> from the index. >> >> - The field value here would be exactly 'N:O:L' where N is the segment >> number, >> O is the offset, and L is the length. O and L are 64bit values. N >> is a 32 >> bit value (though 64bit wouldn't really hurt). >> >> This mechanism allows for the external storage of any named field. >> >> CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO >> code for >> efficient content lookup. (Though filehandle caching should probably >> be used). >> >> Since content is broken into logical 100M segments the underlying >> filesystem can >> orgnize the file into contiguous blocks for efficient non-fragmented >> lookup. >> >> File manipulation is easy and indexes can be merged by simply >> concatenating the >> second file to the end of the first. (Though the segment, offset, and >> length >> need to be updated). (FIXME: I think I need to think about this more >> since I >> will have < 100M per syncs) >> >> Supporting full unicode is important. Full java.lang.String storage >> is used >> with String.getBytes() so we should be able to avoid unicode issues. >> If Java >> has a correct java.lang.String representation it's possible easily add >> unicode >> support just by serializing the byte representation. (Note that the >> JDK says >> that the DEFAULT system char encoding is used so if this is ever >> changed it >> might break the index) >> >> While Linux and modern versions of Windows (not sure about OSX) >> support 64bit >> filesystems the 4G storage boundary of 32bit filesystems (ext2 is an >> example) >> are an issue. Using smaller indexes can prevent this but eventually >> segment >> lookup in the filesystem will be slow. This will only happen within >> terabyte >> storage systems so hopefully the developer has migrated to another >> (modern) >> filesystem such as XFS. >> >> ** FEATURES ** >> >> - Must be able to replicate indexes easily to other hosts. >> >> - Adding content to the index must be CHEAP >> >> - Deletes need to be cheap (these are cheap for older content. Just >> discard >> older indexes) >> >> - Filesystem needs to be able to optimize storage >> >> - Must support UNICODE and binary content (images, blobs, byte arrays, >> serialized objects, etc) >> >> - Filesystem metadata operations should be fast. Since content is >> kept in >> LARGE indexes this is easy to avoid. >> >> - Migration to the new system from legacy indexes should be fast and >> painless for future developers >> > > -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412