On Sat, Apr 11, 2009 at 11:16:28AM -0400, Michael McCandless wrote: > > But then, let's consider discrete vs. compound with regards to transparency. > > > > When we're talking about discrete segment files, we're only talking about > > binary data -- because the metadata is all in segmeta.json. Those binary > > files are hard to examine without a tool anyway -- hexdumping is hard core. > > :) > > > > So, transparency-wise, perhaps not so much is gained by going discrete. > > You can list their size, and see their presence or not.
Well, in the current KS format, there are two "real" files which make up the compound system: * cf.dat -- binary data. * cfmeta.json -- list of file names mapped to offset and length. So, opening the cfmeta.json file is analogous to a directory listing, though slightly less information-rich and intuitive. > > FWIW... I've already implemented a ByteBufDocReader proof-of-concept class > > in > > pure Perl; instead of serializing all fields marked as "stored", it stores > > one > > fixed-width byte array per document -- so doc storage is essentially a > > flatfile. I'm also pretty close to finishing a ZlibDocReader that uses Zlib > > compression. (The "compressed" field spec flag has been removed.) > > How can doc storage be fixed width? (text fields have different > length). It's not real doc storage. The Stored() attribute is ignored by this implementation; only one fixed length byte array gets written for each doc. The main use case for this is when documents are stored externally -- perhaps in a database, or potentially, on separate doc servers. For large search clusters, dedicated doc/highlight servers are a good idea, and this is a start in that direction. > So you removed "compressed" from FieldSpec and instead the user swaps > out the DocReader component? I wonder how compression compares if you > did column-stride body text vs row stride body text plus all other > fields. Let a thousand flowers bloom -- make doc storage pluggable, and let people experiment. Marvin Humphrey
