Re: Flexible indexing design

Marvin Humphrey Tue, 15 Apr 2008 12:04:30 -0700


On Apr 13, 2008, at 2:35 AM, Michael McCandless wrote:

I think the major difference is locality?  In a compound file, you
have to seek "far away" to reach the prx & skip data (if they are
separate).

There's another item worth mentioning, something that Doug, Grant andI discussed when this flexible indexing talk started way back when.When you unify frq/prx data into a single file, phrase queries and thelike benefit from improved locality, but simple term queries areimpeded because needless positional data must be plowed through.

We dismissed that cost with the assertion that you could specify amatch-only field for simple queries if that was important to you, butIME that doesn't seem to be very practical. It's hard for theinternals to know that they should prefer one field over another basedon the type of query, and hard to manually override everywhere.

This is like "column stride" vs "row stride" serialization
of a matrix.

Relatively soon, though, we will all be on SSDs, so maybe this
locality argument becomes far less important ;)

Yes, I've thought about that. It defeats the phrase-query localityargument for unified postings files and recommends breaking things uplogically by type of data into frq/prx/payload/whatever.

Would it be possible to design a Posting plugin class that reads frommultiple files? I'm pretty sure the answer is yes. It messes up thesingle-stream readRecord() method I've been detailing and might forcePosting to maintain state. But if Postings are scarce TermBuffer-style objects where values mutate, rather than populous Term-styleobjects where you need a new instance for each set of values, then itdoesn't matter if they're large.

If that could be done, I think it would be possible to retrofit thePosting/PostingList concept into Lucene without a file format change.FWIW.

Does KS allow non-compound format?


No, it doesn't.

I would think running out of file descriptors is common problemotherwise.

The default per-process limit for file descriptors on OS X is 256.Under the Lucene non-compound file format, you're guaranteed to runout of file descriptors eventually under normal usage. If KS alloweda non-compound format, you'd also be guaranteed to run out of filedescriptors, just sooner. Since not failing at all is the onlyacceptable outcome, there's not much practical difference.

I think there's more to be gained from tweaking out the VFS than inaccommodating a non-compound format. Saddling users with filedescriptor constraint worries and having to invoke ulimit all the timesucks.

My conclusion was that it was better to exploit the benefits ofbounded,
single-purpose streams and simple file formats whenever possible.
There's also a middle way, where each *format* gets its own file.Then you
wind up with fewer files, but you have to track field number state.
The nice thing is that packet-scoped plugins can be compatible withALL of
these configurations:
Right.  This way users can pick & choose how to put things in the
index (with "healthy" defaults, of course).

Well, IMO, we don't want the users to have to care about the containerclasses.

Under the TermDocs/TermPositions model, every time you add new data,you need to subclass the containers. Under the PostingList model, youdon't -- Posting plugs in.

For KS at least, the primary goal is to make Posting public and aseasy to subclass as possible -- because a public Posting plugin classseems to me to be the easiest way to add custom flexible indexingfeatures like like text payloads, or arbitrary integer values used bycustom function queries, or other schemes not yet considered.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing design

Reply via email to