On Mon, Oct 19, 2009 at 11:25 AM, John Crenshaw
<johncrens...@priacta.com> wrote:
> On the other hand, we could add a tree inside each segment to index the
> doclist. The term would be looked up as normal at a cost of O(log nT).
> After that though, if the docid is known, it could be looked up at an
> additional cost of only O(log nD). The total cost O(log nT) + O(log nD)
> is only marginally worse than O(log nTD) (and only because nTD is the
> count of a natural join, rather than a true product of the counts.)

The current code deals with variably-sized doclists by aggregating
small ones into a leaf block, while large ones get their own leaf
block, which can require overflow pages (at the SQLite level).  One of
my ideas for future development was to augment the index btree to
store keys as term+docid.  Large doclists could then be spread across
multiple leaves and more efficiently accessed.  There would be an
extra byte per key in the interior nodes, plus there would be
generally more info above the leaf nodes, but I think it would be
pretty reasonable.

I once had an off-list discussion with someone about more generally
modeling the index as a sorted set of <term, docid, posinfo> tuples,
which can then be delta-encoded into leaf nodes and built up
more-or-less as things currently work.  That would allow leaf nodes to
be uniformly-sized.  It was a big change, though, in an area I wasn't
enthusiastic about revisiting at the time.

> This still doesn't offer direct access to the doc lists by docid (you
> still need terms) but that problem should easy to solve once the term +
> docid case is handled separately, because only the docid needs to be
> indexed at that point.

I think that this approach may not offer as much performance increase
as one might like.  Theoretically, you may read 1/10 as many pages for
a particular query, but when you consider seeks and streaming, the
gains may not be measurable when compared to a more stupid
implementation (read the existing doclist and trim it based on
inputs).  Just really depends on things like fragmentation and the
like.

Which leads me to the next point.  There's no gain to doing any of
this is you cannot phrase the question in terms of existing SQLite
structures.  I'm not entirely certain that the current code has any
facility for running MATCH against a set of rowids.  I think you get
one at a time.  That's a big hit for any implementation, but if things
come in sorted order it may be something that can be masked by caching
some info in the cursor.

You could try writing a MATCH function where the doclists needed are
faulted in as needed, and each row trims the inputs to just that docid
before executing the query.  If things come in sorted order you can
just open doclist-readers and scan once, so that part would drop from
N^2 to N.  I think doing this would be a reasonable starter project,
because it could be done with no storage-format changes, but it would
generate a familiarity with lots of bits of the code.  And I think
there's half a chance it would perform reasonably well for many cases.

-scott
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to