On Mon, Oct 19, 2009 at 11:25 AM, John Crenshaw <johncrens...@priacta.com> wrote: > On the other hand, we could add a tree inside each segment to index the > doclist. The term would be looked up as normal at a cost of O(log nT). > After that though, if the docid is known, it could be looked up at an > additional cost of only O(log nD). The total cost O(log nT) + O(log nD) > is only marginally worse than O(log nTD) (and only because nTD is the > count of a natural join, rather than a true product of the counts.)
The current code deals with variably-sized doclists by aggregating small ones into a leaf block, while large ones get their own leaf block, which can require overflow pages (at the SQLite level). One of my ideas for future development was to augment the index btree to store keys as term+docid. Large doclists could then be spread across multiple leaves and more efficiently accessed. There would be an extra byte per key in the interior nodes, plus there would be generally more info above the leaf nodes, but I think it would be pretty reasonable. I once had an off-list discussion with someone about more generally modeling the index as a sorted set of <term, docid, posinfo> tuples, which can then be delta-encoded into leaf nodes and built up more-or-less as things currently work. That would allow leaf nodes to be uniformly-sized. It was a big change, though, in an area I wasn't enthusiastic about revisiting at the time. > This still doesn't offer direct access to the doc lists by docid (you > still need terms) but that problem should easy to solve once the term + > docid case is handled separately, because only the docid needs to be > indexed at that point. I think that this approach may not offer as much performance increase as one might like. Theoretically, you may read 1/10 as many pages for a particular query, but when you consider seeks and streaming, the gains may not be measurable when compared to a more stupid implementation (read the existing doclist and trim it based on inputs). Just really depends on things like fragmentation and the like. Which leads me to the next point. There's no gain to doing any of this is you cannot phrase the question in terms of existing SQLite structures. I'm not entirely certain that the current code has any facility for running MATCH against a set of rowids. I think you get one at a time. That's a big hit for any implementation, but if things come in sorted order it may be something that can be masked by caching some info in the cursor. You could try writing a MATCH function where the doclists needed are faulted in as needed, and each row trims the inputs to just that docid before executing the query. If things come in sorted order you can just open doclist-readers and scan once, so that part would drop from N^2 to N. I think doing this would be a reasonable starter project, because it could be done with no storage-format changes, but it would generate a familiarity with lots of bits of the code. And I think there's half a chance it would perform reasonably well for many cases. -scott _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users