[ https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307202#comment-15307202 ]
Lvton. Smith commented on LUCENE-6825: -------------------------------------- bq. It should give sizable performance gains (*smaller index, faster searching*) over what we have today, and even over what auto-prefix with efficient numeric terms would do. *smaller index* yes!! *faster searching??* I cannot understand that :( "Auto-prefix" uses less terms, although larger index. But, "bkd", I think, is equal to using "every number term between the range". Why faster? > Add multidimensional byte[] indexing support to Lucene > ------------------------------------------------------ > > Key: LUCENE-6825 > URL: https://issues.apache.org/jira/browse/LUCENE-6825 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 6.0 > > Attachments: LUCENE-6825.patch, LUCENE-6825.patch > > > I think we should graduate the low-level block KD-tree data structure > from sandbox into Lucene's core? > This can be used for very fast 1D range filtering for numerics, > removing the 8 byte (long/double) limit we have today, so e.g. we > could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc. > It can also be used for > 1D use cases, like 2D (lat/lon) and 3D > (x/y/z with geo3d) geo shape intersection searches. > The idea here is to add a new part of the Codec API (DimensionalFormat > maybe?) that can do low-level N-dim point indexing and at runtime > exposes only an "intersect" method. > It should give sizable performance gains (smaller index, faster > searching) over what we have today, and even over what auto-prefix > with efficient numeric terms would do. > There are many steps here ... and I think adding this is analogous to > how we added FSTs, where we first added low level data structure > support and then gradually cutover the places that benefit from an > FST. > So for the first step, I'd like to just add the low-level block > KD-tree impl into oal.util.bkd, but make a couple improvements over > what we have now in sandbox: > * Use byte[] as the value not int (@rjernst's good idea!) > * Generalize it to arbitrary dimensions vs. specialized/forked 1D, > 2D, 3D cases we have now > This is already hard enough :) After that we can build the > DimensionalFormat on top, then cutover existing specialized block > KD-trees. We also need to fix OfflineSorter to use Directory API so > we don't fill up /tmp when building a block KD-tree. > A block KD-tree is at heart an inverted data structure, like postings, > but is also similar to auto-prefix in that it "picks" proper > N-dimensional "terms" (leaf blocks) to index based on how the specific > data being indexed is distributed. I think this is a big part of why > it's so fast, i.e. in contrast to today where we statically slice up > the space into the same terms regardless of the data (trie shifting, > morton codes, geohash, hilbert curves, etc.) > I'm marking this as trunk only for now... as we iterate we can see if > it could maybe go back to 5.x... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org