[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

Nicholas Knize (JIRA) Tue, 06 Oct 2015 15:41:04 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945944#comment-14945944
 ]


Nicholas Knize commented on LUCENE-6825:
----------------------------------------

+1000 for graduating the data structure to core.

{{// Sort all docs once by lat, once by lon:}}  
* I'm assuming lat, lon specific naming will be refactored to more generalized 
naming?
* In an XTree implementation (similar to BKD but with more rigorous split 
criteria) I ran into bias issues when sorting by incremental dimensions (simple 
for loop like this). This is often why sort is done by a reduced dimensional 
encoding value (e.g., Hilbert, Morton). This is particularly important as the 
tree grows (which I'm guessing happens when BKD segments merge?). Maybe another 
new issue to investigate simple interleave packing and sorting on the packed 
value?

{{// Find which dim has the largest span so we can split on it:}}
* Maybe refactor this into a {{split}} method?  It would give an opportunity to 
override for investigating improvements based on other split criteria (e.g., 
squareness, area) 



> Add multidimensional byte[] indexing support to Lucene
> ------------------------------------------------------
>
>                 Key: LUCENE-6825
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6825
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: Trunk
>
>         Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
>     2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

Reply via email to