[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

Michael McCandless (JIRA) Mon, 12 Oct 2015 14:53:27 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953813#comment-14953813
 ]


Michael McCandless commented on LUCENE-6825:
--------------------------------------------

bq. have you considered a new module for this

Well, I think a codec format is a natural way to expose this service, since 
(like postings, doc values, etc.), it's a low-level utility that can be used 
for diverse use cases (2D and 3D spatial, numeric range filtering, binary range 
filtering so we can support IPv6, BigInteger, BigDecimal, etc.).  For it to be 
exposed as a part of the codec means it needs to be in core...


> Add multidimensional byte[] indexing support to Lucene
> ------------------------------------------------------
>
>                 Key: LUCENE-6825
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6825
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: Trunk
>
>         Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
>     2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

Reply via email to