[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

Michael McCandless (JIRA) Wed, 07 Oct 2015 01:26:48 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946494#comment-14946494
 ]


Michael McCandless commented on LUCENE-6825:
--------------------------------------------

Thanks [~rjernst]!

bq. I see a number of debugging s.o.p, 

I'll nuke all of these once things seem to be working ... also, those 
{{bytesToInt}} in the reader/writer are temporary just for debugging the test 
case that's encoding ints into the byte[].

{quote}
Is this nocommit still relevant?

// nocommit can we get this back?
//state.docs.grow(count);
{quote}

Unfortunately, yes.  One source of efficiency for the 1D,2D,3D specialized BKDs 
now is they can "notify" the consumer of the incoming docIDs that a chunk is 
coming ... but I didn't expose any means to do this in the intersect API.  I 
could add it but held back for now since that dirties the read-time API.  
Though for the 1D case, this was a big speedup, because in that case I can give 
a pretty accurate estimate up front of how large the result set will be ... 
I'll mull, probably leave as a TODO for now.

bq. Can we avoid the bitflip in bytesToInt by using a long accumulator and 
casting to int?

Hmm I don't think that's the same thing?  We flip that big so that in binary 
space negative ints sort before positive ints?  But remember this is only a 
test issue at this point (for this issue) ... in future issues we will need 
clever "encode/decode foobar as byte[]" ...

bq. Can we limit the num dims to 3 for now?

How about limit to 15?  I think 3 is a bit too low (and 255 is a way too 
high!), but there are maybe interesting things to do w/ more than 3 dims, e.g. 
index household income along with x, y, z points or something.  So this would 
still give us 4 free bits in case we want them later...

bq. In `BKDWriter.finish()`, can the try/catch around building be simplified?

I'll try your suggestion.  I think if I change destroy to close instead, I can 
just tap into IOUtils goodness...

bq. In `BKDWriter.build()`, there is a nocommit about destroying perdim 
writers, but I think that is handled by the caller in `finish()` 

It is confusing!

Each level of the recursion may have written its own files, so I think we need 
the toplevel destroy (removes the fully sorted by each dim file we created up 
front) and the destroy as we recurse (removes the partitioned subset of each 
dim).

> Add multidimensional byte[] indexing support to Lucene
> ------------------------------------------------------
>
>                 Key: LUCENE-6825
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6825
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: Trunk
>
>         Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
>     2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

Reply via email to