[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945626#comment-14945626
 ] 

Ryan Ernst commented on LUCENE-6825:
------------------------------------

This looks good. I think this is a great plan, to separate this out into a 
separate codec format, and to do it in pieces. And this patch is a good start. 

A few comments on the patch:
* I see a number of debugging s.o.p, these should be commented out or removed?
* Is this nocommit still relevant?
{quote}
// nocommit can we get this back?
//state.docs.grow(count);
{quote}
* There is a nocommit on the {{bytesToInt}} method. I think as a follow up we 
should investigate packing values, but is the nocommit still needed? Also, it 
seems to exist in both reader and writer? Could it be in one place, but still 
package private, perhaps in `oal.util.bkd.Util`?
* Can we avoid the bitflip in {{bytesToInt}} by using a long accumulator and 
casting to int? we can assert no upper bits are set before casting?
* Can we limit the num dims to 3 for now? I see a check for < 255 to fit in a 
byte, but it might be nice later to use those extra bits for some other 
information (essentially lets reserve the bits for now, instead of allowing 
silly number of dimensions)
* In `BKDWriter.finish()`, can the try/catch around building be simplified? I 
think you could remove the `success` marker and do the regular cleanup after 
build, and change the `finally` to a `catch`, then add any failures when 
destroying the per dim writers as suppressed exceptions to the original?
* In `BKDWriter.build()`, there is a nocommit about destroying perdim writers, 
but I think that is handled by the caller in `finish()` mentioned in my 
previous comment? I also see some destroy calls below that...is there double 
destroying going on, or is this more complicated than it looks?


> Add multidimensional byte[] indexing support to Lucene
> ------------------------------------------------------
>
>                 Key: LUCENE-6825
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6825
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: Trunk
>
>         Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
>     2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to