[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

Nicholas Knize (JIRA) Thu, 16 Apr 2015 07:44:42 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498096#comment-14498096
 ]


Nicholas Knize commented on LUCENE-6422:
----------------------------------------

Awesome for putting together the benchmark (even a rough one).  Can you attach 
the spatial.alg file you used so I can verify (I'm assuming its a variant on 
one you've posted before?) I'm getting very different numbers with "production" 
data sets (e.g., high res peninsulas, islands, global political boundaries, 
planet osm data - e.g., more exotic shapes than circles larger than a few 
hundred KM)

A few questions... and other observations.

bq. chosen arbitrarily; with 27 it choked on memory given 2GB heap

What choked specifically? I'm using PackedQuad with depth between 26 and 29. 
1GB heap size using the shapes I described above.

bq. disabling leafy branch pruning to compare apples to apples

Out of curiosity, why is this option enabled by default if it uses transient 
storage that doubles memory consumption? Seems backwards to me. 

bq. I was skeptical there would be index size savings and the benchmark shows 
there aren't any.

IMHO I would avoid these kinds of absolute statements (especially with the 
highly variable nature of spatial use-cases). In this situation your numbers do 
not surprise me when disabling that leafyBranchPrune option (which still 
confuses me why its there?), and using a single shape type with variable size.  
There is an outstanding issue in the existing patch (I'll see if I can't push 
out a fix today) - the TermsEnum is returning Terms with BytesRef containing 
bytes[] that are double the size than they should be (e.g., 16 bytes instead of 
8 - all padded w/ zeros). I suspect its some improper configuration in the 
reader?  So for every high res cell (e.g.), the term will be 16 bytes (still 
better than, say, 27).  

I think we can do better on simulated test data in the test framework. I love 
the randomization and what minimal "real" data sets that are there are great. 
It does not provide the coverage necessary, though, to best simulate some real 
world scenarios. That's okay, to steal Mike's quote "progress not perfection". 
I'll definitely work to provide some more real world tests so we have better 
coverage and benchmarking options using "real world" data. Its a good way to 
recommend one indexing structure over another (this one's just the beginning. 
There are more indexing structures in trial mode, and even more improvements 
for the packed version)

Let's keep this going... Since this patch is non-destructive I don't see a 
reason it can't be committed as another option and I can submit enhancement 
patches to this feature.  That would be up to the community. 

> Add StreamingQuadPrefixTree
> ---------------------------
>
>                 Key: LUCENE-6422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spatial
>    Affects Versions: 5.x
>            Reporter: Nicholas Knize
>         Attachments: LUCENE-6422.patch, 
> LUCENE-6422_with_SPT_factory_and_benchmark.patch
>
>
> To conform to Lucene's inverted index, SpatialStrategies use strings to 
> represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
> bits per GeoHash cell, respectively.  To create the terms representing a 
> Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
> ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
> invert() which creates a second lexicographically sorted array of Terms. This 
> doubles the memory consumption when indexing a shape.
> This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
> accomplish the following:
> 1.  Create a packed 8byte representation for a QuadCell
> 2.  Build the Packed cells 'on demand' when incrementToken is called
> Improvements over this approach include the generation of the packed cells 
> using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

Reply via email to