[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498567#comment-14498567
 ] 

David Smiley commented on LUCENE-6422:
--------------------------------------

I welcome any suggestions you have pertaining naming (or again on anything).  
If you can suggest a better name to what "leafy branch pruning" does, then at a 
minimum it could be expressed in the javadocs.  Not long ago it had another 
name but it was even worse ;-).  Naming is hard.  Likewise... if 
SpatialPrefixTree might have a better literature/industry based name then I'd 
love to know what that is.  When I named it as-such I looked around but didn't 
seem to find anything that was perfect.  It's a type of Trie, a spatial trie, 
and "PrefixTree" is a synonym for Trie, so... there you go.  Maybe I didn't 
look hard enough.  I'm sorry if I _seemed_ touchy on the terminology; I merely 
want to ensure we mutually understood what we were and weren't talking about -- 
that's all.  So when you proposed that what StreamingQuadPrefixTree did was 
leafy branch pruning, and as the coiner of the term I can see it didn't meet my 
definition; clarification is in order.

bq. I want to make sure there is no confusion here (especially for anyone else 
willing to participate and the sensitivity surrounding my use of the 
pruneLeafyBranches term). SPT's in the context of lucene-spatial's definition, 
no? (since SPT is not a common name for a geo data structure its a derivative)

Yes, in the context of lucene-spatial's definition.  

bq. There are use-cases for traversing beyond the within state in GeoSpatial 
data structures. So someone might come along and contribute the "unexpected". 
Just want to be prepared for that future discussion.

Interesting; I'm curious what that might be.

bq. A lot of it is quite big so I'll have to add to the build.xml to pull a 
compressed version from somewhere.

Yes, as you may have observed, this is how all existing test data is handled in 
the benchmark module:  fetched & decompressed on-demand.  I added the Geonames 
point data.  And I added facilities to SpatialDocMaker to automatically turn 
the points into circles & rects of random size.

I'm looking forward to seeing PackedQuadPrefixTree kick ass in being compressed 
allowing distErrPct=0 -- I'm still puzzled but I look forward to being 
pleasantly astonished. I do understand that it uses all 8 bits (instead of just 
2 as the legacy impl) of the first number of bytes for the quadrant info, and 
that should lead to shorter terms / prefixes for higher-precision data.  But I 
don't see that there would be any change in _the number_ of terms, which is, as 
I see, the scalability challenge.  I have an idea on solving this floating in 
my head but maybe I needn't ponder it longer if PackedQuadPrefixTree handles it 
somehow.  It's not obvious to me but where in the code of PackedQuadCell are 
the 5 "depth bits" encoded & decoded?

bq. In essence, the rough benchmark you performed doesn't benchmark default 
Lucene spatial behavior. Is that not a problem?

It is a problem -- it's not ideal, that is.  Preferably it would stay enabled 
but I think you indicated it's not supported by PackedQuadTree?  I didn't look 
closer.

RE Geo3d:  As an example for anything with me; it's an outlier, not an example 
that proves the rule.  If I am blind to facts I can't  see then show me.  
Hopefully you noticed that Karl and I are working together and it's come real 
far now (lots of discussion on ReviewBoard & bugs I helped Karl find via 
randomized testing).  Your patch here has nothing in common with it -- 
PackedQuadTree _obviously_ belongs here, and it should be quite evident that 
I'm here helping by providing feedback and running a benchmark. And yes, being 
critical.  Anyway... lets get back to work.

The only thing about this patch that is a blocker (-1) for me is 
StreamingPrefixTreeStrategy.  Not most of the code in it, but it's existence as 
a SpatialStrategy subclass instead of an SPT subclass.  I don't mind that it 
optimizes something that I don't think needed optimization, though I suspect if 
you noticed TreeCellIterator originally you wouldn't have bothered.

I have other by-line feedback I'd like to give (and would _prefer_ to do so via 
ReviewBoard/GitHub) but we needn't steamroll this in  under the mantra of 
"progress not perfection".  

> Add StreamingQuadPrefixTree
> ---------------------------
>
>                 Key: LUCENE-6422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spatial
>    Affects Versions: 5.x
>            Reporter: Nicholas Knize
>         Attachments: LUCENE-6422.patch, 
> LUCENE-6422_with_SPT_factory_and_benchmark.patch
>
>
> To conform to Lucene's inverted index, SpatialStrategies use strings to 
> represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
> bits per GeoHash cell, respectively.  To create the terms representing a 
> Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
> ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
> invert() which creates a second lexicographically sorted array of Terms. This 
> doubles the memory consumption when indexing a shape.
> This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
> accomplish the following:
> 1.  Create a packed 8byte representation for a QuadCell
> 2.  Build the Packed cells 'on demand' when incrementToken is called
> Improvements over this approach include the generation of the packed cells 
> using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to