[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496135#comment-14496135
 ] 

David Smiley commented on LUCENE-6422:
--------------------------------------

*Awesome work Nick!*  It's so nice to see meaty spatial contributions like this 
(Geo3d is another example).

RE "Streaming" (transient memory use while indexing):  I appreciate that the 
out-of-the box configuration of RPT with either LegacyPrefixTree (be it quad or 
geohash)  will use a lot of memory for indexing.  But since... I don't know how 
long now, this only occurs if the "leafy branch pruning" optimization is 
enabled on RPT.  That algorithm, existing on RecursivePrefixTreeStrategy, 
unfortunately buffers all the cells. It's somewhat simple; it could be improved 
to not buffer all cells but it would need to buffer some.  Recently I did some 
benchmarking and found that the leafy branch pruning yielded lots of index size 
savings, particularly with the quad tree.  I'd love to chat with you about the 
subject of "leaves" on the SPT and an idea I have on doing better.  Any way, I 
suggest you do another memory benchmark with leafy branch pruning disabled with 
the PackedQuadTree but not the StreamingQuad...Strategy.  With it disabled, the 
underlying BytesRefIteratortokenStream will consume a Iterator<Cell> that is a 
direct instance of TreeCellIterator, and then you get the "streaming" effect.  
The existing TreeCellIterator is quite similar to the 
Streaming...PrefixTreeIterator here.  If I'm right about there being no 
appreciable memory savings, then this part of the patch can be removed as it's 
redundant.

I really like the new PackedQuadPrefixTree.java.  (IMO that's what this JIRA 
issue is mostly about)  Can you consider _not_ subclassing Legacy* ?  I'd like 
to leave the legacy trees as-is and new SPTs not inherit from it.  Can you base 
your next patch off of trunk?  And can you *either* post on 
reviewboard.apache.org or use a GitHub fork & branch so I can provide by-line 
feedback?

> Add StreamingQuadPrefixTree
> ---------------------------
>
>                 Key: LUCENE-6422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spatial
>    Affects Versions: 5.x
>            Reporter: Nicholas Knize
>         Attachments: LUCENE-6422.patch
>
>
> To conform to Lucene's inverted index, SpatialStrategies use strings to 
> represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
> bits per GeoHash cell, respectively.  To create the terms representing a 
> Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
> ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
> invert() which creates a second lexicographically sorted array of Terms. This 
> doubles the memory consumption when indexing a shape.
> This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
> accomplish the following:
> 1.  Create a packed 8byte representation for a QuadCell
> 2.  Build the Packed cells 'on demand' when incrementToken is called
> Improvements over this approach include the generation of the packed cells 
> using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to