[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498567#comment-14498567
]
David Smiley commented on LUCENE-6422:
--------------------------------------
I welcome any suggestions you have pertaining naming (or again on anything).
If you can suggest a better name to what "leafy branch pruning" does, then at a
minimum it could be expressed in the javadocs. Not long ago it had another
name but it was even worse ;-). Naming is hard. Likewise... if
SpatialPrefixTree might have a better literature/industry based name then I'd
love to know what that is. When I named it as-such I looked around but didn't
seem to find anything that was perfect. It's a type of Trie, a spatial trie,
and "PrefixTree" is a synonym for Trie, so... there you go. Maybe I didn't
look hard enough. I'm sorry if I _seemed_ touchy on the terminology; I merely
want to ensure we mutually understood what we were and weren't talking about --
that's all. So when you proposed that what StreamingQuadPrefixTree did was
leafy branch pruning, and as the coiner of the term I can see it didn't meet my
definition; clarification is in order.
bq. I want to make sure there is no confusion here (especially for anyone else
willing to participate and the sensitivity surrounding my use of the
pruneLeafyBranches term). SPT's in the context of lucene-spatial's definition,
no? (since SPT is not a common name for a geo data structure its a derivative)
Yes, in the context of lucene-spatial's definition.
bq. There are use-cases for traversing beyond the within state in GeoSpatial
data structures. So someone might come along and contribute the "unexpected".
Just want to be prepared for that future discussion.
Interesting; I'm curious what that might be.
bq. A lot of it is quite big so I'll have to add to the build.xml to pull a
compressed version from somewhere.
Yes, as you may have observed, this is how all existing test data is handled in
the benchmark module: fetched & decompressed on-demand. I added the Geonames
point data. And I added facilities to SpatialDocMaker to automatically turn
the points into circles & rects of random size.
I'm looking forward to seeing PackedQuadPrefixTree kick ass in being compressed
allowing distErrPct=0 -- I'm still puzzled but I look forward to being
pleasantly astonished. I do understand that it uses all 8 bits (instead of just
2 as the legacy impl) of the first number of bytes for the quadrant info, and
that should lead to shorter terms / prefixes for higher-precision data. But I
don't see that there would be any change in _the number_ of terms, which is, as
I see, the scalability challenge. I have an idea on solving this floating in
my head but maybe I needn't ponder it longer if PackedQuadPrefixTree handles it
somehow. It's not obvious to me but where in the code of PackedQuadCell are
the 5 "depth bits" encoded & decoded?
bq. In essence, the rough benchmark you performed doesn't benchmark default
Lucene spatial behavior. Is that not a problem?
It is a problem -- it's not ideal, that is. Preferably it would stay enabled
but I think you indicated it's not supported by PackedQuadTree? I didn't look
closer.
RE Geo3d: As an example for anything with me; it's an outlier, not an example
that proves the rule. If I am blind to facts I can't see then show me.
Hopefully you noticed that Karl and I are working together and it's come real
far now (lots of discussion on ReviewBoard & bugs I helped Karl find via
randomized testing). Your patch here has nothing in common with it --
PackedQuadTree _obviously_ belongs here, and it should be quite evident that
I'm here helping by providing feedback and running a benchmark. And yes, being
critical. Anyway... lets get back to work.
The only thing about this patch that is a blocker (-1) for me is
StreamingPrefixTreeStrategy. Not most of the code in it, but it's existence as
a SpatialStrategy subclass instead of an SPT subclass. I don't mind that it
optimizes something that I don't think needed optimization, though I suspect if
you noticed TreeCellIterator originally you wouldn't have bothered.
I have other by-line feedback I'd like to give (and would _prefer_ to do so via
ReviewBoard/GitHub) but we needn't steamroll this in under the mantra of
"progress not perfection".
> Add StreamingQuadPrefixTree
> ---------------------------
>
> Key: LUCENE-6422
> URL: https://issues.apache.org/jira/browse/LUCENE-6422
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/spatial
> Affects Versions: 5.x
> Reporter: Nicholas Knize
> Attachments: LUCENE-6422.patch,
> LUCENE-6422_with_SPT_factory_and_benchmark.patch
>
>
> To conform to Lucene's inverted index, SpatialStrategies use strings to
> represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5
> bits per GeoHash cell, respectively. To create the terms representing a
> Shape, the BytesRefIteratorTokenStream first builds all of the terms into an
> ArrayList of Cells in memory, then passes the ArrayList.Iterator back to
> invert() which creates a second lexicographically sorted array of Terms. This
> doubles the memory consumption when indexing a shape.
> This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to
> accomplish the following:
> 1. Create a packed 8byte representation for a QuadCell
> 2. Build the Packed cells 'on demand' when incrementToken is called
> Improvements over this approach include the generation of the packed cells
> using an AutoPrefixAutomaton
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]