[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

Nicholas Knize (JIRA) Fri, 17 Apr 2015 11:06:24 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500327#comment-14500327
 ]


Nicholas Knize commented on LUCENE-6422:
----------------------------------------

bq.  If you can suggest a better name to what "leafy branch pruning" does, then 
at a minimum it could be expressed in the javadocs

++.  Just 'prune' is probably more clear since its universally used all over 
data structures.  We can add a javadoc comment that describes it in further 
detail if necessary.

bq. if SpatialPrefixTree might have a better literature/industry based name 
then I'd love to know what that is.

There are trie based spatial trees all over (kd-Trie, kd-b-trie, buddy tree) 
industry and the literature. The one you call QuadPrefixTree was originally 
introduced in 1991 called the QuadTrie.  (reference: Gaston H. Gonnet and 
Ricardo Baeza-Yates, Handbook of Algorithms and Data Structures -- in Pascal 
and C, 2nd edition, Addison-Wesley, 1991.)  Dr. Hanan Samet from UMD has a 
great section on MX and PR QuadTrees (same as QuadPrefixTree and a name someone 
mentioned to you in another issue).  He provides a nice discussion on the 
differences between MX, PR and their point based counterparts (compared by the 
decomposition methods).  There's certainly nothing wrong with an implementation 
specific name. If you are asking for suggestions then I offer: SpatialTrie, 
GeoHashTrie, QuadTrie as being shorter, more to the point, and probably more 
relate-able to other spatial SMEs (whom I'm hoping would be willing to get more 
involved). 

bq. It's not obvious to me but where in the code of PackedQuadCell are the 5 
"depth bits" encoded & decoded?

PackedQuadCell.getLevel() decodes, and its encoded in PackedQuadCell.nextTerm()

bq. Preferably it would stay enabled but I think you indicated it's not 
supported by PackedQuadTree? I didn't look closer.

That's correct.

bq. I also wonder whether we need a new, lighter weight spatial module 
(spatial2? spatia_light?), or maybe spatial_sandbox, where the barrier is lower?

+++ I think this is a great idea for experimental/research features we don't 
want cluttering up the spatial module.

bq. RE abstractions: I respect your opinion although I don't agree that there 
are too many here.

IMHO, this is a slippery slope.  There are so many diverse spatial data 
structures we should be taking a look at for improving spatial search in higher 
order dimensions (4D space-time for just a start).  That's a personal interest 
area for me, in how the most powerful high dimension structures (that already 
exist) can fit within the design and implementation of lucene-core (a green 
field to explore).  Something like this does require a sophisticated 
abstraction framework and this particular one has a bit of a learning curve. I 
think that can work itself out over time with a bit of refactoring (which it 
sounds like all are open to?).  In the meantime it does set the bar rather high 
for new contributors. This is another +1 for a spatial sandbox for experimental 
research (heck make it a separate repo). 

bq. Sigh; these conversations are stressfull 

They're very verbose, but maybe that's the kick in the pants needed to help the 
spatial module really take off. That is, after all, the common goal of the 
community?




> Add StreamingQuadPrefixTree
> ---------------------------
>
>                 Key: LUCENE-6422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spatial
>    Affects Versions: 5.x
>            Reporter: Nicholas Knize
>         Attachments: LUCENE-6422.patch, LUCENE-6422.patch, 
> LUCENE-6422_with_SPT_factory_and_benchmark.patch
>
>
> To conform to Lucene's inverted index, SpatialStrategies use strings to 
> represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
> bits per GeoHash cell, respectively.  To create the terms representing a 
> Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
> ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
> invert() which creates a second lexicographically sorted array of Terms. This 
> doubles the memory consumption when indexing a shape.
> This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
> accomplish the following:
> 1.  Create a packed 8byte representation for a QuadCell
> 2.  Build the Packed cells 'on demand' when incrementToken is called
> Improvements over this approach include the generation of the packed cells 
> using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

Reply via email to