[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498096#comment-14498096 ]
Nicholas Knize commented on LUCENE-6422: ---------------------------------------- Awesome for putting together the benchmark (even a rough one). Can you attach the spatial.alg file you used so I can verify (I'm assuming its a variant on one you've posted before?) I'm getting very different numbers with "production" data sets (e.g., high res peninsulas, islands, global political boundaries, planet osm data - e.g., more exotic shapes than circles larger than a few hundred KM) A few questions... and other observations. bq. chosen arbitrarily; with 27 it choked on memory given 2GB heap What choked specifically? I'm using PackedQuad with depth between 26 and 29. 1GB heap size using the shapes I described above. bq. disabling leafy branch pruning to compare apples to apples Out of curiosity, why is this option enabled by default if it uses transient storage that doubles memory consumption? Seems backwards to me. bq. I was skeptical there would be index size savings and the benchmark shows there aren't any. IMHO I would avoid these kinds of absolute statements (especially with the highly variable nature of spatial use-cases). In this situation your numbers do not surprise me when disabling that leafyBranchPrune option (which still confuses me why its there?), and using a single shape type with variable size. There is an outstanding issue in the existing patch (I'll see if I can't push out a fix today) - the TermsEnum is returning Terms with BytesRef containing bytes[] that are double the size than they should be (e.g., 16 bytes instead of 8 - all padded w/ zeros). I suspect its some improper configuration in the reader? So for every high res cell (e.g.), the term will be 16 bytes (still better than, say, 27). I think we can do better on simulated test data in the test framework. I love the randomization and what minimal "real" data sets that are there are great. It does not provide the coverage necessary, though, to best simulate some real world scenarios. That's okay, to steal Mike's quote "progress not perfection". I'll definitely work to provide some more real world tests so we have better coverage and benchmarking options using "real world" data. Its a good way to recommend one indexing structure over another (this one's just the beginning. There are more indexing structures in trial mode, and even more improvements for the packed version) Let's keep this going... Since this patch is non-destructive I don't see a reason it can't be committed as another option and I can submit enhancement patches to this feature. That would be up to the community. > Add StreamingQuadPrefixTree > --------------------------- > > Key: LUCENE-6422 > URL: https://issues.apache.org/jira/browse/LUCENE-6422 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/spatial > Affects Versions: 5.x > Reporter: Nicholas Knize > Attachments: LUCENE-6422.patch, > LUCENE-6422_with_SPT_factory_and_benchmark.patch > > > To conform to Lucene's inverted index, SpatialStrategies use strings to > represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 > bits per GeoHash cell, respectively. To create the terms representing a > Shape, the BytesRefIteratorTokenStream first builds all of the terms into an > ArrayList of Cells in memory, then passes the ArrayList.Iterator back to > invert() which creates a second lexicographically sorted array of Terms. This > doubles the memory consumption when indexing a shape. > This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to > accomplish the following: > 1. Create a packed 8byte representation for a QuadCell > 2. Build the Packed cells 'on demand' when incrementToken is called > Improvements over this approach include the generation of the packed cells > using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org