[
https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702280#comment-15702280
]
Michael McCandless commented on LUCENE-7563:
--------------------------------------------
bq. It seems we are always delta coding with the split value of the parent
level, but for the multi-dimensional case, I think it would be better to
delta-code with the last split value that was on the same dimension?
Hmm I think I am already doing that? Note that the
{{splitValuesStack}} in {{BKDReader.PackedIndexTree}} holds all
dimensions' last split values, and then when I read the suffix bytes
in, I copy them into the packed values for the current split
dimension:
{noformat}
in.readBytes(splitValuesStack[level], splitDim*bytesPerDim+prefix,
suffix);
{noformat}
I think?
I'll test on the OpenStreetMaps geo benchmark to measure the impact
... I'll also run the 2B tests to make sure nothing broke.
bq. For instance we use whole bytes to store the split dimension or the prefix
length while they only need 3 and 4 bits? In the multi-dimensional case we
could store both on a single byte.
Oooh that's a great idea! Saves 1 byte per inner node. We need 5
bits for the prefix I think since it can range 0 .. 16 inclusive, and
3 bits for the {{splitDim}} since it's 0 .. 7 inclusive.
bq. It doesn't need to be done in the same patch, but it would also be nice for
SimpleText to not use the legacy format of the index. I'm not sure how to
proceed however.
Yeah I'm not sure what to do here either ... but it felt wrong to just
pass these packed bytes to the simple text format ... that packed form
is even further from "simple" than the two arrays we have now.
> BKD index should compress unused leading bytes
> ----------------------------------------------
>
> Key: LUCENE-7563
> URL: https://issues.apache.org/jira/browse/LUCENE-7563
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7563.patch, LUCENE-7563.patch
>
>
> Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per
> dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom
> two bytes in a given segment, we shouldn't store all those leading 0s in the
> index.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]