[ 
https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702891#comment-15702891
 ] 

Adrien Grand commented on LUCENE-7563:
--------------------------------------

bq. Hmm I think I am already doing that?

You are right, I had not read the code correctly.

bq. Oooh that's a great idea! Saves 1 byte per inner node. We need 5 bits for 
the prefix I think since it can range 0 .. 16 inclusive, and 3 bits for the 
splitDim since it's 0 .. 7 inclusive.

I have been thinking about it more and I think we can make it more general. The 
first two bytes that differ are likely close to each other, so if we call their 
difference {{firstByteDelta}}, we could pack {{firstByteDelta}}, {{splitDim}} 
and {{prefix}} into a single vint (eg. {{(firstByteDelta * (1 + bytesPerDim) + 
prefix) * numDims + splitDim}}) that would sometimes only take one byte (quite 
often when {{numDims}} and {{bytesPerDim}} are small and rarely in the opposite 
case).

bq. but it felt wrong to just pass these packed bytes to the simple text format 
...

Agreed. Maybe we should duplicate the curent BKDReader/BKDWriter into a new 
impl that would be specific to SimpleText and would not need all those 
optimizations so that both impls can evolve separately.

> BKD index should compress unused leading bytes
> ----------------------------------------------
>
>                 Key: LUCENE-7563
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7563
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: master (7.0), 6.4
>
>         Attachments: LUCENE-7563.patch, LUCENE-7563.patch
>
>
> Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per 
> dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom 
> two bytes in a given segment, we shouldn't store all those leading 0s in the 
> index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to