[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-7563: --------------------------------- Attachment: LUCENE-7563-prefixlen-unary.patch The change looks good and the drop is quite spectacular. http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#searcher_heap :-) I think there is just a redundant arraycopy in {{clone()}}? For the record, I played with another idea leveraging the fact that the prefix lengths on two consecutive levels are likely close to each other, and the most common values for the deltas are 0, then 1, then -1. So we might be able to do more savings by encoding the delta between consecutive prefix length using unary coding on top of zig-zag encoding, which would allow to encode 0 on 1 bit, 1 on 2 bits, 2 on 3 bits, etc. However it only saved 1% memory on IndexOSM and less than 1% on IndexTaxis. I'm attaching it here if someone wants to have a look but I don't think the gains are worth the complexity. > BKD index should compress unused leading bytes > ---------------------------------------------- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org