[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-7563: --------------------------------------- Attachment: LUCENE-7563.patch New patch; I think it's ready. This breaks out a private BKD implementation for {{SimpleText}} which is a nice cleanup for the core BKD implementation, e.g. {{BKDReader}} is now final; its strange protected constructor is gone; protected methods are now private. This patch also implements [~jpountz]'s last compression idea, to often use only 1 byte to encode prefix, splitDim and first-byte-delta of the suffix instead of the 2 bytes required in the previous iterations. This gives another ~4-5% further compression improvement: * sparse-sorted -> 2.37 MB * sparse -> 2.07 MB * dense -> 2.00 MB And the OpenStreetMaps geo benchmark: * geo3d -> 1.75 MB * LatLonPoint -> 1.72 MB I'm running the 2B BKD and Points tests now ... if those pass, I plan to push to master first and let this bake a bit before backporting. > BKD index should compress unused leading bytes > ---------------------------------------------- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563.patch, LUCENE-7563.patch, LUCENE-7563.patch, > LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org