[jira] [Commented] (LUCENE-7371) BKDReader could compress values better
[ https://issues.apache.org/jira/browse/LUCENE-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382455#comment-15382455 ] Michael McCandless commented on LUCENE-7371: Oh sorry, I upgraded the Linux kernel from 4.4 -> 4.6.4 on 7/17! I'll add an annotation. > BKDReader could compress values better > -- > > Key: LUCENE-7371 > URL: https://issues.apache.org/jira/browse/LUCENE-7371 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7371.patch, LUCENE-7371.patch, LUCENE-7371.patch > > > For compressing values, BKDReader only relies on shared prefixes in a block. > We could probably easily do better. For instance there are only 256 possible > values for the first byte of the dimension that the values are sorted by, yet > we use a block size of 1024. So by using something simple like run-length > compression we could save 6 bits per value on average. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7371) BKDReader could compress values better
[ https://issues.apache.org/jira/browse/LUCENE-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382452#comment-15382452 ] Robert Muir commented on LUCENE-7371: - I think [~mikemccand] may have upgraded his operating system. > BKDReader could compress values better > -- > > Key: LUCENE-7371 > URL: https://issues.apache.org/jira/browse/LUCENE-7371 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7371.patch, LUCENE-7371.patch, LUCENE-7371.patch > > > For compressing values, BKDReader only relies on shared prefixes in a block. > We could probably easily do better. For instance there are only 256 possible > values for the first byte of the dimension that the values are sorted by, yet > we use a block size of 1024. So by using something simple like run-length > compression we could save 6 bits per value on average. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7371) BKDReader could compress values better
[ https://issues.apache.org/jira/browse/LUCENE-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382404#comment-15382404 ] Adrien Grand commented on LUCENE-7371: -- The benchmarks are reporting interesting changes, some seem to perform slightly faster now, like IntNRQ (http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html) or the geo3d distance filter (http://people.apache.org/~mikemccand/geobench.html#search-distance) but some others seem to perform a bit slower like the 10-gon filter (http://people.apache.org/~mikemccand/geobench.html#search-poly_10) or the 10 nearest points (http://people.apache.org/~mikemccand/geobench.html#search-nearest_10). The fact that it is not consistently slower or faster is due to the distribution of points in the blocks that need to be read I think (the more unique leading bytes, the more expensive the read). Given that the slow down is not general to all benchmarks and that the size reduction is significant I don't think this should be reverted, but let me know if you think otherwise. (For the record many benchmarks look slower on July 17th but I don't think this is related to this change, for instance even phrases got slower http://people.apache.org/~mikemccand/lucenebench/Phrase.html) > BKDReader could compress values better > -- > > Key: LUCENE-7371 > URL: https://issues.apache.org/jira/browse/LUCENE-7371 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7371.patch, LUCENE-7371.patch, LUCENE-7371.patch > > > For compressing values, BKDReader only relies on shared prefixes in a block. > We could probably easily do better. For instance there are only 256 possible > values for the first byte of the dimension that the values are sorted by, yet > we use a block size of 1024. So by using something simple like run-length > compression we could save 6 bits per value on average. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7371) BKDReader could compress values better
[ https://issues.apache.org/jira/browse/LUCENE-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373139#comment-15373139 ] ASF subversion and git services commented on LUCENE-7371: - Commit 1a6df249f91ca9f4dab792c48f5965f3388f1776 in lucene-solr's branch refs/heads/branch_6x from [~jpountz] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1a6df24 ] LUCENE-7371: Fix CHANGES entry. > BKDReader could compress values better > -- > > Key: LUCENE-7371 > URL: https://issues.apache.org/jira/browse/LUCENE-7371 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7371.patch, LUCENE-7371.patch, LUCENE-7371.patch > > > For compressing values, BKDReader only relies on shared prefixes in a block. > We could probably easily do better. For instance there are only 256 possible > values for the first byte of the dimension that the values are sorted by, yet > we use a block size of 1024. So by using something simple like run-length > compression we could save 6 bits per value on average. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7371) BKDReader could compress values better
[ https://issues.apache.org/jira/browse/LUCENE-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373140#comment-15373140 ] ASF subversion and git services commented on LUCENE-7371: - Commit b54d46722b36f107edd59a8d843b93f5727a9058 in lucene-solr's branch refs/heads/master from [~jpountz] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b54d467 ] LUCENE-7371: Fix CHANGES entry. > BKDReader could compress values better > -- > > Key: LUCENE-7371 > URL: https://issues.apache.org/jira/browse/LUCENE-7371 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7371.patch, LUCENE-7371.patch, LUCENE-7371.patch > > > For compressing values, BKDReader only relies on shared prefixes in a block. > We could probably easily do better. For instance there are only 256 possible > values for the first byte of the dimension that the values are sorted by, yet > we use a block size of 1024. So by using something simple like run-length > compression we could save 6 bits per value on average. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7371) BKDReader could compress values better
[ https://issues.apache.org/jira/browse/LUCENE-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373137#comment-15373137 ] ASF subversion and git services commented on LUCENE-7371: - Commit 1f446872aa9346c22643d0fb753ec42942b5a4d2 in lucene-solr's branch refs/heads/branch_6x from [~jpountz] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f44687 ] LUCENE-7371: Better compression of values in Lucene60PointsFormat. > BKDReader could compress values better > -- > > Key: LUCENE-7371 > URL: https://issues.apache.org/jira/browse/LUCENE-7371 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7371.patch, LUCENE-7371.patch, LUCENE-7371.patch > > > For compressing values, BKDReader only relies on shared prefixes in a block. > We could probably easily do better. For instance there are only 256 possible > values for the first byte of the dimension that the values are sorted by, yet > we use a block size of 1024. So by using something simple like run-length > compression we could save 6 bits per value on average. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7371) BKDReader could compress values better
[ https://issues.apache.org/jira/browse/LUCENE-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373129#comment-15373129 ] ASF subversion and git services commented on LUCENE-7371: - Commit 866398bea67607bcd54331a48736e6bdb94a703d in lucene-solr's branch refs/heads/master from [~jpountz] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=866398b ] LUCENE-7371: Better compression of values in Lucene60PointsFormat. > BKDReader could compress values better > -- > > Key: LUCENE-7371 > URL: https://issues.apache.org/jira/browse/LUCENE-7371 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7371.patch, LUCENE-7371.patch, LUCENE-7371.patch > > > For compressing values, BKDReader only relies on shared prefixes in a block. > We could probably easily do better. For instance there are only 256 possible > values for the first byte of the dimension that the values are sorted by, yet > we use a block size of 1024. So by using something simple like run-length > compression we could save 6 bits per value on average. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7371) BKDReader could compress values better
[ https://issues.apache.org/jira/browse/LUCENE-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15372981#comment-15372981 ] Michael McCandless commented on LUCENE-7371: +1, great! > BKDReader could compress values better > -- > > Key: LUCENE-7371 > URL: https://issues.apache.org/jira/browse/LUCENE-7371 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7371.patch, LUCENE-7371.patch, LUCENE-7371.patch > > > For compressing values, BKDReader only relies on shared prefixes in a block. > We could probably easily do better. For instance there are only 256 possible > values for the first byte of the dimension that the values are sorted by, yet > we use a block size of 1024. So by using something simple like run-length > compression we could save 6 bits per value on average. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7371) BKDReader could compress values better
[ https://issues.apache.org/jira/browse/LUCENE-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15372803#comment-15372803 ] Michael McCandless commented on LUCENE-7371: This is a nice optimization! Patch looks good! The {{BKDWriter}} change to pick which dimension to apply the run-length coding to is best effort right? Because, you could have a dim with fewer unique leading suffix bytes, but a larger delta between first and last values? But it would take quite a bit more work at indexing time to figure it out ... maybe add a comment explaining this tradeoff? It seems likely the "min delta" approach should work well in practice, but have you tried with the slow-but-correct approach to verify? Also, I noticed {{TestBackwardsCompatibility}} seems not to test points! I'll go fix that ... > BKDReader could compress values better > -- > > Key: LUCENE-7371 > URL: https://issues.apache.org/jira/browse/LUCENE-7371 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7371.patch, LUCENE-7371.patch > > > For compressing values, BKDReader only relies on shared prefixes in a block. > We could probably easily do better. For instance there are only 256 possible > values for the first byte of the dimension that the values are sorted by, yet > we use a block size of 1024. So by using something simple like run-length > compression we could save 6 bits per value on average. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org