[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113384#comment-17113384 ] Viral Gandhi commented on LUCENE-9211: -- This improvement had a negative impact on our internal benchmarking when we tried to upgrade to Lucene 8.5.1. I have created an issue regarding that - https://issues.apache.org/jira/browse/LUCENE-9378. > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > Fix For: 8.5 > > Time Spent: 0.5h > Remaining Estimate: 0h > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039132#comment-17039132 ] ASF subversion and git services commented on LUCENE-9211: - Commit bcdc21a0013709095abee1c588b3271a14949ea2 in lucene-solr's branch refs/heads/branch_8x from markharwood [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bcdc21a ] LUCENE-9211 Add compression for Binary doc value fields (#1234) Stores groups of 32 binary doc values in LZ4-compressed blocks. (cherry picked from commit f549ee353530fcd48390a314aff9ec1723b47346) > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039104#comment-17039104 ] ASF subversion and git services commented on LUCENE-9211: - Commit ce2959fe4cb1d1e77df04464c46004bf7846f6b5 in lucene-solr's branch refs/heads/master from markharwood [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ce2959f ] LUCENE-9211 Add compression for Binary doc value fields (#1234) Stores groups of 32 binary doc values in LZ4-compressed blocks. > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036806#comment-17036806 ] Adrien Grand commented on LUCENE-9211: -- I had a quick look at Juan's commit, there are things I like and things I have questions about. Since this PR is ready, or almost ready, I'd suggest merging this one first. [~juan.duran] I saw that your commit tried to modify the current Lucene80DocValuesFormat. I'm a bit nervous about it because it makes it hard to spot any potential subtle difference in the on-disk format that would cause bugs, so I'd suggest creating a new Lucene85DocValuesFormat instead, even if it has the same ideas or even same on-disk format as the current Lucene80DocValuesFormat? > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036139#comment-17036139 ] Mark Harwood commented on LUCENE-9211: -- {quote}the link did not work. {quote} Sorry, formatting must have mangled my URL - this is the full link FWIW [https://github.com/apache/lucene-solr/blob/master/lucene/benchmark/conf/spatial.alg#L31] Thanks for testing and good to know your tests showed little difference in performance. What's your view on how best to proceed from here? Wait for Juan's PR to land before doing any more? > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035624#comment-17035624 ] David Smiley commented on LUCENE-9211: -- Thanks so much for running the benchmarks [~mharwood]! When you say you modified "this line"; the link did not work. If you merely changed the default spatial.alg to use composite then it's only indexing point data which is not realistic for this spatial strategy. Instead LUCENE-5579 has a spatial.alg file that converts those points to random circles and it'll be more interesting. I just did a diff on that spatial.alg with the default one and they are pretty similar overall. > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035224#comment-17035224 ] juan camilo rodriguez duran commented on LUCENE-9211: - [~mharwood] the main idea of mine PR it just to make code cleaner and extensible, it is not supposed to introduce any regression nor improvement of the current format. (spoiler alert: I'm working in the extension to improve sorted and sorted set doc values for the lookup using BytesRef) > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034557#comment-17034557 ] Mark Harwood commented on LUCENE-9211: -- Thanks Juan and David for your comments. I ran the spatial.alg test and modified [this line|[https://github.com/apache/lucene-solr/blob/master/lucene/benchmark/conf/spatial.alg#L31]] to use the "composite" strategy in order to exercise the Binary DV storage. I did four runs of master and PR 1234 and there wasn't a clear pattern of changes in speed. ||Master read recs/s||PR 1234 read recs/s|| |875.66|884.96| |869.94|841.75| |823.38|853.97| |842.11|878.73| ||Master write docs/s||PR 1234 write docs/s|| |7,688.46|{color:#00}8,163.20{color}| |8,223.35|{color:#00}7,882.39{color}| |7,381.71|{color:#00}7,930.78{color}| |8,385.32|7,925| > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034350#comment-17034350 ] juan camilo rodriguez duran commented on LUCENE-9211: - [~mharwood] here you will find a draft for the PR I'm preparing [https://github.com/juanka588/lucene-solr/commit/b7c8d14d53190753ea789c3fb3d299d3374c3677] > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033125#comment-17033125 ] David Smiley commented on LUCENE-9211: -- This seems cool for some use-cases but I worry about the overhead for others. I think I have a benchmark module ".alg" file for SerializedDVStrategy in spatial-extras. I should try it out on your PR. I wish it was easier for us to let users toggle the choice of DocValuesFormat only for one type but not for others. DocValuesFormat is really a format of formats, which is inflexible. [~juan.duran], a colleague of mine, has been diving into this topic lately and I hope he shares it here (new issue of course). > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org