[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303494#comment-17303494 ] Bruno Roustant commented on LUCENE-9663: Ok, I backported to 8.x branch, and I updated CHANGES.txt in main to move to 8.9.0 section. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: 8.9 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303492#comment-17303492 ] ASF subversion and git services commented on LUCENE-9663: - Commit d6a554138d2fcde7065e85bc1770207b6eca5736 in lucene's branch refs/heads/main from Bruno Roustant [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d6a5541 ] LUCENE-9663: Move to 8.9.0 section in CHANGES.txt. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: main (9.0) > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303487#comment-17303487 ] ASF subversion and git services commented on LUCENE-9663: - Commit b61b19c746a35adeb7c5befccfb3bed2e46e91cc in lucene-solr's branch refs/heads/branch_8x from jaison [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b61b19c ] LUCENE-9663: Add compression to terms dict from SortedSet/Sorted DocValues. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: main (9.0) > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302487#comment-17302487 ] Adrien Grand commented on LUCENE-9663: -- +1 to backport > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: main (9.0) > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302448#comment-17302448 ] Michael McCandless commented on LUCENE-9663: Oh, why not backport this to 8.x? It is not API changing, right? Just smaller indices, slightly slower ord lookup? > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: main (9.0) > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284851#comment-17284851 ] ASF subversion and git services commented on LUCENE-9663: - Commit 5856c0f176c27b9ea683c63439960dd41e3e45f2 in lucene-solr's branch refs/heads/master from jaison [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5856c0f ] LUCENE-9663: Add compression to terms dict from SortedSet/Sorted DocValues. Closes #2302 > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: master (9.0) > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281779#comment-17281779 ] Bruno Roustant commented on LUCENE-9663: I'm ready to merge. I think it could go to 8.9 branch but I'd like to have confirmation. This change adds compression to Lucene80DocValuesFormat if the Mode.BEST_COMPRESSION is used and is backward compatible. [~jpountz] any suggestion? Thanks > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: master (9.0) > > Time Spent: 11h > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280390#comment-17280390 ] Jaison.Bi commented on LUCENE-9663: --- Ok...Will create a new issue..Thanks [~broustant] > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: master (9.0) > > Time Spent: 11h > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278966#comment-17278966 ] Bruno Roustant commented on LUCENE-9663: The latest PR looks good. I'm going to merge it in a couple of days if there is no objection. [~Jaison] you may want to open another Jira issue if you want to propose more configuration for the compression (and you can link it to this issue). > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: master (9.0) > > Time Spent: 8h 40m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268982#comment-17268982 ] Jaison.Bi commented on LUCENE-9663: --- {quote}In future tests you could ask Lucene to disable compound file format. {quote} ok:) {quote}But, building the {{OrdinalMap}} got quite a bit slower in some cases, if I'm reading the above table correctly? E.g. ~1.2 seconds to ~2.1 seconds for field {{extend}}? But other fields were less heavily impacted. {quote} correct. The average value size of field "extend" is bigger than others. So bigger value size indicates more decompression overhead. {quote}This is likely an OK tradeoff – we pay that slower price once per refresh, but gain a substantially smaller index for text heavy / high cardinality SSDV fields. {quote} So this feature is only enabled under BEST_COMPRESSION mode currently. Thanks [~mikemccand] > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267890#comment-17267890 ] Michael McCandless commented on LUCENE-9663: {quote}(I didnot count dvd file size since compound file exist) {quote} In future tests you could ask Lucene to disable compound file format. Wow, 6.23 GB -> 5.38 GB is impressive compression gains! But, building the {{OrdinalMap}} got quite a bit slower in some cases, if I'm reading the above table correctly? E.g. ~1.2 seconds to ~2.1 seconds for field {{extend}}? But other fields were less heavily impacted. This is likely an OK tradeoff – we pay that slower price once per refresh, but gain a substantially smaller index for text heavy / high cardinality SSDV fields. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267595#comment-17267595 ] Jaison.Bi commented on LUCENE-9663: --- [~mikemccand] [~jpountz] [~sokolov] Please help to review the pull request, thanks :) > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266978#comment-17266978 ] Jaison.Bi commented on LUCENE-9663: --- Should I change Lucene80DocValuesFormat to Lucene90DocValuesFormat and move into package "lucene90"? > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266976#comment-17266976 ] Jaison.Bi commented on LUCENE-9663: --- Thanks for the comment, [~mikemccand] I added one benchmark test to compare the diff of building OridinalMap. Still using the data mentioned in previous comment. Each index contains 4 segments. Index directory size: ||Before||After|| |6.23 GB|5.38 GB| (I didnot count dvd file size since compound file exist) See below results: ||Benchmark||Mode||Cnt||Score||Error||Units|| |BuildOrdinalMapBenchmark.buildOrdinalMap_extend_After|avgt|15|2120.204|± 111.956|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_extend_Before|avgt|15|1217.172|± 57.555|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_host_After|avgt|15|4.775|± 0.260|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_host_Before|avgt|15|4.667|± 0.154|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_obj_After|avgt|15|670.785|± 52.170|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_obj_Before|avgt|15|557.300|± 80.592|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_reqid_After|avgt|15|876.092|± 112.798|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_reqid_Before|avgt|15|515.775|± 61.233|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_uploadtime_After|avgt|15|167.986|± 5.600|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_uploadtime_Before|avgt|15|162.752|± 1.934|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_url_After|avgt|15|667.657|± 18.655|ms/op| |BuildOrdinalMapBenchmark.buildOrdinalMap_url_Before|avgt|15|524.013|± 27.244|ms/op| > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266862#comment-17266862 ] Michael McCandless commented on LUCENE-9663: {quote}Also +1 to test how slower building an OrdinalMap gets with this change. {quote} +1 too – this is done on every refresh, typically. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266861#comment-17266861 ] Michael McCandless commented on LUCENE-9663: Whoa, it is impressive the {{*SSDVFacets}} tasks were not impacted by this compression! Those tasks heavily use the {{SortedSetDocValues}} terms dictionary at the end of each query, to resolve ordinals back to labels. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266814#comment-17266814 ] Jaison.Bi commented on LUCENE-9663: --- This feature shares the same configuration introduced by LUCENE-9378, so it's not enabled by default currently. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266811#comment-17266811 ] Jaison.Bi commented on LUCENE-9663: --- The benchmark result from luceneutil(source: wikimedium10m) does not show obvious reduction after this change: ||TaskQPS||baseline||StdDevQPS||my_modified_version||StdDev||Pct diff||p-value|| |Fuzzy2|69.46|(13.6%)|64.58|(17.9%)|-7.0% ( -33% - 28%)|0.197| |OrHighMed|53.91|(3.5%)|52.69|(4.0%)|-2.3% ( -9% - 5%)|0.078| |OrHighHigh|26.47|(3.7%)|26.00|(2.8%)|-1.8% ( -7% - 4%)|0.112| |Fuzzy1|77.94|(11.0%)|76.62|(11.2%)|-1.7% ( -21% - 23%)|0.656| |Prefix3|91.45|(3.0%)|90.82|(4.3%)|-0.7% ( -7% - 6%)|0.588| |LowTerm|1411.91|(5.7%)|1402.69|(5.0%)|-0.7% ( -10% - 10%)|0.722| |MedPhrase|168.82|(3.7%)|168.37|(3.8%)|-0.3% ( -7% - 7%)|0.832| |OrHighLow|544.05|(7.3%)|543.19|(8.3%)|-0.2% ( -14% - 16%)|0.953| |LowSloppyPhrase|19.33|(2.6%)|19.37|(3.5%)|0.2% ( -5% - 6%)|0.858| |HighSpanNear|3.20|(2.5%)|3.21|(4.2%)|0.2% ( -6% - 7%)|0.871| |Wildcard|129.32|(6.0%)|129.58|(3.9%)|0.2% ( -9% - 10%)|0.910| |PKLookup|202.45|(3.2%)|202.85|(3.3%)|0.2% ( -6% - 6%)|0.859| |BrowseDayOfYearSSDVFacets|14.23|(2.2%)|14.26|(2.4%)|0.3% ( -4% - 4%)|0.749| |MedSpanNear|184.22|(3.4%)|184.76|(4.6%)|0.3% ( -7% - 8%)|0.832| |HighIntervalsOrdered|13.82|(2.0%)|13.89|(2.8%)|0.5% ( -4% - 5%)|0.519| |HighTermTitleBDVSort|93.32|(12.6%)|93.87|(12.0%)|0.6% ( -21% - 28%)|0.889| |HighTermDayOfYearSort|74.63|(10.7%)|75.08|(12.3%)|0.6% ( -20% - 26%)|0.878| |MedSloppyPhrase|129.10|(2.5%)|129.89|(4.2%)|0.6% ( -5% - 7%)|0.611| |HighPhrase|19.91|(3.1%)|20.03|(2.8%)|0.6% ( -5% - 6%)|0.552| |HighSloppyPhrase|21.03|(2.1%)|21.16|(3.5%)|0.6% ( -4% - 6%)|0.524| |Respell|52.62|(4.2%)|52.97|(2.6%)|0.7% ( -5% - 7%)|0.588| |TermDTSort|240.48|(13.1%)|242.13|(12.7%)|0.7% ( -22% - 30%)|0.876| |IntNRQ|113.26|(3.3%)|114.07|(3.3%)|0.7% ( -5% - 7%)|0.527| |AndHighHigh|53.15|(3.8%)|53.55|(3.7%)|0.8% ( -6% - 8%)|0.553| |LowSpanNear|22.72|(2.5%)|22.92|(2.8%)|0.8% ( -4% - 6%)|0.349| |MedTerm|1383.09|(3.9%)|1399.20|(5.4%)|1.2% ( -7% - 10%)|0.474| |BrowseDayOfYearTaxoFacets|3.09|(5.2%)|3.14|(4.5%)|1.4% ( -7% - 11%)|0.401| |HighTermMonthSort|92.89|(16.9%)|94.23|(17.4%)|1.4% ( -28% - 42%)|0.807| |AndHighMed|278.15|(4.2%)|282.18|(4.7%)|1.4% ( -7% - 10%)|0.345| |BrowseDateTaxoFacets|3.09|(5.2%)|3.14|(4.4%)|1.6% ( -7% - 11%)|0.330| |BrowseMonthTaxoFacets|3.39|(6.0%)|3.44|(5.1%)|1.6% ( -8% - 13%)|0.398| |BrowseMonthSSDVFacets|15.74|(6.4%)|16.00|(3.3%)|1.7% ( -7% - 12%)|0.337| |LowPhrase|319.40|(3.5%)|324.87|(5.1%)|1.7% ( -6% - 10%)|0.252| |AndHighLow|730.59|(4.6%)|744.60|(4.8%)|1.9% ( -7% - 11%)|0.238| |OrNotHighLow|660.02|(5.7%)|673.32|(3.9%)|2.0% ( -7% - 12%)|0.231| |HighTerm|1289.67|(4.6%)|1316.15|(4.9%)|2.1% ( -7% - 12%)|0.210| |OrHighNotMed|691.04|(7.0%)|711.12|(5.6%)|2.9% ( -9% - 16%)|0.182| |OrHighNotHigh|610.79|(8.2%)|631.12|(5.8%)|3.3% ( -9% - 18%)|0.171| |OrNotHighMed|637.03|(6.9%)|658.85|(7.0%)|3.4% ( -9% - 18%)|0.152| |OrNotHighHigh|599.42|(5.9%)|620.44|(5.3%)|3.5% ( -7% - 15%)|0.070| |OrHighNotLow|861.26|(6.1%)|912.07|(7.8%)|5.9% ( -7% - 21%)|0.014| I also wrote another benchmark test for faceting test: * 25 SortedSet fields per document. * Some high-cardinality fields and average-value length are defined as below: ||Field Name||Cardinality||Avg Value Length|| |reqid|3772370|69| |extend|3758007|343| |url|3623677|61| |obj|3599083|57| |uploadtime|1064012|136| |host|2418|12| This benchmark tests focus on the latency of building SortedSetDocValuesReaderState(Will read TermsDict) and getting top 10 children. See below results: ||Benchmark||Mode||Cnt||Score||Error||Units|| |SortedSetFacetBenchmark.testBuildReaderState_After|avgt|15|32772.487|± 926.056|ms/op| |SortedSetFacetBenchmark.testBuildReaderState_Before|avgt|15|19462.099|± 906.832|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_extend_After|avgt|15|1575.330|± 22.725|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_extend_Before|avgt|15|1559.596|± 18.216|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_host_After|avgt|15|1599.762|± 81.167|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_host_Before|avgt|15|1573.225|± 25.173|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_obj_After|avgt|15|1578.812|± 19.121|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_obj_Before|avgt|15|1578.499|± 16.796|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_reqid_After|avgt|15|1575.300|± 13.651|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_reqid_Before|avgt|15|1562.115|± 27.098|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_uploadtime_After|avgt|15|1560.106|± 18.756|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_uploadtime_Before|avgt|15|1556.131|± 14.161|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_url_After|avgt|15|1568.535|± 23.545|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_url_Before|avgt|15|1554.675|± 23.721|ms/op| So the operations read
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265621#comment-17265621 ] Jaison.Bi commented on LUCENE-9663: --- Theoretically, prefix + lz4 should be better. Since the terms were sorted, they always contains same prefixes. And LZ4 could not compress the beginning of the block(there's no references to find the duplicate string). > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264719#comment-17264719 ] Jaison.Bi commented on LUCENE-9663: --- Thanks, Adrien Grand. {quote}My intuition is that it would actually be better to do LZ4 in addition to prefix compression, like we do for the terms dictionary of the inverted index {quote} I have compared the results between prefix + lz4 and lz4 only, and also tried to change the doc size per doc to see the difference. See the below result: ||compression type||docs per block||*.dvd file size||write time cost||merge time cost|| |prefix + lz4|256|1.04GB|648456ms|375966ms| |lz4-only|256|1.08GB|639489ms|350477ms| |lz4-only|64|1.15GB|625797ms|298093ms| |lz4-only|128|1.1GB|618034ms|320740ms| |lz4-only|512|1.07GB|639892ms|458737ms| It seems prefix compression + lz4 does not make significant improvement. I think because the "common prefix" could be well-handled by lz4 :-) > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264171#comment-17264171 ] Adrien Grand commented on LUCENE-9663: -- +1 to add lightweight compression to doc-value terms dictionaries. I've seen users store things like unique URLs in sorted doc-value fields where compressing suffixes would have helped. I agree with Jaison that the query impact should be negligible since faceting typically bottlenecks on reading ordinals, not terms dictionaries, though we should double check. :) Also +1 to test how slower building an OrdinalMap gets with this change. bq. replacing prefix-compression with LZ4 My intuition is that it would actually be better to do LZ4 in addition to prefix compression, like we do for the terms dictionary of the inverted index. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263831#comment-17263831 ] Jaison.Bi commented on LUCENE-9663: --- Thanks for the comment, [~sokolov] {quote}if you are running luceneutil tests, could you please also report QPS changes? {quote} Sure, I will. {quote}I'm not clear what the usage of this {{keywords}} field is exactly - is it used for aggregations? {quote} Ya, "keyword" field is used for aggregations mostly. {quote}It would be good to run a faceting test; luceneutil doesn't really have any tests of high-cardinality SSDV aggregations; I think day-of-year is the closest it gets. Maybe you could add one? It's important to test the impact on the query side. {quote} ok, I will learn how to change luceneutil. Meanwhile, I can do another benchmark test using *esrally* as a supplement, it has some aggregation tests. would it be alright? Actually, aggregations are using *global ordinal data* instead of terms dict, terms dict compression will affect the performance of building global oridinal data. Anyway, I will test the impact on query side. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263442#comment-17263442 ] Michael Sokolov commented on LUCENE-9663: - Interesting - if you are running luceneutil tests, could you please also report QPS changes? > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org