[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-03-17 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303494#comment-17303494
 ] 

Bruno Roustant commented on LUCENE-9663:


Ok, I backported to 8.x branch, and I updated CHANGES.txt in main to move to 
8.9.0 section.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: 8.9
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-03-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303492#comment-17303492
 ] 

ASF subversion and git services commented on LUCENE-9663:
-

Commit d6a554138d2fcde7065e85bc1770207b6eca5736 in lucene's branch 
refs/heads/main from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d6a5541 ]

LUCENE-9663: Move to 8.9.0 section in CHANGES.txt.


> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: main (9.0)
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-03-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303487#comment-17303487
 ] 

ASF subversion and git services commented on LUCENE-9663:
-

Commit b61b19c746a35adeb7c5befccfb3bed2e46e91cc in lucene-solr's branch 
refs/heads/branch_8x from jaison
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b61b19c ]

LUCENE-9663: Add compression to terms dict from SortedSet/Sorted DocValues.


> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: main (9.0)
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-03-16 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302487#comment-17302487
 ] 

Adrien Grand commented on LUCENE-9663:
--

+1 to backport

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: main (9.0)
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-03-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302448#comment-17302448
 ] 

Michael McCandless commented on LUCENE-9663:


Oh, why not backport this to 8.x?  It is not API changing, right?  Just smaller 
indices, slightly slower ord lookup?

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: main (9.0)
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-02-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284851#comment-17284851
 ] 

ASF subversion and git services commented on LUCENE-9663:
-

Commit 5856c0f176c27b9ea683c63439960dd41e3e45f2 in lucene-solr's branch 
refs/heads/master from jaison
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5856c0f ]

LUCENE-9663: Add compression to terms dict from SortedSet/Sorted DocValues.

Closes #2302


> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: master (9.0)
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-02-09 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281779#comment-17281779
 ] 

Bruno Roustant commented on LUCENE-9663:


I'm ready to merge. I think it could go to 8.9 branch but I'd like to have 
confirmation.

This change adds compression to Lucene80DocValuesFormat if the 
Mode.BEST_COMPRESSION is used and is backward compatible.

[~jpountz] any suggestion? Thanks

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: master (9.0)
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-02-06 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280390#comment-17280390
 ] 

Jaison.Bi commented on LUCENE-9663:
---

Ok...Will create a new issue..Thanks [~broustant]

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: master (9.0)
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-02-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278966#comment-17278966
 ] 

Bruno Roustant commented on LUCENE-9663:


The latest PR looks good. I'm going to merge it in a couple of days if there is 
no objection.

[~Jaison] you may want to open another Jira issue if you want to propose more 
configuration for the compression (and you can link it to this issue).

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: master (9.0)
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-20 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268982#comment-17268982
 ] 

Jaison.Bi commented on LUCENE-9663:
---

{quote}In future tests you could ask Lucene to disable compound file format.
{quote}
ok:)
{quote}But, building the {{OrdinalMap}} got quite a bit slower in some cases, 
if I'm reading the above table correctly?  E.g. ~1.2 seconds to ~2.1 seconds 
for field {{extend}}? But other fields were less heavily impacted. 
{quote}
correct.  The average value size of field "extend" is bigger than others. So 
bigger value size indicates more decompression overhead.
{quote}This is likely an OK tradeoff – we pay that slower price once per 
refresh, but gain a substantially smaller index for text heavy / high 
cardinality SSDV fields.
{quote}
So this feature is only enabled under BEST_COMPRESSION mode currently.

Thanks [~mikemccand]

 

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-19 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267890#comment-17267890
 ] 

Michael McCandless commented on LUCENE-9663:


{quote}(I didnot count dvd file size since compound file exist)
{quote}
In future tests you could ask Lucene to disable compound file format.

Wow, 6.23 GB -> 5.38 GB is impressive compression gains!

But, building the {{OrdinalMap}} got quite a bit slower in some cases, if I'm 
reading the above table correctly?  E.g. ~1.2 seconds to ~2.1 seconds for field 
{{extend}}?  But other fields were less heavily impacted.  This is likely an OK 
tradeoff – we pay that slower price once per refresh, but gain a substantially 
smaller index for text heavy / high cardinality SSDV fields.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-18 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267595#comment-17267595
 ] 

Jaison.Bi commented on LUCENE-9663:
---

[~mikemccand] [~jpountz] [~sokolov]

Please help to review the pull request, thanks :)

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-17 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266978#comment-17266978
 ] 

Jaison.Bi commented on LUCENE-9663:
---

Should I change Lucene80DocValuesFormat to Lucene90DocValuesFormat and move 
into package "lucene90"?  

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-17 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266976#comment-17266976
 ] 

Jaison.Bi commented on LUCENE-9663:
---

Thanks for the comment, [~mikemccand]

I added one benchmark test to compare the diff of building OridinalMap.
 Still using the data mentioned in previous comment. Each index contains 4 
segments. 
 Index directory size:   
||Before||After||
|6.23 GB|5.38 GB|

(I didnot count dvd file size since compound file exist)

See below results:
||Benchmark||Mode||Cnt||Score||Error||Units||
|BuildOrdinalMapBenchmark.buildOrdinalMap_extend_After|avgt|15|2120.204|± 
111.956|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_extend_Before|avgt|15|1217.172|± 
57.555|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_host_After|avgt|15|4.775|± 
0.260|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_host_Before|avgt|15|4.667|± 
0.154|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_obj_After|avgt|15|670.785|± 
52.170|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_obj_Before|avgt|15|557.300|± 
80.592|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_reqid_After|avgt|15|876.092|± 
112.798|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_reqid_Before|avgt|15|515.775|± 
61.233|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_uploadtime_After|avgt|15|167.986|± 
5.600|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_uploadtime_Before|avgt|15|162.752|± 
1.934|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_url_After|avgt|15|667.657|± 
18.655|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_url_Before|avgt|15|524.013|± 
27.244|ms/op|

 

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-17 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266862#comment-17266862
 ] 

Michael McCandless commented on LUCENE-9663:


{quote}Also +1 to test how slower building an OrdinalMap gets with this change.
{quote}
+1 too – this is done on every refresh, typically.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-17 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266861#comment-17266861
 ] 

Michael McCandless commented on LUCENE-9663:


Whoa, it is impressive the {{*SSDVFacets}} tasks were not impacted by this 
compression!  Those tasks heavily use the {{SortedSetDocValues}} terms 
dictionary at the end of each query, to resolve ordinals back to labels.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-17 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266814#comment-17266814
 ] 

Jaison.Bi commented on LUCENE-9663:
---

This feature shares the same configuration introduced by LUCENE-9378, so it's 
not enabled by default currently.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-17 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266811#comment-17266811
 ] 

Jaison.Bi commented on LUCENE-9663:
---

The benchmark result from luceneutil(source: wikimedium10m) does not show 
obvious reduction after this change:
||TaskQPS||baseline||StdDevQPS||my_modified_version||StdDev||Pct diff||p-value||
|Fuzzy2|69.46|(13.6%)|64.58|(17.9%)|-7.0% ( -33% - 28%)|0.197|
|OrHighMed|53.91|(3.5%)|52.69|(4.0%)|-2.3% ( -9% - 5%)|0.078|
|OrHighHigh|26.47|(3.7%)|26.00|(2.8%)|-1.8% ( -7% - 4%)|0.112|
|Fuzzy1|77.94|(11.0%)|76.62|(11.2%)|-1.7% ( -21% - 23%)|0.656|
|Prefix3|91.45|(3.0%)|90.82|(4.3%)|-0.7% ( -7% - 6%)|0.588|
|LowTerm|1411.91|(5.7%)|1402.69|(5.0%)|-0.7% ( -10% - 10%)|0.722|
|MedPhrase|168.82|(3.7%)|168.37|(3.8%)|-0.3% ( -7% - 7%)|0.832|
|OrHighLow|544.05|(7.3%)|543.19|(8.3%)|-0.2% ( -14% - 16%)|0.953|
|LowSloppyPhrase|19.33|(2.6%)|19.37|(3.5%)|0.2% ( -5% - 6%)|0.858|
|HighSpanNear|3.20|(2.5%)|3.21|(4.2%)|0.2% ( -6% - 7%)|0.871|
|Wildcard|129.32|(6.0%)|129.58|(3.9%)|0.2% ( -9% - 10%)|0.910|
|PKLookup|202.45|(3.2%)|202.85|(3.3%)|0.2% ( -6% - 6%)|0.859|
|BrowseDayOfYearSSDVFacets|14.23|(2.2%)|14.26|(2.4%)|0.3% ( -4% - 4%)|0.749|
|MedSpanNear|184.22|(3.4%)|184.76|(4.6%)|0.3% ( -7% - 8%)|0.832|
|HighIntervalsOrdered|13.82|(2.0%)|13.89|(2.8%)|0.5% ( -4% - 5%)|0.519|
|HighTermTitleBDVSort|93.32|(12.6%)|93.87|(12.0%)|0.6% ( -21% - 28%)|0.889|
|HighTermDayOfYearSort|74.63|(10.7%)|75.08|(12.3%)|0.6% ( -20% - 26%)|0.878|
|MedSloppyPhrase|129.10|(2.5%)|129.89|(4.2%)|0.6% ( -5% - 7%)|0.611|
|HighPhrase|19.91|(3.1%)|20.03|(2.8%)|0.6% ( -5% - 6%)|0.552|
|HighSloppyPhrase|21.03|(2.1%)|21.16|(3.5%)|0.6% ( -4% - 6%)|0.524|
|Respell|52.62|(4.2%)|52.97|(2.6%)|0.7% ( -5% - 7%)|0.588|
|TermDTSort|240.48|(13.1%)|242.13|(12.7%)|0.7% ( -22% - 30%)|0.876|
|IntNRQ|113.26|(3.3%)|114.07|(3.3%)|0.7% ( -5% - 7%)|0.527|
|AndHighHigh|53.15|(3.8%)|53.55|(3.7%)|0.8% ( -6% - 8%)|0.553|
|LowSpanNear|22.72|(2.5%)|22.92|(2.8%)|0.8% ( -4% - 6%)|0.349|
|MedTerm|1383.09|(3.9%)|1399.20|(5.4%)|1.2% ( -7% - 10%)|0.474|
|BrowseDayOfYearTaxoFacets|3.09|(5.2%)|3.14|(4.5%)|1.4% ( -7% - 11%)|0.401|
|HighTermMonthSort|92.89|(16.9%)|94.23|(17.4%)|1.4% ( -28% - 42%)|0.807|
|AndHighMed|278.15|(4.2%)|282.18|(4.7%)|1.4% ( -7% - 10%)|0.345|
|BrowseDateTaxoFacets|3.09|(5.2%)|3.14|(4.4%)|1.6% ( -7% - 11%)|0.330|
|BrowseMonthTaxoFacets|3.39|(6.0%)|3.44|(5.1%)|1.6% ( -8% - 13%)|0.398|
|BrowseMonthSSDVFacets|15.74|(6.4%)|16.00|(3.3%)|1.7% ( -7% - 12%)|0.337|
|LowPhrase|319.40|(3.5%)|324.87|(5.1%)|1.7% ( -6% - 10%)|0.252|
|AndHighLow|730.59|(4.6%)|744.60|(4.8%)|1.9% ( -7% - 11%)|0.238|
|OrNotHighLow|660.02|(5.7%)|673.32|(3.9%)|2.0% ( -7% - 12%)|0.231|
|HighTerm|1289.67|(4.6%)|1316.15|(4.9%)|2.1% ( -7% - 12%)|0.210|
|OrHighNotMed|691.04|(7.0%)|711.12|(5.6%)|2.9% ( -9% - 16%)|0.182|
|OrHighNotHigh|610.79|(8.2%)|631.12|(5.8%)|3.3% ( -9% - 18%)|0.171|
|OrNotHighMed|637.03|(6.9%)|658.85|(7.0%)|3.4% ( -9% - 18%)|0.152|
|OrNotHighHigh|599.42|(5.9%)|620.44|(5.3%)|3.5% ( -7% - 15%)|0.070|
|OrHighNotLow|861.26|(6.1%)|912.07|(7.8%)|5.9% ( -7% - 21%)|0.014|

 

I also wrote another benchmark test for faceting test:
 * 25 SortedSet fields per document.
 * Some high-cardinality fields and average-value length are defined as below:

||Field Name||Cardinality||Avg Value Length||
|reqid|3772370|69|
|extend|3758007|343|
|url|3623677|61|
|obj|3599083|57|
|uploadtime|1064012|136|
|host|2418|12|

This benchmark tests focus on the latency of building 
SortedSetDocValuesReaderState(Will read TermsDict) and getting top 10 children. 
See below results:
||Benchmark||Mode||Cnt||Score||Error||Units||
|SortedSetFacetBenchmark.testBuildReaderState_After|avgt|15|32772.487|± 
926.056|ms/op|
|SortedSetFacetBenchmark.testBuildReaderState_Before|avgt|15|19462.099|± 
906.832|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_extend_After|avgt|15|1575.330|± 
22.725|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_extend_Before|avgt|15|1559.596|± 
18.216|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_host_After|avgt|15|1599.762|± 
81.167|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_host_Before|avgt|15|1573.225|± 
25.173|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_obj_After|avgt|15|1578.812|± 
19.121|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_obj_Before|avgt|15|1578.499|± 
16.796|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_reqid_After|avgt|15|1575.300|± 
13.651|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_reqid_Before|avgt|15|1562.115|± 
27.098|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_uploadtime_After|avgt|15|1560.106|±
 18.756|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_uploadtime_Before|avgt|15|1556.131|±
 14.161|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_url_After|avgt|15|1568.535|± 
23.545|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_url_Before|avgt|15|1554.675|± 
23.721|ms/op|

 So the operations read 

[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-14 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265621#comment-17265621
 ] 

Jaison.Bi commented on LUCENE-9663:
---

Theoretically, prefix + lz4 should be better. Since the terms were sorted, they 
always contains same prefixes. And LZ4 could not compress the beginning of the 
block(there's no references to find the duplicate string).

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-14 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264719#comment-17264719
 ] 

Jaison.Bi commented on LUCENE-9663:
---

Thanks, Adrien Grand.
{quote}My intuition is that it would actually be better to do LZ4 in addition 
to prefix compression, like we do for the terms dictionary of the inverted index
{quote}
I have compared the results between prefix + lz4 and lz4 only, and also tried 
to change the doc size per doc to see the difference. See the below result: 
||compression type||docs per block||*.dvd file size||write time cost||merge 
time cost||
|prefix + lz4|256|1.04GB|648456ms|375966ms|
|lz4-only|256|1.08GB|639489ms|350477ms|
|lz4-only|64|1.15GB|625797ms|298093ms|
|lz4-only|128|1.1GB|618034ms|320740ms|
|lz4-only|512|1.07GB|639892ms|458737ms|

It seems prefix compression + lz4 does not make significant improvement.  I 
think because the "common prefix" could be well-handled by lz4 :-) 

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264171#comment-17264171
 ] 

Adrien Grand commented on LUCENE-9663:
--

+1 to add lightweight compression to doc-value terms dictionaries. I've seen 
users store things like unique URLs in sorted doc-value fields where 
compressing suffixes would have helped.

I agree with Jaison that the query impact should be negligible since faceting 
typically bottlenecks on reading ordinals, not terms dictionaries, though we 
should double check. :) Also +1 to test how slower building an OrdinalMap gets 
with this change.

bq. replacing prefix-compression with LZ4

My intuition is that it would actually be better to do LZ4 in addition to 
prefix compression, like we do for the terms dictionary of the inverted index.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-12 Thread Jaison.Bi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263831#comment-17263831
 ] 

Jaison.Bi commented on LUCENE-9663:
---

Thanks for the comment, [~sokolov]
{quote}if you are running luceneutil tests, could you please also report QPS 
changes?
{quote}
Sure, I will.
{quote}I'm not clear what the usage of this {{keywords}} field is exactly - is 
it used for aggregations?
{quote}
Ya, "keyword" field is used for aggregations mostly. 
{quote}It would be good to run a faceting test; luceneutil doesn't really have 
any tests of high-cardinality SSDV aggregations; I think day-of-year is the 
closest it gets. Maybe you could add one? It's important to test the impact on 
the query side.
{quote}
ok, I will learn how to change luceneutil. Meanwhile, I can do another 
benchmark test using *esrally* as a supplement, it has some aggregation tests. 
would it be alright?

Actually, aggregations are using *global ordinal data* instead of terms dict, 
terms dict compression will affect the performance of building global oridinal 
data. Anyway, I will test the impact on query side.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-01-12 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263442#comment-17263442
 ] 

Michael Sokolov commented on LUCENE-9663:
-

Interesting - if you are running luceneutil tests, could you please also report 
QPS changes?

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org