[ 
https://issues.apache.org/jira/browse/LUCENE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17115885#comment-17115885
 ] 

Adrien Grand commented on LUCENE-9378:
--------------------------------------

bq. Another factor that probably plays a role here is how compressible the data 
is

I looked a bit more into the data we use for benchmarking. wikibigall already 
makes titles quite compressible given how linefile docs are sorted by title in 
the input file. wikimedium makes it way more compressible given how it splits 
articles into 1kB chunks that all share the same title, creating many duplicate 
values for adjacent doc IDs. This makes wikimedium titles a best-case scenario 
for compression (and thus a worst-case scenario for search speed) and I'd 
expect performance numbers to be significantly different between wikimediumall 
and wikibigall, and again between wikibigall and a shuffled copy of wikibigall 
that would no longer sort by title.

Are wikimedium titles representative of the data that you are indexing into 
binary doc values at Amazon, ie. are adjacent doc IDs likely to get the exact 
same value? If that's the case, then we could probably add ad-hoc compression 
for this case which would have a better runtime than LZ4, and we could 
automatically make the decision at index time instead of requiring users to 
configure a flag.

> Configurable compression for BinaryDocValues
> --------------------------------------------
>
>                 Key: LUCENE-9378
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9378
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Viral Gandhi
>            Priority: Minor
>
> Lucene 8.5.1 includes a change to always [compress 
> BinaryDocValues|https://issues.apache.org/jira/browse/LUCENE-9211]. This 
> caused (~30%) reduction in our red-line QPS (throughput). 
> We think users should be given some way to opt-in for this compression 
> feature instead of always being enabled which can have a substantial query 
> time cost as we saw during our upgrade. [~mikemccand] suggested one possible 
> approach by introducing a *mode* in Lucene80DocValuesFormat (COMPRESSED and 
> UNCOMPRESSED) and allowing users to create a custom Codec subclassing the 
> default Codec and pick the format they want.
> Idea is similar to Lucene50StoredFieldsFormat which has two modes, 
> Mode.BEST_SPEED and Mode.BEST_COMPRESSION.
> Here's related issues for adding benchmark covering BINARY doc values 
> query-time performance - [https://github.com/mikemccand/luceneutil/issues/61]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to