[ https://issues.apache.org/jira/browse/LUCENE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17115885#comment-17115885 ]
Adrien Grand commented on LUCENE-9378: -------------------------------------- bq. Another factor that probably plays a role here is how compressible the data is I looked a bit more into the data we use for benchmarking. wikibigall already makes titles quite compressible given how linefile docs are sorted by title in the input file. wikimedium makes it way more compressible given how it splits articles into 1kB chunks that all share the same title, creating many duplicate values for adjacent doc IDs. This makes wikimedium titles a best-case scenario for compression (and thus a worst-case scenario for search speed) and I'd expect performance numbers to be significantly different between wikimediumall and wikibigall, and again between wikibigall and a shuffled copy of wikibigall that would no longer sort by title. Are wikimedium titles representative of the data that you are indexing into binary doc values at Amazon, ie. are adjacent doc IDs likely to get the exact same value? If that's the case, then we could probably add ad-hoc compression for this case which would have a better runtime than LZ4, and we could automatically make the decision at index time instead of requiring users to configure a flag. > Configurable compression for BinaryDocValues > -------------------------------------------- > > Key: LUCENE-9378 > URL: https://issues.apache.org/jira/browse/LUCENE-9378 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Viral Gandhi > Priority: Minor > > Lucene 8.5.1 includes a change to always [compress > BinaryDocValues|https://issues.apache.org/jira/browse/LUCENE-9211]. This > caused (~30%) reduction in our red-line QPS (throughput). > We think users should be given some way to opt-in for this compression > feature instead of always being enabled which can have a substantial query > time cost as we saw during our upgrade. [~mikemccand] suggested one possible > approach by introducing a *mode* in Lucene80DocValuesFormat (COMPRESSED and > UNCOMPRESSED) and allowing users to create a custom Codec subclassing the > default Codec and pick the format they want. > Idea is similar to Lucene50StoredFieldsFormat which has two modes, > Mode.BEST_SPEED and Mode.BEST_COMPRESSION. > Here's related issues for adding benchmark covering BINARY doc values > query-time performance - [https://github.com/mikemccand/luceneutil/issues/61] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org