[
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Mamontov updated LUCENE-8689:
----------------------------------
Attachment: results2.png
> Boolean DocValues Codec Implementation
> --------------------------------------
>
> Key: LUCENE-8689
> URL: https://issues.apache.org/jira/browse/LUCENE-8689
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Reporter: Ivan Mamontov
> Priority: Minor
> Labels: patch, performance
> Attachments: LUCENE-8689.patch, SynteticDocValuesBench70.java,
> results.png, results2.png
>
>
> To avoid issues where some products become available/unavailable at some
> point in time after being out-of-stock, e-commerce search system designers
> need to embed up-to-date information about inventory availability right into
> the search engines. Key requirement is to be able to accurately filter out
> unavailable products and use availability as one of ranking signals. However,
> keeping availability data up-to-date is a non-trivial task. Straightforward
> implementation based on a partial updates of Lucene documents causes Solr
> cache trashing with negatively affected query performance and resource
> utilization.
> As an alternative solution we can use DocValues and build-in in-place
> updates where field values can be independently updated without touching
> inverted index, and while filtering by DocValues is a bit slower, overall
> performance gain is better. However existing long based docValues are not
> sufficiently optimized for carrying boolean inventory availability data:
> * All DocValues queries are internally rewritten into
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct
> iteration over all column values and typically much slower than using
> TermsQuery.
> * On every commit/merge codec has to iterate over DocValues a couple times
> in order to choose ths best compression algorithm suitable for given data. As
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc
> values format that uses internal representation of
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load
> them at search time as a simple long array without additional decoding. There
> are several reasons for this:
> * At index time encoding is super fast without superfluous iterations over
> all values to choose ths best compression algorithm suitable for given
> data.</li>
> * At query time decoding is also simple and fast, no GC pressure and extra
> steps
> * Internal representation allows to perform random access in constant time
> Limitations are:
> * Does not support non boolean fields
> * Boolean fields must be represented as long values 1 for true and 0 for
> false
> * Current implementation does not support advanced bit set formats like
> org.apache.lucene.util.SparseFixedBitSet or
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple JMH based benchmark
> [^SynteticDocValuesBench70.java] which allows to estimate a relative cost
> of DF filters. This benchmark creates 2 000 000 documents with 5 boolean
> columns with different density, where 10, 35, 50, 60 and 90 is an amount of
> documents with value 1. Each method tries to enumerate over all values in
> synthetic store field in all available ways:
> * baseline – in almost all cases Solr uses FixedBitSet in filter cache to
> keep store availability. This test just iterates over all bits.
> * docValuesRaw – iterates over all values of DV column, the same code is
> used in "post filtering", sorting and faceting.
> * docValuesNumbersQuery – iterates over all values produced by query/filter
> store:1, actually there is the only query implementation for DV based fields
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and
> filter queries for non indexed filed into this fallback implementation.
> * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which
> support only two values – 0/1
> !results.png!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]