[ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
----------------------------------
    Attachment: SynteticDocValuesBench70.java

> Boolean DocValues Codec Implementation
> --------------------------------------
>
>                 Key: LUCENE-8689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8689
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Ivan Mamontov
>            Priority: Minor
>              Labels: patch, performance
>         Attachments: LUCENE-8689.patch, SynteticDocValuesBench70.java, 
> results.png
>
>
> To avoid issues where some products become available/unavailable at some 
> point in time after being out-of-stock, e-commerce search system designers 
> need to embed up-to-date information about inventory availability right into 
> the search engines. Key requirement is to be able to accurately filter out 
> unavailable products and use availability as one of ranking signals. However, 
> keeping availability data up-to-date is a non-trivial task. Straightforward 
> implementation based on a partial updates of Lucene documents causes Solr 
> cache trashing with negatively affected query performance and resource 
> utilization.
>  As an alternative solution we can use DocValues and build-in in-place 
> updates where field values can be independently updated without touching 
> inverted index, and while filtering by DocValues is a bit slower, overall 
> performance gain is better. However existing long based docValues are not 
> sufficiently optimized for carrying boolean inventory availability data:
>  * All DocValues queries are internally rewritten into 
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
> iteration over all column values and typically much slower than using 
> TermsQuery.
>  * On every commit/merge codec has to iterate over DocValues a couple times 
> in order to choose ths best compression algorithm suitable for given data. As 
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc 
> values format that uses internal representation of 
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
> them at search time as a simple long array without additional decoding. There 
> are several reasons for this:
>  * At index time encoding is super fast without superfluous iterations over 
> all values to choose ths best compression algorithm suitable for given 
> data.</li>
>  * At query time decoding is also simple and fast, no GC pressure and extra 
> steps
>  * Internal representation allows to perform random access in constant time
> Limitations are:
>  * Does not support non boolean fields
>  * Boolean fields must be represented as long values 1 for true and 0 for 
> false
>  * Current implementation does not support advanced bit set formats like 
> org.apache.lucene.util.SparseFixedBitSet or 
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple benchmark(JMH 
> based) which allows to estimate a relative cost of DF filters. This benchmark 
> creates 2 000 000 documents with 5 boolean columns with different density, 
> where 10, 35, 50, 60 and 90 is an amount of documents with value 1. Each 
> method tries to enumerate over all values in synthetic store field in all 
> available ways:
>  * baseline – in almost all cases Solr uses FixedBitSet in filter cache to 
> keep store availability. This test just iterates over all bits.
>  * docValuesRaw – iterates over all values of DV column, the same code is 
> used in "post filtering", sorting and faceting.
>  * docValuesNumbersQuery – iterates over all values produced by query/filter 
> store:1, actually there is the only query implementation for DV based fields 
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
> filter queries for non indexed filed into this fallback implementation.
>  * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
> support only two values – 0/1
> !results.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to