[jira] [Commented] (LUCENE-8689) Boolean DocValues Codec Implementation

Dmitry Popov (JIRA) Sun, 19 May 2019 04:20:27 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843396#comment-16843396
 ]


Dmitry Popov commented on LUCENE-8689:
--------------------------------------

Updates: Boolean Codec vs Lucene80 (with different Dense and Sparse 
compressions). There is no such difference:

!boolean_vs_dense_vs_sparse_updates.png|width=1093,height=582!

> Boolean DocValues Codec Implementation
> --------------------------------------
>
>                 Key: LUCENE-8689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8689
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Ivan Mamontov
>            Priority: Minor
>              Labels: patch, performance
>         Attachments: LUCENE-8689.patch, LUCENE-8689.patch, 
> SynteticDocValuesBench70.java, SynteticDocValuesBench80.java, 
> benchmark_dense.png, boolean_vs_dense_vs_sparse_indexing.png, 
> boolean_vs_dense_vs_sparse_updates.png, results2.png
>
>
> To avoid issues where some products become available/unavailable at some 
> point in time after being out-of-stock, e-commerce search system designers 
> need to embed up-to-date information about inventory availability right into 
> the search engines. Key requirement is to be able to accurately filter out 
> unavailable products and use availability as one of ranking signals. However, 
> keeping availability data up-to-date is a non-trivial task. Straightforward 
> implementation based on a partial updates of Lucene documents causes Solr 
> cache trashing with negatively affected query performance and resource 
> utilization.
>  As an alternative solution we can use DocValues and build-in in-place 
> updates where field values can be independently updated without touching 
> inverted index, and while filtering by DocValues is a bit slower, overall 
> performance gain is better. However existing long based docValues are not 
> sufficiently optimized for carrying boolean inventory availability data:
>  * All DocValues queries are internally rewritten into 
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
> iteration over all column values and typically much slower than using 
> TermsQuery.
>  * On every commit/merge codec has to iterate over DocValues a couple times 
> in order to choose the best compression algorithm suitable for given data. As 
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc 
> values format that uses internal representation of 
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
> them at search time as a simple long array without additional decoding. There 
> are several reasons for this:
>  * At index time encoding is super fast without superfluous iterations over 
> all values to choose the best compression algorithm suitable for given data.
>  * At query time decoding is also simple and fast, no GC pressure and extra 
> steps
>  * Internal representation allows to perform random access in constant time
> Limitations are:
>  * Does not support non boolean fields
>  * Boolean fields must be represented as long values 1 for true and 0 for 
> false
>  * Current implementation does not support advanced bit set formats like 
> org.apache.lucene.util.SparseFixedBitSet or 
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple JMH based benchmark 
> [^SynteticDocValuesBench70.java] which allows to estimate a relative cost of 
> DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns 
> with different density, where 10, 35, 50, 60 and 90 is an amount of documents 
> with value 1. Each method tries to enumerate over all values in synthetic 
> store field in all available ways:
>  * baseline – in almost all cases Solr uses FixedBitSet in filter cache to 
> keep store availability. This test just iterates over all bits.
>  * docValuesRaw – iterates over all values of DV column, the same code is 
> used in "post filtering", sorting and faceting.
>  * docValuesNumbersQuery – iterates over all values produced by query/filter 
> store:1, actually there is the only query implementation for DV based fields 
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
> filter queries for non indexed filed into this fallback implementation.
>  * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
> support only two values – 0/1
> !results2.png!
> Query latency is similar to FixedBitSet with negligible overhead 1-2 ms. 
> DocValuesNumbersQuery 6-7 times slower compared to boolean query. Raw doc 
> values iterator is also not so fast as it performs on-the-fly decoding.
> Attached patch contains two parts:
>  * bitwise codec and all required structures and producers/consumers
>  * boolean query which removes TwoPhaseIterator, AllBits approximation and 
> missing docs lookup
>  * docValues codec test green except non long values cases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8689) Boolean DocValues Codec Implementation

Reply via email to