[ https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843396#comment-16843396 ]
Dmitry Popov commented on LUCENE-8689: -------------------------------------- Updates: Boolean Codec vs Lucene80 (with different Dense and Sparse compressions). There is no such difference: !boolean_vs_dense_vs_sparse_updates.png|width=1093,height=582! > Boolean DocValues Codec Implementation > -------------------------------------- > > Key: LUCENE-8689 > URL: https://issues.apache.org/jira/browse/LUCENE-8689 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Reporter: Ivan Mamontov > Priority: Minor > Labels: patch, performance > Attachments: LUCENE-8689.patch, LUCENE-8689.patch, > SynteticDocValuesBench70.java, SynteticDocValuesBench80.java, > benchmark_dense.png, boolean_vs_dense_vs_sparse_indexing.png, > boolean_vs_dense_vs_sparse_updates.png, results2.png > > > To avoid issues where some products become available/unavailable at some > point in time after being out-of-stock, e-commerce search system designers > need to embed up-to-date information about inventory availability right into > the search engines. Key requirement is to be able to accurately filter out > unavailable products and use availability as one of ranking signals. However, > keeping availability data up-to-date is a non-trivial task. Straightforward > implementation based on a partial updates of Lucene documents causes Solr > cache trashing with negatively affected query performance and resource > utilization. > As an alternative solution we can use DocValues and build-in in-place > updates where field values can be independently updated without touching > inverted index, and while filtering by DocValues is a bit slower, overall > performance gain is better. However existing long based docValues are not > sufficiently optimized for carrying boolean inventory availability data: > * All DocValues queries are internally rewritten into > org.apache.lucene.search.DocValuesNumbersQuery which is based on direct > iteration over all column values and typically much slower than using > TermsQuery. > * On every commit/merge codec has to iterate over DocValues a couple times > in order to choose the best compression algorithm suitable for given data. As > a result for 4K fields and 3M max doc merge takes more than 10 minutes > This issue is intended to solve these limitations via special bitwise doc > values format that uses internal representation of > org.apache.lucene.util.FixedBitSet in order to store indexed values and load > them at search time as a simple long array without additional decoding. There > are several reasons for this: > * At index time encoding is super fast without superfluous iterations over > all values to choose the best compression algorithm suitable for given data. > * At query time decoding is also simple and fast, no GC pressure and extra > steps > * Internal representation allows to perform random access in constant time > Limitations are: > * Does not support non boolean fields > * Boolean fields must be represented as long values 1 for true and 0 for > false > * Current implementation does not support advanced bit set formats like > org.apache.lucene.util.SparseFixedBitSet or > org.apache.lucene.util.RoaringDocIdSet > In order to evaluate performance gain I've wrote a simple JMH based benchmark > [^SynteticDocValuesBench70.java] which allows to estimate a relative cost of > DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns > with different density, where 10, 35, 50, 60 and 90 is an amount of documents > with value 1. Each method tries to enumerate over all values in synthetic > store field in all available ways: > * baseline – in almost all cases Solr uses FixedBitSet in filter cache to > keep store availability. This test just iterates over all bits. > * docValuesRaw – iterates over all values of DV column, the same code is > used in "post filtering", sorting and faceting. > * docValuesNumbersQuery – iterates over all values produced by query/filter > store:1, actually there is the only query implementation for DV based fields > - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and > filter queries for non indexed filed into this fallback implementation. > * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which > support only two values – 0/1 > !results2.png! > Query latency is similar to FixedBitSet with negligible overhead 1-2 ms. > DocValuesNumbersQuery 6-7 times slower compared to boolean query. Raw doc > values iterator is also not so fast as it performs on-the-fly decoding. > Attached patch contains two parts: > * bitwise codec and all required structures and producers/consumers > * boolean query which removes TwoPhaseIterator, AllBits approximation and > missing docs lookup > * docValues codec test green except non long values cases -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org