[ https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Mamontov updated LUCENE-8689: ---------------------------------- Attachment: SynteticDocValuesBench70.java > Boolean DocValues Codec Implementation > -------------------------------------- > > Key: LUCENE-8689 > URL: https://issues.apache.org/jira/browse/LUCENE-8689 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Reporter: Ivan Mamontov > Priority: Minor > Labels: patch, performance > Attachments: LUCENE-8689.patch, SynteticDocValuesBench70.java, > results.png > > > To avoid issues where some products become available/unavailable at some > point in time after being out-of-stock, e-commerce search system designers > need to embed up-to-date information about inventory availability right into > the search engines. Key requirement is to be able to accurately filter out > unavailable products and use availability as one of ranking signals. However, > keeping availability data up-to-date is a non-trivial task. Straightforward > implementation based on a partial updates of Lucene documents causes Solr > cache trashing with negatively affected query performance and resource > utilization. > As an alternative solution we can use DocValues and build-in in-place > updates where field values can be independently updated without touching > inverted index, and while filtering by DocValues is a bit slower, overall > performance gain is better. However existing long based docValues are not > sufficiently optimized for carrying boolean inventory availability data: > * All DocValues queries are internally rewritten into > org.apache.lucene.search.DocValuesNumbersQuery which is based on direct > iteration over all column values and typically much slower than using > TermsQuery. > * On every commit/merge codec has to iterate over DocValues a couple times > in order to choose ths best compression algorithm suitable for given data. As > a result for 4K fields and 3M max doc merge takes more than 10 minutes > This issue is intended to solve these limitations via special bitwise doc > values format that uses internal representation of > org.apache.lucene.util.FixedBitSet in order to store indexed values and load > them at search time as a simple long array without additional decoding. There > are several reasons for this: > * At index time encoding is super fast without superfluous iterations over > all values to choose ths best compression algorithm suitable for given > data.</li> > * At query time decoding is also simple and fast, no GC pressure and extra > steps > * Internal representation allows to perform random access in constant time > Limitations are: > * Does not support non boolean fields > * Boolean fields must be represented as long values 1 for true and 0 for > false > * Current implementation does not support advanced bit set formats like > org.apache.lucene.util.SparseFixedBitSet or > org.apache.lucene.util.RoaringDocIdSet > In order to evaluate performance gain I've wrote a simple benchmark(JMH > based) which allows to estimate a relative cost of DF filters. This benchmark > creates 2 000 000 documents with 5 boolean columns with different density, > where 10, 35, 50, 60 and 90 is an amount of documents with value 1. Each > method tries to enumerate over all values in synthetic store field in all > available ways: > * baseline – in almost all cases Solr uses FixedBitSet in filter cache to > keep store availability. This test just iterates over all bits. > * docValuesRaw – iterates over all values of DV column, the same code is > used in "post filtering", sorting and faceting. > * docValuesNumbersQuery – iterates over all values produced by query/filter > store:1, actually there is the only query implementation for DV based fields > - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and > filter queries for non indexed filed into this fallback implementation. > * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which > support only two values – 0/1 > !results.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org