[ https://issues.apache.org/jira/browse/LUCENE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Underwood updated LUCENE-8374: ---------------------------------- Attachment: single_vehicle_logs.txt > Reduce reads for sparse DocValues > --------------------------------- > > Key: LUCENE-8374 > URL: https://issues.apache.org/jira/browse/LUCENE-8374 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Affects Versions: 7.5, master (8.0) > Reporter: Toke Eskildsen > Priority: Major > Labels: performance > Fix For: 7.6 > > Attachments: LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, > LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374_branch_7_3.patch, > LUCENE-8374_branch_7_3.patch.20181005, LUCENE-8374_branch_7_4.patch, > LUCENE-8374_branch_7_5.patch, entire_index_logs.txt, > image-2018-10-24-07-30-06-663.png, image-2018-10-24-07-30-56-962.png, > single_vehicle_logs.txt, > start-2018-10-24-1_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png, > > start-2018-10-24_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png > > > The {{Lucene70DocValuesProducer}} has the internal classes > {{SparseNumericDocValues}} and {{BaseSortedSetDocValues}} (sparse code path), > which again uses {{IndexedDISI}} to handle the docID -> value-ordinal lookup. > The value-ordinal is the index of the docID assuming an abstract tightly > packed monotonically increasing list of docIDs: If the docIDs with > corresponding values are {{[0, 4, 1432]}}, their value-ordinals will be {{[0, > 1, 2]}}. > h2. Outer blocks > The lookup structure of {{IndexedDISI}} consists of blocks of 2^16 values > (65536), where each block can be either {{ALL}}, {{DENSE}} (2^12 to 2^16 > values) or {{SPARSE}} (< 2^12 values ~= 6%). Consequently blocks vary quite a > lot in size and ordinal resolving strategy. > When a sparse Numeric DocValue is needed, the code first locates the block > containing the wanted docID flag. It does so by iterating blocks one-by-one > until it reaches the needed one, where each iteration requires a lookup in > the underlying {{IndexSlice}}. For a common memory mapped index, this > translates to either a cached request or a read operation. If a segment has > 6M documents, worst-case is 91 lookups. In our web archive, our segments has > ~300M values: A worst-case of 4577 lookups! > One obvious solution is to use a lookup-table for blocks: A long[]-array with > an entry for each block. For 6M documents, that is < 1KB and would allow for > direct jumping (a single lookup) in all instances. Unfortunately this > lookup-table cannot be generated upfront when the writing of values is purely > streaming. It can be appended to the end of the stream before it is closed, > but without knowing the position of the lookup-table the reader cannot seek > to it. > One strategy for creating such a lookup-table would be to generate it during > reads and cache it for next lookup. This does not fit directly into how > {{IndexedDISI}} currently works (it is created anew for each invocation), but > could probably be added with a little work. An advantage to this is that this > does not change the underlying format and thus could be used with existing > indexes. > h2. The lookup structure inside each block > If {{ALL}} of the 2^16 values are defined, the structure is empty and the > ordinal is simply the requested docID with some modulo and multiply math. > Nothing to improve there. > If the block is {{DENSE}} (2^12 to 2^16 values are defined), a bitmap is used > and the number of set bits up to the wanted index (the docID modulo the block > origo) are counted. That bitmap is a long[1024], meaning that worst case is > to lookup and count all set bits for 1024 longs! > One known solution to this is to use a [rank > structure|[https://en.wikipedia.org/wiki/Succinct_data_structure]]. I > [implemented > it|[https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java]] > for a related project and with that (), the rank-overhead for a {{DENSE}} > block would be long[32] and would ensure a maximum of 9 lookups. It is not > trivial to build the rank-structure and caching it (assuming all blocks are > dense) for 6M documents would require 22 KB (3.17% overhead). It would be far > better to generate the rank-structure at index time and store it immediately > before the bitset (this is possible with streaming as each block is fully > resolved before flushing), but of course that would require a change to the > codec. > If {{SPARSE}} (< 2^12 values ~= 6%) are defined, the docIDs are simply in the > form of a list. As a comment in the code suggests, a binary search through > these would be faster, although that would mean seeking backwards. If that is > not acceptable, I don't have any immediate idea for avoiding the full > iteration. > I propose implementing query-time caching of both block-jumps and inner-block > lookups for {{DENSE}} (using rank) as first improvement and an index-time > {{DENSE}}-rank structure for future improvement. As query-time caching is > likely to be too costly for rapidly-changing indexes, it should probably be > an opt-in in solrconfig.xml. > h2. Some real-world observations > This analysis was triggered by massive (10x) slowdown problems with both > simple querying and large exports from our webarchive index after upgrading > from Solr 4.10 to 7.3.1. The query-matching itself takes ½-2 seconds, but > returning the top-10 documents takes 5-20 seconds (~50 non-stored DocValues > fields), up from ½-2 seconds in total from Solr 4.10 (more of a mix of stored > vs. DocValues, so might not be directly comparable). > Measuring with VisualVM points to {{NIOFSIndexInput.readInternal}} as *the* > hotspot. We ran some tests with simple queries on a single 307,171,504 > document segment with different single-value DocValued fields in the fl and > got > > ||Field||Type||Docs with value||Docs w/ val %||Speed in docs/sec|| > |url|String|307,171,504|100%|12,500| > |content_type_ext|String|224,375,378|73%|360| > |author|String|1,506,365|0.5%|1,100| > |crawl_date|DatePoint|307,171,498|~100%|90| > |content_text_length|IntPoint|285,800,212|93%|410| > |content_length|IntPoint|307,016,816|99.9%|100| > |crawl_year|IntPoint|307,171,498|~100%|14,500| > |last_modified|DatePoint|6,835,065|2.2%|570| > |source_file_offset|LongPoint|307,171,504|100%|28,000| > Note how both url and source_file_offset are very fast and also has a value > for _all_ documents. Contrary to this, content_type_ext is very slow and > crawl_date is extremely slow and as they both have _nearly_ all documents, I > presume they are using {{IndexedDISI#DENSE}}. last_modified is also quite > slow and presumably uses {{IndexedDISI#SPARSE}}. > The only mystery is crawl_year which is also present in _nearly_ all > documents, but is very fast. I have no explanation for that one yet. > I hope to take a stab at this around August 2018, but no promises. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org