[ https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022946#comment-17022946 ]
ASF subversion and git services commented on LUCENE-4702: --------------------------------------------------------- Commit b283b8df628dc9bbdbbb994b4a3653b7eecd7fd9 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b283b8d ] LUCENE-4702: Terms dictionary compression. (#1126) Compress blocks of suffixes in order to make the terms dictionary more space-efficient. Two compression algorithms are used depending on which one is more space-efficient: - LowercaseAsciiCompression, which applies when all bytes are in the `[0x1F,0x3F)` or `[0x5F,0x7F)` ranges, which notably include all digits, lowercase ASCII characters, '.', '-' and '_', and encodes 4 chars on 3 bytes. It is very often applicable on analyzed content and decompresses very quickly thanks to auto-vectorization support in the JVM. - LZ4, when the compression ratio is less than 0.75. I was a bit unhappy with the complexity of the high-compression LZ4 option, so I simplified it in order to only keep the logic that detects duplicate strings. The logic about what to do in case overlapping matches are found, which was responsible for most of the complexity while only yielding tiny benefits, has been removed. > Terms dictionary compression > ---------------------------- > > Key: LUCENE-4702 > URL: https://issues.apache.org/jira/browse/LUCENE-4702 > Project: Lucene - Core > Issue Type: Wish > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Trivial > Attachments: LUCENE-4702.patch, LUCENE-4702.patch > > Time Spent: 3.5h > Remaining Estimate: 0h > > I've done a quick test with the block tree terms dictionary by replacing a > call to IndexOutput.writeBytes to write suffix bytes with a call to > LZ4.compressHC to test the peformance hit. Interestingly, search performance > was very good (see comparison table below) and the tim files were 14% smaller > (from 150432 bytes overall to 129516). > {noformat} > TaskQPS baseline StdDevQPS compressed StdDev > Pct diff > Fuzzy1 111.50 (2.0%) 78.78 (1.5%) > -29.4% ( -32% - -26%) > Fuzzy2 36.99 (2.7%) 28.59 (1.5%) > -22.7% ( -26% - -18%) > Respell 122.86 (2.1%) 103.89 (1.7%) > -15.4% ( -18% - -11%) > Wildcard 100.58 (4.3%) 94.42 (3.2%) > -6.1% ( -13% - 1%) > Prefix3 124.90 (5.7%) 122.67 (4.7%) > -1.8% ( -11% - 9%) > OrHighLow 169.87 (6.8%) 167.77 (8.0%) > -1.2% ( -15% - 14%) > LowTerm 1949.85 (4.5%) 1929.02 (3.4%) > -1.1% ( -8% - 7%) > AndHighLow 2011.95 (3.5%) 1991.85 (3.3%) > -1.0% ( -7% - 5%) > OrHighHigh 155.63 (6.7%) 154.12 (7.9%) > -1.0% ( -14% - 14%) > AndHighHigh 341.82 (1.2%) 339.49 (1.7%) > -0.7% ( -3% - 2%) > OrHighMed 217.55 (6.3%) 216.16 (7.1%) > -0.6% ( -13% - 13%) > IntNRQ 53.10 (10.9%) 52.90 (8.6%) > -0.4% ( -17% - 21%) > MedTerm 998.11 (3.8%) 994.82 (5.6%) > -0.3% ( -9% - 9%) > MedSpanNear 60.50 (3.7%) 60.36 (4.8%) > -0.2% ( -8% - 8%) > HighSpanNear 19.74 (4.5%) 19.72 (5.1%) > -0.1% ( -9% - 9%) > LowSpanNear 101.93 (3.2%) 101.82 (4.4%) > -0.1% ( -7% - 7%) > AndHighMed 366.18 (1.7%) 366.93 (1.7%) > 0.2% ( -3% - 3%) > PKLookup 237.28 (4.0%) 237.96 (4.2%) > 0.3% ( -7% - 8%) > MedPhrase 173.17 (4.7%) 174.69 (4.7%) > 0.9% ( -8% - 10%) > LowSloppyPhrase 180.91 (2.6%) 182.79 (2.7%) > 1.0% ( -4% - 6%) > LowPhrase 374.64 (5.5%) 379.11 (5.8%) > 1.2% ( -9% - 13%) > HighTerm 253.14 (7.9%) 256.97 (11.4%) > 1.5% ( -16% - 22%) > HighPhrase 19.52 (10.6%) 19.83 (11.0%) > 1.6% ( -18% - 25%) > MedSloppyPhrase 141.90 (2.6%) 144.11 (2.5%) > 1.6% ( -3% - 6%) > HighSloppyPhrase 25.26 (4.8%) 25.97 (5.0%) > 2.8% ( -6% - 13%) > {noformat} > Only queries which are very terms-dictionary-intensive got a performance hit > (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved > (surprisingly) well. > Do you think of it as something worth exploring? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org