[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Labels: gsoc2013 (was: gsoc2014) > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch from last commit, and summary: Previously our term dictionary were both block-based: * BlockTerms dict breaks terms list into several blocks, as a linear structure with skip points. * BlockTreeTerms dict uses a trie-like structure to decide how terms are assigned to different blocks, and uses an FST index to optimize seeking performance. However, those two kinds of term dictionary don't hold all the term data in memory. For the worst case there would be at least two seeks: one from index in memory, another from file on disk. And we already have many complicated optimizations for this... If by design a term dictionary can be memory resident, the data structure will be simpler (after all we don't need maintain extra file pointers for a second-time seek, and we don't have to decide heuristic for how terms are clustered). And this is why those two FST-based implementation are introduced. Another big change in the code is: since our term dictionaries were both block-based, previous API was also limited. It was the postings writer who collected term metadata, and the term dictionary who told postings writer the range of terms it should flush to block. However, encoding of terms data should be decided by term dictionary part, since postings writer doesn't always know how terms are structured in term dictionary... Previous API had some tricky codes for this, e.g. PulsingPostingsWriter had to use terms' ordinal in block to decide how to write metadata, which is unnecessary. To make the API between term dict and postings list more 'pluggable' and 'general', I refactored the PostingsReader/WriterBase. For example, the postings writer should provide some information to term dictionary, like how many metadata values are strictly monotonic, so that term dictionary can optimize delta-encoding itself. And since the term dictionary now fully decides how metadata are written, it gets the ability to utilize intblock-based metadata encoding. Now the two implementations of term dictionary can easily be plugged with current postings formats, like: * FST41 = FSTTermdict + Lucene41PostingsBaseFormat, * FSTOrd41 = FSTOrdTermdict + Lucene41PostingsBaseFormat. * FSTOrdPulsing41 = FSTOrdTermsdict + PulsingPostingsWrapper + Lucene41PostingsFormat About performance, as shown before, those two term dict improve on primary key lookup, but still have overhead on wildcard query (both two term dict have only prefix information, and term dictionary cannot work well with this...). I'll try to hack this later. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch The uploaded patch should show all the changes against trunk: I added two different implementations of term dict, and refactored the PostingsBaseFormat to plug in non-block based term dicts. I'm still working on the javadocs, and maybe we should rename that 'temp' package, like 'fstterms'? > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch, to show the impersonation hack for Pulsing format. We cannot perfectly impersonate old pulsing format yet: the old format divided metadata block as inlined bytes and wrapped bytes, so when the term dict reader reads the length of metadata block, it is actually the length of 'inlined block'... And the 'wrapped block' won't be loaded for wrapped PF. However, to introduce a new method in PostingsReaderBase doesn't seem to be a good way... > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch, it will show how current codecs (Block/BlockTree + Lucene4X/Pulsing/Mock*) are changed according to our API refactoring. TestBackwardsCompatibility still fails, and I'll work on the impersonation later. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch, update BlockTerms dict so that it follows refactored API. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch with backward compability fix on Lucene41PBF (TempPostingsReader is actually a fork of Lucene41PostingsReader). > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Uploaded patch. It is optimized for wildcardquery, and I did a quick test on 1M wiki data: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff PKLookup 314.63 (1.5%) 314.64 (1.2%) 0.0% ( -2% -2%) Fuzzy1 91.32 (3.7%) 92.50 (1.6%) 1.3% ( -3% -6%) Respell 104.54 (3.9%) 106.97 (1.6%) 2.3% ( -2% -8%) Fuzzy2 38.22 (4.1%) 39.16 (1.2%) 2.5% ( -2% -8%) Wildcard 109.56 (3.1%) 273.42 (5.0%) 149.6% ( 137% - 162%) {noformat} and TempFSTOrd vs. Lucene41, on 1M data: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff Respell 134.85 (3.7%) 106.30 (0.6%) -21.2% ( -24% - -17%) Fuzzy2 47.78 (4.1%) 39.03 (0.9%) -18.3% ( -22% - -13%) Fuzzy1 112.02 (3.0%) 91.95 (0.6%) -17.9% ( -20% - -14%) Wildcard 326.68 (3.5%) 273.41 (1.9%) -16.3% ( -20% - -11%) PKLookup 194.61 (1.8%) 314.24 (0.7%) 61.5% ( 57% - 65%) {noformat} But I'm not happy with it :(, the hack I did here is to consume another big block to store the last byte of each term. So for wildcard query ab*c, we have external information to tell the ord of nearest term like *c. Knowing the ord, we can use a similar approach like getByOutput to jump to the next target term. Previously, we have to walk on fst to the stop node to find out whether the last byte is 'c', so this optimization comes to be a big chunk. However I don't really like this patch :(, we have to increase index size (521M => 530M), and the code comes to be mess up, since we always have to foresee the next arc on current stack. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch, revive IntersectTermsEnum in TempFSTOrd. Mike, since we already have an intersect() impl, maybe we can still keep this? By the way, it is easy to migrate from TempFST to TempFSTOrd. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: (was: LUCENE-5152.patch) > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-5152.patch Previous design put much stress on decoding of Outputs. This becomes disaster for wildcard queries: like for f*nd, we usually have to walk to the last character of FST, then find that it is not 'd' and automaton doesn't accept this. In this case, TempFST is actually iterating all the result of f*, which decodes all the metadata for them... So I'm trying another approach, the main idea is to load metadata & stats as lazily as possible. Here I use FST as term index, and leave all other stuff in a single term block. The term index FST holds the relationship between , and in the term block we can maintain a skip list for find related metadata & stats. It is a little similar to BTTR now, and we can someday control how much data to keep memory resident (e.g. keep stats in memory but metadata on disk, however this should be another issue). Another good part is, it naturally supports seek by ord.(ah, actually I don't understand where it is used). Tests pass, and intersect is not implemented yet. perf based on 1M wiki data, between non-intersect TempFST and TempFSTOrd: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff PKLookup 373.80 (0.0%) 320.30 (0.0%) -14.3% ( -14% - -14%) Fuzzy1 43.82 (0.0%) 47.10 (0.0%) 7.5% ( 7% -7%) Prefix3 399.62 (0.0%) 433.95 (0.0%) 8.6% ( 8% -8%) Fuzzy2 14.26 (0.0%) 15.95 (0.0%) 11.9% ( 11% - 11%) Respell 40.69 (0.0%) 46.29 (0.0%) 13.8% ( 13% - 13%) Wildcard 83.44 (0.0%) 96.54 (0.0%) 15.7% ( 15% - 15%) {noformat} perf hit on pklookup should be sane, since I haven't optimize the skip list. I'll update intersect() later, and later we'll cutover to PagedBytes & PackedLongBuffer. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-5152.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Upload patch: implemented IntersectEnum.next() & seekCeil() lots of nocommits, but passed all tests The main idea is to run a DFS on FST, and backtrack as early as possible (i.e. when we see this label is rejected by automaton) For this version, there is one explicit perf overhead: I use a real stack here, which can be replaced by a Frame[] to reuse objects. There're several aspects I didn't dig deep: * currently, CompiledAutomaton provides a commonSuffixRef, but how can we make use of it in FST? * the DFS is somewhat a 'goto' version, i.e, we can make the code cleaner with a single while-loop similar to BFS search. However, since FST doesn't always tell us how may arcs are leaving current arc, we have problem dealing with this... * when FST is large enough, the next() operation will takes much time doing the linear arc read, maybe we should make use of CompiledAutomaton.sortedTransition[] when leaving arcs are heavy. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch: revert hashCode() > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch according to previous comments. We still somewhat need the existance of hashCode(), because in NodeHash, it will check whether the frozen node have the same hashcode with uncompiled node (NodeHash:128). > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: df-ttf-estimate.txt Uploaded detail data for wikimediumall. Oh, sorry, there is an error when I caculated index size for df==0 trick, it should be 105MB instead of 70MB. But the real test is still beyond estimation (weird...). df==0 tricks gains similar compression. Index size are below: {noformat} v0: 13195304 v1 = v0 + flag byte: 12847172 v2 = v1 + steal bit: 12770700 v3 = v1 + zero df:12780884 {noformat} Another thing that surprised me is, with the same code/conf, luceneutil creates different sizes of index? I tested that df==0 trick several times on wikimedium1m, the index size varies from 514M~522M... Will multi-threading affects much here? > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: example.png LUCENE-3069.patch Uploaded patch, it is the main part of changes I commited to branch3069. The picture shows current impl of outputs (it is fetched from one field in wikimedium5k). * long[] (sortable metadata) * byte[] (unsortable, generic metadata) * df, ttf (term stats) A single byte flag is used to indicate whether/which fields current outputs maintains, for PBF with short byte[], this should be enough. Also, for long-tail terms, the totalTermFreq an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf). Since TermsEnum is totally based on FSTEnum, the performance of term dict should be similar with MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this hurts. Following is the performance comparison: {noformat} pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall TaskQPS base StdDevQPS comp StdDev Pct diff Respell 48.13 (4.4%) 15.38 (1.0%) -68.0% ( -70% - -65%) Fuzzy2 51.30 (5.3%) 17.47 (1.3%) -65.9% ( -68% - -62%) Fuzzy1 52.24 (4.0%) 18.50 (1.2%) -64.6% ( -67% - -61%) Wildcard9.31 (1.7%)6.16 (2.2%) -33.8% ( -37% - -30%) Prefix3 23.25 (1.8%) 19.00 (2.2%) -18.3% ( -21% - -14%) PKLookup 244.92 (3.6%) 225.42 (2.3%) -8.0% ( -13% - -2%) LowTerm 295.88 (5.5%) 293.27 (4.8%) -0.9% ( -10% -9%) HighPhrase 13.62 (6.5%) 13.54 (7.4%) -0.6% ( -13% - 14%) MedTerm 99.51 (7.8%) 99.19 (7.7%) -0.3% ( -14% - 16%) MedPhrase 154.63 (9.4%) 154.38 (10.1%) -0.2% ( -17% - 21%) HighTerm 28.25 (10.7%) 28.25 (10.0%) -0.0% ( -18% - 23%) OrHighHigh 16.83 (13.3%) 16.86 (13.1%) 0.2% ( -23% - 30%) HighSloppyPhrase9.02 (4.4%)9.03 (4.5%) 0.2% ( -8% -9%) LowPhrase6.26 (3.4%)6.27 (4.1%) 0.2% ( -7% -8%) OrHighMed 13.73 (13.2%) 13.77 (12.8%) 0.3% ( -22% - 30%) OrHighLow 25.65 (13.2%) 25.73 (13.0%) 0.3% ( -22% - 30%) MedSloppyPhrase6.63 (2.7%)6.66 (2.7%) 0.5% ( -4% -6%) AndHighMed 42.77 (1.8%) 43.13 (1.5%) 0.8% ( -2% -4%) LowSloppyPhrase 32.68 (3.0%) 32.96 (2.8%) 0.8% ( -4% -6%) AndHighHigh 22.90 (1.2%) 23.18 (0.7%) 1.2% ( 0% -3%) LowSpanNear 29.30 (2.0%) 29.83 (2.2%) 1.8% ( -2% -6%) MedSpanNear8.39 (2.7%)8.56 (2.9%) 2.0% ( -3% -7%) IntNRQ3.12 (1.9%)3.18 (6.7%) 2.1% ( -6% - 10%) AndHighLow 507.01 (2.4%) 522.10 (2.8%) 3.0% ( -2% -8%) HighSpanNear5.43 (1.8%)5.60 (2.6%) 3.1% ( -1% -7%) {noformat} {noformat} pure TempFST vs. pure Lucene41, on wikimediumall TaskQPS base StdDevQPS comp StdDev Pct diff Respell 49.24 (2.7%) 15.51 (1.0%) -68.5% ( -70% - -66%) Fuzzy2 52.01 (4.8%) 17.61 (1.4%) -66.1% ( -68% - -63%) Fuzzy1 53.00 (4.0%) 18.62 (1.3%) -64.9% ( -67% - -62%) Wildcard9.37 (1.3%)6.15 (2.1%) -34.4% ( -37% - -31%) Prefix3 23.36 (0.8%) 18.96 (2.1%) -18.8% ( -21% - -16%) MedPhrase 155.86 (9.8%) 152.34 (9.7%) -2.3% ( -19% - 19%) LowPhrase6.33 (3.7%)6.23 (4.0%) -1.6% ( -8% -6%) HighPhrase 13.68 (7.2%) 13.49 (6.8%) -1.4% ( -14% - 13%) OrHighMed 13.78 (13.0%) 13.68 (12.7%) -0.8% ( -23% - 28%) HighSloppyPhrase9.14 (5.2%)9.07 (3.7%) -0.7% ( -9% -8%) OrHighHigh 16.87 (13.3%) 16.76 (12.9%) -0.6% ( -23% - 29%) OrHighLow 25.71 (13.1%) 25.58 (12.8%) -0.5% ( -23% - 29%)
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3069: -- Fix Version/s: (was: 4.3) 4.4 > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3069: --- Fix Version/s: (was: 4.1) 4.2 > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Labels: gsoc2012, lucene-gsoc-12 > Fix For: 4.2 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3069: --- Labels: gsoc2012 lucene-gsoc-12 (was: ) > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Labels: gsoc2012, lucene-gsoc-12 > Fix For: 4.0 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3069: Summary: Lucene should have an entirely memory resident term dictionary (was: Lucene should be able to have a entirely memory resident term dictionary) > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Java > Issue Type: Improvement > Components: Index, Search >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Fix For: 4.0 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org