[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937095#comment-13937095 ] Han Jiang commented on LUCENE-3069: --- Had to reopen it because jira doesn't permit label change :) > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937065#comment-13937065 ] Michael McCandless commented on LUCENE-3069: Woops, thanks for fixing the gsoc label Han! > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886478#comment-13886478 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1562771 from [~mikemccand] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1562771 ] LUCENE-3069: also exclude MockRandom from this test case > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2014 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886256#comment-13886256 ] Han Jiang commented on LUCENE-3069: --- Thanks Mike! > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2014 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885502#comment-13885502 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1562506 from [~mikemccand] in branch 'dev/trunk' [ https://svn.apache.org/r1562506 ] LUCENE-3069: merge the back-compat indices from 4.x > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2014 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885476#comment-13885476 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1562498 from [~mikemccand] in branch 'dev/trunk' [ https://svn.apache.org/r1562498 ] LUCENE-3069: move CHANGES entries under 4.7 > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2014 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885474#comment-13885474 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1562497 from [~mikemccand] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1562497 ] LUCENE-3069: port fully RAM-resident terms FST dictionary implementations to 4.x > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2014 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885387#comment-13885387 ] Michael McCandless commented on LUCENE-3069: I'd like to commit this to 4.7 as well ... I'll backport & commit soon. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2014 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879423#comment-13879423 ] Han Jiang commented on LUCENE-3069: --- Thanks for catching this Mike! I wasn't quick to get that username :p > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878932#comment-13878932 ] Mark Miller commented on LUCENE-3069: - Ah, sorry. I'm usually in Comments view and I took "Note that all the commit messages at the end of this issue " as referring to the ASF subversion and git services commit tags. Given past experience, I don't trust the fisheye or the like integrations. We might wake up one day and they will just be gone along with their history... > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878917#comment-13878917 ] Michael McCandless commented on LUCENE-3069: Thanks Mark. Yes, you can see FishEye's comments under Source and and also the All tab. "Our" (svnpubsub) commit messages are correct here (they say "Han Jiang"), but the FishEye comments are incorrect (they say "Han Lee"). > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878765#comment-13878765 ] Mark Miller commented on LUCENE-3069: - I think it's actually two different things - our commit messages are generated by a script that subscribes to svnpubsub, and it does some look ups to figure out the right user name. The fisheye stuff is what you see if you look at the source tab I think. So it might be easier to fix, since I think it's in our control (INFRA's anyway). https://svn.apache.org/repos/infra/infrastructure/trunk/projects/svngit2jira/ > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878687#comment-13878687 ] Michael McCandless commented on LUCENE-3069: Note that all the commit messages at the end of this issue (generated by Jira's FishEye plugin I think) incorrectly state that "Han Lee" committed changes here. This is due to an issue in FishEye with username collision ... Han Jiang's (who really committed here) apache username is "han", but in Jira that user name belongs to Han Lee, which leads to this mis-labeling. Here's the INFRA issue: https://issues.apache.org/jira/browse/INFRA-3243 but it's currently WONTFIX unfortunately ... > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.7 > > Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, > example.png > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790156#comment-13790156 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1530520 from [~billy] in branch 'dev/trunk' [ https://svn.apache.org/r1530520 ] LUCENE-3069: add CHANGES, move new postingsformats to oal.codecs > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.6 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782141#comment-13782141 ] Simon Willnauer commented on LUCENE-3069: - nice one! I am happy that this one made it in 2.5 years after opening! Great work Han!! > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761960#comment-13761960 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1521173 from [~billy] in branch 'dev/trunk' [ https://svn.apache.org/r1521173 ] LUCENE-3069: Lucene should have an entirely memory resident term dictionary > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760304#comment-13760304 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1520618 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1520618 ] LUCENE-3069: reuse customized TermState in PBF > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760328#comment-13760328 ] Michael McCandless commented on LUCENE-3069: Thanks Han. I think we can just leave the .smy as is for now, and keep passing "boolean absolute" down. We can later improve these ... I think we should first land this on trunk and let jenkins chew on it for a while ... and if all seems good, then back port. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760325#comment-13760325 ] Han Jiang commented on LUCENE-3069: --- I think this is ready to commit to trunk now, and I'll wait for a day or two before committing it. :) > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760259#comment-13760259 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1520592 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1520592 ] LUCENE-3069: remove impersonate codes, fix typo > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760160#comment-13760160 ] Han Jiang commented on LUCENE-3069: --- Mike, thanks for the review! bq. In general, couldn't the writer re-use the reader's TermState? I'm afraid this somewhat makes codes longer? I'll make a patch to see this. {quote} Have you run "first do no harm" perf tests? Ie, compare current trunk w/ default Codec to branch w/ default Codec? Just to make sure there are no surprises... {quote} Yes, no surprise yet. bq. Why does Lucene41PostingsWriter have "impersonation" code? Yeah, these should be removed. {quote} I forget: why does the postings reader/writer need to handle delta coding again (take an absolute boolean argument)? Was it because of pulsing or sep? It's fine for now (progress not perfection) ... but not clean, since "delta coding" is really an encoding detail so in theory the terms dict should "own" that ... {quote} Ah, yes, because of pulsing. This is because.. PulsingPostingsBase is more than a PostingsBaseFormat. It somewhat acts like a term dict, e.g. it needs to understand how terms are structured in one block (term No.1 uses absolute value, term No.x use delta value) then judge how to restruct the inlined and wrapped block (No.1 still uses absolute value, but the first-non-pulsed term will need absolute encoding as well). Without the argument 'absolute', the real term dictionary will do the delta encoding itself, then PulsingPostingsBase will be confused, and all wrapped PostingsBase have to encode metadata values without delta-format. {quote} The new .smy file for Pulsing is sort of strange ... but necessary since it always uses 0 longs, so we have to store this somewhere ... you could put it into FieldInfo attributes instead? {quote} Yeah, it is another hairy thing... the reason is, we don't have a 'PostingsTrailer' for PostingsBaseFormat. Pulsing will not know the longs size for each field, until all the fields are consumed... and it should not write those longsSize to termsOut in close() since the term dictionary will use the DirTrailer hack here. (maybe every term dictionary should close postingsWriter first, then write field summary and close itself? I'm not sure though). bq. Should we backport this to 4.x? Yeah, OK! > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759448#comment-13759448 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1520422 from [~mikemccand] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1520422 ] LUCENE-3069: small javadoc fixes > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759449#comment-13759449 ] Michael McCandless commented on LUCENE-3069: Patch looks great. It's nice how postings writers no longer need their own redundant PendingTerm instances to track the term's metadata / blocking; just use their existing TermState class instead. And how postings readers don't have to deal w/ blocking either. In general, couldn't the writer re-use the reader's TermState? E.g. Lucene40PostingsWriter just use Lucene40PostingsReader's StandardTermState, rather than make its own? (And same for Lucene41PostingsWriter/Reader). Have you run "first do no harm" perf tests? Ie, compare current trunk w/ default Codec to branch w/ default Codec? Just to make sure there are no surprises... Why does Lucene41PostingsWriter have "impersonation" code? Was that just for debugging during dev? Can we remove it (it should always write the current format)? The reader needs it of course ... but it shouldn't be commented as "impersonation" but as back-compat? In the javadocs for encodeTerm, don't we require that the long[] are always monotonic? It's not "optional"? Also, "monotonical" should be "monotonic" there. Maybe we should add a "reset" method to each PF's TermState, so instead of doing newTermState() when absolute, we can .reset(), and likewise in the reader. I forget: why does the postings reader/writer need to handle delta coding again (take an absolute boolean argument)? Was it because of pulsing or sep? It's fine for now (progress not perfection) ... but not clean, since "delta coding" is really an encoding detail so in theory the terms dict should "own" that ... "monotonical" appears several times but I think it should instead be "monotonic". The new .smy file for Pulsing is sort of strange ... but necessary since it always uses 0 longs, so we have to store this somewhere ... you could put it into FieldInfo attributes instead? It's nice how small the FST terms dicts are! Much simpler than the hairy BlockTree code... Should we backport this to 4.x? In theory this should not be so hard ... 3.x indices already have their own PF impls, and the change is back-compatible to current 4.x indices ... > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757814#comment-13757814 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1520034 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1520034 ] LUCENE-3069: move TermDict impls to package 'memory', nuke all 'Temp' symbols > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757771#comment-13757771 ] Han Jiang commented on LUCENE-3069: --- Yes, with slight changes, it can support seek by ord. (With FST.getByOutput). > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757741#comment-13757741 ] David Smiley commented on LUCENE-3069: -- I like FSTOrd as well. Presumably this one also exposes it via TermsEnum.ord()? > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757676#comment-13757676 ] Han Jiang commented on LUCENE-3069: --- OK! These two term dicts are both FST-based: * FST term dict directly uses FST to map term to its metadata & stats (FST) * FSTOrd term dict uses FST to map term to its ordinal number (FST), and the ordinal is then used to seek metadata from another big chunk. I prefer the second impl since it puts much less stress on FST. I have updated the detailed format explaination in last commit. Hmm, I'll create another patch for this... > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757668#comment-13757668 ] Michael McCandless commented on LUCENE-3069: Thanks for uploading the diffs against trunk, Han; I'll review this. Can you explain the two new terms dict impls? And maybe write up a brief summary of all the changes (to help others understand the patch)? Maybe we can put the new "all in memory" terms dict impls under oal.codecs.memory? FSTTerms* seems like a good name? (Just because in the future maybe we have other impls of "all in memory" terms dicts)... > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757419#comment-13757419 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1519909 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1519909 ] LUCENE-3069: javadocs > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756320#comment-13756320 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1519542 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1519542 ] LUCENE-3069: update javadocs, fix impersonator bug > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754768#comment-13754768 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1518989 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1518989 ] LUCENE-3069: merge trunk changes > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752528#comment-13752528 ] Michael McCandless commented on LUCENE-3069: PostingsReaderBase.pulsed is quite crazy ... really the terms dict should not need this information, ideally. Pulsing has no back-compat guarantees, so it's fine to only support writing the "new" format and being able to read it. Ie, if this change is only for impersonation then we shouldn't need to do it, I think? Also, this is spooky: {code} int start = (int)in.getFilePointer(); {code} Isn't that unsafe in general? Ie it could overflow int... > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751191#comment-13751191 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1517792 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1517792 ] LUCENE-3069: add version check for impersonation > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748582#comment-13748582 ] Han Jiang commented on LUCENE-3069: --- bq. Patch looks great on quick look! I'll look more when I'm back bq. online... OK! I commit it so that we can see later changes. bq. One thing: I think e.g. BlockTreeTermsReader needs some back-compat bq. code, so it won't try to read longsSize on old indices? Yes, both two Block* term dict will have a new VERSION variable to mark the change, and if codec header shows a previous version, they will not read that longSize VInt. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748579#comment-13748579 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1516860 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1516860 ] LUCENE-3069: merge 'temp' codes back > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748556#comment-13748556 ] Michael McCandless commented on LUCENE-3069: Patch looks great on quick look! I'll look more when I'm back online... One thing: I think e.g. BlockTreeTermsReader needs some back-compat code, so it won't try to read longsSize on old indices? > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748388#comment-13748388 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1516742 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1516742 ] LUCENE-3069: API refactoring on Lucene40RW > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748193#comment-13748193 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1516677 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1516677 ] LUCENE-3069: API refactoring on MockRandom, revert supress codec in compatibility test > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747180#comment-13747180 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1516365 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1516365 ] LUCENE-3069: API refactoring on Sep/IntBlock PF > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743897#comment-13743897 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1515469 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1515469 ] LUCENE-3069: API refactoring on Pulsing PF > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740948#comment-13740948 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1514253 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1514253 ] LUCENE-3069: API refactoring on BlockTerms dict > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738105#comment-13738105 ] Han Jiang commented on LUCENE-3069: --- Hi, currently, we have problem when migrating the codes to trunk: The API refactoring on PostingsReader/WriterBase now splits term metadata into two parts: monotonic long[] and generical byte[], the former is known by term dictionary for better d-gap encoding. So we need a 'longsSize' in field summary, to tell reader the fixed length of this monotonic long[]. However, this API change actually breaks backward compability: the old 4.x indices didn't support this, and for some codec like Lucene40, since their writer part are already deprecated, their tests won't pass. It seems like we can put all the metadata in generic byte[] and let PBF do its own buffering (like we do in old API: nextTerm() ), however we'll have to add logics for this, in every PBF then. So... can we solve this problem more elegantly? > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737782#comment-13737782 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1513336 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1513336 ] LUCENE-3069: merge trunk changes over > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725288#comment-13725288 ] Han Jiang commented on LUCENE-3069: --- bq. Maybe try testing on a different wildcard query, e.g. something like a*b* (that does not have a commonSuffix)? I replace all the ab*c in tasks file with ab*c*, but the performance hit is still heavy: 33M wikidata, Lucene41 vs. TempFSTOrd {noformat} Wildcard7.40 (1.9%)4.63 (1.2%) -37.5% ( -39% - -34%) {noformat} > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725258#comment-13725258 ] Michael McCandless commented on LUCENE-3069: bq. Mike, since we already have an intersect() impl, maybe we can still keep this? +1 It's odd that WildcardQuery is so angry; I wonder if it's because we can't use the commonSuffix opto. Maybe try testing on a different wildcard query, e.g. something like a*b* (that does not have a commonSuffix)? > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724957#comment-13724957 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1508744 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1508744 ] LUCENE-3069: introduce intersect() to TempFSTOrd > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724955#comment-13724955 ] Han Jiang commented on LUCENE-3069: --- Performance result after last patch(intersect) is applied. On wiki 33M data, between TempFST(with intersect) and TempFSTOrd(with intersect): {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff PKLookup 232.47 (1.0%) 205.28 (2.0%) -11.7% ( -14% - -8%) Prefix3 26.93 (1.2%) 28.40 (1.4%) 5.5% ( 2% -8%) Wildcard6.75 (2.1%)7.37 (1.5%) 9.2% ( 5% - 13%) Fuzzy1 29.86 (1.8%) 51.87 (3.7%) 73.7% ( 67% - 80%) Fuzzy2 30.82 (1.6%) 53.82 (2.7%) 74.7% ( 69% - 80%) Respell 27.30 (1.2%) 49.55 (2.6%) 81.5% ( 76% - 86%) {noformat} So the decoding of outputs is really the main hurt. And now we should start to compare it with trunk (base=Lucene41, comp=TempFSTOrd): Hmm, I must have done something wrong on wildcard query here. {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff Wildcard 19.21 (2.1%)7.30 (0.3%) -62.0% ( -63% - -60%) Prefix3 33.69 (1.2%) 28.18 (0.9%) -16.4% ( -18% - -14%) Fuzzy1 61.59 (2.1%) 52.36 (0.8%) -15.0% ( -17% - -12%) Fuzzy2 60.94 (1.0%) 54.15 (1.3%) -11.1% ( -13% - -8%) Respell 54.21 (2.8%) 49.54 (1.2%) -8.6% ( -12% - -4%) PKLookup 148.40 (1.0%) 208.07 (3.6%) 40.2% ( 35% - 45%) {noformat} I'll commit current version so we can iterate on it. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724736#comment-13724736 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1508705 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1508705 ] LUCENE-3069: add TempFSTOrd, with FST index + specialized block > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724450#comment-13724450 ] David Smiley commented on LUCENE-3069: -- Nice work! The spatial prefix trees will have even more awesome performance with all terms in RAM. It'd be nice if I could configure the docFreq to be memory resident but, as Mike said, adding options like that can be explored later. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724253#comment-13724253 ] Michael McCandless commented on LUCENE-3069: Wow, those are nice perf results, without implementing intersect! Intersect really is an optional operation, so we could stop here/now and button everything up :) I like this approach: you moved all the metadata (docFreq, totalTermFreq, long[] and byte[] from the PostingsFormatBase) into blocks, and then when we really need a term's metadata we go to its block and scan for it (like block tree). I wonder if we could use MonotonicAppendingLongBuffer instead of long[] for the in-memory skip data? Right now it's I think 48 bytes per block (block = 128 terms), so I guess that's fairly small (.375 bytes per term). {quote} It is a little similar to BTTR now, and we can someday control how much data to keep memory resident (e.g. keep stats in memory but metadata on disk, however this should be another issue). {quote} That's a nice (future) plus; this way the app can keep "only" the terms+ords in RAM, and leave all term metadata on disk. But this is definitely optional for the project and we should separately explore it ... {quote} Another good part is, it naturally supports seek by ord.(ah, actually I don't understand where it is used). {quote} This is also a nice side-effect! > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718526#comment-13718526 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1506612 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1506612 ] LUCENE-3069: accumulate metadata lazily > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718114#comment-13718114 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1506439 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1506439 ] LUCENE-3069: stack reuses objects during DFS > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717926#comment-13717926 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1506389 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1506389 ] LUCENE-3069: no need to reseek FSTReader, update nocommits > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717912#comment-13717912 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1506385 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1506385 ] LUCENE-3069: support intersect operations > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717911#comment-13717911 ] Han Jiang commented on LUCENE-3069: --- bq. You should not need to .getPosition / .setPosition on the fstReader: Oh, yes! I'll fix. bq. I think we can't really make use of it, which is fine (it's an optional optimization). OK, actually I was quite curious why we don't make use of commonPrefixRef in CompiledAutomaton. Maybe we can determinize the input Automaton first, then get commonPrefixRef via SpecialOperation? Is it too slow, or the prefix isn't always long enough to take into consideration? bq. But this can only be done if that FST node's arcs are array'd right? Yes, array arcs only, and we might need methods like advance(label) to do the search, and here gossip search might work better than traditional binary search. {quote} Separately, supporting ord w/ FST terms dict should in theory be not so hard; you'd need to use getByOutput to seek by ord. Maybe (later, eventually) we can make this a write-time option. We should open a separate issue ... {quote} Ah, yes, but seems that getByOutput doesn't rewind/reuse previous state? We always have to start from first arc during every seek. However, I'm not sure in what kinds of usecase we need the ord information. I'll commit current version first, so we can iterate. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716577#comment-13716577 ] Michael McCandless commented on LUCENE-3069: Patch looks great! Wonderful how you were able to share some code in BaseTermsEnum... It looks like you impl'd seekCeil in general for the IntersectEnum? Wild :) You should not need to .getPosition / .setPosition on the fstReader: the FST APIs do this under-the-hood. bq. currently, CompiledAutomaton provides a commonSuffixRef, but how can we make use of it in FST? I think we can't really make use of it, which is fine (it's an optional optimization). {quote} when FST is large enough, the next() operation will takes much time doing the linear arc read, maybe we should make use of CompiledAutomaton.sortedTransition[] when leaving arcs are heavy. {quote} Interesting ... you mean e.g. if the Automaton is very restrictive compared to the FST, then we can do a binary search. But this can only be done if that FST node's arcs are array'd right? Separately, supporting ord w/ FST terms dict should in theory be not so hard; you'd need to use getByOutput to seek by ord. Maybe (later, eventually) we can make this a write-time option. We should open a separate issue ... > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709928#comment-13709928 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1503797 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1503797 ] LUCENE-3069: merge trunk changes over > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709886#comment-13709886 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1503781 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1503781 ] LUCENE-3069: remove some nocommits, update hashCode() & equal() > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709867#comment-13709867 ] Michael McCandless commented on LUCENE-3069: bq. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. This is normal / by design. It's so that the case of seekExact(TermState) followed by .docs or .docsAndPositions is fast. We only need to re-load the metadata if the caller then tries to do .next() {quote} bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? {quote} We can't guarantee that, but I think we can just check if pair == null and return null from term()? {quote} By the way, for real data, when two outputs are not 'NO_OUTPUT', even they contains the same metadata + stats, it seems to be very seldom that their arcs can be identical on FST (increases less than 1MB for wikimedium1m if equals always return false for non-singleton argument). Therefore... yes, hashCode() isn't necessary here. {quote} Hmm, but it seems like we should implement it? Ie we do get a smaller FST when implementing it? > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708486#comment-13708486 ] Han Jiang commented on LUCENE-3069: --- bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708465#comment-13708465 ] Michael McCandless commented on LUCENE-3069: The new code on the branch looks great! I can't wait to see perf results after we implement .intersect().. Some small stuff in TempFSTTermsReader.java: * In next(), when we handle seekPending=true, I think we should assert that the seekCeil returned SeekStatus.FOUND? Ie, it's not possible to seekExact(TermState) to a term that doesn't exist. * useCache is an ancient option from back when we had a terms dict cache; we long ago removed it ... I think we should remove useCache parameter too? * It's silly that fstEnum.seekCeil doesn't return a status, ie that we must re-compare the term we got to differentiate FOUND vs NOT_FOUND ... so we lose some perf here. But this is just a future TODO ... * "nocommit: this method doesn't act as 'seekExact' right?" -- not sure why this is here; seekExact is working as it should I think. * Maybe instead of term and meta members, we could just hold the current pair? In TempTermOutputs.java: * longsSize, hasPos can be final? (Same with TempMetaData's fields) * TempMetaData.hashCode() doesn't mix in docFreq/tTF? * It doesn't impl equals (must it really impl hashCode?) > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708428#comment-13708428 ] Michael McCandless commented on LUCENE-3069: {quote} Another thing that surprised me is, with the same code/conf, luceneutil creates different sizes of index? I tested that df==0 trick several times on wikimedium1m, the index size varies from 514M~522M... Will multi-threading affects much here? {quote} Using threads means the docs are assigned to different segments each time you run ... it's interesting this can cause such variance in the index size though. It is known that e.g. sorting docs by web site (if you are indexing content from different sites) can give good compression; maybe that's the effect we're seeing here? > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708065#comment-13708065 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1502991 from [~billy] in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1502991 ] LUCENE-3069: remove redundant info for fields without payload > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707709#comment-13707709 ] Han Jiang commented on LUCENE-3069: --- I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for "body" field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |25| 0 | 0| 0| |26| 0 | 0| 0| |27| 0 | 0| 0| |28| 0 | 0| 0| |29| 0 | 0| 0| |30| 0 | 0| 0| |31| 0 | 0| 0| |32| 0 | 0| 0| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707649#comment-13707649 ] Han Jiang commented on LUCENE-3069: --- bq. Cool idea! I wonder how many of those are df == ttf == 1? I didn't try a very precise estimation, but the percentage will be large: For the index of wikimedium1m, the larget segment has a 'body' field with: {noformat} bitwidth/7 df==ttf df 1 1324400 / 1542987 2 110 / 18951 3 0 / 175 4 0 / 0 5 0 / 0 {noformat} That is where 85.8% comes. 'bitwidth/7' means the 'ceil(bitwidth of df / 7)' since we're using VInt encoding. So, for this field, we can save (1324400+110*2) bytes by stealing one bit. bq. Maybe we could try writing a vInt of 0 for docFreq to indicate that both docFreq and totalTermFreq are 1? Yes, that may helps! I'll try to test the percentage. But still we should note that, df is a small part in term dict data. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707634#comment-13707634 ] Robert Muir commented on LUCENE-3069: - {quote} Also, for long-tail terms, the totalTermFreq an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf). {quote} Cool idea! I wonder how many of those are df == ttf == 1? We would currently waste a byte in this case (because we write a vInt for docFreq of 1, and then a vInt of totalTermFreq - docFreq of 0). Maybe we could try writing a vInt of 0 for docFreq to indicate that both docFreq and totalTermFreq are 1? > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13705604#comment-13705604 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1502152 from [~billy] [ https://svn.apache.org/r1502152 ] LUCENE-3069: steal bit to encode TTF > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704677#comment-13704677 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1501811 from [~billy] [ https://svn.apache.org/r1501811 ] LUCENE-3069: use more compact outputs i/o > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702086#comment-13702086 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1500814 from [~billy] [ https://svn.apache.org/r1500814 ] LUCENE-3069: reader part, update logic in outputs > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700059#comment-13700059 ] ASF subversion and git services commented on LUCENE-3069: - Commit 1499744 from [~billy] [ https://svn.apache.org/r1499744 ] LUCENE-3069: writer part > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13684676#comment-13684676 ] Commit Tag Bot commented on LUCENE-3069: [lucene3069 commit] han http://svn.apache.org/viewvc?view=revision&revision=1493517 LUCENE-3069: setField now expose per-field info to term dict > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13684674#comment-13684674 ] Commit Tag Bot commented on LUCENE-3069: [lucene3069 commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1493516 LUCENE-3069: add nocommit/TODO > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13684616#comment-13684616 ] Commit Tag Bot commented on LUCENE-3069: [lucene3069 commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1493493 LUCENE-3069: merge trunk changes over > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672167#comment-13672167 ] Han Jiang commented on LUCENE-3069: --- the detail ideas/wild thoughts will be put here: https://gist.github.com/sleepsort/5642021 > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642971#comment-13642971 ] Han Jiang commented on LUCENE-3069: --- This is my inital proposal for this project: https://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2013/billybob/32001 I'm looking forward to your feedbacks. :) > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Labels: gsoc2013 > Fix For: 4.3 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624919#comment-13624919 ] Han Jiang commented on LUCENE-3069: --- This project is quite interesting! Since we already have an entirely memory resident PF, the target of this project seems to be as below: 1. implement a simplified version of BlockTreeTerms*; 2. change the API of current PostingsBastFormat, so that some non-block-based term dic will be possible to plug in it.(ideally, MemoryPF should work with this) > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Labels: gsoc2013 > Fix For: 4.3 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617982#comment-13617982 ] David Smiley commented on LUCENE-3069: -- I'd love to see this come to pass. I've been thinking about what goes on a layer beneath TermsEnum (i.e. how it is implemented) as I work on spatial stuff. Geohash prefixes are a natural fit for FSTs; it should compress ridiculously well. There is an approach to building a heatmap (spatial grid faceting) that I'm thinking of that would do 2500 seek()'s for a 50x50 grid; I'd like those seek's to be as fast as possible. I have another approach in mind requiring a slightly different encoding, but it would do 2500 next()'s which should be faster. Nonetheless; it's a lot -- ideally the terms dict would be entirely memory resident. > Lucene should have an entirely memory resident term dictionary > -- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search >Affects Versions: 4.0-ALPHA >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Labels: gsoc2013 > Fix For: 4.3 > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org