[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM
[ https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419071#comment-13419071 ] Robert Muir commented on LUCENE-4227: - Would it really be that much slower if it was slightly more reasonable, e.g. storing freqs in packed ints (with huper-duper fast options) instead of wasting so much on them? DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM - Key: LUCENE-4227 URL: https://issues.apache.org/jira/browse/LUCENE-4227 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-4227.patch, LUCENE-4227.patch This postings format just wraps Lucene40 (on disk) but then at search time it loads (up front) all terms postings into RAM. You'd use this if you have insane amounts of RAM and want the fastest possible search performance. The postings are not compressed: docIds, positions are stored as straight int[]s. The terms are stored as a skip list (array of byte[]), but I packed all terms together into a single long byte[]: I had started as actual separate byte[] per term but the added pointer deref and loss of locality was a lot (~2X) slower for terms-dict intensive queries like FuzzyQuery. Low frequency postings (docFreq = 32 by default) store all docs, pos and offsets into a single int[]. High frequency postings store docs as int[], freqs as int[], and positions as int[][] parallel arrays. For skipping I just do a growing binary search. I also made specialized DirectTermScorer and DirectExactPhraseScorer for the high freq case that just pull the int[] and iterate themselves. All tests pass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM
[ https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419119#comment-13419119 ] Michael McCandless commented on LUCENE-4227: {quote} Would it really be that much slower if it was slightly more reasonable, e.g. storing freqs in packed ints (with huper-duper fast options) instead of wasting so much on them? {quote} Probably not that much slower? I think that's a good idea! But I think we can explore this after committing? There are other things we can try too (eg collapse skip list into shared int[]: I think this one may give a perf gain, collapse positions, etc.). DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM - Key: LUCENE-4227 URL: https://issues.apache.org/jira/browse/LUCENE-4227 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-4227.patch, LUCENE-4227.patch This postings format just wraps Lucene40 (on disk) but then at search time it loads (up front) all terms postings into RAM. You'd use this if you have insane amounts of RAM and want the fastest possible search performance. The postings are not compressed: docIds, positions are stored as straight int[]s. The terms are stored as a skip list (array of byte[]), but I packed all terms together into a single long byte[]: I had started as actual separate byte[] per term but the added pointer deref and loss of locality was a lot (~2X) slower for terms-dict intensive queries like FuzzyQuery. Low frequency postings (docFreq = 32 by default) store all docs, pos and offsets into a single int[]. High frequency postings store docs as int[], freqs as int[], and positions as int[][] parallel arrays. For skipping I just do a growing binary search. I also made specialized DirectTermScorer and DirectExactPhraseScorer for the high freq case that just pull the int[] and iterate themselves. All tests pass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM
[ https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419126#comment-13419126 ] Robert Muir commented on LUCENE-4227: - Yeah, i don't think we need to solve it before committing. I do think maybe this class needs some more warnings, to me it seems it will use crazy amounts of RAM. I also am not sure I like the name Direct... is it crazy to suggest Instantiated? DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM - Key: LUCENE-4227 URL: https://issues.apache.org/jira/browse/LUCENE-4227 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-4227.patch, LUCENE-4227.patch This postings format just wraps Lucene40 (on disk) but then at search time it loads (up front) all terms postings into RAM. You'd use this if you have insane amounts of RAM and want the fastest possible search performance. The postings are not compressed: docIds, positions are stored as straight int[]s. The terms are stored as a skip list (array of byte[]), but I packed all terms together into a single long byte[]: I had started as actual separate byte[] per term but the added pointer deref and loss of locality was a lot (~2X) slower for terms-dict intensive queries like FuzzyQuery. Low frequency postings (docFreq = 32 by default) store all docs, pos and offsets into a single int[]. High frequency postings store docs as int[], freqs as int[], and positions as int[][] parallel arrays. For skipping I just do a growing binary search. I also made specialized DirectTermScorer and DirectExactPhraseScorer for the high freq case that just pull the int[] and iterate themselves. All tests pass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM
[ https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419129#comment-13419129 ] Michael McCandless commented on LUCENE-4227: bq. I do think maybe this class needs some more warnings, to me it seems it will use crazy amounts of RAM. I'll add some scary warnings :) bq. I also am not sure I like the name Direct... is it crazy to suggest Instantiated? It is very much like the old instantiated (though I think its terms dict is faster than instantiated's)... but I didn't really like the name Instanstiated... I had picked Direct because it directly represents the postings ... but maybe we can find a better name. I will update MIGRATE.txt to explain how Direct (or whatever we name it) is the closest match if you were previously using Instantiated... DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM - Key: LUCENE-4227 URL: https://issues.apache.org/jira/browse/LUCENE-4227 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-4227.patch, LUCENE-4227.patch This postings format just wraps Lucene40 (on disk) but then at search time it loads (up front) all terms postings into RAM. You'd use this if you have insane amounts of RAM and want the fastest possible search performance. The postings are not compressed: docIds, positions are stored as straight int[]s. The terms are stored as a skip list (array of byte[]), but I packed all terms together into a single long byte[]: I had started as actual separate byte[] per term but the added pointer deref and loss of locality was a lot (~2X) slower for terms-dict intensive queries like FuzzyQuery. Low frequency postings (docFreq = 32 by default) store all docs, pos and offsets into a single int[]. High frequency postings store docs as int[], freqs as int[], and positions as int[][] parallel arrays. For skipping I just do a growing binary search. I also made specialized DirectTermScorer and DirectExactPhraseScorer for the high freq case that just pull the int[] and iterate themselves. All tests pass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM
[ https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419131#comment-13419131 ] Robert Muir commented on LUCENE-4227: - {quote} It is very much like the old instantiated (though I think its terms dict is faster than instantiated's)... but I didn't really like the name Instanstiated... I had picked Direct because it directly represents the postings ... but maybe we can find a better name. {quote} OK, I think what would be better is a better synonym for Uncompressed. I realized Direct is consistent with packedints or whatever... but I don't think it should using this name either, its not intuitive. DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM - Key: LUCENE-4227 URL: https://issues.apache.org/jira/browse/LUCENE-4227 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-4227.patch, LUCENE-4227.patch This postings format just wraps Lucene40 (on disk) but then at search time it loads (up front) all terms postings into RAM. You'd use this if you have insane amounts of RAM and want the fastest possible search performance. The postings are not compressed: docIds, positions are stored as straight int[]s. The terms are stored as a skip list (array of byte[]), but I packed all terms together into a single long byte[]: I had started as actual separate byte[] per term but the added pointer deref and loss of locality was a lot (~2X) slower for terms-dict intensive queries like FuzzyQuery. Low frequency postings (docFreq = 32 by default) store all docs, pos and offsets into a single int[]. High frequency postings store docs as int[], freqs as int[], and positions as int[][] parallel arrays. For skipping I just do a growing binary search. I also made specialized DirectTermScorer and DirectExactPhraseScorer for the high freq case that just pull the int[] and iterate themselves. All tests pass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM
[ https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419173#comment-13419173 ] Robert Muir commented on LUCENE-4227: - I dont have better name either. Lets just commit it with this one and think about it for later! DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM - Key: LUCENE-4227 URL: https://issues.apache.org/jira/browse/LUCENE-4227 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-4227.patch, LUCENE-4227.patch, LUCENE-4227.patch This postings format just wraps Lucene40 (on disk) but then at search time it loads (up front) all terms postings into RAM. You'd use this if you have insane amounts of RAM and want the fastest possible search performance. The postings are not compressed: docIds, positions are stored as straight int[]s. The terms are stored as a skip list (array of byte[]), but I packed all terms together into a single long byte[]: I had started as actual separate byte[] per term but the added pointer deref and loss of locality was a lot (~2X) slower for terms-dict intensive queries like FuzzyQuery. Low frequency postings (docFreq = 32 by default) store all docs, pos and offsets into a single int[]. High frequency postings store docs as int[], freqs as int[], and positions as int[][] parallel arrays. For skipping I just do a growing binary search. I also made specialized DirectTermScorer and DirectExactPhraseScorer for the high freq case that just pull the int[] and iterate themselves. All tests pass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM
[ https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415613#comment-13415613 ] Michael McCandless commented on LUCENE-4227: I ran perf tests on a 2M Wikipedia index (requires 8 GB heap: need more RAM to go higher!). Results without the specialized scorers (baseline is trunk w/ MMapDir): {noformat} TaskQPS base StdDev base QPS directStdDev direct Pct diff PKLookup 259.28 11.94 227.965.85 -18% - -5% Fuzzy1 160.215.11 183.911.48 10% - 19% TermGroup1M 18.330.21 21.600.11 15% - 19% SpanNear5.790.166.860.31 10% - 27% TermBGroup1M 18.460.24 22.160.11 17% - 22% TermBGroup1M1P 22.470.65 28.040.67 18% - 31% SloppyPhrase3.510.134.600.05 24% - 37% IntNRQ 53.754.68 71.224.21 14% - 53% OrHighHigh 18.850.42 26.892.16 28% - 57% OrHighMed 37.930.91 54.575.71 25% - 62% Respell 167.735.37 242.931.78 39% - 50% Wildcard 46.641.74 69.983.43 37% - 63% Prefix3 109.513.45 165.776.42 41% - 62% Fuzzy2 56.482.37 88.250.91 48% - 64% AndHighHigh 24.590.74 41.820.72 62% - 78% Phrase 12.570.20 21.890.71 65% - 82% Term 39.051.74 69.003.68 60% - 94% AndHighMed 126.872.48 261.734.19 99% - 113% {noformat} Nice speedups! Same run, but using trunk w/ RAMDirectory as the baseline: {noformat} TaskQPS base StdDev base QPS directStdDev direct Pct diff PKLookup 248.504.73 222.034.43 -14% - -7% Fuzzy1 159.413.65 185.323.15 11% - 21% SpanNear5.740.086.750.17 13% - 22% TermGroup1M 17.780.42 21.030.68 11% - 25% TermBGroup1M 19.320.58 23.081.02 10% - 28% IntNRQ 46.820.49 56.121.28 15% - 23% TermBGroup1M1P 23.270.46 30.140.91 23% - 36% Respell 163.363.42 221.102.48 31% - 39% OrHighMed 30.621.94 42.945.70 14% - 69% OrHighHigh 17.980.99 25.693.35 17% - 70% Prefix3 114.410.67 164.192.22 40% - 46% Wildcard 47.580.36 70.471.20 44% - 51% Fuzzy2 53.921.37 83.542.66 46% - 64% SloppyPhrase5.070.238.120.74 39% - 82% AndHighHigh 24.730.75 40.510.42 57% - 70% Phrase 14.020.07 23.420.30 64% - 69% Term 39.962.13 67.394.09 50% - 88% AndHighMed 132.663.24 274.071.64 100% - 113% {noformat} Still good speedups over the obvious hold index in RAM option. Then, just testing the specialized scorers (baseline = DirectPF without specialized scorers): {noformat} TaskQPS base StdDev base QPS directStdDev direct Pct diff IntNRQ 74.863.42 71.720.27 -8% - 0% Wildcard 62.882.34 60.520.49 -7% - 0% Prefix3 102.463.98 98.920.85 -7% - 1% AndHighHigh 51.411.96 50.261.10 -7% - 3% AndHighMed 238.185.17 234.142.83 -4% - 1% Fuzzy1 179.641.73 177.963.27 -3% - 1% SloppyPhrase8.970.378.930.48 -9% - 9% Respell 223.761.16 222.792.68 -2% - 1% Fuzzy2 79.621.38 79.310.90 -3% - 2% SpanNear6.830.256.890.31 -7% - 9% PKLookup 220.251.46 225.172.560% - 4% OrHighMed 50.704.27 53.203.95