[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM

2012-07-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419071#comment-13419071
 ] 

Robert Muir commented on LUCENE-4227:
-

Would it really be that much slower if it was slightly more reasonable, e.g. 
storing freqs
in packed ints (with huper-duper fast options) instead of wasting so much on 
them?


 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM

2012-07-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419119#comment-13419119
 ] 

Michael McCandless commented on LUCENE-4227:


{quote}
Would it really be that much slower if it was slightly more reasonable, e.g. 
storing freqs
 in packed ints (with huper-duper fast options) instead of wasting so much on 
them?
{quote}

Probably not that much slower?  I think that's a good idea!

But I think we can explore this after committing?  There are other things we 
can try too (eg collapse skip list into shared int[]: I think this one may give 
a perf gain, collapse positions, etc.).


 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM

2012-07-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419126#comment-13419126
 ] 

Robert Muir commented on LUCENE-4227:
-

Yeah, i don't think we need to solve it before committing.

I do think maybe this class needs some more warnings, to me it seems it will 
use crazy amounts of RAM.
I also am not sure I like the name Direct... is it crazy to suggest 
Instantiated?

 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM

2012-07-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419129#comment-13419129
 ] 

Michael McCandless commented on LUCENE-4227:


bq. I do think maybe this class needs some more warnings, to me it seems it 
will use crazy amounts of RAM.

I'll add some scary warnings :)

bq. I also am not sure I like the name Direct... is it crazy to suggest 
Instantiated?

It is very much like the old instantiated (though I think its terms dict is 
faster than instantiated's)... but I didn't really like the name 
Instanstiated... I had picked Direct because it directly represents the 
postings ... but maybe we can find a better name.

I will update MIGRATE.txt to explain how Direct (or whatever we name it) is 
the closest match if you were previously using Instantiated...



 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM

2012-07-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419131#comment-13419131
 ] 

Robert Muir commented on LUCENE-4227:
-

{quote}
It is very much like the old instantiated (though I think its terms dict is 
faster than instantiated's)... but I didn't really like the name 
Instanstiated... I had picked Direct because it directly represents the 
postings ... but maybe we can find a better name.
{quote}

OK, I think what would be better is a better synonym for Uncompressed. I 
realized Direct is consistent with packedints
or whatever... but I don't think it should using this name either, its not 
intuitive.

 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM

2012-07-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419173#comment-13419173
 ] 

Robert Muir commented on LUCENE-4227:
-

I dont have better name either. Lets just commit it with this one and think 
about it for later!

 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM

2012-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415613#comment-13415613
 ] 

Michael McCandless commented on LUCENE-4227:


I ran perf tests on a 2M Wikipedia index (requires 8 GB heap: need
more RAM to go higher!).

Results without the specialized scorers (baseline is trunk w/ MMapDir):

{noformat}
TaskQPS base StdDev base  QPS directStdDev direct  Pct 
diff
PKLookup  259.28   11.94  227.965.85  -18% -   
-5%
  Fuzzy1  160.215.11  183.911.48   10% -   
19%
 TermGroup1M   18.330.21   21.600.11   15% -   
19%
SpanNear5.790.166.860.31   10% -   
27%
TermBGroup1M   18.460.24   22.160.11   17% -   
22%
  TermBGroup1M1P   22.470.65   28.040.67   18% -   
31%
SloppyPhrase3.510.134.600.05   24% -   
37%
  IntNRQ   53.754.68   71.224.21   14% -   
53%
  OrHighHigh   18.850.42   26.892.16   28% -   
57%
   OrHighMed   37.930.91   54.575.71   25% -   
62%
 Respell  167.735.37  242.931.78   39% -   
50%
Wildcard   46.641.74   69.983.43   37% -   
63%
 Prefix3  109.513.45  165.776.42   41% -   
62%
  Fuzzy2   56.482.37   88.250.91   48% -   
64%
 AndHighHigh   24.590.74   41.820.72   62% -   
78%
  Phrase   12.570.20   21.890.71   65% -   
82%
Term   39.051.74   69.003.68   60% -   
94%
  AndHighMed  126.872.48  261.734.19   99% -  
113%
{noformat}

Nice speedups!

Same run, but using trunk w/ RAMDirectory as the baseline:

{noformat}
TaskQPS base StdDev base  QPS directStdDev direct  Pct 
diff
PKLookup  248.504.73  222.034.43  -14% -   
-7%
  Fuzzy1  159.413.65  185.323.15   11% -   
21%
SpanNear5.740.086.750.17   13% -   
22%
 TermGroup1M   17.780.42   21.030.68   11% -   
25%
TermBGroup1M   19.320.58   23.081.02   10% -   
28%
  IntNRQ   46.820.49   56.121.28   15% -   
23%
  TermBGroup1M1P   23.270.46   30.140.91   23% -   
36%
 Respell  163.363.42  221.102.48   31% -   
39%
   OrHighMed   30.621.94   42.945.70   14% -   
69%
  OrHighHigh   17.980.99   25.693.35   17% -   
70%
 Prefix3  114.410.67  164.192.22   40% -   
46%
Wildcard   47.580.36   70.471.20   44% -   
51%
  Fuzzy2   53.921.37   83.542.66   46% -   
64%
SloppyPhrase5.070.238.120.74   39% -   
82%
 AndHighHigh   24.730.75   40.510.42   57% -   
70%
  Phrase   14.020.07   23.420.30   64% -   
69%
Term   39.962.13   67.394.09   50% -   
88%
  AndHighMed  132.663.24  274.071.64  100% -  
113%
{noformat}

Still good speedups over the obvious hold index in RAM option.

Then, just testing the specialized scorers (baseline = DirectPF without
specialized scorers):

{noformat}
TaskQPS base StdDev base  QPS directStdDev direct  Pct 
diff
  IntNRQ   74.863.42   71.720.27   -8% -
0%
Wildcard   62.882.34   60.520.49   -7% -
0%
 Prefix3  102.463.98   98.920.85   -7% -
1%
 AndHighHigh   51.411.96   50.261.10   -7% -
3%
  AndHighMed  238.185.17  234.142.83   -4% -
1%
  Fuzzy1  179.641.73  177.963.27   -3% -
1%
SloppyPhrase8.970.378.930.48   -9% -
9%
 Respell  223.761.16  222.792.68   -2% -
1%
  Fuzzy2   79.621.38   79.310.90   -3% -
2%
SpanNear6.830.256.890.31   -7% -
9%
PKLookup  220.251.46  225.172.560% -
4%
   OrHighMed   50.704.27   53.203.95