[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-16 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995436#comment-12995436
 ] 

hao yan commented on LUCENE-2903:
-

Thank both of you! Thanks for testing my codec so quickly, Michael! 

RE: One question: it looks like this PFOR impl can only handle up to 28
bit wide ints? Which means... could it could fail on some cases?
Though I suppose you would never see too many of these immense ints in
one block, and so they'd always be encoded as exceptions and so it's
actually safe...?

Hao: This won't fail. In my PFOR impl, I will first checkBigNumbers() to see if 
there is any number = 2^28, if there is, i will force encoding the lower 4 
bits using the 128 4-bit slots. Thus, all exceptions left to simple16 are  
2^28, which can definitely be handled. So, there is no failure cases!!! :) . 

BTW, my PFOR impl will save more index size than VInt and other PFOR impls. 
Thus, if the user case is real-time search which requires loading index from 
disk to memory frequently, my PFOR impl may save even more. 


  





 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE-2903.patch, for_pfor.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993003#comment-12993003
 ] 

Michael McCandless commented on LUCENE-2903:


I don't see a new patch here?  Eg, PForDeltaFixedIntBlockWithReadIntCodec.java 
seems to be missing (and others)?

Also, it's best if you can run the perf tests without -debug, since it runs on 
a more realistic index.  The small (100K docs) debug index over-emphasizes the 
setup cost for each query, vs the actual time to enum the docs.

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992687#comment-12992687
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Robert and Michael

In order to test if ByteBuffer/IntBuffer works better than int[]-byte[] 
conversion, I now separate them into 3 different codecs. All of them use the 
same PForDelta implementation except that they use different 
indexinput/indexoutput as follows.

1. PatchedFrameOfRef3 - use in.readBytes(), it will convert int[] - byte[] 
manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java

2. PatchedFrameOfRef4 - use in.readBytes(), it will convert int[] - byte[] 
by ByteBuffer/IntBuffer. Its corresponding java code is: 
PForDeltaFixedIntBlockWithByteBufferCodec.java

3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need 
conversion. Its corresponding java code is: 
PForDeltaFixedIntBlockWithReadIntCodec.java

I tested them against BulkVInt on MacOS. The detailed results are attached. 
Here is the conclusion:

1) Yes, Michael and Robert, you guys are right! ByteBuffer/IntBuffer are faster 
then my manual conversion btw byte[]/int[]. I guess the reason I thought they 
were worse is that i did not separate codecs before, such that the test results 
is not stable due to JVM/JIT. 

2) Now, PatchedFrameOfRef4 is still worse than BulkVInt in many kinds of 
queries. However, it seems that it can do better for fuzzy queries and 
wildcardquery.

3) Of course, these PatchedFrameOfRef3,4,5 are all better than 
PatchedFrameOfRef and FrameOfRef for almost all queries.

4) The new patched is just uploaded, please check them out. 

The following is the experimental results for 0.1M data.

(1) bulkVInt VS patchedFrameOfRef4 (withByteBuffer, in.readBytes(..) )

QueryQPS bulkVIntQPS pathcedFrameofref4-withByteBuffer  Pct diff
 united states  389.26  361.79 -7.1%
   united states~3  234.52  228.99 -2.4%
   +nebraska +states 1138.95  992.06-12.9%
 +united +states  670.69  603.86-10.0%
doctimesecnum:[1 TO 6]  415.28  447.83  7.8%
doctitle:.*[Uu]nited.*  496.03  522.47  5.3%
  spanFirst(unit, 5) 1176.47 1086.96 -7.6%
spanNear([unit, state], 10, true)  502.26  423.73-15.6%
  states 1612.90 1453.49 -9.9%
 u*d  167.95  171.17  1.9%
un*d  260.69  275.33  5.6%
uni*  602.41  577.37 -4.2%
   unit* 1016.26 1041.67  2.5%
   united states  617.28  549.45-11.0%
  united~0.6   12.22   12.93  5.9%
 united~0.75   53.88   56.78  5.4%
unit~0.5   12.58   13.19  4.9%
unit~0.7   52.41   54.93  4.8%

(2) bulkVInt VS patchedFrameOfRef3 (with my own int[] - byte[] conversion, 
still in.readBytes(..))

 QueryQPS bulkVIntQPS pathcedFrameofref3  Pct diff
 united states  388.50  363.24 -6.5%
   united states~3  234.80  223.56 -4.8%
   +nebraska +states 1138.95 1016.26-10.8%
 +united +states  671.14  607.90 -9.4%
doctimesecnum:[1 TO 6]  418.24  441.89  5.7%
doctitle:.*[Uu]nited.*  489.00  522.74  6.9%
  spanFirst(unit, 5) 1246.88 1127.40 -9.6%
spanNear([unit, state], 10, true)  514.14  473.71 -7.9%
  states 1612.90 1488.10 -7.7%
 u*d  170.77  167.31 -2.0%
un*d  261.37  264.48  1.2%
uni*  609.38  602.41 -1.1%
   unit* 1028.81 1052.63  2.3%
   united states  614.25  564.33 -8.1%
  united~0.6   12.05   12.11  0.5%
 united~0.75   53.16   54.97  3.4%
unit~0.5   12.43   12.50  0.6%
unit~0.7   52.81   53.23  0.8%


(3) bulkVInt VS patchedFrameOfRef5 (with my own int[] - byte[] conversion, 
still in.readBytes(..))

  QueryQPS bulkVIntQPS pathcedFrameofref5-withReadInt  Pct diff
 united states  391.24  366.70 -6.3%
   united states~3  235.40  235.07 -0.1%
   +nebraska +states 1137.66 1072.96 -5.7%
 +united +states  673.40  642.26 -4.6%
doctimesecnum:[1 TO 6]  414.25  407.66 -1.6%
doctitle:.*[Uu]nited.*  492.61  538.21  9.3%
  spanFirst(unit, 5) 1253.13 1175.09 -6.2%
spanNear([unit, state], 10, true)  511.25  483.56 -5.4%
  states 1642.04 1490.31 -9.2%
 u*d  166.78  160.28 -3.9%
un*d  261.64  255.36 -2.4%
uni*  609.38  593.47 -2.6%
   unit* 1026.69 

[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992757#comment-12992757
 ] 

Robert Muir commented on LUCENE-2903:
-

Hello, 

I don't see the new files you referred to in the patch
Maybe the new files were not added to svn with 'svn add' before making the 
patch?


 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992809#comment-12992809
 ] 

hao yan commented on LUCENE-2903:
-

just uploaded. Sorry. 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-08 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992237#comment-12992237
 ] 

hao yan commented on LUCENE-2903:
-

I tried to move memory allocation out of readBlock() to BlockReader's 
constructor. It improves the performance a little. I also tried to use 
ByteBuffer/IntBuffer to replace my manual convertsion between bytes[]/int[]. It 
makes things worse.

The following is my result for 0.1M data:
(1) BulkVInt vs patchedFrameoFRef3
QueryQPS   bulkVIntQPS patchedFrameoFRef3  Pct diff
 united states  393.55  362.84 -7.8%
   united states~3  243.84  236.80 -2.9%
   +nebraska +states 1140.25  998.00-12.5%
 +united +states  687.76  633.31 -7.9%
doctimesecnum:[1 TO 6]  413.56  427.53  3.4%
doctitle:.*[Uu]nited.*  510.46  534.47  4.7%
  spanFirst(unit, 5) 1240.69 1108.65-10.6%
spanNear([unit, state], 10, true)  511.77  463.18 -9.5%
  states 1626.02 1483.68 -8.8%
 u*d  164.23  162.79 -0.9%
un*d  257.53  252.97 -1.8%
uni*  607.53  591.02 -2.7%
   unit* 1024.59 1043.84  1.9%
   united states  627.35  578.70 -7.8%
  united~0.6   11.51   11.36 -1.3%
 united~0.75   52.58   53.57  1.9%
unit~0.5   12.08   11.93 -1.2%
unit~0.7   50.98   51.30  0.6%

(2) FrameOfRef VS PatchcedFrameOfRef3
QueryQPSpatchedFrameofrefQPS pathcedFrameofref3  Pct diff
 united states  314.76  362.71 15.2%
   united states~3  227.53  237.08  4.2%
   +nebraska +states 1075.27 1025.64 -4.6%
 +united +states  646.41  626.57 -3.1%
doctimesecnum:[1 TO 6]  412.88  429.37  4.0%
doctitle:.*[Uu]nited.*  481.70  528.82  9.8%
  spanFirst(unit, 5) 1060.45 1118.57  5.5%
spanNear([unit, state], 10, true)  409.33  467.73 14.3%
  states 1353.18 1479.29  9.3%
 u*d  158.91  165.98  4.4%
un*d  237.36  256.41  8.0%
uni*  560.22  593.12  5.9%
   unit*  946.97 1043.84 10.2%
   united states  431.22  583.09 35.2%
  united~0.6   10.91   11.37  4.2%
 united~0.75   50.30   53.30  5.9%
unit~0.5   11.54   11.94  3.5%
unit~0.7   47.38   50.38  6.3%


(3) PatchedFrameOfRef VS PatchedFrameOfRef3

 QueryQPS FrameOfRefQPS pathcedFrameofref3  Pct diff
 united states  326.26  360.49 10.5%
   united states~3  226.50  234.69  3.6%
   +nebraska +states 1077.59 1021.45 -5.2%
 +united +states  648.51  630.52 -2.8%
doctimesecnum:[1 TO 6]  324.46  428.45 32.0%
doctitle:.*[Uu]nited.*  485.44  527.70  8.7%
  spanFirst(unit, 5) 1007.05 .11 10.3%
spanNear([unit, state], 10, true)  446.03  465.55  4.4%
  states 1449.28 1459.85  0.7%
 u*d  158.43  161.79  2.1%
un*d  246.37  256.28  4.0%
uni*  548.85  594.88  8.4%
   unit*  920.81 1042.75 13.2%
   united states  450.65  576.37 27.9%
  united~0.6   11.07   11.26  1.7%
 united~0.75   50.70   52.60  3.8%
unit~0.5   11.64   11.76  1.0%
unit~0.7   49.04   50.70  3.4%




 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the 

[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-06 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991220#comment-12991220
 ] 

hao yan commented on LUCENE-2903:
-

HI, Michael

Did u try FrameOfRef and PatchedFrameOfRef? 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-06 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991222#comment-12991222
 ] 

hao yan commented on LUCENE-2903:
-

And it sure complicate the pfordelta algorithm a lot by using intbuffer.set/get.

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-06 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991221#comment-12991221
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Paul

I tested ByteBuffer-IntBuffer, it is not faster than converting int[] - 
byte[]. 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990967#comment-12990967
 ] 

Michael McCandless commented on LUCENE-2903:


bq.  Do you have any other data sets?

I'll try using europarl.  I have a line file w/ one paragraph per line = 5.6M 
docs, 3.2GB = ~620 UTF8 bytes per doc (smaller than the line file we use for 
Wikipedia, which targets ~1024 bytes per line).

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990988#comment-12990988
 ] 

Michael McCandless commented on LUCENE-2903:


Results on Europarl (each paragraph is a doc):

||Query||QPS bulkvint||QPS pfor3||Pct diff
|doctimesecnum:[1 TO 6]|16.73|12.91|{color:red}-22.8%{color}|
|spanFirst(unit, 5)|5214.47|4143.21|{color:red}-20.5%{color}|
|spanNear([unit, state], 10, true)|869.71|719.62|{color:red}-17.3%{color}|
|united states|320.66|266.50|{color:red}-16.9%{color}|
|united states~3|212.07|187.75|{color:red}-11.5%{color}|
|u*d|41.09|36.90|{color:red}-10.2%{color}|
|unit~0.7|94.11|85.34|{color:red}-9.3%{color}|
|un*d|68.38|62.09|{color:red}-9.2%{color}|
|+united +states|440.68|406.08|{color:red}-7.8%{color}|
|united states|272.68|255.73|{color:red}-6.2%{color}|
|states|552.36|532.76|{color:red}-3.5%{color}|
|unit~0.5|18.86|18.67|{color:red}-1.0%{color}|
|uni*|47.96|47.65|{color:red}-0.6%{color}|
|united~0.6|23.82|23.69|{color:red}-0.5%{color}|
|unit*|435.99|437.09|{color:green}0.3%{color}|
|doctitle:.*[Uu]nited.*|24.16|24.31|{color:green}0.6%{color}|
|+nebraska +states|35010.33|36809.36|{color:green}5.1%{color}|
|united~0.75|172.36|195.18|{color:green}13.2%{color}|


 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991004#comment-12991004
 ] 

Michael McCandless commented on LUCENE-2903:


So, the QPS in that run are absurdly high for most queries... I think we need 
different queries to test against Europarl.

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-03 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990186#comment-12990186
 ] 

Paul Elschot commented on LUCENE-2903:
--

When the IntBuffer is produced by ByteBuffer.asIntBuffer() and that ByteBuffer 
is produced from a byte[], this IntBuffer can be used to compress data into on 
an int by int basis.
After that, this byte[] can be written directly to an IndexOutput.

What is it that cannot be avoided?

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-03 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990214#comment-12990214
 ] 

hao yan commented on LUCENE-2903:
-

I think essentially the above step also need to do int-byte-int conversion. 
So, there is no reason it can save more than I do it manually.

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-03 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990260#comment-12990260
 ] 

Paul Elschot commented on LUCENE-2903:
--

The conversion is done in place, no data is being copied. Have a look here:
http://download.oracle.com/javase/1.5.0/docs/api/java/nio/ByteBuffer.html#views


 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990365#comment-12990365
 ] 

Michael McCandless commented on LUCENE-2903:


Hmm the last patch still seems to have disabled the allOnes opto in BulkVInt?  
(I turned it back on in my checkout).

Also it seems to have an added dep to Kamikaze, but it's not in fact needed?

I ran perf test on 10M Wikipedia docs, of BulkVInt vs PatcheFrameOfRef3, on 
Linux:

||Query||QPS bulkvint||QPS pfor3||Pct diff
|+nebraska +states|104.70|66.87|{color:red}-36.1%{color}|
|united states|12.19|9.05|{color:red}-25.7%{color}|
|united states|16.56|13.46|{color:red}-18.7%{color}|
|spanNear([unit, state], 10, true)|43.37|35.84|{color:red}-17.4%{color}|
|+united +states|19.51|16.27|{color:red}-16.6%{color}|
|united~0.6|8.52|7.64|{color:red}-10.4%{color}|
|united~0.75|13.05|11.74|{color:red}-10.1%{color}|
|states|47.99|43.40|{color:red}-9.6%{color}|
|u*d|9.64|8.89|{color:red}-7.8%{color}|
|spanFirst(unit, 5)|157.25|145.20|{color:red}-7.7%{color}|
|united states~3|5.80|5.37|{color:red}-7.4%{color}|
|unit~0.5|17.32|16.12|{color:red}-6.9%{color}|
|doctimesecnum:[1 TO 6]|12.40|11.68|{color:red}-5.8%{color}|
|un*d|19.11|18.41|{color:red}-3.7%{color}|
|unit*|33.44|32.44|{color:red}-3.0%{color}|
|unit~0.7|27.27|26.90|{color:red}-1.4%{color}|
|doctitle:.*[Uu]nited.*|5.34|5.30|{color:red}-0.8%{color}|
|uni*|18.89|18.81|{color:red}-0.4%{color}|

Somehow it's substantially slower... but I haven't tested how the other PFor 
impl we have compares.  Not sure what's up...

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-03 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990480#comment-12990480
 ] 

hao yan commented on LUCENE-2903:
-

Yes. Other PFOR impls (FrameOfRef and PatchedFrameOfRef) are even slower. (as 
long as you set -server when you run them). I am also wondering why. Actually I 
think wikipedia data is kind of biased. Do you have any other data sets? 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989692#comment-12989692
 ] 

Paul Elschot commented on LUCENE-2903:
--

Just one nitpick about the codec name containing 'New'.
This will be out of date rather soon, so it may be better to simply use an 
incremental number.

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989754#comment-12989754
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Paul. thanks for the suggestions. I just uploaded a new patch which renamed 
the codec as PatchedFrameOfRef3. 

I actually have a question to ask. In BulkVInt codec, it writes the compressed 
byte stream as a chunk of bytes. However, in pfordelta-related codecs, the 
compressed results are in ints, i have to either write single int with a loop, 
or first convert int array to byte array and then call out.writeBytes(). Do you 
know any other smarter way to write an int array to indexOutput? 

Another try I did is to make PForDelta itself produce byte-wise compressed 
results. However, from my experimental results, it will slow down pfordelta 
significantly. Also, i do not think the NIO buffer used in FrameOfRef and 
PatchedFrameOfRef help since essentially it is like the way that we first 
convert int array to byte array and then writeBytes().

Do you have any good suggestions? thanks! 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989856#comment-12989856
 ] 

Paul Elschot commented on LUCENE-2903:
--

One way to get an underlying byte array from an IntBuffer is by using 
ByteBuffer.asIntBuffer() to allocate the IntBuffer via a ByteBuffer from the 
byte array. Would that be possible here?
I remember using this for testing the original (P)FOR implementation with 
Lucene's IndexInput/IndexOutput. I did not look at any code answer this though, 
so please holler if this is a dead end.


 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989872#comment-12989872
 ] 

hao yan commented on LUCENE-2903:
-

Yes, using ByteBuffer.asIntBuffer() is the same as converting int/byte array to 
byte/int array. I think the underlying implementation ByteBuffer.asIntBuffer() 
cannot avoid. I also tried ByteBuffer/IntBuffer though, the result is worse 
which makes sense since it may incur extra costs.

Where to holler? :) 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989480#comment-12989480
 ] 

Robert Muir commented on LUCENE-2903:
-

Just curious: why does the patch remove BulkVint's optimization for blocks of 
all 1's (it writes a 0-byte header only in this case) ?

{noformat}
+  allOnes = false;
   if (allOnes) {
 // the most common int pattern (all 1's)
 // write a special header (numBytes=0) for this case.
{noformat}

This is an important optimization I think: besides the fact its the most common 
bitpattern[1], its efficient: a single compare-to-zero for the entire block of 
128 ints, and it takes care of several worst-cases for vint: blocks of all 1 
docdeltas (something more commonly seen in structured data, but still the most 
common pattern in unstructured, stopwordish things), and all 1 freqs (e.g. you 
should have omitTF'ed). Depending on block size this significantly reduces the 
.doc/.freq files for vint, and still helps in the pure unstructured case (I 
measured this with luceneutil).

[1] http://portal.acm.org/citation.cfm?id=1712668

Furthermore, I was thinking that along the lines of this allOnes trick, we 
could evaluate an alternative to the Sep file layout: instead at least we 
should consider interleaving .doc and .freq (block of doc deltas, block of 
freqs).
With this interleaved layout, something only interested in doc deltas can just 
read the freq byte header and skip these bytes to bypass all the freqs... 
omitTF is then implemented automatically for a lot of cases (though this 
wouldn't be equivalent to lucene's manually-set omitTFAP today, as positions 
would still exist). If you did manually set omitTF, we could arguably just 
write this same 0 byte header for freq blocks, which means all 1 freqs, and not 
have so much specialization and different codepaths.


 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-01 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989532#comment-12989532
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Robert

Sorry. That was a mistake. I commented out that one just for debugging to see 
if that affect the performance. I should have changed it back. I will attach a 
new patch. 

thanks for pointing that out. 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org