[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995436#comment-12995436 ] hao yan commented on LUCENE-2903: - Thank both of you! Thanks for testing my codec so quickly, Michael! RE: One question: it looks like this PFOR impl can only handle up to 28 bit wide ints? Which means... could it could fail on some cases? Though I suppose you would never see too many of these immense ints in one block, and so they'd always be encoded as exceptions and so it's actually safe...? Hao: This won't fail. In my PFOR impl, I will first checkBigNumbers() to see if there is any number = 2^28, if there is, i will force encoding the lower 4 bits using the 128 4-bit slots. Thus, all exceptions left to simple16 are 2^28, which can definitely be handled. So, there is no failure cases!!! :) . BTW, my PFOR impl will save more index size than VInt and other PFOR impls. Thus, if the user case is real-time search which requires loading index from disk to memory frequently, my PFOR impl may save even more. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE-2903.patch, for_pfor.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993003#comment-12993003 ] Michael McCandless commented on LUCENE-2903: I don't see a new patch here? Eg, PForDeltaFixedIntBlockWithReadIntCodec.java seems to be missing (and others)? Also, it's best if you can run the perf tests without -debug, since it runs on a more realistic index. The small (100K docs) debug index over-emphasizes the setup cost for each query, vs the actual time to enum the docs. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992687#comment-12992687 ] hao yan commented on LUCENE-2903: - Hi, Robert and Michael In order to test if ByteBuffer/IntBuffer works better than int[]-byte[] conversion, I now separate them into 3 different codecs. All of them use the same PForDelta implementation except that they use different indexinput/indexoutput as follows. 1. PatchedFrameOfRef3 - use in.readBytes(), it will convert int[] - byte[] manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java 2. PatchedFrameOfRef4 - use in.readBytes(), it will convert int[] - byte[] by ByteBuffer/IntBuffer. Its corresponding java code is: PForDeltaFixedIntBlockWithByteBufferCodec.java 3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need conversion. Its corresponding java code is: PForDeltaFixedIntBlockWithReadIntCodec.java I tested them against BulkVInt on MacOS. The detailed results are attached. Here is the conclusion: 1) Yes, Michael and Robert, you guys are right! ByteBuffer/IntBuffer are faster then my manual conversion btw byte[]/int[]. I guess the reason I thought they were worse is that i did not separate codecs before, such that the test results is not stable due to JVM/JIT. 2) Now, PatchedFrameOfRef4 is still worse than BulkVInt in many kinds of queries. However, it seems that it can do better for fuzzy queries and wildcardquery. 3) Of course, these PatchedFrameOfRef3,4,5 are all better than PatchedFrameOfRef and FrameOfRef for almost all queries. 4) The new patched is just uploaded, please check them out. The following is the experimental results for 0.1M data. (1) bulkVInt VS patchedFrameOfRef4 (withByteBuffer, in.readBytes(..) ) QueryQPS bulkVIntQPS pathcedFrameofref4-withByteBuffer Pct diff united states 389.26 361.79 -7.1% united states~3 234.52 228.99 -2.4% +nebraska +states 1138.95 992.06-12.9% +united +states 670.69 603.86-10.0% doctimesecnum:[1 TO 6] 415.28 447.83 7.8% doctitle:.*[Uu]nited.* 496.03 522.47 5.3% spanFirst(unit, 5) 1176.47 1086.96 -7.6% spanNear([unit, state], 10, true) 502.26 423.73-15.6% states 1612.90 1453.49 -9.9% u*d 167.95 171.17 1.9% un*d 260.69 275.33 5.6% uni* 602.41 577.37 -4.2% unit* 1016.26 1041.67 2.5% united states 617.28 549.45-11.0% united~0.6 12.22 12.93 5.9% united~0.75 53.88 56.78 5.4% unit~0.5 12.58 13.19 4.9% unit~0.7 52.41 54.93 4.8% (2) bulkVInt VS patchedFrameOfRef3 (with my own int[] - byte[] conversion, still in.readBytes(..)) QueryQPS bulkVIntQPS pathcedFrameofref3 Pct diff united states 388.50 363.24 -6.5% united states~3 234.80 223.56 -4.8% +nebraska +states 1138.95 1016.26-10.8% +united +states 671.14 607.90 -9.4% doctimesecnum:[1 TO 6] 418.24 441.89 5.7% doctitle:.*[Uu]nited.* 489.00 522.74 6.9% spanFirst(unit, 5) 1246.88 1127.40 -9.6% spanNear([unit, state], 10, true) 514.14 473.71 -7.9% states 1612.90 1488.10 -7.7% u*d 170.77 167.31 -2.0% un*d 261.37 264.48 1.2% uni* 609.38 602.41 -1.1% unit* 1028.81 1052.63 2.3% united states 614.25 564.33 -8.1% united~0.6 12.05 12.11 0.5% united~0.75 53.16 54.97 3.4% unit~0.5 12.43 12.50 0.6% unit~0.7 52.81 53.23 0.8% (3) bulkVInt VS patchedFrameOfRef5 (with my own int[] - byte[] conversion, still in.readBytes(..)) QueryQPS bulkVIntQPS pathcedFrameofref5-withReadInt Pct diff united states 391.24 366.70 -6.3% united states~3 235.40 235.07 -0.1% +nebraska +states 1137.66 1072.96 -5.7% +united +states 673.40 642.26 -4.6% doctimesecnum:[1 TO 6] 414.25 407.66 -1.6% doctitle:.*[Uu]nited.* 492.61 538.21 9.3% spanFirst(unit, 5) 1253.13 1175.09 -6.2% spanNear([unit, state], 10, true) 511.25 483.56 -5.4% states 1642.04 1490.31 -9.2% u*d 166.78 160.28 -3.9% un*d 261.64 255.36 -2.4% uni* 609.38 593.47 -2.6% unit* 1026.69
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992757#comment-12992757 ] Robert Muir commented on LUCENE-2903: - Hello, I don't see the new files you referred to in the patch Maybe the new files were not added to svn with 'svn add' before making the patch? Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992809#comment-12992809 ] hao yan commented on LUCENE-2903: - just uploaded. Sorry. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992237#comment-12992237 ] hao yan commented on LUCENE-2903: - I tried to move memory allocation out of readBlock() to BlockReader's constructor. It improves the performance a little. I also tried to use ByteBuffer/IntBuffer to replace my manual convertsion between bytes[]/int[]. It makes things worse. The following is my result for 0.1M data: (1) BulkVInt vs patchedFrameoFRef3 QueryQPS bulkVIntQPS patchedFrameoFRef3 Pct diff united states 393.55 362.84 -7.8% united states~3 243.84 236.80 -2.9% +nebraska +states 1140.25 998.00-12.5% +united +states 687.76 633.31 -7.9% doctimesecnum:[1 TO 6] 413.56 427.53 3.4% doctitle:.*[Uu]nited.* 510.46 534.47 4.7% spanFirst(unit, 5) 1240.69 1108.65-10.6% spanNear([unit, state], 10, true) 511.77 463.18 -9.5% states 1626.02 1483.68 -8.8% u*d 164.23 162.79 -0.9% un*d 257.53 252.97 -1.8% uni* 607.53 591.02 -2.7% unit* 1024.59 1043.84 1.9% united states 627.35 578.70 -7.8% united~0.6 11.51 11.36 -1.3% united~0.75 52.58 53.57 1.9% unit~0.5 12.08 11.93 -1.2% unit~0.7 50.98 51.30 0.6% (2) FrameOfRef VS PatchcedFrameOfRef3 QueryQPSpatchedFrameofrefQPS pathcedFrameofref3 Pct diff united states 314.76 362.71 15.2% united states~3 227.53 237.08 4.2% +nebraska +states 1075.27 1025.64 -4.6% +united +states 646.41 626.57 -3.1% doctimesecnum:[1 TO 6] 412.88 429.37 4.0% doctitle:.*[Uu]nited.* 481.70 528.82 9.8% spanFirst(unit, 5) 1060.45 1118.57 5.5% spanNear([unit, state], 10, true) 409.33 467.73 14.3% states 1353.18 1479.29 9.3% u*d 158.91 165.98 4.4% un*d 237.36 256.41 8.0% uni* 560.22 593.12 5.9% unit* 946.97 1043.84 10.2% united states 431.22 583.09 35.2% united~0.6 10.91 11.37 4.2% united~0.75 50.30 53.30 5.9% unit~0.5 11.54 11.94 3.5% unit~0.7 47.38 50.38 6.3% (3) PatchedFrameOfRef VS PatchedFrameOfRef3 QueryQPS FrameOfRefQPS pathcedFrameofref3 Pct diff united states 326.26 360.49 10.5% united states~3 226.50 234.69 3.6% +nebraska +states 1077.59 1021.45 -5.2% +united +states 648.51 630.52 -2.8% doctimesecnum:[1 TO 6] 324.46 428.45 32.0% doctitle:.*[Uu]nited.* 485.44 527.70 8.7% spanFirst(unit, 5) 1007.05 .11 10.3% spanNear([unit, state], 10, true) 446.03 465.55 4.4% states 1449.28 1459.85 0.7% u*d 158.43 161.79 2.1% un*d 246.37 256.28 4.0% uni* 548.85 594.88 8.4% unit* 920.81 1042.75 13.2% united states 450.65 576.37 27.9% united~0.6 11.07 11.26 1.7% united~0.75 50.70 52.60 3.8% unit~0.5 11.64 11.76 1.0% unit~0.7 49.04 50.70 3.4% Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991220#comment-12991220 ] hao yan commented on LUCENE-2903: - HI, Michael Did u try FrameOfRef and PatchedFrameOfRef? Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991222#comment-12991222 ] hao yan commented on LUCENE-2903: - And it sure complicate the pfordelta algorithm a lot by using intbuffer.set/get. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991221#comment-12991221 ] hao yan commented on LUCENE-2903: - Hi, Paul I tested ByteBuffer-IntBuffer, it is not faster than converting int[] - byte[]. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990967#comment-12990967 ] Michael McCandless commented on LUCENE-2903: bq. Do you have any other data sets? I'll try using europarl. I have a line file w/ one paragraph per line = 5.6M docs, 3.2GB = ~620 UTF8 bytes per doc (smaller than the line file we use for Wikipedia, which targets ~1024 bytes per line). Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990988#comment-12990988 ] Michael McCandless commented on LUCENE-2903: Results on Europarl (each paragraph is a doc): ||Query||QPS bulkvint||QPS pfor3||Pct diff |doctimesecnum:[1 TO 6]|16.73|12.91|{color:red}-22.8%{color}| |spanFirst(unit, 5)|5214.47|4143.21|{color:red}-20.5%{color}| |spanNear([unit, state], 10, true)|869.71|719.62|{color:red}-17.3%{color}| |united states|320.66|266.50|{color:red}-16.9%{color}| |united states~3|212.07|187.75|{color:red}-11.5%{color}| |u*d|41.09|36.90|{color:red}-10.2%{color}| |unit~0.7|94.11|85.34|{color:red}-9.3%{color}| |un*d|68.38|62.09|{color:red}-9.2%{color}| |+united +states|440.68|406.08|{color:red}-7.8%{color}| |united states|272.68|255.73|{color:red}-6.2%{color}| |states|552.36|532.76|{color:red}-3.5%{color}| |unit~0.5|18.86|18.67|{color:red}-1.0%{color}| |uni*|47.96|47.65|{color:red}-0.6%{color}| |united~0.6|23.82|23.69|{color:red}-0.5%{color}| |unit*|435.99|437.09|{color:green}0.3%{color}| |doctitle:.*[Uu]nited.*|24.16|24.31|{color:green}0.6%{color}| |+nebraska +states|35010.33|36809.36|{color:green}5.1%{color}| |united~0.75|172.36|195.18|{color:green}13.2%{color}| Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991004#comment-12991004 ] Michael McCandless commented on LUCENE-2903: So, the QPS in that run are absurdly high for most queries... I think we need different queries to test against Europarl. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990186#comment-12990186 ] Paul Elschot commented on LUCENE-2903: -- When the IntBuffer is produced by ByteBuffer.asIntBuffer() and that ByteBuffer is produced from a byte[], this IntBuffer can be used to compress data into on an int by int basis. After that, this byte[] can be written directly to an IndexOutput. What is it that cannot be avoided? Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990214#comment-12990214 ] hao yan commented on LUCENE-2903: - I think essentially the above step also need to do int-byte-int conversion. So, there is no reason it can save more than I do it manually. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990260#comment-12990260 ] Paul Elschot commented on LUCENE-2903: -- The conversion is done in place, no data is being copied. Have a look here: http://download.oracle.com/javase/1.5.0/docs/api/java/nio/ByteBuffer.html#views Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990365#comment-12990365 ] Michael McCandless commented on LUCENE-2903: Hmm the last patch still seems to have disabled the allOnes opto in BulkVInt? (I turned it back on in my checkout). Also it seems to have an added dep to Kamikaze, but it's not in fact needed? I ran perf test on 10M Wikipedia docs, of BulkVInt vs PatcheFrameOfRef3, on Linux: ||Query||QPS bulkvint||QPS pfor3||Pct diff |+nebraska +states|104.70|66.87|{color:red}-36.1%{color}| |united states|12.19|9.05|{color:red}-25.7%{color}| |united states|16.56|13.46|{color:red}-18.7%{color}| |spanNear([unit, state], 10, true)|43.37|35.84|{color:red}-17.4%{color}| |+united +states|19.51|16.27|{color:red}-16.6%{color}| |united~0.6|8.52|7.64|{color:red}-10.4%{color}| |united~0.75|13.05|11.74|{color:red}-10.1%{color}| |states|47.99|43.40|{color:red}-9.6%{color}| |u*d|9.64|8.89|{color:red}-7.8%{color}| |spanFirst(unit, 5)|157.25|145.20|{color:red}-7.7%{color}| |united states~3|5.80|5.37|{color:red}-7.4%{color}| |unit~0.5|17.32|16.12|{color:red}-6.9%{color}| |doctimesecnum:[1 TO 6]|12.40|11.68|{color:red}-5.8%{color}| |un*d|19.11|18.41|{color:red}-3.7%{color}| |unit*|33.44|32.44|{color:red}-3.0%{color}| |unit~0.7|27.27|26.90|{color:red}-1.4%{color}| |doctitle:.*[Uu]nited.*|5.34|5.30|{color:red}-0.8%{color}| |uni*|18.89|18.81|{color:red}-0.4%{color}| Somehow it's substantially slower... but I haven't tested how the other PFor impl we have compares. Not sure what's up... Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990480#comment-12990480 ] hao yan commented on LUCENE-2903: - Yes. Other PFOR impls (FrameOfRef and PatchedFrameOfRef) are even slower. (as long as you set -server when you run them). I am also wondering why. Actually I think wikipedia data is kind of biased. Do you have any other data sets? Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989692#comment-12989692 ] Paul Elschot commented on LUCENE-2903: -- Just one nitpick about the codec name containing 'New'. This will be out of date rather soon, so it may be better to simply use an incremental number. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989754#comment-12989754 ] hao yan commented on LUCENE-2903: - Hi, Paul. thanks for the suggestions. I just uploaded a new patch which renamed the codec as PatchedFrameOfRef3. I actually have a question to ask. In BulkVInt codec, it writes the compressed byte stream as a chunk of bytes. However, in pfordelta-related codecs, the compressed results are in ints, i have to either write single int with a loop, or first convert int array to byte array and then call out.writeBytes(). Do you know any other smarter way to write an int array to indexOutput? Another try I did is to make PForDelta itself produce byte-wise compressed results. However, from my experimental results, it will slow down pfordelta significantly. Also, i do not think the NIO buffer used in FrameOfRef and PatchedFrameOfRef help since essentially it is like the way that we first convert int array to byte array and then writeBytes(). Do you have any good suggestions? thanks! Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989856#comment-12989856 ] Paul Elschot commented on LUCENE-2903: -- One way to get an underlying byte array from an IntBuffer is by using ByteBuffer.asIntBuffer() to allocate the IntBuffer via a ByteBuffer from the byte array. Would that be possible here? I remember using this for testing the original (P)FOR implementation with Lucene's IndexInput/IndexOutput. I did not look at any code answer this though, so please holler if this is a dead end. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989872#comment-12989872 ] hao yan commented on LUCENE-2903: - Yes, using ByteBuffer.asIntBuffer() is the same as converting int/byte array to byte/int array. I think the underlying implementation ByteBuffer.asIntBuffer() cannot avoid. I also tried ByteBuffer/IntBuffer though, the result is worse which makes sense since it may incur extra costs. Where to holler? :) Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989480#comment-12989480 ] Robert Muir commented on LUCENE-2903: - Just curious: why does the patch remove BulkVint's optimization for blocks of all 1's (it writes a 0-byte header only in this case) ? {noformat} + allOnes = false; if (allOnes) { // the most common int pattern (all 1's) // write a special header (numBytes=0) for this case. {noformat} This is an important optimization I think: besides the fact its the most common bitpattern[1], its efficient: a single compare-to-zero for the entire block of 128 ints, and it takes care of several worst-cases for vint: blocks of all 1 docdeltas (something more commonly seen in structured data, but still the most common pattern in unstructured, stopwordish things), and all 1 freqs (e.g. you should have omitTF'ed). Depending on block size this significantly reduces the .doc/.freq files for vint, and still helps in the pure unstructured case (I measured this with luceneutil). [1] http://portal.acm.org/citation.cfm?id=1712668 Furthermore, I was thinking that along the lines of this allOnes trick, we could evaluate an alternative to the Sep file layout: instead at least we should consider interleaving .doc and .freq (block of doc deltas, block of freqs). With this interleaved layout, something only interested in doc deltas can just read the freq byte header and skip these bytes to bypass all the freqs... omitTF is then implemented automatically for a lot of cases (though this wouldn't be equivalent to lucene's manually-set omitTFAP today, as positions would still exist). If you did manually set omitTF, we could arguably just write this same 0 byte header for freq blocks, which means all 1 freqs, and not have so much specialization and different codepaths. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989532#comment-12989532 ] hao yan commented on LUCENE-2903: - Hi, Robert Sorry. That was a mistake. I commented out that one just for debugging to see if that affect the performance. I should have changed it back. I will attach a new patch. thanks for pointing that out. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org