[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-24 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599534#comment-14599534
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/24/15 3:05 PM:
--

Obviously this is out-of-scope, but the only real problem a stream solves is 
ensuring there is always enough room. That and being easily passed to JDK 
tools, and supporting functionality like vint encoding. However we could 
certainly explore a "lightweight stream" API that exposes the BB and just has 
ensureRemaining() exposed (plus WritableByteChannel methods for calls that may 
be too large for a buffer). vint coding can be done via static method calls. 
*if* it's workable, I would be strongly in favour of this, as right now the 
method invocation costs for writing/reading streams are really significant. It 
isn't a small undertaking, though. But nor is it that huge. 

It is definitely something we should explore, to see how viable it is. In a 
follow-up ticket, of course :)


was (Author: benedict):
Obviously this is out-of-scope, but the only real problem a stream solves is 
ensuring there is always enough room. That and being easily passed to JDK 
tools, and supporting functionality like vint encoding. However we could 
certainly explore a "lightweight stream" API that exposes the BB and just has 
ensureRemaining() exposed. vint coding can be done via static method calls. 
*if* it's workable, I would be strongly in favour of this, as right now the 
method invocation costs for writing/reading streams are really significant. It 
isn't a small undertaking, though. But nor is it that huge. 

It is definitely something we should explore, to see how viable it is. In a 
follow-up ticket, of course :)

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-23 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598108#comment-14598108
 ] 

Ariel Weisberg edited comment on CASSANDRA-9499 at 6/23/15 6:13 PM:


If you're busy let me know and I can do it.


was (Author: aweisberg):
If your busy let me know and I can do it.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-23 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597846#comment-14597846
 ] 

Ariel Weisberg edited comment on CASSANDRA-9499 at 6/23/15 3:48 PM:


That doesn't make the tempBuffer in DataOutputPlus any safer. If a 
DataOutputPlus ends up wrapping another DataOutputPlus it could go poorly.

Also are we doing this as two or one commits?


was (Author: aweisberg):
That doesn't make the tempBuffer in DataOutputPlus any safer. If a 
DataOutputPlus ends up wrapping another DataOutputPlus it could go poorly.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-23 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597435#comment-14597435
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/23/15 10:11 AM:
---

[~snazy]: While reviewing this, I noticed that OHC provides a 
DataInput/DataOutput, which results in pretty inefficient serialization between 
the two (especially since we wrap, and do vint encoding). Could we update its 
interface to provide a ByteBuffer instead? This would also permit us to remove 
two boiler-plate forwarding classes, which would be nice.


was (Author: benedict):
[~snazy]: While reviewing this, I noticed that OHC provides a 
DataInput/DataOutput, which results in pretty inefficient serialization between 
the two (especially since we wrap, and do vint encoding). Could we update its 
interface to provide a ByteBuffer instead?

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-23 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597432#comment-14597432
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/23/15 10:10 AM:
---

[~aweisberg]: i've pushed a small further tweak 
[here|https://github.com/belliottsmith/cassandra/tree/C-9499-madness] to 
simplify the readMinimum logic, and reduce the number of operations (and size 
of code) on the normal path.

I'm also ambivalent about not using the VInt methods in the switch for 
SerializingCache, but it's not super-important (since obviously the inner 
serialization will not until 8099, but 8099 is unlikely to touch this 
particular place)

edit: also, noticed we didn't move the {{tempBuffer}} encoding approach to the 
{{VIntCoding.writeUnsignedVInt}} version. Seems like an easy win?

Otherwise, LGTM. If you can squash and otherwise prepare for commit, we can get 
it into mainline.




was (Author: benedict):
[~aweisberg]: i've pushed a small further tweak 
[here|https://github.com/belliottsmith/cassandra/tree/C-9499-madness] to 
simplify the readMinimum logic, and reduce the number of operations (and size 
of code) on the normal path.

I'm also ambivalent about not using the VInt methods in the switch for 
SerializingCache, but it's not super-important (since obviously the inner 
serialization will not until 8099, but 8099 is unlikely to touch this 
particular place)

Otherwise, LGTM. If you can squash and otherwise prepare for commit, we can get 
it into mainline.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-19 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592307#comment-14592307
 ] 

Ariel Weisberg edited comment on CASSANDRA-9499 at 6/19/15 2:46 PM:


I have the proposed encoding implemented and have the beginnings of an 
efficient implementation for BufferedDataOutputStreamPlus and NIOInputStream. 
There are probably still branches that could be removed as well as bit fiddling 
that could be more efficient and clearer. I am going to try and clean it up, 
but I could use feedback.

I couldn't come up with an efficient implementation for computeUnsignedVIntSize 
(when you haven't encoded it yet).
 
I also created the branch 
https://github.com/aweisberg/cassandra/commits/C-9499-madness where I removed 
Encoded*Stream and changed the serializers to use DataInputPlus which extends 
DataInput to add the varint decoding methods.

I am using 1s for the extension bits. I am also emitting the bytes in little 
endian order although it seems like I would need to do that at least for the 
first byte. I could emit the rest of the bytes in big endian order for the 
getLong(). Right now I have to reverse them because the ByteBuffer is big 
endian. l am wagering it is faster than swapping the settings.


was (Author: aweisberg):
I have the proposed encoding implemented and have the beginnings of an 
efficient implementation for BufferedDataOutputStreamPlus and NIOInputStream. 
There are probably still branches that could be removed as well as bit fiddling 
that could be more efficient and clearer. I am going to try and clean it up, 
but I could use feedback.

I couldn't come up with an efficient implementation for computeUnsignedVIntSize 
(when you haven't encoded it yet).
 
I also created the branch 
https://github.com/aweisberg/cassandra/commits/C-9499-madness where I removed 
Encoded*Stream and changed the serializers to use DataInputPlus which extends 
DataInput to add the varint decoding methods. I haven't rebased that on top of 
C-9499 yet.

I am using 1s for the extension bits. I am also emitting the bytes in little 
endian order although it seems like I would need to do that at least for the 
first byte. I could emit the rest of the bytes in big endian order for the 
getLong(). Right now I have to reverse them because the ByteBuffer is big 
endian. l am wagering it is faster than swapping the settings.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-18 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592307#comment-14592307
 ] 

Ariel Weisberg edited comment on CASSANDRA-9499 at 6/18/15 7:40 PM:


I have the proposed encoding implemented and have the beginnings of an 
efficient implementation for BufferedDataOutputStreamPlus and NIOInputStream. 
There are probably still branches that could be removed as well as bit fiddling 
that could be more efficient and clearer. I am going to try and clean it up, 
but I could use feedback.

I couldn't come up with an efficient implementation for computeUnsignedVIntSize 
(when you haven't encoded it yet).
 
I also created the branch 
https://github.com/aweisberg/cassandra/commits/C-9499-madness where I removed 
Encoded*Stream and changed the serializers to use DataInputPlus which extends 
DataInput to add the varint decoding methods. I haven't rebased that on top of 
C-9499 yet.

I am using 1s for the extension bits. I am also emitting the bytes in little 
endian order although it seems like I would need to do that at least for the 
first byte. I could emit the rest of the bytes in big endian order for the 
getLong(). Right now I have to reverse them because the ByteBuffer is big 
endian. l am wagering it is faster than swapping the settings.


was (Author: aweisberg):
I have the proposed encoding implemented and have the beginnings of an 
efficient implementation for BufferedDataOutputStreamPlus and NIOInputStream. 
There is probably still branches or that could be removed as well as making the 
bit fiddling more efficient and clearer. I am going to try and clean it up, but 
I could use feedback.

I couldn't come up with an efficient implementation for computeUnsignedVIntSize 
(when you haven't encoded it yet).
 
I also created the branch 
https://github.com/aweisberg/cassandra/commits/C-9499-madness where I removed 
Encoded*Stream and changed the serializers to use DataInputPlus which extends 
DataInput to add the varint decoding methods. I haven't rebased that on top of 
C-9499 yet.

I am using 1s for the extension bits. I am also emitting the bytes in little 
endian order although it seems like I would need to do that at least for the 
first byte. I could emit the rest of the bytes in big endian order for the 
getLong(). Right now I have to reverse them because the ByteBuffer is big 
endian. l am wagering it is faster than swapping the settings.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-17 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589470#comment-14589470
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/17/15 8:47 AM:
--

I'm confused as to why we need 10 bytes? Pretty much by definition a 
continuation bit encoding needs 9 bytes to represent 8 bytes. Let's not use 
Google's implementation. It looks pretty rubbish. 

The reason they use 10 bytes is they cannot be bothered to realise the last 
byte does not need a continuation bit. They also use a *terrible* 
implementation for deciding how long it needs to be.

Here's  a simple change which makes it 9 bytes, and easily optimised: the 
continuation bits are all shifted to the first byte, which effectively encodes 
the length in run-length encoding, i.e. the number of contiguous top order bits 
that are set to 1. i.e. {{length = Integer.numberOfLeadingZeros(firstByte ^ 
0x)}}

This way we read the first byte, and if there are any more to read, we read a 
long, and quickly truncate.


was (Author: benedict):
I'm confused as to why we need 10 bytes? Pretty much by definition a 
continuation bit encoding needs 9 bytes to represent 8 bytes. Let's not use 
Google's implementation. It looks pretty rubbish. 

The reason they use 10 bytes is they cannot be bothered to realise the last 
byte does not need a continuation bit. They also use a *terrible* 
implementation for deciding how long it needs to be.

Here's  a simple change which makes it 9 bytes, and easily optimised: the 
continuation bits are all shifted to the first byte, which effectively encodes 
the length in run-length encoding, i.e. the number of contiguous top order bits 
that are set to 1. i.e. {{length = Integer.numberOfLeadingZeros(firstByte ^ 
(byte) -1)}}

This way we read the first byte, and if there are any more to read, we read a 
long, and quickly truncate.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588727#comment-14588727
 ] 

Ariel Weisberg edited comment on CASSANDRA-9499 at 6/16/15 9:25 PM:


I don't think the method we currently use is analogous to what protocol buffers 
is doing.

https://developers.google.com/protocol-buffers/docs/encoding#varints

The first description is a method of using the last bit of each byte as a 
continuation bit. We are doing a length prefix and not extension.

https://github.com/google/protobuf/blob/master/java/src/main/java/com/google/protobuf/CodedOutputStream.java#L1213

The encoding on the wire is unsigned and an additional transform (zigzag) is 
applied for signed numbers.

This sounds like it is closer to what everyone is looking for. Want a bias 
towards positive numbers? We could do unsigned with a small amount of 
additional code.


was (Author: aweisberg):
I don't think the method we currently use is analogous to what protocol buffers 
is doing.

https://developers.google.com/protocol-buffers/docs/encoding#varints

The first description is of a method is just using the last bit of each byte as 
a continuation bit. We are doing a length prefix and not extension.

https://github.com/google/protobuf/blob/master/java/src/main/java/com/google/protobuf/CodedOutputStream.java#L1213

The encoding on the wire is unsigned and an additional transform (zigzag) is 
applied for signed numbers.

This sounds like it is closer to what everyone is looking for. Want a bias 
towards positive numbers? We could do unsigned with a small amount of 
additional code.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588343#comment-14588343
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/16/15 4:41 PM:
--

When I said little endian, it was a typo. Typically we use big endian, so we 
want any method to assume this. I hadn't noticed that they chose to implement 
it was, effectively, little endian encoding. Which is another reason to switch: 
it's confusing to swap between different types of encoding, and breaks 
assumptions. However, if we wanted to implement this optimisation for the 
current encoding scheme, we would just modify if to a shift left instead of an 
&, followed by a Long.reverseBytes() (nb: admittedly without thinking about it 
too rigorously). If we're going to implement a different encoding scheme, 
though, it's probably better to just do that...

That said, it is a shame we didn't standardise on little endian, given our main 
target platform is little endian. I know Java defaults to big endian, but the 
target platform is more important. That's a much scarier change than modifying 
this, though, so let's not go there. Probably ever. 


was (Author: benedict):
When I said little endian, it was a typo. Typically we use big endian, so we 
want any method to assume this. I hadn't noticed that they chose to implement 
it was, effectively, little endian encoding. Which is another reason to switch: 
it's confusing to swap between different types of encoding, and breaks 
assumptions. However, if we wanted to implement this optimisation for the 
current encoding scheme, we would just modify if to a shift left instead of an 
&, followed by a Long.reverseBytes() (nb: admittedly without thinking about it 
too rigorously)

That said, it is a shame we didn't standardise on little endian, given our main 
target platform is little endian. I know Java defaults to big endian, but the 
target platform is more important. That's a much scarier change than modifying 
this, though, so let's not go there. Probably ever. 

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588326#comment-14588326
 ] 

Ariel Weisberg edited comment on CASSANDRA-9499 at 6/16/15 4:29 PM:


I am getting the wrong answer for Integer.MIN_VALUE trying to mask out the long 
that has the 8-byte value. I get -2357198848.

I replaced 
{noformat}
long i = 0;
for (int idx = 0; idx < len; idx++)
{
byte b = buf.get();
i = i << 8;
i = i | (b & 0xFF);
}
{noformat}

with

{noformat}
long i = buf.getLong(buf.position());
i &= (-1L >>> (64 - (len * 8)));
{noformat}

I guess this comes down to the endianess issue. Right now NIODataInputStream 
doesn't handle little endian anything. The buffer is private and there is no 
constructor for providing one nor a method for changing the order of the 
underlying buffer. Is this a dead end, or pick a different implementation 
depending on byte order?




was (Author: aweisberg):
I am getting the wrong answer for Integer.MIN_VALUE trying to mask out the long 
that has the 8-byte value. I get -2357198848.

I replaced 
{noformat}
long i = 0;
for (int idx = 0; idx < len; idx++)
{
byte b = buf.get();
i = i << 8;
i = i | (b & 0xFF);
}
{noformat}

with

{noformat}
long i = buf.getLong(buf.position());
i &= (-1L >>> (64 - (len * 8)));
{noformat}

I guess this comes down to the endianess issue. Right now NIODataInputStream 
doesn't handle little endian anything. The buffer is private and there is no 
constructor for providing one nor a method for changing the order of the 
underlying buffer. Is this a dead end, pick a different implementation 
depending on byte order?



> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588212#comment-14588212
 ] 

Ariel Weisberg edited comment on CASSANDRA-9499 at 6/16/15 3:29 PM:


Another question, how do we know all the ByteBuffers are little endian? Is that 
the case right now? I that the DataOutput/Input are big endian?

Just saw your response. If we aren't going to shoot for it then endian won't be 
an obstacle.


was (Author: aweisberg):
Another question, how do we know all the ByteBuffers are little endian? Is that 
the case right now? I that the DataOutput/Input are big endian?

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587967#comment-14587967
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/16/15 12:42 PM:
---

It's worth noting, of course, that the existing scheme (including a 
positive-only variant supporting 250 values) has a very narrow useful band 
around that unlikeliness threshold (just 250..256 for positive-only!), after 
which it jumps up to 3 bytes, instead of 2. So the poorly behaved users are 
penalised more strongly. I kind of expect we'll have either everyone behaving 
well (so 1 byte in either scheme), or people behaving quite badly, and jumping 
up to 3 bytes. But either way, some boundary usecases win, some lose. It's 
always the way.


was (Author: benedict):
It's worth noting, of course, that the existing scheme has a very narrow useful 
band around that unlikeliness threshold, after which it jumps up to 3 bytes, 
instead of 2. So the poorly behaved users are penalised more strongly. I kind 
of expect we'll have either everyone behaving well (so 1 byte in either 
scheme), or people behaving quite badly, and jumping up to 3 bytes. But either 
way, some boundary usecases win, some lose. It's always the way.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587908#comment-14587908
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/16/15 12:02 PM:
---

bq. I assumed that knowing a given integer can't be negative could lead to more 
efficient encoding

Well, we could have two different encodings, but with the current scheme this 
would mostly help values in the range of [128..250). I'm not sure if that's 
worth confusing everything for.

However if we change the encoding, we can bias towards positive encodings, 
since they're more common. I'm somewhat inclined to use a hybrid extending bits 
scheme. A starting suggestion:

* first byte: 2bits of length; followed by, if any of the first bits are set, 1 
sign bit; followed by, if all length bits are set, 2 more bits of length; if 
all of those length bits are set, one more length bit; the remainder (2-6 bits) 
are value bits\*
* all remaining bytes contain value bits only

This would lead to the following encoding sizes
||value range||suggested scheme size||existing scheme size||
|0..63|1|1|
|64..8K|2|mostly 3, 64..128=1, 128..256=2|
|-8192..0|2|mostly 3, -112..0=1, -256..-112=1|
|8K..2M|3|mostly 4|

etc.

So, we basically lose out a small amount for values in the range 64..128, and 
-256..-1. Everything else we gain. If we wanted to further bias towards 
positive encoding, we could require that at least one sign bit is present for 
the signbit to be present, so that negative numbers cannot be encoded in less 
than 3 bytes.

\* unless all the length bits are set, in which case we just immediately bump 
to 8 bytes and call getLong()


was (Author: benedict):
bq. I assumed that knowing a given integer can't be negative could lead to more 
efficient encoding

Well, we could have two different encodings, but with the current scheme this 
would mostly help values in the range of [128..250). I'm not sure if that's 
worth confusing everything for.

However if we change the encoding, we can bias towards positive encodings, 
since they're more common. I'm somewhat inclined to use a hybrid extending bits 
scheme. A starting suggestion:

* first byte: 2bits of length; followed by, if any of the first bits are set, 1 
sign bit; followed by, if all length bits are set, 2 more bits of length; the 
remainder (3-6 bits) are value bits
* all remaining bytes contain value bits only

This would lead to the following encoding sizes
||value range||suggested scheme size||existing scheme size||
|0..63|1|1|
|64..8K|2|mostly 3, 64..128=1, 128..256=2|
|-8192..0|2|mostly 3, -112..0=1, -256..-112=1|
|8K..2M|3|mostly 4|

etc.

So, we basically lose out a small amount for values in the range 64..128, and 
-256..-1. Everything else we gain. If we wanted to further bias towards 
positive encoding, we could require that at least one sign bit is present for 
the signbit to be present, so that negative numbers cannot be encoded in less 
than 3 bytes.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587771#comment-14587771
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/16/15 11:14 AM:
---

There would be no overconsumption: we would _read_ more bytes that we needed, 
but we would not move the buffer position until after. This method would work 
for both the writer and the reader, and I don't see where the problems you 
raise occur. The only ugliness would be, in the reader, _potentially_ extending 
limit() (in the rare case we don't have any real bytes present) and resetting 
it when done, but this would be a rare branch to take (although it could be 
done unconditionally, since it's cheap), i.e. set {{limit(position + 8)}} 
before; and set {{limit(previousLimit)}} after.

For the writer there is no such ugliness, so we should at least implement this 
approach there: we build the long value we want, put it without moving the 
position, and then shift the position forwards by only the populated length.


was (Author: benedict):
There would be no overconsumption: we would _read_ more bytes that we needed, 
but we would not move the buffer position until after. This method would work 
for both the writer and the reader, and I don't see where the problems you 
raise occur. The only ugliness would be, in the reader, _potentially_ extending 
limit() (in the rare case we don't have any real bytes present) and resetting 
it when done, but this would be a rare branch to take (although it could be 
done unconditionally, since it's cheap), i.e. set {{limit(position + 8)}} 
before; and set {{limit(previousLimit)}} after.

For the writer there is no such ugliness, so we should at least implement this 
approach there: we build the long value we want, put it without moving the 
position, and then shift the position forwards by only the populated length.

However, on the reader, since there _would_ be some ugliness, that may also 
hinder the performance benefit (it would depend on the value being large, 
whereas we probably expect it to be small, if we're using writevInt): we _know_ 
the value is at least two bytes long. So perhaps we should just call getShort() 
before entering the loop, in the expectation/hope that we often won't require 
the loop itself. We could even unroll the 6 iterations of the loop into a 
switch statement with fall-through, since this should cap the branches at 1: 
but it's probably not important to do so, given the method costs likely still 
dominate. 

We could, optionally, also have a fast route for when length == 8, since I 
would expect that typically we willl vint encode either very small values, or 
full long values.

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587843#comment-14587843
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/16/15 10:52 AM:
---

bq. which means we still can change the encoding as it pleases us

Perhaps we should then. That way the size indicator can be determined as (-120 
- value), with the negative values giving [-6..0] and positive [0..6]. Then the 
{{vIntDecoding.vIntIsNegative}} method can vanish, and we can just use the 
signbit of the size, and can truncate the signbit for iteration.

bq. Maybe it could also make sense to have an "unsigned" variant for when we 
know the value written is positive.

Could you give an example of where this would be helpful? I'm guessing we'll be 
using this in places where we have either an int or long, in which case it will 
be decoded into something that is signed, and I can't see us using a long 
instead of an int, just for the extra bit (we may as well just use the whole 
long, if we're vint encoding it)?

bq. And as a nit, I think it would be convenient in practice to have methods 
for int 

Agreed. We can simply provide default methods in the DataOutputStreamPlus and 
(a concrete method in) {{AbstractDataInput}}, that just wrap the casts/calls to 
the long variants. If there are no other implementations, it's quite likely 
they will get inlined (although I don't have time to hunt through hotspot to 
confirm if inlining decisions occur before or after the vtable resolution of 
default methods, I hope they did it in a way that supported it)


was (Author: benedict):
bq. which means we still can change the encoding as it pleases us

Perhaps we should then. That way the size indicator can be determined as (-120 
- value), with the negative values giving [-6..0] and positive [0..6]. Then the 
{{vIntDecoding.vIntIsNegative}} method can vanish, and we can just use the 
signbit of the size, and can truncate the signbit for iteration.

bq. Maybe it could also make sense to have an "unsigned" variant for when we 
know the value written is positive.

Could you give an example of where this would be helpful? I'm guessing we'll be 
using this in places where we have either an int or long, in which case it will 
be decoded into something that is signed, and I can't see us using a long 
instead of an int, just for the extra bit (we may as well just use the whole 
long, if we're vint encoding it)?

bq. And as a nit, I think it would be convenient in practice to have methods 
for int 

Agreed. We can simply provide default methods in the DataOutputStreamPlus and 
DataInputPlus, that just wrap the casts/calls to the long variants. If there 
are no other implementations, it's quite likely they will get inlined  
(although I don't have time to hunt through hotspot to confirm if inlining 
decisions occur before or after the vtable resolution of default methods, I 
hope they did it in a way that supported it)

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587785#comment-14587785
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/16/15 10:37 AM:
---

One final nit: I'm not super keen on the assert in 
{{vIntDecoding.vIntIsNegative}} -it's possible the compiler will eliminate the 
unnecessary expressions- I've misread it twice now already.

{{assert (value < -120 || (value >= -112 && value < 0)) == value < -120}}

it seems we're really just asserting {{!(value >= -112 && value < 0)}}, i.e. 
that we've dealt with the existing negative numbers. Really we should be just 
asserting {{value < -112}}, since we should have dealt with all of the simple 
values already, and should just be left with the size encodings, no?


was (Author: benedict):
One final nit: I'm not super keen on the assert in 
{{vIntDecoding.vIntIsNegative}}: it's possible the compiler will eliminate the 
unnecessary expressions, but

{{assert (value < -120 || (value >= -112 && value < 0)) == value < -120}}

will fail in exactly the same situations as

{{assert value < -120 || value >= 0}}

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587785#comment-14587785
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/16/15 10:27 AM:
---

One final nit: I'm not super keen on the assert in 
{{vIntDecoding.vIntIsNegative}}: it's possible the compiler will eliminate the 
unnecessary expressions, but

{{assert (value < -120 || (value >= -112 && value < 0)) == value < -120}}

will fail in exactly the same situations as

{{assert value < -120 || value >= 0}}


was (Author: benedict):
One final nit: I'm not super keen on the assert in 
{{vIntDecoding.vIntIsNegative}}: it's possible the compiler will eliminate the 
unnecessary expressions, but

{{assert (value < -120 || (value >= -112 && value < 0)) == value < -120}}

will fail in exactly the same situations as

{{assert value < -120}}

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587785#comment-14587785
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/16/15 10:26 AM:
---

One final nit: I'm not super keen on the assert in 
{{vIntDecoding.vIntIsNegative}}: it's possible the compiler will eliminate the 
unnecessary expressions, but

{{assert (value < -120 || (value >= -112 && value < 0)) == value < -120}}

will fail in exactly the same situations as

{{assert value < -120}}


was (Author: benedict):
One final nit: I'm not super keen on the assert in 
{{vIntDecoding.vIntIsNegative}}: it's possible the compiler will eliminate the 
unnecessary expressions, but

{{assert (value < -120 || (value >= -112 && value < 0)) == value < -120 }}

will fail in exactly the same situations as

{{ assert value < -120 }}

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9499) Introduce writeVInt method to DataOutputStreamPlus

2015-06-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587084#comment-14587084
 ] 

Benedict edited comment on CASSANDRA-9499 at 6/16/15 12:32 AM:
---

Thanks.

I know Cap'n Proto and our existing code uses a loop, but why not just use {{8 
- Integer.numberOfLeadingZeros\(v\)/8}} ?

The cyclic dependency between {{EncodedDIS}} and AbstractDIS is a bit confusing 
to me. I'd rather we simply marked {{EncodedDIS @Deprecated}}, and moved all of 
the implementation details somewhere that's acyclic.

The read method we can make quite a bit more efficient, with a special version 
of {{prepareReadPrimitive}}: we want the result to be that {{length}} bytes are 
in the buffer for consumptions,but that we are also 8 bytes or more before the 
end of the buffer. This way we can just call 
{{buffer.getLong(buffer.position())}}, then advance its position by {{length}} 
and truncate the long with {{ & (-1L >>> (64 - (length * 8)) }} (where 
{{length}} here excludes the initial size byte)

It occurs to me there's nothing stopping us using this approach for writing as 
well, simply calling putLong(), with an optional writeByte() to fill in the 
most-significant byte if it was non-empty.


was (Author: benedict):
Thanks.

I know Cap'n Proto and our existing code uses a loop, but why not just use {{8 
- Integer.numberOfLeadingZeros\(v\)/8}} ?

The cyclic dependency between {{EncodedDIS}} and AbstractDIS is a bit confusing 
to me. I'd rather we simply marked {{EncodedDIS @Deprecated}}, and moved all of 
the implementation details somewhere that's acyclic.

The read method we can make quite a bit more efficient, with a special version 
of {{prepareReadPrimitive}}: we want the result to be that {{length}} bytes are 
in the buffer for consumptions,but that we are also 8 bytes or more before the 
end of the buffer. This way we can just call 
{{buffer.getLong(buffer.position())}}, then advance its position by {{length}} 
and truncate the long with {{ & (-1L >>> (64 - (length * 8)) }} (where 
{{length}} here excludes the initial size byte)

> Introduce writeVInt method to DataOutputStreamPlus
> --
>
> Key: CASSANDRA-9499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9499
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> CASSANDRA-8099 really could do with a writeVInt method, for both fixing 
> CASSANDRA-9498 but also efficiently encoding timestamp/deletion deltas. It 
> should be possible to make an especially efficient implementation against 
> BufferedDataOutputStreamPlus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)