[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-04-01 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390400#comment-14390400
 ] 

Robert Stupp commented on CASSANDRA-8670:
-

Your assumption is correct 
{{jdk/src/solaris/native/java/net/SocketOutputStream.c}} is the Linux source.
What I don't understand is why 
{{Java_java_net_SocketOutputStream_socketWrite0}} not just uses the given 
{{byte[]}} but copies to a stack/heap buffer.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-04-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390405#comment-14390405
 ] 

Benedict commented on CASSANDRA-8670:
-

bq. What I don't understand is why 
Java_java_net_SocketOutputStream_socketWrite0 not just uses the given byte[] 
but copies to a stack/heap buffer.

Time to safe point. Using it directly would require pausing GC until the socket 
write had completed, which could be lengthy (since it's blocking, it's 
unbounded, in fact)

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-04-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390561#comment-14390561
 ] 

Benedict commented on CASSANDRA-8670:
-

bq. In most cases we opt to use the channel when available. ... Where we could 
really do better is when using compression

You're right. I thought we had more paths leftover, but it is just compression 
really. That's a much cleaner state of affairs!

It looks like compression will typically be under the stack buffer size, so we 
don't have such a problem, but it would still be nice to move to a 
DirectByteBuffer compressed output stream. It should be quite viable since 
there are now compression methods for working over these directly, but that 
should probably be a follow up ticket. Do you want to file, or shall I?

bq. BufferedDataOutputStreamPlus.close() will clean the buffer even if it might 
not own the buffer such as when it is provided to the constructor. It also 
doesn't check if the channel is null.

We don't call close in situations where this is a problem

bq. When used from the commit log without a channel it will throw an NPE if it 
exceeds the capacity of the buffer when it goes to flush

We presize the buffer correctly

Still, on both these counts we could offer better safety if we wanted, without 
much cost. If we used a DataOutputBuffer it would solve these problems, and we 
could assert that the final buffer is == the provided buffer...

bq. It almost seems like there is a case for a version just for wrapping fixed 
size bytebuffers.

Possibly. I'm on the fence, since it's extra code cache and class hierarchy 
pollution, with limited positive impact. We would need to duplicate the whole 
of BufferedDataOutputStreamPlus. You could benchmark to see if there's an 
appreciable difference? If the difference is small, I would prefer to avoid the 
pollution.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: OutputStreamBench.java, largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-04-01 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390570#comment-14390570
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

CASSANDRA-8887 would be part of it. Maybe an umbrella ticket and do that first 
to figure out how hook the compression libraries up. Can you file?

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: OutputStreamBench.java, largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-04-01 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390541#comment-14390541
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

In what scenarios do we end up actually writing to the native methods right 
now? In most cases we opt to use the channel when available. The only time we 
should be using WrappedDataOutputStreamPlus is when we aren't actually writing 
to the channel of a file or socket.

Where we could really do better is when using compression, we could have the 
compression wrap a direct buffer which wraps the channel and avoid relying on 
the built in mechanisms for getting data off heap.

I had some other refrigerator moments overnight as well.

BufferedDataOutputStreamPlus.close() will clean the buffer even if it might not 
own the buffer such as when it is provided to the constructor. It also doesn't 
check if the channel is null.

When used from the commit log without a channel it will throw an NPE if it 
exceeds the capacity of the buffer when it goes to flush. I suppose one runtime 
exception is as good as another. There is also the extra bounds check in 
ensureRemaining() which always seemed a little useless to me and the fact that 
it will not drop through and do efficient copies for direct byte buffers. It 
almost seems like there is a case for a version just for wrapping fixed size 
bytebuffers. 

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-04-01 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390595#comment-14390595
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

On more thinking I don't think having close clean the buffer if we allow people 
to supply one is great. It could be double free or use after free if someone 
makes a mistake.

I think we should had least have FileUtil.clean null out the pointer to the 
buffer before/after cleaning (whichever is possible) so we get as immediate a 
failure as possible. Duplicates or slices of the buffer won't pick up that the 
pointer was nulled, but it's better then nothing. We should also assert that 
the buffer wasn't provided in the constructor and throw an exception if someone 
did that and then called close.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: OutputStreamBench.java, largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-04-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390598#comment-14390598
 ] 

Benedict commented on CASSANDRA-8670:
-

bq. Also on trying to understand code cache pollution. How much do you really 
know about how many instructions are emitted when the JVM can inline and 
duplicate stuff out the yin yang?

Well, there are a lot of heuristics to apply, that are admittedly limited and 
imperfect. But in general: hotspot won't inline megamorphic call sites, and 
even bimorphic callsites are unlikely to be (i think probably never) inlined, 
only given a static despatch fast path. These heuristics are enough to guide 
decisions around this sufficiently well in my experience. If there are multiple 
implementations viable at any moment, then the callsite will not be inlined, at 
most its location will be, and even if it _is_ inlined, this doesn't 
necessarily pollute the code cache, since inlined methods are small (as are 
most methods) and adjacent occupancy of the cache is essentially free, unless 
the overall method size exceeds the cache line boundary.

In general my view, the simplest heuristic is: if the benefit is small, and it 
increases the number of active call sites, then let's not. This works from a 
code management as well as a cache pollution perspective at once.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: OutputStreamBench.java, largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-04-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390630#comment-14390630
 ] 

Benedict commented on CASSANDRA-8670:
-

bq. We should also assert that the buffer wasn't provided in the constructor 
and throw an exception if someone did that and then called close.

If we forbid this entirely, and only expose it via DataOutputBuffer, then the 
close method is harmless (as it's a no-op)

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: OutputStreamBench.java, largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-31 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388757#comment-14388757
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

Unit tets pass and I am +1 on your changes.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-31 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388515#comment-14388515
 ] 

Benedict commented on CASSANDRA-8670:
-

I've pushed one more round of changes 
[here|https://github.com/belliottsmith/cassandra/tree/8670-2], after your 
follow up round (which I mention for posterity). I've made the following 
changes; let me know your thoughts on them:

* Merged writeUTF into one method, with a fast path _only_ for ASCII 
characters, since this is likely to benefit most from unrolling, and the 
instruction cache pollution effect is small. The two separate but near 
identical and very large methods look almost certain to be worse due to icache 
misses than a single branch that is mostly predicted correctly, especially when 
we had multiple branches inside the loop, which were each more likely to be 
mispredicted. As a follow-up commit, in case you're worried by this, I've 
introduced a no-conditional version of sizeOfChar (which we may be able to 
optimise further), but I haven't performed any benchmarks to measure the 
difference in effect.
* Reverted the new hollowBuffer approach for array backed buffers - I couldn't 
see a reason for not just directly invoking the write(byte[]) methods?
* Based SafeMemoryWriter on DataOutputBuffer
* Shared the UBDOSP.utfBytes and DOSP.WBC.buf in the same ThreadLocal 
* Preferred bb.hasArray() to bb.isDirect(), since it is a concrete method, so 
can be inlined
* Moved writeUTFLegacy into the test case, since it's only for test purposes now
* Fixed formatting in UnbufferedDataOutputStreamPlus (seems a good opportunity 
to standardise it)

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386852#comment-14386852
 ] 

Benedict commented on CASSANDRA-8670:
-

I've just pushed another update, which undoes part of the SequentialWriter 
changes, since all writes (esp. for compression) should go through the SW 
itself. It also makes one further minor change to stop using 
Channels.newChannel() everywhere, since that introduces an extra layer of byte 
shuffling unnecessarily, when DataOutputPlus implements a compatible method 
that can be called directly.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386007#comment-14386007
 ] 

Benedict commented on CASSANDRA-8670:
-

I've pushed some suggestions for further refactoring 
[here|https://github.com/belliottsmith/cassandra/tree/8670-suggestions]. I've 
only looked at the overall class hierarchy, I haven't focused yet on reviewing 
the method implementation changes.

Mostly these changes flatten the class hierarchy; it's gotten deep enough I 
don't think there's a good reason to maintain the distinction between 
DataStreamOutputPlus and DataStreamOutputPlusAndChannel, especially since we 
often just mock up a Channel based off the OutputStream. I've also flattened 
NIODataOutputStream and DataOutputStreamByteBufferPlus into 
BufferedDataOutputStreamPlus, since we only write to the buffer if we don't 
exceed its size. At the same time, since we are now refactoring this whole 
hierarchy, I made DataOutputBuffer extend BufferedDataOutputStreamPlus, and 
just ensures the buffer grows as necessary, and have removed 
FastByteArrayOutputStream since we no longer need it.

I've also stopped SequentialWriter implementing WritableByteChannel, and now 
pass in its internal Channel, since that's the only way the operations will 
benefit. As a follow up ticket, we should probably move SequentialWriter to 
utilising BufferedDataOutputStreamPlus directly, so that it can benefit from 
faster encoding of primitives

Let me know what you think of the changes to the hierarchy, and once we've 
ironed that out we can move on to the home stretch and confirm the code 
changes. One other thing we could consider is dropping the Plus from 
everything except the interface, since it seems superfluous, and it's all 
fairly verbose.


 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-27 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384556#comment-14384556
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

I think I covered what we talked about. I followed quite a few things and this 
is where it lead me. I don't feel like I made a dent in terms of having less 
code in wide use.

AbstractDataOutputStreamAndChannelPlus (formerly AbstractDataOutput) is still 
pretty firmly entrenched.


 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-26 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381722#comment-14381722
 ] 

Benedict commented on CASSANDRA-8670:
-

bq. What tool are you using to review?

I like to navigate in IntelliJ, and on the command line, so having a clean run 
of commits helps a lot.

After a bit of consideration, I think there's a good justification for 
introducing a whole new class if we intend to fully replace 
DataStreamOutputAndChannel, largely because the two write paths are not at all 
clear, and appear to be different (the old versions of the write paths being 
hard to actually pin down the location of in the VM source). So having a solid 
handle on how it behaves, and ensuring fewer code paths are executed, seems a 
good thing. As such, I think this patch should replace DSOaC entirely, and 
remove it from the codebase. I also think this is a good opportunity to share 
its code with DataOutputByteBuffer, and in doing hopefully make that faster, 
potentially improving performance of CL append (it doesn't need to extend 
AbstractDataOutput, and would share most of its implementation with 
NIODataOutputStream if it did not).

A few comments in NIODataInputStream:

* readNext() should assert it is never shuffling more than 7 bytes; in fact 
ideally this would be done by readMinimum() to make it clearer
* readNext() should IMO never shuffle unless it's at the end of its capacity; 
if it hasRemaining() and limit() != capacity() it should read on from its 
current limit (readMinimum can ensure there is room to fully meet its 
requirements)
* readUnsignedShort() could simply be: {{ return readShort()  0x;}} 
* available() should return the bytes in the buffer at least
* ensureMinimum() isn't clearly named, since it is more intrinsically linked to 
primitive reads than it suggests, consuming the bytes and throwing EOF if it 
cannot read. Something like preparePrimitiveRead() (no fixed idea myself, just 
think it is more than ensureMinimum)

A few comments in NIODataOutputStreamPlus:
* close() should flush
* close() should clean the buffer
* why the use of hollowBuffer? For clarity in case of restoring the cursor 
position during exceptions? Would be helpful to clarify with a comment. It 
seems like perhaps this should only be used for the first branch, though, since 
the second should have no risk of throwing an exception, so we can safely 
restore the position. It seems like it might be best to make hollowBuffer 
default to null, and instantiate it only if it is larger than our buffer size, 
otherwise first flushing our internal buffer if we haven't got enough room. 
This way we should rarely need the hollowBuffer.
* We should either extend our AbstractDataOutput, or make our writeUTF method 
public static, so we can share it

Finally, it would be nice if we didn't need to stash the OutputStream version 
separately. Perhaps we can reorganise the class hierarchy, so that 
DataOutputStreamPlus doesn't wrap an internal OutputStream, it just is a light 
abstract class merge of the types OutputStream and DataOutputPlus. We can 
introduce a WrappedDataOutputStreamPlus in its place, and AbstractDataOutput 
could extend our new DataOutputStreamPlus instead of the other way around (with 
Wrapped... extending _it_). Then we can just stash a DataOutputStreamPlus in 
all cases. Sound reasonable?

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control 

[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-26 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382524#comment-14382524
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

NIODataInputStream
bq. readNext() should assert it is never shuffling more than 7 bytes; in fact 
ideally this would be done by readMinimum() to make it clearer
By assert you mean an assert that compiles out or a precondition?

bq. readNext() should IMO never shuffle unless it's at the end of its capacity; 
if it hasRemaining() and limit() != capacity() it should read on from its 
current limit (readMinimum can ensure there is room to fully meet its 
requirements)
I guess I don't get when this optimization will help. I could see it hurting. 
You could stream through the buffer not returning to the beginning on a regular 
basis and end up issuing smaller then desired reads.

NIODataOutputStreamPlus
bq. available() should return the bytes in the buffer at least
I duplicated the JDK behavior for NIO. DataInputStream for a socket returns 0, 
for a file it returns the bytes remaining to read from the file. I think it 
makes sense for the API when you don't have a real answer.

bq. why the use of hollowBuffer? For clarity in case of restoring the cursor 
position during exceptions? Would be helpful to clarify with a comment. It 
seems like perhaps this should only be used for the first branch, though, since 
the second should have no risk of throwing an exception, so we can safely 
restore the position. It seems like it might be best to make hollowBuffer 
default to null, and instantiate it only if it is larger than our buffer size, 
otherwise first flushing our internal buffer if we haven't got enough room. 
This way we should rarely need the hollowBuffer.
The contract of the API requires that the incoming buffer not be modified. For 
thread safety reasons I don't modify the original buffer's position and then 
reset it in a finally block.

I am not sure what you mean by hollow buffer larger then our buffer. It's 
hollow so it has no size. We also use it copy things into our buffer while 
preserving the original position.

The rest is reasonable.





 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-26 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382732#comment-14382732
 ] 

Benedict commented on CASSANDRA-8670:
-

bq. to not have to worry about it ever

I'm not sure you can ever not worry about it, since the default behaviour is 
that methods _do_ modify the position of a BB, so you have to assume when it's 
a risk that you need to ensure each thread has its own version via duplicate(). 
But if that's your rationale, just comment it to explain and I'm cool with it.

bq. For file IO

this should always be fully populated, but like I said: comments to explain 
(and assertions) will solve everything :)

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-26 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382646#comment-14382646
 ] 

Benedict commented on CASSANDRA-8670:
-

bq. By assert you mean an assert that compiles out or a precondition?

I don't mind, really. It's just for clarity

bq. I guess I don't get when this optimization will help. I could see it 
hurting. You could stream through the buffer not returning to the beginning on 
a regular basis and end up issuing smaller then desired reads.

It was more a suggestion for clarity - at least, in my opinion. Assuming we 
typically fill the buffer, it isn't really a problem, and if we don't we 
usually have room to fill after it (although if we were to try to fill an 
almost full buffer it would be a problem; but so is repeatedly shuffling a 
buffer that is regularly very underfilled). But perhaps some more comments 
explaining the behaviour of (and reasoning behind) each branch is a better 
solution, along with the assertions to make clear this is not a costly or 
common operation.

bq. DataInputStream for a socket returns 0

DataInputStream isn't buffered. BufferedInputStream, and the API spec in 
InputStream#available suggest we should return number of bytes we have buffered

bq. We also use it copy things into our buffer while preserving the original 
position.

In this scenario it is likely cheaper to simply restore the position once done, 
and this approach also means we can likely typically avoid ever allocating a 
hollow buffer. There is also no (typical) risk of exception, so no reason to 
use the hollow buffer, since we can guarantee we will be able to restore its 
position.

bq. I am not sure what you mean by hollow buffer larger then our buffer

I meant parameter provided buffer

There are also some formatting issues I forgot to mention (braces on the wrong 
line, and lots of extra linebreaks between methods)

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-26 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382709#comment-14382709
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

bq. In this scenario it is likely cheaper to simply restore the position once 
done, and this approach also means we can likely typically avoid ever 
allocating a hollow buffer. There is also no (typical) risk of exception, so no 
reason to use the hollow buffer, since we can guarantee we will be able to 
restore its position.
I've been bitten by very hard to find threading bugs from having multiple 
threads concurrently reading from the same ByteBuffer using relative methods. 
It's a small amount of code to not have to worry about it ever. It's common 
that you will send a message referencing an immutable object graph across 
multiple connections and it ends up not being so immutable because 
serialization for some object uses a ByteBuffer without duplicating first.

I don't think allocating an extra object per stream should enter into the 
decision making. It's tiny in the big picture.

bq. but so is repeatedly shuffling a buffer that is regularly very underfilled
True, but you only have to shuffle up to 7 bytes, and if the buffer is empty 
the cost of filling it is going to dominate. JNI calls, multiple allocations, 
context switch to the kernel. 

In the case of socket IO we are hoping to buffer entire message, potentially 
all available ones so the common case is that the buffer will be empty at the 
end and it doesn't matter much.

For file IO doing a sequential read the common case will be that there are 
handful of bytes left because we are reading a multi-byte value. If we don't 
shuffle in that case we will do all that work to read a few bytes and then go 
back to fill the rest of the buffer a second time.

I'll update available() to return the buffered data.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-25 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379647#comment-14379647
 ] 

Benedict commented on CASSANDRA-8670:
-

bq. Do you want to see it commit by commit or should I squash it?

I don't mind, so long as it's easy to squash myself (it looks to be interleaved 
with commits from elsewhere right now, is the difficulty)

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-25 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379897#comment-14379897
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

What tool are you using to review? github does a good job handling merge 
commits and not show them as part of the diff and it doesn't show them as 
individual commits.

You can also convert any github comparison to a single diff by adding .diff to 
the URL. That's what I used to create a [new 
branch|https://github.com/apache/cassandra/compare/trunk...aweisberg:C-8670-2?expand=1]

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-24 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378807#comment-14378807
 ] 

Benedict commented on CASSANDRA-8670:
-

Could we get the patch rebased so that all of the commits are adjacent, and not 
interspersed with merges in from 2.1/trunk? It makes it difficult to follow 
exactly what's been changed (I've been guilty of this approach in the past, but 
I think it helps clean review to always ensure every commit for a patch occurs 
at the end of the git log)

It's also worth discussing a potential simpler approach to this: couldn't we 
wrap DataInputStream, and proxy read(byte[]) to a loop over read(byte[], int, 
int)? For DataOutputPlus we can just change the behaviour of our 
DataOutputStreamPlus for byte[], which would fall through to 
DataOutputStreamAndChannel which could use the code you have for 
write(ByteBuffer) only to duplicate the behaviour here. We could (and probably 
should, when compression is disabled) use that in OTC to remove the indirection 
when filling a ByteBuffer. I'm not saying for sure this is better, but since it 
is much simpler it seems we should refute this approach before attempting 
something more involved?

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-24 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378930#comment-14378930
 ] 

Benedict commented on CASSANDRA-8670:
-

bq. I am in favor of doing whatever doesn't have me benchmarking.

Let's not get bogged down in microbenchmarking this stuff. IMO a little 
analysis of the options is sufficient, since either is likely an improvement. I 
don't have a preconceived answer to the question I raised, I just think it's 
worth assessing both options in contrast to each other. It's a bit late for me 
to assess that myself this evening, so I'll aim to collect my thoughts on that 
tomorrow. Feel free to fill in yours if you have time.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-24 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378902#comment-14378902
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

Do you want to see it commit by commit or should I squash it?

bq. It's also worth discussing a potential simpler approach to this: couldn't 
we wrap DataInputStream, and proxy read(byte[]) to a loop over read(byte[], 
int, int)? For DataOutputPlus we can just change the behaviour of our 
DataOutputStreamPlus for byte[], which would fall through to 
DataOutputStreamAndChannel which could use the code you have for 
write(ByteBuffer) only to duplicate the behaviour here. We could (and probably 
should, when compression is disabled) use that in OTC to remove the indirection 
when filling a ByteBuffer. I'm not saying for sure this is better, but since it 
is much simpler it seems we should refute this approach before attempting 
something more involved?

I was generally trying improve the situation by reading/writing to direct 
buffers and in the read case not adding another wrapper class and indirection. 
I have no idea what kind of output the JVM has for Streams that wrap other 
streams.

DataOutputStreamAndChannel doesn't work with BufferedOutputStream. It doesn't 
flush the output stream before writing to the channel. We could always change 
that though.

I am in favor of doing whatever doesn't have me benchmarking.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: largecolumn_test.py


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-17 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366114#comment-14366114
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

I can confirm that compression determines whether you see an issue with memory 
used by intracluster messaging.

The memory used by Netty shouldn't scale with cluster size the same way. 

Still figuring out how to test for the issue when Netty is dominating direct 
bytebuffer usage. I don't see a way to get Netty to report on it's memory usage 
nor a way to get the allocations done by NIO tracked.

Monkey patching where art thou?

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-16 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364086#comment-14364086
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

I have a dtest that can reproduce the issue. I added to JMX gcstats the current 
amount of in flight  direct bytebuffer memory (this includes buffers that 
haven't been GCed). This is the value from java.nio.Bits.

When I took a heap dump the issue was with Netty pooling memory. Netty pooled 
600 megabytes of memory after I serially (and single threaded) wrote/read 5 
rows with a single 35 megabyte column each. Each pooled bit of memory was 16 
megabytes. I don't know yet where Netty steady states.

This doesn't match what I recall from the original user report where the memory 
was being pooled as part of intracluster networking. There may be another 
factor like the setting for intracluster compression that influences it. It may 
even be that enabling intracluster compression is a work around.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger than the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-12 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358866#comment-14358866
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
---

I am getting to this now. Should be fixed in 3.0. Once I have it fixed for 3.0 
we can decide about back porting to 2.1.

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 3.0


 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger then the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage

2015-03-07 Thread Evin Callahan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351742#comment-14351742
 ] 

Evin Callahan commented on CASSANDRA-8670:
--

What's the path forward on this?

 Large columns + NIO memory pooling causes excessive direct memory usage
 ---

 Key: CASSANDRA-8670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg

 If you provide a large byte array to NIO and ask it to populate the byte 
 array from a socket it will allocate a thread local byte buffer that is the 
 size of the requested read no matter how large it is. Old IO wraps new IO for 
 sockets (but not files) so old IO is effected as well.
 Even If you are using Buffered{Input | Output}Stream you can end up passing a 
 large byte array to NIO. The byte array read method will pass the array to 
 NIO directly if it is larger then the internal buffer.  
 Passing large cells between nodes as part of intra-cluster messaging can 
 cause the NIO pooled buffers to quickly reach a high watermark and stay 
 there. This ends up costing 2x the largest cell size because there is a 
 buffer for input and output since they are different threads. This is further 
 multiplied by the number of nodes in the cluster - 1 since each has a 
 dedicated thread pair with separate thread locals.
 Anecdotally it appears that the cost is doubled beyond that although it isn't 
 clear why. Possibly the control connections or possibly there is some way in 
 which multiple 
 Need a workload in CI that tests the advertised limits of cells on a cluster. 
 It would be reasonable to ratchet down the max direct memory for the test to 
 trigger failures if a memory pooling issue is introduced. I don't think we 
 need to test concurrently pulling in a lot of them, but it should at least 
 work serially.
 The obvious fix to address this issue would be to read in smaller chunks when 
 dealing with large values. I think small should still be relatively large (4 
 megabytes) so that code that is reading from a disk can amortize the cost of 
 a seek. It can be hard to tell what the underlying thing being read from is 
 going to be in some of the contexts where we might choose to implement 
 switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)