[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382709#comment-14382709
 ] 

Ariel Weisberg commented on CASSANDRA-8670:
-------------------------------------------

bq. In this scenario it is likely cheaper to simply restore the position once 
done, and this approach also means we can likely typically avoid ever 
allocating a hollow buffer. There is also no (typical) risk of exception, so no 
reason to use the hollow buffer, since we can guarantee we will be able to 
restore its position.
I've been bitten by very hard to find threading bugs from having multiple 
threads concurrently reading from the same ByteBuffer using relative methods. 
It's a small amount of code to not have to worry about it ever. It's common 
that you will send a message referencing an "immutable" object graph across 
multiple connections and it ends up not being so immutable because 
serialization for some object uses a ByteBuffer without duplicating first.

I don't think allocating an extra object per stream should enter into the 
decision making. It's tiny in the big picture.

bq. but so is repeatedly shuffling a buffer that is regularly very underfilled
True, but you only have to shuffle up to 7 bytes, and if the buffer is empty 
the cost of filling it is going to dominate. JNI calls, multiple allocations, 
context switch to the kernel. 

In the case of socket IO we are hoping to buffer entire message, potentially 
all available ones so the common case is that the buffer will be empty at the 
end and it doesn't matter much.

For file IO doing a sequential read the common case will be that there are 
handful of bytes left because we are reading a multi-byte value. If we don't 
shuffle in that case we will do all that work to read a few bytes and then go 
back to fill the rest of the buffer a second time.

I'll update available() to return the buffered data.

> Large columns + NIO memory pooling causes excessive direct memory usage
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-8670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 3.0
>
>         Attachments: largecolumn_test.py
>
>
> If you provide a large byte array to NIO and ask it to populate the byte 
> array from a socket it will allocate a thread local byte buffer that is the 
> size of the requested read no matter how large it is. Old IO wraps new IO for 
> sockets (but not files) so old IO is effected as well.
> Even If you are using Buffered{Input | Output}Stream you can end up passing a 
> large byte array to NIO. The byte array read method will pass the array to 
> NIO directly if it is larger than the internal buffer.  
> Passing large cells between nodes as part of intra-cluster messaging can 
> cause the NIO pooled buffers to quickly reach a high watermark and stay 
> there. This ends up costing 2x the largest cell size because there is a 
> buffer for input and output since they are different threads. This is further 
> multiplied by the number of nodes in the cluster - 1 since each has a 
> dedicated thread pair with separate thread locals.
> Anecdotally it appears that the cost is doubled beyond that although it isn't 
> clear why. Possibly the control connections or possibly there is some way in 
> which multiple 
> Need a workload in CI that tests the advertised limits of cells on a cluster. 
> It would be reasonable to ratchet down the max direct memory for the test to 
> trigger failures if a memory pooling issue is introduced. I don't think we 
> need to test concurrently pulling in a lot of them, but it should at least 
> work serially.
> The obvious fix to address this issue would be to read in smaller chunks when 
> dealing with large values. I think small should still be relatively large (4 
> megabytes) so that code that is reading from a disk can amortize the cost of 
> a seek. It can be hard to tell what the underlying thing being read from is 
> going to be in some of the contexts where we might choose to implement 
> switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to