[ 
https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381722#comment-14381722
 ] 

Benedict commented on CASSANDRA-8670:
-------------------------------------

bq. What tool are you using to review?

I like to navigate in IntelliJ, and on the command line, so having a clean run 
of commits helps a lot.

After a bit of consideration, I think there's a good justification for 
introducing a whole new class if we intend to fully replace 
DataStreamOutputAndChannel, largely because the two write paths are not at all 
clear, and appear to be different (the old versions of the write paths being 
hard to actually pin down the location of in the VM source). So having a solid 
handle on how it behaves, and ensuring fewer code paths are executed, seems a 
good thing. As such, I think this patch should replace DSOaC entirely, and 
remove it from the codebase. I also think this is a good opportunity to share 
its code with DataOutputByteBuffer, and in doing hopefully make that faster, 
potentially improving performance of CL append (it doesn't need to extend 
AbstractDataOutput, and would share most of its implementation with 
NIODataOutputStream if it did not).

A few comments in NIODataInputStream:

* readNext() should assert it is never shuffling more than 7 bytes; in fact 
ideally this would be done by readMinimum() to make it clearer
* readNext() should IMO never shuffle unless it's at the end of its capacity; 
if it hasRemaining() and limit() != capacity() it should read on from its 
current limit (readMinimum can ensure there is room to fully meet its 
requirements)
* readUnsignedShort() could simply be: {{ return readShort() & 0xFFFF;}} 
* available() should return the bytes in the buffer at least
* ensureMinimum() isn't clearly named, since it is more intrinsically linked to 
primitive reads than it suggests, consuming the bytes and throwing EOF if it 
cannot read. Something like preparePrimitiveRead() (no fixed idea myself, just 
think it is more than "ensureMinimum")

A few comments in NIODataOutputStreamPlus:
* close() should flush
* close() should clean the buffer
* why the use of hollowBuffer? For clarity in case of restoring the cursor 
position during exceptions? Would be helpful to clarify with a comment. It 
seems like perhaps this should only be used for the first branch, though, since 
the second should have no risk of throwing an exception, so we can safely 
restore the position. It seems like it might be best to make hollowBuffer 
default to null, and instantiate it only if it is larger than our buffer size, 
otherwise first flushing our internal buffer if we haven't got enough room. 
This way we should rarely need the hollowBuffer.
* We should either extend our AbstractDataOutput, or make our writeUTF method 
public static, so we can share it

Finally, it would be nice if we didn't need to stash the OutputStream version 
separately. Perhaps we can reorganise the class hierarchy, so that 
DataOutputStreamPlus doesn't wrap an internal OutputStream, it just is a light 
abstract class merge of the types OutputStream and DataOutputPlus. We can 
introduce a WrappedDataOutputStreamPlus in its place, and AbstractDataOutput 
could extend our new DataOutputStreamPlus instead of the other way around (with 
Wrapped... extending _it_). Then we can just stash a DataOutputStreamPlus in 
all cases. Sound reasonable?

> Large columns + NIO memory pooling causes excessive direct memory usage
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-8670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 3.0
>
>         Attachments: largecolumn_test.py
>
>
> If you provide a large byte array to NIO and ask it to populate the byte 
> array from a socket it will allocate a thread local byte buffer that is the 
> size of the requested read no matter how large it is. Old IO wraps new IO for 
> sockets (but not files) so old IO is effected as well.
> Even If you are using Buffered{Input | Output}Stream you can end up passing a 
> large byte array to NIO. The byte array read method will pass the array to 
> NIO directly if it is larger than the internal buffer.  
> Passing large cells between nodes as part of intra-cluster messaging can 
> cause the NIO pooled buffers to quickly reach a high watermark and stay 
> there. This ends up costing 2x the largest cell size because there is a 
> buffer for input and output since they are different threads. This is further 
> multiplied by the number of nodes in the cluster - 1 since each has a 
> dedicated thread pair with separate thread locals.
> Anecdotally it appears that the cost is doubled beyond that although it isn't 
> clear why. Possibly the control connections or possibly there is some way in 
> which multiple 
> Need a workload in CI that tests the advertised limits of cells on a cluster. 
> It would be reasonable to ratchet down the max direct memory for the test to 
> trigger failures if a memory pooling issue is introduced. I don't think we 
> need to test concurrently pulling in a lot of them, but it should at least 
> work serially.
> The obvious fix to address this issue would be to read in smaller chunks when 
> dealing with large values. I think small should still be relatively large (4 
> megabytes) so that code that is reading from a disk can amortize the cost of 
> a seek. It can be hard to tell what the underlying thing being read from is 
> going to be in some of the contexts where we might choose to implement 
> switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to