[GitHub] spark pull request: Added a FastByteArrayOutputStream that exposes...

srowen Sat, 12 Apr 2014 04:14:23 -0700

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/397#issuecomment-40277866
  
    So I think I agree with the overall direction here, but want to make a few 
comments to clarify why. Apologies if I'm stating the obvious.
    
    The management of the byte array here is not what's solving the immediate 
problem. Where this stuff is used in the code, the calling code wants to write 
N bytes in pieces, and then get a `byte[]` of exactly N bytes. Except in the 
lucky case that N exactly matched the buffer size, this entails an allocation 
and copy. Previously this was hidden in a `trim()` method. The `compact()` 
method would not work directly since it can result in an underlying array of 
more than N bytes.
    
    What's needed is an abstraction that can take a `byte[]` that's possibly 
too large, and a limit N, and broker access in a way that makes it act like it 
is only of length N, without a copy.
    
    Although this is not `ByteBuffer`'s primarily role in life, it can kind of 
play that role. You can wrap bytes 0 to N of a `byte[]` without a copy. And it 
has methods to read through the elements of the buffer. Note that its `array()` 
method itself does not copy, but, also provides access to the whole 
maybe-too-large array underneath.
    
    For that reason, and because calling code in one case already wants a 
`ByteBuffer`, this feels like a good solution. However, other callers need to 
change to use `ByteBuffer` if they care about the allocation. So there's more 
to this than just a drop-in replacement.
    
    However I'd also say that this doesn't need reimplementing 
`ByteArrayOutputStream` so entirely. Just subclass it and expose a 
`toByteBuffer()` method that `wrap()`s the internal `byte[]` with an 
appropriate limit.
    
    I understand the idea of this code is to implement a compaction mechanism, 
but that's a separate issue really. (When does this help? I understand it can 
free up some heap, at the cost of a new second allocation and copy, and could 
be helpful if this object were sticking around a long time. But it's not, 
right?)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added a FastByteArrayOutputStream that exposes...

Reply via email to