Vincent created SPARK-25034:
-------------------------------

             Summary: possible triple memory consumption in fetchBlockSync()
                 Key: SPARK-25034
                 URL: https://issues.apache.org/jira/browse/SPARK-25034
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.3.0, 2.2.2, 2.4.0
            Reporter: Vincent


Hello

in the code of  _fetchBlockSync_() in _blockTransferService_, we have:
 
{code:java}
val ret = ByteBuffer.allocate(data.size.toInt)
ret.put(data.nioByteBuffer())
ret.flip()
result.success(new NioManagedBuffer(ret)) 
{code}

In some cases, the _data_ variable is a _NettyManagedBuffer_, whose underlying 
netty representation is a _CompositeByteBuffer_.

Going through the code above in this configuration, assuming that the variable 
_data_ holds N bytes:
1) we allocate a full buffer of N bytes in _ret_
2) calling _data.nioByteBuffer()_ on a  _CompositeByteBuffer_ will trigger a 
full merge of all the composite buffers, which will allocate  *again* a full 
buffer of N bytes
3) we copy to _ret_ the data byte by byte

This means that at some point the N bytes of data are located 3 times in memory.
Is this really necessary?
It seems unclear to me why we have to process at all the data, given that we 
receive a _ManagedBuffer_ and we want to return a _ManagedBuffer_ 
Is there something I'm missing here? It seems this whole operation could be 
done with 0 copies. 
The only upside here is that the new buffer will have merged all the composite 
buffer's arrays, but it is really not clear if this is intended. In any case 
this could be done with peak memory of 2N and not 3N

Cheers!
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to