Vincent created SPARK-25034: ------------------------------- Summary: possible triple memory consumption in fetchBlockSync() Key: SPARK-25034 URL: https://issues.apache.org/jira/browse/SPARK-25034 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0, 2.2.2, 2.4.0 Reporter: Vincent
Hello in the code of _fetchBlockSync_() in _blockTransferService_, we have: {code:java} val ret = ByteBuffer.allocate(data.size.toInt) ret.put(data.nioByteBuffer()) ret.flip() result.success(new NioManagedBuffer(ret)) {code} In some cases, the _data_ variable is a _NettyManagedBuffer_, whose underlying netty representation is a _CompositeByteBuffer_. Going through the code above in this configuration, assuming that the variable _data_ holds N bytes: 1) we allocate a full buffer of N bytes in _ret_ 2) calling _data.nioByteBuffer()_ on a _CompositeByteBuffer_ will trigger a full merge of all the composite buffers, which will allocate *again* a full buffer of N bytes 3) we copy to _ret_ the data byte by byte This means that at some point the N bytes of data are located 3 times in memory. Is this really necessary? It seems unclear to me why we have to process at all the data, given that we receive a _ManagedBuffer_ and we want to return a _ManagedBuffer_ Is there something I'm missing here? It seems this whole operation could be done with 0 copies. The only upside here is that the new buffer will have merged all the composite buffer's arrays, but it is really not clear if this is intended. In any case this could be done with peak memory of 2N and not 3N Cheers! -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org