[ 
https://issues.apache.org/jira/browse/SPARK-22062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201305#comment-16201305
 ] 

Saisai Shao commented on SPARK-22062:
-------------------------------------

Yes, there potentially has OOM problem, but I think this kind of temporarily 
allocated {{ByteBuffer}} is difficult to be defined as whether it should be 
accounted into storage memory or execution memory. Furthermore, how to deal 
with remote fetching if memory is not enough, shall we fail the task or can we 
stream the remote fetches?

What I can think of is to leverage the current implementation of shuffle to 
spill the large blocks to local disk during fetching, and tasks can read the 
data from local temporary files, this could avoid OOM.

> BlockManager does not account for memory consumed by remote fetches
> -------------------------------------------------------------------
>
>                 Key: SPARK-22062
>                 URL: https://issues.apache.org/jira/browse/SPARK-22062
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 2.2.0
>            Reporter: Sergei Lebedev
>            Priority: Minor
>
> We use Spark exclusively with {{StorageLevel.DiskOnly}} as our workloads are 
> very sensitive to memory usage. Recently, we've spotted that the jobs 
> sometimes OOM leaving lots of byte[] arrays on the heap. Upon further 
> investigation, we've found that the arrays come from 
> {{BlockManager.getRemoteBytes}}, which 
> [calls|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L638]
>  {{BlockTransferService.fetchBlockSync}}, which in its turn would 
> [allocate|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/BlockTransferService.scala#L99]
>  an on-heap {{ByteBuffer}} of the same size as the block (e.g. full 
> partition), if the block was successfully retrieved over the network.
> This memory is not accounted towards Spark storage/execution memory and could 
> potentially lead to OOM if {{BlockManager}} fetches too many partitions in 
> parallel. I wonder if this is intentional behaviour, or in fact a bug?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to