[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900741#comment-15900741
 ] 

jin xing edited comment on SPARK-19659 at 3/8/17 5:47 AM:
----------------------------------------------------------

[~irashid] [~rxin]
I uploaded SPARK-19659-design-v2.pdf, please take a look.

Yes, only outliers should be tracked and MapStatus should stay compact. It is a 
good idea to track all the sizes that are more than 2x the average. “2x” is a 
default value, it will be configurable by parameter.

We need to collect metrics for stability and performance:

- For stability:
 1) Blocks which have size between the average and 2x the average are 
underestimated. So there is a risk that those blocks cannot fit in memory. In 
this approach, driver should calculate the sum of underestimated sizes. 
 2) Show block sizes’ distribution of MapStatus, the distribution will 
between(0~100K, 100K~1M, 1~10M, 10~100M, 100M~1G, 1G~10G), which will help a 
lot for debugging(e.g. find skew situations).
* For performance:
 1) Peak memory(off heap or on heap) used for fetching blocks should be tracked.
 2) Fetching blocks to disk will cause performance degradation. Executor should 
calculate the size of blocks shuffled to disk and the time cost. 

Yes, Imran, I will definitely break the proposal to smaller pieces(jiras and 
prs). The metrics part should be done first before other parts proposed here.

What's more, why not make 2000 as a configurable parameter, thus user can chose 
the track the accurate sizes of blocks to some level.




was (Author: jinxing6...@126.com):
[~irashid] [~rxin]
I uploaded SPARK-19659-design-v2.pdf, please take a look.

Yes, only outliers should be tracked and MapStatus should stay compact. It is a 
good idea to track all the sizes that are more than 2x the average. “2x” is a 
default value, it will be configurable by parameter.

We need to collect metrics for stability and performance:

- For stability:
 1) Blocks which have size between the average and 2x the average are 
underestimated. So there is a risk that those blocks cannot fit in memory. In 
this approach, driver should calculate the sum of underestimated sizes. 
 2) Show block sizes’ distribution of MapStatus, the distribution will 
between(0~100K, 100K~1M, 1~10M, 10~100M, 100M~1G, 1G~10G), which will help a 
lot for debugging(e.g. find skew situations).
* For performance:
 1) Memory(off heap or on heap) used for fetching blocks should be tracked.
 2) Fetching blocks to disk will cause performance degradation. Executor should 
calculate the size of blocks shuffled to disk and the time cost. 

Yes, Imran, I will definitely break the proposal to smaller pieces(jiras and 
prs). The metrics part should be done first before other parts proposed here.

What's more, why not make 2000 as a configurable parameter, thus user can chose 
the track the accurate sizes of blocks to some level.



> Fetch big blocks to disk when shuffle-read
> ------------------------------------------
>
>                 Key: SPARK-19659
>                 URL: https://issues.apache.org/jira/browse/SPARK-19659
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 2.1.0
>            Reporter: jin xing
>         Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to