[ https://issues.apache.org/jira/browse/SPARK-9377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647502#comment-14647502 ]
Sean Owen commented on SPARK-9377: ---------------------------------- [~jem.tucker] do you want to open a PR that implements these? > Shuffle tuning should discuss task size optimisation > ---------------------------------------------------- > > Key: SPARK-9377 > URL: https://issues.apache.org/jira/browse/SPARK-9377 > Project: Spark > Issue Type: Documentation > Components: Documentation, Shuffle > Reporter: Jem Tucker > Priority: Minor > > Recent issue SPARK-9310 highlighted the negative effects of having too high > parallelism caused by task overhead. Although large task numbers is > unavoidable with high volumes of data, more in detail in the documentation > will be very beneficial to newcomers when optimising the performance of their > applications. > Areas to discuss could be: > - What are the overheads of a Spark task? > -- Does this overhead chance with task size etc? > - How to dynamically calculate a suitable parallelism for a Spark job > - Examples of designing code to minimise shuffles > -- How to minimise the data volumes when shuffles are required > - Differences between sort-based and hash-based shuffles > -- Benefits and weaknesses of each -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org