[ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876347#comment-14876347
 ] 

Imran Rashid commented on SPARK-9850:
-------------------------------------

just to continue brainstorming on what to do with large data -- I realize that 
my earlier suggestion about sending uncompressed (or at least not 
{{HighlyCompressed}}) map status back to the driver may not work, since part of 
the point is to avoid OOM on the driver, not to just reduce communication from 
the driver back to the executors.

But here are two other ideas:
1. Create a variant of {{HighlyCompressedMapOutput}} which stores all block 
sizes that are more than some factor above the median, lets say 5x?  This would 
let you deal w/ really extreme skew without increasing the size too much.
2. Since you only need the summed size of the map output per reduce partition, 
you could first perform a tree-reduce of those sizes on the executors before 
sending back to the driver.  avoids trying to guess some arbitrary cutoff 
factor, but also way more complicated.

> Adaptive execution in Spark
> ---------------------------
>
>                 Key: SPARK-9850
>                 URL: https://issues.apache.org/jira/browse/SPARK-9850
>             Project: Spark
>          Issue Type: Epic
>          Components: Spark Core, SQL
>            Reporter: Matei Zaharia
>            Assignee: Yin Huai
>         Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to