[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237056#comment-15237056 ]
Justin Uang commented on SPARK-9850: ------------------------------------ I like this idea a lot. One thing we encounter in our use cases is that we end up accidentally joining on a field that is 50% nulls, or a string that represents null like "N/A". It then becomes quite cumbersome to have to constantly have to have a Spark expert dig in and find why there is 1 task that will never finish. Would it be possible to add a threshold such that if a join key ever gets too big, it will just fail the job with an error message? > Adaptive execution in Spark > --------------------------- > > Key: SPARK-9850 > URL: https://issues.apache.org/jira/browse/SPARK-9850 > Project: Spark > Issue Type: Epic > Components: Spark Core, SQL > Reporter: Matei Zaharia > Assignee: Yin Huai > Attachments: AdaptiveExecutionInSpark.pdf > > > Query planning is one of the main factors in high performance, but the > current Spark engine requires the execution DAG for a job to be set in > advance. Even with costÂ-based optimization, it is hard to know the behavior > of data and user-defined functions well enough to always get great execution > plans. This JIRA proposes to add adaptive query execution, so that the engine > can change the plan for each query as it sees what data earlier stages > produced. > We propose adding this to Spark SQL / DataFrames first, using a new API in > the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, > the functionality could be extended to other libraries or the RDD API, but > that is more difficult than adding it in SQL. > I've attached a design doc by Yin Huai and myself explaining how it would > work in more detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org