[ 
https://issues.apache.org/jira/browse/HIVE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao updated HIVE-7958:
-----------------------
    Description: 
A SparkWork instance currently may contain disjointed work graphs. For 
instance, union_remove_1.q may generated a plan like this:
{code}
Reduce2 <- Map 1
Reduce4 <- Map 3
{code}
The SparkPlan instance generated from this work graph contains two result RDDs. 
When such plan is executed, we call .foreach() on the two RDDs sequentially, 
which results two Spark jobs, one after the other.

While this works functionally, the performance will not be great as the Spark 
jobs are run sequentially rather than concurrently.

Another side effect of this is that the corresponding SparkPlan instance is 
over-complicated.

The are two potential approaches:

1. Let SparkCompiler generate a work that can be executed in ONE Spark job 
only. In above example, two Spark task should be generated.

2. Let SparkPlanGenerate generate multiple Spark plans and then SparkClient 
executes them concurrently.

Approach #1 seems more reasonable and naturally fit to our architecture. Also, 
Hive's task execution framework already takes care of the task concurrency.

  was:
A SparkWork instance currently may contain disjointed work graphs. For 
instance, union_remove_1.q may generated a plan like this:
{code}
Reduce2 -> Map 1
Reduce4 <- Map 3
{code}
The SparkPlan instance generated from this work graph contains two result RDDs. 
When such plan is executed, we call .foreach() on the two RDDs sequentially, 
which results two Spark jobs, one after the other.

While this works functionally, the performance will not be great as the Spark 
jobs are run sequentially rather than concurrently.

Another side effect of this is that the corresponding SparkPlan instance is 
over-complicated.

The are two potential approaches:

1. Let SparkCompiler generate a work that can be executed in ONE Spark job 
only. In above example, two Spark task should be generated.

2. Let SparkPlanGenerate generate multiple Spark plans and then SparkClient 
executes them concurrently.

Approach #1 seems more reasonable and naturally fit to our architecture. Also, 
Hive's task execution framework already takes care of the task concurrency.


> SparkWork generated by SparkCompiler may require multiple Spark jobs to run
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-7958
>                 URL: https://issues.apache.org/jira/browse/HIVE-7958
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Priority: Critical
>              Labels: Spark-M1
>
> A SparkWork instance currently may contain disjointed work graphs. For 
> instance, union_remove_1.q may generated a plan like this:
> {code}
> Reduce2 <- Map 1
> Reduce4 <- Map 3
> {code}
> The SparkPlan instance generated from this work graph contains two result 
> RDDs. When such plan is executed, we call .foreach() on the two RDDs 
> sequentially, which results two Spark jobs, one after the other.
> While this works functionally, the performance will not be great as the Spark 
> jobs are run sequentially rather than concurrently.
> Another side effect of this is that the corresponding SparkPlan instance is 
> over-complicated.
> The are two potential approaches:
> 1. Let SparkCompiler generate a work that can be executed in ONE Spark job 
> only. In above example, two Spark task should be generated.
> 2. Let SparkPlanGenerate generate multiple Spark plans and then SparkClient 
> executes them concurrently.
> Approach #1 seems more reasonable and naturally fit to our architecture. 
> Also, Hive's task execution framework already takes care of the task 
> concurrency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to