[jira] [Commented] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

Chengxiang Li (JIRA) Tue, 16 Sep 2014 01:52:31 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135151#comment-14135151
 ]


Chengxiang Li commented on HIVE-8118:
-------------------------------------

Actually, we could generate a spark graph with one map RDD followed by multi 
reduce RDDs, it should not related with SparkMapRecordHandler and 
SparkReduceRecorderHandler, we could wrap each reduce side child operator with 
a separate HiveReduceFunction in SparkCompiler level. 
For a map RDD which is followed by two reduce RDDs and then connected to a 
union RDD, Spark would compute map RDD twice unless map RDD is cached. If two 
reduce share the same shuffle dependency(which means they have same map output 
partitions), the job could be optimized to compute map RDD only once 
theoretically, but i think this should be an Spark framework level 
optimization. while two reduce RDDs don't share the same shuffle dependency, 
map RDD would be computed twice anyway. 
For multi-insert case, if we wrap all FileSinkOperators into one RDD, parent of 
FileSinkOperator would forward rows to each FileSinkOperator, so the data 
source for insert would be only generated once. 
so I think we do not really need multiple result collectors for 
SparkMapRecorderHandler and SparkReduceRecordHandler.

> SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
> with multiple result collectors[Spark Branch]
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8118
>                 URL: https://issues.apache.org/jira/browse/HIVE-8118
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Venki Korukanti
>              Labels: Spark-M1
>
> In the current implementation, both SparkMapRecordHandler and 
> SparkReduceRecorderHandler takes only one result collector, which limits that 
> the corresponding map or reduce task can have only one child. It's very 
> comment in multi-insert queries where a map/reduce task has more than one 
> children. A query like the following has two map tasks as parents:
> {code}
> select name, sum(value) from dec group by name union all select name, value 
> from dec order by name
> {code}
> It's possible in the future an optimation may be implemented so that a map 
> work is followed by two reduce works and then connected to a union work.
> Thus, we should take this as a general case. Tez is currently providing a 
> collector for each child operator in the map-side or reduce side operator 
> tree. We can take Tez as a reference.
> Likely this is a big change and subtasks are possible. 
> With this, we can have a simpler and clean multi-insert implementation. This 
> is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

Reply via email to