[
https://issues.apache.org/jira/browse/HIVE-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124358#comment-14124358
]
Na Yang commented on HIVE-7870:
-------------------------------
Removing those duplicated filesinks is hard because during the time that those
filesinks are added to the filesinkset, it is hard to know which filesink is
eventually used by the spark work. if we remove the wrong filesink from the
filesinkset, then we are not able to create the proper linkedfilesinks for the
target filesink. This will cause wrong result for the merge and move work when
hive.merge.sparkfiles is turned ON.
For example, in the following query, three duplicate filesink FS1, FS2, FS3
will be added to the filesinkset. (the number is according to the order they
are added to the filesinkset), FS2 and FS3 will be used for the subqueries of
the outer union. In addition, FS2 and FS3 have different directory when
hive.merge.sparkfiles=true.
insert overwrite table outputTbl1
SELECT * FROM
(
select key, 1 as values from inputTbl1
union all
select * FROM (
SELECT key, count(1) as values from inputTbl1 group by key
UNION ALL
SELECT key, 2 as values from inputTbl1
) a
)b;
However, in the following query, same as above query, three duplicate filesink
FS1, FS2, FS3 will be added to the filesinkset. But FS1 will be used for the
subqueries of the union. FS1, FS2 and FS3 all have the same directory when
hive.merge.sparkfiles=true.
insert overwrite table outputTbl1
SELECT * FROM
(
select key, 1 as values from inputTbl1
union all
select * FROM (
SELECT key, 3 as values from inputTbl1
UNION ALL
SELECT key, 2 as values from inputTbl1
) a
)b;
When the filesinks are added to the filesinkset, the final plan has not been
generated yet, so there is no way to know which filesink should not be added to
the set. After the final plan is generated, it is hard to detect the duplicate
filesinks and remove the right one either.
Therefore, duplicate filesinks are in the filesinkset. The potential problem
that duplicate filesinks cause is generating multiple merge and move works when
hive.merge.sparkfiles=true. This problem has been resolved in the patch by
linking those duplicate filesinks together and use a HashMap to make sure one
directory only gets processed once and only one merge and move work will be
generated for each directory no matter how many duplicate filesinks exist.
> Insert overwrite table query does not generate correct task plan [Spark
> Branch]
> -------------------------------------------------------------------------------
>
> Key: HIVE-7870
> URL: https://issues.apache.org/jira/browse/HIVE-7870
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Na Yang
> Assignee: Na Yang
> Labels: Spark-M1
> Attachments: HIVE-7870.1-spark.patch, HIVE-7870.2-spark.patch,
> HIVE-7870.3-spark.patch, HIVE-7870.4-spark.patch, HIVE-7870.5-spark.patch
>
>
> Insert overwrite table query does not generate correct task plan when
> hive.optimize.union.remove and hive.merge.sparkfiles properties are ON.
> {noformat}
> set hive.optimize.union.remove=true
> set hive.merge.sparkfiles=true
> insert overwrite table outputTbl1
> SELECT * FROM
> (
> select key, 1 as values from inputTbl1
> union all
> select * FROM (
> SELECT key, count(1) as values from inputTbl1 group by key
> UNION ALL
> SELECT key, 2 as values from inputTbl1
> ) a
> )b;
> select * from outputTbl1 order by key, values;
> {noformat}
> query result
> {noformat}
> 1 1
> 1 2
> 2 1
> 2 2
> 3 1
> 3 2
> 7 1
> 7 2
> 8 2
> 8 2
> 8 2
> {noformat}
> expected result:
> {noformat}
> 1 1
> 1 1
> 1 2
> 2 1
> 2 1
> 2 2
> 3 1
> 3 1
> 3 2
> 7 1
> 7 1
> 7 2
> 8 1
> 8 1
> 8 2
> 8 2
> 8 2
> {noformat}
> Move work is not working properly and some data are missing during move.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)