Re: MERGE performances issue

Nicolas Paris Wed, 09 May 2018 12:46:00 -0700

2018-05-07 23:26 GMT+02:00 Gopal Vijayaraghavan <gop...@apache.org>:

> > Then I am wondering if the merge statement is impracticable because
> > of bad use of myself or because this feature is just not mature enough.
>
> Since you haven't mentioned a Hive version here, I'm going to assume
> you're some variant of Hive 1.x & that has some fundamental physical
> planning issues which makes an UPDATE + INSERT faster than an UPSERT.
>


True. I was using hive 1.2.1. Then I tested HIVE 2.10. The point is I am
quite unclear on if HIVE 2.X is equivalent to
HIVE LLAP or not. My concern with HIVE LLAP is I cannot use it combined
with Kerberos security since the LLAP daemon
is hosted by HIVE, and apparently cannot do "doAs" to impersonate other
users.

If there is a way to use HIVE 2.X without LLAP and benefit from all the
feature unless in memory computation, that would be
a good point to me.



> This is because an UPDATE uses an inner join which is rotated around so
> that the smaller table can always be the hash table side, while UPSERT
> requires a LEFT OUTER where the join scales poorly when the big table side
> is the target table for merge (which is your case).
>
> I recommend you run "explain <query>" and see the physical plan for the
> merge you're running (90% sure you have a shuffle join without
> vectorization).
>

Here are the explain:

HIVE1
Vertex dependency in root stage
Map 1 <- Union 2 (CONTAINS)
Map 7 <- Union 2 (CONTAINS)
Map 8 <- Union 2 (CONTAINS)
Reducer 3 <- Map 9 (SIMPLE_EDGE), Union 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
Reducer 5 <- Reducer 3 (SIMPLE_EDGE)
Reducer 6 <- Reducer 3 (SIMPLE_EDGE)

HIVE2
 Vertex dependency in root stage
 Map 1 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS)
 Map 6 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS)
 Map 7 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS)
 Reducer 3 <- Union 2 (SIMPLE_EDGE)
 Reducer 4 <- Union 2 (SIMPLE_EDGE)
 Reducer 5 <- Union 2 (SIMPLE_EDGE)


Does this confirm your thought?



> https://issues.apache.org/jira/browse/HIVE-19305
>
> This basically forms a 1:1 bridge between PySpark and Hive-ACID (or well,
> any other hive table).
>
>
Thanks for all those detail. A guess that would be helpful for other
developers to have a clear
documentation on how to deal with the transactional metastore, ACID
specific folder and so on.
As an example, this github issue show more information would be helfull for
other projects


https://github.com/prestodb/presto/issues/1970



Thanks again for all your details,

Regards

Re: MERGE performances issue

Reply via email to