2018-05-07 23:26 GMT+02:00 Gopal Vijayaraghavan <gop...@apache.org>: > > Then I am wondering if the merge statement is impracticable because > > of bad use of myself or because this feature is just not mature enough. > > Since you haven't mentioned a Hive version here, I'm going to assume > you're some variant of Hive 1.x & that has some fundamental physical > planning issues which makes an UPDATE + INSERT faster than an UPSERT. >
True. I was using hive 1.2.1. Then I tested HIVE 2.10. The point is I am quite unclear on if HIVE 2.X is equivalent to HIVE LLAP or not. My concern with HIVE LLAP is I cannot use it combined with Kerberos security since the LLAP daemon is hosted by HIVE, and apparently cannot do "doAs" to impersonate other users. If there is a way to use HIVE 2.X without LLAP and benefit from all the feature unless in memory computation, that would be a good point to me. > This is because an UPDATE uses an inner join which is rotated around so > that the smaller table can always be the hash table side, while UPSERT > requires a LEFT OUTER where the join scales poorly when the big table side > is the target table for merge (which is your case). > > I recommend you run "explain <query>" and see the physical plan for the > merge you're running (90% sure you have a shuffle join without > vectorization). > Here are the explain: HIVE1 Vertex dependency in root stage Map 1 <- Union 2 (CONTAINS) Map 7 <- Union 2 (CONTAINS) Map 8 <- Union 2 (CONTAINS) Reducer 3 <- Map 9 (SIMPLE_EDGE), Union 2 (SIMPLE_EDGE) Reducer 4 <- Reducer 3 (SIMPLE_EDGE) Reducer 5 <- Reducer 3 (SIMPLE_EDGE) Reducer 6 <- Reducer 3 (SIMPLE_EDGE) HIVE2 Vertex dependency in root stage Map 1 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS) Map 6 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS) Map 7 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS) Reducer 3 <- Union 2 (SIMPLE_EDGE) Reducer 4 <- Union 2 (SIMPLE_EDGE) Reducer 5 <- Union 2 (SIMPLE_EDGE) Does this confirm your thought? > https://issues.apache.org/jira/browse/HIVE-19305 > > This basically forms a 1:1 bridge between PySpark and Hive-ACID (or well, > any other hive table). > > Thanks for all those detail. A guess that would be helpful for other developers to have a clear documentation on how to deal with the transactional metastore, ACID specific folder and so on. As an example, this github issue show more information would be helfull for other projects https://github.com/prestodb/presto/issues/1970 Thanks again for all your details, Regards