> Then I am wondering if the merge statement is impracticable because > of bad use of myself or because this feature is just not mature enough.
Since you haven't mentioned a Hive version here, I'm going to assume you're some variant of Hive 1.x & that has some fundamental physical planning issues which makes an UPDATE + INSERT faster than an UPSERT. This is because an UPDATE uses an inner join which is rotated around so that the smaller table can always be the hash table side, while UPSERT requires a LEFT OUTER where the join scales poorly when the big table side is the target table for merge (which is your case). I recommend you run "explain <query>" and see the physical plan for the merge you're running (90% sure you have a shuffle join without vectorization). However tackling that directly is a bit hard, because the ACIDv1 did not allow for easy predicate push-down when the table had pending UPDATEs (i.e you couldn't skip rows in deltas when reading them). So the 1st major thing that went into Hive 3.x was the new ACIDv2 implementation allows for the predicates to be applied directly for 100% performance (Spark might find it easier to read ACIDv2, but that is more broken with lock-handling rather than the format readers right now). https://issues.apache.org/jira/browse/HIVE-11320 The next big thing that went in was the new Tez semi-join reduction which kicks in for MERGE, which turns the 1.5B join into a bloom filter scan first (this would be similar to the UPDATE). You will see this as much more useful if you're mirroring a data-set which has auto-incremental or natural order keys rather than randomized keys (like a UUID would be bad, but a customer_id autoinc would be good). https://issues.apache.org/jira/browse/HIVE-15269 However, if your source table is an external table & it does not have column stats, there's no ability today to inject this semi-join via a query-hint right now, which is probably what's needed for this to work if your ingest is from CSV tables. https://issues.apache.org/jira/browse/HIVE-19115 That's probably a trivial problem, but at that point what happens is that the semi-join + merge looks like this. http://people.apache.org/~gopalv/semijoin-merge.png So you might be ahead of the curve here, but the feedback from people using Hive1+Merge is going into Hive3+Merge fixes. > Second bad point: Right now spark is not able to read an ACID table > without Major compaction. Meaning, the table needs to be rebuild > from scratch behind the scene. If your goal is to do ML workloads in Spark, this might be interesting to keep track of . https://issues.apache.org/jira/browse/HIVE-19305 This basically forms a 1:1 bridge between PySpark and Hive-ACID (or well, any other hive table). Cheers, Gopal
