Re: MERGE performances issue

Gopal Vijayaraghavan Mon, 07 May 2018 14:26:31 -0700

> Then I am wondering if the merge statement is impracticable because 
> of bad use of myself or because this feature is just not mature enough.


Since you haven't mentioned a Hive version here, I'm going to assume you're 
some variant of Hive 1.x & that has some fundamental physical planning issues 
which makes an UPDATE + INSERT faster than an UPSERT.

This is because an UPDATE uses an inner join which is rotated around so that 
the smaller table can always be the hash table side, while UPSERT requires a 
LEFT OUTER where the join scales poorly when the big table side is the target 
table for merge (which is your case).

I recommend you run "explain <query>" and see the physical plan for the merge 
you're running (90% sure you have a shuffle join without vectorization).

However tackling that directly is a bit hard, because the ACIDv1 did not allow 
for easy predicate push-down when the table had pending UPDATEs (i.e you 
couldn't skip rows in deltas when reading them).

So the 1st major thing that went into Hive 3.x was the new ACIDv2 
implementation allows for the predicates to be applied directly for 100% 
performance (Spark might find it easier to read ACIDv2, but that is more broken 
with lock-handling rather than the format readers right now).

https://issues.apache.org/jira/browse/HIVE-11320

The next big thing that went in was the new Tez semi-join reduction which kicks 
in for MERGE, which turns the 1.5B join into a bloom filter scan first (this 
would be similar to the UPDATE). You will see this as much more useful if 
you're mirroring a data-set which has auto-incremental or natural order keys 
rather than randomized keys (like a UUID would be bad, but a customer_id 
autoinc would be good).

https://issues.apache.org/jira/browse/HIVE-15269

However, if your source table is an external table & it does not have column 
stats, there's no ability today to inject this semi-join via a query-hint right 
now, which is probably what's needed for this to work if your ingest is from 
CSV tables.

https://issues.apache.org/jira/browse/HIVE-19115

That's probably a trivial problem, but at that point what happens is that the 
semi-join + merge looks like this.

http://people.apache.org/~gopalv/semijoin-merge.png

So you might be ahead of the curve here, but the feedback from people using 
Hive1+Merge is going into Hive3+Merge fixes.

> Second bad point: Right now spark is not able to read an ACID table
> without Major compaction. Meaning, the table needs to be rebuild
> from scratch behind the scene.

If your goal is to do ML workloads in Spark, this might be interesting to keep 
track of .

https://issues.apache.org/jira/browse/HIVE-19305

This basically forms a 1:1 bridge between PySpark and Hive-ACID (or well, any 
other hive table).

Cheers,
Gopal

Re: MERGE performances issue

Reply via email to