Re: MERGE performances issue
Hive-MR3 could be a solution for you. It supports everything that you mention in the previous post. I have written a blog article discussing the pros and cons of Hive-MR3 with respect to Hive-LLAP. https://mr3.postech.ac.kr/blog/2018/05/19/comparison-hivemr3-llap/ --- Sungwoo On Thu, May 10, 2018 at 4:44 AM, Nicolas Paris wrote: > > True. I was using hive 1.2.1. Then I tested HIVE 2.10. The point is I am > quite unclear on if HIVE 2.X is equivalent to > HIVE LLAP or not. My concern with HIVE LLAP is I cannot use it combined > with Kerberos security since the LLAP daemon > is hosted by HIVE, and apparently cannot do "doAs" to impersonate other > users. > > If there is a way to use HIVE 2.X without LLAP and benefit from all the > feature unless in memory computation, that would be > a good point to me. > >
Re: MERGE performances issue
2018-05-07 23:26 GMT+02:00 Gopal Vijayaraghavan : > > Then I am wondering if the merge statement is impracticable because > > of bad use of myself or because this feature is just not mature enough. > > Since you haven't mentioned a Hive version here, I'm going to assume > you're some variant of Hive 1.x & that has some fundamental physical > planning issues which makes an UPDATE + INSERT faster than an UPSERT. > True. I was using hive 1.2.1. Then I tested HIVE 2.10. The point is I am quite unclear on if HIVE 2.X is equivalent to HIVE LLAP or not. My concern with HIVE LLAP is I cannot use it combined with Kerberos security since the LLAP daemon is hosted by HIVE, and apparently cannot do "doAs" to impersonate other users. If there is a way to use HIVE 2.X without LLAP and benefit from all the feature unless in memory computation, that would be a good point to me. > This is because an UPDATE uses an inner join which is rotated around so > that the smaller table can always be the hash table side, while UPSERT > requires a LEFT OUTER where the join scales poorly when the big table side > is the target table for merge (which is your case). > > I recommend you run "explain " and see the physical plan for the > merge you're running (90% sure you have a shuffle join without > vectorization). > Here are the explain: HIVE1 Vertex dependency in root stage Map 1 <- Union 2 (CONTAINS) Map 7 <- Union 2 (CONTAINS) Map 8 <- Union 2 (CONTAINS) Reducer 3 <- Map 9 (SIMPLE_EDGE), Union 2 (SIMPLE_EDGE) Reducer 4 <- Reducer 3 (SIMPLE_EDGE) Reducer 5 <- Reducer 3 (SIMPLE_EDGE) Reducer 6 <- Reducer 3 (SIMPLE_EDGE) HIVE2 Vertex dependency in root stage Map 1 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS) Map 6 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS) Map 7 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS) Reducer 3 <- Union 2 (SIMPLE_EDGE) Reducer 4 <- Union 2 (SIMPLE_EDGE) Reducer 5 <- Union 2 (SIMPLE_EDGE) Does this confirm your thought? > https://issues.apache.org/jira/browse/HIVE-19305 > > This basically forms a 1:1 bridge between PySpark and Hive-ACID (or well, > any other hive table). > > Thanks for all those detail. A guess that would be helpful for other developers to have a clear documentation on how to deal with the transactional metastore, ACID specific folder and so on. As an example, this github issue show more information would be helfull for other projects https://github.com/prestodb/presto/issues/1970 Thanks again for all your details, Regards
Re: MERGE performances issue
> Then I am wondering if the merge statement is impracticable because > of bad use of myself or because this feature is just not mature enough. Since you haven't mentioned a Hive version here, I'm going to assume you're some variant of Hive 1.x & that has some fundamental physical planning issues which makes an UPDATE + INSERT faster than an UPSERT. This is because an UPDATE uses an inner join which is rotated around so that the smaller table can always be the hash table side, while UPSERT requires a LEFT OUTER where the join scales poorly when the big table side is the target table for merge (which is your case). I recommend you run "explain " and see the physical plan for the merge you're running (90% sure you have a shuffle join without vectorization). However tackling that directly is a bit hard, because the ACIDv1 did not allow for easy predicate push-down when the table had pending UPDATEs (i.e you couldn't skip rows in deltas when reading them). So the 1st major thing that went into Hive 3.x was the new ACIDv2 implementation allows for the predicates to be applied directly for 100% performance (Spark might find it easier to read ACIDv2, but that is more broken with lock-handling rather than the format readers right now). https://issues.apache.org/jira/browse/HIVE-11320 The next big thing that went in was the new Tez semi-join reduction which kicks in for MERGE, which turns the 1.5B join into a bloom filter scan first (this would be similar to the UPDATE). You will see this as much more useful if you're mirroring a data-set which has auto-incremental or natural order keys rather than randomized keys (like a UUID would be bad, but a customer_id autoinc would be good). https://issues.apache.org/jira/browse/HIVE-15269 However, if your source table is an external table & it does not have column stats, there's no ability today to inject this semi-join via a query-hint right now, which is probably what's needed for this to work if your ingest is from CSV tables. https://issues.apache.org/jira/browse/HIVE-19115 That's probably a trivial problem, but at that point what happens is that the semi-join + merge looks like this. http://people.apache.org/~gopalv/semijoin-merge.png So you might be ahead of the curve here, but the feedback from people using Hive1+Merge is going into Hive3+Merge fixes. > Second bad point: Right now spark is not able to read an ACID table > without Major compaction. Meaning, the table needs to be rebuild > from scratch behind the scene. If your goal is to do ML workloads in Spark, this might be interesting to keep track of . https://issues.apache.org/jira/browse/HIVE-19305 This basically forms a 1:1 bridge between PySpark and Hive-ACID (or well, any other hive table). Cheers, Gopal
Re: MERGE performances issue
>> Has anyone any positive feedback on the hive MERGE statement ? FYI https://issues.apache.org/jira/browse/HIVE-19286 https://issues.apache.org/jira/browse/HIVE-19295 On Mon, May 7, 2018 at 12:35 AM, Nicolas Paris wrote: > Hi, > > Has anyone any positive feedback on the hive MERGE statement ? There > is some informations [1] and [2]. > > From my experience, merging a source table of 300M rows and 100 columns > to a target of 1.5B is 100 times slower than doing an UPDATE and an INSERT. > It is also slower than a third approach consisting in building the > new table from scratch, and renaming it to replace the old one. > > Second bad point: Right now spark is not able to read an ACID table > without Major compaction. Meaning, the table needs to be rebuild > from scratch behind the scene. > > Then I am wondering if the merge statement is impracticable because > of bad use of myself or because this feature is just not mature enough. > > [1]: https://thisdataguy.com/2018/01/29/why-is-my-hive-merge- > statement-slow/ > [2]: https://fr.hortonworks.com/blog/apache-hive-moving- > beyond-analytics-offload-with-sql-merge/ > > > -- Oleksiy
MERGE performances issue
Hi, Has anyone any positive feedback on the hive MERGE statement ? There is some informations [1] and [2]. >From my experience, merging a source table of 300M rows and 100 columns to a target of 1.5B is 100 times slower than doing an UPDATE and an INSERT. It is also slower than a third approach consisting in building the new table from scratch, and renaming it to replace the old one. Second bad point: Right now spark is not able to read an ACID table without Major compaction. Meaning, the table needs to be rebuild from scratch behind the scene. Then I am wondering if the merge statement is impracticable because of bad use of myself or because this feature is just not mature enough. [1]: https://thisdataguy.com/2018/01/29/why-is-my-hive-merge-statement-slow/ [2]: https://fr.hortonworks.com/blog/apache-hive-moving-beyond-analytics-offload-with-sql-merge/