Re: MERGE performances issue

2018-05-09 Thread Nicolas Paris
2018-05-07 23:26 GMT+02:00 Gopal Vijayaraghavan :

> > Then I am wondering if the merge statement is impracticable because
> > of bad use of myself or because this feature is just not mature enough.
>
> Since you haven't mentioned a Hive version here, I'm going to assume
> you're some variant of Hive 1.x & that has some fundamental physical
> planning issues which makes an UPDATE + INSERT faster than an UPSERT.
>

​True. I was using hive 1.2.1. Then I tested HIVE 2.10.​ The point is I am
quite unclear​ on if HIVE 2.X is equivalent to
HIVE LLAP or not. My concern with HIVE LLAP is I cannot use it combined
with Kerberos security since the LLAP daemon
is hosted by HIVE, and apparently cannot do "doAs" to impersonate other
users.

If there is a way to use HIVE 2.X without LLAP and benefit from all the
feature unless in memory computation, that would be
a good point to me.



> This is because an UPDATE uses an inner join which is rotated around so
> that the smaller table can always be the hash table side, while UPSERT
> requires a LEFT OUTER where the join scales poorly when the big table side
> is the target table for merge (which is your case).
>
> I recommend you run "explain " and see the physical plan for the
> merge you're running (90% sure you have a shuffle join without
> vectorization).
>

​Here are the explain:

HIVE1
Vertex dependency in root stage
Map 1 <- Union 2 (CONTAINS)
Map 7 <- Union 2 (CONTAINS)
Map 8 <- Union 2 (CONTAINS)
Reducer 3 <- Map 9 (SIMPLE_EDGE), Union 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
Reducer 5 <- Reducer 3 (SIMPLE_EDGE)
Reducer 6 <- Reducer 3 (SIMPLE_EDGE)

HIVE2
 Vertex dependency in root stage
 Map 1 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS)
 Map 6 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS)
 Map 7 <- Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS)
 Reducer 3 <- Union 2 (SIMPLE_EDGE)
 Reducer 4 <- Union 2 (SIMPLE_EDGE)
 Reducer 5 <- Union 2 (SIMPLE_EDGE)


Does this confirm your thought?



> https://issues.apache.org/jira/browse/HIVE-19305
>
> This basically forms a 1:1 bridge between PySpark and Hive-ACID (or well,
> any other hive table).
>
>
​Thanks for all those detail. A guess that would be helpful for other
developers to have a clear
documentation on how to deal with the transactional metastore, ACID
specific folder and so on.
As an example, this github issue show more information would be helfull for
other projects
​

​https://github.com/prestodb/presto/issues/1970



Thanks again for all your details,

Regards
​


Partition Pruning using UDF

2018-05-09 Thread Alberto Ramón
Hello

We have a UDP to select the correct partition to read 'FindPartition':
Select * from TB where partitionCol =FindPartition();

How I can avoid a full scan of all partitions?


(Set MyPartition=FindPartition();  // Is not valid in Hive)


Re: May 2018 Hive User Group Meeting

2018-05-09 Thread Luis Figueroa
Hey everyone,

Was the meeting recorded by any chance?

Luis

On May 8, 2018, at 5:31 PM, Sahil Takiar 
> wrote:

Hey Everyone,

Almost time for the meetup! The live stream can be viewed on this link: 
https://live.lifesizecloud.com/extension/2000992219?token=067078ac-a8df-45bc-b84c-4b371ecbc719==en=Hive%20User%20Group%20Meetup

The stream won't be live until the meetup starts.

For those attending in person, there will be guest wifi:

Login: HiveMeetup
Password: ClouderaHive

On Mon, May 7, 2018 at 12:48 PM, Sahil Takiar 
> wrote:
Hey Everyone,

The meetup is only a day away! 
Here
 is a link to all the abstracts we have compiled thus far. Several of you have 
asked about event streaming and recordings. The meetup will be both streamed 
live and recorded. We will post the links on this thread and on the meetup link 
tomorrow closer to the start of the meetup.

The meetup will be at Cloudera HQ - 395 Page Mill Rd. If you have any trouble 
getting into the building, feel free to post on the meetup link.

Meetup Link: https://www.meetup.com/Hive-User-Group-Meeting/events/249641278/

On Wed, May 2, 2018 at 7:48 AM, Sahil Takiar 
> wrote:
Hey Everyone,

The agenda for the meetup has been set and I'm excited to say we have lots of 
interesting talks scheduled! Below is final agenda, the full list of abstracts 
will be sent out soon. If you are planning to attend, please RSVP on the meetup 
link so we can get an accurate headcount of attendees 
(https://www.meetup.com/Hive-User-Group-Meeting/events/249641278/).

6:30 - 7:00 PM Networking and Refreshments
7:00PM - 8:20 PM Lightning Talks (10 min each) - 8 talks total

  *   What's new in Hive 3.0.0 - Ashutosh Chauhan
  *   Hive-on-Spark at Uber: Efficiency & Scale - Xuefu Zhang
  *   Hive-on-S3 Performance: Past, Present, and Future - Sahil Takiar
  *   Dali: Data Access Layer at LinkedIn - Adwait Tumbde
  *   Parquet Vectorization in Hive - Vihang Karajgaonkar
  *   ORC Column Level Encryption - Owen O’Malley
  *   Running Hive at Scale @ Lyft - Sharanya Santhanam, Rohit Menon
  *   Materialized Views in Hive - Jesus Camacho Rodriguez

8:30 PM - 9:00 PM Hive Metastore Panel

  *   Moderator: Vihang Karajgaonkar
  *   Participants:
 *   Daniel Dai - Hive Metastore Caching
 *   Alan Gates - Hive Metastore Separation
 *   Rituparna Agrawal - Customer Use Cases & Pain Points of (Big) Metadata

The Metastore panel will consist of a short presentation by each panelist 
followed by a Q session driven by the moderator.

On Tue, Apr 24, 2018 at 2:53 PM, Sahil Takiar 
> wrote:
We still have a few slots open for lightening talks, so if anyone is interested 
in giving a presentation don't hesitate to reach out!

If you are planning to attend the meetup, please RSVP on the Meetup link 
(https://www.meetup.com/Hive-User-Group-Meeting/events/249641278/) so that we 
can get an accurate headcount for food.

Thanks!

--Sahil

On Wed, Apr 11, 2018 at 5:08 PM, Sahil Takiar 
> wrote:
Hi all,

I'm happy to announce that the Hive community is organizing a Hive user group 
meeting in the Bay Area next month. The details can be found at 
https://www.meetup.com/Hive-User-Group-Meeting/events/249641278/

The format of this meetup will be slightly different from previous ones. There 
will be one hour dedicated to lightning talks, followed by a group discussion 
on the future of the Hive Metastore.

We are inviting talk proposals from Hive users as well as developers at this 
time. Please contact either myself 
(takiar.sa...@gmail.com), Vihang Karajgaonkar 
(vih...@cloudera.com), or Peter Vary 
(pv...@cloudera.com) with proposals. We currently 
have 5 openings.

Please let me know if you have any questions or suggestions.

Thanks,
Sahil



--
Sahil Takiar
Software Engineer
takiar.sa...@gmail.com | (510) 673-0309



--
Sahil Takiar
Software Engineer
takiar.sa...@gmail.com | (510) 673-0309



--
Sahil Takiar
Software Engineer
takiar.sa...@gmail.com | (510) 673-0309



--
Sahil Takiar
Software Engineer
takiar.sa...@gmail.com | (510) 673-0309


Unsubscribe

2018-05-09 Thread Dheena Dhayalan