Re: Hive Hash in Spark

2019-05-07 Thread Bruce Robbins
Mildly off-topic:

>From a *correctness* perspective only, it seems Spark can read bucketed
Hive tables just fine. I am ignoring the fact that Spark doesn't take
advantage of the bucketing.

Is that a fair assessment? Or is it more complicated than that?

Also, Spark has code to prevent an application from accidentally writing to
a bucketed Hive table (except it as a hole
<https://issues.apache.org/jira/browse/SPARK-27498>). Except for that hole,
the write case is covered.

Spark apps reading bucketed Hive tables seems to be common, so I hope it
works (as it seems to).


On Thu, Mar 7, 2019 at 12:58 PM  wrote:

> Thanks Ryan and Reynold for the information!
>
>
>
> Cheers,
>
> Tyson
>
>
>
> *From:* Ryan Blue 
> *Sent:* Wednesday, March 6, 2019 3:47 PM
> *To:* Reynold Xin 
> *Cc:* tcon...@gmail.com; Spark Dev List 
> *Subject:* Re: Hive Hash in Spark
>
>
>
> I think this was needed to add support for bucketed Hive tables. Like
> Tyson noted, if the other side of a join can be bucketed the same way, then
> Spark can use a bucketed join. I have long-term plans to support this in
> the DataSourceV2 API, but I don't think we are very close to implementing
> it yet.
>
>
>
> rb
>
>
>
> On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin  wrote:
>
> I think they might be used in bucketing? Not 100% sure.
>
>
>
>
>
> On Wed, Mar 06, 2019 at 1:40 PM,  wrote:
>
> Hi,
>
>
>
> I noticed the existence of a Hive Hash partitioning implementation in
> Spark, but also noticed that it’s not being used, and that the Spark hash
> partitioning function is presently hardcoded to Murmur3. My question is
> whether Hive Hash is dead code or are their future plans to support reading
> and understanding data the has been partitioned using Hive Hash? By
> understanding, I mean that I’m able to avoid a full shuffle join on Table A
> (partitioned by Hive Hash) when joining with a Table B that I can shuffle
> via Hive Hash to Table A.
>
>
>
> Thank you,
>
> Tyson
>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


RE: Hive Hash in Spark

2019-03-07 Thread tcondie
Thanks Ryan and Reynold for the information!

 

Cheers,

Tyson

 

From: Ryan Blue  
Sent: Wednesday, March 6, 2019 3:47 PM
To: Reynold Xin 
Cc: tcon...@gmail.com; Spark Dev List 
Subject: Re: Hive Hash in Spark

 

I think this was needed to add support for bucketed Hive tables. Like Tyson 
noted, if the other side of a join can be bucketed the same way, then Spark can 
use a bucketed join. I have long-term plans to support this in the DataSourceV2 
API, but I don't think we are very close to implementing it yet.

 

rb

 

On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin mailto:r...@databricks.com> > wrote:

  
<https://r.superhuman.com/ORW9b9xbycVLlwb2fY-QoeSH_HHnSeZymYN4tDzn6UL_xDShHZx3ZGZRs6DKmCb1ZPf4uF9VNCWT7nrUvCx-n8SpL0ovl-mTgbIbCutZjpNvJjvj3AtXVMMjGxPS9pF41rVjqBJlBzWNUNxTBUeWrM9l6yGGW80MR0tu4C-Jnxz8BhSxpDxO3Q.gif>
 

I think they might be used in bucketing? Not 100% sure.

 

 

On Wed, Mar 06, 2019 at 1:40 PM, mailto:tcon...@gmail.com> 
> wrote:

Hi,

 

I noticed the existence of a Hive Hash partitioning implementation in Spark, 
but also noticed that it’s not being used, and that the Spark hash partitioning 
function is presently hardcoded to Murmur3. My question is whether Hive Hash is 
dead code or are their future plans to support reading and understanding data 
the has been partitioned using Hive Hash? By understanding, I mean that I’m 
able to avoid a full shuffle join on Table A (partitioned by Hive Hash) when 
joining with a Table B that I can shuffle via Hive Hash to Table A. 

 

Thank you,

Tyson

 




 

-- 

Ryan Blue

Software Engineer

Netflix



Re: Hive Hash in Spark

2019-03-06 Thread Ryan Blue
I think this was needed to add support for bucketed Hive tables. Like Tyson
noted, if the other side of a join can be bucketed the same way, then Spark
can use a bucketed join. I have long-term plans to support this in the
DataSourceV2 API, but I don't think we are very close to implementing it
yet.

rb

On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin  wrote:

> I think they might be used in bucketing? Not 100% sure.
>
>
> On Wed, Mar 06, 2019 at 1:40 PM,  wrote:
>
>> Hi,
>>
>>
>>
>> I noticed the existence of a Hive Hash partitioning implementation in
>> Spark, but also noticed that it’s not being used, and that the Spark hash
>> partitioning function is presently hardcoded to Murmur3. My question is
>> whether Hive Hash is dead code or are their future plans to support reading
>> and understanding data the has been partitioned using Hive Hash? By
>> understanding, I mean that I’m able to avoid a full shuffle join on Table A
>> (partitioned by Hive Hash) when joining with a Table B that I can shuffle
>> via Hive Hash to Table A.
>>
>>
>>
>> Thank you,
>>
>> Tyson
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Hive Hash in Spark

2019-03-06 Thread Reynold Xin
I think they might be used in bucketing? Not 100% sure.

On Wed, Mar 06, 2019 at 1:40 PM, < tcon...@gmail.com > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
>  
> 
> 
> 
> I noticed the existence of a Hive Hash partitioning implementation in
> Spark, but also noticed that it’s not being used, and that the Spark hash
> partitioning function is presently hardcoded to Murmur3. My question is
> whether Hive Hash is dead code or are their future plans to support
> reading and understanding data the has been partitioned using Hive Hash?
> By understanding, I mean that I’m able to avoid a full shuffle join on
> Table A (partitioned by Hive Hash) when joining with a Table B that I can
> shuffle via Hive Hash to Table A.
> 
> 
> 
>  
> 
> 
> 
> Thank you,
> 
> 
> 
> Tyson
> 
> 
>