Hi all,
I'm relatively new to spark and something is bothering me for optimizing
sort merge join from parquet.
My work consists to get stats on purchases for a retail company.
For example, i have to calculate the mean purchase over a period, for a
segment of prodcuts and a segment of client
with the identical join keys will be loaded by the
> same node/task , since lots of factors need to be considered, like task
> pool size, cluster size, source format, storage, data locality etc.,.
>
> I’ll agree it’s worth to optimize it for performance concerns, and
> actually in Hive
epartitioning on “key”.
>>
>>
>>
>> Taking the file system based data source as “UnknownPartitioning”, will
>> be a simple and SAFE way for JOIN, as it’s hard to guarantee the records
>> from different data sets with the identical join keys will be loaded by the
>&
No as far as I can tell, @Michael @YinHuai @Reynold , any comments on this
optimization?
From: Jonathan Coveney [mailto:jcove...@gmail.com]
Sent: Tuesday, November 3, 2015 4:17 AM
To: Alex Nastetsky
Cc: Cheng, Hao; user
Subject: Re: Sort Merge Join
Additionally, I'm curious if there are any
Hi,
I'm trying to understand SortMergeJoin (SPARK-2213).
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For
example, in the code below, the two datasets have different number of
partitions, but it still does a SortMerge join after a "hashpartitioning".
CODE:
val
From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
Sent: Monday, November 2, 2015 11:29 AM
To: user
Subject: Sort Merge Join
Hi,
I'm trying to understand SortMergeJoin (SPARK-2213).
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For
example, in the code below, the two dat