Optimize sort merge join

2018-01-27 Thread Antoine Bonnin
Hi all, I'm relatively new to spark and something is bothering me for optimizing sort merge join from parquet. My work consists to get stats on purchases for a retail company. For example, i have to calculate the mean purchase over a period, for a segment of prodcuts and a segment of client

Re: Sort Merge Join

2015-11-02 Thread Alex Nastetsky
with the identical join keys will be loaded by the > same node/task , since lots of factors need to be considered, like task > pool size, cluster size, source format, storage, data locality etc.,. > > I’ll agree it’s worth to optimize it for performance concerns, and > actually in Hive

Re: Sort Merge Join

2015-11-02 Thread Jonathan Coveney
epartitioning on “key”. >> >> >> >> Taking the file system based data source as “UnknownPartitioning”, will >> be a simple and SAFE way for JOIN, as it’s hard to guarantee the records >> from different data sets with the identical join keys will be loaded by the >&

RE: Sort Merge Join

2015-11-02 Thread Cheng, Hao
No as far as I can tell, @Michael @YinHuai @Reynold , any comments on this optimization? From: Jonathan Coveney [mailto:jcove...@gmail.com] Sent: Tuesday, November 3, 2015 4:17 AM To: Alex Nastetsky Cc: Cheng, Hao; user Subject: Re: Sort Merge Join Additionally, I'm curious if there are any

Sort Merge Join

2015-11-01 Thread Alex Nastetsky
Hi, I'm trying to understand SortMergeJoin (SPARK-2213). 1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For example, in the code below, the two datasets have different number of partitions, but it still does a SortMerge join after a "hashpartitioning". CODE: val

RE: Sort Merge Join

2015-11-01 Thread Cheng, Hao
From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com] Sent: Monday, November 2, 2015 11:29 AM To: user Subject: Sort Merge Join Hi, I'm trying to understand SortMergeJoin (SPARK-2213). 1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For example, in the code below, the two dat