Re: Sort Merge Join

2015-11-02 Thread Alex Nastetsky
Thanks for the response. Taking the file system based data source as “UnknownPartitioning”, will be a simple and SAFE way for JOIN, as it’s hard to guarantee the records from different data sets with the identical join keys will be loaded by the same node/task , since lots of factors need to be

Re: Sort Merge Join

2015-11-02 Thread Jonathan Coveney
Additionally, I'm curious if there are any JIRAS around making dataframes support ordering better? there are a lot of operations that can be optimized if you know that you have a total ordering on your data...are there any plans, or at least JIRAS, around having the catalyst optimizer handle this

RE: Sort Merge Join

2015-11-02 Thread Cheng, Hao
No as far as I can tell, @Michael @YinHuai @Reynold , any comments on this optimization? From: Jonathan Coveney [mailto:jcove...@gmail.com] Sent: Tuesday, November 3, 2015 4:17 AM To: Alex Nastetsky Cc: Cheng, Hao; user Subject: Re: Sort Merge Join Additionally, I'm curious if there are any

RE: Sort Merge Join

2015-11-01 Thread Cheng, Hao
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For example, in the code below, the two datasets have different number of partitions, but it still does a SortMerge join after a "hashpartitioning". [Hao:] A distributed JOIN operation (either HashBased or SortBased Join)