Thanks for the response.
Taking the file system based data source as “UnknownPartitioning”, will be
a simple and SAFE way for JOIN, as it’s hard to guarantee the records from
different data sets with the identical join keys will be loaded by the same
node/task , since lots of factors need to be
Additionally, I'm curious if there are any JIRAS around making dataframes
support ordering better? there are a lot of operations that can be
optimized if you know that you have a total ordering on your data...are
there any plans, or at least JIRAS, around having the catalyst optimizer
handle this
No as far as I can tell, @Michael @YinHuai @Reynold , any comments on this
optimization?
From: Jonathan Coveney [mailto:jcove...@gmail.com]
Sent: Tuesday, November 3, 2015 4:17 AM
To: Alex Nastetsky
Cc: Cheng, Hao; user
Subject: Re: Sort Merge Join
Additionally, I'm curious if there are any
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For
example, in the code below, the two datasets have different number of
partitions, but it still does a SortMerge join after a "hashpartitioning".
[Hao:] A distributed JOIN operation (either HashBased or SortBased Join)