The focus of this release was to get the API out there and there's a lot of
low hanging performance optimizations. That said, there is likely always
going to be some cost of materializing objects.
Another note, anytime your comparing performance its useful to include the
output of explain so we
Hi,
Including query plan :
DataFrame :
== Physical Plan ==
SortBasedAggregate(key=[agreement#23],
functions=[(MaxVectorAggFunction(values#3),mode=Final,isDistinct=false)],
output=[agreement#23,maxvalues#27])
+- ConvertToSafe
+- Sort [agreement#23 ASC], false, 0
+- TungstenExchange
Hi,
I have done some performance tests by repeating execution with
different number of executors and memory for YARN clustered Spark
(version 1.6.0) ( cluster contains 6 large size nodes)
I found Dataset joinWith or cogroup from 3 to 5 times slower then
broadcast join in DataFrame, how to