Re: How to make Dataset api as fast as DataFrame

2016-01-13 Thread Michael Armbrust
The focus of this release was to get the API out there and there's a lot of low hanging performance optimizations. That said, there is likely always going to be some cost of materializing objects. Another note, anytime your comparing performance its useful to include the output of explain so we

Re: How to make Dataset api as fast as DataFrame

2016-01-13 Thread Arkadiusz Bicz
Hi, Including query plan : DataFrame : == Physical Plan == SortBasedAggregate(key=[agreement#23], functions=[(MaxVectorAggFunction(values#3),mode=Final,isDistinct=false)], output=[agreement#23,maxvalues#27]) +- ConvertToSafe +- Sort [agreement#23 ASC], false, 0 +- TungstenExchange

How to make Dataset api as fast as DataFrame

2016-01-13 Thread Arkadiusz Bicz
Hi, I have done some performance tests by repeating execution with different number of executors and memory for YARN clustered Spark (version 1.6.0) ( cluster contains 6 large size nodes) I found Dataset joinWith or cogroup from 3 to 5 times slower then broadcast join in DataFrame, how to