Hi, Good point. I have just measured performance with "spark.sql.inMemoryColumnarStorage.compressed=false." It improved the performance than with default. However, it is still slower RDD version on my environment.
It seems to be consistent with the PR https://github.com/apache/spark/pull/11956. This PR shows room to performance improvement for float/double values that are not compressed. Kazuaki Ishizaki From: linguin....@gmail.com To: Maciej Bry��ski <mac...@brynski.pl> Cc: Spark dev list <dev@spark.apache.org> Date: 2016/08/28 11:30 Subject: Re: Cache'ing performance Hi, How does the performance difference change when turning off compression? It is enabled by default. // maropu Sent by iPhone 2016/08/28 10:13、Kazuaki Ishizaki <ishiz...@jp.ibm.com> のメッセ�`ジ: Hi I think that it is a performance issue in both DataFrame and Dataset cache. It is not due to only Encoders. The DataFrame version "spark.range(Int.MaxValue).toDF.cache().count()" is also slow. While a cache for DataFrame and Dataset is stored as a columnar format with some compressed data representation, we have revealed there is room to improve performance. We have already created pull requests to address them. These pull requests are under review. https://github.com/apache/spark/pull/11956 https://github.com/apache/spark/pull/14091 We would appreciate your feedback to these pull requests. Best Regards, Kazuaki Ishizaki From: Maciej Bry��ski <mac...@brynski.pl> To: Spark dev list <dev@spark.apache.org> Date: 2016/08/28 05:40 Subject: Cache'ing performance Hi, I did some benchmark of cache function today. RDD sc.parallelize(0 until Int.MaxValue).cache().count() Datasets spark.range(Int.MaxValue).cache().count() For me Datasets was 2 times slower. Results (3 nodes, 20 cores and 48GB RAM each) RDD - 6s Datasets - 13,5 s Is that expected behavior for Datasets and Encoders ? Regards, -- Maciek Bry��ski