Hi,
Good point. I have just measured performance with 
"spark.sql.inMemoryColumnarStorage.compressed=false."
It improved the performance than with default. However, it is still slower 
RDD version on my environment.

It seems to be consistent with the PR 
https://github.com/apache/spark/pull/11956. This PR shows room to 
performance improvement for float/double values that are not compressed.

Kazuaki Ishizaki



From:   linguin....@gmail.com
To:     Maciej Bry��ski <mac...@brynski.pl>
Cc:     Spark dev list <dev@spark.apache.org>
Date:   2016/08/28 11:30
Subject:        Re: Cache'ing performance



Hi,

How does the performance difference change when turning off compression?
It is enabled by default.

// maropu

Sent by iPhone

2016/08/28 10:13、Kazuaki Ishizaki <ishiz...@jp.ibm.com> のメッセ�`ジ:

Hi
I think that it is a performance issue in both DataFrame and Dataset 
cache. It is not due to only Encoders. The DataFrame version 
"spark.range(Int.MaxValue).toDF.cache().count()" is also slow.

While a cache for DataFrame and Dataset is stored as a columnar format 
with some compressed data representation, we have revealed there is room 
to improve performance. We have already created pull requests to address 
them. These pull requests are under review. 
https://github.com/apache/spark/pull/11956
https://github.com/apache/spark/pull/14091

We would appreciate your feedback to these pull requests.

Best Regards,
Kazuaki Ishizaki



From:        Maciej Bry��ski <mac...@brynski.pl>
To:        Spark dev list <dev@spark.apache.org>
Date:        2016/08/28 05:40
Subject:        Cache'ing performance



Hi,
I did some benchmark of cache function today.

RDD
sc.parallelize(0 until Int.MaxValue).cache().count()

Datasets
spark.range(Int.MaxValue).cache().count()

For me Datasets was 2 times slower.

Results (3 nodes, 20 cores and 48GB RAM each)
RDD - 6s
Datasets - 13,5 s

Is that expected behavior for Datasets and Encoders ?

Regards,
-- 
Maciek Bry��ski



Reply via email to