Hi,
I'm using Spark 2.2, and have a big batch job, using dataframes (with
built-in, basic types). It references the same intermediate dataframe
multiple times, so I wanted to try to cache() that and see if it helps,
both in memory footprint and performance.

Now, the Spark 2.2 tuning page (
http://spark.apache.org/docs/latest/tuning.html) clearly says:
1. The default Spark serialization is Java serialization.
2. It is recommended to switch to Kyro serialization.
3. "Since Spark 2.0.0, we internally use Kryo serializer when shuffling
RDDs with simple types, arrays of simple types, or string type".

Now, I remember that in 2.0 launch, there were discussion of a third
serialization format that is much more performant and compact. (Encoder?),
but it is not referenced in the tuning guide and its Scala doc is not very
clear to me. Specifically, Databricks shared some graphs etc of how much it
is better than Kyro and Java serialization - see Encoders here:
https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html

So, is that relevant to cache()? If so, how can I enable it - and is
it for MEMORY_AND_DISK_ONLY
or MEMORY_AND_DISK_SER?

I tried to play with some other variations, like enabling Kyro by the
tuning guide instructions, but didn't see any impact on the cached
dataframe size (same tens of GBs in the UI). So any tips around that?

Thanks.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

Reply via email to