Re: Question about Serialization in Storage Level

Imran Rashid Wed, 27 May 2015 08:35:43 -0700

Hi Zhipeng,

yes, your understanding is correct.  the "SER" portion just refers to how
its stored in memory.  On disk, the data always has to be serialized.



On Fri, May 22, 2015 at 10:40 PM, Jiang, Zhipeng <zhipeng.ji...@intel.com>
wrote:

>  Hi Todd, Howard,
>
>
>
> Thanks for your reply, I might not present my question clearly.
>
>
>
> What I mean is, if I call *rdd.persist(StorageLevel.MEMORY_AND_DISK)*,
> the BlockManager will cache the rdd to MemoryStore. RDD will be migrated to
> DiskStore when it cannot fit in memory. I think this migration does require
> data serialization and compression (if spark.rdd.compress is set to be
> true). So the data in Disk is serialized, even if I didn’t choose a
> serialized storage level, am I right?
>
>
>
> Thanks,
>
> Zhipeng
>
>
>
>
>
> *From:* Todd Nist [mailto:tsind...@gmail.com]
> *Sent:* Thursday, May 21, 2015 8:49 PM
> *To:* Jiang, Zhipeng
> *Cc:* user@spark.apache.org
> *Subject:* Re: Question about Serialization in Storage Level
>
>
>
> From the docs,
> https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
> :
>
>
>
> *Storage Level*
>
> *Meaning*
>
> MEMORY_ONLY
>
> Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
> in memory, some partitions will not be cached and will be recomputed on the
> fly each time they're needed. This is the default level.
>
> MEMORY_AND_DISK
>
> Store RDD as *deserialized* Java objects in the JVM. If the RDD does not
> fit in memory, store the partitions that don't fit on disk, and read them
> from there when they're needed.
>
> MEMORY_ONLY_SER
>
> Store RDD as *serialized* Java objects (one byte array per partition).
> This is generally more space-efficient than deserialized objects,
> especially when using a fast serializer
> <https://spark.apache.org/docs/latest/tuning.html>, but more
> CPU-intensive to read.
>
> MEMORY_AND_DISK_SER
>
> Similar to *MEMORY_ONLY_SER*, but spill partitions that don't fit in
> memory to disk instead of recomputing them on the fly each time they're
> needed.
>
>
>
> On Thu, May 21, 2015 at 3:52 AM, Jiang, Zhipeng <zhipeng.ji...@intel.com>
> wrote:
>
>  Hi there,
>
>
>
> This question may seem to be kind of naïve, but what’s the difference
> between *MEMORY_AND_DISK* and *MEMORY_AND_DISK_SER*?
>
>
>
> If I call *rdd.persist(StorageLevel.MEMORY_AND_DISK)*, the BlockManager
> won’t serialize the *rdd*?
>
>
>
> Thanks,
>
> Zhipeng
>
>
>

Re: Question about Serialization in Storage Level

Reply via email to