Re: Are RDD's ever persisted to disk?

kant kodali Tue, 23 Aug 2016 13:29:27 -0700

@srkanth are you sure? the whole point of RDD's is to store transformations but
not the data as the spark paper points out but I do lack the practical
experience for me to confirm. when I looked at the spark source code 
(specifically the checkpoint code) a while ago it was clearly storing some JVM
byte code to disk which I thought were the transformations.






On Tue, Aug 23, 2016 1:11 PM, srikanth.je...@gmail.com wrote:
RDD contains data but not JVM byte code i.e. data which is read from source and
transformations have been applied. This is ideal case to persist RDDs.. As Nirav
mentioned this data will be serialized before persisting to disk..



Thanks,
Sreekanth Jella



From: kant kodali
Sent: Tuesday, August 23, 2016 3:59 PM
To: Nirav
Cc: RK Aduri ; srikanth.je...@gmail.com ; user@spark.apache.org
Subject: Re: Are RDD's ever persisted to disk?



Storing RDD to disk is nothing but storing JVM byte code to disk (in case of
Java or Scala). am I correct?







On Tue, Aug 23, 2016 12:55 PM, Nirav nira...@gmail.com wrote:

You can store either in serialized form(butter array) or just save it in a
string format like tsv or csv. There are different RDD save apis for that.

Sent from my iPhone


On Aug 23, 2016, at 12:26 PM, kant kodali < kanth...@gmail.com > wrote:

ok now that I understand RDD can be stored to the disk. My last question on this
topic would be this.



Storing RDD to disk is nothing but storing JVM byte code to disk (in case of
Java or Scala). am I correct?







On Tue, Aug 23, 2016 12:19 PM, RK Aduri rkad...@collectivei.com wrote:

On an other note, if you have a streaming app, you checkpoint the RDDs so that
they can be accessed in case of a failure. And yes, RDDs are persisted to DISK.
You can access spark’s UI and see it listed under Storage tab.



If RDDs are persisted in memory, you avoid any disk I/Os so that any lookups
will be cheap. RDDs are reconstructed based on a graph (DAG - available in Spark
UI )



On Aug 23, 2016, at 12:10 PM, < srikanth.je...@gmail.com > < 
srikanth.je...@gmail.com > wrote:



RAM or Virtual memory is finite, so data size needs to be considered before
persist. Please see below documentation when to choose the persistency level.




http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose



Thanks,
Sreekanth Jella



From: kant kodali
Sent: Tuesday, August 23, 2016 2:42 PM
To: srikanth.je...@gmail.com
Cc: user@spark.apache.org
Subject: Re: Are RDD's ever persisted to disk?



so when do we ever need to persist RDD on disk? given that we don't need to
worry about RAM(memory) as virtual memory will just push pages to the disk when
memory becomes scarce.







On Tue, Aug 23, 2016 11:23 AM, srikanth.je...@gmail.com wrote:

Hi Kant Kodali,



Based on the input parameter to persist() method either it will be cached on
memory or persisted to disk. In case of failures Spark will reconstruct the RDD
on a different executor based on the DAG. That is how failures are handled.
Spark Core does not replicate the RDDs as they can be reconstructed from the
source (let’s say HDFS, Hive or S3 etc.) but not from memory (which is lost
already).



Thanks,
Sreekanth Jella



From: kant kodali
Sent: Tuesday, August 23, 2016 2:12 PM
To: user@spark.apache.org
Subject: Are RDD's ever persisted to disk?



I am new to spark and I keep hearing that RDD's can be persisted to memory or
disk after each checkpoint. I wonder why RDD's are persisted in memory? In case
of node failure how would you access memory to reconstruct the RDD? persisting
to disk make sense because its like persisting to a Network file system (in case
of HDFS) where a each block will have multiple copies across nodes so if a node
goes down RDD's can still be reconstructed by the reading the required block
from other nodes and recomputing it but my biggest question is Are RDD's ever 
persisted to disk?





Collective[i] dramatically improves sales and marketing performance using
technology, applications and a revolutionary network designed to provide next
generation analytics and decision-support directly to business users. Our goal
is to maximize human potential and minimize mistakes. In most cases, the results
are astounding. We cannot, however, stop emails from sometimes being sent to the
wrong person. If you are not the intended recipient, please notify us by
replying to this email's sender and deleting it (and any attachments)
permanently from your system. If you are, please respect the confidentiality of
this communication's contents.

Re: Are RDD's ever persisted to disk?

Reply via email to