@RK yeah I am thinking perhaps it is a better question to the @dev group. but from the files that I pointed out the code and the comments that are in those files I would be more inclined to think that it is actually storing byte code.
On Tue, Aug 23, 2016 4:37 PM, RK Aduri rkad...@collectivei.com wrote: Can you come up with your complete analysis? A snapshot of what you think the code is doing. May be that would help us understand what exactly you were trying to convey. On Aug 23, 2016, at 4:21 PM, kant kodali < kanth...@gmail.com > wrote: apache/spark spark - Mirror of Apache Spark GITHUB.COM On Tue, Aug 23, 2016 4:17 PM, kant kodali kanth...@gmail.com wrote: @RK you may want to look more deeply if you are curious. the code starts from here apache/spark spark - Mirror of Apache Spark GITHUB.COM and it goes here where it is trying to save the python code object(which is a byte code) apache/spark spark - Mirror of Apache Spark GITHUB.COM On Tue, Aug 23, 2016 2:39 PM, RK Aduri rkad...@collectivei.com wrote: I just had a glance. AFAIK, that is nothing do with RDDs. It’s a pickler used to serialize and deserialize the python code. On Aug 23, 2016, at 2:23 PM, kant kodali < kanth...@gmail.com > wrote: @Sean well this makes sense but I wonder what the following source code is doing? apache/spark spark - Mirror of Apache Spark GITHUB.COM This code looks like it is trying to store some byte code some where (whether its memory or disk) but why even go this path like creating a code objects so it can be executed later and so on after all we are trying to persist the result of computing the RDD" ? On Tue, Aug 23, 2016 1:42 PM, Sean Owen so...@cloudera.com wrote: We're probably mixing up some semantics here. An RDD is indeed, really, just some bookkeeping that records how a certain result is computed. It is not the data itself. However we often talk about "persisting an RDD" which means "persisting the result of computing the RDD" in which case that persisted representation can be used instead of recomputing it. The result of computing an RDD is really some objects in memory. It's possible to persist the RDD in memory by just storing these objects in memory as cached partitions. This involves no serialization. Data can be persisted to disk but this involves serializing objects to bytes (not byte code). It's also possible to store a serialized representation in memory because it may be more compact. This is not the same as saving/writing an RDD to persistent storage as text or JSON or whatever. On Tue, Aug 23, 2016 at 9:28 PM, kant kodali < kanth...@gmail.com > wrote: > @srkanth are you sure? the whole point of RDD's is to store transformations > but not the data as the spark paper points out but I do lack the practical > experience for me to confirm. when I looked at the spark source > code(specifically the checkpoint code) a while ago it was clearly storing > some JVM byte code to disk which I thought were the transformations. > > > > On Tue, Aug 23, 2016 1:11 PM, srikanth.je...@gmail.com wrote: >> >> RDD contains data but not JVM byte code i.e. data which is read from >> source and transformations have been applied. This is ideal case to persist >> RDDs.. As Nirav mentioned this data will be serialized before persisting to >> disk.. >> >> >> >> Thanks, >> Sreekanth Jella >> >> >> >> From: kant kodali >> Sent: Tuesday, August 23, 2016 3:59 PM >> To: Nirav >