There could be clues in the different RDD subclasses; rdd1 is ParallelCollectionRDD but rdd3 is SubtractedRDD.
On Thu, Feb 18, 2016 at 1:37 PM, DaPsul <dap...@gmx.de> wrote: > (copy from > > http://stackoverflow.com/questions/35467128/spark-subtractbykey-increases-rdd-cached-memory-size > ) > > I've found a very strange behavior for RDD's (spark 1.6.0 with scala 2.11): > > When i use subtractByKey on an RDD the resulting RDD should be of equal or > smaller size. What i get is an RDD that takes even more space in memory: > > //Initialize first RDD > val rdd1 = sc.parallelize(Array((1,1),(2,2),(3,3))).cache() > > //dummy action to cache it => size according to webgui: 184 Bytes > rdd1.first > > //Initialize RDD to subtract (empty RDD should result in no change for > rdd1) > val rdd2 = sc.parallelize(Array[(Int,Int)]()) > > //perform subtraction > val rdd3 = rdd1.subtractByKey(rdd2).cache() > > //dummy action to cache rdd3 => size according to webgui: 208 Bytes > rdd3.first > > I frist realized this strange behaviour for an RDD of ~200k rows and size > 1.3 GB that scaled up to more than 2 GB after subtraction > > Edit: Tried the example above with more values(10k) => same behaviour. The > size increases by ~1.6 times. Also reduceByKey seems to have a similar > effect. > > When i create an RDD by > > sc.paralellize(rdd3.collect()) > > the size is the same as for rdd3, so the increased size carries over even > if > it's extracted from RDD. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/subtractByKey-increases-RDD-size-in-memory-any-ideas-tp26272.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >