Spark 2.2.0 GC Overhead Limit Exceeded and OOM errors in the executors

2017-10-27 Thread Supun Nakandala
Hi all, I am trying to do some image analytics type workload using Spark. The images are read in JPEG format and then are converted to the raw format in map functions and this causes the size of the partitions to grow by an order of 1. In addition to this, I am caching some of the data because my

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Supun Nakandala
0-13 14:50 GMT-07:00 Vadim Semenov : >> >>> When you do `Dataset.rdd` you actually create a new job >>> >>> here you can see what it does internally: >>> https://github.com/apache/spark/blob/master/sql/core/src/mai >>> n/scala/org/apache/spark/sql/

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Supun Nakandala
he the new RDD. > > On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala < > supun.nakand...@gmail.com> wrote: > >> Hi all, >> >> I have been experimenting with cache/persist/unpersist methods with >> respect to both Dataframes and RDD APIs. However, I am experie

Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Supun Nakandala
Hi all, I have been experimenting with cache/persist/unpersist methods with respect to both Dataframes and RDD APIs. However, I am experiencing different behaviors Ddataframe API compared RDD API such Dataframes are not getting cached when count() is called. Is there a difference between how thes