Michael, collect will bring down the results to the driver JVM, whereas the RDD or DataFrame would be cached on the executors (if it is cached). So, as Dean said, the driver JVM needs to have enough memory to store the results of collect.
Thanks, Silvio From: Michael Segel <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> Date: Tuesday, December 22, 2015 at 4:26 PM To: Dean Wampler <deanwamp...@gmail.com<mailto:deanwamp...@gmail.com>> Cc: Gaurav Agarwal <gaurav130...@gmail.com<mailto:gaurav130...@gmail.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: Spark data frame Dean, RDD in memory and then the collect() resulting in a collection, where both are alive at the same time. (Again not sure how Tungsten plays in to this… ) So his collection can’t be larger than 1/2 of the memory allocated to the heap. (Unless you have allocated swap…, right?) On Dec 22, 2015, at 12:11 PM, Dean Wampler <deanwamp...@gmail.com<mailto:deanwamp...@gmail.com>> wrote: You can call the collect() method to return a collection, but be careful. If your data is too big to fit in the driver's memory, it will crash. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe<http://typesafe.com/> @deanwampler<http://twitter.com/deanwampler> http://polyglotprogramming.com<http://polyglotprogramming.com/> On Tue, Dec 22, 2015 at 1:09 PM, Gaurav Agarwal <gaurav130...@gmail.com<mailto:gaurav130...@gmail.com>> wrote: We are able to retrieve data frame by filtering the rdd object . I need to convert that data frame into java pojo. Any idea how to do that