Michael,

collect will bring down the results to the driver JVM, whereas the RDD or 
DataFrame would be cached on the executors (if it is cached). So, as Dean said, 
the driver JVM needs to have enough memory to store the results of collect.

Thanks,
Silvio

From: Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>>
Date: Tuesday, December 22, 2015 at 4:26 PM
To: Dean Wampler <deanwamp...@gmail.com<mailto:deanwamp...@gmail.com>>
Cc: Gaurav Agarwal <gaurav130...@gmail.com<mailto:gaurav130...@gmail.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Spark data frame

Dean,

RDD in memory and then the collect() resulting in a collection, where both are 
alive at the same time.
(Again not sure how Tungsten plays in to this… )

So his collection can’t be larger than 1/2 of the memory allocated to the heap.

(Unless you have allocated swap…, right?)

On Dec 22, 2015, at 12:11 PM, Dean Wampler 
<deanwamp...@gmail.com<mailto:deanwamp...@gmail.com>> wrote:

You can call the collect() method to return a collection, but be careful. If 
your data is too big to fit in the driver's memory, it will crash.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd 
Edition<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe<http://typesafe.com/>
@deanwampler<http://twitter.com/deanwampler>
http://polyglotprogramming.com<http://polyglotprogramming.com/>

On Tue, Dec 22, 2015 at 1:09 PM, Gaurav Agarwal 
<gaurav130...@gmail.com<mailto:gaurav130...@gmail.com>> wrote:

We are able to retrieve data frame by filtering the rdd object . I need to 
convert that data frame into java pojo. Any idea how to do that


Reply via email to