Re: Estimate size of Dataframe programatically

2015-08-10 Thread Srikanth
SizeEstimator.estimate(df) will not give the size of dataframe rt? I think it will give size of df object. With RDD, I sample() and collect() and sum size of each row. If I do the same with dataframe it will no longer be size when represented in columnar format. I'd also like to know how

Re: Estimate size of Dataframe programatically

2015-08-10 Thread Ted Yu
From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes of the LogicalPlan is = autoBroadcastJoinThreshold, the plan's output would be used in broadcast join as the 'build' relation. FYI On Mon, Aug 10, 2015 at 8:04 AM, Srikanth srikanth...@gmail.com wrote:

Estimate size of Dataframe programatically

2015-08-07 Thread Srikanth
Hello, Is there a way to estimate the approximate size of a dataframe? I know we can cache and look at the size in UI but I'm trying to do this programatically. With RDD, I can sample and sum up size using SizeEstimator. Then extrapolate it to the entire RDD. That will give me approx size of RDD.