>From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes of the LogicalPlan is <= autoBroadcastJoinThreshold, the plan's output would be used in broadcast join as the 'build' relation.
FYI On Mon, Aug 10, 2015 at 8:04 AM, Srikanth <srikanth...@gmail.com> wrote: > SizeEstimator.estimate(df) will not give the size of dataframe rt? I think > it will give size of df object. > > With RDD, I sample() and collect() and sum size of each row. If I do the > same with dataframe it will no longer be size when represented in columnar > format. > > I'd also like to know how spark.sql.autoBroadcastJoinThreshold estimates > size of dataframe. Is it going to broadcast when columnar storage size is > less that 10 MB? > > Srikanth > > > On Fri, Aug 7, 2015 at 2:51 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Have you tried calling SizeEstimator.estimate() on a DataFrame ? >> >> I did the following in REPL: >> >> scala> SizeEstimator.estimate(df) >> res1: Long = 17769680 >> >> FYI >> >> On Fri, Aug 7, 2015 at 6:48 AM, Srikanth <srikanth...@gmail.com> wrote: >> >>> Hello, >>> >>> Is there a way to estimate the approximate size of a dataframe? I know >>> we can cache and look at the size in UI but I'm trying to do this >>> programatically. With RDD, I can sample and sum up size using >>> SizeEstimator. Then extrapolate it to the entire RDD. That will give me >>> approx size of RDD. With dataframes, its tricky due to columnar storage. >>> How do we do it? >>> >>> On a related note, I see size of RDD object to be ~60MB. Is that the >>> footprint of RDD in driver JVM? >>> >>> scala> val temp = sc.parallelize(Array(1,2,3,4,5,6)) >>> scala> SizeEstimator.estimate(temp) >>> res13: Long = 69507320 >>> >>> Srikanth >>> >> >> >