Hello,
   I have set the value of spark.sql.autoBroadcastJoinThreshold to a very
high value of 20 GB. I am joining a table that I am sure is below this
variable, however spark is doing a SortMergeJoin. If I set a broadcast hint
then spark does a broadcast join and job finishes much faster. However, when
run in production for some large tables, I run into errors. Is there a way
to see the actual size of the table being broadcast? I wrote the table being
broadcast to disk and it took only 32 MB in parquet. I tried to cache this
table in Zeppelin and run a table.count() operation but nothing gets shown
on on the Storage tab of the Spark History Server. spark.util.SizeEstimator
doesn't seem to be giving accurate numbers for this table either. Any way to
figure out the size of this table being broadcast?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to