Hello, I have set the value of spark.sql.autoBroadcastJoinThreshold to a very high value of 20 GB. I am joining a table that I am sure is below this variable, however spark is doing a SortMergeJoin. If I set a broadcast hint then spark does a broadcast join and job finishes much faster. However, when run in production for some large tables, I run into errors. Is there a way to see the actual size of the table being broadcast? I wrote the table being broadcast to disk and it took only 32 MB in parquet. I tried to cache this table in Zeppelin and run a table.count() operation but nothing gets shown on on the Storage tab of the Spark History Server. spark.util.SizeEstimator doesn't seem to be giving accurate numbers for this table either. Any way to figure out the size of this table being broadcast?
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org