Hello!

In my use case we are using PySpark 3.1 and there are a few pyspark scripts
that are running better with higher driver memory.

As far as I know the default value of
"spark.sql.autoBroadcastJoinThreshold" is 10MB, there are a few cases where
the default driver configuration was throwing OOM errors so it was
necessary to add more driver memory. (basic configuration - driver-1 core
and 2GB memory, (executor-2 cores and 6 GB memory) * 2, i.e 2 executors.)

I have 2 questions:
- I want to set an upper limit on the broadcast dataframe size, i.e., I do
not want the broadcast size to exceed 10MB. Does this setting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10485760)
guarantee that?
- Instead of the above setting, does the following setting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", 10485760)
provide any benefit?

Any insight will be helpful!

With regards,
Soumyadeep.

Reply via email to