Arnaud Nauwynck created SPARK-48575:
---------------------------------------
Summary: spark.history.fs.update.interval calling too many
directory pollings when spark log dir contains many sparkEvent apps
Key: SPARK-48575
URL: https://issues.apache.org/jira/browse/SPARK-48575
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 3.4.3, 3.5.1, 3.3.3, 4.0.0
Reporter: Arnaud Nauwynck
In the case of a sparkLog dir containing "lot" of spark eventLogs sub-dirs
(example 1000),
running a supposedly "iddle" Sparkhistory server is causing millions of
directory listing calls each hour.
example: with ~1000 apps, every 10 seconds (default of
"spark.history.fs.update.interval") SparkHistory is performing
- 1x VirtualFileSystem.listStatus(path) with path=sparkLog dir
- then 2x foreach each appSubDirPath (corresponding to a sparkApp eventLogs)
=> 2 x 1000 x VirtualFileSystem.listStatus(appSubDirPath)
On a cloud provider (example Azure), this cost a lot per month :
because "List FileSystem" calls ~$0.065 per 10000 ops for Tier "Hot" or $0.0228
for "Premium" (cf
https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/ )
Let's do the multiplications:
30 (days per month) * 86400 (sec per day) / 10 (interval second) = 259 000
update times
... * 2001 (listings ops per update) = 518 millions listing calls per month
... * 0.0228 / 10000 = 1182 USD/month !!!!
Admitedly, the retention conf "spark.history.fs.cleaner.maxAge" (default =7d)
for spark eventLog is too much for workflows than run many short spark apps,
and it would be possible to reduce it.
It is extremely important to reduce these recurrent costs
Here are several whishes
1/ fix "bug" in Spark History that is calling twice the
VirtualFileSystem.listStatus(appSubDirPath).
Indeed, it is easilly possible to perform only 1 listing per sub-dir
(cf attach patch, changing ~5 lines of code)
This would divide cost by x2.
2/ in addition to conf "spark.history.fs.cleaner.maxAge", add another conf
param ""spark.history.fs.cleaner.maxCount" to limit the number of spark apps.
This could be defaulted to ~50.
This would additionaly divide cost by x10 (in case you have 1000 apps).
3/ change the code in SparkHistory to check lazily for update only on demand
when someone click in Spark History web UI. For example, if the last cached
update time is less than "spark.history.fs.update.interval" then no update is
needed, else update is immediatly performed and cached before returning
response.
4/ change the code in SparkHistory to avoid doing a listing on each app sub-dir.
It is possible to perform a single listing on "sparkLog" top level dir, to
discover new apps.
Then for each app subdir, most of them are already finished, and already
recompacted by SparkHistory itself. This info is already stored in spark
history Keystore db.
Allmost all the listing sub-dirs can thefore be completly avoided.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]