[jira] [Created] (SPARK-48575) spark.history.fs.update.interval calling too many directory pollings when spark log dir contains many sparkEvent apps

Arnaud Nauwynck (Jira) Sat, 08 Jun 2024 13:05:05 -0700

Arnaud Nauwynck created SPARK-48575:
---------------------------------------


             Summary: spark.history.fs.update.interval calling too many 
directory pollings when spark log dir contains many sparkEvent apps 
                 Key: SPARK-48575
                 URL: https://issues.apache.org/jira/browse/SPARK-48575
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.4.3, 3.5.1, 3.3.3, 4.0.0
            Reporter: Arnaud Nauwynck


In the case of a sparkLog dir containing "lot" of spark eventLogs sub-dirs 
(example 1000),
running a supposedly "iddle" Sparkhistory server is causing millions of 
directory listing calls each hour.

example: with ~1000 apps, every 10 seconds (default of 
"spark.history.fs.update.interval") SparkHistory is performing
- 1x  VirtualFileSystem.listStatus(path)   with path=sparkLog dir
- then 2x foreach each appSubDirPath (corresponding to a sparkApp eventLogs)
   => 2 x 1000 x VirtualFileSystem.listStatus(appSubDirPath)

On a cloud provider (example Azure), this cost a lot per month : 
because "List FileSystem" calls ~$0.065 per 10000 ops for Tier "Hot" or $0.0228 
for "Premium" (cf 
https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/ )

Let's do the multiplications:
30 (days per month) * 86400 (sec per day) / 10 (interval second) = 259 000 
update times
... * 2001  (listings ops per update) = 518 millions listing calls per month
... * 0.0228 / 10000 = 1182 USD/month !!!!

Admitedly, the retention conf "spark.history.fs.cleaner.maxAge"  (default =7d) 
for spark eventLog is too much for workflows than run many short spark apps, 
and it would be possible to reduce it.

It is extremely important to reduce these recurrent costs
Here are several whishes

1/ fix "bug" in Spark History that is calling twice the 
VirtualFileSystem.listStatus(appSubDirPath). 
Indeed, it is easilly possible to perform only 1 listing per sub-dir
(cf attach patch, changing ~5 lines of code)
This would divide cost by x2.

2/ in addition to conf "spark.history.fs.cleaner.maxAge", add another conf 
param ""spark.history.fs.cleaner.maxCount" to limit the number of spark apps. 
This could be defaulted to ~50. 
This would additionaly divide cost by x10 (in case you have 1000 apps).

3/ change the code in SparkHistory to check lazily for update only on demand 
when someone click in Spark History web UI. For example, if the last cached 
update time is less than "spark.history.fs.update.interval" then no update is 
needed, else update is immediatly performed and cached before returning 
response.

4/ change the code in SparkHistory to avoid doing a listing on each app sub-dir.
 It is possible to perform a single listing on "sparkLog" top level dir, to 
discover new apps.
 Then for each app subdir, most of them are already finished, and already 
recompacted by SparkHistory itself. This info is already stored in spark 
history Keystore db. 
  Allmost all the listing sub-dirs can thefore be completly avoided.








--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48575) spark.history.fs.update.interval calling too many directory pollings when spark log dir contains many sparkEvent apps

Reply via email to