Arnaud Nauwynck created SPARK-48575: ---------------------------------------
Summary: spark.history.fs.update.interval calling too many directory pollings when spark log dir contains many sparkEvent apps Key: SPARK-48575 URL: https://issues.apache.org/jira/browse/SPARK-48575 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.3, 3.5.1, 3.3.3, 4.0.0 Reporter: Arnaud Nauwynck In the case of a sparkLog dir containing "lot" of spark eventLogs sub-dirs (example 1000), running a supposedly "iddle" Sparkhistory server is causing millions of directory listing calls each hour. example: with ~1000 apps, every 10 seconds (default of "spark.history.fs.update.interval") SparkHistory is performing - 1x VirtualFileSystem.listStatus(path) with path=sparkLog dir - then 2x foreach each appSubDirPath (corresponding to a sparkApp eventLogs) => 2 x 1000 x VirtualFileSystem.listStatus(appSubDirPath) On a cloud provider (example Azure), this cost a lot per month : because "List FileSystem" calls ~$0.065 per 10000 ops for Tier "Hot" or $0.0228 for "Premium" (cf https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/ ) Let's do the multiplications: 30 (days per month) * 86400 (sec per day) / 10 (interval second) = 259 000 update times ... * 2001 (listings ops per update) = 518 millions listing calls per month ... * 0.0228 / 10000 = 1182 USD/month !!!! Admitedly, the retention conf "spark.history.fs.cleaner.maxAge" (default =7d) for spark eventLog is too much for workflows than run many short spark apps, and it would be possible to reduce it. It is extremely important to reduce these recurrent costs Here are several whishes 1/ fix "bug" in Spark History that is calling twice the VirtualFileSystem.listStatus(appSubDirPath). Indeed, it is easilly possible to perform only 1 listing per sub-dir (cf attach patch, changing ~5 lines of code) This would divide cost by x2. 2/ in addition to conf "spark.history.fs.cleaner.maxAge", add another conf param ""spark.history.fs.cleaner.maxCount" to limit the number of spark apps. This could be defaulted to ~50. This would additionaly divide cost by x10 (in case you have 1000 apps). 3/ change the code in SparkHistory to check lazily for update only on demand when someone click in Spark History web UI. For example, if the last cached update time is less than "spark.history.fs.update.interval" then no update is needed, else update is immediatly performed and cached before returning response. 4/ change the code in SparkHistory to avoid doing a listing on each app sub-dir. It is possible to perform a single listing on "sparkLog" top level dir, to discover new apps. Then for each app subdir, most of them are already finished, and already recompacted by SparkHistory itself. This info is already stored in spark history Keystore db. Allmost all the listing sub-dirs can thefore be completly avoided. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org