[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...
Github user jianjianjiao commented on the issue: https://github.com/apache/spark/pull/22444 @squito Yes, you are correct. I was trying to make the applications running during the scan be picked up quicker. It turns out the SPARK-6951 has done great job in achieving this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...
Github user jianjianjiao commented on the issue: https://github.com/apache/spark/pull/22444 @vanzin Really thanks for you suggestions. It becomes much faster loading event logs. from more than 2.5 hours, to 19 minutes, loading 17K event logs, some of them are larger than 10G. 1. To enable SHS V2 to caching things on disk. We are using Windows, there is a small "posix.permissions not supported in windows" issue, I create a new PR here https://github.com/apache/spark/pull/22520 , could you please take a look? This change doesn't speed up loading very much, but it improves other part. 2. Tried 2.4, and also tried applying SPARK-6951 to 2.3. this is the critical part improving the speed. I will close this PR, as it is useless now. Thanks again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22444 > so any server restart results in hours of downtime, just from scanning. Well, that's why 2.3 supports caching things on disk. Also, 2.4 has SPARK-6951 which should make this a lot faster even without disk caching. @jianjianjiao have you tried out 2.4? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...
Github user squito commented on the issue: https://github.com/apache/spark/pull/22444 > history server startup needs to go through all these logs before being usable, so any server restart results in hours of downtime, just from scanning. I don't think this is true. The first scan may take a long time, but i think the SHS is usable even during that time. As soon as a scan makes it through some file, that file is added the listing. But if I understand correctly, the advantage here is that as more applications are run during that 2.5 hour scan, you will pick those up more quickly. > 1. would it make sense for the initial scans to go for the most recent logs first, because that 2.5 hour time to scan all files is still there. > 2. would you want the UI and rest api to indicate that the scan was still in progress, and not to worry if the listing was incomplete? I think both of these already happen. @jianjianjiao again its been a while since I've looked at this code -- does that sound correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/22444 I see the reasoning here * @jianjianjiao has a very large cluster with many thousands of history files of past (successful) jobs. * history server startup needs to go through all these logs before being usable, so any server restart results in hours of downtime, just from scanning. * this patch breaks things up to be incremental. I don't have any opinions on the patch itself; I've not looked at that code for so long my reviews are probably dangerous. Two thought: 1. would it make sense for the initial scans to go for the most recent logs first, because that 2.5 hour time to scan all files is still there. 1. would you want the UI and rest api to indicate that the scan was still in progress, and not to worry if the listing was incomplete? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...
Github user jianjianjiao commented on the issue: https://github.com/apache/spark/pull/22444 Add @vanzin @steveloughran @squito who made changes to related code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org