[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

2018-09-21 Thread jianjianjiao
Github user jianjianjiao commented on the issue:

https://github.com/apache/spark/pull/22444
  
@squito  Yes, you are correct. I was trying to make the applications 
running during the scan be picked up quicker.  It turns out the SPARK-6951 has 
done great job in achieving this.  




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

2018-09-21 Thread jianjianjiao
Github user jianjianjiao commented on the issue:

https://github.com/apache/spark/pull/22444
  
@vanzin   Really thanks for you suggestions. It becomes much faster loading 
event logs. from more than 2.5 hours, to 19 minutes, loading 17K event logs, 
some of them are larger than 10G.

1. To enable SHS V2 to caching things on disk. We are using Windows, there 
is a small "posix.permissions not supported in windows" issue, I create a new 
PR here https://github.com/apache/spark/pull/22520 , could you please take a 
look?  This change doesn't speed up loading very much, but it improves other 
part. 

2. Tried 2.4, and also tried applying  SPARK-6951 to 2.3.  this is the 
critical part improving the speed.

I will close this PR, as it is useless now.  Thanks again.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

2018-09-18 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/22444
  
> so any server restart results in hours of downtime, just from scanning.

Well, that's why 2.3 supports caching things on disk. Also, 2.4 has 
SPARK-6951 which should make this a lot faster even without disk caching. 
@jianjianjiao have you tried out 2.4?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

2018-09-18 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/22444
  
> history server startup needs to go through all these logs before being 
usable, so any server restart results in hours of downtime, just from scanning.

I don't think this is true. The first scan may take a long time, but i 
think the SHS is usable even during that time.  As soon as a scan makes it 
through some file, that file is added the listing.

But if I understand correctly, the advantage here is that as more 
applications are run during that 2.5 hour scan, you will pick those up more 
quickly.

> 1. would it make sense for the initial scans to go for the most recent 
logs first, because that 2.5 hour time to scan all files is still there.
> 2. would you want the UI and rest api to indicate that the scan was still 
in progress, and not to worry if the listing was incomplete?

I think both of these already happen.

@jianjianjiao again its been a while since I've looked at this code -- does 
that sound correct?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

2018-09-18 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/22444
  
I see the reasoning here

* @jianjianjiao has a very large cluster with many thousands of history 
files of past (successful) jobs.
* history server startup needs to go through all these logs before being 
usable, so any server restart results in hours of downtime, just from scanning.
* this patch breaks things up to be incremental.

I don't have any opinions on the patch itself; I've not looked at that code 
for so long my reviews are probably dangerous.

Two thought: 

1. would it make sense for the initial scans to go for the most recent logs 
first, because that 2.5 hour time to scan all files is still there. 
1. would you want the UI and rest api to indicate that the scan was still 
in progress, and not to worry if the listing was incomplete?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

2018-09-17 Thread jianjianjiao
Github user jianjianjiao commented on the issue:

https://github.com/apache/spark/pull/22444
  
Add @vanzin @steveloughran  @squito who made changes to related code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org