[GitHub] liupc commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce

2019-01-29 Thread GitBox
liupc commented on issue #23647: [SPARK-26712]Support multi directories for 
executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-458817205
 
 
   @vanzin 
   >  I'm hoping that the executor registration data can be somehow re-created
   
   My change does save recovery data to a better directory(as explained in the 
above note) if disk error happens, so spark can recover from it.
   
   > The shuffle service will need at least the app secret to allow the 
executors to connect. I'm wondering if after a restart, YARN actually calls the 
initializeApplication callback which would allow that data to be re-created. 
That's the bare minimum;
   
   this secrets recovery is done by YarnShuffleService itself. so maybe we 
should also change secret recovery related codes.
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce

2019-01-29 Thread GitBox
liupc commented on issue #23647: [SPARK-26712]Support multi directories for 
executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-458802178
 
 
   @squito 
   
   > that still doesn't completely handle the problem, as any existing shuffle 
data written to the bad disks is gone
   Yes, the existing shuffle data written for those finished `ShuffleMapTask` 
would gone, but Spark'retry( I mean the stage retry) mechanism should fix 
things, is it?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce

2019-01-28 Thread GitBox
liupc commented on issue #23647: [SPARK-26712]Support multi directories for 
executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-458420729
 
 
   @vanzin 
   
   > It feels to me like enabling the option in SPARK-16505 is the right thing. 
If your recovery dir is bad, then the NM shouldn't be running until that is 
fixed. But that also assumes that the failure is detected during shuffle 
service initialization, and not later.
   
   Yes, I think we should make this option enabled by default. maybe in another 
PR.
   
   > If implementing multi-disk supports, I'm also not sure even how you'd do 
it. Opening the DB may or may not work, depending on how bad the disk is. So if 
the first time it does not work, and you write the recovery db to some other 
directory, but then the NM crashes (e.g. because of the bad disk) and the next 
time, opening the DB actually works in the first try, you'll end up reading 
stale data before you realize you're reading from the bad disk. I see you have 
checks for the last mod time, but even that can cause troubles in a scenario 
where the failure may or may not happen depending on when you look...
   
   This PR just periodically check bad disk and saving current executors info 
in memory to the new good directory. The data is newest if we handles well the 
synchronization. There indeed exists a case that make the recovery failure(NM 
crashes and disk broken happens at the same time), but it should be really 
really rare to happen.
   
   > I tend to think that if your recovery disk is bad, that should be treated 
as a catastrophic failure, and trying to work around that is kinda pointless. 
What you could do is try to keep running in spite of the bad disk, e.g. by only 
keeping data in memory. You'd only see problems when the NM is restarted (you'd 
lose existing state), but at that point Spark's retry mechanism should fix 
things.
   
   I understand what you are talking about, but the major problem is that if 
that happens, the long running applications can not recover from resource waste 
or maybe occasional job failure except restarting the application. I think if 
this problem can be resolved by current implementation, then I agree with your 
opinion. But for my understanding, current implementation can not solve this 
problem.
   However, with my PR, the application can run as usual, and when SREs fixed 
the disk, everything comes back.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce

2019-01-28 Thread GitBox
liupc commented on issue #23647: [SPARK-26712]Support multi directories for 
executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-458386913
 
 
   @squito 
   
   >  If you have a bad disk, you're definitely losing some shuffle data. 
Furthermore, any other shuffleMapStages would need to know to not write their 
output to the bad disk also. 
   
   This blacklist is introduced in another PR 
https://github.com/apache/spark/pull/23614, it will solve the shuffle write 
issues.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce

2019-01-27 Thread GitBox
liupc commented on issue #23647: [SPARK-26712]Support multi directories for 
executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661
 
 
   @vanzin @HyukjinKwon 
   we once run into a similar problem on Spark2.0.1 when 
https://github.com/apache/spark/pull/14162 is not introduced. The disk broken 
of recovery path caused NM started without `YarnShuffleService`, so that 
executors scheduled on the node were unable to register with 
`YarnShuffleService`, and finally caused the application failure.
   Even though, we now have https://github.com/apache/spark/pull/14162 and the 
application level blacklist, but I think this PR still make sense for long 
running applications(for instance, Spark ThriftServer applications or spark 
streaming applications).
   For these type of applications, this case might not be a uncommon thing for 
they are running for a long time.
   and even if we suppose spark would recover with application level blacklist 
enabled, it will still cause resource waste, for shuffle will always fail on 
the node, not not mention that there are chances that the node is not 
blacklisted and will cause job failure.
   Hope my explanation can make you convinced.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce

2019-01-25 Thread GitBox
liupc commented on issue #23647: [SPARK-26712]Support multi directories for 
executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-457541376
 
 
   @srowen @vanzin @squito @HyukjinKwon +Potetial reviewers, could anybody give 
some suggestions?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org