Hi Sungwoo, Thank you for your message. I'm glad to hear that you've successfully implemented stage rerun in Hive-MR3! It's great that the integration with Celeborn was smooth. If you have any further questions, feel free to reach out.
Thanks, Jiashu Xiong Sungwoo Park <[email protected]> 于2024年10月16日周三 22:37写道: > Hi Jiashu, > > Thank your for your reply. Based on what you advised, we have fully > implemented stage rerun in Hive-MR3, where a read failure triggers the > whole re-execution of the parent stage. There were not many changes to the > code for interfacing with Celeborn (which makes me think again that > Celeborn API is quite elegant). > > Thanks a lot, > > --- Sungwoo Park > > On Tue, 8 Oct 2024, rexxiong wrote: > > > Hi Sungwoo, > > > > Glad to see that the integration of Mr3 with Celeborn works well and > that the upgrade > > from version 0.3 to 0.5 went smoothly. > > Now, Celeborn supports Spark stage reruns, which are already in > production. And for > > your questions: > > > > 1. Suppose that a reducer fails to read the output of a certain mapper. > In > > such a case, should we re-execute all the mappers in the previous stage? > > Or, is it okay to re-execute only the mapper whose output is lost? > > In our previous implementation, MR3-Celeborn does not fully support task > > rerun (similar to stage rerun) because Celeborn does not return the > > identity of mapper tasks whose output has been lost. > > > > In my opinion, Mr3 should use a new shuffle ID to re-execute all mappers > and generate > > fresh outputs for that reducer. This is necessary because the Celeborn > Worker merges > > all mapper outputs by partition ID into the same files. If the reducer > is unable to > > read these files, it indicates that the data is lost, which indicates > that > > mappers(probably all, and I think only rerun those mapper which > associate the data is > > not easy in celeborn, possible but may change a lot) that wrote to those > files should > > be re-executed > > > > 2. When a reducer tries to read the output of mappers, when is it okay to > > use the application shuffle ID? > > If MR3 doesn't support stage rerun as before, MR3 can still use the > application > > Shuffle Id. Additionally, for Spark applications, we can disable stage > rerun support > > by setting celeborn.client.spark.fetch.throwsFetchFailure to false. This > will enable > > Spark to continue using the application Shuffle ID. > > > > 3. Along the same line of question 2, should we always get Celeborn > > shuffle IDs when trying to read the output of mappers? Considering the > > fact the the current code of MR3-Celeborn works fine, it seems like this > > is not always necessary. > > > > If MR3 does not require support for stage reruns, we can continue to use > the > > application shuffle ID as before. However, if support for stage reruns > is needed in > > MR3, we will need to regenerate the shuffle ID for each stage attempt. > This will allow > > us to distinguish between the map outputs for the reducers across > different attempts > > of the same stage. > > > > Thanks,Jiashu Xiong > > > > Sungwoo Park <[email protected]> 于2024年10月7日周一 14:20?道: > > (I left this message in Celeborn Slack channel, but perhaps this > mailing > > list is the right place.) > > > > Hello, > > > > Previously we implemented an extension of MR3 (an execution > engine) to > > support Celeborn 0.3.1. For a short introduction, please see: > > https://mr3docs.datamonad.com/docs/mr3/features/celeborn/ > > > > Now we are upgrading Celeborn to 0.5.1 and working on supporting > stage > > rerun, much like Spark-Celeborn. > > > > To my (pleasant) surprise, upgrading Celeborn from 0.3.1 to 0.5.1 > was > > quite smooth. After recompiling with Celeborn 0.5.1, MR3-Celeborn > just > > worked fine. I was surprised because the current code does not > obtain > > Celeborn shuffle IDs at all (because there was no notion of > Celeborn > > shuffle IDs back in 0.3.1) and we use only application shuffle IDs > which > > are generated by MR3 (similarly to Spark shuffle IDs). > > > > I have a few questions. > > > > 1. Suppose that a reducer fails to read the output of a certain > mapper. In > > such a case, should we re-execute all the mappers in the previous > stage? > > Or, is it okay to re-execute only the mapper whose output is lost? > > In our previous implementation, MR3-Celeborn does not fully > support task > > rerun (similar to stage rerun) because Celeborn does not return the > > identity of mapper tasks whose output has been lost. > > > > 2. When a reducer tries to read the output of mappers, when is it > okay to > > use the application shuffle ID? > > > > 3. Along the same line of question 2, should we always get Celeborn > > shuffle IDs when trying to read the output of mappers? Considering > the > > fact the the current code of MR3-Celeborn works fine, it seems > like this > > is not always necessary. > > > > Thank you. > > > > --- Sungwoo Park > > > > > >
