[ https://issues.apache.org/jira/browse/SPARK-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan Williams updated SPARK-10551: ---------------------------------- Description: Doing forensics on some failed Spark applications and seeing nonsensical things in the event logs, e.g.: {code} $ grep -n '"Task ID":12083' application_1439224376754_5702 24578:{"Event":"SparkListenerTaskStart","Stage ID":6,"Stage Attempt ID":0,"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}} 28918:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"232"},"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]}} 29062:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]},"Task Metrics":{"Host Name":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Executor Deserialize Time":181,"Executor Run Time":1585,"Result Size":8760,"JVM GC Time":0,"Result Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Write Metrics":{"Shuffle Bytes Written":454121,"Shuffle Write Time":43293396,"Shuffle Records Written":2549},"Input Metrics":{"Data Read Method":"Memory","Bytes Read":810520,"Records Read":2549}}} {code} Task ID 12083 has a TaskStart event, a TaskEnd event indicating that the task failed due to {{ExecutorLostFailure}}, and then a TaskEnd event saying that the task succeeded. The history server is not showing me this file in the "complete" or "incomplete" sections, though it has this line in its stdout (and no apparent exceptions later), which I thought meant that it parsed the file correctly: {code} 15/09/10 17:57:56 INFO FsHistoryProvider: Replaying log path: hdfs://demeter-nn1.demeter.hpc.mssm.edu/spark/tmp/logs/willir31/application_1439224376754_5702 {code} [~arahuja] ran this application originally and says that the live web UI was showing inconsistent/nonsensical data. was: Doing forensics on some failed Spark applications and seeing nonsensical things in the event logs, e.g.: {code} $ grep -n '"Task ID":12083' application_1439224376754_5702 24578:{"Event":"SparkListenerTaskStart","Stage ID":6,"Stage Attempt ID":0,"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}} 28918:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"232"},"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]}} 29062:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]},"Task Metrics":{"Host Name":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Executor Deserialize Time":181,"Executor Run Time":1585,"Result Size":8760,"JVM GC Time":0,"Result Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Write Metrics":{"Shuffle Bytes Written":454121,"Shuffle Write Time":43293396,"Shuffle Records Written":2549},"Input Metrics":{"Data Read Method":"Memory","Bytes Read":810520,"Records Read":2549}}} {code} Task ID 12083 has a TaskStart event, a TaskEnd event indicating that the task failed due to {{ExecutorLostFailure}}, and then a TaskEnd event saying that the task succeeded. The history server is not showing me this file in the "complete" or "incomplete" sections, though it has this line in its stdout (and no apparent exceptions later), which I thought meant that it parsed the file correctly: {code} 15/09/10 17:57:56 INFO FsHistoryProvider: Replaying log path: hdfs://demeter-nn1.demeter.hpc.mssm.edu/spark/tmp/logs/willir31/application_1439224376754_5702 {code} [~arahuja] ran this application originally and says that the live web UI was showing inconsistent/nonsensical data. > Successful task-end event after task failed due to executor loss > ---------------------------------------------------------------- > > Key: SPARK-10551 > URL: https://issues.apache.org/jira/browse/SPARK-10551 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.4.1 > Reporter: Ryan Williams > > Doing forensics on some failed Spark applications and seeing nonsensical > things in the event logs, e.g.: > {code} > $ grep -n '"Task ID":12083' application_1439224376754_5702 > 24578:{"Event":"SparkListenerTaskStart","Stage ID":6,"Stage Attempt > ID":0,"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch > Time":1440703704768,"Executor > ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting > Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}} > 28918:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task > Type":"ShuffleMapTask","Task End > Reason":{"Reason":"ExecutorLostFailure","Executor ID":"232"},"Task > Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch > Time":1440703704768,"Executor > ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting > Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]}} > 29062:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task > Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task > Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch > Time":1440703704768,"Executor > ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting > Result Time":0,"Finish > Time":1440703707747,"Failed":true,"Accumulables":[]},"Task Metrics":{"Host > Name":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Executor Deserialize > Time":181,"Executor Run Time":1585,"Result Size":8760,"JVM GC Time":0,"Result > Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes > Spilled":0,"Shuffle Write Metrics":{"Shuffle Bytes Written":454121,"Shuffle > Write Time":43293396,"Shuffle Records Written":2549},"Input Metrics":{"Data > Read Method":"Memory","Bytes Read":810520,"Records Read":2549}}} > {code} > Task ID 12083 has a TaskStart event, a TaskEnd event indicating that the task > failed due to {{ExecutorLostFailure}}, and then a TaskEnd event saying that > the task succeeded. > The history server is not showing me this file in the "complete" or > "incomplete" sections, though it has this line in its stdout (and no apparent > exceptions later), which I thought meant that it parsed the file correctly: > {code} > 15/09/10 17:57:56 INFO FsHistoryProvider: Replaying log path: > hdfs://demeter-nn1.demeter.hpc.mssm.edu/spark/tmp/logs/willir31/application_1439224376754_5702 > {code} > [~arahuja] ran this application originally and says that the live web UI was > showing inconsistent/nonsensical data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org