[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377297#comment-14377297 ] Hitesh Shah edited comment on TEZ-1909 at 3/24/15 5:07 AM: --- Comments: {code} LOG.warn(Other recovery files will be skipped due to error in the previous recovery file); {code} - please add the file name to this log line For TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED, maybe change to TEZ_TEST_... and likewise change property value. No scope defined? It seems like the patch for this jira has been merged with fixes for a different jira? Can these be separated out? was (Author: hitesh): Comments: {code} LOG.warn(Other recovery files will be skipped due to error in the previous recovery file); {code} - please add the file name to this line as well as its length For TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED, maybe change to TEZ_TEST_... and likewise change property value. No scope defined? It seems like the patch for this jira has been merged with fixes for a different jira? Can these be separated out? Remove need to copy over all events from attempt 1 to attempt 2 dir --- Key: TEZ-1909 URL: https://issues.apache.org/jira/browse/TEZ-1909 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch Use of file versions should prevent the need for copying over data into a second attempt dir. Care needs to be taken to handle last corrupt record handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366825#comment-14366825 ] Jeff Zhang edited comment on TEZ-1909 at 3/20/15 9:14 AM: -- Attach the new patch to address the review comment. [~hitesh] Please help review Apart from the issues in the review comments, I also found there's one issue about RecoveryService. For the scenario of draining the events before RecoverySerivce is stopped, previously I take the event queue's size equal to zero as an indication of events are all consumed, but it is not true. Because even if the event queue is empty, the event may still been processing. I fix this bug in the new patch just like AsyncDispatcher did. bq. the if (skipAllOtherEvents) { check is probably also needed at the top of the loop to prevent new files from being opened and read ( in addition to short-circuiting the read of all events in the given file ). Maybe just log a message that other files were present and skipped Fix it. also add unit test in TestRecoveryParser bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() ? Only for unit test. But in the new patch, I remove it and initialize the Set in the setup method. bq. also, we should add a test for adding corrupt data to the summary stream and ensuring that its processing fails Done. bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used anywhere apart from being set to true in one of the tests. Fix it. bq. please replace import com.sun.tools.javac.util.List; with java.lang.List Fix it bq. testCorruptedLastRecord should also verify that the dag submitted event was seen. Done. verify DAGAppMaster.createDAG is invoked. was (Author: zjffdu): Attach the new patch to address the review comment. Apart from the issues in the review comments, I also found there's one issue about RecoveryService. For the scenario of draining the events before RecoverySerivce is stopped, previously I take the event queue's size equal to zero as an indication of events are all consumed, but it is not true. Because even if the event queue is empty, the event may still been processing. I fix this bug in the new patch just like AsyncDispatcher did. bq. the if (skipAllOtherEvents) { check is probably also needed at the top of the loop to prevent new files from being opened and read ( in addition to short-circuiting the read of all events in the given file ). Maybe just log a message that other files were present and skipped Fix it. also add unit test in TestRecoveryParser bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() ? Only for unit test. But in the new patch, I remove it and initialize the Set in the setup method. bq. also, we should add a test for adding corrupt data to the summary stream and ensuring that its processing fails Done. bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used anywhere apart from being set to true in one of the tests. Fix it. bq. please replace import com.sun.tools.javac.util.List; with java.lang.List Fix it bq. testCorruptedLastRecord should also verify that the dag submitted event was seen. Done. verify DAGAppMaster.createDAG is invoked. Remove need to copy over all events from attempt 1 to attempt 2 dir --- Key: TEZ-1909 URL: https://issues.apache.org/jira/browse/TEZ-1909 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch Use of file versions should prevent the need for copying over data into a second attempt dir. Care needs to be taken to handle last corrupt record handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366825#comment-14366825 ] Jeff Zhang edited comment on TEZ-1909 at 3/18/15 8:37 AM: -- Attach the new patch to address the review comment. Apart from the issues in the review comments, I also found there's one issue about RecoveryService. For the scenario of draining the events before RecoverySerivce is stopped, previously I take the event queue's size equal to zero as an indication of events are all consumed, but it is not true. Because even if the event queue is empty, the event may still been processing. I fix this bug in the new patch just like AsyncDispatcher did. bq. the if (skipAllOtherEvents) { check is probably also needed at the top of the loop to prevent new files from being opened and read ( in addition to short-circuiting the read of all events in the given file ). Maybe just log a message that other files were present and skipped Fix it. also add unit test in TestRecoveryParser bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() ? Only for unit test. But in the new patch, I remove it and initialize the Set in the setup method. bq. also, we should add a test for adding corrupt data to the summary stream and ensuring that its processing fails Done. bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used anywhere apart from being set to true in one of the tests. Fix it. bq. please replace import com.sun.tools.javac.util.List; with java.lang.List Fix it bq. testCorruptedLastRecord should also verify that the dag submitted event was seen. Done. verify DAGAppMaster.createDAG is invoked. was (Author: zjffdu): Attach the new patch to address the review comment. Apart from the issues in the review comments, I also found there's one issue about RecoveryService. For the scenario of draining the events before RecoverySerivce is stopped, previously I take the event queue's size eqaul to zero as an indication of events are all consumed, but it is not true. Because even if the event queue is empty, the event may still being processing. I fix this bug in the new patch just like AsyncDispatcher did. bq. the if (skipAllOtherEvents) { check is probably also needed at the top of the loop to prevent new files from being opened and read ( in addition to short-circuiting the read of all events in the given file ). Maybe just log a message that other files were present and skipped Fix it. also add unit test in TestRecoveryParser bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() ? Only for unit test. But in the new patch, I remove it and initialize the Set in the setup method. bq. also, we should add a test for adding corrupt data to the summary stream and ensuring that its processing fails Done. bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used anywhere apart from being set to true in one of the tests. Fix it. bq. please replace import com.sun.tools.javac.util.List; with java.lang.List Fix it bq. testCorruptedLastRecord should also verify that the dag submitted event was seen. Done. verify DAGAppMaster.createDAG is invoked. Remove need to copy over all events from attempt 1 to attempt 2 dir --- Key: TEZ-1909 URL: https://issues.apache.org/jira/browse/TEZ-1909 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch Use of file versions should prevent the need for copying over data into a second attempt dir. Care needs to be taken to handle last corrupt record handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357592#comment-14357592 ] Hitesh Shah edited comment on TEZ-1909 at 3/11/15 9:02 PM: --- Minor comments: - there can be cases where data is partially written hence there might be an error when reading the last record. Maybe we should add a simulated test for this by writing invalid data to the end of an intermediate summary and dag file and seeing whether the code handles it correctly? - skipAllOtherEvents should probably be a flag across all files for a given dag. At the moment, it is considered only for a single dag file and reset. - log line LOG.info(isSpeculationEnabled: + isSpeculationEnabled); was removed - not sure why. {code} for (int attemptNum=1; attemptNum=3; ++attemptNum) { ListHistoryEvent historyEvents = new ArrayListHistoryEvent(); for (int i=1 ;i=attemptNum;++i) { Path currentAttemptRecoveryDataDir = TezCommonUtils.getAttemptRecoveryPath(recoveryDataDir,i); Path recoveryFilePath = new Path(currentAttemptRecoveryDataDir, appId.toString().replace(application, dag) + _1 + TezConstants.DAG_RECOVERY_RECOVER_FILE_SUFFIX); historyEvents.addAll(RecoveryParser.parseDAGRecoveryFile( fs.open(recoveryFilePath))); } {code} The above code needs a bit of cleanup in TestDAGRecovery - not sure why we need 2 loops for the 3 attempts' recovery data. was (Author: hitesh): Minor comments: - there can be cases where data is partially written hence there might be an error when reading the last record. Maybe we should add a simulated test for this by writing invalid data to the end of an intermediate summary and dag file and seeing whether the code handles it correctly? - skipAllOtherEvents should probably be a flag across all files for a given dag. At the moment, it is considered only for a single dag file and reset. - log line LOG.info(isSpeculationEnabled: + isSpeculationEnabled); was removed - not sure why. {code} for (int attemptNum=1; attemptNum=3; ++attemptNum) { ListHistoryEvent historyEvents = new ArrayListHistoryEvent(); for (int i=1 ;i=attemptNum;++i) { Path currentAttemptRecoveryDataDir = TezCommonUtils.getAttemptRecoveryPath(recoveryDataDir,i); Path recoveryFilePath = new Path(currentAttemptRecoveryDataDir, appId.toString().replace(application, dag) + _1 + TezConstants.DAG_RECOVERY_RECOVER_FILE_SUFFIX); historyEvents.addAll(RecoveryParser.parseDAGRecoveryFile( fs.open(recoveryFilePath))); } {code} The above code needs a bit of cleanup - not sure why we need 2 loops for the 3 attempts' recovery data. Remove need to copy over all events from attempt 1 to attempt 2 dir --- Key: TEZ-1909 URL: https://issues.apache.org/jira/browse/TEZ-1909 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-1909-1.patch Use of file versions should prevent the need for copying over data into a second attempt dir. Care needs to be taken to handle last corrupt record handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262006#comment-14262006 ] Hitesh Shah edited comment on TEZ-1909 at 12/31/14 7:23 AM: Actually both. Today, we end up copying over data from the previous attempt into the current attempt's directory. ( the attempt specific directly already exists hence covers part 1 of your comment ). It might be better to just have a chain of partial files to reduce the copy overhead. was (Author: hitesh): Actually both. Today, we end up copying over data from the previous attempt into the current attempt's directory. It might be better to just have a chain of partial files to reduce the copy overhead. Remove need to copy over all events from attempt 1 to attempt 2 dir --- Key: TEZ-1909 URL: https://issues.apache.org/jira/browse/TEZ-1909 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Use of file versions should prevent the need for copying over data into a second attempt dir. Care needs to be taken to handle last corrupt record handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)