[jira] [Commented] (MAPREDUCE-4798) TestJobHistoryServer fails some times with 'java.lang.AssertionError: Address already in use'
[ https://issues.apache.org/jira/browse/MAPREDUCE-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509435#comment-13509435 ] sam liu commented on MAPREDUCE-4798: Thanks Eric! Pls let me know if there is any further action > TestJobHistoryServer fails some times with 'java.lang.AssertionError: Address > already in use' > - > > Key: MAPREDUCE-4798 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4798 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver, test >Affects Versions: 1.0.3 > Environment: Red Hat Ent Server 6.2 >Reporter: sam liu >Priority: Minor > Labels: patch > Attachments: MAPREDUCE-4798_branch-1.patch, MAPREDUCE-4798.patch > > Original Estimate: 3h > Remaining Estimate: 3h > > UT Failure in IHC 1.0.3: org.apache.hadoop.mapred.TestJobHistoryServer. This > UT fails sometimes. > The error message is: > 'Testcase: testHistoryServerStandalone took 5.376 sec > Caused an ERROR > Address already in use > java.lang.AssertionError: Address already in use > at > org.apache.hadoop.mapred.TestJobHistoryServer.testHistoryServerStandalone(TestJobHistoryServer.java:113)' -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4835) AM job metrics can double-count a job if it errors after entering a completion state
[ https://issues.apache.org/jira/browse/MAPREDUCE-4835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509385#comment-13509385 ] Xuan Gong commented on MAPREDUCE-4835: -- The method "Have JobImpl.finished ignore incrementing any metrics if the job is already in a terminal state (SUCCEEDED/FAILED/KILLED) to avoid double-counting a job." may not work. But before we call the finished, the current states is already changed. So, it is very difficult to check previous status is terminal states or not. For example, somehow we did InternalErrorTransition, it will change to state from succeeded to error. From the code at InternalErrorTransition, public void transition(JobImpl job, JobEvent event) { //TODO Is this JH event required. job.setFinishTime(); JobUnsuccessfulCompletionEvent failedEvent = new JobUnsuccessfulCompletionEvent(job.oldJobId, job.finishTime, 0, 0, JobStateInternal.ERROR.toString()); job.eventHandler.handle(new JobHistoryEvent(job.jobId, failedEvent)); <-- this line is actually change the states job.finished(JobStateInternal.ERROR); <-- this line will increase the failure count that is duplicate } So, what we can do is add JobStateInternal previousState = getInternalState() before job.eventHandler.handle(new JobHistoryEvent(job.jobId, failedEvent)), and check the previousState to decide whether we need to increase the count or not. For example, if we do not want to increase the count when we change the terminal states to error state. We can do: In InternalErrorTransition, public void transition(JobImpl job, JobEvent event) { //TODO Is this JH event required. job.setFinishTime(); JobUnsuccessfulCompletionEvent failedEvent = new JobUnsuccessfulCompletionEvent(job.oldJobId, job.finishTime, 0, 0, JobStateInternal.ERROR.toString()); JobStateInternal previousState = job.getInternalState(); job.eventHandler.handle(new JobHistoryEvent(job.jobId, failedEvent)); //check the previous state is not terminal states, is not error states, when we meet error states, we should have already increase the count, we do not want to do it again if(previousState != JobStateInternal.SUCCEEDED || previousState != JobStateInternal.KILLED || previousState != JobStateInternal.FAILED || previousState != JobStateInternal.ERROR) { job.finished(JobStateInternal.ERROR); } } > AM job metrics can double-count a job if it errors after entering a > completion state > > > Key: MAPREDUCE-4835 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4835 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 2.0.3-alpha, 0.23.6 >Reporter: Jason Lowe >Priority: Minor > > If JobImpl enters the SUCCEEDED, FAILED, or KILLED state but then encounters > an invalid state transition, it could double-count the job since jobs that > encounter an error are considered failed jobs. Therefore the job could be > counted initially as a successful, failed, or killed job, respectively, then > counted again as a failed job due to the internal error afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer
[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509066#comment-13509066 ] Jason Lowe commented on MAPREDUCE-4842: --- Here's the sequence of events that I believe led to the hang during shuffle. See {{MergeManager}} for context of variable references. # Fetchers started fetching data # Enough data finishes transferring to reach the {{commitMemory}} threshold and an in-memory merge starts # While the merge takes place some of the output data is freed before the merge completes, lowering {{commitMemory}} and {{usedMemory}} which allows more data to be fetched # Eventually we try to fetch too much data because {{usedMemory}} exceeds {{memoryLimit}} and further fetchers are told to WAIT # All of the outstanding fetches complete and call {{closeInMemoryFile}}, but we don't start a merge because the previous merge is still marked in progress # Merge completes, allowing a new merge to be started on the next {{closeInMemoryFile}} call # With no outstanding fetches and no new fetches allowed, we never call {{closeInMemoryFile}} again and never start the next merge # With no merge in progress and therefore nothing to wait upon, fetcher threads proceed to pummel the {{MergeManager}} asking for merge data reservations that are never given, and the reducer log grows rather rapidly > Shuffle race can hang reducer > - > > Key: MAPREDUCE-4842 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.0.3-alpha, 0.23.5 >Reporter: Jason Lowe > > Saw an instance where the shuffle caused multiple reducers in a job to hang. > It looked similar to the problem described in MAPREDUCE-3721, where the > fetchers were all being told to WAIT by the MergeManager but no merge was > taking place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-4842) Shuffle race can hang reducer
Jason Lowe created MAPREDUCE-4842: - Summary: Shuffle race can hang reducer Key: MAPREDUCE-4842 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.5, 2.0.3-alpha Reporter: Jason Lowe Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4824) Provide a mechanism for jobs to indicate they should not be recovered on restart
[ https://issues.apache.org/jira/browse/MAPREDUCE-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom White updated MAPREDUCE-4824: - Attachment: MAPREDUCE-4824.patch > I'm concerned that this might blow up different schedulers in different ways. I don't think that's a problem since the code change only affects job submission, which kicks in before scheduling code is run. > Maybe we need to do an 'if' check during recovery and not throw an > IOException? I had another look at this and came up with a new patch. Does it look better? The Hadoop 2 change sounds like the right approach. At first I thought we didn't need the property in Hadoop 2, due to MAPREDUCE-2702, but actually it would allow users to mark a job as non-recoverable on a per-instance basis. It would build on YARN-128. > Provide a mechanism for jobs to indicate they should not be recovered on > restart > > > Key: MAPREDUCE-4824 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4824 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: mrv1 >Affects Versions: 1.1.0 >Reporter: Tom White >Assignee: Tom White > Attachments: MAPREDUCE-4824.patch, MAPREDUCE-4824.patch, > MAPREDUCE-4824.patch, MAPREDUCE-4824.patch > > > Some jobs (like Sqoop or HBase jobs) are not idempotent, so should not be > recovered on jobtracker restart. MAPREDUCE-2702 solves this problem for MR2, > however the approach there is not applicable for MR1, since even if we only > use the job-level part of the patch and add a isRecoverySupported method to > OutputCommitter, there is no way to use that information from the JT (which > initiates recovery), since the JT does not instantiate OutputCommitters - and > it shouldn't since they are user-level code. (In MR2 it's OK since the MR AM > calls the method.) > Instead, we can add a MR configuration property to say that a job is not > recoverable, and the JT could safely read this from the job conf. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2584) Check for serializers early, and give out more information regarding missing serializers
[ https://issues.apache.org/jira/browse/MAPREDUCE-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated MAPREDUCE-2584: --- Status: Open (was: Patch Available) > Check for serializers early, and give out more information regarding missing > serializers > > > Key: MAPREDUCE-2584 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2584 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.2 >Reporter: Harsh J >Assignee: Harsh J > Labels: serializers, tasks > Attachments: MAPREDUCE-2584.r2.diff, MAPREDUCE-2584.r3.diff, > MAPREDUCE-2584.r4.diff, MAPREDUCE-2584.r5.diff, MAPREDUCE-2584.r6.diff, > MAPREDUCE-2584.r7.diff, MAPREDUCE-2584.r7.diff, MAPREDUCE-2584.r8.diff, > MAPREDUCE-2584.r9.diff > > > As discussed on HADOOP-7328, MapReduce can handle serializers in a much > better way in case of bad configuration, improper imports (Some odd Text > class instead of the Writable Text set as key), etc.. > This issue covers the MapReduce parts of the improvements (made to IFile, > MapOutputBuffer, etc. and possible early-check of serializer availability > pre-submit) that provide more information than just an NPE as is the current > case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2584) Check for serializers early, and give out more information regarding missing serializers
[ https://issues.apache.org/jira/browse/MAPREDUCE-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508585#comment-13508585 ] Harsh J commented on MAPREDUCE-2584: Problem is coming from the fact that when we lookup the IO_SERIALIZATIONS_KEY, we don't pass a default classes set. We could add that in, but it may get harder to maintain it if we extend the list. In any case, I'll upload another patch. {code} java.io.IOException: Couldn't find a serializer for the Map-Output Key class: 'class org.apache.hadoop.io.LongWritable'. If custom serialization is being used, ensure that the 'io.serializations' property is appropriately configured for the Job. at org.apache.hadoop.mapreduce.JobSubmitter.checkSerializerSpecs(JobSubmitter.java:462) at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:424) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1215) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1437) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1215) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:617) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:612) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1437) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:612) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:890) at org.apache.hadoop.conf.TestNoDefaultsJobConf.testNoDefaults(TestNoDefaultsJobConf.java:82) {code} > Check for serializers early, and give out more information regarding missing > serializers > > > Key: MAPREDUCE-2584 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2584 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 0.20.2 >Reporter: Harsh J >Assignee: Harsh J > Labels: serializers, tasks > Attachments: MAPREDUCE-2584.r2.diff, MAPREDUCE-2584.r3.diff, > MAPREDUCE-2584.r4.diff, MAPREDUCE-2584.r5.diff, MAPREDUCE-2584.r6.diff, > MAPREDUCE-2584.r7.diff, MAPREDUCE-2584.r7.diff, MAPREDUCE-2584.r8.diff, > MAPREDUCE-2584.r9.diff > > > As discussed on HADOOP-7328, MapReduce can handle serializers in a much > better way in case of bad configuration, improper imports (Some odd Text > class instead of the Writable Text set as key), etc.. > This issue covers the MapReduce parts of the improvements (made to IFile, > MapOutputBuffer, etc. and possible early-check of serializer availability > pre-submit) that provide more information than just an NPE as is the current > case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira