[jira] [Commented] (MAPREDUCE-6948) TestJobImpl.testUnusableNodeTransition failed

2017-12-15 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293374#comment-16293374
 ] 

Jim Brennan commented on MAPREDUCE-6948:


I have been unable to reproduce this problem in trunk, nor in branch-2.8.   I 
have been able to repro in branch-2.7, but only by adding a sleep to exacerbate 
the race condition.

Analysis:
The key point in the failure case is here:
{{2017-08-30 10:12:22,000 INFO  [Thread-49] impl.JobImpl 
(JobImpl.java:transition(1953)) - Num completed Tasks: 1
2017-08-30 10:12:22,029 INFO  [Thread-49] impl.JobImpl 
(JobImpl.java:transition(1953)) - Num completed Tasks: 2
2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl 
(JobImpl.java:actOnUnusableNode(1354)) - TaskAttempt killed because it ran on 
unusable node Mock for NodeId, hashCode: 1280187896. 
AttemptId:attempt_123456789_0001_m_00_0
2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl 
(JobImpl.java:transition(1953)) - Num completed Tasks: 3
}}
At this point Num completed tasks should be 2.  Since it is 3, we start moving 
to the COMMITTED state too early and trip the failure.
In the successful case, the log looks like this:
{{2017-12-15 16:16:54,253 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:transition(1979)) - Num completed Tasks: 1
2017-12-15 16:16:54,258 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:transition(1979)) - Num completed Tasks: 2
2017-12-15 16:16:54,260 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:actOnUnusableNode(1359)) - TaskAttempt killed because it ran on 
unusable node Mock for NodeId, hashCode: 131679889. 
AttemptId:attempt_123456789_0001_m_00_0
2017-12-15 16:16:54,261 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:transition(1979)) - Num completed Tasks: 2
2017-12-15 16:16:54,262 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:checkReadyForCompletionWhenAllReducersDone(2103)) - Killing map 
task task_123456789_0001_m_00
2017-12-15 16:16:54,263 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:checkReadyForCompletionWhenAllReducersDone(2103)) - Killing map 
task task_123456789_0001_m_01
2017-12-15 16:16:54,263 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:transition(1979)) - Num completed Tasks: 3}}

The second Num Completed Tasks:2 line corresponds to when we mark the Reducer 
task as SUCCEEDED.  At this point, the count of succeeded map tasks should be 
1, because it was just decremented due to the unusable node.  It is incremented 
to 2 before printing.

The difference between branch-2.7, which fails, and trunk/branch-2.8 is the fix 
in MAPREDUCE-6675, which switched it to use a DrainDispatcher and added a 
dispatcher.await() call before we complete the reducer.

Another possible factor is YARN-5436, which fixed a very similar race in 
DrainDispatcher.  That one is present in trunk, but not in branch-2.8.  So it 
may account for intermittent failures in branch-2.8, but I was not able to 
reproduce it.

So as far as I can tell, this appears to be fixed already.

[~haibo.chen], can you provide any insight?  Any chance this failure was seen 
on branch-2.8 or branch-2.7?



> TestJobImpl.testUnusableNodeTransition failed
> -
>
> Key: MAPREDUCE-6948
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6948
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Haibo Chen
>Assignee: Jim Brennan
>  Labels: unit-test
>
> *Error Message*
> expected: but was:
> *Stacktrace*
> java.lang.AssertionError: expected: but was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl.assertJobState(TestJobImpl.java:1041)
>   at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl.testUnusableNodeTransition(TestJobImpl.java:615)
> *Standard out*
> {code}
> 2017-08-30 10:12:21,928 INFO  [Thread-49] event.AsyncDispatcher 
> (AsyncDispatcher.java:register(209)) - Registering class 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventType for class 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler
> 2017-08-30 10:12:21,939 INFO  [Thread-49] event.AsyncDispatcher 
> (AsyncDispatcher.java:register(209)) - Registering class 
> org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl$StubbedJob
> 2017-08-30 10:12:21,940 INFO  [Thread-49] event.AsyncDispatcher 
> (AsyncDispatcher.java:register(209)) - Registering class 
> org.apache.hadoop.mapreduce.v2.app.job.event.TaskEventType for class 
> org.apache.hadoop.yarn.event.EventHandler$$EnhancerByMockitoWithCGLIB$$79f96ebf
> 2017-08-30 

[jira] [Updated] (MAPREDUCE-6988) Let JHS support different file systems for intermediate_done and done

2017-12-15 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated MAPREDUCE-6988:
---
Fix Version/s: (was: 2.7.5)

> Let JHS support different file systems for intermediate_done and done
> -
>
> Key: MAPREDUCE-6988
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6988
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobhistoryserver
>Affects Versions: 2.7.4
>Reporter: Johan Gustavsson
>Priority: Minor
> Attachments: MAPREDUCE-6988.000.patch, MAPREDUCE-6988.001.patch, 
> MAPREDUCE-6988.002.patch, MAPREDUCE-6988.003.patch
>
>
> Currently JHS uses filecontext to move files from intermediate_done to done 
> folder. Since filecontext limits the use to 1 filesystem it makes it harder 
> to use s3 as a storage for jhist files. By moving this to filesystem 
> interface we can set hdfs for intermediate storage and s3 as long term 
> storage therefore reducing the number of puts to s3 and removing the need for 
> all M/R containers to carry a s3 sdk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6950) Error Launching job : java.io.IOException: Unknown Job job_xxx_xxx

2017-12-15 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated MAPREDUCE-6950:
---
Fix Version/s: (was: 2.7.5)

> Error Launching job : java.io.IOException: Unknown Job job_xxx_xxx
> --
>
> Key: MAPREDUCE-6950
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6950
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.7.1
>Reporter: zhengchenyu
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> some job report error, like this:
> {code}
> hadoop.mapreduce.Job.monitorAndPrintJob(Job.java 1367) [main] :  map 100% 
> reduce 100%
> [2017-08-31T20:27:12.591+08:00] [INFO] 
> hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java 277) 
> [main] : Application state is completed. FinalApplicationStatus=SUCCEEDED. 
> Redirecting to job history server
> [2017-08-31T20:27:12.821+08:00] [INFO] 
> hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java 277) 
> [main] : Application state is completed. FinalApplicationStatus=SUCCEEDED. 
> Redirecting to job history server
> [2017-08-31T20:27:13.039+08:00] [INFO] 
> hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java 277) 
> [main] : Application state is completed. FinalApplicationStatus=SUCCEEDED. 
> Redirecting to job history server
> [2017-08-31T20:27:13.256+08:00] [ERROR] 
> hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java 1034) [main] : 
> Error Launching job : java.io.IOException: Unknown Job job_xxx_xxx
> {code}
> I found the am container log, like below. Here we know error happened in 
> pipeline, maybe some dn error. And I also found some other reason which close 
> the JobHistoryEventHandler. So MR AM can't write the information for JH. So 
> client counldn't know whether the appplication is finished. 
> {code}
> 2017-08-31 20:27:10,813 INFO [Thread-1968] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, 
> writing event MAP_ATTEMPT_STARTED
> 2017-08-31 20:27:10,814 ERROR [Thread-1968] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing 
> History Event: 
> org.apache.hadoop.mapreduce.jobhistory.TaskAttemptStartedEvent@2055ea0a
> java.io.EOFException: Premature EOF: no length prefix available
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2292)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1317)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1237)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
> 2017-08-31 20:27:10,814 INFO [Thread-1968] 
> org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler 
> failed in state STOPPED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.EOFException: 
> Premature EOF: no length prefix available
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.EOFException: 
> Premature EOF: no length prefix available
> at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:580)
> at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:374)
>  
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
> at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
> {code}
> This problem is serious , especially for hive. Job must rerun meaninglessly!  
> So I think we need to retry the operation of writing history event. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6362) History Plugin should be updated

2017-12-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293194#comment-16293194
 ] 

Hadoop QA commented on MAPREDUCE-6362:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  4s{color} 
| {color:red} MAPREDUCE-6362 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | MAPREDUCE-6362 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12731975/MAPREDUCE-6362.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7256/console |
| Powered by | Apache Yetus 0.7.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> History Plugin should be updated
> 
>
> Key: MAPREDUCE-6362
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6362
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: MAPREDUCE-6362.patch
>
>
> As applications complete, the RM tracks their IDs in a completed list. This 
> list is routinely truncated to limit the total number of application 
> remembered by the RM.
> When a user clicks the History for a job, either the browser is redirected to 
> the application's tracking link obtained from the stored application 
> instance. But when the application has been purged from the RM, an error is 
> displayed.
> In very busy clusters the rate at which applications complete can cause 
> applications to be purged from the RM's internal list within hours, which 
> breaks the proxy URLs users have saved for their jobs.
> We would like the RM to provide valid tracking links persist so that users 
> are not frustrated by broken links.
> With the current plugin in place, redirections for the Mapreduce jobs works 
> but we need the add functionality for tez jobs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org