[
https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133426#comment-13133426
]
Vinod Kumar Vavilapalli commented on MAPREDUCE-2708:
----------------------------------------------------
I hit some kind of a blocker here:
A normally finishing jobhistory file for my small job (with 6 maps of 1min
sleep each) is 60KB:
bq. -rw-rw---- 3 nobody rm 60207 2011-10-22 21:20
/job-history-root/history/done/2011/10/22/000000/job_1319280146725_0003-1319298340296-nobody-Sleep+job-1319298659124-6-1-SUCCEEDED.jhist
Now, if I kill the AM after a couple of tasks, NN shows the #bytes to be zero:
bq. -rw-r--r-- 3 nobody supergroup 0 2011-10-22 21:15
/user/nobody/staging1234/nobody/.staging/job_1319280146725_0003_1.jhist
And either when new generation AM tries to read this file for recovery or if I
manually try to read this via dfs command, it errs out:
{quote}
11/10/22 21:30:31 DEBUG ipc.Client: closing ipc connection to /127.0.0.1:50020:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException:
No valid credentials provided (Mechanism level: Failed to find any Kerberos
tgt)]
java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed
to find any Kerberos tgt)]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:535)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152)
at
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:499)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:583)
at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:205)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1195)
at org.apache.hadoop.ipc.Client.call(Client.java:1065)
at
org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
at $Proxy10.getReplicaVisibleLength(Unknown Source)
at
org.apache.hadoop.hdfs.protocolR23Compatible.ClientDatanodeProtocolTranslatorR23.getReplicaVisibleLength(ClientDatanodeProtocolTranslatorR23.java:121)
at
org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:163)
at
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:140)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:111)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:569)
at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:235)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:585)
at
org.apache.hadoop.fs.shell.Display$Cat.getInputStream(Display.java:93)
at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:81)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:300)
at
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:272)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:255)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:239)
at
org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:185)
at org.apache.hadoop.fs.shell.Command.run(Command.java:149)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:254)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:296)
....
11/10/22 21:30:31 ERROR ipc.RPC: Tried to call RPC.stopProxy on an object that
is not a proxy.
java.lang.IllegalArgumentException: not a proxy instance
at java.lang.reflect.Proxy.getInvocationHandler(Proxy.java:637)
at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:479)
at
org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:183)
at
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:140)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:111)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:569)
at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:235)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:585)
at
org.apache.hadoop.fs.shell.Display$Cat.getInputStream(Display.java:93)
at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:81)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:300)
at
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:272)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:255)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:239)
at
org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:185)
at org.apache.hadoop.fs.shell.Command.run(Command.java:149)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:254)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:296)
11/10/22 21:30:31 ERROR ipc.RPC: Could not get invocation handler null for
proxy class class
org.apache.hadoop.hdfs.protocolR23Compatible.ClientDatanodeProtocolTranslatorR23,
or invocation handler is not closeable.
cat: Cannot obtain block length for
LocatedBlock[BP-995821427-127.0.0.1-1318832709756:blk_-7812123742502704244_1249;
getBlockSize()=0; corrupt=false; offset=0; locs=[127.0.0.1:999]
{quote}
So, looks like we are in a fix if the job-history file is of a single block
size and that block isn't complete yet. I could try with a small block size say
25-30K for the jobhistory file, but is that okay for running on clusters?
Sharad?
> [MR-279] Design and implement MR Application Master recovery
> ------------------------------------------------------------
>
> Key: MAPREDUCE-2708
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708
> Project: Hadoop Map/Reduce
> Issue Type: Sub-task
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.0
> Reporter: Sharad Agarwal
> Assignee: Sharad Agarwal
> Priority: Blocker
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-2708-20111021.1.txt,
> MAPREDUCE-2708-20111021.txt, MAPREDUCE-2708-20111022.txt, mr2708_v1.patch,
> mr2708_v2.patch
>
>
> Design recovery of MR AM from crashes/node failures. The running job should
> recover from the state it left off.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira