[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204799#comment-13204799
 ] 

Roman Shaposhnik commented on MAPREDUCE-3728:
---------------------------------------------

Here's a more direct way to reproduce the problem.

{noformat}
# sudo - yarn
yarn$ mkdir -p /tmp/TEST/{logs,locs} /tmp/TEST/locs/usercache
yarn$ cp /tmp/cont1.tokens /tmp/TEST/cont1.tokens
yarn$ container-executor rvs 0 app1 /tmp/TEST/cont1.tokens /tmp/TEST/locs 
/tmp/TEST/logs /usr/java/jdk1.6.0_26/jre/bin/java -classpath 
/usr/lib/hadoop/lib/\*:/usr/lib/hadoop/\*:/etc/hadoop/conf/nm-config/log4j.properties:/etc/hadoop/conf
 -Djava.library.path=/usr/lib/hadoop/lib/native 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer
 rvs app1 cont1 0.0.0.0 4344 /tmp/TEST/locs 

main : command provided 0
main : user is rvs
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
=========== Using localizerTokenSecurityInfo12/02/09 11:54:41 INFO ipc.Client: 
Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 0 time(s).
12/02/09 11:54:42 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 1 time(s).
12/02/09 11:54:43 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 2 time(s).
12/02/09 11:54:44 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 3 time(s).
12/02/09 11:54:45 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 4 time(s).
12/02/09 11:54:46 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 5 time(s).
12/02/09 11:54:47 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 6 time(s).
12/02/09 11:54:48 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 7 time(s).
12/02/09 11:54:49 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 8 time(s).
12/02/09 11:54:50 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 9 time(s).
java.lang.reflect.UndeclaredThrowableException
        at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:221)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:345)
Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: 
Call From c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at 
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:148)
        at $Proxy6.heartbeat(Unknown Source)
        at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:54)
        ... 3 more
Caused by: java.net.ConnectException: Call From 
c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:686)
        at org.apache.hadoop.ipc.Client.call(Client.java:1141)
        at org.apache.hadoop.ipc.Client.call(Client.java:1100)
        at 
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:145)
        ... 5 more
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:488)
        at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
        at org.apache.hadoop.ipc.Client.call(Client.java:1117)
        ... 7 more
{noformat}

as you can see the localication process got to a point where it was
trying to fetch the files for localization. This means that it has
completed all the filesystem manipulations. Next step would have been
launching a container under the user id of 'rvs'. So lets see where
that container would have put its intermediate results: 

{noformat}
  yarn$ ls -ld /tmp/TEST/locs/usercache/rvs/appcache/app1/output/
  drwxr-xr-x 2 rvs yarn 4096 Feb  9 11:54 
/tmp/TEST/locs/usercache/rvs/appcache/app1/output/
{noformat}

quite naturally, given that the sticky bit is no longer present on the output 
dir
AND the fact that user rvs has a group of rvs as the default group the resulting
files are totally out of reach for something running under the yarn account.
                
> ShuffleHandler can't access results when configured in a secure mode
> --------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3728
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Roman Shaposhnik
>            Priority: Critical
>             Fix For: 0.23.1
>
>
> While running the simplest of jobs (Pi) on MR2 in a fully secure 
> configuration I have noticed that the job was failing on the reduce side with 
> the following messages littering the nodemanager logs:
> {noformat}
> 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
> Shuffle error
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
> usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_000003_0/file.out.index
>  in any of the configured local directories
> {noformat}
> While digging further I found out that the permissions on the files/dirs were 
> prohibiting nodemanager (running under the user yarn) to access these files:
> {noformat}
> $ ls -l 
> /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_000001_0
> -rw-r----- 1 testuser testuser 28 Jan 20 15:41 file.out
> -rw-r----- 1 testuser testuser 32 Jan 20 15:41 file.out.index
> {noformat}
> Digging even further revealed that the group-sticky bit that was faithfully 
> put on all the subdirectories between testuser and 
> application_1327102703969_0001 was gone from output and 
> attempt_1327102703969_0001_m_000001_0. 
> Looking into how these subdirectories are created 
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
> {noformat}
>       // $x/usercache/$user/appcache/$appId/filecache
>       Path appFileCacheDir = new Path(appBase, FILECACHE);
>       appsFileCacheDirs[i] = appFileCacheDir.toString();
>       lfs.mkdir(appFileCacheDir, null, false);
>       // $x/usercache/$user/appcache/$appId/output
>       lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
> {noformat}
> Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
> sticky bit from output and filecache.
> At this point I'm at a loss about how this is supposed to work. My 
> understanding was
> that the whole sequence of events here was predicated on a sticky bit set so
> that daemons running under the user yarn (default group yarn) can have access
> to the resulting files and subdirectories down at output and below. Please let
> me know if I'm missing something or whether this is just a bug that needs to 
> be fixed.
> On a related note, when the shuffle side of the Pi job failed the job itself 
> didn't.
> It went into the endless loop and only exited when it exhausted all the local 
> storage
> for the log files (at which point the nodemanager died and thus the job 
> ended). Perhaps
> this is even more serious side effect of this issue that needs to be 
> investigated 
> separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to