Hi John, FWIW, setting the log level of org.apache.hadoop.security.UserGroupInformation to ERROR seemed to prevent the fatal NameNode slowdown we ran into. Although I still saw "no such user" Shell$ExitCodeException messages in the logs, these only occurred every few minutes or so. Thus, it seems like this is a reasonable work around until the underlying problem is fixed. I suggest that you file a JIRA ticket, though, as nobody seems to be rushing in here to tell us what we're doing wrong.
Thanks, - Chris On Feb 18, 2014, at 5:54 PM, Chris Schneider wrote: > Hi John, > > My AWS Elastic MapReduce NameNode is also filling its log file with messages > like the following: > > 2014-02-18 23:56:52,344 WARN org.apache.hadoop.security.UserGroupInformation > (IPC Server handler 78 on 9000): No groups available for user > job_201402182309_0073 > 2014-02-18 23:56:52,351 WARN org.apache.hadoop.security.UserGroupInformation > (IPC Server handler 48 on 9000): No groups available for user > job_201402182309_0073 > 2014-02-18 23:56:52,356 WARN org.apache.hadoop.security.UserGroupInformation > (IPC Server handler 38 on 9000): No groups available for user > job_201402182309_0073 > > I ran into this same issue in March 2013 and got past it by using an > m1.xlarge master node (instead of my usual m1.large) when (like right now) I > double my slave count (to 32 cc2.8xlarge instances) to re-import a lot of my > input data. Using that m1.xlarge didn't prevent the NameNode from logging > messages like this, but the beefier instance seemed to weather the load these > messages represented better. > > Unfortunately, even my m1.xlarge master node now seems overwhelmed. The > cluster starts off fine, efficiently mowing through the jobs in my job flow > step for a few hours, but it eventually gets into a mode where the copy phase > of the reduce jobs appear to make no progress at all. At that point, the > NameNode seems to be spending all of its time writing messages like the ones > above. > > The issue doesn't seem to be related to the NameNode JVM size (I tried > increasing it to 4GB before I realized it never used more than ~400MB), nor > dfs.namenode.handler.count (which I increased from 64 to 96). > > We're currently trying to work around the problem by hacking log4j.properties > to set the logging level for org.apache.hadoop.security.UserGroupInformation > to ERROR. We might have to do so for the entire package, as I've also seen > the following in the NameNode logs: > > 2014-02-19 01:01:24,184 WARN > org.apache.hadoop.security.ShellBasedUnixGroupsMapping (IPC Server handler 84 > on 9000): got exception trying to get groups for user job_201402182309_0226 > org.apache.hadoop.util.Shell$ExitCodeException: id: job_201402182309_0226: No > such user > > at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) > at org.apache.hadoop.util.Shell.run(Shell.java:182) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) > at > org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:78) > at > org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:53) > at org.apache.hadoop.security.Groups.getGroups(Groups.java:79) > at > org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1037) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.<init>(FSPermissionChecker.java:50) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5218) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkTraverse(FSNamesystem.java:5201) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:2030) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getFileInfo(NameNode.java:850) > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) > > I would also be very interested in hearing Jakob Homan and Deveraj Das > respond to your analysis of the changes made for MAPREDUCE-1457. > > Please post again with any further information you're able to glean about > this problem. > > Thanks, > > - Chris > > On Jan 8, 2014, at 1:26 PM, Jian Fang wrote: > >> Looked a bit deeper and seems this code was introduced by the following JIRA. >> >> https://issues.apache.org/jira/browse/MAPREDUCE-1457 >> >> There is another related JIRA, i.e., >> https://issues.apache.org/jira/browse/MAPREDUCE-4329. >> >> Perhaps, the warning message is a side effect of JIRA MAPREDUCE-1457 when >> the cluster is running in non-secured mode. There should be some code path >> that caused the job id was treated as user name in task tracker or job >> tracker. Then the job id was passed to HDFS name node. This is definitely a >> big problem since the heavy warning logs alone degraded the system >> performance in a relatively big cluster. >> >> This behavior is very easy to reproduce by simply running terasort on a >> cluster. >> >> Any suggestion to fix this problem? >> >> >> >> >> On Wed, Jan 8, 2014 at 11:18 AM, Jian Fang <jian.fang.subscr...@gmail.com> >> wrote: >> Thanks Vinod for your quick response. It is running in non-secure mode. >> >> I still don't get what is the purpose to use job id in UGI. Could you please >> explain a bit more? >> >> Thanks, >> >> John >> >> >> On Wed, Jan 8, 2014 at 10:11 AM, Vinod Kumar Vavilapalli >> <vino...@hortonworks.com> wrote: >> It just seems like lazy code. You can see that, later, there is this: >> >> {code} >> >> for(Token<?> token : >> UserGroupInformation.getCurrentUser().getTokens()) { >> childUGI.addToken(token); >> } >> >> {code} >> >> So eventually the JobToken is getting added to the UGI which runs task-code. >> >> > WARN org.apache.hadoop.security.UserGroupInformation (IPC Server handler >> > 63 on 9000): No groups available for user job_201401071758_0002 >> >> This seems to be a problem. When the task tries to reach the NameNode, it >> should do so as the user, not the job-id. It is not just logging, I'd be >> surprised if jobs pass. Do you have permissions enabled on HDFS? >> >> Oh, or is this in non-secure mode (i.e. without kerberos)? >> >> +Vinod >> >> >> On Jan 7, 2014, at 5:14 PM, Jian Fang <jian.fang.subscr...@gmail.com> wrote: >> >> > Hi, >> > >> > I looked at Hadoop 1.X source code and found some logic that I could not >> > understand. >> > >> > In the org.apache.hadoop.mapred.Child class, there were two UGIs defined >> > as follows. >> > >> > UserGroupInformation current = UserGroupInformation.getCurrentUser(); >> > current.addToken(jt); >> > >> > UserGroupInformation taskOwner >> > = >> > UserGroupInformation.createRemoteUser(firstTaskid.getJobID().toString()); >> > taskOwner.addToken(jt); >> > >> > But it is the taskOwner that is actually passed as a UGI to task tracker >> > and then to HDFS. The first one was not referenced any where. >> > >> > final TaskUmbilicalProtocol umbilical = >> > taskOwner.doAs(new >> > PrivilegedExceptionAction<TaskUmbilicalProtocol>() { >> > @Override >> > public TaskUmbilicalProtocol run() throws Exception { >> > return >> > (TaskUmbilicalProtocol)RPC.getProxy(TaskUmbilicalProtocol.class, >> > TaskUmbilicalProtocol.versionID, >> > address, >> > defaultConf); >> > } >> > }); >> > >> > What puzzled me is that the job id is actually passed in as the user name >> > to task tracker. On the Name node side, when it tries to map the >> > non-existing user name, i.e., task id, to a group, it always returns empty >> > array. As a result, we always see annoying warning messages such as >> > >> > WARN org.apache.hadoop.security.UserGroupInformation (IPC Server handler >> > 63 on 9000): No groups available for user job_201401071758_0002 >> > >> > Sometimes, the warning messages were thrown so fast, hundreds or even >> > thousands per second for a big cluster, the system performance was >> > degraded dramatically. >> > >> > Could someone please explain why this logic was designed in this way? Any >> > benefit to use non-existing user for the group mapping? Or is this a bug? >> > >> > Thanks in advance, >> > >> > John >> >> >> -- >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity to >> which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the reader >> of this message is not the intended recipient, you are hereby notified that >> any printing, copying, dissemination, distribution, disclosure or >> forwarding of this communication is strictly prohibited. If you have >> received this communication in error, please contact the sender immediately >> and delete it from your system. Thank You. ----------------------------------------- Chris Schneider http://www.scaleunlimited.com custom big data solutions -----------------------------------------