Hi John,

FWIW, setting the log level of org.apache.hadoop.security.UserGroupInformation 
to ERROR seemed to prevent the fatal NameNode slowdown we ran into. Although I 
still saw "no such user" Shell$ExitCodeException messages in the logs, these 
only occurred every few minutes or so. Thus, it seems like this is a reasonable 
work around until the underlying problem is fixed. I suggest that you file a 
JIRA ticket, though, as nobody seems to be rushing in here to tell us what 
we're doing wrong.

Thanks,

- Chris

On Feb 18, 2014, at 5:54 PM, Chris Schneider wrote:

> Hi John,
> 
> My AWS Elastic MapReduce NameNode is also filling its log file with messages 
> like the following:
> 
> 2014-02-18 23:56:52,344 WARN org.apache.hadoop.security.UserGroupInformation 
> (IPC Server handler 78 on 9000): No groups available for user 
> job_201402182309_0073
> 2014-02-18 23:56:52,351 WARN org.apache.hadoop.security.UserGroupInformation 
> (IPC Server handler 48 on 9000): No groups available for user 
> job_201402182309_0073
> 2014-02-18 23:56:52,356 WARN org.apache.hadoop.security.UserGroupInformation 
> (IPC Server handler 38 on 9000): No groups available for user 
> job_201402182309_0073
> 
> I ran into this same issue in March 2013 and got past it by using an 
> m1.xlarge master node (instead of my usual m1.large) when (like right now) I 
> double my slave count (to 32 cc2.8xlarge instances) to re-import a lot of my 
> input data. Using that m1.xlarge didn't prevent the NameNode from logging 
> messages like this, but the beefier instance seemed to weather the load these 
> messages represented better.
> 
> Unfortunately, even my m1.xlarge master node now seems overwhelmed. The 
> cluster starts off fine, efficiently mowing through the jobs in my job flow 
> step for a few hours, but it eventually gets into a mode where the copy phase 
> of the reduce jobs appear to make no progress at all. At that point, the 
> NameNode seems to be spending all of its time writing messages like the ones 
> above.
> 
> The issue doesn't seem to be related to the NameNode JVM size (I tried 
> increasing it to 4GB before I realized it never used more than ~400MB), nor 
> dfs.namenode.handler.count (which I increased from 64 to 96).
> 
> We're currently trying to work around the problem by hacking log4j.properties 
> to set the logging level for org.apache.hadoop.security.UserGroupInformation 
> to ERROR. We might have to do so for the entire package, as I've also seen 
> the following in the NameNode logs:
> 
> 2014-02-19 01:01:24,184 WARN 
> org.apache.hadoop.security.ShellBasedUnixGroupsMapping (IPC Server handler 84 
> on 9000): got exception trying to get groups for user job_201402182309_0226
> org.apache.hadoop.util.Shell$ExitCodeException: id: job_201402182309_0226: No 
> such user
> 
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
>       at org.apache.hadoop.util.Shell.run(Shell.java:182)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
>       at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
>       at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
>       at 
> org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:78)
>       at 
> org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:53)
>       at org.apache.hadoop.security.Groups.getGroups(Groups.java:79)
>       at 
> org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1037)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.<init>(FSPermissionChecker.java:50)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5218)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkTraverse(FSNamesystem.java:5201)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:2030)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.getFileInfo(NameNode.java:850)
>       at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> 
> I would also be very interested in hearing Jakob Homan and Deveraj Das 
> respond to your analysis of the changes made for MAPREDUCE-1457.
> 
> Please post again with any further information you're able to glean about 
> this problem.
> 
> Thanks,
> 
> - Chris
> 
> On Jan 8, 2014, at 1:26 PM, Jian Fang wrote:
> 
>> Looked a bit deeper and seems this code was introduced by the following JIRA.
>> 
>> https://issues.apache.org/jira/browse/MAPREDUCE-1457
>> 
>> There is another related JIRA, i.e., 
>> https://issues.apache.org/jira/browse/MAPREDUCE-4329.
>> 
>> Perhaps, the warning message is a side effect of JIRA MAPREDUCE-1457 when 
>> the cluster is running in non-secured mode. There should be some code path 
>> that caused the job id was treated as user name in task tracker or job 
>> tracker. Then the job id was passed to HDFS name node. This is definitely a 
>> big problem since the heavy warning logs alone degraded the system 
>> performance in a relatively big cluster.
>> 
>> This behavior is very easy to reproduce by simply running terasort on a 
>> cluster. 
>> 
>> Any suggestion to fix this problem?
>> 
>> 
>> 
>> 
>> On Wed, Jan 8, 2014 at 11:18 AM, Jian Fang <jian.fang.subscr...@gmail.com> 
>> wrote:
>> Thanks Vinod for your quick response. It is running in non-secure mode.
>> 
>> I still don't get what is the purpose to use job id in UGI. Could you please 
>> explain a bit more?
>> 
>> Thanks,
>> 
>> John
>> 
>> 
>> On Wed, Jan 8, 2014 at 10:11 AM, Vinod Kumar Vavilapalli 
>> <vino...@hortonworks.com> wrote:
>> It just seems like lazy code. You can see that, later, there is this:
>> 
>> {code}
>> 
>>         for(Token<?> token : 
>> UserGroupInformation.getCurrentUser().getTokens()) {
>>           childUGI.addToken(token);
>>         }
>> 
>> {code}
>> 
>> So eventually the JobToken is getting added to the UGI which runs task-code.
>> 
>> >  WARN org.apache.hadoop.security.UserGroupInformation (IPC Server handler 
>> > 63 on 9000): No groups available for user job_201401071758_0002
>> 
>> This seems to be a problem. When the task tries to reach the NameNode, it 
>> should do so as the user, not the job-id. It is not just logging, I'd be 
>> surprised if jobs pass. Do you have permissions enabled on HDFS?
>> 
>> Oh, or is this in non-secure mode (i.e. without kerberos)?
>> 
>> +Vinod
>> 
>> 
>> On Jan 7, 2014, at 5:14 PM, Jian Fang <jian.fang.subscr...@gmail.com> wrote:
>> 
>> > Hi,
>> >
>> > I looked at Hadoop 1.X source code and found some logic that I could not 
>> > understand.
>> >
>> > In the org.apache.hadoop.mapred.Child class, there were two UGIs defined 
>> > as follows.
>> >
>> >     UserGroupInformation current = UserGroupInformation.getCurrentUser();
>> >     current.addToken(jt);
>> >
>> >     UserGroupInformation taskOwner
>> >      = 
>> > UserGroupInformation.createRemoteUser(firstTaskid.getJobID().toString());
>> >     taskOwner.addToken(jt);
>> >
>> > But it is the taskOwner that is actually passed as a UGI to task tracker 
>> > and then to HDFS. The first one was not referenced any where.
>> >
>> >     final TaskUmbilicalProtocol umbilical =
>> >       taskOwner.doAs(new 
>> > PrivilegedExceptionAction<TaskUmbilicalProtocol>() {
>> >         @Override
>> >         public TaskUmbilicalProtocol run() throws Exception {
>> >           return 
>> > (TaskUmbilicalProtocol)RPC.getProxy(TaskUmbilicalProtocol.class,
>> >               TaskUmbilicalProtocol.versionID,
>> >               address,
>> >               defaultConf);
>> >         }
>> >     });
>> >
>> > What puzzled me is that the job id is actually passed in as the user name 
>> > to task tracker. On the Name node side, when it tries to map the 
>> > non-existing user name, i.e., task id, to a group, it always returns empty 
>> > array. As a result, we always see annoying warning messages such as
>> >
>> >  WARN org.apache.hadoop.security.UserGroupInformation (IPC Server handler 
>> > 63 on 9000): No groups available for user job_201401071758_0002
>> >
>> > Sometimes, the warning messages were thrown so fast, hundreds or even 
>> > thousands per second for a big cluster, the system performance was 
>> > degraded dramatically.
>> >
>> > Could someone please explain why this logic was designed in this way? Any 
>> > benefit to use non-existing user for the group mapping? Or is this a bug?
>> >
>> > Thanks in advance,
>> >
>> > John
>> 
>> 
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.

-----------------------------------------
Chris Schneider
http://www.scaleunlimited.com
custom big data solutions
-----------------------------------------

Reply via email to