Andrew Or created MAPREDUCE-6116: ------------------------------------ Summary: Start container with auxiliary service data race condition Key: MAPREDUCE-6116 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6116 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.0 Environment: HDP 2.1 on SLES 11 Reporter: Andrew Or
This shares the same symptoms as MAPREDUCE-2947, which is supposedly fixed. The stack trace I ran into is very similar: {code} Exception in thread "ContainerLauncher #1" java.lang.Error: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:224) at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:93) at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:63) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ... 2 more Caused by: java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:236) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:147) at java.nio.ByteBuffer.get(ByteBuffer.java:694) at com.google.protobuf.ByteString.copyFrom(ByteString.java:217) at com.google.protobuf.ByteString.copyFrom(ByteString.java:229) at org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:196) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.convertToProtoFormat(ContainerLaunchContextPBImpl.java:101) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl$2$1.next(ContainerLaunchContextPBImpl.java:312) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl$2$1.next(ContainerLaunchContextPBImpl.java:300) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$Builder.addAllServiceData(YarnProtos.java:32918) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.addServiceDataToProto(ContainerLaunchContextPBImpl.java:323) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToBuilder(ContainerLaunchContextPBImpl.java:112) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToProto(ContainerLaunchContextPBImpl.java:128) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.getProto(ContainerLaunchContextPBImpl.java:70) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.convertToProtoFormat(StartContainerRequestPBImpl.java:156) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToBuilder(StartContainerRequestPBImpl.java:85) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToProto(StartContainerRequestPBImpl.java:95) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.getProto(StartContainerRequestPBImpl.java:57) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.convertToProtoFormat(StartContainersRequestPBImpl.java:137) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.addLocalRequestsToProto(StartContainersRequestPBImpl.java:97) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToBuilder(StartContainersRequestPBImpl.java:79) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToProto(StartContainersRequestPBImpl.java:72) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.getProto(StartContainersRequestPBImpl.java:48) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:93) at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:201) ... 5 more {code} What I was doing in my application is calling `ContainerLaunchContext#setServiceData` with my custom shuffle secret. This exception happens only frequently but not always, which leads me to conjecture that it's a race condition. After seeing MAPREDUCE-2947, I manually synchronized all of my calls to `NMClient#startContainer`, and I never ran into this issue again. I suspect that there is still a race condition in the AuxiliaryService code even after MAPREDUCE-2947. -- This message was sent by Atlassian JIRA (v6.3.4#6332)