Chris Drome created HIVE-14344:
----------------------------------
Summary: Intermittent failures caused by leaking delegation tokens
Key: HIVE-14344
URL: https://issues.apache.org/jira/browse/HIVE-14344
Project: Hive
Issue Type: Bug
Components: Tez
Affects Versions: 2.1.0, 1.2.1
Reporter: Chris Drome
Assignee: Chris Drome
We have experienced random job failures caused by leaking delegation tokens.
The Tez child task will fail because it is attempting to read from the
delegation tokens directory of a different (related) task.
Failure results in the following type of stack trace:
{noformat}
2016-07-21 16:57:18,061 [FATAL] [TezChild] |tez.ReduceRecordSource|:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
processing row (tag=0)
at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:249)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException: java.io.IOException: Exception reading
file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens
at
org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first(RowContainer.java:237)
at
org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first(RowContainer.java:74)
at
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:650)
at
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:756)
at
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:316)
at
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:279)
at
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:272)
at
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:258)
at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
... 17 more
Caused by: java.lang.RuntimeException: java.io.IOException: Exception reading
file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens
at
org.apache.hadoop.mapreduce.security.TokenCache.mergeBinaryTokens(TokenCache.java:141)
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:119)
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at
org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first(RowContainer.java:222)
... 25 more
Caused by: java.io.IOException: Exception reading
file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens
at
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:175)
at
org.apache.hadoop.mapreduce.security.TokenCache.mergeBinaryTokens(TokenCache.java:136)
... 32 more
Caused by: java.io.FileNotFoundException: File
file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens
does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
at
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:170)
... 33 more
{noformat}
The application that failed was {{application_1468602386465_489844}} while
complaining about
{{appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens}}.
This seems to only manifest via HiveAction through Oozie.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)