[jira] [Commented] (HIVE-17908) LLAP External client not correctly handling killTask for pending requests

Hive QA (JIRA) Sat, 04 Nov 2017 07:31:19 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239039#comment-16239039
 ]


Hive QA commented on HIVE-17908:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12895972/HIVE-17908.6.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 11354 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=62)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=146)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=156)
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[ct_noperm_loc]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] 
(batchId=111)
org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut 
(batchId=206)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testAmPoolInteractions 
(batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testApplyPlanQpChanges 
(batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testApplyPlanUserMapping 
(batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testAsyncSessionInitFailures
 (batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testClusterFractions 
(batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testDestroyAndReturn 
(batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testQueueing 
(batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testReopen (batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testReuse (batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testReuseWithDifferentPool
 (batchId=281)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testReuseWithQueueing 
(batchId=281)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=223)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7636/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7636/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7636/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 19 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12895972 - PreCommit-HIVE-Build

> LLAP External client not correctly handling killTask for pending requests
> -------------------------------------------------------------------------
>
>                 Key: HIVE-17908
>                 URL: https://issues.apache.org/jira/browse/HIVE-17908
>             Project: Hive
>          Issue Type: Bug
>          Components: llap
>            Reporter: Jason Dere
>            Assignee: Jason Dere
>            Priority: Major
>         Attachments: HIVE-17908.1.patch, HIVE-17908.2.patch, 
> HIVE-17908.3.patch, HIVE-17908.4.patch, HIVE-17908.5.patch, HIVE-17908.6.patch
>
>
> Hitting "Timed out waiting for heartbeat for task ID" errors with the LLAP 
> external client.
> HIVE-17393 fixed some of these errors, however it is also occurring because 
> the client is not correctly handling the killTask notification when the 
> request is accepted but still waiting for the first task heartbeat. In this 
> situation the client should retry the request, similar to what the LLAP AM 
> does. Current logic is ignoring the killTask in this situation, which results 
> in a heartbeat timeout - no heartbeats are sent by LLAP because of the 
> killTask notification.
> {noformat}
> 17/08/09 05:36:02 WARN TaskSetManager: Lost task 10.0 in stage 4.0 (TID 14, 
> cn114-10.l42scl.hortonworks.com, executor 5): java.io.IOException: Received 
> reader event error: Timed out waiting for heartbeat for task ID 
> attempt_7739111832518812959_0005_0_00_000010_0
>         at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:178)
>         at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:50)
>         at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121)
>         at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
>         at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>         at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at org.apache.spark.scheduler.Task.run(Task.scala:99)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 
> LlapTaskUmbilicalExternalClient(attempt_7739111832518812959_0005_0_00_000010_0):
>  Error while attempting to read chunk length
>         at 
> org.apache.hadoop.hive.llap.io.ChunkedInputStream.read(ChunkedInputStream.java:82)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.hasInput(LlapBaseRecordReader.java:267)
>         at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:142)
>         ... 22 more
> Caused by: java.net.SocketException: Socket closed
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17908) LLAP External client not correctly handling killTask for pending requests

Reply via email to