[ 
https://issues.apache.org/jira/browse/HIVE-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238100#comment-16238100
 ] 

Jason Dere commented on HIVE-17908:
-----------------------------------

Tests look ok .. [~sershe] [~aplusplus] can you review?

> LLAP External client not correctly handling killTask for pending requests
> -------------------------------------------------------------------------
>
>                 Key: HIVE-17908
>                 URL: https://issues.apache.org/jira/browse/HIVE-17908
>             Project: Hive
>          Issue Type: Bug
>          Components: llap
>            Reporter: Jason Dere
>            Assignee: Jason Dere
>            Priority: Major
>         Attachments: HIVE-17908.1.patch, HIVE-17908.2.patch, 
> HIVE-17908.3.patch, HIVE-17908.4.patch, HIVE-17908.5.patch
>
>
> Hitting "Timed out waiting for heartbeat for task ID" errors with the LLAP 
> external client.
> HIVE-17393 fixed some of these errors, however it is also occurring because 
> the client is not correctly handling the killTask notification when the 
> request is accepted but still waiting for the first task heartbeat. In this 
> situation the client should retry the request, similar to what the LLAP AM 
> does. Current logic is ignoring the killTask in this situation, which results 
> in a heartbeat timeout - no heartbeats are sent by LLAP because of the 
> killTask notification.
> {noformat}
> 17/08/09 05:36:02 WARN TaskSetManager: Lost task 10.0 in stage 4.0 (TID 14, 
> cn114-10.l42scl.hortonworks.com, executor 5): java.io.IOException: Received 
> reader event error: Timed out waiting for heartbeat for task ID 
> attempt_7739111832518812959_0005_0_00_000010_0
>         at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:178)
>         at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:50)
>         at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121)
>         at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
>         at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>         at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at org.apache.spark.scheduler.Task.run(Task.scala:99)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 
> LlapTaskUmbilicalExternalClient(attempt_7739111832518812959_0005_0_00_000010_0):
>  Error while attempting to read chunk length
>         at 
> org.apache.hadoop.hive.llap.io.ChunkedInputStream.read(ChunkedInputStream.java:82)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.hasInput(LlapBaseRecordReader.java:267)
>         at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:142)
>         ... 22 more
> Caused by: java.net.SocketException: Socket closed
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to