[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

2017-01-03 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15796633#comment-15796633
 ] 

Sergey Shelukhin commented on HIVE-15529:
-

+1

> LLAP: TaskSchedulerService can get stuck when scheduleTask returns 
> DELAYED_RESOURCES
> 
>
> Key: HIVE-15529
> URL: https://issues.apache.org/jira/browse/HIVE-15529
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Critical
> Attachments: HIVE-15529.1.patch
>
>
> Easier way to simulate the issue:
> 1. Start hive cli with "--hiveconf hive.execution.mode=llap"
> 2. Run a sql script file (e.g sql script containing tpc-ds queries)
> 3. In the middle of the run, press "ctrl+C" which would interrupt the current 
> job. This should not exit the hive cli yet.
> 4. After sometime, launch the same SQL script in same cli. This would get 
> stuck indefinitely (waiting for computing the splits).
> Even when cli is quit, AM runs forever until explicitly killed. 
> Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} 
> dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task 
> scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

2017-01-03 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15796620#comment-15796620
 ] 

Rajesh Balamohan commented on HIVE-15529:
-


Yes, issue is with "PathChildrenCache.getCurrentData(name)". This internally 
stores "path, childData". But currently code passes "getNodeIdentity()" which 
would always return null. Hence it was not able to re-enable the disabled node.


{noformat}
For example, PathChildrenCache has the following keys and values

Keys: [/user-rbalamohan/llap0/workers/slot-00, 
/user-rbalamohan/llap0/workers/worker-000416]
Values: [ChildData{path='/user-rbalamohan/llap0/workers/slot-00', 
stat=101148,101148,1483486257760,1483486257760,0,0,0,96572092669136166,36,0,101148...

But as per NodeEnablerCallable, it ends up requesting based on nodeIdentity. 
Hence the issue. Here is the log snippet

2017-01-03 18:33:16,343 [INFO] [LlapSchedulerNodeEnabler] 
|tezplugins.LlapTaskSchedulerService|: Attempting to re-enable node: 
{machine-105:40396, id=f5c2afe6-79bb-4636-8357-6b9158bef4d2, stc=24}
..
2017-01-03 18:33:16,356 [INFO] [LlapSchedulerNodeEnabler] 
|tezplugins.LlapTaskSchedulerService|: Not re-enabling node: 
{machine-105:40396, id=f5c2afe6-79bb-4636-8357-6b9158bef4d2, stc=24}, since it 
is not present in the RegistryActiveNodeList
{noformat}

Patch fixes the issue by checking for nodeIdentity.

> LLAP: TaskSchedulerService can get stuck when scheduleTask returns 
> DELAYED_RESOURCES
> 
>
> Key: HIVE-15529
> URL: https://issues.apache.org/jira/browse/HIVE-15529
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Critical
> Attachments: HIVE-15529.1.patch
>
>
> Easier way to simulate the issue:
> 1. Start hive cli with "--hiveconf hive.execution.mode=llap"
> 2. Run a sql script file (e.g sql script containing tpc-ds queries)
> 3. In the middle of the run, press "ctrl+C" which would interrupt the current 
> job. This should not exit the hive cli yet.
> 4. After sometime, launch the same SQL script in same cli. This would get 
> stuck indefinitely (waiting for computing the splits).
> Even when cli is quit, AM runs forever until explicitly killed. 
> Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} 
> dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task 
> scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

2017-01-03 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795984#comment-15795984
 ] 

Sergey Shelukhin commented on HIVE-15529:
-

How does this patch fix the issue described? Is the problem in getCurrentData 
call?

> LLAP: TaskSchedulerService can get stuck when scheduleTask returns 
> DELAYED_RESOURCES
> 
>
> Key: HIVE-15529
> URL: https://issues.apache.org/jira/browse/HIVE-15529
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Critical
> Attachments: HIVE-15529.1.patch
>
>
> Easier way to simulate the issue:
> 1. Start hive cli with "--hiveconf hive.execution.mode=llap"
> 2. Run a sql script file (e.g sql script containing tpc-ds queries)
> 3. In the middle of the run, press "ctrl+C" which would interrupt the current 
> job. This should not exit the hive cli yet.
> 4. After sometime, launch the same SQL script in same cli. This would get 
> stuck indefinitely (waiting for computing the splits).
> Even when cli is quit, AM runs forever until explicitly killed. 
> Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} 
> dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task 
> scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

2017-01-03 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15794981#comment-15794981
 ] 

Hive QA commented on HIVE-15529:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845369/HIVE-15529.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 9 failed/errored test(s), 10898 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=110)

[tez_joins_explain.q,transform2.q,groupby5.q,cbo_semijoin.q,bucketmapjoin13.q,union_remove_6_subq.q,groupby2_map_multi_distinct.q,load_dyn_part9.q,multi_insert_gby2.q,vectorization_11.q,groupby_position.q,avro_compression_enabled_native.q,smb_mapjoin_8.q,join21.q,auto_join16.q]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_5] 
(batchId=92)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2763/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2763/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2763/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 9 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12845369 - PreCommit-HIVE-Build

> LLAP: TaskSchedulerService can get stuck when scheduleTask returns 
> DELAYED_RESOURCES
> 
>
> Key: HIVE-15529
> URL: https://issues.apache.org/jira/browse/HIVE-15529
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Critical
> Attachments: HIVE-15529.1.patch
>
>
> Easier way to simulate the issue:
> 1. Start hive cli with "--hiveconf hive.execution.mode=llap"
> 2. Run a sql script file (e.g sql script containing tpc-ds queries)
> 3. In the middle of the run, press "ctrl+C" which would interrupt the current 
> job. This should not exit the hive cli yet.
> 4. After sometime, launch the same SQL script in same cli. This would get 
> stuck indefinitely (waiting for computing the splits).
> Even when cli is quit, AM runs forever until explicitly killed. 
> Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} 
> dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task 
> scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

2017-01-02 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15794069#comment-15794069
 ] 

Rajesh Balamohan commented on HIVE-15529:
-

[~pxiong] - Yes, on task failure the node gets into disabled state. Will debug 
more on this.

> LLAP: TaskSchedulerService can get stuck when scheduleTask returns 
> DELAYED_RESOURCES
> 
>
> Key: HIVE-15529
> URL: https://issues.apache.org/jira/browse/HIVE-15529
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Priority: Critical
>
> Easier way to simulate the issue:
> 1. Start hive cli with "--hiveconf hive.execution.mode=llap"
> 2. Run a sql script file (e.g sql script containing tpc-ds queries)
> 3. In the middle of the run, press "ctrl+C" which would interrupt the current 
> job. This should not exit the hive cli yet.
> 4. After sometime, launch the same SQL script in same cli. This would get 
> stuck indefinitely (waiting for computing the splits).
> Even when cli is quit, AM runs forever until explicitly killed. 
> Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} 
> dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task 
> scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

2017-01-02 Thread Pengcheng Xiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793297#comment-15793297
 ] 

Pengcheng Xiong commented on HIVE-15529:


[~rajesh.balamohan], this sounds related to HIVE-15467?

> LLAP: TaskSchedulerService can get stuck when scheduleTask returns 
> DELAYED_RESOURCES
> 
>
> Key: HIVE-15529
> URL: https://issues.apache.org/jira/browse/HIVE-15529
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Priority: Critical
>
> Easier way to simulate the issue:
> 1. Start hive cli with "--hiveconf hive.execution.mode=llap"
> 2. Run a sql script file (e.g sql script containing tpc-ds queries)
> 3. In the middle of the run, press "ctrl+C" which would interrupt the current 
> job. This should not exit the hive cli yet.
> 4. After sometime, launch the same SQL script in same cli. This would get 
> stuck indefinitely (waiting for computing the splits).
> Even when cli is quit, AM runs forever until explicitly killed. 
> Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} 
> dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task 
> scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)