[jira] [Commented] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown

2018-08-27 Thread Gunther Hagleitner (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594400#comment-16594400
 ] 

Gunther Hagleitner commented on TEZ-3980:
-

+1

> ShuffleRunner: the wake loop needs to check for shutdown
> 
>
> Key: TEZ-3980
> URL: https://issues.apache.org/jira/browse/TEZ-3980
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Gopal V
>Assignee: Gopal V
>Priority: Major
> Attachments: TEZ-3980.1.patch
>
>
> In the ShuffleRunner threads, there's a loop which does not terminate if the 
> task threads get killed.
> {code}
>   while ((runningFetchers.size() >= numFetchers || 
> pendingHosts.isEmpty())
>   && numCompletedInputs.get() < numInputs) {
> inputContext.notifyProgress();
> boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS);
>   }
> {code}
> The wakeLoop signal does not exit this out of the loop and is missing a break 
> for shut-down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3960) Better error handling in proto history logger and add doAs support.

2018-06-29 Thread Gunther Hagleitner (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527928#comment-16527928
 ] 

Gunther Hagleitner commented on TEZ-3960:
-

LGTM +1

> Better error handling in proto history logger and add doAs support.
> ---
>
> Key: TEZ-3960
> URL: https://issues.apache.org/jira/browse/TEZ-3960
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Fix For: 0.10.0
>
> Attachments: TEZ-3960.01.patch, TEZ-3960.02.patch
>
>
> DagManifestScanner gets stuck for a days logs if there are errors in them. 
> Fix it using fixed number of retries.
> The scanner should be able to use doAs to ensure it can read files if run 
> using a proxy admin user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3915) Create protobuf based history event logger.

2018-04-19 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444345#comment-16444345
 ] 

Gunther Hagleitner commented on TEZ-3915:
-

LGTM +1

> Create protobuf based history event logger.
> ---
>
> Key: TEZ-3915
> URL: https://issues.apache.org/jira/browse/TEZ-3915
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Fix For: 0.9.next
>
> Attachments: TEZ-3915.01.patch, TEZ-3915.02.patch, TEZ-3915.03.patch, 
> TEZ-3915.04.patch, TEZ-3915.05.patch, TEZ-3915.06.patch, TEZ-3915.07.patch
>
>
> A protobuf based history event logger, to log directly into hdfs. Implement a 
> reader api also, to get the events from the files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-10 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320472#comment-16320472
 ] 

Gunther Hagleitner commented on TEZ-3880:
-

Looks good to me now: +1

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.01.patch, TEZ-3880.02.patch, TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-05 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314059#comment-16314059
 ] 

Gunther Hagleitner commented on TEZ-3880:
-

There's a comment in the TaskAttemptTerminationCause that references LLAP. I 
think that shouldn't be committed. I also don't know why this patch is calling 
in question whether INTERRUPTED_BY_SYSTEM is used or not. Can you add a test 
for the new behavior?

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3405) Support ability for AM to kill itself if there is no client heartbeating to it

2016-08-09 Thread Gunther Hagleitner (JIRA)
Gunther Hagleitner created TEZ-3405:
---

 Summary: Support ability for AM to kill itself if there is no 
client heartbeating to it
 Key: TEZ-3405
 URL: https://issues.apache.org/jira/browse/TEZ-3405
 Project: Apache Tez
  Issue Type: Bug
Reporter: Gunther Hagleitner
Priority: Critical


HiveServer2 optionally maintains a pool of AMs in either Tez or LLAP mode. This 
is done to amortize the cost of launching a Tez session.

We also try in a shutdown hook to kill all these AMs when HS2 goes down. 
However, there are cases where HS2 doesn't get the chance to kill these AMs 
before it goes away. As a result these zombie AMs hang around until the timeout 
kicks in.

The trouble with the timeout is that we have to set it fairly high. Otherwise 
the benefit of having pre-launched AMs obviously goes away (in a lightly loaded 
cluster).

So, if people kill/restart HS2 they often times run into situations where the 
cluster/queue doesn't have any more capacity for AMs. They either have to 
manually kill the zombies or wait.

The request is therefore for Tez to maintain a heartbeat to the client. If the 
client goes away the AM should exit. That way we can keep the AMs alive for a 
long time regardless of activity and at the same time don't have to worry about 
them if HS2 goes down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3353) Tez should adjust processor memory on demand

2016-07-18 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383269#comment-15383269
 ] 

Gunther Hagleitner commented on TEZ-3353:
-

cc [~hitesh], [~sseth]

> Tez should adjust processor memory on demand
> 
>
> Key: TEZ-3353
> URL: https://issues.apache.org/jira/browse/TEZ-3353
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Wei Zheng
>
> Hive requests some amount of memory for its map join, which sometimes is not 
> allocated as much as expected. Tez should make adjustment and satisfy that 
> request as is.
> This is related to HIVE-13934.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3353) Tez should adjust processor memory on demand

2016-07-18 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383268#comment-15383268
 ] 

Gunther Hagleitner commented on TEZ-3353:
-

The request is to get a proper API. Setting 
"tez.task.scale.memory.reserve-fraction" based on a computation that has to 
involve various internal Tez variables (fraction of xmx to container, fraction 
reserved by default for processor, etc) is just bogus. Hive should just be able 
to tell Tez that it needs 500mb and tez can layout the mem as it sees fit with 
this in mind.

> Tez should adjust processor memory on demand
> 
>
> Key: TEZ-3353
> URL: https://issues.apache.org/jira/browse/TEZ-3353
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Wei Zheng
>
> Hive requests some amount of memory for its map join, which sometimes is not 
> allocated as much as expected. Tez should make adjustment and satisfy that 
> request as is.
> This is related to HIVE-13934.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1962) Running out of threads in tez local mode

2015-01-14 Thread Gunther Hagleitner (JIRA)
Gunther Hagleitner created TEZ-1962:
---

 Summary: Running out of threads in tez local mode
 Key: TEZ-1962
 URL: https://issues.apache.org/jira/browse/TEZ-1962
 Project: Apache Tez
  Issue Type: Bug
Reporter: Gunther Hagleitner
Priority: Critical


I've been trying to port the hive ut to tez local mode. However, local mode 
seems to leak threads which causes tests to crash after a while (oom). See 
attached stack trace - there are a lot of TezChild threads still hanging 
around.

([~sseth] as discussed offline)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-529) Hive communicates state from RecordReader to Processor via JobConf

2014-12-21 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255333#comment-14255333
 ] 

Gunther Hagleitner commented on TEZ-529:


No there's a workaround in place that doesn't rely on map.input.file afaik. 
This can be closed I think - [~sseth]?

 Hive communicates state from RecordReader to Processor via JobConf
 --

 Key: TEZ-529
 URL: https://issues.apache.org/jira/browse/TEZ-529
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Gunther Hagleitner
Assignee: Siddharth Seth

 Hive currently switches between operator pipelines + partition descriptors 
 via a map.input.file
 In the CombineFileInputFormat case Hive relies on the fact that 
 CombineFileRecordReader sets this field every time a new file is processed. 
 This file will then be read in the processor to setup the correct processing 
 pipeline.
 After the Tez refactor RecordReader and TezProcessor use different job conf 
 instances. Because of that Hive will fail since map.input.file isn't set and 
 updated in the processor's conf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1805) Dag gets stuck (hive tests, mini cluster)

2014-11-28 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228141#comment-14228141
 ] 

Gunther Hagleitner commented on TEZ-1805:
-

Found the issue - definitely Hive bug. Thanks!

 Dag gets stuck (hive tests, mini cluster)
 -

 Key: TEZ-1805
 URL: https://issues.apache.org/jira/browse/TEZ-1805
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0, 0.5.2
Reporter: Gunther Hagleitner
 Attachments: dag_1417137410462_0002_4.pdf, stuck-dag-logs.tar.gz


 There's a test in the hive suite that gets stuck and I'm not sure what's 
 causing it.
 Repro:
 (In hive tree: https://github.com/apache/hive)
 mvn clean install -DskipTests -Phadoop-2  cd itests  mvn clean install 
 -DskipTests -Phadoop-2
 then:
 mvn test -Dtest=TestMiniTezCliDriver -Phadoop-2 -Dqfile=lvj_mapjoin.q
 I'll attach logs and stack traces. It seems application: 
 pplication_1417137410462_0002 got stuck in that run. Only exception I saw is:
 {noformat}
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /Users/ghagleitner/Projects/hive-trunk2/itests/qtest/target/tmp/scratchdir/ghagleitner/_tez_session_dir/dc4fca20-4a39-4452-9\
 75a-467bda4947ca/.tez/application_1417137410462_0001/recovery/1/summary 
 (inode 16430): File does not exist. Holder 
 DFSClient_NONMAPREDUCE_1900574341_1 does not have any open files. 
  
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
   
  
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3170)
   

   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3140)
   

   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:665)
   
   
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:499)
   
  
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   
   
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   
  
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)  
   
   
 
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) 
   
   
 
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) 
   
   
 
   at java.security.AccessController.doPrivileged(Native Method)   
   
   
 
   at javax.security.auth.Subject.doAs(Subject.java:394)   
   
   
 
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   

   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)   
   

[jira] [Updated] (TEZ-1805) Dag gets stuck (hive tests, mini cluster)

2014-11-27 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated TEZ-1805:

Attachment: stuck-dag-logs.tar.gz

 Dag gets stuck (hive tests, mini cluster)
 -

 Key: TEZ-1805
 URL: https://issues.apache.org/jira/browse/TEZ-1805
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0, 0.5.2
Reporter: Gunther Hagleitner
 Attachments: stuck-dag-logs.tar.gz


 There's a test in the hive suite that gets stuck and I'm not sure what's 
 causing it.
 Repro:
 (In hive tree: https://github.com/apache/hive)
 mvn clean install -DskipTests -Phadoop-2  cd itests  mvn clean install 
 -DskipTests -Phadoop-2
 then:
 mvn test -Dtest=TestMiniTezCliDriver -Phadoop-2 -Dqfile=lvj_mapjoin.q
 I'll attach logs and stack traces. It seems application: 
 pplication_1417137410462_0002 got stuck in that run. Only exception I saw is:
 {noformat}
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /Users/ghagleitner/Projects/hive-trunk2/itests/qtest/target/tmp/scratchdir/ghagleitner/_tez_session_dir/dc4fca20-4a39-4452-9\
 75a-467bda4947ca/.tez/application_1417137410462_0001/recovery/1/summary 
 (inode 16430): File does not exist. Holder 
 DFSClient_NONMAPREDUCE_1900574341_1 does not have any open files. 
  
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
   
  
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3170)
   

   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3140)
   

   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:665)
   
   
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:499)
   
  
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   
   
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   
  
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)  
   
   
 
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) 
   
   
 
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) 
   
   
 
   at java.security.AccessController.doPrivileged(Native Method)   
   
   
 
   at javax.security.auth.Subject.doAs(Subject.java:394)   
   
   
 
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   

   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)   
   
   

[jira] [Commented] (TEZ-1805) Dag gets stuck (hive tests, mini cluster)

2014-11-27 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228116#comment-14228116
 ] 

Gunther Hagleitner commented on TEZ-1805:
-

Thanks guys - that really helps. I know what to look for now. As [~bikassaha] 
says - it'd be helpful to have this error out.

 Dag gets stuck (hive tests, mini cluster)
 -

 Key: TEZ-1805
 URL: https://issues.apache.org/jira/browse/TEZ-1805
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0, 0.5.2
Reporter: Gunther Hagleitner
 Attachments: dag_1417137410462_0002_4.pdf, stuck-dag-logs.tar.gz


 There's a test in the hive suite that gets stuck and I'm not sure what's 
 causing it.
 Repro:
 (In hive tree: https://github.com/apache/hive)
 mvn clean install -DskipTests -Phadoop-2  cd itests  mvn clean install 
 -DskipTests -Phadoop-2
 then:
 mvn test -Dtest=TestMiniTezCliDriver -Phadoop-2 -Dqfile=lvj_mapjoin.q
 I'll attach logs and stack traces. It seems application: 
 pplication_1417137410462_0002 got stuck in that run. Only exception I saw is:
 {noformat}
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /Users/ghagleitner/Projects/hive-trunk2/itests/qtest/target/tmp/scratchdir/ghagleitner/_tez_session_dir/dc4fca20-4a39-4452-9\
 75a-467bda4947ca/.tez/application_1417137410462_0001/recovery/1/summary 
 (inode 16430): File does not exist. Holder 
 DFSClient_NONMAPREDUCE_1900574341_1 does not have any open files. 
  
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
   
  
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3170)
   

   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3140)
   

   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:665)
   
   
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:499)
   
  
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   
   
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   
  
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)  
   
   
 
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) 
   
   
 
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) 
   
   
 
   at java.security.AccessController.doPrivileged(Native Method)   
   
   
 
   at javax.security.auth.Subject.doAs(Subject.java:394)   
   
   
 
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   

   at 

[jira] [Commented] (TEZ-1635) Dag gets stuck intermittently

2014-10-06 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160664#comment-14160664
 ] 

Gunther Hagleitner commented on TEZ-1635:
-

[~vikram.dixit] can you answer [~rajesh.balamohan]'s question?

 Dag gets stuck intermittently
 -

 Key: TEZ-1635
 URL: https://issues.apache.org/jira/browse/TEZ-1635
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Vikram Dixit K
Priority: Blocker
 Attachments: Screen Shot 2014-10-05 at 9.46.31 AM.png, 
 syslog_dag_1412109415326_0002_10.gz, tez_smb_1_hung_job.log, 
 tez_smb_1_successful_job.log


 Attaching logs for the dag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1635) Dag gets stuck intermittently

2014-10-04 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159002#comment-14159002
 ] 

Gunther Hagleitner commented on TEZ-1635:
-

tez_smb_1.q is consistentlly failing for me (latest trunk).

 Dag gets stuck intermittently
 -

 Key: TEZ-1635
 URL: https://issues.apache.org/jira/browse/TEZ-1635
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Vikram Dixit K
Priority: Blocker
 Attachments: syslog_dag_1412109415326_0002_10.gz


 Attaching logs for the dag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1635) Dag gets stuck intermittently

2014-10-03 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158601#comment-14158601
 ] 

Gunther Hagleitner commented on TEZ-1635:
-

{noformat}
TezChild daemon prio=5 tid=7fc9ad1a6000 nid=0x112c82000 waiting on condition 
[112c8]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  7f3b53b68 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
at 
org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
at 
org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
at 
org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:104)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:161)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:163)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:142)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:172)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:167)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
{noformat}

 Dag gets stuck intermittently
 -

 Key: TEZ-1635
 URL: https://issues.apache.org/jira/browse/TEZ-1635
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Vikram Dixit K
Priority: Blocker
 Attachments: syslog_dag_1412109415326_0002_10.gz


 Attaching logs for the dag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1447) Handle parallelism updates and versioning w/ custom InputInitializerEvents

2014-08-27 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111986#comment-14111986
 ] 

Gunther Hagleitner commented on TEZ-1447:
-

I'm with Sid. The feature was delivered in 0.5.0. This is fixing it.

 Handle parallelism updates and versioning w/ custom InputInitializerEvents
 --

 Key: TEZ-1447
 URL: https://issues.apache.org/jira/browse/TEZ-1447
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Gunther Hagleitner
Assignee: Siddharth Seth
Priority: Blocker

 I'm trying to do dynamic partition pruning through input initializer events 
 in Hive. That means that the initializer of a table scan vertex has to 
 receive events from all tasks in another vertex (which contain the pruning 
 info) before generating tasks to run.
 The problem with the current API I ran into:
 getNumTasks: I'm currently using a busy loop to wait for the num tasks for a 
 vertex to be decided (-1 - x). There's no way around it, because it's the 
 only way to find out what number of events to expect (0 is a valid number of 
 tasks - so I can't wait for the first to complete).
 With auto-reducer parallelism I have to employ another busy loop. Because I 
 might be initially expecting 10 events, which later get's knocked down to 5. 
 Since there's no event associated with this, I have to periodically check 
 whether I have enough events.
 Versioning: Events have a version number, but I don't know which task they 
 are coming from. Thus I can't de-dup events.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TEZ-1486) TezUncheckedException when using dynamic partition pruning

2014-08-22 Thread Gunther Hagleitner (JIRA)
Gunther Hagleitner created TEZ-1486:
---

 Summary: TezUncheckedException when using dynamic partition pruning
 Key: TEZ-1486
 URL: https://issues.apache.org/jira/browse/TEZ-1486
 Project: Apache Tez
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Siddharth Seth


I'm working on using the AM event mechanism to dynamically prune partitions at 
DAG runtime for certain queries. The query is:

select count(*) from srcpart join srcpart_double_hour on (srcpart.hr*2 = 
srcpart_double_hour.hr) where srcpart_double_hour.hour = 11;

This will result in two vertices connected through a broadcast edge. The vertex 
prepares two things: The list of partition keys (hr) that are being sent to the 
AM for dynamic pruning and the records to be used in the hash join.

The second vertex will block until all events are received (initializer) then 
it will load and process the hash join.

It's possible for queries like this to result in zero splits on the second 
vertex (i.e.: no matching rows for the join)

The exception I get when this is run is:

org.apache.tez.dag.api.TezUncheckedException: Event must be routed. 
sourceVertex=vertex_1408686217936_0003_3_00 srcIndex = 0 
destAttemptId=vertex_1408686217936_0003_3_01 
edgeManager=org.apache.tez.dag.app.dag.impl.BroadcastEdgeManager Ev\
ent type=DATA_MOVEMENT_EVENT
  at 
org.apache.tez.dag.app.dag.impl.Edge.sendTezEventToDestinationTasks(Edge.java:371)
  at 
org.apache.tez.dag.app.dag.impl.VertexImpl$RouteEventTransition.transition(VertexImpl.java:3372)
  at 
org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1088)
  at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerPluginContextImpl.scheduleVertexTasks(VertexManager.java:111)
  at 
org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.onVertexStarted(ImmediateStartVertexManager.java:49)
  at 
org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:244)
  at 
org.apache.tez.dag.app.dag.impl.VertexImpl.startVertex(VertexImpl.java:2923)
  at org.apache.tez.dag.app.dag.impl.VertexImpl.access$5900(VertexImpl.java:169)
  at 
org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:2914)
  at 
org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:2906)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
  at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1355)
  at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:168)
  at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1650)
  at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1636)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
  at java.lang.Thread.run(Thread.java:695)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TEZ-1447) Handle parallelism updates and versioning w/ custom InputInitializerEvents

2014-08-18 Thread Gunther Hagleitner (JIRA)
Gunther Hagleitner created TEZ-1447:
---

 Summary: Handle parallelism updates and versioning w/ custom 
InputInitializerEvents
 Key: TEZ-1447
 URL: https://issues.apache.org/jira/browse/TEZ-1447
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Gunther Hagleitner
Priority: Blocker
 Fix For: 0.5.0


I'm trying to do dynamic partition pruning through input initializer events in 
Hive. That means that the initializer of a table scan vertex has to receive 
events from all tasks in another vertex (which contain the pruning info) before 
generating tasks to run.

The problem with the current API I ran into:

getNumTasks: I'm currently using a busy loop to wait for the num tasks for a 
vertex to be decided (-1 - x). There's no way around it, because it's the only 
way to find out what number of events to expect (0 is a valid number of tasks - 
so I can't wait for the first to complete).

With auto-reducer parallelism I have to employ another busy loop. Because I 
might be initially expecting 10 events, which later get's knocked down to 5. 
Since there's no event associated with this, I have to periodically check 
whether I have enough events.

Versioning: Events have a version number, but I don't know which task they are 
coming from. Thus I can't de-dup events.



--
This message was sent by Atlassian JIRA
(v6.2#6252)