[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388153#comment-14388153
 ] 

Cyrille Chépélov commented on TEZ-2224:
---

Hello [~jeffzhang], branch-0.6 now fails to build; apparently due to this patch

{preformat}
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project tez-dag: Compilation failure: Compilation failure:
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[39,30]
 cannot find symbol
[ERROR] symbol:   class ConfigurationScope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[40,30]
 cannot find symbol
[ERROR] symbol:   class Scope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] -> [Help 1]

$ find . -name "*.java" |xargs grep ConfigurationScope
./tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:import
 org.apache.tez.dag.api.ConfigurationScope;
$ 
{preformat}

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388153#comment-14388153
 ] 

Cyrille Chépélov edited comment on TEZ-2224 at 3/31/15 7:07 AM:


Hello [~jeffzhang], branch-0.6 now fails to build; apparently due to this patch

{noformat}
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project tez-dag: Compilation failure: Compilation failure:
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[39,30]
 cannot find symbol
[ERROR] symbol:   class ConfigurationScope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[40,30]
 cannot find symbol
[ERROR] symbol:   class Scope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] -> [Help 1]

$ find . -name "*.java" |xargs grep ConfigurationScope
./tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:import
 org.apache.tez.dag.api.ConfigurationScope;
$ 
{noformat}

update: branch-0.6 builds & passes fine after reverting 
627d508305ed2bbeff9e9c5a4ec1e083a66a554c


was (Author: cchepelov):
Hello [~jeffzhang], branch-0.6 now fails to build; apparently due to this patch

{noformat}
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project tez-dag: Compilation failure: Compilation failure:
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[39,30]
 cannot find symbol
[ERROR] symbol:   class ConfigurationScope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[40,30]
 cannot find symbol
[ERROR] symbol:   class Scope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] -> [Help 1]

$ find . -name "*.java" |xargs grep ConfigurationScope
./tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:import
 org.apache.tez.dag.api.ConfigurationScope;
$ 
{noformat}

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388153#comment-14388153
 ] 

Cyrille Chépélov edited comment on TEZ-2224 at 3/31/15 7:06 AM:


Hello [~jeffzhang], branch-0.6 now fails to build; apparently due to this patch

{noformat}
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project tez-dag: Compilation failure: Compilation failure:
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[39,30]
 cannot find symbol
[ERROR] symbol:   class ConfigurationScope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[40,30]
 cannot find symbol
[ERROR] symbol:   class Scope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] -> [Help 1]

$ find . -name "*.java" |xargs grep ConfigurationScope
./tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:import
 org.apache.tez.dag.api.ConfigurationScope;
$ 
{noformat}


was (Author: cchepelov):
Hello [~jeffzhang], branch-0.6 now fails to build; apparently due to this patch

{preformat}
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project tez-dag: Compilation failure: Compilation failure:
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[39,30]
 cannot find symbol
[ERROR] symbol:   class ConfigurationScope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[40,30]
 cannot find symbol
[ERROR] symbol:   class Scope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] -> [Help 1]

$ find . -name "*.java" |xargs grep ConfigurationScope
./tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:import
 org.apache.tez.dag.api.ConfigurationScope;
$ 
{preformat}

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388156#comment-14388156
 ] 

Jeff Zhang commented on TEZ-2224:
-

Thanks [~cchepelov] I am looking at it.

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388157#comment-14388157
 ] 

Jeff Zhang commented on TEZ-2224:
-

Thanks [~cchepelov] I am looking at it.

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)

2015-03-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388170#comment-14388170
 ] 

Cyrille Chépélov commented on TEZ-2237:
---

Indeed, application_142732418_1444.yarn-logs.red.txt was done using 
straight 0.6.0. A later log file (application_142732418_1467.red.txt.gz), 
impractically big, was made with branch-0.6 (as of 
66ca9655a4412e1c1db1d37e882a407706dbe3ad), which seems to include TEZ-1929. It 
seemed to freeze when I uploaded the log yesterday, and I had to free up the 
cluster, so I killed it in the end. 

It seems I killed application_142732418_1467 too early yesterday. My 
updated plan is:
# run again using TEZ branch-0.6 as of 974588e180ab53ea3e7243f2dea29a5d8ef2416d 
("TEZ-2240"), cascading-3.0.0-wip-92
# run again (if still failing) using ("tez.am.dag.scheduler.class" -> 
"org.apache.tez.dag.app.dag.impl.DAGSchedulerNaturalOrderControlled") in the 
scalding.Job#config override method.
# report




> Complex DAG freezes and fails (was BufferTooSmallException raised in 
> UnorderedPartitionedKVWriter then DAG lingers)
> ---
>
> Key: TEZ-2237
> URL: https://issues.apache.org/jira/browse/TEZ-2237
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
> Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system 
> disk + 4*1 or 2 TiB HDD for HDFS & local  (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
>Reporter: Cyrille Chépélov
> Attachments: all_stacks.lst, alloc_mem.png, alloc_vcores.png, 
> application_142732418_1444.yarn-logs.red.txt.gz, 
> appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, 
> appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, 
> gc_count_MRAppMaster.png, mem_free.png, ordered-grouped-kv-input-traces.diff, 
> start_containers.png, stop_containers.png, 
> syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, 
> syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG), 
> after about a hour of processing, several BufferTooSmallException are raised 
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active", 
> tying up memory and CPU resources as far as YARN is concerned, while little 
> if any actual processing takes place. 
> It seems two separate issues are at hand:
>   1. BufferTooSmallException are raised even though, small as the actually 
> allocated buffers seem to be (around a couple megabytes were allotted whereas 
> 100MiB were requested), the actual keys and values are never bigger than 24 
> and 1024 bytes respectively.
>   2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop 
> (stop requests appear to be sent 7 hours after the BTSE exceptions are 
> raised, but 9 hours after these stop requests, the DAG was still lingering on 
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from 
> validating the results compared to traditional MR1-based results. The lack of 
> conclusion renders the cluster queue unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388153#comment-14388153
 ] 

Cyrille Chépélov edited comment on TEZ-2224 at 3/31/15 7:23 AM:


Hello [~zjffdu]], branch-0.6 now fails to build; apparently due to this patch

{noformat}
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project tez-dag: Compilation failure: Compilation failure:
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[39,30]
 cannot find symbol
[ERROR] symbol:   class ConfigurationScope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[40,30]
 cannot find symbol
[ERROR] symbol:   class Scope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] -> [Help 1]

$ find . -name "*.java" |xargs grep ConfigurationScope
./tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:import
 org.apache.tez.dag.api.ConfigurationScope;
$ 
{noformat}

update: branch-0.6 builds & passes fine after reverting 
627d508305ed2bbeff9e9c5a4ec1e083a66a554c


was (Author: cchepelov):
Hello [~jeffzhang], branch-0.6 now fails to build; apparently due to this patch

{noformat}
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project tez-dag: Compilation failure: Compilation failure:
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[39,30]
 cannot find symbol
[ERROR] symbol:   class ConfigurationScope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] 
/home/cchepelov/workspace/3rd-party/tez/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:[40,30]
 cannot find symbol
[ERROR] symbol:   class Scope
[ERROR] location: package org.apache.tez.dag.api
[ERROR] -> [Help 1]

$ find . -name "*.java" |xargs grep ConfigurationScope
./tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java:import
 org.apache.tez.dag.api.ConfigurationScope;
$ 
{noformat}

update: branch-0.6 builds & passes fine after reverting 
627d508305ed2bbeff9e9c5a4ec1e083a66a554c

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2224:

Attachment: TEZ-2224-2-addendum.patch

Attach addendum patch for fix compilation issue in branch-0.5 & branch-0.6

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2-addendum.patch, 
> TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388186#comment-14388186
 ] 

Jeff Zhang edited comment on TEZ-2224 at 3/31/15 7:44 AM:
--

Attach addendum patch for fix compilation issue in branch-0.5 & branch-0.6. 
[~cchepelov] It should work now. 


was (Author: zjffdu):
Attach addendum patch for fix compilation issue in branch-0.5 & branch-0.6

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2-addendum.patch, 
> TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2256) Avoid use of BufferTooSmallException to signal end of buffer in UnorderedPartitionedKVWriter

2015-03-31 Thread JIRA
Cyrille Chépélov created TEZ-2256:
-

 Summary: Avoid use of BufferTooSmallException to signal end of 
buffer in UnorderedPartitionedKVWriter
 Key: TEZ-2256
 URL: https://issues.apache.org/jira/browse/TEZ-2256
 Project: Apache Tez
  Issue Type: Improvement
Affects Versions: 0.6.0, 0.7.0
Reporter: Cyrille Chépélov
Priority: Minor


UnorderedPartitionedKVWriter delegates serialization to the application, 
passing it a private ByteArrayOutputStream. In case the buffer is exhausted, 
ByteArrayOutputStream signals that with a private BufferTooSmallException, 
which can be seen but not dealt with by the application. As [~cwensel] pointed 
out, when the application is in fact a complex framework, there is no way to 
distinguish this exception from a real failure, which compels logging the full 
stack even for reasonable events such as "buffer complete".

Suggested approach: set a "complete" flag in ByteArrayOutputStream that 
disables any further output, and replace  BufferTooSmallException (BTSE) 
handling by checking that flag. 

[~sseth] suggested checking out SortedOutput as well, as the mechanisms there 
should be similar.

I'll give this a go this week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)

2015-03-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388189#comment-14388189
 ] 

Cyrille Chépélov commented on TEZ-2237:
---

OK, that'll be TEZ-2256 :)

> Complex DAG freezes and fails (was BufferTooSmallException raised in 
> UnorderedPartitionedKVWriter then DAG lingers)
> ---
>
> Key: TEZ-2237
> URL: https://issues.apache.org/jira/browse/TEZ-2237
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
> Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system 
> disk + 4*1 or 2 TiB HDD for HDFS & local  (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
>Reporter: Cyrille Chépélov
> Attachments: all_stacks.lst, alloc_mem.png, alloc_vcores.png, 
> application_142732418_1444.yarn-logs.red.txt.gz, 
> appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, 
> appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, 
> gc_count_MRAppMaster.png, mem_free.png, ordered-grouped-kv-input-traces.diff, 
> start_containers.png, stop_containers.png, 
> syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, 
> syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG), 
> after about a hour of processing, several BufferTooSmallException are raised 
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active", 
> tying up memory and CPU resources as far as YARN is concerned, while little 
> if any actual processing takes place. 
> It seems two separate issues are at hand:
>   1. BufferTooSmallException are raised even though, small as the actually 
> allocated buffers seem to be (around a couple megabytes were allotted whereas 
> 100MiB were requested), the actual keys and values are never bigger than 24 
> and 1024 bytes respectively.
>   2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop 
> (stop requests appear to be sent 7 hours after the BTSE exceptions are 
> raised, but 9 hours after these stop requests, the DAG was still lingering on 
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from 
> validating the results compared to traditional MR1-based results. The lack of 
> conclusion renders the cluster queue unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang reopened TEZ-2224:
-

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2-addendum.patch, 
> TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388252#comment-14388252
 ] 

Jeff Zhang commented on TEZ-2224:
-

Build fails, checking that https://builds.apache.org/job/Tez-Build/964/console

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2-addendum.patch, 
> TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2149) Optimizations for the timed version of DAGClient.getStatus

2015-03-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388295#comment-14388295
 ] 

Jeff Zhang commented on TEZ-2149:
-

Build fails https://builds.apache.org/job/Tez-Build/963/console
May due to TEZ-2224 or this ticket

> Optimizations for the timed version of DAGClient.getStatus
> --
>
> Key: TEZ-2149
> URL: https://issues.apache.org/jira/browse/TEZ-2149
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Fix For: 0.7.0
>
> Attachments: TEZ-2149.1.txt, TEZ-2149.2.txt
>
>
> From 
> https://issues.apache.org/jira/browse/TEZ-1967?focusedCommentId=14325037&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325037
> - The sleep within the AM can be improved via monitors.
> - INITED state is returned when communicating with the AM, SUBMITTED state is 
> returned when communicating with the RM. That could be used to optimize the 
> flow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-03-31 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-714:
---
Attachment: TEZ-714-6.patch

> OutputCommitters should not run in the main AM dispatcher thread
> 
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Jeff Zhang
>Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, 
> TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, TEZ-714-6.patch, 
> Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event 
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-03-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388448#comment-14388448
 ] 

Jeff Zhang commented on TEZ-714:


[~bikassaha] Minor update on the patch. 
* Please ignore any issue in InternalErrorTransition, Create TEZ-2250 for that.
* Please check my last comment above for the other review comment

> OutputCommitters should not run in the main AM dispatcher thread
> 
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Jeff Zhang
>Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, 
> TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, TEZ-714-6.patch, 
> Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event 
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388449#comment-14388449
 ] 

Hadoop QA commented on TEZ-714:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708415/TEZ-714-6.patch
  against master revision 3bfe003.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/370//console

This message is automatically generated.

> OutputCommitters should not run in the main AM dispatcher thread
> 
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Jeff Zhang
>Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, 
> TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, TEZ-714-6.patch, 
> Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event 
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-714 PreCommit Build #370

2015-03-31 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-714
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/370/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 20 lines...]
[PreCommit-TEZ-Build] $ /bin/bash /tmp/hudson7378719374853631643.sh
Running in Jenkins mode


==
==
Testing patch for TEZ-714.
==
==


HEAD is now at 3bfe003 TEZ-2224. Fix compilation failure (zjffdu)
error: pathspec 'master' did not match any file(s) known to git.
>From https://git-wip-us.apache.org/repos/asf/tez
 * branchHEAD   -> FETCH_HEAD
Current branch HEAD is up to date.
TEZ-714 patch is being downloaded at Tue Mar 31 12:20:47 UTC 2015 from
http://issues.apache.org/jira/secure/attachment/12708415/TEZ-714-6.patch
The patch does not appear to apply with p0 to p2
PATCH APPLICATION FAILED




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708415/TEZ-714-6.patch
  against master revision 3bfe003.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/370//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
bb4945f433094ef9e01fe6635e59c8753ed8c60b logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
No tests ran.

[jira] [Updated] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-03-31 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-714:
---
Attachment: TEZ-714-7.patch

Rebase patch.

> OutputCommitters should not run in the main AM dispatcher thread
> 
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Jeff Zhang
>Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, 
> TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, TEZ-714-6.patch, 
> TEZ-714-7.patch, Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event 
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-714 PreCommit Build #371

2015-03-31 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-714
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/371/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2773 lines...]
[INFO] Final Memory: 71M/982M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708419/TEZ-714-7.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/371//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/371//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
07367388537559996959a6d67ade9c66754a664e logged out


==
==
Finished build.
==
==


Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #368
Archived 44 artifacts
Archive block size is 32768
Received 4 blocks and 2636440 bytes
Compression is 4.7%
Took 1.1 sec
Description set: TEZ-714
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388528#comment-14388528
 ] 

Hadoop QA commented on TEZ-714:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708419/TEZ-714-7.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/371//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/371//console

This message is automatically generated.

> OutputCommitters should not run in the main AM dispatcher thread
> 
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Jeff Zhang
>Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, 
> TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, TEZ-714-6.patch, 
> TEZ-714-7.patch, Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event 
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2226) Disable writing history to timeline if domain creation fails.

2015-03-31 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated TEZ-2226:
--
Attachment: TEZ-2226.3.patch

> Disable writing history to timeline if domain creation fails.
> -
>
> Key: TEZ-2226
> URL: https://issues.apache.org/jira/browse/TEZ-2226
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Chang Li
>Priority: Blocker
> Attachments: TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.patch, 
> TEZ-2226.wip.2.patch, TEZ-2226.wip.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2226) Disable writing history to timeline if domain creation fails.

2015-03-31 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388705#comment-14388705
 ] 

Chang Li commented on TEZ-2226:
---

[~hitesh] Thanks for suggestion of proper changes. I have changed my fix. Now 
the putDomain failure will not throw exception but only set timelineClinet to 
null. Then createSessionDomain or createDAGDomain will check null for 
timelineClient, and return null for aclConfig. Within 
createApplicationSubmissionContext, aclConfig will be check, and turn off 
timelineservice if aclConfig is null. Does this work?

> Disable writing history to timeline if domain creation fails.
> -
>
> Key: TEZ-2226
> URL: https://issues.apache.org/jira/browse/TEZ-2226
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Chang Li
>Priority: Blocker
> Attachments: TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.patch, 
> TEZ-2226.wip.2.patch, TEZ-2226.wip.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388708#comment-14388708
 ] 

Hitesh Shah commented on TEZ-2224:
--

[~zjffdu] It is better to either revert the change or address the fix in a new 
jira. 

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2-addendum.patch, 
> TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388709#comment-14388709
 ] 

Hitesh Shah commented on TEZ-2224:
--

Also the build failed as AMRecovery timed out. 

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2-addendum.patch, 
> TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2256) Avoid use of BufferTooSmallException to signal end of buffer in UnorderedPartitionedKVWriter

2015-03-31 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2256:
-
Assignee: Cyrille Chépélov

> Avoid use of BufferTooSmallException to signal end of buffer in 
> UnorderedPartitionedKVWriter
> 
>
> Key: TEZ-2256
> URL: https://issues.apache.org/jira/browse/TEZ-2256
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Cyrille Chépélov
>Assignee: Cyrille Chépélov
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> UnorderedPartitionedKVWriter delegates serialization to the application, 
> passing it a private ByteArrayOutputStream. In case the buffer is exhausted, 
> ByteArrayOutputStream signals that with a private BufferTooSmallException, 
> which can be seen but not dealt with by the application. As [~cwensel] 
> pointed out, when the application is in fact a complex framework, there is no 
> way to distinguish this exception from a real failure, which compels logging 
> the full stack even for reasonable events such as "buffer complete".
> Suggested approach: set a "complete" flag in ByteArrayOutputStream that 
> disables any further output, and replace  BufferTooSmallException (BTSE) 
> handling by checking that flag. 
> [~sseth] suggested checking out SortedOutput as well, as the mechanisms there 
> should be similar.
> I'll give this a go this week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2256) Avoid use of BufferTooSmallException to signal end of buffer in UnorderedPartitionedKVWriter

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388713#comment-14388713
 ] 

Hitesh Shah commented on TEZ-2256:
--

[~cchepelov] You are now in the contributors list so you should be able to 
assign jiras to yourself.

> Avoid use of BufferTooSmallException to signal end of buffer in 
> UnorderedPartitionedKVWriter
> 
>
> Key: TEZ-2256
> URL: https://issues.apache.org/jira/browse/TEZ-2256
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Cyrille Chépélov
>Assignee: Cyrille Chépélov
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> UnorderedPartitionedKVWriter delegates serialization to the application, 
> passing it a private ByteArrayOutputStream. In case the buffer is exhausted, 
> ByteArrayOutputStream signals that with a private BufferTooSmallException, 
> which can be seen but not dealt with by the application. As [~cwensel] 
> pointed out, when the application is in fact a complex framework, there is no 
> way to distinguish this exception from a real failure, which compels logging 
> the full stack even for reasonable events such as "buffer complete".
> Suggested approach: set a "complete" flag in ByteArrayOutputStream that 
> disables any further output, and replace  BufferTooSmallException (BTSE) 
> handling by checking that flag. 
> [~sseth] suggested checking out SortedOutput as well, as the mechanisms there 
> should be similar.
> I'll give this a go this week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2226 PreCommit Build #372

2015-03-31 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2226
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/372/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2192 lines...]

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708442/TEZ-2226.3.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   
org.apache.tez.runtime.library.common.shuffle.impl.TestShuffleInputEventHandlerImpl
  org.apache.tez.runtime.library.common.shuffle.TestFetcher

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/372//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/372//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
53e3dd8bd982aae4a9ea5a94df36a9958ca84a58 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #371
Archived 44 artifacts
Archive block size is 32768
Received 2 blocks and 2634702 bytes
Compression is 2.4%
Took 0.82 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
3 tests failed.
REGRESSION:  
org.apache.tez.runtime.library.common.shuffle.TestFetcher.testSetupLocalDiskFetch

Error Message:
test timed out after 3000 milliseconds

Stack Trace:
java.lang.Exception: test timed out after 3000 milliseconds
at java.net.PlainDatagramSocketImpl.receive0(Native Method)
at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:145)
at java.net.DatagramSocket.receive(DatagramSocket.java:786)
at com.sun.jndi.dns.DnsClient.doUdpQuery(DnsClient.java:416)
at com.sun.jndi.dns.DnsClient.query(DnsClient.java:210)
at com.sun.jndi.dns.Resolver.query(Resolver.java:81)
at com.sun.jndi.dns.DnsContext.c_getAttributes(DnsContext.java:430)
at 
com.sun.jndi.toolkit.ctx.ComponentDirContext.p_getAttributes(ComponentDirContext.java:231)
at 
com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.getAttributes(PartialCompositeDirContext.java:139)
at 
com.sun.jndi.toolkit.url.GenericURLDirContext.getAttributes(GenericURLDirContext.java:103)
at 
sun.security.krb5.KrbServiceLocator.getKerberosService(KrbServiceLocator.java:87)
at sun.security.krb5.Config.checkRealm(Config.java:1295)
at sun.security.krb5.Config.getRealmFromDNS(Config.java:1268)
at sun.security.krb5.Config.getDefaultRealm(Config.java:1162)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:84)
at 
org.apache.hadoop.security.authentication.util.KerberosName.(KerberosName.java:86)
at 
org.apache.hadoop.security.UserGroupInformation.init

[jira] [Commented] (TEZ-2226) Disable writing history to timeline if domain creation fails.

2015-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388724#comment-14388724
 ] 

Hadoop QA commented on TEZ-2226:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708442/TEZ-2226.3.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   
org.apache.tez.runtime.library.common.shuffle.impl.TestShuffleInputEventHandlerImpl
  org.apache.tez.runtime.library.common.shuffle.TestFetcher

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/372//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/372//console

This message is automatically generated.

> Disable writing history to timeline if domain creation fails.
> -
>
> Key: TEZ-2226
> URL: https://issues.apache.org/jira/browse/TEZ-2226
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Chang Li
>Priority: Blocker
> Attachments: TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.patch, 
> TEZ-2226.wip.2.patch, TEZ-2226.wip.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2226) Disable writing history to timeline if domain creation fails.

2015-03-31 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388782#comment-14388782
 ] 

Chang Li commented on TEZ-2226:
---

These two tests pass on my machine

> Disable writing history to timeline if domain creation fails.
> -
>
> Key: TEZ-2226
> URL: https://issues.apache.org/jira/browse/TEZ-2226
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Chang Li
>Priority: Blocker
> Attachments: TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.patch, 
> TEZ-2226.wip.2.patch, TEZ-2226.wip.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2192) Relocalization does not check for source

2015-03-31 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2192:
-
Attachment: TEZ-2192.1.patch

> Relocalization does not check for source
> 
>
> Key: TEZ-2192
> URL: https://issues.apache.org/jira/browse/TEZ-2192
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.5.2
>Reporter: Rohini Palaniswamy
>Assignee: Hitesh Shah
>Priority: Blocker
> Attachments: TEZ-2192.1.patch
>
>
>  PIG-4443 spills the input splits to disk if serialized split size is greater 
> than some threshold. It faces issues with relocalization when more than one 
> vertex has job.split file. If a job.split file is already there on container 
> reuse, it is reused causing wrong data to be read.
> Either need a way to turn off relocalization or  check the source+timestamp 
> and redownload the file during relocalization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2192) Relocalization does not check for source

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389042#comment-14389042
 ] 

Hitesh Shah commented on TEZ-2192:
--

[~sseth] [~bikassaha] [~zjffdu] Review please.

Still need to do some additional manual testing with a job to trigger 
re-localizations and conflicting resources. 

> Relocalization does not check for source
> 
>
> Key: TEZ-2192
> URL: https://issues.apache.org/jira/browse/TEZ-2192
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.5.2
>Reporter: Rohini Palaniswamy
>Assignee: Hitesh Shah
>Priority: Blocker
> Attachments: TEZ-2192.1.patch
>
>
>  PIG-4443 spills the input splits to disk if serialized split size is greater 
> than some threshold. It faces issues with relocalization when more than one 
> vertex has job.split file. If a job.split file is already there on container 
> reuse, it is reused causing wrong data to be read.
> Either need a way to turn off relocalization or  check the source+timestamp 
> and redownload the file during relocalization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2231) Create project by-laws

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389070#comment-14389070
 ] 

Hitesh Shah commented on TEZ-2231:
--

bq. 3 day vote - drop the weekend clause ? That can differ.
Could you explain? Maybe we should switch to a longer vote to avoid differences 
in that case?

Addressed the other 2 comments in patch 3. 

( Change of patch nullifies the +1's from [~rohini] and [~bikassaha] ).


> Create project by-laws
> --
>
> Key: TEZ-2231
> URL: https://issues.apache.org/jira/browse/TEZ-2231
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: by-laws.2.patch, by-laws.3.patch, by-laws.patch
>
>
> Define the Project by-laws.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2231) Create project by-laws

2015-03-31 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2231:
-
Attachment: by-laws.3.patch

> Create project by-laws
> --
>
> Key: TEZ-2231
> URL: https://issues.apache.org/jira/browse/TEZ-2231
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: by-laws.2.patch, by-laws.3.patch, by-laws.patch
>
>
> Define the Project by-laws.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-2192 PreCommit Build #373

2015-03-31 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2192
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/373/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2758 lines...]
[INFO] Final Memory: 71M/967M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708484/TEZ-2192.1.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/373//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/373//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
adced8ca8c1f43aab6d5133225b359d397140cc5 logged out


==
==
Finished build.
==
==


Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #371
Archived 69 artifacts
Archive block size is 32768
Received 0 blocks and 7218226 bytes
Compression is 0.0%
Took 2.8 sec
Description set: TEZ-2192
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2192) Relocalization does not check for source

2015-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389224#comment-14389224
 ] 

Hadoop QA commented on TEZ-2192:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708484/TEZ-2192.1.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/373//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/373//console

This message is automatically generated.

> Relocalization does not check for source
> 
>
> Key: TEZ-2192
> URL: https://issues.apache.org/jira/browse/TEZ-2192
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.5.2
>Reporter: Rohini Palaniswamy
>Assignee: Hitesh Shah
>Priority: Blocker
> Attachments: TEZ-2192.1.patch
>
>
>  PIG-4443 spills the input splits to disk if serialized split size is greater 
> than some threshold. It faces issues with relocalization when more than one 
> vertex has job.split file. If a job.split file is already there on container 
> reuse, it is reused causing wrong data to be read.
> Either need a way to turn off relocalization or  check the source+timestamp 
> and redownload the file during relocalization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2257) NPEs in TaskReporter

2015-03-31 Thread Siddharth Seth (JIRA)
Siddharth Seth created TEZ-2257:
---

 Summary: NPEs in TaskReporter
 Key: TEZ-2257
 URL: https://issues.apache.org/jira/browse/TEZ-2257
 Project: Apache Tez
  Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Siddharth Seth


The task reported can end up throwing NPEs when adding events, reporting 
exceptions or marking as task as complete.
currentCallable causes this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2257) NPEs in TaskReporter

2015-03-31 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-2257:

Attachment: TEZ-2257.1.txt

Patch synchronizes access to the currentCallable. [~hitesh] - please review.

> NPEs in TaskReporter
> 
>
> Key: TEZ-2257
> URL: https://issues.apache.org/jira/browse/TEZ-2257
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-2257.1.txt
>
>
> The task reported can end up throwing NPEs when adding events, reporting 
> exceptions or marking as task as complete.
> currentCallable causes this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2225) Remove instances of LOG.isDebugEnabled

2015-03-31 Thread Vasanth kumar RJ (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasanth kumar RJ updated TEZ-2225:
--
Attachment: TEZ-2225.1.patch

Attached patch. Please review.

> Remove instances of LOG.isDebugEnabled
> --
>
> Key: TEZ-2225
> URL: https://issues.apache.org/jira/browse/TEZ-2225
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Vasanth kumar RJ
>Assignee: Vasanth kumar RJ
>Priority: Minor
>  Labels: performance
> Attachments: TEZ-2225.1.patch
>
>
> Remove LOG.isDebugEnabled() and use parameterized debug logging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2257) NPEs in TaskReporter

2015-03-31 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-2257:

Description: 
The task reported can end up throwing NPEs when adding events, reporting 
exceptions or marking as task as complete.
currentCallable causes this.

{code}
15/02/23 15:31:28 [TezChild] INFO task.TezTaskRunner : Encounted an error while 
executing task: attempt_1424727586401_0019_1_00_00_0
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: java.io.IOException: java.lang.InterruptedException
  at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:186)
  at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
  at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
  at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
  at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
  at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:171)
  at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:166)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: java.io.IOException: java.lang.InterruptedException
  at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
  at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:292)
  at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:163)
  ... 13 more
Caused by: java.io.IOException: java.io.IOException: 
java.lang.InterruptedException
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
  at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
  at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
  at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
  at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
  at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:126)
  at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
  at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
  ... 15 more
Caused by: java.io.IOException: java.lang.InterruptedException
  at 
org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:146)
  at 
org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:87)
  at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
  ... 21 more
Caused by: java.lang.InterruptedException
  at java.lang.Object.wait(Native Method)
  at 
org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.nextCvb(LlapInputFormat.java:163)
  at 
org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:132)
  ... 23 more
15/02/23 15:31:28 [TezChild] INFO task.TezTaskRunner : Ignoring the following 
exception since a previous exception is already registered
java.lang.NullPointerException
  at 
org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:120)
  at org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:382)
  at 
org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:260)
  at org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:51)
  at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:227)
  at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
  at 
org.apache.tez.runtime.task.TezTaskRunner$T

[jira] [Updated] (TEZ-2216) Expose errors during AM initialization

2015-03-31 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated TEZ-2216:
--
Attachment: TEZ-2216.1.patch

> Expose errors during AM initialization
> --
>
> Key: TEZ-2216
> URL: https://issues.apache.org/jira/browse/TEZ-2216
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
> Attachments: TEZ-2216.1.patch
>
>
> If there are bad configs or other issues that cause errors/exceptions during 
> AM initialization (eg. during service init) then those errors are not exposed 
> to the user. Exposing them would be useful in quickly debugging such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2216) Expose errors during AM initialization

2015-03-31 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389321#comment-14389321
 ] 

Chang Li commented on TEZ-2216:
---

[~bikassaha], [~zjffdu], I am interested in working on this issue. I just did a 
simple try catch within serviceInit in YarnTaskSchedulerService based on Jeff's 
finding. Is it sufficient? Thanks

> Expose errors during AM initialization
> --
>
> Key: TEZ-2216
> URL: https://issues.apache.org/jira/browse/TEZ-2216
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
> Attachments: TEZ-2216.1.patch
>
>
> If there are bad configs or other issues that cause errors/exceptions during 
> AM initialization (eg. during service init) then those errors are not exposed 
> to the user. Exposing them would be useful in quickly debugging such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2216) Expose errors during AM initialization

2015-03-31 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389321#comment-14389321
 ] 

Chang Li edited comment on TEZ-2216 at 3/31/15 8:21 PM:


[~bikassaha], [~zjffdu], I am interested in working on this issue. I just added 
a simple try catch block within serviceInit in YarnTaskSchedulerService based 
on Jeff's finding. Is it sufficient? Thanks


was (Author: lichangleo):
[~bikassaha], [~zjffdu], I am interested in working on this issue. I just did a 
simple try catch within serviceInit in YarnTaskSchedulerService based on Jeff's 
finding. Is it sufficient? Thanks

> Expose errors during AM initialization
> --
>
> Key: TEZ-2216
> URL: https://issues.apache.org/jira/browse/TEZ-2216
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
> Attachments: TEZ-2216.1.patch
>
>
> If there are bad configs or other issues that cause errors/exceptions during 
> AM initialization (eg. during service init) then those errors are not exposed 
> to the user. Exposing them would be useful in quickly debugging such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2257) NPEs in TaskReporter

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389375#comment-14389375
 ] 

Hitesh Shah commented on TEZ-2257:
--

Seems like a major issue. Why is this only targeted for 0.7.0? 

> NPEs in TaskReporter
> 
>
> Key: TEZ-2257
> URL: https://issues.apache.org/jira/browse/TEZ-2257
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-2257.1.txt
>
>
> The task reported can end up throwing NPEs when adding events, reporting 
> exceptions or marking as task as complete.
> currentCallable causes this.
> {code}
> 15/02/23 15:31:28 [TezChild] INFO task.TezTaskRunner : Encounted an error 
> while executing task: attempt_1424727586401_0019_1_00_00_0
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:186)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:171)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:166)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:292)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:163)
>   ... 13 more
> Caused by: java.io.IOException: java.io.IOException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:126)
>   at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
>   ... 15 more
> Caused by: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:146)
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:87)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
>   ... 21 more
> Caused by: java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.nextCvb(LlapInputFormat.java:163)
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:132)
>   ... 23 more
> 15/02/23 15:31:28 [TezChild] INFO task.TezTaskRunner : Ignoring the following 
> exception since a previous exception is already registered
> java.lang.NullPointerException
>   at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:120)
>   at 
> org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:382)
>   at 
> org.apache.tez

[jira] [Commented] (TEZ-2257) NPEs in TaskReporter

2015-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389399#comment-14389399
 ] 

Hadoop QA commented on TEZ-2257:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708512/TEZ-2257.1.txt
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.runtime.library.common.shuffle.TestFetcher

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/375//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/375//console

This message is automatically generated.

> NPEs in TaskReporter
> 
>
> Key: TEZ-2257
> URL: https://issues.apache.org/jira/browse/TEZ-2257
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-2257.1.txt
>
>
> The task reported can end up throwing NPEs when adding events, reporting 
> exceptions or marking as task as complete.
> currentCallable causes this.
> {code}
> 15/02/23 15:31:28 [TezChild] INFO task.TezTaskRunner : Encounted an error 
> while executing task: attempt_1424727586401_0019_1_00_00_0
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:186)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:171)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:166)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:292)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:163)
>   ... 13 more
> Caused by: java.io.IOException: java.io.IOException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:126)
>   at org.apache.tez.mapreduce.lib.MRReader

[jira] [Created] (TEZ-2258) Spurious logging if dag is not running when getDAGStatus is invoked.

2015-03-31 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-2258:


 Summary: Spurious logging if dag is not running when getDAGStatus 
is invoked. 
 Key: TEZ-2258
 URL: https://issues.apache.org/jira/browse/TEZ-2258
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0, 0.5.0
Reporter: Hitesh Shah


Remove spurious logging: 

2015-03-31 13:50:14,468 INFO [IPC Server handler 0 on 59179] ipc.Server: IPC 
Server handler 0 on 59179, call 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.1:59188 Call#123 Retry#0
org.apache.tez.dag.api.TezException: No running dag at present
at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89)
at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156)
at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95)
at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2225 PreCommit Build #374

2015-03-31 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2225
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/374/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2823 lines...]



{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708513/TEZ-2225.1.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/374//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/374//artifact/patchprocess/newPatchFindbugsWarningstez-mapreduce.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/374//artifact/patchprocess/newPatchFindbugsWarningstez-api.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/374//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
09c35bd3dc62ff73d65102ed323d41e5cf18cb23 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #373
Archived 44 artifacts
Archive block size is 32768
Received 0 blocks and 2745515 bytes
Compression is 0.0%
Took 3.2 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2225) Remove instances of LOG.isDebugEnabled

2015-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389448#comment-14389448
 ] 

Hadoop QA commented on TEZ-2225:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708513/TEZ-2225.1.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/374//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/374//artifact/patchprocess/newPatchFindbugsWarningstez-mapreduce.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/374//artifact/patchprocess/newPatchFindbugsWarningstez-api.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/374//console

This message is automatically generated.

> Remove instances of LOG.isDebugEnabled
> --
>
> Key: TEZ-2225
> URL: https://issues.apache.org/jira/browse/TEZ-2225
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Vasanth kumar RJ
>Assignee: Vasanth kumar RJ
>Priority: Minor
>  Labels: performance
> Attachments: TEZ-2225.1.patch
>
>
> Remove LOG.isDebugEnabled() and use parameterized debug logging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2257 PreCommit Build #375

2015-03-31 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2257
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/375/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2138 lines...]

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708512/TEZ-2257.1.txt
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.runtime.library.common.shuffle.TestFetcher

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/375//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/375//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
b11f36fa539d4bd73974545e503e3a410660c9a8 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #373
Archived 44 artifacts
Archive block size is 32768
Received 4 blocks and 2554699 bytes
Compression is 4.9%
Took 2.2 sec
[description-setter] Could not determine description.
Recording test results
Publish JUnit test result report is waiting for a checkpoint on 
PreCommit-TEZ-Build #374
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
1 tests failed.
REGRESSION:  
org.apache.tez.runtime.library.common.shuffle.TestFetcher.testSetupLocalDiskFetch

Error Message:
test timed out after 3000 milliseconds

Stack Trace:
java.lang.Exception: test timed out after 3000 milliseconds
at java.net.PlainDatagramSocketImpl.receive0(Native Method)
at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:145)
at java.net.DatagramSocket.receive(DatagramSocket.java:786)
at com.sun.jndi.dns.DnsClient.doUdpQuery(DnsClient.java:416)
at com.sun.jndi.dns.DnsClient.query(DnsClient.java:210)
at com.sun.jndi.dns.Resolver.query(Resolver.java:81)
at com.sun.jndi.dns.DnsContext.c_getAttributes(DnsContext.java:430)
at 
com.sun.jndi.toolkit.ctx.ComponentDirContext.p_getAttributes(ComponentDirContext.java:231)
at 
com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.getAttributes(PartialCompositeDirContext.java:139)
at 
com.sun.jndi.toolkit.url.GenericURLDirContext.getAttributes(GenericURLDirContext.java:103)
at 
sun.security.krb5.KrbServiceLocator.getKerberosService(KrbServiceLocator.java:87)
at sun.security.krb5.Config.checkRealm(Config.java:1295)
at sun.security.krb5.Config.getRealmFromDNS(Config.java:1268)
at sun.security.krb5.Config.getDefaultRealm(Config.java:1162)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:84)
at 
org.apache.hadoop.security.authentication.util.KerberosName.(KerberosName.java:86)
at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroup

Failed: TEZ-2216 PreCommit Build #376

2015-03-31 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2216
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/376/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2753 lines...]



{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708515/TEZ-2216.1.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/376//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/376//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
1d042a3002c9e74d441a0a5335b16b5d3dcd614a logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #373
Archived 44 artifacts
Archive block size is 32768
Received 20 blocks and 2071650 bytes
Compression is 24.0%
Took 1.4 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2216) Expose errors during AM initialization

2015-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389460#comment-14389460
 ] 

Hadoop QA commented on TEZ-2216:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12708515/TEZ-2216.1.patch
  against master revision 3bfe003.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/376//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/376//console

This message is automatically generated.

> Expose errors during AM initialization
> --
>
> Key: TEZ-2216
> URL: https://issues.apache.org/jira/browse/TEZ-2216
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
> Attachments: TEZ-2216.1.patch
>
>
> If there are bad configs or other issues that cause errors/exceptions during 
> AM initialization (eg. during service init) then those errors are not exposed 
> to the user. Exposing them would be useful in quickly debugging such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2259) Push additional data to Timeline for Recovery for better consumption in UI

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389481#comment-14389481
 ] 

Hitesh Shah commented on TEZ-2259:
--

\cc [~zjffdu]

> Push additional data to Timeline for Recovery for better consumption in UI
> --
>
> Key: TEZ-2259
> URL: https://issues.apache.org/jira/browse/TEZ-2259
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>
> Some things I can think of: 
>  
>- applicationAttemptId in which the dag was submitted
>- appAttemptId in which the dag was completed 
> Above provides implicit information on how many app attempts the dag spanned 
> ( and therefore recovered how many times ).
>   
>- Maybe an implicit event mentioning that the DAG was recovered and in 
> which attempt it was recovered. Possibly add information on what state was 
> recovered?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2259) Push additional data to Timeline for Recovery for better consumption in UI

2015-03-31 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-2259:


 Summary: Push additional data to Timeline for Recovery for better 
consumption in UI
 Key: TEZ-2259
 URL: https://issues.apache.org/jira/browse/TEZ-2259
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah


Some things I can think of: 
 
   - applicationAttemptId in which the dag was submitted
   - appAttemptId in which the dag was completed 

Above provides implicit information on how many app attempts the dag spanned ( 
and therefore recovered how many times ).
  
   - Maybe an implicit event mentioning that the DAG was recovered and in which 
attempt it was recovered. Possibly add information on what state was recovered?





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2259) Push additional data to Timeline for Recovery for better consumption in UI

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389481#comment-14389481
 ] 

Hitesh Shah edited comment on TEZ-2259 at 3/31/15 9:53 PM:
---

\cc [~zjffdu] [~Sreenath] [~pramachandran]



was (Author: hitesh):
\cc [~zjffdu]

> Push additional data to Timeline for Recovery for better consumption in UI
> --
>
> Key: TEZ-2259
> URL: https://issues.apache.org/jira/browse/TEZ-2259
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>
> Some things I can think of: 
>  
>- applicationAttemptId in which the dag was submitted
>- appAttemptId in which the dag was completed 
> Above provides implicit information on how many app attempts the dag spanned 
> ( and therefore recovered how many times ).
>   
>- Maybe an implicit event mentioning that the DAG was recovered and in 
> which attempt it was recovered. Possibly add information on what state was 
> recovered?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2259) Push additional data to Timeline for Recovery for better consumption in UI

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389485#comment-14389485
 ] 

Hitesh Shah commented on TEZ-2259:
--

FWIW, once attempt ids are known, the DAG logs can be obtained by find the log 
links from the data obtained from "http://RM:8088/ws/v1/cluster/apps//appattempts" 

> Push additional data to Timeline for Recovery for better consumption in UI
> --
>
> Key: TEZ-2259
> URL: https://issues.apache.org/jira/browse/TEZ-2259
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>
> Some things I can think of: 
>  
>- applicationAttemptId in which the dag was submitted
>- appAttemptId in which the dag was completed 
> Above provides implicit information on how many app attempts the dag spanned 
> ( and therefore recovered how many times ).
>   
>- Maybe an implicit event mentioning that the DAG was recovered and in 
> which attempt it was recovered. Possibly add information on what state was 
> recovered?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2216) Expose errors during AM initialization

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389498#comment-14389498
 ] 

Hitesh Shah commented on TEZ-2216:
--

[~lichangleo] The problem stmt is as follows: 

The DAGAppMaster class can be thought of as a composite service consisting of 
multiple services. The YarnSchedulerService is one of them. Today, if any 
service fails to init or start, the AM fails without fully unregistering from 
the RM. 

The objective here is to first pin point the error i.e. which service failed to 
come up and why? Then, actually start the yarn scheduler service if it has not 
been started and eventually use it to un-register with the error set to failed 
and the diagnostics pointing to the error traced earlier ( which service failed 
to init/start ).

 





> Expose errors during AM initialization
> --
>
> Key: TEZ-2216
> URL: https://issues.apache.org/jira/browse/TEZ-2216
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
> Attachments: TEZ-2216.1.patch
>
>
> If there are bad configs or other issues that cause errors/exceptions during 
> AM initialization (eg. during service init) then those errors are not exposed 
> to the user. Exposing them would be useful in quickly debugging such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2216) Expose errors during AM initialization

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389503#comment-14389503
 ] 

Hitesh Shah commented on TEZ-2216:
--

Obviously, there also needs to be special case handling if YarnSchedulerService 
is the one having problems. 

> Expose errors during AM initialization
> --
>
> Key: TEZ-2216
> URL: https://issues.apache.org/jira/browse/TEZ-2216
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
> Attachments: TEZ-2216.1.patch
>
>
> If there are bad configs or other issues that cause errors/exceptions during 
> AM initialization (eg. during service init) then those errors are not exposed 
> to the user. Exposing them would be useful in quickly debugging such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2231) Create project by-laws

2015-03-31 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389589#comment-14389589
 ] 

Siddharth Seth commented on TEZ-2231:
-

Thanks

bq. Could you explain? Maybe we should switch to a longer vote to avoid 
differences in that case?
Weekend differs for different countries.
I think a 3 day vote should be fine though, but am fine with a longer period.

> Create project by-laws
> --
>
> Key: TEZ-2231
> URL: https://issues.apache.org/jira/browse/TEZ-2231
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: by-laws.2.patch, by-laws.3.patch, by-laws.patch
>
>
> Define the Project by-laws.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2257) NPEs in TaskReporter

2015-03-31 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389607#comment-14389607
 ] 

Siddharth Seth commented on TEZ-2257:
-

Updating the target versions.
We've never seen this (yet) in regular containers - it's absolutely possible 
though.

> NPEs in TaskReporter
> 
>
> Key: TEZ-2257
> URL: https://issues.apache.org/jira/browse/TEZ-2257
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-2257.1.txt
>
>
> The task reported can end up throwing NPEs when adding events, reporting 
> exceptions or marking as task as complete.
> currentCallable causes this.
> {code}
> 15/02/23 15:31:28 [TezChild] INFO task.TezTaskRunner : Encounted an error 
> while executing task: attempt_1424727586401_0019_1_00_00_0
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:186)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:171)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:166)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:292)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:163)
>   ... 13 more
> Caused by: java.io.IOException: java.io.IOException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:126)
>   at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
>   ... 15 more
> Caused by: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:146)
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:87)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
>   ... 21 more
> Caused by: java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.nextCvb(LlapInputFormat.java:163)
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:132)
>   ... 23 more
> 15/02/23 15:31:28 [TezChild] INFO task.TezTaskRunner : Ignoring the following 
> exception since a previous exception is already registered
> java.lang.NullPointerException
>   at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:120)
>   at 
> org.apache.tez.runtime.task.TaskReporter.

[jira] [Updated] (TEZ-2257) NPEs in TaskReporter

2015-03-31 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-2257:

Target Version/s: 0.5.4  (was: 0.7.0)

> NPEs in TaskReporter
> 
>
> Key: TEZ-2257
> URL: https://issues.apache.org/jira/browse/TEZ-2257
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-2257.1.txt
>
>
> The task reported can end up throwing NPEs when adding events, reporting 
> exceptions or marking as task as complete.
> currentCallable causes this.
> {code}
> 15/02/23 15:31:28 [TezChild] INFO task.TezTaskRunner : Encounted an error 
> while executing task: attempt_1424727586401_0019_1_00_00_0
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:186)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:171)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:166)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:292)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:163)
>   ... 13 more
> Caused by: java.io.IOException: java.io.IOException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:126)
>   at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
>   ... 15 more
> Caused by: java.io.IOException: java.lang.InterruptedException
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:146)
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:87)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
>   ... 21 more
> Caused by: java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.nextCvb(LlapInputFormat.java:163)
>   at 
> org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:132)
>   ... 23 more
> 15/02/23 15:31:28 [TezChild] INFO task.TezTaskRunner : Ignoring the following 
> exception since a previous exception is already registered
> java.lang.NullPointerException
>   at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:120)
>   at 
> org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:382)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:260)
>   at 
> 

[jira] [Commented] (TEZ-2258) Spurious logging if dag is not running when getDAGStatus is invoked.

2015-03-31 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389613#comment-14389613
 ] 

Siddharth Seth commented on TEZ-2258:
-

Dupe of TEZ-1961 ?

> Spurious logging if dag is not running when getDAGStatus is invoked. 
> -
>
> Key: TEZ-2258
> URL: https://issues.apache.org/jira/browse/TEZ-2258
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0, 0.6.0
>Reporter: Hitesh Shah
>
> Remove spurious logging: 
> 2015-03-31 13:50:14,468 INFO [IPC Server handler 0 on 59179] ipc.Server: IPC 
> Server handler 0 on 59179, call 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
> from 127.0.0.1:59188 Call#123 Retry#0
> org.apache.tez.dag.api.TezException: No running dag at present
>   at 
> org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89)
>   at 
> org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389698#comment-14389698
 ] 

Jeff Zhang commented on TEZ-2224:
-

[~hitesh] I am looking at this. The build fails may due to TEZ-2149 or this 
ticket.

> EventQueue empty doesn't mean events are consumed in RecoveryService
> 
>
> Key: TEZ-2224
> URL: https://issues.apache.org/jira/browse/TEZ-2224
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.7.0, 0.5.4, 0.6.1
>
> Attachments: TEZ-2224-1.patch, TEZ-2224-2-addendum.patch, 
> TEZ-2224-2.patch
>
>
> If the event queue is empty, the event may still been processing. Should fix 
> it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TEZ-2258) Spurious logging if dag is not running when getDAGStatus is invoked.

2015-03-31 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah resolved TEZ-2258.
--
Resolution: Duplicate

> Spurious logging if dag is not running when getDAGStatus is invoked. 
> -
>
> Key: TEZ-2258
> URL: https://issues.apache.org/jira/browse/TEZ-2258
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0, 0.6.0
>Reporter: Hitesh Shah
>
> Remove spurious logging: 
> 2015-03-31 13:50:14,468 INFO [IPC Server handler 0 on 59179] ipc.Server: IPC 
> Server handler 0 on 59179, call 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
> from 127.0.0.1:59188 Call#123 Retry#0
> org.apache.tez.dag.api.TezException: No running dag at present
>   at 
> org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89)
>   at 
> org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2259) Push additional data to Timeline for Recovery for better consumption in UI

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389765#comment-14389765
 ] 

Hitesh Shah edited comment on TEZ-2259 at 4/1/15 12:33 AM:
---

For a single attempt, the logs can be found via 
"http://RM:8088/ws/v1/cluster/apps/"


was (Author: hitesh):
For a single attempt, the logs can be found via 
http://RM:8088/ws/v1/cluster/apps/

> Push additional data to Timeline for Recovery for better consumption in UI
> --
>
> Key: TEZ-2259
> URL: https://issues.apache.org/jira/browse/TEZ-2259
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>
> Some things I can think of: 
>  
>- applicationAttemptId in which the dag was submitted
>- appAttemptId in which the dag was completed 
> Above provides implicit information on how many app attempts the dag spanned 
> ( and therefore recovered how many times ).
>   
>- Maybe an implicit event mentioning that the DAG was recovered and in 
> which attempt it was recovered. Possibly add information on what state was 
> recovered?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2259) Push additional data to Timeline for Recovery for better consumption in UI

2015-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389765#comment-14389765
 ] 

Hitesh Shah commented on TEZ-2259:
--

For a single attempt, the logs can be found via 
http://RM:8088/ws/v1/cluster/apps/

> Push additional data to Timeline for Recovery for better consumption in UI
> --
>
> Key: TEZ-2259
> URL: https://issues.apache.org/jira/browse/TEZ-2259
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>
> Some things I can think of: 
>  
>- applicationAttemptId in which the dag was submitted
>- appAttemptId in which the dag was completed 
> Above provides implicit information on how many app attempts the dag spanned 
> ( and therefore recovered how many times ).
>   
>- Maybe an implicit event mentioning that the DAG was recovered and in 
> which attempt it was recovered. Possibly add information on what state was 
> recovered?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2260) AM been shutdown due to NoSuchMethodError in DAGProtos

2015-03-31 Thread Jeff Zhang (JIRA)
Jeff Zhang created TEZ-2260:
---

 Summary: AM been shutdown due to NoSuchMethodError in DAGProtos
 Key: TEZ-2260
 URL: https://issues.apache.org/jira/browse/TEZ-2260
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang


Not sure why this happens, maybe due to environment issue.

{code}
2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1427850436467_0007_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=datagen, taskAttemptId=attempt_1427850436467_0007_1_00_00_0, 
startTime=1427850527981, finishTime=1427850529750, timeTaken=1769, 
status=SUCCEEDED, errorEnum=, diagnostics=, counters=Counters: 8, File System 
Counters, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=953030, HDFS_READ_OPS=9, 
HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=6, 
org.apache.tez.common.counters.TaskCounter, GC_TIME_MILLIS=46, 
COMMITTED_HEAP_BYTES=257425408, OUTPUT_RECORDS=44195
2015-04-01 09:08:49,757 FATAL [RecoveryEventHandlingThread] 
yarn.YarnUncaughtExceptionHandler: Thread 
Thread[RecoveryEventHandlingThread,5,main] threw an Error.  Shutting down now...
java.lang.NoSuchMethodError: 
org.apache.tez.dag.api.records.DAGProtos$TezCountersProto$Builder.access$26000()Lorg/apache/tez/dag/api/records/DAGProtos$TezCountersProto$Builder;
at 
org.apache.tez.dag.api.records.DAGProtos$TezCountersProto.newBuilder(DAGProtos.java:24581)
at 
org.apache.tez.dag.api.DagTypeConverters.convertTezCountersToProto(DagTypeConverters.java:544)
at 
org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProto(TaskAttemptFinishedEvent.java:97)
at 
org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProtoStream(TaskAttemptFinishedEvent.java:120)
at 
org.apache.tez.dag.history.recovery.RecoveryService.handleRecoveryEvent(RecoveryService.java:403)
at 
org.apache.tez.dag.history.recovery.RecoveryService.access$700(RecoveryService.java:50)
at 
org.apache.tez.dag.history.recovery.RecoveryService$1.run(RecoveryService.java:158)
at java.lang.Thread.run(Thread.java:745)
2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] impl.TaskAttemptImpl: 
attempt_1427850436467_0007_1_00_00_0 TaskAttempt Transitioned from RUNNING 
to SUCCEEDED due to event TA_DONE
{code}

This issue result in several consequent issues. Because this error cause the AM 
to recovery in the next attempt. But in the next attempt it meet the following 
issue, looks like data node crashed.
{code}
2015-04-01 09:09:00,093 WARN [Thread-82] hdfs.DFSClient: DataStreamer Exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
and a client may configure this via 
'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
configuration.
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
2015-04-01 09:09:00,093 WARN [Dispatcher thread: Central] hdfs.DFSClient: Error 
while syncing
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
and a client may configure this via 
'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
configuration.
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
2015-04-01 09:09:00,094 ERROR [Dispatcher thread: Central] 
recovery.RecoveryService: Error handling summary event, 
eventType=VERTEX_FINISHED
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
and a client may configure this via 
'dfs.client.block.write.replace-datanode-on-fa

[jira] [Updated] (TEZ-2260) AM been shutdown due to NoSuchMethodError in DAGProtos

2015-03-31 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2260:

Attachment: applog.tar

attach the full app logs when running 
TestTezJobs.testSortMergeJoinExamplePipeline

> AM been shutdown due to NoSuchMethodError in DAGProtos
> --
>
> Key: TEZ-2260
> URL: https://issues.apache.org/jira/browse/TEZ-2260
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
> Attachments: applog.tar
>
>
> Not sure why this happens, maybe due to environment issue.
> {code}
> 2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1427850436467_0007_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=datagen, taskAttemptId=attempt_1427850436467_0007_1_00_00_0, 
> startTime=1427850527981, finishTime=1427850529750, timeTaken=1769, 
> status=SUCCEEDED, errorEnum=, diagnostics=, counters=Counters: 8, File System 
> Counters, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=953030, HDFS_READ_OPS=9, 
> HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=6, 
> org.apache.tez.common.counters.TaskCounter, GC_TIME_MILLIS=46, 
> COMMITTED_HEAP_BYTES=257425408, OUTPUT_RECORDS=44195
> 2015-04-01 09:08:49,757 FATAL [RecoveryEventHandlingThread] 
> yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[RecoveryEventHandlingThread,5,main] threw an Error.  Shutting down 
> now...
> java.lang.NoSuchMethodError: 
> org.apache.tez.dag.api.records.DAGProtos$TezCountersProto$Builder.access$26000()Lorg/apache/tez/dag/api/records/DAGProtos$TezCountersProto$Builder;
>   at 
> org.apache.tez.dag.api.records.DAGProtos$TezCountersProto.newBuilder(DAGProtos.java:24581)
>   at 
> org.apache.tez.dag.api.DagTypeConverters.convertTezCountersToProto(DagTypeConverters.java:544)
>   at 
> org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProto(TaskAttemptFinishedEvent.java:97)
>   at 
> org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProtoStream(TaskAttemptFinishedEvent.java:120)
>   at 
> org.apache.tez.dag.history.recovery.RecoveryService.handleRecoveryEvent(RecoveryService.java:403)
>   at 
> org.apache.tez.dag.history.recovery.RecoveryService.access$700(RecoveryService.java:50)
>   at 
> org.apache.tez.dag.history.recovery.RecoveryService$1.run(RecoveryService.java:158)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] 
> impl.TaskAttemptImpl: attempt_1427850436467_0007_1_00_00_0 TaskAttempt 
> Transitioned from RUNNING to SUCCEEDED due to event TA_DONE
> {code}
> This issue result in several consequent issues. Because this error cause the 
> AM to recovery in the next attempt. But in the next attempt it meet the 
> following issue, looks like data node crashed.
> {code}
> 2015-04-01 09:09:00,093 WARN [Thread-82] hdfs.DFSClient: DataStreamer 
> Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
> 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
> and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> 2015-04-01 09:09:00,093 WARN [Dispatcher thread: Central] hdfs.DFSClient: 
> Error while syncing
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
> 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
> and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> 2015-04-01 09:09:00,094 ERROR [Dispatcher thread: Central] 
> recovery.RecoveryServic

[jira] [Commented] (TEZ-2260) AM been shutdown due to NoSuchMethodError in DAGProtos

2015-03-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389882#comment-14389882
 ] 

Jeff Zhang commented on TEZ-2260:
-

Create TEZ-2261 for adding diagnostics in DAGAppMaster when recovery error 
happens

> AM been shutdown due to NoSuchMethodError in DAGProtos
> --
>
> Key: TEZ-2260
> URL: https://issues.apache.org/jira/browse/TEZ-2260
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
> Attachments: applog.tar
>
>
> Not sure why this happens, maybe due to environment issue.
> {code}
> 2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1427850436467_0007_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=datagen, taskAttemptId=attempt_1427850436467_0007_1_00_00_0, 
> startTime=1427850527981, finishTime=1427850529750, timeTaken=1769, 
> status=SUCCEEDED, errorEnum=, diagnostics=, counters=Counters: 8, File System 
> Counters, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=953030, HDFS_READ_OPS=9, 
> HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=6, 
> org.apache.tez.common.counters.TaskCounter, GC_TIME_MILLIS=46, 
> COMMITTED_HEAP_BYTES=257425408, OUTPUT_RECORDS=44195
> 2015-04-01 09:08:49,757 FATAL [RecoveryEventHandlingThread] 
> yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[RecoveryEventHandlingThread,5,main] threw an Error.  Shutting down 
> now...
> java.lang.NoSuchMethodError: 
> org.apache.tez.dag.api.records.DAGProtos$TezCountersProto$Builder.access$26000()Lorg/apache/tez/dag/api/records/DAGProtos$TezCountersProto$Builder;
>   at 
> org.apache.tez.dag.api.records.DAGProtos$TezCountersProto.newBuilder(DAGProtos.java:24581)
>   at 
> org.apache.tez.dag.api.DagTypeConverters.convertTezCountersToProto(DagTypeConverters.java:544)
>   at 
> org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProto(TaskAttemptFinishedEvent.java:97)
>   at 
> org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProtoStream(TaskAttemptFinishedEvent.java:120)
>   at 
> org.apache.tez.dag.history.recovery.RecoveryService.handleRecoveryEvent(RecoveryService.java:403)
>   at 
> org.apache.tez.dag.history.recovery.RecoveryService.access$700(RecoveryService.java:50)
>   at 
> org.apache.tez.dag.history.recovery.RecoveryService$1.run(RecoveryService.java:158)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] 
> impl.TaskAttemptImpl: attempt_1427850436467_0007_1_00_00_0 TaskAttempt 
> Transitioned from RUNNING to SUCCEEDED due to event TA_DONE
> {code}
> This issue result in several consequent issues. Because this error cause the 
> AM to recovery in the next attempt. But in the next attempt it meet the 
> following issue, looks like data node crashed.
> {code}
> 2015-04-01 09:09:00,093 WARN [Thread-82] hdfs.DFSClient: DataStreamer 
> Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
> 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
> and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> 2015-04-01 09:09:00,093 WARN [Dispatcher thread: Central] hdfs.DFSClient: 
> Error while syncing
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
> 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
> and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> 2015-04-01 09:09:00,094 ERROR [Dispatcher thread: Centr

[jira] [Created] (TEZ-2261) Should add diagnostics in DAGAppMaster when recovery error happens

2015-03-31 Thread Jeff Zhang (JIRA)
Jeff Zhang created TEZ-2261:
---

 Summary: Should add diagnostics in DAGAppMaster when recovery 
error happens
 Key: TEZ-2261
 URL: https://issues.apache.org/jira/browse/TEZ-2261
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang


Should add diagnostics in DAGAppMaster when recovery error happens, otherwise 
AM is shutdown and the next dag submission will just throw 
SessionNotRunningException which would confuse users.

{code}
if (this.historyEventHandler.hasRecoveryFailed()) {
  LOG.warn("Recovery had a fatal error, shutting down session after" +
  " DAG completion");
  sessionStopped.set(true);
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

2015-03-31 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390034#comment-14390034
 ] 

Tsuyoshi Ozawa commented on TEZ-145:


@Gopal V [~bikassaha] How can we go ahead with this issue? I'll implement 
lacking parts if we need it.

> Support a combiner processor that can run non-local to map/reduce nodes
> ---
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)

2015-03-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390095#comment-14390095
 ] 

Cyrille Chépélov commented on TEZ-2237:
---

Update:
# _run again using TEZ branch-0.6 as of 
974588e180ab53ea3e7243f2dea29a5d8ef2416d ("TEZ-2240"), cascading-3.0.0-wip-92 
_: 
** started yesterday at 09:30 local
** intense activity until about 10:20, with a marked drop until 12:15 (multiple 
DAGs executed and completed during the tim); another intense  burst until 12:30 
then the system pretty much "fell asleep". 
** at 17:30 the system was still asleep
** at 01:09 today, the DAG's attempt#1 got killed (this is at about the 
24-hours "AMRM key regeneration cycle" time, since the switch to summer time 
this past week-end, consistent with 00:09 in my reports about the previous 
week). 
** Attempt #2 started at 01:09. Activity was intense until 04:00 then subsided, 
until at about 04:30 the system fell asleep.
** Attempt #2's last message is: 
{noformat}
2015-04-01 04:27:53,824 INFO [TezChild] element.TezBoundaryStage: calling 
UnorderedKVInput#start() on: Boundary(ECCC5DB0C5C04B2EBED0FC3187C8487A) 
ECCC5DB0C5C04B2EBED0FC3187C8487A
2015-04-01 04:27:53,824 INFO [TezChild] element.TezGroupGate: calling 
OrderedGroupedKVInput#start() on: GroupBy(_pipe_332+_pipe_333)[by:[{1}:'key']] 
DEF94DA9BECF4A5BA6C85388B1EAAD41, for 1 inputs
2015-04-01 04:27:53,824 INFO [TezChild] input.OrderedGroupedKVInput: 
OrderedGroupedKVInput#start(): OrderedGroupedKVInput was already started, not 
starting again!
{noformat}

I'll now:
# kill the DAG, extract the yarn-logs files and put them up here
# try again with ("tez.am.dag.scheduler.class" -> 
"org.apache.tez.dag.app.dag.impl.DAGSchedulerNaturalOrderControlled") 



> Complex DAG freezes and fails (was BufferTooSmallException raised in 
> UnorderedPartitionedKVWriter then DAG lingers)
> ---
>
> Key: TEZ-2237
> URL: https://issues.apache.org/jira/browse/TEZ-2237
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
> Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system 
> disk + 4*1 or 2 TiB HDD for HDFS & local  (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
>Reporter: Cyrille Chépélov
> Attachments: all_stacks.lst, alloc_mem.png, alloc_vcores.png, 
> application_142732418_1444.yarn-logs.red.txt.gz, 
> appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, 
> appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, 
> gc_count_MRAppMaster.png, mem_free.png, ordered-grouped-kv-input-traces.diff, 
> start_containers.png, stop_containers.png, 
> syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, 
> syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG), 
> after about a hour of processing, several BufferTooSmallException are raised 
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active", 
> tying up memory and CPU resources as far as YARN is concerned, while little 
> if any actual processing takes place. 
> It seems two separate issues are at hand:
>   1. BufferTooSmallException are raised even though, small as the actually 
> allocated buffers seem to be (around a couple megabytes were allotted whereas 
> 100MiB were requested), the actual keys and values are never bigger than 24 
> and 1024 bytes respectively.
>   2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop 
> (stop requests appear to be sent 7 hours after the BTSE exceptions are 
> raised, but 9 hours after these stop requests, the DAG was still lingering on 
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from 
> validating the results compared to traditional MR1-based results. The lack of 
> conclusion renders the cluster queue unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)