from:"Jason Lowe \(JIRA\)"

[jira] [Commented] (TEZ-394) Better scheduling for uneven DAGs

2019-01-04 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734580#comment-16734580
 ] 

Jason Lowe commented on TEZ-394:


Attaching a new patch that makes this behavior configurable and disabled by 
default.  This avoids the bad preemption behavior that Gopal encountered when 
running with the default YARN task scheduler but allows users to enable it in 
conjuction with a DAG-aware task scheduler like DagAwareYarnTaskScheduler.

> Better scheduling for uneven DAGs
> -
>
> Key: TEZ-394
> URL: https://issues.apache.org/jira/browse/TEZ-394
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Rohini Palaniswamy
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-394.001.patch, TEZ-394.002.patch, TEZ-394.003.patch, 
> TEZ-394.004.patch
>
>
>   Consider a series of joins or group by on dataset A with few datasets that 
> takes 10 hours followed by a final join with a dataset X. The vertex that 
> loads dataset X will be one of the top vertexes and initialized early even 
> though its output is not consumed till the end after 10 hours. 
> 1) Could either use delayed start logic for better resource allocation
> 2) Else if they are started upfront, need to handle failure/recovery cases 
> where the nodes which executed the MapTask might have gone down when the 
> final join happens. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-394) Better scheduling for uneven DAGs

2019-01-04 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-394:
---
Attachment: TEZ-394.004.patch

> Better scheduling for uneven DAGs
> -
>
> Key: TEZ-394
> URL: https://issues.apache.org/jira/browse/TEZ-394
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Rohini Palaniswamy
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-394.001.patch, TEZ-394.002.patch, TEZ-394.003.patch, 
> TEZ-394.004.patch
>
>
>   Consider a series of joins or group by on dataset A with few datasets that 
> takes 10 hours followed by a final join with a dataset X. The vertex that 
> loads dataset X will be one of the top vertexes and initialized early even 
> though its output is not consumed till the end after 10 hours. 
> 1) Could either use delayed start logic for better resource allocation
> 2) Else if they are started upfront, need to handle failure/recovery cases 
> where the nodes which executed the MapTask might have gone down when the 
> final join happens. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-4027) DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang

2018-12-18 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724150#comment-16724150
 ] 

Jason Lowe commented on TEZ-4027:
-

Thanks for the patch!  +1 lgtm.  Committing this.

> DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang
> --
>
> Key: TEZ-4027
> URL: https://issues.apache.org/jira/browse/TEZ-4027
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Attachments: TEZ-4027.001.patch, TEZ-4027.002.patch
>
>
> In a scenario where there are retro active failures and the YARN queue is 
> full to not allow more new container assignments, the scheduler can 
> miscompute blocked vertex set as it tries to flip the bits upto the length of 
> the bitset which may not be reflective of the total number of vertices. This 
> causes no preemption and the DAG will hang.
> {code}
> @GuardedBy("DagAwareYarnTaskScheduler.this")
> BitSet createVertexBlockedSet() {
>   BitSet blocked = new BitSet();
>   Entry entry = priorityStats.lastEntry();
>   if (entry != null) {
> RequestPriorityStats stats = entry.getValue();
> blocked.or(stats.allowedVertices);
> blocked.flip(0, blocked.length());
> blocked.or(stats.descendants);
>   }
>   return blocked;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-4021) API incompatibility wro4j-maven-plugin

2018-11-27 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-4021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701050#comment-16701050
 ] 

Jason Lowe commented on TEZ-4021:
-

+1 lgtm.  Committing this.

> API incompatibility wro4j-maven-plugin
> --
>
> Key: TEZ-4021
> URL: https://issues.apache.org/jira/browse/TEZ-4021
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-4021.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-4022) Upgrade Maven Surefire plugin to 3.0.0-M1

2018-11-26 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699598#comment-16699598
 ] 

Jason Lowe commented on TEZ-4022:
-

Thanks for the patch!  +1 lgtm pending Jenkins.

> Upgrade Maven Surefire plugin to 3.0.0-M1
> -
>
> Key: TEZ-4022
> URL: https://issues.apache.org/jira/browse/TEZ-4022
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-4022.001.patch
>
>
> Recently all the unit tests are failing. This is caused by the latest Java 8 
> issue reported at SUREFIRE-1588 and fixed in Maven Surefire plugin 3.0.0-M1. 
> We need to update the plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3996) Reorder input failed events before data movement events

2018-10-01 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634175#comment-16634175
 ] 

Jason Lowe commented on TEZ-3996:
-

bq. we launch an external process in the processor and for various reasons 
can't restart the external process with the new version of DMEs

This limitation sounds like it lowers the fault tolerance in the DAG.  If I 
understand correctly, any retroactive failure of an upstream task forces any 
active downstream task to fail because we cannot update the downstream task 
with a new DME when the upstream task rerun completes.  That means it could 
only take four upstream task attempt reruns, across four different upstream 
tasks, to fail a downstream vertex if the upstream re-runs were spread out 
sufficiently in time and a downstream attempt was relaunched in-between each 
upstream retroactive failure.  So instead of any one upstream task failing four 
times to fail the DAG, it becomes any four attempts _across_ the upstream tasks 
worst-case.

Dropping the DME event seems like the right approach, although I worry a bit 
that this may be an expensive thing to do on the AM side with a large number of 
upstream and downstream tasks.  We may need to refactor how those are tracked 
AM-side.  Another approach which isn't as clean but might scale better is to 
have the AM send over an event when the task attempt is "up to date" with 
events -- in other words, the pending event queue is drained on the AM side and 
it could be a while before more events are sent to the task attempt.  Then a 
downstream task can load up all the events, filtering DMEs that have been 
invalidated by later IFEs, until it receives the special, "up to date" event 
which indicates it's OK to start the processing of any valid DMEs received so 
far.


> Reorder input failed events before data movement events
> ---
>
> Key: TEZ-3996
> URL: https://issues.apache.org/jira/browse/TEZ-3996
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Hitesh Sharma
>Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for 
> DataMovementEvent to arrive and then starts an external process to do some 
> work. When a revocation happens then the processor recieves an 
> InputFailedEvent, which tells it about the failed input, and we fail the 
> processor as it is working on old inputs. When the new inputs are available 
> then Tez restarts the processor and sends the InputFailedEvent along with all 
> the DataMovementEvent which includes the older versions and the new version 
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many 
> times we see the older DataMovementEvent first at which our processor thinks 
> it is good to start. We then receive the InputFailedEvent and the new version 
> of DataMovementEvent, but that's late and the processor fails. This keeps 
> repeating on every subsequent task attempt and the task fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3996) Reorder input failed events before data movement events

2018-09-28 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632502#comment-16632502
 ] 

Jason Lowe commented on TEZ-3996:
-

bq. Arguably a better fix is to simply not send DMEs to tasks where we know the 
input has failed rather than send it and then invalidate it.

What I meant here is to neither send the DME nor the input failed events.  In 
other words, if we know a DME is bad just don't tell the task at all about it 
at all.  It's just a waste, right?  We can simply wait until a valid DME is 
generated later.

bq. The happy path cases go like task receiving DMEs and then some IFEs 
(typically after some time of receiving the initial DME).  This is fine and the 
processor can deal with it and in our case we mark the processor as a non-fatal 
failure and retry.

Wait, if it can handle this then why does it matter if they arrive in the same 
heartbeat?  There's no guarantee that the events from a previous heartbeat have 
been fully processed before the asynchronous task heartbeat retrieves more, 
correct?  The code should not be relying on whether these are arriving in a 
single heartbeat vs. multiple heartbeats.  We should address the core problem 
of an IFE arriving too early relative to a DME -- that seems to be the crux of 
the issue.  I'm not seeing why the relative time between DME and IFE is 
relevant to the nature of how the failure event is processed re: fatal vs. 
non-fatal.

> Reorder input failed events before data movement events
> ---
>
> Key: TEZ-3996
> URL: https://issues.apache.org/jira/browse/TEZ-3996
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Hitesh Sharma
>Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for 
> DataMovementEvent to arrive and then starts an external process to do some 
> work. When a revocation happens then the processor recieves an 
> InputFailedEvent, which tells it about the failed input, and we fail the 
> processor as it is working on old inputs. When the new inputs are available 
> then Tez restarts the processor and sends the InputFailedEvent along with all 
> the DataMovementEvent which includes the older versions and the new version 
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many 
> times we see the older DataMovementEvent first at which our processor thinks 
> it is good to start. We then receive the InputFailedEvent and the new version 
> of DataMovementEvent, but that's late and the processor fails. This keeps 
> repeating on every subsequent task attempt and the task fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3994) Upgrade maven-surefire-plugin to 0.21.0 to support yetus

2018-09-28 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631994#comment-16631994
 ] 

Jason Lowe commented on TEZ-3994:
-

Thanks for the patch!  +1 lgtm.  Committing this.

> Upgrade maven-surefire-plugin to 0.21.0 to support yetus
> 
>
> Key: TEZ-3994
> URL: https://issues.apache.org/jira/browse/TEZ-3994
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3994.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3891) Migrate patch submisssion scripts and hooks to Yetus 0.8.0

2018-09-28 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631988#comment-16631988
 ] 

Jason Lowe commented on TEZ-3891:
-

Ah, nevermind my comment.  I didn't see the subtasks at first, looks like 
you're already separating the work into sub-JIRAs and this is just a testbed 
for an uber patch to see what work is left for new subtasks.  Thanks!

> Migrate patch submisssion scripts and hooks to Yetus 0.8.0
> --
>
> Key: TEZ-3891
> URL: https://issues.apache.org/jira/browse/TEZ-3891
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3891.001.patch, TEZ-3891.002-branch-0.9.patch, 
> TEZ-3891.002.patch
>
>
> Patch test/validation results are no longer posted to JIRA. This is due to 
> EOL for some APIs that were used being used. 
> Discussed with [~jlowe] and [~jeagles]. 
> As suggested by [~aw], moving to Yetus 0.7.0 seems to the most sense, rather 
> than try to workaround and carry forward the older scripts. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3891) Migrate patch submisssion scripts and hooks to Yetus 0.8.0

2018-09-28 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631948#comment-16631948
 ] 

Jason Lowe commented on TEZ-3891:
-

Thanks for driving this, [~jeagles]!  I'm curious about some of the code 
changes though.  I would not expect to see any changes to the java files in 
order to convert to Yetus -- are these fixes to get unit tests or other nits 
from Yetus fixed?  If so I'm wondering if that's better handled in one or more 
separate JIRAs to keep this one focused on converting to Yetus.

> Migrate patch submisssion scripts and hooks to Yetus 0.8.0
> --
>
> Key: TEZ-3891
> URL: https://issues.apache.org/jira/browse/TEZ-3891
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3891.001.patch, TEZ-3891.002-branch-0.9.patch, 
> TEZ-3891.002.patch
>
>
> Patch test/validation results are no longer posted to JIRA. This is due to 
> EOL for some APIs that were used being used. 
> Discussed with [~jlowe] and [~jeagles]. 
> As suggested by [~aw], moving to Yetus 0.7.0 seems to the most sense, rather 
> than try to workaround and carry forward the older scripts. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3996) Reorder input failed events before data movement events

2018-09-28 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631901#comment-16631901
 ] 

Jason Lowe commented on TEZ-3996:
-

I believe every input failed event arrives _after_ the corresponding data 
movement event.  The input failed event is to notify the task that a prior DME 
is no longer valid.  Arguably a better fix is to simply not send DMEs to tasks 
where we know the input has failed rather than send it and then invalidate it.  
What worries me about sending them in reverse order is that a task may 
interpret the latter DME event as "oh, now the input is good and here's where 
to get it."

InputFailedEvent has traditionally been used to indicate inputs that are likely 
not going to be fetchable from a task, and a task is free to ignore the input 
failure if it was able to successfully fetch the input that supposedly has 
failed.  It sounds like input failure is being redefined a bit in this context 
where somehow the input is retrievable but considered invalid?

> Reorder input failed events before data movement events
> ---
>
> Key: TEZ-3996
> URL: https://issues.apache.org/jira/browse/TEZ-3996
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Hitesh Sharma
>Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for 
> DataMovementEvent to arrive and then starts an external process to do some 
> work. When a revocation happens then the processor recieves an 
> InputFailedEvent, which tells it about the failed input, and we fail the 
> processor as it is working on old inputs. When the new inputs are available 
> then Tez restarts the processor and sends the InputFailedEvent along with all 
> the DataMovementEvent which includes the older versions and the new version 
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many 
> times we see the older DataMovementEvent first at which our processor thinks 
> it is good to start. We then receive the InputFailedEvent and the new version 
> of DataMovementEvent, but that's late and the processor fails. This keeps 
> repeating on every subsequent task attempt and the task fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3975) Please add OWASP Dependency Check to the build (pom.xml)

2018-09-26 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3975:

Fix Version/s: 0.10.1
   0.9.2

Thanks, [~jeagles]!  I committed this to master and branch-0.9.

> Please add OWASP Dependency Check to the build (pom.xml)
> 
>
> Key: TEZ-3975
> URL: https://issues.apache.org/jira/browse/TEZ-3975
> Project: Apache Tez
>  Issue Type: New Feature
>Affects Versions: 0.8.next, 0.9.next, 0.10.0, 0.10.1
> Environment: All development, build, test, environments.
>Reporter: Albert Baker
>Assignee: Jonathan Eagles
>Priority: Major
>  Labels: build, easy-fix, security
> Fix For: 0.9.2, 0.10.1
>
> Attachments: TEZ-3975.001.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  Please add OWASP Dependency Check to the build (pom.xml).  OWASP DC makes an 
> outbound REST call to MITRE Common Vulnerabilities & Exposures (CVE) to 
> perform a lookup for each dependant .jar to list any/all known 
> vulnerabilities for each jar.  This step is needed because a manual MITRE CVE 
> lookup/check on the main component does not include checking for 
> vulnerabilities in components or in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle).   
> Also, add the appropriate command to the nightly build to generate a report 
> of all known vulnerabilities in any/all third party libraries/dependencies 
> that get pulled in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false 
> clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulnerailities.  Project teams that keep up with removing vulnerabilities on 
> a weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3975) Please add OWASP Dependency Check to the build (pom.xml)

2018-09-26 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629378#comment-16629378
 ] 

Jason Lowe commented on TEZ-3975:
-

Thanks for the patch!  +1 lgtm.  Committing this.

> Please add OWASP Dependency Check to the build (pom.xml)
> 
>
> Key: TEZ-3975
> URL: https://issues.apache.org/jira/browse/TEZ-3975
> Project: Apache Tez
>  Issue Type: New Feature
>Affects Versions: 0.8.next, 0.9.next, 0.10.0, 0.10.1
> Environment: All development, build, test, environments.
>Reporter: Albert Baker
>Assignee: Jonathan Eagles
>Priority: Major
>  Labels: build, easy-fix, security
> Attachments: TEZ-3975.001.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  Please add OWASP Dependency Check to the build (pom.xml).  OWASP DC makes an 
> outbound REST call to MITRE Common Vulnerabilities & Exposures (CVE) to 
> perform a lookup for each dependant .jar to list any/all known 
> vulnerabilities for each jar.  This step is needed because a manual MITRE CVE 
> lookup/check on the main component does not include checking for 
> vulnerabilities in components or in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle).   
> Also, add the appropriate command to the nightly build to generate a report 
> of all known vulnerabilities in any/all third party libraries/dependencies 
> that get pulled in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false 
> clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulnerailities.  Project teams that keep up with removing vulnerabilities on 
> a weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress

2018-09-21 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623693#comment-16623693
 ] 

Jason Lowe commented on TEZ-3982:
-

+1 lgtm.  I verified this patch is identical to the master version with only 
changing MonotonicClock to SystemClock in the test case.  Verified the 
branch-0.9 build completes and test case passes.  Committing this.

> DAGAppMaster and tasks should not report negative or invalid progress
> -
>
> Key: TEZ-3982
> URL: https://issues.apache.org/jira/browse/TEZ-3982
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Fix For: 0.9.2, 0.10.1
>
> Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, 
> TEZ-3982.003.patch, TEZ-3982.004.patch, TEZ-3982.005.branch-0.9.patch
>
>
> AM fails (AMRMClient expects non negative progress) if any component reports 
> invalid or -ve progress, DagAppMaster/Tasks should check and report 
> accordingly to allow the AM to execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress

2018-09-21 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3982.
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 0.9.2

Thanks, [~kshukla]!  I committed this to branch-0.9.

> DAGAppMaster and tasks should not report negative or invalid progress
> -
>
> Key: TEZ-3982
> URL: https://issues.apache.org/jira/browse/TEZ-3982
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Fix For: 0.9.2, 0.10.1
>
> Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, 
> TEZ-3982.003.patch, TEZ-3982.004.patch, TEZ-3982.005.branch-0.9.patch
>
>
> AM fails (AMRMClient expects non negative progress) if any component reports 
> invalid or -ve progress, DagAppMaster/Tasks should check and report 
> accordingly to allow the AM to execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress

2018-09-21 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3982:

Fix Version/s: (was: 0.9.2)

> DAGAppMaster and tasks should not report negative or invalid progress
> -
>
> Key: TEZ-3982
> URL: https://issues.apache.org/jira/browse/TEZ-3982
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Fix For: 0.10.1
>
> Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, 
> TEZ-3982.003.patch, TEZ-3982.004.patch
>
>
> AM fails (AMRMClient expects non negative progress) if any component reports 
> invalid or -ve progress, DagAppMaster/Tasks should check and report 
> accordingly to allow the AM to execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress

2018-09-21 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reopened TEZ-3982:
-

This broke the branch-0.9 build.  Looks like MonotonicClock isn't in the 
version of Hadoop branch-0.9 depends upon:
{noformat}
[ERROR] 
/tez/tez-dag/src/test/java/org/apache/tez/dag/app/TestDAGAppMaster.java:[17,35] 
cannot find symbol
  symbol:   class MonotonicClock
  location: package org.apache.hadoop.yarn.util
[ERROR] 
/tez/tez-dag/src/test/java/org/apache/tez/dag/app/TestDAGAppMaster.java:[431,32]
 cannot find symbol
  symbol:   class MonotonicClock
  location: class org.apache.tez.dag.app.TestDAGAppMaster
[INFO] 2 errors 
{noformat}

I reverted this from branch-0.9 to fix the build.

> DAGAppMaster and tasks should not report negative or invalid progress
> -
>
> Key: TEZ-3982
> URL: https://issues.apache.org/jira/browse/TEZ-3982
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Fix For: 0.10.1
>
> Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, 
> TEZ-3982.003.patch, TEZ-3982.004.patch
>
>
> AM fails (AMRMClient expects non negative progress) if any component reports 
> invalid or -ve progress, DagAppMaster/Tasks should check and report 
> accordingly to allow the AM to execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3989) Fix by-laws related to emeritus clause

2018-09-18 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619104#comment-16619104
 ] 

Jason Lowe commented on TEZ-3989:
-

Apparently not.  I updated the site manually following the instructions at 
https://cwiki.apache.org/confluence/display/TEZ/Updating+the+Tez+Website


> Fix by-laws related to emeritus clause 
> ---
>
> Key: TEZ-3989
> URL: https://issues.apache.org/jira/browse/TEZ-3989
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Major
> Fix For: 0.10.1
>
>
> The emeritus clause is not valid and needs to be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3989) Fix by-laws related to emeritus clause

2018-09-13 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614044#comment-16614044
 ] 

Jason Lowe commented on TEZ-3989:
-

Does the Apache Tez site update on a Jenkins job or is it done manually?

> Fix by-laws related to emeritus clause 
> ---
>
> Key: TEZ-3989
> URL: https://issues.apache.org/jira/browse/TEZ-3989
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Major
> Fix For: 0.10.1
>
>
> The emeritus clause is not valid and needs to be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (TEZ-3989) Fix by-laws related to emeritus clause

2018-09-13 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3989.
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 0.10.1

Thanks, [~hitesh]! I committed this to master.

> Fix by-laws related to emeritus clause 
> ---
>
> Key: TEZ-3989
> URL: https://issues.apache.org/jira/browse/TEZ-3989
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Major
> Fix For: 0.10.1
>
>
> The emeritus clause is not valid and needs to be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3989) Fix by-laws related to emeritus clause

2018-09-13 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614040#comment-16614040
 ] 

Jason Lowe commented on TEZ-3989:
-

+1 lgtm.  This matches what was proposed on the dev list.

> Fix by-laws related to emeritus clause 
> ---
>
> Key: TEZ-3989
> URL: https://issues.apache.org/jira/browse/TEZ-3989
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Major
>
> The emeritus clause is not valid and needs to be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3987) Schedule giving priorities based on topological order

2018-09-06 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605846#comment-16605846
 ] 

Jason Lowe commented on TEZ-3987:
-

If vertex 5 is an ancestor of vertex 3 in the graph then I would not expect 
TEZ-394 to give vertex 3 a higher priority.  I plugged the topology into the 
testCriticalPathOrdering unit test added as part of the TEZ-394 patch, and it 
ordered the vertices like this (highest to lowest priority): v1, v2, v5, v4, 
v3, v6.  The existing topology ordering also orders v5 higher than v3.  The 
same test without the TEZ-394 change to DAG planning results in a vertex 
ordering of v2, v4, v5, v1, v3, v6.

What is the ordering you are striving for in that example?


> Schedule giving priorities based on topological order
> -
>
> Key: TEZ-3987
> URL: https://issues.apache.org/jira/browse/TEZ-3987
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
>
> It'd be an improvement for some DAGs to be scheduled in a topological order 
> as opposed to the scheduling based on distance from the root from 
> {{DAGScheduler}} and {{DAGSchedulerControlled}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3987) Schedule giving priorities based on topological order

2018-09-05 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604969#comment-16604969
 ] 

Jason Lowe commented on TEZ-3987:
-

Is this related to TEZ-394?

> Schedule giving priorities based on topological order
> -
>
> Key: TEZ-3987
> URL: https://issues.apache.org/jira/browse/TEZ-3987
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
>
> It'd be an improvement for some DAGs to be scheduled in a topological order 
> as opposed to the scheduling based on distance from the root from 
> {{DAGScheduler}} and {{DAGSchedulerControlled}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown

2018-08-29 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3980:

Fix Version/s: 0.9.2

Thanks, [~gopalv]!  I committed this to branch-0.9 as well.

> ShuffleRunner: the wake loop needs to check for shutdown
> 
>
> Key: TEZ-3980
> URL: https://issues.apache.org/jira/browse/TEZ-3980
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Gopal V
>Assignee: Gopal V
>Priority: Major
> Fix For: 0.9.2, 0.10.0
>
> Attachments: TEZ-3980.1.patch
>
>
> In the ShuffleRunner threads, there's a loop which does not terminate if the 
> task threads get killed.
> {code}
>   while ((runningFetchers.size() >= numFetchers || 
> pendingHosts.isEmpty())
>   && numCompletedInputs.get() < numInputs) {
> inputContext.notifyProgress();
> boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS);
>   }
> {code}
> The wakeLoop signal does not exit this out of the loop and is missing a break 
> for shut-down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3978) DAGClientServer Socket exception when localhost name lookup failures

2018-08-16 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582880#comment-16582880
 ] 

Jason Lowe commented on TEZ-3978:
-

I did a quick test and it turns out InetSocketAddress(0).getHostString() 
returns "0.0.0.0" in practice.  However I noticed that the task communicator is 
binding this way, and MapReduce has been doing this also for a while, so it's 
probably fine.  I'd like to avoid the reverse host lookup, and this 
accomplishes that.

+1 for the latest patch. Committing this.

> DAGClientServer Socket exception when localhost name lookup failures
> 
>
> Key: TEZ-3978
> URL: https://issues.apache.org/jira/browse/TEZ-3978
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3978.001.patch, TEZ-3978.002.patch
>
>
> Call From 0.0.0.0 to null:0 failed on socket exception: 
> java.net.SocketException: Invalid argument
> {code}
> 2018-08-10 21:19:55,523 [ERROR] [ServiceThread:DAGClientRPCServer] 
> |client.DAGClientServer|: Failed to start DAGClientServer: 
> java.net.SocketException: Call From 0.0.0.0 to null:0 failed on socket 
> exception: java.net.SocketException: Invalid argument; For more details see:  
> http://wiki.apache.org/hadoop/SocketException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:804)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:777)
>   at org.apache.hadoop.ipc.Server.bind(Server.java:563)
>   at org.apache.hadoop.ipc.Server$Listener.(Server.java:958)
>   at org.apache.hadoop.ipc.Server.(Server.java:2657)
>   at org.apache.hadoop.ipc.RPC$Server.(RPC.java:968)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:367)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:342)
>   at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:810)
>   at 
> org.apache.tez.dag.api.client.DAGClientServer.createServer(DAGClientServer.java:134)
>   at 
> org.apache.tez.dag.api.client.DAGClientServer.serviceStart(DAGClientServer.java:82)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1909)
>   at 
> org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1930)
> Caused by: java.net.SocketException: Invalid argument
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at org.apache.hadoop.ipc.Server.bind(Server.java:553)
>   ... 11 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3978) DAGClientServer Socket exception when localhost name lookup failures

2018-08-14 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580411#comment-16580411
 ] 

Jason Lowe commented on TEZ-3978:
-

Thanks for the patch!

I'm a little concerned that we are changing the semantics of the bind.  This 
now binds to every interface on the machine rather than the primary interface 
as it did before.  Is that desired?  It might be safer to just pass the IP 
address from the sock addr without doing a reverse host lookup.

> DAGClientServer Socket exception when localhost name lookup failures
> 
>
> Key: TEZ-3978
> URL: https://issues.apache.org/jira/browse/TEZ-3978
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3978.001.patch
>
>
> Call From 0.0.0.0 to null:0 failed on socket exception: 
> java.net.SocketException: Invalid argument
> {code}
> 2018-08-10 21:19:55,523 [ERROR] [ServiceThread:DAGClientRPCServer] 
> |client.DAGClientServer|: Failed to start DAGClientServer: 
> java.net.SocketException: Call From 0.0.0.0 to null:0 failed on socket 
> exception: java.net.SocketException: Invalid argument; For more details see:  
> http://wiki.apache.org/hadoop/SocketException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:804)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:777)
>   at org.apache.hadoop.ipc.Server.bind(Server.java:563)
>   at org.apache.hadoop.ipc.Server$Listener.(Server.java:958)
>   at org.apache.hadoop.ipc.Server.(Server.java:2657)
>   at org.apache.hadoop.ipc.RPC$Server.(RPC.java:968)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:367)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:342)
>   at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:810)
>   at 
> org.apache.tez.dag.api.client.DAGClientServer.createServer(DAGClientServer.java:134)
>   at 
> org.apache.tez.dag.api.client.DAGClientServer.serviceStart(DAGClientServer.java:82)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1909)
>   at 
> org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1930)
> Caused by: java.net.SocketException: Invalid argument
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at org.apache.hadoop.ipc.Server.bind(Server.java:553)
>   ... 11 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3974) Tez: Correctness regression of TEZ-955 in TEZ-2937

2018-08-10 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3974:

 Hadoop Flags: Reviewed
Fix Version/s: 0.9.2

Thanks, [~jmarhuen]!  I committed this to branch-0.9 as well.

> Tez: Correctness regression of TEZ-955 in TEZ-2937
> --
>
> Key: TEZ-3974
> URL: https://issues.apache.org/jira/browse/TEZ-3974
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Gopal V
>Assignee: Jaume M
>Priority: Critical
> Fix For: 0.9.2, 0.10.0
>
> Attachments: TEZ-3974.1.patch, TEZ-3974.2.patch, TEZ-3974.3.patch
>
>
> TEZ-2937 might have introduced a race condition for Tez output events, along 
> with TEZ-2237
> {code}
>   // Close the Outputs.
>   for (OutputSpec outputSpec : outputSpecs) {
> String destVertexName = outputSpec.getDestinationVertexName();
> initializedOutputs.remove(destVertexName);
> List closeOutputEvents = 
> ((LogicalOutputFrameworkInterface)outputsMap.get(destVertexName)).close();
> sendTaskGeneratedEvents(closeOutputEvents,
> EventProducerConsumerType.OUTPUT, taskSpec.getVertexName(),
> destVertexName, taskSpec.getTaskAttemptID());
>   }
>   // Close the Processor.
>   processorClosed = true;
>   processor.close();
> {code}
> As part of TEZ-2237, the outputs send empty events when the output is closed 
> without being started (which happens in task init failures).
> These events are obsoleted when a task fails and this happens in the AM, but 
> not before the dispatcher looks at them.
> Depending on the timing, the empty events can escape obsoletion & be sent to 
> a downstream task.
> This gets marked as a SKIPPED event in the downstream task, which means that 
> further obsoletion events sent to the downstream task is ignored (because a 
> zero byte fetch is not repeated on node failure).
> So the downstream task can exit without actually waiting for the retry of the 
> failed task and cause silent dataloss in case where the retry succeeds in 
> another attempt.
> So if processor.close() throws an exception, this introduce a race condition 
> and if the AM is too fast, we end up with correctness issues.
> This was originally reported in TEZ-955



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-08-07 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571755#comment-16571755
 ] 

Jason Lowe commented on TEZ-3976:
-

bq. Each of those translate into an event in the AM which finally crashes due 
to OOM after around 30 minutes and around 10 million shuffle input errors (and 
10 million lines like the previous ones).

To clarify, this is the task crashing with an OOM and not the AM?  If so it 
looks like yet another flow control issue with the RPC layer where the task is 
generating events faster than the network can push them out.

Normally this scenario does not occur because one of the following triggers 
first:
# The AM sees enough complaints about the inability to fetch an upstream task's 
inputs that it decides to retroactively fail the upstream task and re-run it.  
When that is decided the AM informs the downstream tasks about the now obsolete 
shuffle inputs, and the downstream tasks will stop trying to fetch it.
# The downstream task decides that it is getting far too many errors and 
insufficient overall shuffle progress, so it declares itself unhealthy and 
fails.

It would be useful to understand why neither of those is happening in this case.


> ShuffleManager reporting too many errors
> 
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jaume M
>Priority: Major
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, 
> OUTPUT_BYTES_WITH_OVERHEAD=779483,

[jira] [Commented] (TEZ-3942) RPC getTask writable optimization invalid in hadoop 2.8+

2018-07-24 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554381#comment-16554381
 ] 

Jason Lowe commented on TEZ-3942:
-

Thanks for the performance details!  Committing the 003 patch.

> RPC getTask writable optimization invalid in hadoop 2.8+
> 
>
> Key: TEZ-3942
> URL: https://issues.apache.org/jira/browse/TEZ-3942
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Nishant Dash
>Priority: Major
> Attachments: TEZ-3942.001.patch, TEZ-3942.002.patch, 
> TEZ-3942.003.patch, TEZ-3942.bench.patch, TEZ-3942.bench2.patch, 
> TEZ-3942.test.patch
>
>
> TEZ-3140 added an optimization to improve performance of RPC writable. 
> HADOOP-13426 added in hadoop 2.8 has invalidated the assumption of the added 
> optimization by changing the underlying output buffer.
> {noformat}
> "IPC Server handler 25 on 35274" #85 daemon prio=5 os_prio=0 
> tid=0x022c nid=0x1b40f runnable [0x2ba1a6627000]
>java.lang.Thread.State: RUNNABLE
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> - locked <0x00072fe9ac68> (a 
> org.apache.hadoop.ipc.ResponseBuffer$FramedBuffer)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> - locked <0x00072fe9ac48> (a org.apache.hadoop.ipc.ResponseBuffer)
> at 
> org.apache.tez.dag.api.EntityDescriptor.write(EntityDescriptor.java:121)
> at org.apache.tez.runtime.api.impl.InputSpec.write(InputSpec.java:66)
> at org.apache.tez.runtime.api.impl.TaskSpec.write(TaskSpec.java:174)
> at org.apache.tez.common.ContainerTask.write(ContainerTask.java:77)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:202)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:128)
> at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:82)
> at 
> org.apache.hadoop.ipc.RpcWritable$WritableWrapper.writeTo(RpcWritable.java:75)
> at 
> org.apache.hadoop.ipc.Server.setupResponseForWritable(Server.java:2807)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2792)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2766)
> at org.apache.hadoop.ipc.Server.access$100(Server.java:138)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:905)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1949)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3942) RPC getTask writable optimization invalid in hadoop 2.8+

2018-07-19 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549887#comment-16549887
 ] 

Jason Lowe commented on TEZ-3942:
-

Thanks for the patch!  +1 lgtm.  [~nishantdash] do you have any performance 
metrics from before and after the patch?


> RPC getTask writable optimization invalid in hadoop 2.8+
> 
>
> Key: TEZ-3942
> URL: https://issues.apache.org/jira/browse/TEZ-3942
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Nishant Dash
>Priority: Major
> Attachments: TEZ-3942.001.patch, TEZ-3942.002.patch, 
> TEZ-3942.003.patch, TEZ-3942.bench.patch, TEZ-3942.test.patch
>
>
> TEZ-3140 added an optimization to improve performance of RPC writable. 
> HADOOP-13426 added in hadoop 2.8 has invalidated the assumption of the added 
> optimization by changing the underlying output buffer.
> {noformat}
> "IPC Server handler 25 on 35274" #85 daemon prio=5 os_prio=0 
> tid=0x022c nid=0x1b40f runnable [0x2ba1a6627000]
>java.lang.Thread.State: RUNNABLE
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> - locked <0x00072fe9ac68> (a 
> org.apache.hadoop.ipc.ResponseBuffer$FramedBuffer)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> - locked <0x00072fe9ac48> (a org.apache.hadoop.ipc.ResponseBuffer)
> at 
> org.apache.tez.dag.api.EntityDescriptor.write(EntityDescriptor.java:121)
> at org.apache.tez.runtime.api.impl.InputSpec.write(InputSpec.java:66)
> at org.apache.tez.runtime.api.impl.TaskSpec.write(TaskSpec.java:174)
> at org.apache.tez.common.ContainerTask.write(ContainerTask.java:77)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:202)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:128)
> at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:82)
> at 
> org.apache.hadoop.ipc.RpcWritable$WritableWrapper.writeTo(RpcWritable.java:75)
> at 
> org.apache.hadoop.ipc.Server.setupResponseForWritable(Server.java:2807)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2792)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2766)
> at org.apache.hadoop.ipc.Server.access$100(Server.java:138)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:905)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1949)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3971) Incorrect query result in hive when hive.convert.join.bucket.mapjoin.tez=true

2018-07-16 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545328#comment-16545328
 ] 

Jason Lowe commented on TEZ-3971:
-

This problem report seems more appropriate for the HIVE project.  Hive is 
responsible for generating the DAG for Tez, and it sounds like it is generating 
an incorrect DAG to process the results.  That would be a bug in Hive rather 
than a bug in Tez.


> Incorrect query result in hive when hive.convert.join.bucket.mapjoin.tez=true
> -
>
> Key: TEZ-3971
> URL: https://issues.apache.org/jira/browse/TEZ-3971
> Project: Apache Tez
>  Issue Type: Bug
> Environment: We are using Hive 3, Hadoop 3.1 and Tez 0.91
>Reporter: Karthik
>Priority: Major
> Attachments: extended_explain.txt
>
>
> When hive.convert.join.bucket.mapjoin.tez=true and bucketed column is in 
> select clause but not in where clause, hive is performing a bucket map join 
> and returning incorrect results. When the bucketed column is removed from 
> select clause or  hive.convert.join.bucket.mapjoin.tez=false, returned query 
> results are correct.
>  
> create table my_fact(AMT decimal(20,3),bucket_col string ,join_col string )
> PARTITIONED BY (FISCAL_YEAR string ,ACCOUNTING_PERIOD string )
>  CLUSTERED BY (bucket_col) INTO 10 
> BUCKETS 
> stored as ORC
>  ;
> create table my_dim(join_col string,filter_col string) stored as orc;
> After populating and analyzing above tables, explain  plan looks as below 
> when  hive.convert.join.bucket.mapjoin.tez=TRUE:
>  
> explain  select T4.join_col as account1,my_fact.accounting_period
> FROM my_fact JOIN my_dim T4 ON my_fact.join_col = T4.join_col
> WHERE my_fact.fiscal_year = '2015'
>  AND T4.filter_col IN ( 'VAL1', 'VAL2' ) 
> and my_fact.accounting_period in (10);
> Vertex dependency in root stage
> Map 1 <- Map 2 (CUSTOM_EDGE)
> Stage-0
>  Fetch Operator
>  limit:-1
>  Stage-1
>  Map 1 vectorized, llap
>  File Output Operator [FS_24]
>  Select Operator [SEL_23] (rows=15282589 width=291)
>  Output:["_col0","_col1","_col2"]
>  Map Join Operator [MAPJOIN_22] (rows=15282589 width=291)
>  
> *BucketMapJoin*:true,Conds:SEL_21._col1=RS_19._col0(Inner),Output:["_col0","_col3","_col4"]
>  <-Map 2 [CUSTOM_EDGE] vectorized, llap
>  MULTICAST [RS_19]
>  PartitionCols:_col0
>  Select Operator [SEL_18] (rows=818 width=186)
>  Output:["_col0"]
>  Filter Operator [FIL_17] (rows=818 width=186)
>  predicate:((filter_col) IN ('VAL1', 'VAL2') and join_col is not null)
>  TableScan [TS_3] (rows=1635 width=186)
>  default@my_dim,t4,Tbl:COMPLETE,Col:NONE,Output:["join_col","filter_col"]
>  <-Select Operator [SEL_21] (rows=13893263 width=291)
>  Output:["_col0","_col1","_col3"]
>  Filter Operator [FIL_20] (rows=13893263 width=291)
>  predicate:join_col is not null
>  TableScan [TS_0] (rows=13893263 width=291)
>  
> default@my_fact,my_fact,Tbl:COMPLETE,Col:NONE,Output:["bucket_col","join_col"]
> [^extended_explain.txt] has more detailed plan.
> When  hive.convert.join.bucket.mapjoin.tez=false,  plan no longer has 
> bucketjoin and query results are correct.
> Vertex dependency in root stage
> Map 1 <- Map 2 (BROADCAST_EDGE)
> Stage-0
>  Fetch Operator
>  limit:-1
>  Stage-1
>  Map 1 vectorized, llap
>  File Output Operator [FS_24]
>  Select Operator [SEL_23] (rows=15282589 width=291)
>  Output:["_col0","_col1","_col2"]
>  Map Join Operator [MAPJOIN_22] (rows=15282589 width=291)
>  Conds:SEL_21._col1=RS_19._col0(Inner),Output:["_col0","_col3","_col4"]
>  <-Map 2 [BROADCAST_EDGE] vectorized, llap
>  BROADCAST [RS_19]
>  PartitionCols:_col0
>  Select Operator [SEL_18] (rows=818 width=186)
>  Output:["_col0"]
>  Filter Operator [FIL_17] (rows=818 width=186)
>  predicate:((filter_col) IN ('VAL1', 'VAL2') and join_col is not null)
>  TableScan [TS_3] (rows=1635 width=186)
>  default@my_dim,t4,Tbl:COMPLETE,Col:NONE,Output:["join_col","filter_col"]
>  <-Select Operator [SEL_21] (rows=13893263 width=291)
>  Output:["_col0","_col1","_col3"]
>  Filter Operator [FIL_20] (rows=13893263 width=291)
>  predicate:join_col is not null
>  TableScan [TS_0] (rows=13893263 width=291)
>  
> default@my_fact,my_fact,Tbl:COMPLETE,Col:NONE,Output:["bucket_col","join_col"]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3964) Inflater not closed in some places

2018-07-12 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541837#comment-16541837
 ] 

Jason Lowe commented on TEZ-3964:
-

Thanks for updating the patch!  +1 lgtm.  Committing this.

> Inflater not closed in some places
> --
>
> Key: TEZ-3964
> URL: https://issues.apache.org/jira/browse/TEZ-3964
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: TEZ-3964.2.patch
>
>
> We call [this 
> method|https://github.com/apache/tez/blob/314dfc79b4b3f528b680b4fee73ad0dca3a3a19b/tez-api/src/main/java/org/apache/tez/common/TezCommonUtils.java#L363]
>  from a few places. We don't call {{end()}} from most of the place where we 
> call and although it's not necessary to call it explicitly it's the 
> recommended way in the docs to do so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3942) RPC getTask writable optimization invalid in hadoop 2.8+

2018-07-12 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541751#comment-16541751
 ] 

Jason Lowe commented on TEZ-3942:
-

Is it appropriate to target this to 0.9.2?  Tez 0.9.x currently works against 
Hadoop 2.7, so it's not a given that Tez 0.9 will have Hadoop 2.8+ underneath.


> RPC getTask writable optimization invalid in hadoop 2.8+
> 
>
> Key: TEZ-3942
> URL: https://issues.apache.org/jira/browse/TEZ-3942
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3942.001.patch, TEZ-3942.test.patch
>
>
> TEZ-3140 added an optimization to improve performance of RPC writable. 
> HADOOP-13426 added in hadoop 2.8 has invalidated the assumption of the added 
> optimization by changing the underlying output buffer.
> {noformat}
> "IPC Server handler 25 on 35274" #85 daemon prio=5 os_prio=0 
> tid=0x022c nid=0x1b40f runnable [0x2ba1a6627000]
>java.lang.Thread.State: RUNNABLE
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> - locked <0x00072fe9ac68> (a 
> org.apache.hadoop.ipc.ResponseBuffer$FramedBuffer)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> - locked <0x00072fe9ac48> (a org.apache.hadoop.ipc.ResponseBuffer)
> at 
> org.apache.tez.dag.api.EntityDescriptor.write(EntityDescriptor.java:121)
> at org.apache.tez.runtime.api.impl.InputSpec.write(InputSpec.java:66)
> at org.apache.tez.runtime.api.impl.TaskSpec.write(TaskSpec.java:174)
> at org.apache.tez.common.ContainerTask.write(ContainerTask.java:77)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:202)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:128)
> at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:82)
> at 
> org.apache.hadoop.ipc.RpcWritable$WritableWrapper.writeTo(RpcWritable.java:75)
> at 
> org.apache.hadoop.ipc.Server.setupResponseForWritable(Server.java:2807)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2792)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2766)
> at org.apache.hadoop.ipc.Server.access$100(Server.java:138)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:905)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1949)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3955) Upgrade hadoop dependency to 3.0.3

2018-07-11 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540293#comment-16540293
 ] 

Jason Lowe commented on TEZ-3955:
-

+1 lgtm.  Committing this.

> Upgrade hadoop dependency to 3.0.3
> --
>
> Key: TEZ-3955
> URL: https://issues.apache.org/jira/browse/TEZ-3955
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3955.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3963) Possible InflaterInputStream leaked in TezCommonUtils and related classes

2018-07-06 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535394#comment-16535394
 ] 

Jason Lowe commented on TEZ-3963:
-

+1 lgtm.  Committing this.

> Possible InflaterInputStream leaked in TezCommonUtils and related classes 
> --
>
> Key: TEZ-3963
> URL: https://issues.apache.org/jira/browse/TEZ-3963
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: TEZ-3963.001.patch
>
>
> I don't think [this is 
> closed|https://github.com/apache/tez/blob/314dfc79b4b3f528b680b4fee73ad0dca3a3a19b/tez-api/src/main/java/org/apache/tez/common/TezCommonUtils.java#L397]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3953) Restore ABI-compat for DAGClient for TEZ-3951

2018-07-05 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3953:

Fix Version/s: 0.9.2
  Description: I committed this to branch-0.9.

> Restore ABI-compat for DAGClient for TEZ-3951
> -
>
> Key: TEZ-3953
> URL: https://issues.apache.org/jira/browse/TEZ-3953
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 0.9.2, 0.10.0
>
> Attachments: TEZ-3953.patch
>
>
> I committed this to branch-0.9.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3963) Possible InflaterInputStream leaked in TezCommonUtils and related classes

2018-07-05 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3963:

Attachment: TEZ-3963.001.patch

> Possible InflaterInputStream leaked in TezCommonUtils and related classes 
> --
>
> Key: TEZ-3963
> URL: https://issues.apache.org/jira/browse/TEZ-3963
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: TEZ-3963.001.patch
>
>
> I don't think [this is 
> closed|https://github.com/apache/tez/blob/314dfc79b4b3f528b680b4fee73ad0dca3a3a19b/tez-api/src/main/java/org/apache/tez/common/TezCommonUtils.java#L397]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3963) Possible InflaterInputStream leaked in TezCommonUtils and related classes

2018-07-05 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534158#comment-16534158
 ] 

Jason Lowe commented on TEZ-3963:
-

Not sure the QA bot supports pull requests, so uploading the patch that is 
equivalent to the PR on github.

> Possible InflaterInputStream leaked in TezCommonUtils and related classes 
> --
>
> Key: TEZ-3963
> URL: https://issues.apache.org/jira/browse/TEZ-3963
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: TEZ-3963.001.patch
>
>
> I don't think [this is 
> closed|https://github.com/apache/tez/blob/314dfc79b4b3f528b680b4fee73ad0dca3a3a19b/tez-api/src/main/java/org/apache/tez/common/TezCommonUtils.java#L397]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (TEZ-3964) Inflater not closed in some places

2018-07-05 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned TEZ-3964:
---

Assignee: Jaume M

Thanks for the report and patch!

Looks good overall.  I think it would be prudent to wrap the code in 
convertDAGPlanToATSMap in a try..finally block for future maintenance.  The 
signature already declares it can throw IOException, so if someone adds some 
code that does throw and forgets to handle the inflater.end call the bug will 
reappear.  I'd recommend refactoring the body into a private method that takes 
an Inflater argument rather than indenting the whole body with the patch -- 
it's already a really long method.


> Inflater not closed in some places
> --
>
> Key: TEZ-3964
> URL: https://issues.apache.org/jira/browse/TEZ-3964
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
>
> We call [this 
> method|https://github.com/apache/tez/blob/314dfc79b4b3f528b680b4fee73ad0dca3a3a19b/tez-api/src/main/java/org/apache/tez/common/TezCommonUtils.java#L363]
>  from a few places. We don't call {{end()}} from most of the place where we 
> call and although it's not necessary to call it explicitly it's the 
> recommended way in the docs to do so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-07-05 Thread Jason Lowe (JIRA)



 [ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3951:

Fix Version/s: (was: 0.9.next)
   0.9.2

> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 0.9.2, 0.10.0
>
> Attachments: TEZ-3951.01.patch, TEZ-3951.patch
>
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3953) Restore ABI-compat for DAGClient for TEZ-3951

2018-07-05 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534109#comment-16534109
 ] 

Jason Lowe commented on TEZ-3953:
-

Does this need to be committed to branch-0.9 since TEZ-3951 went to branch-0.9?

> Restore ABI-compat for DAGClient for TEZ-3951
> -
>
> Key: TEZ-3953
> URL: https://issues.apache.org/jira/browse/TEZ-3953
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 0.10.0
>
> Attachments: TEZ-3953.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3962) Configuration decode leaks an Inflater object

2018-06-28 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526781#comment-16526781
 ] 

Jason Lowe commented on TEZ-3962:
-

Thanks for updating the patch!  The unit test failures are unrelated and are 
test timeouts caused by file sync changes in upstream Hadoop local dir 
allocator (which I believe they are undoing do to all the problems it caused).

+1 lgtm.  Committing this.


> Configuration decode leaks an Inflater object
> -
>
> Key: TEZ-3962
> URL: https://issues.apache.org/jira/browse/TEZ-3962
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.10.0
>Reporter: Gopal V
>Assignee: Eric Wohlstadter
>Priority: Major
> Attachments: TEZ-3962.1.patch, TEZ-3962.2.patch
>
>
> {code}
> public static Configuration createConfFromByteString(ByteString byteString) 
> throws IOException {
> ...
> InflaterInputStream uncompressIs = new 
> InflaterInputStream(byteString.newInput());
> DAGProtos.ConfigurationProto confProto = 
> DAGProtos.ConfigurationProto.parseFrom(uncompressIs);
> {code}
> InflaterInputStream is never closed, this will get eventually collected - but 
> the off-heap buffers for Inflater leaks temporarily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3962) Configuration decode leaks an Inflater object

2018-06-28 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526319#comment-16526319
 ] 

Jason Lowe commented on TEZ-3962:
-

Thanks for the patch!

Why does this patch change the semantics to translate IOExceptions to 
RuntimeExceptions?  That does not seem desirable and is unrelated to the 
reported issue.  I don't see a need to add a catch clause here, as there wasn't 
one before.  Is there another issue being fixed at the same time?

> Configuration decode leaks an Inflater object
> -
>
> Key: TEZ-3962
> URL: https://issues.apache.org/jira/browse/TEZ-3962
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.10.0
>Reporter: Gopal V
>Assignee: Eric Wohlstadter
>Priority: Major
> Attachments: TEZ-3962.1.patch
>
>
> {code}
> public static Configuration createConfFromByteString(ByteString byteString) 
> throws IOException {
> ...
> InflaterInputStream uncompressIs = new 
> InflaterInputStream(byteString.newInput());
> DAGProtos.ConfigurationProto confProto = 
> DAGProtos.ConfigurationProto.parseFrom(uncompressIs);
> {code}
> InflaterInputStream is never closed, this will get eventually collected - but 
> the off-heap buffers for Inflater leaks temporarily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3943) TezClient leaks DAGClient for prewarm

2018-05-29 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/TEZ-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494101#comment-16494101
 ] 

Jason Lowe commented on TEZ-3943:
-

+1 lgtm.  The unit test failures are expected due to the recent update of netty 
dependencies conflicting with the Hadoop 2.x minicluster and should be resolved 
when TEZ-3923 is resolved.

Committing this.


> TezClient leaks DAGClient for prewarm
> -
>
> Key: TEZ-3943
> URL: https://issues.apache.org/jira/browse/TEZ-3943
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3943.01.patch, TEZ-3943.02.patch, TEZ-3943.patch
>
>
> This may in turn leak some security related threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3939) Remove performance hit of precondition check in AM for register running task attempt

2018-05-24 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489069#comment-16489069
 ] 

Jason Lowe commented on TEZ-3939:
-

Thanks for updating the patch!  +1 lgtm.  Committing this.

> Remove performance hit of precondition check in AM for register running task 
> attempt
> 
>
> Key: TEZ-3939
> URL: https://issues.apache.org/jira/browse/TEZ-3939
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3939.001.patch, TEZ-3939.002.patch
>
>
> {noformat}
>java.lang.Thread.State: RUNNABLE
> at org.apache.tez.dag.records.TezTaskID.appendTo(TezTaskID.java:118)
> at 
> org.apache.tez.dag.records.TezTaskAttemptID.appendTo(TezTaskAttemptID.java:97)
> at 
> org.apache.tez.dag.records.TezTaskAttemptID.toString(TezTaskAttemptID.java:119)
> at java.lang.String.valueOf(String.java:2994)
> at java.lang.StringBuilder.append(StringBuilder.java:131)
> at 
> org.apache.tez.dag.app.TezTaskCommunicatorImpl.registerRunningTaskAttempt(TezTaskCommunicatorImpl.java:225)
> at 
> org.apache.tez.dag.app.TaskCommunicatorWrapper.registerRunningTaskAttempt(TaskCommunicatorWrapper.java:56)
> at 
> org.apache.tez.dag.app.TaskCommunicatorManager.registerTaskAttempt(TaskCommunicatorManager.java:565)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl.registerAttemptWithListener(AMContainerImpl.java:1184)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl$AssignTaskAttemptTransition.transition(AMContainerImpl.java:656)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl$AssignTaskAttemptTransition.transition(AMContainerImpl.java:595)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> - locked <0x00079b9161f8> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:59)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl.handle(AMContainerImpl.java:441)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl.handle(AMContainerImpl.java:78)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerMap.handle(AMContainerMap.java:68)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerMap.handle(AMContainerMap.java:40)
> at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180)
> at 
> org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3940) Reduce time to convert TaskFinishedEvent to string

2018-05-24 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489007#comment-16489007
 ] 

Jason Lowe commented on TEZ-3940:
-

Thanks for updating the patch!  +1 lgtm.  Committing this.

> Reduce time to convert TaskFinishedEvent to string
> --
>
> Key: TEZ-3940
> URL: https://issues.apache.org/jira/browse/TEZ-3940
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3940.001.patch, TEZ-3940.002.patch, 
> TEZ-3940.003.patch
>
>
> Found a small CPU improvement while investigating a high CPU AM.
> {noformat}
> "Dispatcher thread {Central}" #38 prio=5 os_prio=0 tid=0x2ba188535800 
> nid=0x1b3e3 runnable [0x2ba1a3e02000]
>java.lang.Thread.State: RUNNABLE
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> org.apache.tez.common.counters.AbstractCounters.toString(AbstractCounters.java:344)
> - locked <0x0007a2ed6b80> (a 
> org.apache.tez.common.counters.TezCounters)
> at 
> org.apache.tez.dag.history.events.TaskFinishedEvent.toString(TaskFinishedEvent.java:135)
> at 
> org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:155)
> at 
> org.apache.tez.dag.history.HistoryEventHandler.handle(HistoryEventHandler.java:259)
> at 
> org.apache.tez.dag.app.dag.impl.TaskImpl.logJobHistoryTaskFinishedEvent(TaskImpl.java:923)
> at 
> org.apache.tez.dag.app.dag.impl.TaskImpl$AttemptSucceededTransition.transition(TaskImpl.java:1116)
> at 
> org.apache.tez.dag.app.dag.impl.TaskImpl$AttemptSucceededTransition.transition(TaskImpl.java:1036)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> - locked <0x000717ed2120> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:59)
> at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:826)
> at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:112)
> at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:2312)
> at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:2299)
> at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180)
> at 
> org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3940) Reduce time to convert TaskFinishedEvent to string

2018-05-23 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487812#comment-16487812
 ] 

Jason Lowe commented on TEZ-3940:
-

Thanks for updating the patch!  The patch doesn't apply to master for me.  The 
first hunk in TaskFinishedEvent fails.  Otherwise patch looks good pending 
Jenkins.

> Reduce time to convert TaskFinishedEvent to string
> --
>
> Key: TEZ-3940
> URL: https://issues.apache.org/jira/browse/TEZ-3940
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3940.001.patch, TEZ-3940.002.patch
>
>
> Found a small CPU improvement while investigating a high CPU AM.
> {noformat}
> "Dispatcher thread {Central}" #38 prio=5 os_prio=0 tid=0x2ba188535800 
> nid=0x1b3e3 runnable [0x2ba1a3e02000]
>java.lang.Thread.State: RUNNABLE
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> org.apache.tez.common.counters.AbstractCounters.toString(AbstractCounters.java:344)
> - locked <0x0007a2ed6b80> (a 
> org.apache.tez.common.counters.TezCounters)
> at 
> org.apache.tez.dag.history.events.TaskFinishedEvent.toString(TaskFinishedEvent.java:135)
> at 
> org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:155)
> at 
> org.apache.tez.dag.history.HistoryEventHandler.handle(HistoryEventHandler.java:259)
> at 
> org.apache.tez.dag.app.dag.impl.TaskImpl.logJobHistoryTaskFinishedEvent(TaskImpl.java:923)
> at 
> org.apache.tez.dag.app.dag.impl.TaskImpl$AttemptSucceededTransition.transition(TaskImpl.java:1116)
> at 
> org.apache.tez.dag.app.dag.impl.TaskImpl$AttemptSucceededTransition.transition(TaskImpl.java:1036)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> - locked <0x000717ed2120> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:59)
> at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:826)
> at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:112)
> at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:2312)
> at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:2299)
> at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180)
> at 
> org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3940) Reduce time to convert TaskFinishedEvent to string

2018-05-23 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487431#comment-16487431
 ] 

Jason Lowe commented on TEZ-3940:
-

Thanks for the report and patch!

Nit: The ternary operator can be simplified with String.valueOf() instead.

There's some whitespace missing when the counters are converted.  The original 
counter toString method would insert newlines and indentation between lines but 
that is missing in this new version.


> Reduce time to convert TaskFinishedEvent to string
> --
>
> Key: TEZ-3940
> URL: https://issues.apache.org/jira/browse/TEZ-3940
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3940.001.patch
>
>
> Found a small CPU improvement while investigating a high CPU AM.
> {noformat}
> "Dispatcher thread {Central}" #38 prio=5 os_prio=0 tid=0x2ba188535800 
> nid=0x1b3e3 runnable [0x2ba1a3e02000]
>java.lang.Thread.State: RUNNABLE
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> org.apache.tez.common.counters.AbstractCounters.toString(AbstractCounters.java:344)
> - locked <0x0007a2ed6b80> (a 
> org.apache.tez.common.counters.TezCounters)
> at 
> org.apache.tez.dag.history.events.TaskFinishedEvent.toString(TaskFinishedEvent.java:135)
> at 
> org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:155)
> at 
> org.apache.tez.dag.history.HistoryEventHandler.handle(HistoryEventHandler.java:259)
> at 
> org.apache.tez.dag.app.dag.impl.TaskImpl.logJobHistoryTaskFinishedEvent(TaskImpl.java:923)
> at 
> org.apache.tez.dag.app.dag.impl.TaskImpl$AttemptSucceededTransition.transition(TaskImpl.java:1116)
> at 
> org.apache.tez.dag.app.dag.impl.TaskImpl$AttemptSucceededTransition.transition(TaskImpl.java:1036)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> - locked <0x000717ed2120> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:59)
> at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:826)
> at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:112)
> at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:2312)
> at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:2299)
> at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180)
> at 
> org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3939) Remove performance hit of precondition check in AM for register running task attempt

2018-05-23 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487385#comment-16487385
 ] 

Jason Lowe commented on TEZ-3939:
-

+1 looks OK to me.  Note that I think we can get the benefit of the 
succinctness of the Preconditions check without the overhead if we changed it 
to use positional formatting arguments like this:
{code}
Preconditions.checkNotNull(containerInfo, "Cannot register task attempt %s 
to unknown container %s",
taskSpec.getTaskAttemptID(), containerId);
{code}


> Remove performance hit of precondition check in AM for register running task 
> attempt
> 
>
> Key: TEZ-3939
> URL: https://issues.apache.org/jira/browse/TEZ-3939
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3939.001.patch
>
>
> {noformat}
>java.lang.Thread.State: RUNNABLE
> at org.apache.tez.dag.records.TezTaskID.appendTo(TezTaskID.java:118)
> at 
> org.apache.tez.dag.records.TezTaskAttemptID.appendTo(TezTaskAttemptID.java:97)
> at 
> org.apache.tez.dag.records.TezTaskAttemptID.toString(TezTaskAttemptID.java:119)
> at java.lang.String.valueOf(String.java:2994)
> at java.lang.StringBuilder.append(StringBuilder.java:131)
> at 
> org.apache.tez.dag.app.TezTaskCommunicatorImpl.registerRunningTaskAttempt(TezTaskCommunicatorImpl.java:225)
> at 
> org.apache.tez.dag.app.TaskCommunicatorWrapper.registerRunningTaskAttempt(TaskCommunicatorWrapper.java:56)
> at 
> org.apache.tez.dag.app.TaskCommunicatorManager.registerTaskAttempt(TaskCommunicatorManager.java:565)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl.registerAttemptWithListener(AMContainerImpl.java:1184)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl$AssignTaskAttemptTransition.transition(AMContainerImpl.java:656)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl$AssignTaskAttemptTransition.transition(AMContainerImpl.java:595)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> - locked <0x00079b9161f8> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:59)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl.handle(AMContainerImpl.java:441)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerImpl.handle(AMContainerImpl.java:78)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerMap.handle(AMContainerMap.java:68)
> at 
> org.apache.tez.dag.app.rm.container.AMContainerMap.handle(AMContainerMap.java:40)
> at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180)
> at 
> org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3935) DAG aware scheduler should release unassigned new containers rather than hold them

2018-05-22 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483935#comment-16483935
 ] 

Jason Lowe commented on TEZ-3935:
-

I'm not that familiar with the pre-warm feature, but I don't see any special 
handling for that feature in either the DAG aware scheduler or the original 
YarnTaskSchedulerService.  As I understand it, pre-warm is implemented by an 
initial, short-running vertex at the top of the DAG to get the containers 
started and primed.  To the schedulers, it just looks like any other task 
request, so no special handling is necessary.

> DAG aware scheduler should release unassigned new containers rather than hold 
> them
> --
>
> Key: TEZ-3935
> URL: https://issues.apache.org/jira/browse/TEZ-3935
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3935.001.patch
>
>
> I saw a case for a very large job with many containers where the DAG aware 
> scheduler was getting behind on assigning containers.  Newly assigned 
> containers were not finding any matching request, so they were queued for 
> reuse processing.  However it took so long to get through all of the task and 
> container events that the container allocations expired before the container 
> was finally assigned and attempted to be launched.
> Newly assigned containers are assigned to their matching requests, even if 
> that violates the DAG priorities, so it should be safe to simply release 
> these if no tasks could be found to use them.  The matching request has 
> either been removed or already satisified with a reused container.  Besides, 
> if we can't find any tasks to take the newly assigned container then it is 
> very likely we have plenty of reusable containers already, and keeping more 
> containers just makes the job a resource hog on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3902) Upgrade to netty-3.10.5.Final.jar

2018-05-21 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482960#comment-16482960
 ] 

Jason Lowe commented on TEZ-3902:
-

Attached a new patch that updates netty to 3.10.5 across all the Tez 
subprojects.

bq. I believe async-http-client should be OK with the proposed version of 
netty, but I haven't had time to test it.
Well I had a bit of time to test it, and it did not work. :-(  I had to update 
the async-http-client dependency version as well, which had some 
incompatibilities that required corresponding updates to the Tez code calling 
that dependency.



> Upgrade to netty-3.10.5.Final.jar
> -
>
> Key: TEZ-3902
> URL: https://issues.apache.org/jira/browse/TEZ-3902
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Eric Wohlstadter
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3902.001.patch, TEZ-3902.002.patch
>
>
> Hadoop 3 and Hive have upgraded to netty-3.10.5.Final, which is not 
> compatible with current Tez dependency netty-3.6.2.Final.
>  
> However, org.apache.tez.shufflehandler.ShuffleHandler depends on 3.6.2 
> specific methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3902) Upgrade to netty-3.10.5.Final.jar

2018-05-21 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3902:

Attachment: TEZ-3902.002.patch

> Upgrade to netty-3.10.5.Final.jar
> -
>
> Key: TEZ-3902
> URL: https://issues.apache.org/jira/browse/TEZ-3902
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Eric Wohlstadter
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3902.001.patch, TEZ-3902.002.patch
>
>
> Hadoop 3 and Hive have upgraded to netty-3.10.5.Final, which is not 
> compatible with current Tez dependency netty-3.6.2.Final.
>  
> However, org.apache.tez.shufflehandler.ShuffleHandler depends on 3.6.2 
> specific methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3937) Empty partition BitSet to byte[] conversion creates one extra byte in rounding error

2018-05-21 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482555#comment-16482555
 ] 

Jason Lowe commented on TEZ-3937:
-

Thanks for updating the patch! I think this is a safer change.  The unit test 
failure is not related, and the test passes locally for me with the patch 
applied.

+1 for the latest patch.  Committing this.


> Empty partition BitSet to byte[] conversion creates one extra byte in 
> rounding error
> 
>
> Key: TEZ-3937
> URL: https://issues.apache.org/jira/browse/TEZ-3937
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3937.001.patch, TEZ-3937.002.patch
>
>
> Byte array length calculation is defined as (bitset.length / 8) + 1 which has 
> off by one errors on byte boundaries. For example, BitSet of length 0 is 
> converted to a byte array of length 1. This was introduced as part of TEZ-972 
> since BitSet.toByteArray and valueOf were not supported as Tez supported Java 
> 6 at the time and API was introduced in Java 7.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3937) Empty partition BitSet to byte[] conversion creates one extra byte in rounding error

2018-05-17 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479252#comment-16479252
 ] 

Jason Lowe commented on TEZ-3937:
-

Thanks for the patch!

I love how this drastically simplifies the code for BitSet serialization, but I 
do have one concern with the change.  If I'm reading the code correctly, the 
new vesion will serialize and deserialize the bytes in the opposite order that 
the old code does.  I think it's fine as long as all callers are pairing these 
two methods together, e.g.: all calls to deserialize a BitSet are passing byte 
arrays that were encoded with the new encoded method.

What worries me there is if a job ever ran where some Tez jars happened to be 
an older version before this change.  The BitSet would be deserialized mostly 
backwards (same order of bits within a byte but opposite byte order).  That 
could lead to silent dataloss since the consumer may think a partition is empty 
when in fact it's a completely different partition is empty, due to how shuffle 
interprets these BitSets.

If accidental mixing of Tez jars isn't a concern I think this change is fine.  
If we're mostly worried about the wasted byte when the number of bits is a 
multiple of 8 then I think a safer change is to fix the toByteArray method to 
size the destination byte array properly instead.


> Empty partition BitSet to byte[] conversion creates one extra byte in 
> rounding error
> 
>
> Key: TEZ-3937
> URL: https://issues.apache.org/jira/browse/TEZ-3937
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3937.001.patch
>
>
> Byte array length calculation is defined as (bitset.length / 8) + 1 which has 
> off by one errors on byte boundaries. For example, BitSet of length 0 is 
> converted to a byte array of length 1. This was introduced as part of TEZ-972 
> since BitSet.toByteArray and valueOf were not supported as Tez supported Java 
> 6 at the time and API was introduced in Java 7.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3824) MRCombiner creates new JobConf copy per spill

2018-05-14 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474573#comment-16474573
 ] 

Jason Lowe commented on TEZ-3824:
-

Thanks for updating the patch!  The unit test failure does not appear to be 
related, looks like the precommit machine got slow and triggered an aggressive 
test timeout.

+1 for the patch.  Committing this.


> MRCombiner creates new JobConf copy per spill
> -
>
> Key: TEZ-3824
> URL: https://issues.apache.org/jira/browse/TEZ-3824
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3824.001.patch, TEZ-3824.002.patch
>
>
> {noformat:title=scope-57(HASH_JOIN) stack trace}
> "SpillThread {scope_60_" #99 daemon prio=5 os_prio=0 tid=0x7f2128d21800 
> nid=0x7487 runnable [0x7f21154c4000]
>java.lang.Thread.State: RUNNABLE
> at 
> java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012)
> at 
> java.util.concurrent.ConcurrentHashMap.putAll(ConcurrentHashMap.java:1084)
> at 
> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:852)
> at org.apache.hadoop.conf.Configuration.(Configuration.java:728)
> - locked <0xd1dc5240> (a org.apache.hadoop.conf.Configuration)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:442)
> at 
> org.apache.hadoop.mapreduce.task.JobContextImpl.(JobContextImpl.java:67)
> at 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.(TaskAttemptContextImpl.java:49)
> at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.(TaskInputOutputContextImpl.java:54)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.(ReduceContextImpl.java:95)
> at 
> org.apache.tez.mapreduce.combine.MRCombiner.createReduceContext(MRCombiner.java:237)
> at 
> org.apache.tez.mapreduce.combine.MRCombiner.runNewCombiner(MRCombiner.java:181)
> at 
> org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:115)
> at 
> org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.runCombineProcessor(ExternalSorter.java:313)
> at 
> org.apache.tez.runtime.library.common.sort.impl.dflt.DefaultSorter.spill(DefaultSorter.java:937)
> at 
> org.apache.tez.runtime.library.common.sort.impl.dflt.DefaultSorter.sortAndSpill(DefaultSorter.java:861)
> at 
> org.apache.tez.runtime.library.common.sort.impl.dflt.DefaultSorter$SpillThread.run(DefaultSorter.java:780)
> {noformat}
> {code:title=JobConf copy construction for tez}
>   public JobContextImpl(Configuration conf, JobID jobId) {
> if (conf instanceof JobConf) {
>   this.conf = (JobConf)conf;
> } else {
> --->this.conf = new JobConf(conf);<
> }
> this.jobId = jobId;
> this.credentials = this.conf.getCredentials();
> try {
>   this.ugi = UserGroupInformation.getCurrentUser();
> } catch (IOException e) {
>   throw new RuntimeException(e);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3935) DAG aware scheduler should release unassigned new containers rather than hold them

2018-05-14 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474437#comment-16474437
 ] 

Jason Lowe commented on TEZ-3935:
-

Attaching a patch that by default will release new containers if they are not 
assigned, but a user can set tez.am.container.reuse.new-containers.enabled=true 
to restore the old behavior if their particular job benefits from holding onto 
unassigned new containers and the impact on the cluster utilization is not a 
concern.

> DAG aware scheduler should release unassigned new containers rather than hold 
> them
> --
>
> Key: TEZ-3935
> URL: https://issues.apache.org/jira/browse/TEZ-3935
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3935.001.patch
>
>
> I saw a case for a very large job with many containers where the DAG aware 
> scheduler was getting behind on assigning containers.  Newly assigned 
> containers were not finding any matching request, so they were queued for 
> reuse processing.  However it took so long to get through all of the task and 
> container events that the container allocations expired before the container 
> was finally assigned and attempted to be launched.
> Newly assigned containers are assigned to their matching requests, even if 
> that violates the DAG priorities, so it should be safe to simply release 
> these if no tasks could be found to use them.  The matching request has 
> either been removed or already satisified with a reused container.  Besides, 
> if we can't find any tasks to take the newly assigned container then it is 
> very likely we have plenty of reusable containers already, and keeping more 
> containers just makes the job a resource hog on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3935) DAG aware scheduler should release unassigned new containers rather than hold them

2018-05-14 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3935:

Attachment: TEZ-3935.001.patch

> DAG aware scheduler should release unassigned new containers rather than hold 
> them
> --
>
> Key: TEZ-3935
> URL: https://issues.apache.org/jira/browse/TEZ-3935
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3935.001.patch
>
>
> I saw a case for a very large job with many containers where the DAG aware 
> scheduler was getting behind on assigning containers.  Newly assigned 
> containers were not finding any matching request, so they were queued for 
> reuse processing.  However it took so long to get through all of the task and 
> container events that the container allocations expired before the container 
> was finally assigned and attempted to be launched.
> Newly assigned containers are assigned to their matching requests, even if 
> that violates the DAG priorities, so it should be safe to simply release 
> these if no tasks could be found to use them.  The matching request has 
> either been removed or already satisified with a reused container.  Besides, 
> if we can't find any tasks to take the newly assigned container then it is 
> very likely we have plenty of reusable containers already, and keeping more 
> containers just makes the job a resource hog on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEZ-3935) DAG aware scheduler should release unassigned new containers rather than hold them

2018-05-14 Thread Jason Lowe (JIRA)

Jason Lowe created TEZ-3935:
---

 Summary: DAG aware scheduler should release unassigned new 
containers rather than hold them
 Key: TEZ-3935
 URL: https://issues.apache.org/jira/browse/TEZ-3935
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


I saw a case for a very large job with many containers where the DAG aware 
scheduler was getting behind on assigning containers.  Newly assigned 
containers were not finding any matching request, so they were queued for reuse 
processing.  However it took so long to get through all of the task and 
container events that the container allocations expired before the container 
was finally assigned and attempted to be launched.

Newly assigned containers are assigned to their matching requests, even if that 
violates the DAG priorities, so it should be safe to simply release these if no 
tasks could be found to use them.  The matching request has either been removed 
or already satisified with a reused container.  Besides, if we can't find any 
tasks to take the newly assigned container then it is very likely we have 
plenty of reusable containers already, and keeping more containers just makes 
the job a resource hog on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3824) MRCombiner creates new JobConf copy per spill

2018-05-10 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470476#comment-16470476
 ] 

Jason Lowe commented on TEZ-3824:
-

bq. As is, the patch will send in a null for the config in case the old API is 
being used?

The new JobConf object added by the patch is only initialized if the new API is 
being used, but it is also only used if the new API is being used.  
createReduceContext is only called by runNewCombiner, which in turn is only 
called if useNewApi is true.

My main comment on the patch is whether we really need a separate jobConf 
field.  The constructor can simply check the new API flag from the parsed 
configuration and either assign {{conf}} to the parsed Configuration object or 
to a JobConf.  That way we don't have to hold onto a Configuration object _and_ 
a JobConf object when doing the new API.


> MRCombiner creates new JobConf copy per spill
> -
>
> Key: TEZ-3824
> URL: https://issues.apache.org/jira/browse/TEZ-3824
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3824.001.patch
>
>
> {noformat:title=scope-57(HASH_JOIN) stack trace}
> "SpillThread {scope_60_" #99 daemon prio=5 os_prio=0 tid=0x7f2128d21800 
> nid=0x7487 runnable [0x7f21154c4000]
>java.lang.Thread.State: RUNNABLE
> at 
> java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012)
> at 
> java.util.concurrent.ConcurrentHashMap.putAll(ConcurrentHashMap.java:1084)
> at 
> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:852)
> at org.apache.hadoop.conf.Configuration.(Configuration.java:728)
> - locked <0xd1dc5240> (a org.apache.hadoop.conf.Configuration)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:442)
> at 
> org.apache.hadoop.mapreduce.task.JobContextImpl.(JobContextImpl.java:67)
> at 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.(TaskAttemptContextImpl.java:49)
> at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.(TaskInputOutputContextImpl.java:54)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.(ReduceContextImpl.java:95)
> at 
> org.apache.tez.mapreduce.combine.MRCombiner.createReduceContext(MRCombiner.java:237)
> at 
> org.apache.tez.mapreduce.combine.MRCombiner.runNewCombiner(MRCombiner.java:181)
> at 
> org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:115)
> at 
> org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.runCombineProcessor(ExternalSorter.java:313)
> at 
> org.apache.tez.runtime.library.common.sort.impl.dflt.DefaultSorter.spill(DefaultSorter.java:937)
> at 
> org.apache.tez.runtime.library.common.sort.impl.dflt.DefaultSorter.sortAndSpill(DefaultSorter.java:861)
> at 
> org.apache.tez.runtime.library.common.sort.impl.dflt.DefaultSorter$SpillThread.run(DefaultSorter.java:780)
> {noformat}
> {code:title=JobConf copy construction for tez}
>   public JobContextImpl(Configuration conf, JobID jobId) {
> if (conf instanceof JobConf) {
>   this.conf = (JobConf)conf;
> } else {
> --->this.conf = new JobConf(conf);<
> }
> this.jobId = jobId;
> this.credentials = this.conf.getCredentials();
> try {
>   this.ugi = UserGroupInformation.getCurrentUser();
> } catch (IOException e) {
>   throw new RuntimeException(e);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3932) TaskSchedulerManager can throw NullPointerException during DAGAppMaster container cleanup race

2018-05-09 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468960#comment-16468960
 ] 

Jason Lowe commented on TEZ-3932:
-

Thanks for the patch!  +1 lgtm.  Committing this.

> TaskSchedulerManager can throw NullPointerException during DAGAppMaster 
> container cleanup race
> --
>
> Key: TEZ-3932
> URL: https://issues.apache.org/jira/browse/TEZ-3932
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: arch: x86 and ppc
> java: openjdk version "1.8.0_161"
>  OpenJDK Runtime Environment (build 1.8.0_161-b14)
>  OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>Reporter: Valencia Edna Serrao
>Assignee: Jonathan Eagles
>Priority: Major
>  Labels: ppc, x86
> Attachments: TEZ-3932.001.patch, TEZ-3932.fail.patch, 
> org.apache.tez.test.TestExceptionPropagation-output.txt
>
>
> Test 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession 
> on x86 and ppc. I found related JIRA's TEZ-3746 and TEZ-3748. Though the 
> issue is marked as resolved in the related JIRA's, the issue exists. Below 
> are the error details:
> {code:java}
> ---
> Test set: org.apache.tez.test.TestExceptionPropagation
> ---
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 96.433 sec 
> <<< FAILURE!
> testExceptionPropagationSession(org.apache.tez.test.TestExceptionPropagation) 
>  Time elapsed: 52.7 sec  <<< ERROR!
> org.apache.tez.dag.api.SessionNotRunning: Application not running, 
> applicationId=application_1525667420557_0001, yarnApplicationState=FAILED, 
> finalApplicationStatus=FAILED, trackingUrl=N/A, diagnostics=[DAG completed 
> with an ERROR state. Shutting down AM, Session stats:submittedDAGs=11, 
> successfulDAGs=0, failedDAGs=12, killedDAGs=0]
>     at 
> org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:910)
>     at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:1024)
>     at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:1034)
>     at 
> org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:652)
>     at org.apache.tez.client.TezClient.submitDAG(TezClient.java:588)
>     at 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession(TestExceptionPropagation.java:227
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3930) TestDagAwareYarnTaskScheduler fails on Hadoop 3

2018-05-07 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3930:

Attachment: TEZ-3930.001.patch

> TestDagAwareYarnTaskScheduler fails on Hadoop 3
> ---
>
> Key: TEZ-3930
> URL: https://issues.apache.org/jira/browse/TEZ-3930
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jonathan Eagles
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3930.001.patch
>
>
> When scheduler shutdown is called, the AMRMClientAsyncImple serviceStop is 
> invoke, which then interrupts the heartbeat thread and then proceeds to join 
> on the heartbeat thread. The heartbeat thread continues to run and continues 
> to throw NullPointerExceptions. The interrupt doesn't seem to cause the 
> thread to be interrupted now in Hadoop 3 (is YARN-5999 to blame or Tez)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (TEZ-3930) TestDagAwareYarnTaskScheduler fails on Hadoop 3

2018-05-04 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned TEZ-3930:
---

Assignee: Jason Lowe

> TestDagAwareYarnTaskScheduler fails on Hadoop 3
> ---
>
> Key: TEZ-3930
> URL: https://issues.apache.org/jira/browse/TEZ-3930
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jonathan Eagles
>Assignee: Jason Lowe
>Priority: Major
>
> When scheduler shutdown is called, the AMRMClientAsyncImple serviceStop is 
> invoke, which then interrupts the heartbeat thread and then proceeds to join 
> on the heartbeat thread. The heartbeat thread continues to run and continues 
> to throw NullPointerExceptions. The interrupt doesn't seem to cause the 
> thread to be interrupted now in Hadoop 3 (is YARN-5999 to blame or Tez)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3926) Changes to master for 0.10.x line and 0.9 release branch

2018-05-02 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461378#comment-16461378
 ] 

Jason Lowe commented on TEZ-3926:
-

+1 lgtm.

> Changes to master for 0.10.x line and 0.9 release branch
> 
>
> Key: TEZ-3926
> URL: https://issues.apache.org/jira/browse/TEZ-3926
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3926.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3923) Move master to Hadoop 3+ and create separate 0.9.x line

2018-05-02 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3923:

Target Version/s: 0.10.0
   Fix Version/s: (was: 0.10.0)

> Move master to Hadoop 3+ and create separate 0.9.x line
> ---
>
> Key: TEZ-3923
> URL: https://issues.apache.org/jira/browse/TEZ-3923
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Eric Wohlstadter
>Assignee: Gopal V
>Priority: Major
>
> Move master to support minimum Hadoop 3+ (0.10.x line) and create separate 
> branch for Hadoop 2 (0.9.x line)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3914) Recovering a large DAG fails to size limit exceeded

2018-04-27 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456697#comment-16456697
 ] 

Jason Lowe commented on TEZ-3914:
-

Thanks for updating the patch!  +1 lgtm.  Committing this.

> Recovering a large DAG fails to size limit exceeded
> ---
>
> Key: TEZ-3914
> URL: https://issues.apache.org/jira/browse/TEZ-3914
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3914.001.patch, TEZ-3914.002.patch, 
> TEZ-3914.003.patch, TEZ-3914.004.patch, TEZ-3914.005.patch
>
>
> A large message will be failed to parse and will be treated as recovery file 
> EOF.
> {noformat}
> 2018-04-16 15:33:59,807 WARN  [Thread-2] app.RecoveryParser 
> (RecoveryParser.java:parseRecoveryData(771)) - Corrupt data found when trying 
> to read next event
> com.google.protobuf.InvalidProtocolBufferException: Protocol message was too 
> large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase 
> the size limit.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3817) DAGs can hang after more than one uncaught Exception during doTransition.

2018-04-23 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448840#comment-16448840
 ] 

Jason Lowe commented on TEZ-3817:
-

+1 lgtm.

> DAGs can hang after more than one uncaught Exception during doTransition.
> -
>
> Key: TEZ-3817
> URL: https://issues.apache.org/jira/browse/TEZ-3817
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1, 0.9.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Attachments: TEZ-3817.001.patch, TEZ-3817.002.patch, 
> TEZ-3817.003.patch, TEZ-3817.004.patch, TEZ-3817.005.patch, 
> TEZ-3817.test.patch
>
>
> A Tez DAG can hang in the last "sane" state if the 
> statemachine.doTransition() throws a runtime exception more than once. The 
> transition for the Error state itself throws an exception, the DAG hangs. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEZ-3922) Tez session should use a final status of SUCCEEDED if all DAGs succeeded

2018-04-20 Thread Jason Lowe (JIRA)

Jason Lowe created TEZ-3922:
---

 Summary: Tez session should use a final status of SUCCEEDED if all 
DAGs succeeded
 Key: TEZ-3922
 URL: https://issues.apache.org/jira/browse/TEZ-3922
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jason Lowe


Tez currently exits with an ENDED final status if it was a session, but that 
conveys no details of the DAGs within the session to the user.  It would be 
convenient if the session exited with a SUCCEEDED status if all the DAGs within 
the session succeeded.  Then a user browsing their jobs on the UI doesn't have 
to dig into every ENDED session to verify all the DAGs within that session 
succeeded.  They only need to dig into the sessions that did not succeed 
knowing at least one DAG did not succeed in that session.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3914) Recovering a large DAG fails to size limit exceeded

2018-04-18 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442949#comment-16442949
 ] 

Jason Lowe commented on TEZ-3914:
-

Thanks for updating the patch and providing an overview of the approach!

Rather than wrapping debug calls with isDebugEnabled, we should leverage 
slf4j's positional parameter syntax to make the debug call cheap enough to just 
call unconditionally, e.g.:
{code}
LOG.debug("Read HistoryEvent, eventType={}, event={}", 
historyEvent.getEventType(), historyEvent);
{code}

RecoveryStream should be marked VisibleForTesting.

I think a close() method on RecoveryStream could clean up the following code 
and make it more maintainable for future usages:
{code}
  entry.getValue().codedOutputStream.flush();
  entry.getValue().outputStream.hflush();
  entry.getValue().outputStream.close();
{code}

Speaking of a close() method for RecoveryStream to help correctness when the 
stream needs to be shut down, aren't we missing a codedOutputStream.flush() 
call here?
{code}
if (outputStreamMap.containsKey(dagId)) {
  try {
outputStreamMap.get(dagId).outputStream.close();
{code}

Similar to the RecoveryStream close method, I could see moving doFlush to a 
flush method for RecoveryStream

Suggestion: RecoveryStream should just take an output stream and create the 
coded output stream in the constructor rather than requiring the caller to 
create it and pass it.  The caller doesn't care about holding onto that coded 
stream in practice.


> Recovering a large DAG fails to size limit exceeded
> ---
>
> Key: TEZ-3914
> URL: https://issues.apache.org/jira/browse/TEZ-3914
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3914.001.patch, TEZ-3914.002.patch, 
> TEZ-3914.003.patch
>
>
> A large message will be failed to parse and will be treated as recovery file 
> EOF.
> {noformat}
> 2018-04-16 15:33:59,807 WARN  [Thread-2] app.RecoveryParser 
> (RecoveryParser.java:parseRecoveryData(771)) - Corrupt data found when trying 
> to read next event
> com.google.protobuf.InvalidProtocolBufferException: Protocol message was too 
> large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase 
> the size limit.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3914) Recovering a large DAG hang job

2018-04-13 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437400#comment-16437400
 ] 

Jason Lowe commented on TEZ-3914:
-

Thanks for the report and patch!  Many of the unit test failures are related.  
Could you elaborate a bit more on the approach taken for the fix?  It's a 
rather sizeable patch, and a high-level overview would help for the review.  

> Recovering a large DAG hang job
> ---
>
> Key: TEZ-3914
> URL: https://issues.apache.org/jira/browse/TEZ-3914
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3914.001.patch, TEZ-3914.002.patch
>
>
> Any failure to parse recovery event is ignore and treated as eof. Job can 
> hang since some task completions may be missed and shuffle will hang.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3913) Precommit build fails to post to JIRA

2018-04-09 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3913:

Attachment: TEZ-3913.002.patch

> Precommit build fails to post to JIRA
> -
>
> Key: TEZ-3913
> URL: https://issues.apache.org/jira/browse/TEZ-3913
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3913.001.patch, TEZ-3913.002.patch
>
>
> The precommit build is failing to post comments to Jira due to a 404 error:
> {noformat}
> Unable to log in to server: 
> https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa.
>  Cause: (404)404
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3913) Precommit build fails to post to JIRA

2018-04-09 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431326#comment-16431326
 ] 

Jason Lowe commented on TEZ-3913:
-

The precommit script is failing because SED is not defined.  Posting an updated 
patch shortly.

> Precommit build fails to post to JIRA
> -
>
> Key: TEZ-3913
> URL: https://issues.apache.org/jira/browse/TEZ-3913
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Fix For: 0.9.2
>
> Attachments: TEZ-3913.001.patch
>
>
> The precommit build is failing to post comments to Jira due to a 404 error:
> {noformat}
> Unable to log in to server: 
> https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa.
>  Cause: (404)404
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3913) Precommit build fails to post to JIRA

2018-04-09 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3913:

Attachment: TEZ-3913.001.patch

> Precommit build fails to post to JIRA
> -
>
> Key: TEZ-3913
> URL: https://issues.apache.org/jira/browse/TEZ-3913
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3913.001.patch
>
>
> The precommit build is failing to post comments to Jira due to a 404 error:
> {noformat}
> Unable to log in to server: 
> https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa.
>  Cause: (404)404
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEZ-3913) Precommit build fails to post to JIRA

2018-04-09 Thread Jason Lowe (JIRA)

Jason Lowe created TEZ-3913:
---

 Summary: Precommit build fails to post to JIRA
 Key: TEZ-3913
 URL: https://issues.apache.org/jira/browse/TEZ-3913
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


The precommit build is failing to post comments to Jira due to a 404 error:
{noformat}
Unable to log in to server: 
https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa.
 Cause: (404)404
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEZ-3912) Fetchers should be more robust to corrupted inputs

2018-04-09 Thread Jason Lowe (JIRA)

Jason Lowe created TEZ-3912:
---

 Summary: Fetchers should be more robust to corrupted inputs
 Key: TEZ-3912
 URL: https://issues.apache.org/jira/browse/TEZ-3912
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe


I recently saw a case where a bad node in the cluster produced corrupted 
shuffle data that caused the codec to throw IllegalArgumentException when 
trying to fetch.  Fetchers currently only handle IOException and InternalError, 
and any other type of exception will cause the entire task to be torn down.  We 
should consider catching Exception like MapReduce does to be more robust in 
light of other types of errors coming from the codec and allow retries to occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-2686) TestFaultTolerance fails frequently

2018-04-05 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427594#comment-16427594
 ] 

Jason Lowe commented on TEZ-2686:
-

There's been a lot of changes since this ticket was filed.  I'm OK with closing 
it for now, we can always reopen or refile if it appears again.

> TestFaultTolerance fails frequently 
> 
>
> Key: TEZ-2686
> URL: https://issues.apache.org/jira/browse/TEZ-2686
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Zhiyuan Yang
>Priority: Major
> Attachments: log.tar, syslog_dag_1451372520174_0001_18, 
> syslog_dag_1451372520174_0001_18_post
>
>
> TestFaultTolerance will fail with a very little possibility. But it fails 
> frequently recently, need to take a look at it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3909) DAG can hang if vertex with no tasks is killed

2018-04-02 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3909:

Attachment: TEZ-3909.001.patch

> DAG can hang if vertex with no tasks is killed
> --
>
> Key: TEZ-3909
> URL: https://issues.apache.org/jira/browse/TEZ-3909
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3909.001.patch
>
>
> If a vertex with no tasks is killed just as it is starting then the vertex 
> can fail to reach a terminal state if the terminate event arrives while the 
> vertex is still in the RUNNING state.  The terminate moves it to the 
> TERMINATING state which ignores the V_COMPLETED event that later arrives.  
> Once it drops the completed event, no other event will kick it out of the 
> TERMINATING state and the DAG hangs forever waiting for all vertices to 
> complete.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEZ-3909) DAG can hang if vertex with no tasks is killed

2018-04-02 Thread Jason Lowe (JIRA)

Jason Lowe created TEZ-3909:
---

 Summary: DAG can hang if vertex with no tasks is killed
 Key: TEZ-3909
 URL: https://issues.apache.org/jira/browse/TEZ-3909
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Jason Lowe
Assignee: Jason Lowe


If a vertex with no tasks is killed just as it is starting then the vertex can 
fail to reach a terminal state if the terminate event arrives while the vertex 
is still in the RUNNING state.  The terminate moves it to the TERMINATING state 
which ignores the V_COMPLETED event that later arrives.  Once it drops the 
completed event, no other event will kick it out of the TERMINATING state and 
the DAG hangs forever waiting for all vertices to complete.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (TEZ-3902) Upgrade to netty-3.10.5.Final.jar

2018-03-26 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned TEZ-3902:
---

Assignee: Jason Lowe  (was: Jonathan Eagles)

> Upgrade to netty-3.10.5.Final.jar
> -
>
> Key: TEZ-3902
> URL: https://issues.apache.org/jira/browse/TEZ-3902
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3902.001.patch
>
>
> Hadoop 3 and Hive have upgraded to netty-3.10.5.Final, which is not 
> compatible with current Tez dependency netty-3.6.2.Final.
>  
> However, org.apache.tez.shufflehandler.ShuffleHandler depends on 3.6.2 
> specific methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3902) Upgrade to netty-3.10.5.Final.jar

2018-03-26 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414475#comment-16414475
 ] 

Jason Lowe commented on TEZ-3902:
-

Here's a patch that updates the netty version per my discussion above.  The 
only difference in the dist tarballs after this change is that 
tez-ext-service-tests.jar and netty-3.6.2.Final.jar are missing from the 
minimal tarball after this change.  I have not had any time to test this.  I 
verified it builds successfully and does what I believe is desired as far as 
removing the netty jar that is conflicting with Hive and Hadoop 3.

I *think* this is a safe change for Hadoop 2.x as well, since we're not 
changing the version of a jar we're shipping in the minimal build but rather 
relying on the netty version we get from Hadoop.

> Upgrade to netty-3.10.5.Final.jar
> -
>
> Key: TEZ-3902
> URL: https://issues.apache.org/jira/browse/TEZ-3902
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3902.001.patch
>
>
> Hadoop 3 and Hive have upgraded to netty-3.10.5.Final, which is not 
> compatible with current Tez dependency netty-3.6.2.Final.
>  
> However, org.apache.tez.shufflehandler.ShuffleHandler depends on 3.6.2 
> specific methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3902) Upgrade to netty-3.10.5.Final.jar

2018-03-26 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3902:

Attachment: TEZ-3902.001.patch

> Upgrade to netty-3.10.5.Final.jar
> -
>
> Key: TEZ-3902
> URL: https://issues.apache.org/jira/browse/TEZ-3902
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3902.001.patch
>
>
> Hadoop 3 and Hive have upgraded to netty-3.10.5.Final, which is not 
> compatible with current Tez dependency netty-3.6.2.Final.
>  
> However, org.apache.tez.shufflehandler.ShuffleHandler depends on 3.6.2 
> specific methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3902) Upgrade to netty-3.10.5.Final.jar

2018-03-26 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414459#comment-16414459
 ] 

Jason Lowe commented on TEZ-3902:
-

bq. Should we file a separate ticket for async-http-client or investigate it 
under this Jira?

It would be under this JIRA since this is about upgrading netty.  Either we 
upgrade it and async http is happy or we need to fix it as part of this JIRA.

I believe async-http-client should be OK with the proposed version of netty, 
but I haven't had time to test it.  I'll attach a patch shortly which removes 
the netty jar from the minimal tarball distribution and upgrades the netty 
version, assuming tez-ext-service-tests should never have been shipped in the 
first place as part of a Tez deploy.

> Upgrade to netty-3.10.5.Final.jar
> -
>
> Key: TEZ-3902
> URL: https://issues.apache.org/jira/browse/TEZ-3902
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Jonathan Eagles
>Priority: Major
>
> Hadoop 3 and Hive have upgraded to netty-3.10.5.Final, which is not 
> compatible with current Tez dependency netty-3.6.2.Final.
>  
> However, org.apache.tez.shufflehandler.ShuffleHandler depends on 3.6.2 
> specific methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3907) Improve log message to include the location the writers decide to spill output

2018-03-26 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414439#comment-16414439
 ] 

Jason Lowe commented on TEZ-3907:
-

QA bot said:

-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12916219/TEZ-3907.002.patch
  against master revision 85bd772.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2745//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2745//console

-

+1 lgtm.  Committing this.


> Improve log message to include the location the writers decide to spill output
> --
>
> Key: TEZ-3907
> URL: https://issues.apache.org/jira/browse/TEZ-3907
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Attachments: TEZ-3907.001.patch, TEZ-3907.002.patch
>
>
> It helps debugging if the log message that prints the start of spill with the 
> buffer sizes and other pointers to include the location as well. This Jira 
> will add the already known location string to the log message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3892) getClient API for TezClient

2018-03-26 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414289#comment-16414289
 ] 

Jason Lowe commented on TEZ-3892:
-

[~ewohlstadter] what more do you need for this?  If anything it should be a 
separate JIRA, since this one is committed and resolved.



> getClient API for TezClient
> ---
>
> Key: TEZ-3892
> URL: https://issues.apache.org/jira/browse/TEZ-3892
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Major
> Fix For: 0.9.2
>
> Attachments: TEZ-3892.1.patch, TEZ-3892.2.patch
>
>
> This is a proposed opt-in feature.
> Tez AM already supports long-lived sessions, if desired a AM session can live 
> indefinitely.
> However, new clients cannot connect to a long-lived AM session through the 
> standard TezClient API. 
> TezClient API only provides a "start" method to initiate a connection, which 
> always allocates a new AM from YARN.
>  # For interactive BI use-cases, this startup time can be significant.
>  # Hive is implementing a HiveServer2 High Availability feature.
>  ** When the singleton HS2 master server fails, the HS2 client is quickly 
> redirected to a pre-warmed HS2 backup. 
>  # For the failover to complete quickly end-to-end, a Tez AM must also be 
> pre-warmed and ready to accept connections.
> For more information, see design for: 
> https://issues.apache.org/jira/browse/HIVE-18281.
> 
> Anticipated changes:
>  # A getClient{{(ApplicationId)}} method is added to TezClient. The 
> functionality is similar to {{start}}
>  ** Code related to launching a new AM from the RM is factored out.
>  ** Since {{start}} and getClient will share some code, this code is 
> refactored into reusable helper methods.
>  ** A usage example is added to {{org/apache/tez/examples}}
>  # It is not a goal of this JIRA to ensure that running Tez DAGs can be 
> recovered by a client using the getClient API. The goal is only for 
> maintaining a pool of warm Tez AMs to skip RM/container/JVM startup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3902) Upgrade to netty-3.10.5.Final.jar

2018-03-26 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414051#comment-16414051
 ] 

Jason Lowe commented on TEZ-3902:
-

After poking around a bit, it looks like the netty jar being included in the 
lib/ directory for the minimal build is coming from the tez shuffle handler. 
The minimal build explicitly excludes the aux-services project. Instead it is 
coming in from tez-ext-service-tests which doesn't look like something we would 
want to ship with a minimal deploy anyway. So simply adding 
tez-ext-service-tests to the tez-dist-minimal.xml exclude list is enough to 
remove the offending netty jar.

I noticed netty 3.6.2-Final is also being referenced by one of other other 
dependencies, async-http-client. However it looks like it's happy to use 
whatever version of netty Tez is using. If I update Tez to use 3.10.5.Final 
then dependency:tree shows async-http-client using netty 3.10.5.Final.
{quote}When I run dependency:tree on the top level I do see this netty version 
being picked by zookeeper and other pieces. Not sure if that is something we 
care about.
{quote}
Zookeeper is not something we need to worry about since that's being pulled in 
by Hadoop. Hadoop and all of its dependencies are considered provided to the 
minimal tarball.  Zookeeper is not being used by Tez directly, and Hadoop will 
provide a version of netty to make Zookeeper and other Hadoop dependencies 
happy.  As long as async-http-client truly is happy with the version of netty 
that Hadoop has as a dependency then we should be good on a standard Hadoop 
install where Hadoop's dependencies are exposed.  However if we move to a 
shaded hadoop-client jar for providing Hadoop then async-http-client could 
break due to netty missing.  There's probably other things in Tez that will 
break in the minimal tarball since we don't ship transitive dependencies there.

> Upgrade to netty-3.10.5.Final.jar
> -
>
> Key: TEZ-3902
> URL: https://issues.apache.org/jira/browse/TEZ-3902
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Jonathan Eagles
>Priority: Major
>
> Hadoop 3 and Hive have upgraded to netty-3.10.5.Final, which is not 
> compatible with current Tez dependency netty-3.6.2.Final.
>  
> However, org.apache.tez.shufflehandler.ShuffleHandler depends on 3.6.2 
> specific methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3907) Improve log message to include the location the writers decide to spill output

2018-03-26 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413961#comment-16413961
 ] 

Jason Lowe commented on TEZ-3907:
-

Thanks for the patch!

Nit: " at :" s/b " at : " so the colon doesn't smash into the subsequent 
filename.  Actually I think it becomes more readable without the additional 
colon, both for this and the same log in UnorderedPartitionedKVWriter.

> Improve log message to include the location the writers decide to spill output
> --
>
> Key: TEZ-3907
> URL: https://issues.apache.org/jira/browse/TEZ-3907
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Attachments: TEZ-3907.001.patch
>
>
> It helps debugging if the log message that prints the start of spill with the 
> buffer sizes and other pointers to include the location as well. This Jira 
> will add the already known location string to the log message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3901) Add hadoop3 profile for upgrade to Jersey 1.19

2018-03-15 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400465#comment-16400465
 ] 

Jason Lowe commented on TEZ-3901:
-

My main concern with this patch is that it adds another dimension that needs to 
be regularly tested.  We already have a problem getting the existing hadoop28 
profile covered, since it was broken for months without being detected in 
Jenkins builds.  Adding yet another profile means another dimension to the test 
matrix.

Given there were no shim/java changes associated with this profile, just a 
dependency change, that tells me that Tez is OK with Jersey 1.19 as-is.  That 
implies that Tez likely "just works" if it uses the jersey version being 
provided by Hadoop, be that 1.9 or 1.19.  So this really does just look like a 
problem that stems around how Tez was deployed rather than a problem in Tez 
itself.


> Add hadoop3 profile for upgrade to Jersey 1.19
> --
>
> Key: TEZ-3901
> URL: https://issues.apache.org/jira/browse/TEZ-3901
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Major
> Attachments: TEZ-3901.1.patch, TEZ-3901.2.patch
>
>
> From [~harishjp]:
> "DAGAppMaster fails to start when using hadoop3 and ATSv15, because 
> TimelineWriter has been changed to use jersey-client 1.19 in hadoop3, but tez 
> packages jersey-client 1.9 with it. There are incompatible changes between 
> them, so we cannot upgrade to 1.19 for all versions, it should be 1.9 in 
> older hadoop and 1.19 in hadoop3."
>  
> This patch includes some copy and paste of the hadoop28 profile to a hadoop3 
> profile. Maven doesn't include anything like "profile inheritance".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3874) NPE in TezClientUtils when "yarn.resourcemanager.zk-address" is present in Configuration

2018-03-09 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393049#comment-16393049
 ] 

Jason Lowe commented on TEZ-3874:
-

>From precommit build:

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12913644/TEZ-3874.6.patch
  against master revision 82d73b3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2741//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2741//console


Committing this.


> NPE in TezClientUtils when "yarn.resourcemanager.zk-address" is present in 
> Configuration
> 
>
> Key: TEZ-3874
> URL: https://issues.apache.org/jira/browse/TEZ-3874
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Blocker
> Attachments: TEZ-3874.1.patch, TEZ-3874.3.patch, TEZ-3874.4.patch, 
> TEZ-3874.5.patch, TEZ-3874.6.patch, TEZ-3874.patch.2
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> "yarn.resourcemanager.zk-address" is deprecated in favor of 
> "hadoop.zk.address" for Hadoop 2.9+.
> Configuration base class does't auto-translate the deprecation. Only 
> YarnConfiguration applies the translation.
> In TezClientUtils.createFinalConfProtoForApp, a NPE is throw if 
> "yarn.resourcemanager.zk-address" is present in the Configuration.
> {code}
> for (Entry entry : amConf) {
>   PlanKeyValuePair.Builder kvp = PlanKeyValuePair.newBuilder();
>   kvp.setKey(entry.getKey());
>   kvp.setValue(amConf.get(entry.getKey()));
>   builder.addConfKeyValues(kvp);
> }
> {code}
> Even though Tez is not specifically looking for the deprecated property, 
> {{amConf.get(entry.getKey())}} will find it during the iteration, if it is in 
> any of the merged xml property resources. 
> {{amConf.get(entry.getKey())}} will return null, and {{kvp.setValue(null)}} 
> will trigger NPE.
> Suggested solution is to change to: 
> {code}
> YarnConfiguration wrappedConf = new YarnConfiguration(amConf);
> for (Entry entry : wrappedConf) {
>   PlanKeyValuePair.Builder kvp = PlanKeyValuePair.newBuilder();
>   kvp.setKey(entry.getKey());
>   kvp.setValue(wrappedConf.get(entry.getKey()));
>   builder.addConfKeyValues(kvp);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3874) NPE in TezClientUtils when "yarn.resourcemanager.zk-address" is present in Configuration

2018-03-07 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389712#comment-16389712
 ] 

Jason Lowe commented on TEZ-3874:
-

Thanks for updating the patch!  Looks much better overall, just one nit left.  
testNotNullKvpWithValueReplacement is no longer testing Tez code.  It creates a 
mocked object and verifies the mock behaves properly, but that does not test 
anything from Tez.  The test along with MockKeyFailureConfiguration should be 
removed.

+1 after that change.


> NPE in TezClientUtils when "yarn.resourcemanager.zk-address" is present in 
> Configuration
> 
>
> Key: TEZ-3874
> URL: https://issues.apache.org/jira/browse/TEZ-3874
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Blocker
> Attachments: TEZ-3874.1.patch, TEZ-3874.3.patch, TEZ-3874.4.patch, 
> TEZ-3874.5.patch, TEZ-3874.patch.2
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> "yarn.resourcemanager.zk-address" is deprecated in favor of 
> "hadoop.zk.address" for Hadoop 2.9+.
> Configuration base class does't auto-translate the deprecation. Only 
> YarnConfiguration applies the translation.
> In TezClientUtils.createFinalConfProtoForApp, a NPE is throw if 
> "yarn.resourcemanager.zk-address" is present in the Configuration.
> {code}
> for (Entry entry : amConf) {
>   PlanKeyValuePair.Builder kvp = PlanKeyValuePair.newBuilder();
>   kvp.setKey(entry.getKey());
>   kvp.setValue(amConf.get(entry.getKey()));
>   builder.addConfKeyValues(kvp);
> }
> {code}
> Even though Tez is not specifically looking for the deprecated property, 
> {{amConf.get(entry.getKey())}} will find it during the iteration, if it is in 
> any of the merged xml property resources. 
> {{amConf.get(entry.getKey())}} will return null, and {{kvp.setValue(null)}} 
> will trigger NPE.
> Suggested solution is to change to: 
> {code}
> YarnConfiguration wrappedConf = new YarnConfiguration(amConf);
> for (Entry entry : wrappedConf) {
>   PlanKeyValuePair.Builder kvp = PlanKeyValuePair.newBuilder();
>   kvp.setKey(entry.getKey());
>   kvp.setValue(wrappedConf.get(entry.getKey()));
>   builder.addConfKeyValues(kvp);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3897) Tez Local Mode hang for vertices with broadcast input

2018-03-05 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386244#comment-16386244
 ] 

Jason Lowe commented on TEZ-3897:
-

Thanks for updating the patch!

+1 lgtm.  Committing this.


> Tez Local Mode hang for vertices with broadcast input
> -
>
> Key: TEZ-3897
> URL: https://issues.apache.org/jira/browse/TEZ-3897
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3897.001.patch, TEZ-3897.002.patch
>
>
> Broadcast edges are not taken into consideration for slow-start edges so 
> downstream tasks in local mode can start before upstream tasks. Without 
> preemption in the scheduler, there will be a hang.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3874) NPE in TezClientUtils when "yarn.resourcemanager.zk-address" is present in Configuration

2018-03-02 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383816#comment-16383816
 ] 

Jason Lowe commented on TEZ-3874:
-

Thanks for updating the patch! Here's the QA report:

{color:#FF}-1 overall{color}. Here are the results of testing the latest 
attachment
 [http://issues.apache.org/jira/secure/attachment/12912688/TEZ-3874.3.patch]
 against master revision bb40cf5.

{color:#008000}+1 @author{color}. The patch does not contain any @author tags.

{color:#008000}+1 tests included{color}. The patch appears to include 2 new or 
modified test files.

{color:#008000}+1 javac{color}. The applied patch does not increase the total 
number of javac compiler warnings.

{color:#008000}+1 javadoc{color}. There were no new javadoc warning messages.

{color:#008000}+1 findbugs{color}. The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:#008000}+1 release audit{color}. The applied patch does not increase the 
total number of release audit warnings.

{color:#FF}-1 core tests{color}. The patch failed these unit tests in :
 org.apache.tez.analyzer.TestAnalyzer

Test results: 
[https://builds.apache.org/job/PreCommit-TEZ-Build/2732//testReport/]
 Console output: 
[https://builds.apache.org/job/PreCommit-TEZ-Build/2732//console]

Again I'm not sure the notNullKvp method is worth the weight. I do not see the 
utility of a Precondition not null check vs. just letting it NPE on the line 
number where it is used. Both are going to throw an NPE and it will be obvious 
which is null for that line. And once we remove the Precondition null checks, 
the function boils down to:
{code:java}
  public static boolean notNullKvpWithValueReplacement(Map.Entry kvp, Configuration conf) {
String key = kvp.getKey();
String value = conf.get(key);
return value != null;
  }
{code}
which IMHO is just not worth it. For example:
{code:java}
  if(TezUtils.notNullKvpWithValueReplacement(entry, amConf)) {
PlanKeyValuePair.Builder kvp = PlanKeyValuePair.newBuilder();
kvp.setKey(entry.getKey());
kvp.setValue(amConf.get(entry.getKey()));
{code}
is equivalent to:
{code:java}
  String val = amConf.get(entry.getKey());
  if (val != null) {
PlanKeyValuePair.Builder kvp = PlanKeyValuePair.newBuilder();
kvp.setKey(entry.getKey());
kvp.setValue(val);
{code}
which is simpler to read and understand what's going on. As a bonus it avoids 
the double-lookup on the conf key in the common case where the value is not 
null.

There's no need for the debug check before doing the debug logs. One main 
benefit of SLF4J's API vs. log4j is avoiding all the explicit log level checks 
if all we're doing is passing positional parameters for the message that don't 
need to be computed just for the log.

I don't think it's worth exposing a public static for the debug log message. 
It's reaching cross-module and exposes potential reference outside of Tez.

> NPE in TezClientUtils when "yarn.resourcemanager.zk-address" is present in 
> Configuration
> 
>
> Key: TEZ-3874
> URL: https://issues.apache.org/jira/browse/TEZ-3874
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Blocker
> Attachments: TEZ-3874.1.patch, TEZ-3874.3.patch, TEZ-3874.patch.2
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> "yarn.resourcemanager.zk-address" is deprecated in favor of 
> "hadoop.zk.address" for Hadoop 2.9+.
> Configuration base class does't auto-translate the deprecation. Only 
> YarnConfiguration applies the translation.
> In TezClientUtils.createFinalConfProtoForApp, a NPE is throw if 
> "yarn.resourcemanager.zk-address" is present in the Configuration.
> {code}
> for (Entry entry : amConf) {
>   PlanKeyValuePair.Builder kvp = PlanKeyValuePair.newBuilder();
>   kvp.setKey(entry.getKey());
>   kvp.setValue(amConf.get(entry.getKey()));
>   builder.addConfKeyValues(kvp);
> }
> {code}
> Even though Tez is not specifically looking for the deprecated property, 
> {{amConf.get(entry.getKey())}} will find it during the iteration, if it is in 
> any of the merged xml property resources. 
> {{amConf.get(entry.getKey())}} will return null, and {{kvp.setValue(null)}} 
> will trigger NPE.
> Suggested solution is to change to: 
> {code}
> YarnConfiguration wrappedConf = new YarnConfiguration(amConf);
> for (Entry entry : wrappedConf) {
>   PlanKeyValuePair.Builder kvp = PlanKeyValuePair.newBuilder();
>   kvp.setKey(entry.getKey());
>   kvp.setValue(wrappedConf.get(entry.getKey()));
>   builder.addConfKeyValues(kvp);
> }
> {code}

[jira] [Commented] (TEZ-3897) Tez Local Mode hang for vertices with broadcast input

2018-02-27 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379151#comment-16379151
 ] 

Jason Lowe commented on TEZ-3897:
-

Thanks for the patch!

Do we need to worry about task deallocations implying the task needs to be 
interrupted/killed?  The other schedulers will automatically deallocate a 
container if a task deallocate maps to a currently allocated task.

Seems like there shouldn't be a PreemptTaskRequest so much as a 
DeallocateContainerRequest.  Both of those kinds of requests don't need a 
priority, so whichever one is kept arguably shouldn't derive from TaskRequest 
but something like a SchedulerRequest that TaskRequest derives from as well.  
Or just have the queue hold Object rather than TaskRequest and do RTTI on 
everything in the queue as it already does.

Nit: It's a bit odd for addPreemptTaskRequest's signature to return an Object 
yet it always returns null.  Better as a void method?

I didn't see in the patch where the actual preemption of the running task 
occurs.  I would expect there to be a corresponding change in 
LocalContainerLauncher to implement the preempt of the running task, but it 
still explicitly ignores any requests to stop a container.


> Tez Local Mode hang for vertices with broadcast input
> -
>
> Key: TEZ-3897
> URL: https://issues.apache.org/jira/browse/TEZ-3897
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3897.001.patch
>
>
> Broadcast edges are not taken into consideration for slow-start edges so 
> downstream tasks in local mode can start before upstream tasks. Without 
> preemption in the scheduler, there will be a hang.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3898) TestTezCommonUtils fails when compiled against hadoop version >= 2.8

2018-02-16 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3898:

Attachment: TEZ-3898.001.patch

> TestTezCommonUtils fails when compiled against hadoop version >= 2.8
> 
>
> Key: TEZ-3898
> URL: https://issues.apache.org/jira/browse/TEZ-3898
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3898.001.patch
>
>
> TestTezCommonUtils fails when compiled against hadoop 2.8 or later:
> {noformat}
> $ cd tez-api
> $ mvn test -Phadoop28 -P-hadoop27 -Dhadoop.version=2.8.3
> -Dtest=TestTezCommonUtilsRunning org.apache.tez.common.TestTezCommonUtils
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.266 sec <<< 
> FAILURE!
> org.apache.tez.common.TestTezCommonUtils  Time elapsed: 0.265 sec  <<< ERROR!
> java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetFactory
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.hdfs.server.datanode.FsDatasetTestUtils$Factory.getFactory(FsDatasetTestUtils.java:47)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.(MiniDFSCluster.java:199)
>   at 
> org.apache.tez.common.TestTezCommonUtils.setup(TestTezCommonUtils.java:60)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEZ-3898) TestTezCommonUtils fails when compiled against hadoop version >= 2.8

2018-02-16 Thread Jason Lowe (JIRA)

Jason Lowe created TEZ-3898:
---

 Summary: TestTezCommonUtils fails when compiled against hadoop 
version >= 2.8
 Key: TEZ-3898
 URL: https://issues.apache.org/jira/browse/TEZ-3898
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


TestTezCommonUtils fails when compiled against hadoop 2.8 or later:
{noformat}
$ cd tez-api
$ mvn test -Phadoop28 -P-hadoop27 -Dhadoop.version=2.8.3
-Dtest=TestTezCommonUtilsRunning org.apache.tez.common.TestTezCommonUtils
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.266 sec <<< 
FAILURE!
org.apache.tez.common.TestTezCommonUtils  Time elapsed: 0.265 sec  <<< ERROR!
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.hadoop.hdfs.server.datanode.FsDatasetTestUtils$Factory.getFactory(FsDatasetTestUtils.java:47)
at 
org.apache.hadoop.hdfs.MiniDFSCluster$Builder.(MiniDFSCluster.java:199)
at 
org.apache.tez.common.TestTezCommonUtils.setup(TestTezCommonUtils.java:60)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3896) TestATSV15HistoryLoggingService#testNonSessionDomains is failing

2018-02-16 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3896:

Attachment: TEZ-3896.001.patch

> TestATSV15HistoryLoggingService#testNonSessionDomains is failing
> 
>
> Key: TEZ-3896
> URL: https://issues.apache.org/jira/browse/TEZ-3896
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3896.001.patch
>
>
> TestATSV15HistoryLoggingService always fails:
> {noformat}
> Running org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.789 sec <<< 
> FAILURE!
> testNonSessionDomains(org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService)
>   Time elapsed: 0.477 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> historyACLPolicyManager.updateTimelineEntityDomain(
> ,
> "session-id"
> );
> Wanted 5 times:
> -> at 
> org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231)
> But was 6 times. Undesired invocation:
> -> at 
> org.apache.tez.dag.history.logging.ats.ATSV15HistoryLoggingService.logEntity(ATSV15HistoryLoggingService.java:389)
>   at 
> org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEZ-3896) TestATSV15HistoryLoggingService#testNonSessionDomains is failing

2018-02-16 Thread Jason Lowe (JIRA)

Jason Lowe created TEZ-3896:
---

 Summary: TestATSV15HistoryLoggingService#testNonSessionDomains is 
failing
 Key: TEZ-3896
 URL: https://issues.apache.org/jira/browse/TEZ-3896
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


TestATSV15HistoryLoggingService always fails:
{noformat}
Running org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.789 sec <<< 
FAILURE!
testNonSessionDomains(org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService)
  Time elapsed: 0.477 sec  <<< FAILURE!
org.mockito.exceptions.verification.TooManyActualInvocations: 
historyACLPolicyManager.updateTimelineEntityDomain(
,
"session-id"
);
Wanted 5 times:
-> at 
org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231)
But was 6 times. Undesired invocation:
-> at 
org.apache.tez.dag.history.logging.ats.ATSV15HistoryLoggingService.logEntity(ATSV15HistoryLoggingService.java:389)

at 
org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3893) Tez Local Mode can hang for cases

2018-02-14 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364346#comment-16364346
 ] 

Jason Lowe commented on TEZ-3893:
-

Thanks for updating the patch!

+1 lgtm.  Committing this.


> Tez Local Mode can hang for cases
> -
>
> Key: TEZ-3893
> URL: https://issues.apache.org/jira/browse/TEZ-3893
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3893.002.patch, TEZ-3893.003.patch, 
> TEZ-3893.004.patch, TEZ-3893.1.patch
>
>
> The scheduler has a race condition where events that notify can be added 
> while the blocking queue is not waiting, but just before waiting. In this 
> case, we can wait forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3893) Tez Local Mode can hang for cases

2018-02-13 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363239#comment-16363239
 ] 

Jason Lowe commented on TEZ-3893:
-

Thanks for updating the patch!  Looks good overall, only a few nits:

An error should be logged and/or an exception should be thrown if the 
dispatcher receives a message it doesn't understand.

shouldProcess and procesRequest were a bit confusing at first until I 
understood that it only apples to _some_ reqeuests (i.e.: allocation requests). 
 It would be nice if these were renamed to indicate that, e.g.: 
shouldProcessAllocateRequest or canAllocate.

A timeout was removed from a test. Intentional?


> Tez Local Mode can hang for cases
> -
>
> Key: TEZ-3893
> URL: https://issues.apache.org/jira/browse/TEZ-3893
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3893.002.patch, TEZ-3893.003.patch, TEZ-3893.1.patch
>
>
> The scheduler has a race condition where events that notify can be added 
> while the blocking queue is not waiting, but just before waiting. In this 
> case, we can wait forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3893) Tez Local Mode can hang for cases

2018-02-12 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361609#comment-16361609
 ] 

Jason Lowe commented on TEZ-3893:
-

Thanks for the patch!

A lot of the fragility in this code stems from the fact that there are items in 
the queue that we can process and items we cannot, and we're trying to juggle 
them in the same queue.  I'm wondering if this gets a lot cleaner if it is 
refactored into two parts, a front-end dispatcher/handler and a fixed-size 
thread pool executor to do the executions.  The front-end _always_ pulls from 
the queue (just FIFO, not priority).  If the message is an allocate, the 
dispatcher schedules the task with the fixed thread pool executor and tracks 
the Future from that schedule in a map.  If the message is a deallocate then it 
looks up the Future from the map and cancels it, which will prevent it from 
executing if it hasn't or should interrupt the thread that is currently 
executing the task.

After that refactoring then the queue management becomes very simple.  The 
dispatcher takes from the queue, always processes the message, then is ready to 
take from the queue again.  The fixed thread pool executor takes a task, 
executes it, then is ready to take the next task if any.

> Tez Local Mode can hang for cases
> -
>
> Key: TEZ-3893
> URL: https://issues.apache.org/jira/browse/TEZ-3893
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>Priority: Major
> Attachments: TEZ-3893.002.patch, TEZ-3893.1.patch
>
>
> The scheduler has a race condition where events that notify can be added 
> while the blocking queue is not waiting, but just before waiting. In this 
> case, we can wait forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TEZ-3894) Tez intermediate outputs implicitly rely on permissive umask for shuffle

2018-02-05 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3894:

Attachment: TEZ-3894.001.patch

> Tez intermediate outputs implicitly rely on permissive umask for shuffle
> 
>
> Key: TEZ-3894
> URL: https://issues.apache.org/jira/browse/TEZ-3894
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3894.001.patch
>
>
> Tez does not explicitly set the permissions of intermediate output files for 
> shuffle. In a secure cluster the shuffle service is running as a different 
> user than the task, so the output files require group readability in order to 
> serve up the data during the shuffle phase. If the umask is too restrictive 
> (e.g.: 077) then the task's file.out and file.out.index permissions can be 
> too restrictive to allow the shuffle handler to access them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3894) Tez intermediate outputs implicitly rely on permissive umask for shuffle

2018-01-31 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347600#comment-16347600
 ] 

Jason Lowe commented on TEZ-3894:
-

This is the Tez equivalent of MAPREDUCE-7033.  This will become more of an 
issue with Hadoop 3.x since HADOOP-11347 fixed a bug in the local filesystem to 
have it honor the configured fs.permission.umask-mode property where it was 
ignored in 2.x and implicitly relied on the UNIX umask.

> Tez intermediate outputs implicitly rely on permissive umask for shuffle
> 
>
> Key: TEZ-3894
> URL: https://issues.apache.org/jira/browse/TEZ-3894
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
>
> Tez does not explicitly set the permissions of intermediate output files for 
> shuffle. In a secure cluster the shuffle service is running as a different 
> user than the task, so the output files require group readability in order to 
> serve up the data during the shuffle phase. If the umask is too restrictive 
> (e.g.: 077) then the task's file.out and file.out.index permissions can be 
> too restrictive to allow the shuffle handler to access them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 3 4 5 >

1 - 100 of 456 matches

Mail list logo