[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments

2015-01-07 Thread Ivan Mitic (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268394#comment-14268394
 ] 

Ivan Mitic commented on TEZ-1924:
-

Thanks for the quick turnaround [~Hitesh]!

> Tez AM does not register with AM with full FQDN causing jobs to fail in some 
> environments
> -
>
> Key: TEZ-1924
> URL: https://issues.apache.org/jira/browse/TEZ-1924
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Ivan Mitic
>Assignee: Ivan Mitic
> Fix For: 0.5.4
>
> Attachments: TEZ-1924.2.patch, TEZ-20.patch
>
>
> Issue originally reported by [~Karam Singh].
> All OrderWordCount, WordCount and Tez tests faultTolerance system tests 
> failed due to java.net.UnknownHostException
> Interesting other tez examples such as mrrsleep, randomwriter, 
> randomtextwriter, sort, join_inner, join_outer, terasort, 
> groupbyorderbymrrtest ran fine
> one such example is following
> {code}
> RUNNING: /usr/lib/hadoop/bin/hadoop jar 
> /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount 
> "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" 
> "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" 
> "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" 
> "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" 
> "-Dmapreduce.map.java.opts=-Xmx1024m" 
> "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" 
> "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" 
> "-Dmapreduce.reduce.java.opts=-Xmx1024m" 
> "-Dtez.am.container.session.delay-allocation-millis=12" 
> /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 
> /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 
> -generateSplitsInClient true
> 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: 
> http://0.0.0.0:8188/ws/v1/timeline/
> 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at 
> headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
> 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from 
> hadoop-metrics2.properties
> 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 
> 60 second(s).
> 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics 
> system started
> 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging 
> directory 
> wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
>  are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx--
> 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session
> 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: 
> http://0.0.0.0:8188/ws/v1/timeline/
> 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at 
> headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
> 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application 
> application_1418977790315_0016
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount 
> DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, 
> outputPath=/user/hrt_qa/Tez_CROutput_1
> 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, 
> splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
> 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 
> 20
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to 
> get into ready state
> 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via 
> proxy
> org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: 
> java.net.UnknownHostException: Invalid host name: local host is: (unknown); 
> destination host is: "workernode1":59575; java.net.UnknownHostException; For 
> more details see:  http://wiki.apache.org/hadoop/UnknownHost
>   at 
> org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351)
>   at 
> org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538)
>   at 
> org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodA

[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments

2015-01-07 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268328#comment-14268328
 ] 

Hitesh Shah commented on TEZ-1924:
--

+1. Looks good. Committing shortly. 

> Tez AM does not register with AM with full FQDN causing jobs to fail in some 
> environments
> -
>
> Key: TEZ-1924
> URL: https://issues.apache.org/jira/browse/TEZ-1924
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Ivan Mitic
>Assignee: Ivan Mitic
> Attachments: TEZ-1924.2.patch, TEZ-20.patch
>
>
> Issue originally reported by [~Karam Singh].
> All OrderWordCount, WordCount and Tez tests faultTolerance system tests 
> failed due to java.net.UnknownHostException
> Interesting other tez examples such as mrrsleep, randomwriter, 
> randomtextwriter, sort, join_inner, join_outer, terasort, 
> groupbyorderbymrrtest ran fine
> one such example is following
> {code}
> RUNNING: /usr/lib/hadoop/bin/hadoop jar 
> /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount 
> "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" 
> "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" 
> "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" 
> "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" 
> "-Dmapreduce.map.java.opts=-Xmx1024m" 
> "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" 
> "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" 
> "-Dmapreduce.reduce.java.opts=-Xmx1024m" 
> "-Dtez.am.container.session.delay-allocation-millis=12" 
> /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 
> /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 
> -generateSplitsInClient true
> 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: 
> http://0.0.0.0:8188/ws/v1/timeline/
> 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at 
> headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
> 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from 
> hadoop-metrics2.properties
> 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 
> 60 second(s).
> 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics 
> system started
> 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging 
> directory 
> wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
>  are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx--
> 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session
> 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: 
> http://0.0.0.0:8188/ws/v1/timeline/
> 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at 
> headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
> 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application 
> application_1418977790315_0016
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount 
> DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, 
> outputPath=/user/hrt_qa/Tez_CROutput_1
> 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, 
> splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
> 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 
> 20
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to 
> get into ready state
> 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via 
> proxy
> org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: 
> java.net.UnknownHostException: Invalid host name: local host is: (unknown); 
> destination host is: "workernode1":59575; java.net.UnknownHostException; For 
> more details see:  http://wiki.apache.org/hadoop/UnknownHost
>   at 
> org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351)
>   at 
> org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538)
>   at 
> org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod

[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments

2015-01-07 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268280#comment-14268280
 ] 

Hitesh Shah commented on TEZ-1924:
--

Thanks for filing the issue [~ivanmi] and also for providing a patch. 

Some general comments:
  - It is usually better if the patch file is named the same as the jira ( with 
a version number for multiple iterations on the patch ).
  - With respect to using the NM hostname, would it be better to extract the 
FQDN from the server object itself if possible? 

> Tez AM does not register with AM with full FQDN causing jobs to fail in some 
> environments
> -
>
> Key: TEZ-1924
> URL: https://issues.apache.org/jira/browse/TEZ-1924
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Ivan Mitic
> Attachments: TEZ-20.patch
>
>
> Issue originally reported by [~Karam Singh].
> All OrderWordCount, WordCount and Tez tests faultTolerance system tests 
> failed due to java.net.UnknownHostException
> Interesting other tez examples such as mrrsleep, randomwriter, 
> randomtextwriter, sort, join_inner, join_outer, terasort, 
> groupbyorderbymrrtest ran fine
> one such example is following
> {code}
> RUNNING: /usr/lib/hadoop/bin/hadoop jar 
> /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount 
> "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" 
> "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" 
> "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" 
> "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" 
> "-Dmapreduce.map.java.opts=-Xmx1024m" 
> "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" 
> "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" 
> "-Dmapreduce.reduce.java.opts=-Xmx1024m" 
> "-Dtez.am.container.session.delay-allocation-millis=12" 
> /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 
> /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 
> -generateSplitsInClient true
> 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: 
> http://0.0.0.0:8188/ws/v1/timeline/
> 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at 
> headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
> 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from 
> hadoop-metrics2.properties
> 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 
> 60 second(s).
> 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics 
> system started
> 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging 
> directory 
> wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
>  are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx--
> 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session
> 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: 
> http://0.0.0.0:8188/ws/v1/timeline/
> 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at 
> headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
> 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application 
> application_1418977790315_0016
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount 
> DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, 
> outputPath=/user/hrt_qa/Tez_CROutput_1
> 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, 
> splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
> 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 
> 20
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to 
> get into ready state
> 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via 
> proxy
> org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: 
> java.net.UnknownHostException: Invalid host name: local host is: (unknown); 
> destination host is: "workernode1":59575; java.net.UnknownHostException; For 
> more details see:  http://wiki.apache.org/hadoop/UnknownHost
>   at 
> org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351)
>   at 
> org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538)
>   at 
> org.apache.tez.mapreduce.examples.OrderedWordCount.ma

[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments

2015-01-07 Thread Ivan Mitic (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268091#comment-14268091
 ] 

Ivan Mitic commented on TEZ-1924:
-

I think I have the root cause at this point. Tez client is trying to talk to 
its AM, and given that AM is registered with a short host name (workernode0), 
Tez client is failing to talk to it. If Tez AM registered with the RM using a 
FQDN we would not have this problem.

> Tez AM does not register with AM with full FQDN causing jobs to fail in some 
> environments
> -
>
> Key: TEZ-1924
> URL: https://issues.apache.org/jira/browse/TEZ-1924
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Ivan Mitic
>
> Issue originally reported by [~Karam Singh].
> All OrderWordCount, WordCount and Tez tests faultTolerance system tests 
> failed due to java.net.UnknownHostException
> Interesting other tez examples such as mrrsleep, randomwriter, 
> randomtextwriter, sort, join_inner, join_outer, terasort, 
> groupbyorderbymrrtest ran fine
> one such example is following
> {code}
> RUNNING: /usr/lib/hadoop/bin/hadoop jar 
> /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount 
> "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" 
> "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" 
> "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" 
> "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" 
> "-Dmapreduce.map.java.opts=-Xmx1024m" 
> "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" 
> "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" 
> "-Dmapreduce.reduce.java.opts=-Xmx1024m" 
> "-Dtez.am.container.session.delay-allocation-millis=12" 
> /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 
> /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 
> -generateSplitsInClient true
> 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: 
> http://0.0.0.0:8188/ws/v1/timeline/
> 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at 
> headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
> 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from 
> hadoop-metrics2.properties
> 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 
> 60 second(s).
> 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics 
> system started
> 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging 
> directory 
> wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
>  are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx--
> 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session
> 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: 
> http://0.0.0.0:8188/ws/v1/timeline/
> 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at 
> headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
> 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application 
> application_1418977790315_0016
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount 
> DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, 
> outputPath=/user/hrt_qa/Tez_CROutput_1
> 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, 
> splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
> 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 
> 20
> 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to 
> get into ready state
> 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via 
> proxy
> org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: 
> java.net.UnknownHostException: Invalid host name: local host is: (unknown); 
> destination host is: "workernode1":59575; java.net.UnknownHostException; For 
> more details see:  http://wiki.apache.org/hadoop/UnknownHost
>   at 
> org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351)
>   at 
> org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538)
>   at 
> org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.