[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268394#comment-14268394 ] Ivan Mitic commented on TEZ-1924: - Thanks for the quick turnaround [~Hitesh]! > Tez AM does not register with AM with full FQDN causing jobs to fail in some > environments > - > > Key: TEZ-1924 > URL: https://issues.apache.org/jira/browse/TEZ-1924 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Ivan Mitic >Assignee: Ivan Mitic > Fix For: 0.5.4 > > Attachments: TEZ-1924.2.patch, TEZ-20.patch > > > Issue originally reported by [~Karam Singh]. > All OrderWordCount, WordCount and Tez tests faultTolerance system tests > failed due to java.net.UnknownHostException > Interesting other tez examples such as mrrsleep, randomwriter, > randomtextwriter, sort, join_inner, join_outer, terasort, > groupbyorderbymrrtest ran fine > one such example is following > {code} > RUNNING: /usr/lib/hadoop/bin/hadoop jar > /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount > "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" > "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" > "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" > "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" > "-Dmapreduce.map.java.opts=-Xmx1024m" > "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" > "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" > "-Dmapreduce.reduce.java.opts=-Xmx1024m" > "-Dtez.am.container.session.delay-allocation-millis=12" > /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 > /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 > -generateSplitsInClient true > 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at > headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 > 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from > hadoop-metrics2.properties > 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at > 60 second(s). > 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics > system started > 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging > directory > wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 > are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- > 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session > 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at > headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 > 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application > application_1418977790315_0016 > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount > DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, > outputPath=/user/hrt_qa/Tez_CROutput_1 > 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, > splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 > 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : > 20 > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to > get into ready state > 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via > proxy > org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: > java.net.UnknownHostException: Invalid host name: local host is: (unknown); > destination host is: "workernode1":59575; java.net.UnknownHostException; For > more details see: http://wiki.apache.org/hadoop/UnknownHost > at > org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) > at > org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) > at > org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodA
[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268328#comment-14268328 ] Hitesh Shah commented on TEZ-1924: -- +1. Looks good. Committing shortly. > Tez AM does not register with AM with full FQDN causing jobs to fail in some > environments > - > > Key: TEZ-1924 > URL: https://issues.apache.org/jira/browse/TEZ-1924 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Ivan Mitic >Assignee: Ivan Mitic > Attachments: TEZ-1924.2.patch, TEZ-20.patch > > > Issue originally reported by [~Karam Singh]. > All OrderWordCount, WordCount and Tez tests faultTolerance system tests > failed due to java.net.UnknownHostException > Interesting other tez examples such as mrrsleep, randomwriter, > randomtextwriter, sort, join_inner, join_outer, terasort, > groupbyorderbymrrtest ran fine > one such example is following > {code} > RUNNING: /usr/lib/hadoop/bin/hadoop jar > /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount > "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" > "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" > "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" > "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" > "-Dmapreduce.map.java.opts=-Xmx1024m" > "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" > "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" > "-Dmapreduce.reduce.java.opts=-Xmx1024m" > "-Dtez.am.container.session.delay-allocation-millis=12" > /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 > /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 > -generateSplitsInClient true > 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at > headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 > 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from > hadoop-metrics2.properties > 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at > 60 second(s). > 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics > system started > 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging > directory > wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 > are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- > 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session > 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at > headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 > 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application > application_1418977790315_0016 > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount > DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, > outputPath=/user/hrt_qa/Tez_CROutput_1 > 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, > splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 > 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : > 20 > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to > get into ready state > 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via > proxy > org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: > java.net.UnknownHostException: Invalid host name: local host is: (unknown); > destination host is: "workernode1":59575; java.net.UnknownHostException; For > more details see: http://wiki.apache.org/hadoop/UnknownHost > at > org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) > at > org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) > at > org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268280#comment-14268280 ] Hitesh Shah commented on TEZ-1924: -- Thanks for filing the issue [~ivanmi] and also for providing a patch. Some general comments: - It is usually better if the patch file is named the same as the jira ( with a version number for multiple iterations on the patch ). - With respect to using the NM hostname, would it be better to extract the FQDN from the server object itself if possible? > Tez AM does not register with AM with full FQDN causing jobs to fail in some > environments > - > > Key: TEZ-1924 > URL: https://issues.apache.org/jira/browse/TEZ-1924 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Ivan Mitic > Attachments: TEZ-20.patch > > > Issue originally reported by [~Karam Singh]. > All OrderWordCount, WordCount and Tez tests faultTolerance system tests > failed due to java.net.UnknownHostException > Interesting other tez examples such as mrrsleep, randomwriter, > randomtextwriter, sort, join_inner, join_outer, terasort, > groupbyorderbymrrtest ran fine > one such example is following > {code} > RUNNING: /usr/lib/hadoop/bin/hadoop jar > /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount > "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" > "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" > "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" > "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" > "-Dmapreduce.map.java.opts=-Xmx1024m" > "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" > "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" > "-Dmapreduce.reduce.java.opts=-Xmx1024m" > "-Dtez.am.container.session.delay-allocation-millis=12" > /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 > /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 > -generateSplitsInClient true > 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at > headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 > 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from > hadoop-metrics2.properties > 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at > 60 second(s). > 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics > system started > 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging > directory > wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 > are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- > 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session > 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at > headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 > 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application > application_1418977790315_0016 > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount > DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, > outputPath=/user/hrt_qa/Tez_CROutput_1 > 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, > splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 > 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : > 20 > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to > get into ready state > 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via > proxy > org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: > java.net.UnknownHostException: Invalid host name: local host is: (unknown); > destination host is: "workernode1":59575; java.net.UnknownHostException; For > more details see: http://wiki.apache.org/hadoop/UnknownHost > at > org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) > at > org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) > at > org.apache.tez.mapreduce.examples.OrderedWordCount.ma
[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268091#comment-14268091 ] Ivan Mitic commented on TEZ-1924: - I think I have the root cause at this point. Tez client is trying to talk to its AM, and given that AM is registered with a short host name (workernode0), Tez client is failing to talk to it. If Tez AM registered with the RM using a FQDN we would not have this problem. > Tez AM does not register with AM with full FQDN causing jobs to fail in some > environments > - > > Key: TEZ-1924 > URL: https://issues.apache.org/jira/browse/TEZ-1924 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Ivan Mitic > > Issue originally reported by [~Karam Singh]. > All OrderWordCount, WordCount and Tez tests faultTolerance system tests > failed due to java.net.UnknownHostException > Interesting other tez examples such as mrrsleep, randomwriter, > randomtextwriter, sort, join_inner, join_outer, terasort, > groupbyorderbymrrtest ran fine > one such example is following > {code} > RUNNING: /usr/lib/hadoop/bin/hadoop jar > /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount > "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" > "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" > "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" > "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" > "-Dmapreduce.map.java.opts=-Xmx1024m" > "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" > "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" > "-Dmapreduce.reduce.java.opts=-Xmx1024m" > "-Dtez.am.container.session.delay-allocation-millis=12" > /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 > /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 > -generateSplitsInClient true > 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at > headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 > 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from > hadoop-metrics2.properties > 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at > 60 second(s). > 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics > system started > 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging > directory > wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 > are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- > 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session > 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at > headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 > 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application > application_1418977790315_0016 > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount > DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, > outputPath=/user/hrt_qa/Tez_CROutput_1 > 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, > splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 > 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : > 20 > 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to > get into ready state > 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via > proxy > org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: > java.net.UnknownHostException: Invalid host name: local host is: (unknown); > destination host is: "workernode1":59575; java.net.UnknownHostException; For > more details see: http://wiki.apache.org/hadoop/UnknownHost > at > org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) > at > org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) > at > org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.