[jira] [Updated] (TEZ-1882) Tez UI build does not work on Windows
[ https://issues.apache.org/jira/browse/TEZ-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-1882: -- Attachment: TEZ-1882.2.patch Thanks [~bikassaha], I have removed the batch file. in patch 2. changed the path to node as a full path. this works on both windows and *nix Tez UI build does not work on Windows - Key: TEZ-1882 URL: https://issues.apache.org/jira/browse/TEZ-1882 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Prakash Ramachandran Priority: Blocker Attachments: TEZ-1882.1.patch, TEZ-1882.2.patch It fails during Bower install because it cannot launch node/node. After working around that the bower script itself fails because its a bash script and will not run on windows. Specific the following command fails in node_modules\.bin\bower basedir=`dirname $0` -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1923) FetcherOrderedGrouped can get into infinite loop due to memory pressure
Rajesh Balamohan created TEZ-1923: - Summary: FetcherOrderedGrouped can get into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold) {code} Attaching the sample hive command just for reference {code} $HIVE_HOME/bin/hive -hiveconf tez.runtime.io.sort.factor=200 --hiveconf hive.tez.auto.reducer.parallelism=false --hiveconf tez.am.heartbeat.interval-ms.max=20 --hiveconf tez.runtime.io.sort.mb=1200 --hiveconf tez.runtime.sort.threads=2 --hiveconf hive.tez.container.size=4096 --hiveconf tez.runtime.shuffle.memory-to-memory.enable=true --hiveconf tez.runtime.shuffle.memory-to-memory.segments=4
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped can get into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Attachment: TEZ-1923.1.patch Tried with the same hive job and it works fine without any infinite loop issues. [~sseth] - Can you please review when you have time? FetcherOrderedGrouped can get into infinite loop due to memory pressure --- Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Attachments: TEZ-1923.1.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold) {code} Attaching the
[jira] [Updated] (TEZ-1925) Remove npm WARN messages from the Tez UI build process.
[ https://issues.apache.org/jira/browse/TEZ-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-1925: - Priority: Critical (was: Major) Remove npm WARN messages from the Tez UI build process. --- Key: TEZ-1925 URL: https://issues.apache.org/jira/browse/TEZ-1925 Project: Apache Tez Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles Priority: Critical Attachments: TEZ-1925-v1.patch The Tez UI currently has these npm WARN messages. [INFO] npm WARN package.json tez-ui@0.0.1 No description [INFO] npm WARN package.json tez-ui@0.0.1 No repository field. [INFO] npm WARN package.json tez-ui@0.0.1 No README data -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1928) Tez local mode hang in Pig tez local mode
[ https://issues.apache.org/jira/browse/TEZ-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated TEZ-1928: Attachment: TestMultiQuery.log Tez local mode hang in Pig tez local mode - Key: TEZ-1928 URL: https://issues.apache.org/jira/browse/TEZ-1928 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Attachments: TestMultiQuery.log, TestScalarAliasesLocal.log Pig tez local mode tests hang under some scenario. I attached several stack trace of hanging tests. By setting tez.am.inline.task.execution.max-tasks, the test does not hang. However, we cannot make it general since Pig backend code is not designed to be multithread-safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1912) Merge exceptions are thrown when enabling tez.runtime.shuffle.memory-to-memory.enable tez.runtime.shuffle.memory-to-memory.segments
[ https://issues.apache.org/jira/browse/TEZ-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan resolved TEZ-1912. --- Resolution: Fixed Hadoop Flags: Reviewed Thanks [~sseth]. Committed to master. commit f1f87c1c81c29e1a1be69dc3261a28cd7151f2b9 Merge exceptions are thrown when enabling tez.runtime.shuffle.memory-to-memory.enable tez.runtime.shuffle.memory-to-memory.segments -- Key: TEZ-1912 URL: https://issues.apache.org/jira/browse/TEZ-1912 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Attachments: TEZ-1912.1.patch Merge exceptions are thrown when running a hive query on tez with the following setting. It works fine without mem-to-mem merge setting. {code} 2015-01-04 20:04:01,371 ERROR [ShuffleAndMergeRunner [Map_1]] orderedgrouped.Shuffle: ShuffleRunner failed with error org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: Error while doing final merge at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:364) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:327) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Rec# 22630125: Negative value-length: -1 at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.positionToNextRecord(IFile.java:720) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.InMemoryReader.readRawKey(InMemoryReader.java:104) at org.apache.tez.runtime.library.common.sort.impl.TezMerger$Segment.readRawKey(TezMerger.java:329) at org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.adjustPriorityQueue(TezMerger.java:500) at org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.next(TezMerger.java:545) at org.apache.tez.runtime.library.common.sort.impl.TezMerger.writeFile(TezMerger.java:204) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.finalMerge(MergeManager.java:862) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.close(MergeManager.java:473) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:362) ... 5 more {code} {code} $HIVE_HOME/bin/hive -hiveconf tez.runtime.io.sort.factor=200 --hiveconf tez.shuffle-vertex-manager.min-src-fraction=1.0 --hiveconf tez.shuffle-vertex-manager.max-src-fraction=1.0 --hiveconf hive.tez.auto.reducer.parallelism=false --hiveconf tez.am.heartbeat.interval-ms.max=20 --hiveconf tez.runtime.io.sort.mb=1200 --hiveconf tez.runtime.sort.threads=2 --hiveconf tez.history.logging.service.class=org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService --hiveconf hive.tez.container.size=4096 --hiveconf tez.runtime.shuffle.memory-to-memory.enable=true --hiveconf tez.runtime.shuffle.memory-to-memory.segments=4 --10 TB dataset use tpcds4_bin_partitioned_orc_1; drop table testData; create table testData as select ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk,ss_quantity,ss_sold_date from store_sales distribute by ss_sold_date; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1928) Tez local mode hang in Pig tez local mode
[ https://issues.apache.org/jira/browse/TEZ-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated TEZ-1928: Attachment: TestMultiQueryBasic.log Tez local mode hang in Pig tez local mode - Key: TEZ-1928 URL: https://issues.apache.org/jira/browse/TEZ-1928 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Attachments: TestMultiQuery.log, TestMultiQueryBasic.log, TestScalarAliasesLocal.log Pig tez local mode tests hang under some scenario. I attached several stack trace of hanging tests. By setting tez.am.inline.task.execution.max-tasks, the test does not hang. However, we cannot make it general since Pig backend code is not designed to be multithread-safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1928) Tez local mode hang in Pig tez local mode
[ https://issues.apache.org/jira/browse/TEZ-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268920#comment-14268920 ] Siddharth Seth commented on TEZ-1928: - [~daijy] - if you're running with Tez-0.5.3 or higher, can you try setting the following - either programmatically or via tez-site {code} property nametez.am.dag.scheduler.class/name valueorg.apache.tez.dag.app.dag.impl.DAGSchedulerNaturalOrderControlled/value /property {code} Tez local mode hang in Pig tez local mode - Key: TEZ-1928 URL: https://issues.apache.org/jira/browse/TEZ-1928 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Attachments: TestMultiQuery.log, TestMultiQueryBasic.log, TestScalarAliasesLocal.log Pig tez local mode tests hang under some scenario. I attached several stack trace of hanging tests. By setting tez.am.inline.task.execution.max-tasks, the test does not hang. However, we cannot make it general since Pig backend code is not designed to be multithread-safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1904) Fix findbugs warnings in tez-runtime-library
[ https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1904: Attachment: TEZ-1904.1.txt [~hitesh], [~rajesh.balamohan] - please review. Fix findbugs warnings in tez-runtime-library Key: TEZ-1904 URL: https://issues.apache.org/jira/browse/TEZ-1904 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Siddharth Seth Attachments: TEZ-1904.1.txt https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-1904) Fix findbugs warnings in tez-runtime-library
[ https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reassigned TEZ-1904: --- Assignee: Siddharth Seth Fix findbugs warnings in tez-runtime-library Key: TEZ-1904 URL: https://issues.apache.org/jira/browse/TEZ-1904 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Siddharth Seth Attachments: TEZ-1904.1.txt https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1925) Remove npm WARN messages from the Tez UI build process.
[ https://issues.apache.org/jira/browse/TEZ-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268798#comment-14268798 ] Prakash Ramachandran commented on TEZ-1925: --- +1 lgtm Remove npm WARN messages from the Tez UI build process. --- Key: TEZ-1925 URL: https://issues.apache.org/jira/browse/TEZ-1925 Project: Apache Tez Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles Priority: Critical Attachments: TEZ-1925-v1.patch The Tez UI currently has these npm WARN messages. [INFO] npm WARN package.json tez-ui@0.0.1 No description [INFO] npm WARN package.json tez-ui@0.0.1 No repository field. [INFO] npm WARN package.json tez-ui@0.0.1 No README data -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Attachment: TEZ-1923.2.patch Rajesh, I believe this will be triggered If there's parallel chunks being fetched without enough completing to hit the current merge condition ? Yes, this is very easily reproduced with mem-to-mem merging (even though such tight loops are possible without mem-to-mem merging). Agreed that the initial patch can end up triggering more spills to disk. Uploading refined patch to address the following 1. Fetchers would wait instead of getting into tight loop. 2. IntermediateMemoryToMemoryMerger would start merging only when there is enough memory available. 3. When mem-to-mem merging is enabled, it would additionally check for (usedMemory memoryLimit). If so, it would kick off mem-to-disk merging to release the memory pressure and to avoid fetchers indefinitely waiting. [~hitesh] - I can backport to 0.5.4 after review. FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is
[jira] [Assigned] (TEZ-1274) Remove Key/Value type checks in IFile
[ https://issues.apache.org/jira/browse/TEZ-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan reassigned TEZ-1274: - Assignee: Rajesh Balamohan Remove Key/Value type checks in IFile - Key: TEZ-1274 URL: https://issues.apache.org/jira/browse/TEZ-1274 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Rajesh Balamohan We check key and value types for each record - this should be removed from the tight loop. Maybe an assertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1928) Tez local mode hang in Pig tez local mode
[ https://issues.apache.org/jira/browse/TEZ-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated TEZ-1928: Attachment: TestScalarAliasesLocal.log Tez local mode hang in Pig tez local mode - Key: TEZ-1928 URL: https://issues.apache.org/jira/browse/TEZ-1928 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Attachments: TestScalarAliasesLocal.log Pig tez local mode tests hang under some scenario. I attached several stack trace of hanging tests. By setting tez.am.inline.task.execution.max-tasks, the test does not hang. However, we cannot make it general since Pig backend code is not designed to be multithread-safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-669) [Umbrella] Security in Tez
[ https://issues.apache.org/jira/browse/TEZ-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-669. Resolution: Fixed [Umbrella] Security in Tez -- Key: TEZ-669 URL: https://issues.apache.org/jira/browse/TEZ-669 Project: Apache Tez Issue Type: Task Reporter: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-785) Vertex.checkVertexForCompletion needs to handle additional VertexTerminationCauses
[ https://issues.apache.org/jira/browse/TEZ-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-785. Resolution: Done I believe this is fixed as part of diagnostic improvements done elsewhere. Vertex.checkVertexForCompletion needs to handle additional VertexTerminationCauses -- Key: TEZ-785 URL: https://issues.apache.org/jira/browse/TEZ-785 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-912) Change ScatterGatherShuffle and BroadcastShuffle to use the same code path
[ https://issues.apache.org/jira/browse/TEZ-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-912: --- Target Version/s: 0.7.0 Change ScatterGatherShuffle and BroadcastShuffle to use the same code path -- Key: TEZ-912 URL: https://issues.apache.org/jira/browse/TEZ-912 Project: Apache Tez Issue Type: Task Reporter: Siddharth Seth Currently there's 2 shuffle schedulers, 2 fetchers, etc. Maintenance headache. Merging the two together is a decent amount of work though - considering how Merge, Shuffle and Fetch are tied together in case of ShuffledMergedInput. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-924) InputFailedEvent handling for Shuffle
[ https://issues.apache.org/jira/browse/TEZ-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-924: --- Target Version/s: 0.7.0 InputFailedEvent handling for Shuffle - Key: TEZ-924 URL: https://issues.apache.org/jira/browse/TEZ-924 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Priority: Critical Shuffle receives batches of Events to process from the AM. The way these events are sent over to the ShuffleHandlers and the way they're processed - it's possible that Shuffle will start fetching data from an Event, which is to be subsequently marked as failed (via an InputFailedEvent) 1) The AM sends events in batches. An InputFailedEvent for a specific Input may not be part of the same batch which contained the original event which is being marked bad. 2) The ShuffleEventHandler processes the events in each batch one event at a time - so even if the InputFailedEvent follows - it's possible for Shuffle to start fetching data from a Failed Input. The AM needs to change to invalidate Inputs up front - so that related events don't span batches. Alternately, it needs to apply the InputFailedEvent to the original event being sent. The Shuffle itself should process a batch update as a batch - that would prevent fetchers from starting early even though there may be additional events for the same host. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-941) Avoid writing out empty partitions in Sorter implementations
[ https://issues.apache.org/jira/browse/TEZ-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-941: --- Target Version/s: 0.7.0 Labels: (was: 0.4) Avoid writing out empty partitions in Sorter implementations Key: TEZ-941 URL: https://issues.apache.org/jira/browse/TEZ-941 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1529) ATS and TezClient integration in secure kerberos enabled cluster
[ https://issues.apache.org/jira/browse/TEZ-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268470#comment-14268470 ] Jonathan Eagles commented on TEZ-1529: -- [~pramachandran], I don't have much context on this issue, but can this issue be re-targeted for Tez 0.6.1 or 0.7.0 release? ATS and TezClient integration in secure kerberos enabled cluster - Key: TEZ-1529 URL: https://issues.apache.org/jira/browse/TEZ-1529 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Priority: Blocker This is a follow up for TEZ-1495 which address ATS - TezClient integration. however it does not enable it in secure kerberos enabled cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1095) Enable tez.runtime.shuffle.memory-to-memory.enable for mem-to-mem merging
[ https://issues.apache.org/jira/browse/TEZ-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1095: Target Version/s: 0.7.0 Enable tez.runtime.shuffle.memory-to-memory.enable for mem-to-mem merging - Key: TEZ-1095 URL: https://issues.apache.org/jira/browse/TEZ-1095 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Currently tez.runtime.shuffle.memory-to-memory.enable is set to false by default. We need to evaluate the usefulness of this parameter and enable it by default if it provides good perf boosts. There is also a possibility that waitForInMemoryMerge() will return in sub milliseconds when this parameter is enabled, causing pressure on network resources. Related JIRA: https://issues.apache.org/jira/browse/TEZ-1091 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1094) Support pipelined data transfer for Unordered Output
[ https://issues.apache.org/jira/browse/TEZ-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1094: Target Version/s: 0.7.0 Support pipelined data transfer for Unordered Output Key: TEZ-1094 URL: https://issues.apache.org/jira/browse/TEZ-1094 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth For unsorted output (and possibly for sorted output), it should be possible to send data in small batches instead of waiting for everything to be generated before transmitting. For now, planning on getting started with UnsortedOutput / Input pairs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1212) Remove synchronization on the write method in OnFileSortedOutput
[ https://issues.apache.org/jira/browse/TEZ-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1212: Target Version/s: 0.7.0 Remove synchronization on the write method in OnFileSortedOutput Key: TEZ-1212 URL: https://issues.apache.org/jira/browse/TEZ-1212 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1211) Remove synchronization on the write method in OnFileUnorderedPartitionedOutput
[ https://issues.apache.org/jira/browse/TEZ-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1211: Target Version/s: 0.7.0 Remove synchronization on the write method in OnFileUnorderedPartitionedOutput -- Key: TEZ-1211 URL: https://issues.apache.org/jira/browse/TEZ-1211 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1274) Remove Key/Value type checks in IFile
[ https://issues.apache.org/jira/browse/TEZ-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1274: Target Version/s: 0.7.0 Remove Key/Value type checks in IFile - Key: TEZ-1274 URL: https://issues.apache.org/jira/browse/TEZ-1274 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth We check key and value types for each record - this should be removed from the tight loop. Maybe an assertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1491) Tez reducer-side merge's counter update is slow
[ https://issues.apache.org/jira/browse/TEZ-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1491: Target Version/s: 0.7.0 Tez reducer-side merge's counter update is slow --- Key: TEZ-1491 URL: https://issues.apache.org/jira/browse/TEZ-1491 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Attachments: perf-top-counters.png TezMerger$MergeQueue::next() shows up in profiles due a synchronized block in a tight loop. Part of the slow operation was due to DataInputBuffer issues identified earlier in HADOOP-10694, but along with that approx 11% of my lock prefix calls were originating from the following line. {code} mergeProgress.set(totalBytesProcessed * progPerByte); {code} in two places within the core loop. !perf-top-counters.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1526) LoadingCache for TezTaskID slow for large jobs
[ https://issues.apache.org/jira/browse/TEZ-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1526: Target Version/s: 0.6.0, 0.7.0 (was: 0.6.0) LoadingCache for TezTaskID slow for large jobs -- Key: TEZ-1526 URL: https://issues.apache.org/jira/browse/TEZ-1526 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles Assignee: Jonathan Eagles Labels: performance Attachments: 10-TezTaskIDs.patch, TEZ-1526-v1.patch, TEZ-1526-v2.patch Using the LoadingCache with default builder settings. 100,000 TezTaskIDs are created in 10 seconds on my setup. With a LoadingCache initialCapacity of 10,000 they are created in 300 ms. With no LoadingCache, they are created in 10 ms. A test case in attached to illustrate the condition I would like to be sped up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1573) Exception from InputInitializer and VertexManagerPlugin is not propogated to client
[ https://issues.apache.org/jira/browse/TEZ-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268515#comment-14268515 ] Siddharth Seth commented on TEZ-1573: - [~jeffzhang] - is this fixed as part of 1267 and the releated jiras ? Exception from InputInitializer and VertexManagerPlugin is not propogated to client --- Key: TEZ-1573 URL: https://issues.apache.org/jira/browse/TEZ-1573 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1921) Intermediate data cleanup for long running sessions
[ https://issues.apache.org/jira/browse/TEZ-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-1921. - Resolution: Duplicate Intermediate data cleanup for long running sessions --- Key: TEZ-1921 URL: https://issues.apache.org/jira/browse/TEZ-1921 Project: Apache Tez Issue Type: Sub-task Reporter: Bikas Saha Intermediate data for a DAG could be deleted after a DAG has completed. Else it accumulates until the session completes and could unnecessarily fill up the local disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-776) Reduce AM mem usage caused by storing TezEvents
[ https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-776: --- Target Version/s: 0.7.0 Reduce AM mem usage caused by storing TezEvents --- Key: TEZ-776 URL: https://issues.apache.org/jira/browse/TEZ-776 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth This is open ended at the moment. A fair chunk of the AM heap is taken up by TezEvents (specifically DataMovementEvents - 64 bytes per event). Depending on the connection pattern - this puts limits on the number of tasks that can be processed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-705) Add a helper which can be used to obtain credentials, setup MRInput parameters
[ https://issues.apache.org/jira/browse/TEZ-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-705. Resolution: Won't Fix MRInput / MROutput modified to make this simpler. Add a helper which can be used to obtain credentials, setup MRInput parameters -- Key: TEZ-705 URL: https://issues.apache.org/jira/browse/TEZ-705 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-690) Tez API Ease of Use
[ https://issues.apache.org/jira/browse/TEZ-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268452#comment-14268452 ] Siddharth Seth commented on TEZ-690: Can this be closed ? Tez API Ease of Use --- Key: TEZ-690 URL: https://issues.apache.org/jira/browse/TEZ-690 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Recently we wrote the wordcount example from scratch using Tez API's in TEZ-689. The code shows some room for improvement in making the Tez API's more concise and less error prone. This jira tracks some of those changes. The improvements in this jira will be reflected in the cleanliness and conciseness of the word count example job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-770) Remove SessionLocalResources
[ https://issues.apache.org/jira/browse/TEZ-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-770. Resolution: Done Resolved elsewhere. Remove SessionLocalResources Key: TEZ-770 URL: https://issues.apache.org/jira/browse/TEZ-770 Project: Apache Tez Issue Type: Task Reporter: Siddharth Seth Assignee: Siddharth Seth These are currently not used, or exposed to users. For now - they just end up adding additional steps when running on a secure cluster. Can be re-introduced if we need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-946) Tez loses buffer-cache performance by running interleaved vertexes
[ https://issues.apache.org/jira/browse/TEZ-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-946: --- Target Version/s: 0.7.0 Tez loses buffer-cache performance by running interleaved vertexes -- Key: TEZ-946 URL: https://issues.apache.org/jira/browse/TEZ-946 Project: Apache Tez Issue Type: Bug Reporter: Gopal V Attachments: union-10.svg For a task which has multiple reduce vertexes running to generate UNION ops, the current Tez behaviour causes bad cache performance as well as bad perf with the object registry. The map spill files get paged in and out of cache, when I was running a large query which had multiple reducers pulling data off different shuffle edges at the same time. Along with this, whenever a map-join vertex is interleaved with a reducer vertex, the map-join hashtable gets dropped in the transition. It would be beneficial to schedule the vertexes at the same level with some priority so that we finish them faster through better buffer-cache hit-rate and object-registry hit-rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-943) Potential memory leaks caused by holding on ot TaskAttemptIDs and Containers
[ https://issues.apache.org/jira/browse/TEZ-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-943: --- Target Version/s: 0.7.0 Potential memory leaks caused by holding on ot TaskAttemptIDs and Containers Key: TEZ-943 URL: https://issues.apache.org/jira/browse/TEZ-943 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Details at https://issues.apache.org/jira/browse/TEZ-940?focusedCommentId=13938870page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13938870 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-485) Get rid of TezTaskStatus
[ https://issues.apache.org/jira/browse/TEZ-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-485: --- Target Version/s: 0.7.0 Get rid of TezTaskStatus Key: TEZ-485 URL: https://issues.apache.org/jira/browse/TEZ-485 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Priority: Minor Attachments: TEZ-485.1.txt TezTaskStatus is used by the MR Reporter only. We should be able to get rid of this interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-485) Get rid of TezTaskStatus
[ https://issues.apache.org/jira/browse/TEZ-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reassigned TEZ-485: -- Assignee: Siddharth Seth Get rid of TezTaskStatus Key: TEZ-485 URL: https://issues.apache.org/jira/browse/TEZ-485 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Minor Attachments: TEZ-485.1.txt TezTaskStatus is used by the MR Reporter only. We should be able to get rid of this interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-965) Tez needs a circuit-breaker to avoid mistaking network blips to task/node failures
[ https://issues.apache.org/jira/browse/TEZ-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-965: --- Target Version/s: 0.7.0 Tez needs a circuit-breaker to avoid mistaking network blips to task/node failures Key: TEZ-965 URL: https://issues.apache.org/jira/browse/TEZ-965 Project: Apache Tez Issue Type: Bug Environment: Flaky DNS cluster Reporter: Gopal V If DNS resolution fails for a period of 5-10 seconds, Tez restarts contra-flows in the query triggering recovery of nearly everything it has run. Nodes are getting marked as bad because they can't shuffle (dns resolution failed for all NMs), which results in log lines like {code} attempt_1394928384313_0234_1_25_000654_0 blamed for read error from attempt_1394928384313_0234_1_24_000366_0 {code} And the tasks restart from an earlier vertex. When a large number of such failures happen, the tasks shouldn't restart previous vertexes, but instead should flip a circuit back-off till the network blip disappears. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-967) Expose list of running tasks along with meta information
[ https://issues.apache.org/jira/browse/TEZ-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-967: --- Target Version/s: 0.7.0 Expose list of running tasks along with meta information Key: TEZ-967 URL: https://issues.apache.org/jira/browse/TEZ-967 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Useful to figure out what is running while executing a DAG - especially while debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1078) ValuesIterator does not need to deserialize keys for comparison
[ https://issues.apache.org/jira/browse/TEZ-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1078: Target Version/s: 0.7.0 ValuesIterator does not need to deserialize keys for comparison --- Key: TEZ-1078 URL: https://issues.apache.org/jira/browse/TEZ-1078 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth ValuesIterator - which provides a Key, Values view - ends up deserializing each key before comparing it to the previous key when trying to determine whether a new key has been found or the next K-V pair in the IFile belongs to the same key. It should be possible to use the compare(byte[]. ...) method from the RawComparator interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-485) Get rid of TezTaskStatus
[ https://issues.apache.org/jira/browse/TEZ-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268487#comment-14268487 ] Hitesh Shah commented on TEZ-485: - +1 Get rid of TezTaskStatus Key: TEZ-485 URL: https://issues.apache.org/jira/browse/TEZ-485 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Minor Attachments: TEZ-485.1.txt TezTaskStatus is used by the MR Reporter only. We should be able to get rid of this interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1275) Add an append method to IFile which does not check for RLE
[ https://issues.apache.org/jira/browse/TEZ-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1275: Target Version/s: 0.7.0 Add an append method to IFile which does not check for RLE -- Key: TEZ-1275 URL: https://issues.apache.org/jira/browse/TEZ-1275 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth The RLE and same key checks are primarily required for sorted output. For the unordered case - these checks should not be hit (and will almost always return false). I believe longer term, the plan is to have only a single method - which does not have the checks, and move all the key comparison and equality logic over to users of IFile - which would end up calling appendKV on new keys, and append(V/ListV) for repeated values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1363) Make use of the regular scheduler when running in LocalMode
[ https://issues.apache.org/jira/browse/TEZ-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1363: Target Version/s: 0.7.0 Make use of the regular scheduler when running in LocalMode --- Key: TEZ-1363 URL: https://issues.apache.org/jira/browse/TEZ-1363 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Jonathan Eagles Attachments: TEZ-1363-v1.patch, TEZ-1363-v2.patch, TEZ-1363-v3.patch In TEZ-708, we decided to introduce a new scheduler for local mode - to keep things simple initially, and get local mode working. Eventually, however, scheduling should go through the regular task scheduler - which should be able to get containers from YARN / LocalAllocator / other sources - and treat them as a regular container for scheduling purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1518) Clean up ID caches on DAG completion
[ https://issues.apache.org/jira/browse/TEZ-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1518: Target Version/s: 0.6.0, 0.7.0 (was: 0.6.0) Clean up ID caches on DAG completion Key: TEZ-1518 URL: https://issues.apache.org/jira/browse/TEZ-1518 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1546) Change InputInitializerContext.registerForVertexStateUpdates to return a list of pending state changes
[ https://issues.apache.org/jira/browse/TEZ-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-1546. - Resolution: Won't Fix Change InputInitializerContext.registerForVertexStateUpdates to return a list of pending state changes -- Key: TEZ-1546 URL: https://issues.apache.org/jira/browse/TEZ-1546 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Critical Sending pending events via the stateChange on the InputInitializer can be confusing - since multiple calls will be made back to back, without knowing how many events are coming in , and which the last one is. Returning all past state changes via register ensures invocations of onStateChanged are current events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-485) Get rid of TezTaskStatus
[ https://issues.apache.org/jira/browse/TEZ-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-485. Resolution: Fixed Fix Version/s: 0.7.0 Committed to master. Get rid of TezTaskStatus Key: TEZ-485 URL: https://issues.apache.org/jira/browse/TEZ-485 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Minor Fix For: 0.7.0 Attachments: TEZ-485.1.txt TezTaskStatus is used by the MR Reporter only. We should be able to get rid of this interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1910) Build fails against hadoop-2.2.0
[ https://issues.apache.org/jira/browse/TEZ-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1910: - Attachment: TEZ-1910.2.patch Modified patch to add relevant comments. Build fails against hadoop-2.2.0 Key: TEZ-1910 URL: https://issues.apache.org/jira/browse/TEZ-1910 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Priority: Blocker Attachments: TEZ-1910.1.patch, TEZ-1910.2.patch https://builds.apache.org/job/Tez-Build-Hadoop-2.2/2/console -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1910) Build fails against hadoop-2.2.0
[ https://issues.apache.org/jira/browse/TEZ-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268549#comment-14268549 ] Jonathan Eagles commented on TEZ-1910: -- +1. Will commit shortly. Thanks, [~hitesh]. Build fails against hadoop-2.2.0 Key: TEZ-1910 URL: https://issues.apache.org/jira/browse/TEZ-1910 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Priority: Blocker Attachments: TEZ-1910.1.patch, TEZ-1910.2.patch https://builds.apache.org/job/Tez-Build-Hadoop-2.2/2/console -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1925) Remove npm WARN messages from the Tez UI build process.
Jonathan Eagles created TEZ-1925: Summary: Remove npm WARN messages from the Tez UI build process. Key: TEZ-1925 URL: https://issues.apache.org/jira/browse/TEZ-1925 Project: Apache Tez Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1925) Remove npm WARN messages from the Tez UI build process.
[ https://issues.apache.org/jira/browse/TEZ-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-1925: - Attachment: TEZ-1925-v1.patch [~pramachandran], [~hitesh], can you have a review? Remove npm WARN messages from the Tez UI build process. --- Key: TEZ-1925 URL: https://issues.apache.org/jira/browse/TEZ-1925 Project: Apache Tez Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: TEZ-1925-v1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1925) Remove npm WARN messages from the Tez UI build process.
[ https://issues.apache.org/jira/browse/TEZ-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-1925: - Description: The Tez UI currently has these npm WARN messages. [INFO] npm WARN package.json tez-ui@0.0.1 No description [INFO] npm WARN package.json tez-ui@0.0.1 No repository field. [INFO] npm WARN package.json tez-ui@0.0.1 No README data Remove npm WARN messages from the Tez UI build process. --- Key: TEZ-1925 URL: https://issues.apache.org/jira/browse/TEZ-1925 Project: Apache Tez Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: TEZ-1925-v1.patch The Tez UI currently has these npm WARN messages. [INFO] npm WARN package.json tez-ui@0.0.1 No description [INFO] npm WARN package.json tez-ui@0.0.1 No repository field. [INFO] npm WARN package.json tez-ui@0.0.1 No README data -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-1900) Fix findbugs warnings in tez-dag
[ https://issues.apache.org/jira/browse/TEZ-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah reassigned TEZ-1900: Assignee: Hitesh Shah Fix findbugs warnings in tez-dag Key: TEZ-1900 URL: https://issues.apache.org/jira/browse/TEZ-1900 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Hitesh Shah Might need to be split out more. https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-dag.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1926) fatalError reported in LogicalIOProcessorRuntimeTask isn't reported to AM
Siddharth Seth created TEZ-1926: --- Summary: fatalError reported in LogicalIOProcessorRuntimeTask isn't reported to AM Key: TEZ-1926 URL: https://issues.apache.org/jira/browse/TEZ-1926 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth May not need to be - but needs to be looked at. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268186#comment-14268186 ] Siddharth Seth commented on TEZ-1923: - Scratch that. This is linked to the MemoryToDisk merger and happens when data comes in parallel and takes time to complete - large chunks for example. commitMemory is only accounted for after a fetch completes, usedMemory is accounted on each reserve. Rajesh, I believe this will be triggered If there's parallel chunks being fetched without enough completing to hit the current merge condition ? Even with this patch, I think it's possible for the WAIT loop in the following situations. 1) closeInMemoryFile is not invoked - i.e. all fetches are still in progress, and the merge doesn't get triggered - fetchers would end up in the WAIT loop. 2) If a single merge were to complete and trigger the new condition - the merge essentially writes a single segment, clears up some memory and go back to condition 1. Also, this would imply writing out more files to disk than we would want to - since the merger will be triggered more often. One option would be to check and wait on usedMemory in the fetchers - instead of just relying on the merger to be running. FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1923.1.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824,
[jira] [Comment Edited] (TEZ-1915) Add public key to KEYS
[ https://issues.apache.org/jira/browse/TEZ-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268172#comment-14268172 ] Jonathan Eagles edited comment on TEZ-1915 at 1/7/15 8:42 PM: -- Simple release prep fix for 0.6. Committed to branch-0.6 and master. Also, published this key to http://pgp.mit.edu/ was (Author: jeagles): Simple release prep fix for 0.6. Committed to branch-0.6 and master Add public key to KEYS -- Key: TEZ-1915 URL: https://issues.apache.org/jira/browse/TEZ-1915 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles Assignee: Jonathan Eagles Priority: Blocker Fix For: 0.6.0 Attachments: TEZ-1915-v1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Summary: FetcherOrderedGrouped gets into infinite loop due to memory pressure (was: FetcherOrderedGrouped can get into infinite loop due to memory pressure) FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Attachments: TEZ-1923.1.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold) {code} Attaching the sample hive
[jira] [Assigned] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan reassigned TEZ-1923: - Assignee: Rajesh Balamohan FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1923.1.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold) {code} Attaching the sample hive command just for reference {code} $HIVE_HOME/bin/hive -hiveconf
[jira] [Commented] (TEZ-1913) Reduce deserialize cost in ValuesIterator
[ https://issues.apache.org/jira/browse/TEZ-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268331#comment-14268331 ] Siddharth Seth commented on TEZ-1913: - Questions/comments on the patch. - The ValuesIterator is used by the Combiners as well. I'm not sure a result of a merge (RawKViterator which support isSameKey) is the only iterator which will be used in these cases. From the PipelineSorter, there were some RawKVIterators which don't implement the method. - EmptyIteartor.isSameKey isn't implemented - don't think we'll ever his this, but the Merger can return an instance of this. Should probably change this to return false. - Test in TestValuesIterator to validate same keys working. Reduce deserialize cost in ValuesIterator - Key: TEZ-1913 URL: https://issues.apache.org/jira/browse/TEZ-1913 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: perfomance Attachments: TEZ-1913.1.patch When TezRawKeyValueIterator-isSameKey() is added, it should be possible to reduce the number of deserializations in ValuesIterator-readNextKey(). Creating this ticket to track the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-15) Support for DAG AM recovery
[ https://issues.apache.org/jira/browse/TEZ-15?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268339#comment-14268339 ] Siddharth Seth commented on TEZ-15: --- Can this be closed, since recovery is already supported. Support for DAG AM recovery --- Key: TEZ-15 URL: https://issues.apache.org/jira/browse/TEZ-15 Project: Apache Tez Issue Type: Improvement Reporter: Bikas Saha -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah resolved TEZ-1924. -- Resolution: Fixed Fix Version/s: 0.5.4 Committed to master, branch 0.5 and branch 0.6. Thanks for your contribution [~ivanmi] Tez AM does not register with AM with full FQDN causing jobs to fail in some environments - Key: TEZ-1924 URL: https://issues.apache.org/jira/browse/TEZ-1924 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Ivan Mitic Assignee: Ivan Mitic Fix For: 0.5.4 Attachments: TEZ-1924.2.patch, TEZ-20.patch Issue originally reported by [~Karam Singh]. All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine one such example is following {code} RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount -DUSE_TEZ_SESSION=true -Dmapreduce.map.memory.mb=2048 -Dtez.am.shuffle-vertex-manager.max-src-fraction=0 -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.framework.name=yarn-tez -Dtez.am.container.reuse.enabled=false -Dtez.am.log.level=DEBUG -Dmapreduce.map.java.opts=-Xmx1024m -Dtez.am.shuffle-vertex-manager.min-src-fraction=0 -Dmapreduce.job.reduce.slowstart.completedmaps=0.01 -Dmapreduce.reduce.java.opts=-Xmx1024m -Dtez.am.container.session.delay-allocation-millis=12 /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: workernode1:59575; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268280#comment-14268280 ] Hitesh Shah commented on TEZ-1924: -- Thanks for filing the issue [~ivanmi] and also for providing a patch. Some general comments: - It is usually better if the patch file is named the same as the jira ( with a version number for multiple iterations on the patch ). - With respect to using the NM hostname, would it be better to extract the FQDN from the server object itself if possible? Tez AM does not register with AM with full FQDN causing jobs to fail in some environments - Key: TEZ-1924 URL: https://issues.apache.org/jira/browse/TEZ-1924 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Ivan Mitic Attachments: TEZ-20.patch Issue originally reported by [~Karam Singh]. All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine one such example is following {code} RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount -DUSE_TEZ_SESSION=true -Dmapreduce.map.memory.mb=2048 -Dtez.am.shuffle-vertex-manager.max-src-fraction=0 -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.framework.name=yarn-tez -Dtez.am.container.reuse.enabled=false -Dtez.am.log.level=DEBUG -Dmapreduce.map.java.opts=-Xmx1024m -Dtez.am.shuffle-vertex-manager.min-src-fraction=0 -Dmapreduce.job.reduce.slowstart.completedmaps=0.01 -Dmapreduce.reduce.java.opts=-Xmx1024m -Dtez.am.container.session.delay-allocation-millis=12 /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: workernode1:59575; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[jira] [Commented] (TEZ-1882) Tez UI build does not work on Windows
[ https://issues.apache.org/jira/browse/TEZ-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268283#comment-14268283 ] Bikas Saha commented on TEZ-1882: - Branch-0.6 - commit 30d485dd41e83cb70b27195e4fa986b8bb586933 Author: Bikas Saha bi...@apache.org Date: Wed Jan 7 13:25:32 2015 -0800 TEZ-1882. Tez UI build does not work on Windows (Prakash Ramachandran via bikas) (cherry picked from commit b9c834a283c0711655f84f51570e70ac38753426) Tez UI build does not work on Windows - Key: TEZ-1882 URL: https://issues.apache.org/jira/browse/TEZ-1882 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Prakash Ramachandran Priority: Blocker Fix For: 0.6.0 Attachments: TEZ-1882.1.patch, TEZ-1882.2.patch It fails during Bower install because it cannot launch node/node. After working around that the bower script itself fails because its a bash script and will not run on windows. Specific the following command fails in node_modules\.bin\bower basedir=`dirname $0` -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-305) Convert umbilical objects to PB serialization
[ https://issues.apache.org/jira/browse/TEZ-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-305: --- Target Version/s: 0.7.0 Convert umbilical objects to PB serialization - Key: TEZ-305 URL: https://issues.apache.org/jira/browse/TEZ-305 Project: Apache Tez Issue Type: Improvement Reporter: Bikas Saha Labels: TEZ-0.2.0, engine -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-516) Add a join example using the Broadcast edge
[ https://issues.apache.org/jira/browse/TEZ-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-516. Resolution: Done Join/Intersect example was added elsewhere. Add a join example using the Broadcast edge --- Key: TEZ-516 URL: https://issues.apache.org/jira/browse/TEZ-516 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-485) Get rid of TezTaskStatus
[ https://issues.apache.org/jira/browse/TEZ-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-485: --- Priority: Minor (was: Major) Get rid of TezTaskStatus Key: TEZ-485 URL: https://issues.apache.org/jira/browse/TEZ-485 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Priority: Minor TezTaskStatus is used by the MR Reporter only. We should be able to get rid of this interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1924: - Target Version/s: 0.5.4 Tez AM does not register with AM with full FQDN causing jobs to fail in some environments - Key: TEZ-1924 URL: https://issues.apache.org/jira/browse/TEZ-1924 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Ivan Mitic Assignee: Ivan Mitic Attachments: TEZ-20.patch Issue originally reported by [~Karam Singh]. All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine one such example is following {code} RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount -DUSE_TEZ_SESSION=true -Dmapreduce.map.memory.mb=2048 -Dtez.am.shuffle-vertex-manager.max-src-fraction=0 -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.framework.name=yarn-tez -Dtez.am.container.reuse.enabled=false -Dtez.am.log.level=DEBUG -Dmapreduce.map.java.opts=-Xmx1024m -Dtez.am.shuffle-vertex-manager.min-src-fraction=0 -Dmapreduce.job.reduce.slowstart.completedmaps=0.01 -Dmapreduce.reduce.java.opts=-Xmx1024m -Dtez.am.container.session.delay-allocation-millis=12 /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: workernode1:59575; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at
[jira] [Updated] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Mitic updated TEZ-1924: Attachment: TEZ-1924.2.patch Thanks Hitesh for the quick review! Attaching the updated addressing your comments. Tez AM does not register with AM with full FQDN causing jobs to fail in some environments - Key: TEZ-1924 URL: https://issues.apache.org/jira/browse/TEZ-1924 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Ivan Mitic Assignee: Ivan Mitic Attachments: TEZ-1924.2.patch, TEZ-20.patch Issue originally reported by [~Karam Singh]. All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine one such example is following {code} RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount -DUSE_TEZ_SESSION=true -Dmapreduce.map.memory.mb=2048 -Dtez.am.shuffle-vertex-manager.max-src-fraction=0 -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.framework.name=yarn-tez -Dtez.am.container.reuse.enabled=false -Dtez.am.log.level=DEBUG -Dmapreduce.map.java.opts=-Xmx1024m -Dtez.am.shuffle-vertex-manager.min-src-fraction=0 -Dmapreduce.job.reduce.slowstart.completedmaps=0.01 -Dmapreduce.reduce.java.opts=-Xmx1024m -Dtez.am.container.session.delay-allocation-millis=12 /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: workernode1:59575; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268328#comment-14268328 ] Hitesh Shah commented on TEZ-1924: -- +1. Looks good. Committing shortly. Tez AM does not register with AM with full FQDN causing jobs to fail in some environments - Key: TEZ-1924 URL: https://issues.apache.org/jira/browse/TEZ-1924 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Ivan Mitic Assignee: Ivan Mitic Attachments: TEZ-1924.2.patch, TEZ-20.patch Issue originally reported by [~Karam Singh]. All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine one such example is following {code} RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount -DUSE_TEZ_SESSION=true -Dmapreduce.map.memory.mb=2048 -Dtez.am.shuffle-vertex-manager.max-src-fraction=0 -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.framework.name=yarn-tez -Dtez.am.container.reuse.enabled=false -Dtez.am.log.level=DEBUG -Dmapreduce.map.java.opts=-Xmx1024m -Dtez.am.shuffle-vertex-manager.min-src-fraction=0 -Dmapreduce.job.reduce.slowstart.completedmaps=0.01 -Dmapreduce.reduce.java.opts=-Xmx1024m -Dtez.am.container.session.delay-allocation-millis=12 /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: workernode1:59575; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Commented] (TEZ-519) Misleading stack trace when using sessions
[ https://issues.apache.org/jira/browse/TEZ-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268365#comment-14268365 ] Siddharth Seth commented on TEZ-519: Is this still a problem ? Misleading stack trace when using sessions -- Key: TEZ-519 URL: https://issues.apache.org/jira/browse/TEZ-519 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha 13/09/27 12:43:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:54311 13/09/27 12:43:01 INFO examples.OrderedWordCount: Creating Tez Session 13/09/27 12:43:01 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:54311 13/09/27 12:43:03 INFO impl.YarnClientImpl: Submitted application application_1380218649569_0047 to ResourceManager at /0.0.0.0:54311 13/09/27 12:43:03 INFO examples.OrderedWordCount: Created Tez Session 13/09/27 12:43:03 INFO client.TezSession: Shutting down Tez Session, sessionName=OrderedWordCountSession, applicationId=application_1380218649569_0047 13/09/27 12:43:03 INFO client.TezSession: Could not connect to AM, killing session via YARN, sessionName=OrderedWordCountSession, applicationId=application_1380218649569_0047 13/09/27 12:43:03 INFO impl.YarnClientImpl: Killing application application_1380218649569_0047 org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /out already exists at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:357) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268394#comment-14268394 ] Ivan Mitic commented on TEZ-1924: - Thanks for the quick turnaround [~Hitesh]! Tez AM does not register with AM with full FQDN causing jobs to fail in some environments - Key: TEZ-1924 URL: https://issues.apache.org/jira/browse/TEZ-1924 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Ivan Mitic Assignee: Ivan Mitic Fix For: 0.5.4 Attachments: TEZ-1924.2.patch, TEZ-20.patch Issue originally reported by [~Karam Singh]. All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine one such example is following {code} RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount -DUSE_TEZ_SESSION=true -Dmapreduce.map.memory.mb=2048 -Dtez.am.shuffle-vertex-manager.max-src-fraction=0 -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.framework.name=yarn-tez -Dtez.am.container.reuse.enabled=false -Dtez.am.log.level=DEBUG -Dmapreduce.map.java.opts=-Xmx1024m -Dtez.am.shuffle-vertex-manager.min-src-fraction=0 -Dmapreduce.job.reduce.slowstart.completedmaps=0.01 -Dmapreduce.reduce.java.opts=-Xmx1024m -Dtez.am.container.session.delay-allocation-millis=12 /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: workernode1:59575; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606)
[jira] [Updated] (TEZ-1903) Fix findbugs warnings in tez-runtime-internals
[ https://issues.apache.org/jira/browse/TEZ-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1903: Attachment: TEZ-1903.1.txt [~hitesh] - please review. Fix findbugs warnings in tez-runtime-internals -- Key: TEZ-1903 URL: https://issues.apache.org/jira/browse/TEZ-1903 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Siddharth Seth Attachments: TEZ-1903.1.txt https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1903) Fix findbugs warnings in tez-runtime-internals
[ https://issues.apache.org/jira/browse/TEZ-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1903: Target Version/s: 0.7.0 Fix findbugs warnings in tez-runtime-internals -- Key: TEZ-1903 URL: https://issues.apache.org/jira/browse/TEZ-1903 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Siddharth Seth Attachments: TEZ-1903.1.txt https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267859#comment-14267859 ] Hitesh Shah commented on TEZ-1923: -- [~rajesh.balamohan] This seems like a critical issue. Any reason why it is not targeted to 0.5.4? FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1923.1.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold) {code}
[jira] [Commented] (TEZ-1922) Fix comments: add UNSORTED_OUTPUT to TEZ_TASK_SCALE_MEMORY_WEIGHTED_RATIOS
[ https://issues.apache.org/jira/browse/TEZ-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268058#comment-14268058 ] Siddharth Seth commented on TEZ-1922: - +1. Fix comments: add UNSORTED_OUTPUT to TEZ_TASK_SCALE_MEMORY_WEIGHTED_RATIOS -- Key: TEZ-1922 URL: https://issues.apache.org/jira/browse/TEZ-1922 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Minor Attachments: TEZ-1922.1.patch Example provided for TEZ_TASK_SCALE_MEMORY_WEIGHTED_RATIOS in TezConfiguration is missing UNSORTED_OUTPUT. PARTITIONED_UNSORTED_OUTPUT:0,UNSORTED_INPUT:1,SORTED_OUTPUT:2,SORTED_MERGED_INPUT:3,PROCESSOR:1,OTHER:1 If user tries to set the value by referring to this, it would end up throwing exceptions in org.apache.tez.runtime.library.resources.WeightedScalingMemoryDistributor.populateTypeScaleMap() -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-1903) Fix findbugs warnings in tez-runtime-internals
[ https://issues.apache.org/jira/browse/TEZ-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reassigned TEZ-1903: --- Assignee: Siddharth Seth Fix findbugs warnings in tez-runtime-internals -- Key: TEZ-1903 URL: https://issues.apache.org/jira/browse/TEZ-1903 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Siddharth Seth https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1912) Merge exceptions are thrown when enabling tez.runtime.shuffle.memory-to-memory.enable tez.runtime.shuffle.memory-to-memory.segments
[ https://issues.apache.org/jira/browse/TEZ-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268082#comment-14268082 ] Siddharth Seth commented on TEZ-1912: - +1. Looks good. Merge exceptions are thrown when enabling tez.runtime.shuffle.memory-to-memory.enable tez.runtime.shuffle.memory-to-memory.segments -- Key: TEZ-1912 URL: https://issues.apache.org/jira/browse/TEZ-1912 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Attachments: TEZ-1912.1.patch Merge exceptions are thrown when running a hive query on tez with the following setting. It works fine without mem-to-mem merge setting. {code} 2015-01-04 20:04:01,371 ERROR [ShuffleAndMergeRunner [Map_1]] orderedgrouped.Shuffle: ShuffleRunner failed with error org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: Error while doing final merge at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:364) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:327) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Rec# 22630125: Negative value-length: -1 at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.positionToNextRecord(IFile.java:720) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.InMemoryReader.readRawKey(InMemoryReader.java:104) at org.apache.tez.runtime.library.common.sort.impl.TezMerger$Segment.readRawKey(TezMerger.java:329) at org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.adjustPriorityQueue(TezMerger.java:500) at org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.next(TezMerger.java:545) at org.apache.tez.runtime.library.common.sort.impl.TezMerger.writeFile(TezMerger.java:204) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.finalMerge(MergeManager.java:862) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.close(MergeManager.java:473) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:362) ... 5 more {code} {code} $HIVE_HOME/bin/hive -hiveconf tez.runtime.io.sort.factor=200 --hiveconf tez.shuffle-vertex-manager.min-src-fraction=1.0 --hiveconf tez.shuffle-vertex-manager.max-src-fraction=1.0 --hiveconf hive.tez.auto.reducer.parallelism=false --hiveconf tez.am.heartbeat.interval-ms.max=20 --hiveconf tez.runtime.io.sort.mb=1200 --hiveconf tez.runtime.sort.threads=2 --hiveconf tez.history.logging.service.class=org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService --hiveconf hive.tez.container.size=4096 --hiveconf tez.runtime.shuffle.memory-to-memory.enable=true --hiveconf tez.runtime.shuffle.memory-to-memory.segments=4 --10 TB dataset use tpcds4_bin_partitioned_orc_1; drop table testData; create table testData as select ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk,ss_quantity,ss_sold_date from store_sales distribute by ss_sold_date; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
Ivan Mitic created TEZ-1924: --- Summary: Tez AM does not register with AM with full FQDN causing jobs to fail in some environments Key: TEZ-1924 URL: https://issues.apache.org/jira/browse/TEZ-1924 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Ivan Mitic Issue originally reported by [~Karam Singh]. All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine one such example is following {code} RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount -DUSE_TEZ_SESSION=true -Dmapreduce.map.memory.mb=2048 -Dtez.am.shuffle-vertex-manager.max-src-fraction=0 -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.framework.name=yarn-tez -Dtez.am.container.reuse.enabled=false -Dtez.am.log.level=DEBUG -Dmapreduce.map.java.opts=-Xmx1024m -Dtez.am.shuffle-vertex-manager.min-src-fraction=0 -Dmapreduce.job.reduce.slowstart.completedmaps=0.01 -Dmapreduce.reduce.java.opts=-Xmx1024m -Dtez.am.container.session.delay-allocation-millis=12 /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: workernode1:59575; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:88) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268089#comment-14268089 ] Siddharth Seth commented on TEZ-1923: - This seems to affect the MemoryToMemory merger only. That should not be enabled in 0.5 since it hasn't been tested much. FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1923.1.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory =
[jira] [Commented] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268091#comment-14268091 ] Ivan Mitic commented on TEZ-1924: - I think I have the root cause at this point. Tez client is trying to talk to its AM, and given that AM is registered with a short host name (workernode0), Tez client is failing to talk to it. If Tez AM registered with the RM using a FQDN we would not have this problem. Tez AM does not register with AM with full FQDN causing jobs to fail in some environments - Key: TEZ-1924 URL: https://issues.apache.org/jira/browse/TEZ-1924 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Ivan Mitic Issue originally reported by [~Karam Singh]. All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine one such example is following {code} RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount -DUSE_TEZ_SESSION=true -Dmapreduce.map.memory.mb=2048 -Dtez.am.shuffle-vertex-manager.max-src-fraction=0 -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.framework.name=yarn-tez -Dtez.am.container.reuse.enabled=false -Dtez.am.log.level=DEBUG -Dmapreduce.map.java.opts=-Xmx1024m -Dtez.am.shuffle-vertex-manager.min-src-fraction=0 -Dmapreduce.job.reduce.slowstart.completedmaps=0.01 -Dmapreduce.reduce.java.opts=-Xmx1024m -Dtez.am.container.session.delay-allocation-millis=12 /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: workernode1:59575; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
[jira] [Updated] (TEZ-1924) Tez AM does not register with AM with full FQDN causing jobs to fail in some environments
[ https://issues.apache.org/jira/browse/TEZ-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Mitic updated TEZ-1924: Attachment: TEZ-20.patch Attaching the patch. Patch is modeled based on what MRv2 AM is doing. Basically, Tez AM should use NodeManager's supplied hostname when it registers with the RM. Tez AM does not register with AM with full FQDN causing jobs to fail in some environments - Key: TEZ-1924 URL: https://issues.apache.org/jira/browse/TEZ-1924 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Ivan Mitic Attachments: TEZ-20.patch Issue originally reported by [~Karam Singh]. All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine one such example is following {code} RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount -DUSE_TEZ_SESSION=true -Dmapreduce.map.memory.mb=2048 -Dtez.am.shuffle-vertex-manager.max-src-fraction=0 -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.framework.name=yarn-tez -Dtez.am.container.reuse.enabled=false -Dtez.am.log.level=DEBUG -Dmapreduce.map.java.opts=-Xmx1024m -Dtez.am.shuffle-vertex-manager.min-src-fraction=0 -Dmapreduce.job.reduce.slowstart.completedmaps=0.01 -Dmapreduce.reduce.java.opts=-Xmx1024m -Dtez.am.container.session.delay-allocation-millis=12 /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true 14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started 14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx-- 14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session 14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050 14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016 14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session 14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1 14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-t...@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20 14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state 14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: workernode1:59575; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351) at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538) at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at