[jira] [Updated] (MAPREDUCE-6441) Improve temporary directory name generation in LocalDistributedCacheManager for concurrent processes
[ https://issues.apache.org/jira/browse/MAPREDUCE-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated MAPREDUCE-6441: -- Attachment: MAPREDUCE-6441.006.patch Changed to Executor model for unit test. This is my third attempt at a unit test, but I haven't managed to get it to fail with the old code. > Improve temporary directory name generation in LocalDistributedCacheManager > for concurrent processes > > > Key: MAPREDUCE-6441 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6441 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: William Watson >Assignee: Ray Chiang > Attachments: HADOOP-10924.02.patch, > HADOOP-10924.03.jobid-plus-uuid.patch, MAPREDUCE-6441.004.patch, > MAPREDUCE-6441.005.patch, MAPREDUCE-6441.006.patch > > > Kicking off many sqoop processes in different threads results in: > {code} > 2014-08-01 13:47:24 -0400: INFO - 14/08/01 13:47:22 ERROR tool.ImportTool: > Encountered IOException running import job: java.io.IOException: > java.util.concurrent.ExecutionException: java.io.IOException: Rename cannot > overwrite non empty destination directory > /tmp/hadoop-hadoop/mapred/local/1406915233073 > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:149) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapred.LocalJobRunner$Job.(LocalJobRunner.java:163) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) > 2014-08-01 13:47:24 -0400: INFO -at > java.security.AccessController.doPrivileged(Native Method) > 2014-08-01 13:47:24 -0400: INFO -at > javax.security.auth.Subject.doAs(Subject.java:415) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:186) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:159) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:239) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:645) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:415) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.run(Sqoop.java:145) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.main(Sqoop.java:238) > {code} > If two are kicked off in the same second. The issue is the following lines of > code in the org.apache.hadoop.mapred.LocalDistributedCacheManager class: > {code} > // Generating unique numbers for FSDownload. > AtomicLong uniqueNumberGenerator = >new AtomicLong(System.currentTimeMillis()); > {code} > and > {code} > Long.toString(uniqueNumberGenerator.incrementAndGet())), > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151228#comment-16151228 ] Konstantin Shvachko commented on MAPREDUCE-6931: The change is very simple, trivial. But important from API viewpoint, as I explained above. > Remove TestDFSIO "Total Throughput" calculation > --- > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Assignee: Dennis Huo >Priority: Critical > Fix For: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3 > > Attachments: MAPREDUCE-6931-001.patch > > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The different calculated fields can also use toMB and a shared > milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated MAPREDUCE-6931: --- Priority: Critical (was: Trivial) > Remove TestDFSIO "Total Throughput" calculation > --- > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Assignee: Dennis Huo >Priority: Critical > Fix For: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3 > > Attachments: MAPREDUCE-6931-001.patch > > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The different calculated fields can also use toMB and a shared > milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151042#comment-16151042 ] Junping Du commented on MAPREDUCE-6931: --- This JIRA is marked as trivial, but we are in 2.8.2 RC stage. In my practice (different RM may have different practices), commits under major priority should be skipped at this stage with balance between importance of fixes and risky of careless code/merge. > Remove TestDFSIO "Total Throughput" calculation > --- > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Assignee: Dennis Huo >Priority: Trivial > Fix For: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3 > > Attachments: MAPREDUCE-6931-001.patch > > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The different calculated fields can also use toMB and a shared > milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150945#comment-16150945 ] Konstantin Shvachko commented on MAPREDUCE-6931: Hey [~djp] I got confused with jira versions, as 2.8.3 was not available. Now it is, thanks. But I hoped the confusing field, which this jira is removing, will not sneak into any releases at all. To avoid questions like what it means and why it was removed later on. I would strongly recommend merging this into 2.8.2. The final decision of course is up to the release manager. > Remove TestDFSIO "Total Throughput" calculation > --- > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Assignee: Dennis Huo >Priority: Trivial > Fix For: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3 > > Attachments: MAPREDUCE-6931-001.patch > > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The different calculated fields can also use toMB and a shared > milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5124) AM lacks flow control for task events
[ https://issues.apache.org/jira/browse/MAPREDUCE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150458#comment-16150458 ] Jason Lowe commented on MAPREDUCE-5124: --- Having the AM send the heartbeat means the AM needs to be the client in the RPC connection since only servers receive method calls. That creates two problems in practice. First is the discovery problem -- how does the AM know the listening port for each task? Second is thread scaling since the client RPC layer creates a thread for every connection. That means a thread per task which is not going to work for large jobs. bq. The actual code may use asynchronous calls not to create a thread for each task. This is really the key and the only thing necessary to solve the problem. The root cause of this problem is that the AM is quickly sending a response to each heartbeat without actually processing it. That creates a flow control issue since the rate of processing heartbeats is somewhat disconnected from the incoming rate. Therefore we can receive them at a rate far greater than it takes to process, causing an unbounded pileup of backlogged events. The reason the AM behaves this way is that it needs to free up the IPC Server handler thread so it can handle other tasks requests, like other heartbeats, new task attempt connections, task requests, etc. There's lots of other places in YARN and MAPREDUCE where a similar tactic is taken with the resulting flow control issue as a result. The real fix is to not send a heartbeat reply until the heartbeat is completely processed. Then there will only ever be as many outstanding heartbeats and metrics status updates as there are task attempts running at the time, rather than an unbounded number based on the rate difference between how fast the tasks are posting heartbeats and how fast the AsyncDispatcher can process them. If we were able to synchronously process the heartbeat in a way that doesn't completely tie up an IPC Server handler thread for the duration of the heartbeat call then we're all set. Task heartbeats naturally slow down as the ability of the AM to process them degrades. No need for the AM to be explicit about rejecting requests or the AM itself doing any 3 second sleeping. We just need to leverage the functionality added in HADOOP-11552 so we aren't compelled to reply to heartbeats before they are fully processed to free up an IPC Server thread. > AM lacks flow control for task events > - > > Key: MAPREDUCE-5124 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5124 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 2.0.3-alpha, 0.23.5 >Reporter: Jason Lowe >Assignee: Haibo Chen > Attachments: MAPREDUCE-5124-proto.2.txt, MAPREDUCE-5124-prototype.txt > > > The AM does not have any flow control to limit the incoming rate of events > from tasks. If the AM is unable to keep pace with the rate of incoming > events for a sufficient period of time then it will eventually exhaust the > heap and crash. MAPREDUCE-5043 addressed a major bottleneck for event > processing, but the AM could still get behind if it's starved for CPU and/or > handling a very large job with tens of thousands of active tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6950) Error Launching job : java.io.IOException: Unknown Job job_xxx_xxx
[ https://issues.apache.org/jira/browse/MAPREDUCE-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150422#comment-16150422 ] Jason Lowe commented on MAPREDUCE-6950: --- Retries of the write are automatically performed by the HDFS client layer before ultimately giving up and bubbling the error up to the application layer. Looking back in the AM logs you should be able to find indications of this. Since the HDFS layer is already retrying, the utility of retrying again at the application layer is questionable. > Error Launching job : java.io.IOException: Unknown Job job_xxx_xxx > -- > > Key: MAPREDUCE-6950 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6950 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am >Affects Versions: 2.7.1 >Reporter: zhengchenyu > Fix For: 2.7.5 > > Original Estimate: 1m > Remaining Estimate: 1m > > some job report error, like this: > {code} > hadoop.mapreduce.Job.monitorAndPrintJob(Job.java 1367) [main] : map 100% > reduce 100% > [2017-08-31T20:27:12.591+08:00] [INFO] > hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java 277) > [main] : Application state is completed. FinalApplicationStatus=SUCCEEDED. > Redirecting to job history server > [2017-08-31T20:27:12.821+08:00] [INFO] > hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java 277) > [main] : Application state is completed. FinalApplicationStatus=SUCCEEDED. > Redirecting to job history server > [2017-08-31T20:27:13.039+08:00] [INFO] > hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java 277) > [main] : Application state is completed. FinalApplicationStatus=SUCCEEDED. > Redirecting to job history server > [2017-08-31T20:27:13.256+08:00] [ERROR] > hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java 1034) [main] : > Error Launching job : java.io.IOException: Unknown Job job_xxx_xxx > {code} > I found the am container log, like below. Here we know error happened in > pipeline, maybe some dn error. And I also found some other reason which close > the JobHistoryEventHandler. So MR AM can't write the information for JH. So > client counldn't know whether the appplication is finished. > {code} > 2017-08-31 20:27:10,813 INFO [Thread-1968] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, > writing event MAP_ATTEMPT_STARTED > 2017-08-31 20:27:10,814 ERROR [Thread-1968] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing > History Event: > org.apache.hadoop.mapreduce.jobhistory.TaskAttemptStartedEvent@2055ea0a > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2292) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1317) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1237) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449) > 2017-08-31 20:27:10,814 INFO [Thread-1968] > org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler > failed in state STOPPED; cause: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.EOFException: > Premature EOF: no length prefix available > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.EOFException: > Premature EOF: no length prefix available > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:580) > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:374) > > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > at > org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) > at > org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) > {code} > This problem is serious , especially for hive. Job must rerun meaninglessly! > So I think we need to retry the operation of writing history event. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org