[jira] [Updated] (MAPREDUCE-6441) Improve temporary directory name generation in LocalDistributedCacheManager for concurrent processes

2017-09-01 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated MAPREDUCE-6441:
--
Attachment: MAPREDUCE-6441.006.patch

Changed to Executor model for unit test.  This is my third attempt at a unit 
test, but I haven't managed to get it to fail with the old code.

> Improve temporary directory name generation in LocalDistributedCacheManager 
> for concurrent processes
> 
>
> Key: MAPREDUCE-6441
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6441
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: William Watson
>Assignee: Ray Chiang
> Attachments: HADOOP-10924.02.patch, 
> HADOOP-10924.03.jobid-plus-uuid.patch, MAPREDUCE-6441.004.patch, 
> MAPREDUCE-6441.005.patch, MAPREDUCE-6441.006.patch
>
>
> Kicking off many sqoop processes in different threads results in:
> {code}
> 2014-08-01 13:47:24 -0400:  INFO - 14/08/01 13:47:22 ERROR tool.ImportTool: 
> Encountered IOException running import job: java.io.IOException: 
> java.util.concurrent.ExecutionException: java.io.IOException: Rename cannot 
> overwrite non empty destination directory 
> /tmp/hadoop-hadoop/mapred/local/1406915233073
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:149)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.(LocalJobRunner.java:163)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> java.security.AccessController.doPrivileged(Native Method)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> javax.security.auth.Subject.doAs(Subject.java:415)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:186)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:159)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:239)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:645)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:415)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.run(Sqoop.java:145)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.main(Sqoop.java:238)
> {code}
> If two are kicked off in the same second. The issue is the following lines of 
> code in the org.apache.hadoop.mapred.LocalDistributedCacheManager class: 
> {code}
> // Generating unique numbers for FSDownload.
> AtomicLong uniqueNumberGenerator =
>new AtomicLong(System.currentTimeMillis());
> {code}
> and 
> {code}
> Long.toString(uniqueNumberGenerator.incrementAndGet())),
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation

2017-09-01 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151228#comment-16151228
 ] 

Konstantin Shvachko commented on MAPREDUCE-6931:


The change is very simple, trivial. But important from API viewpoint, as I 
explained above.

> Remove TestDFSIO "Total Throughput" calculation
> ---
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3
>
> Attachments: MAPREDUCE-6931-001.patch
>
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated fields can also use toMB and a shared 
> milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation

2017-09-01 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated MAPREDUCE-6931:
---
Priority: Critical  (was: Trivial)

> Remove TestDFSIO "Total Throughput" calculation
> ---
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3
>
> Attachments: MAPREDUCE-6931-001.patch
>
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated fields can also use toMB and a shared 
> milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation

2017-09-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151042#comment-16151042
 ] 

Junping Du commented on MAPREDUCE-6931:
---

This JIRA is marked as trivial, but we are in 2.8.2 RC stage. In my practice 
(different RM may have different practices), commits under major priority 
should be skipped at this stage with balance between importance of fixes and 
risky of careless code/merge.

> Remove TestDFSIO "Total Throughput" calculation
> ---
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Trivial
> Fix For: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3
>
> Attachments: MAPREDUCE-6931-001.patch
>
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated fields can also use toMB and a shared 
> milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation

2017-09-01 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150945#comment-16150945
 ] 

Konstantin Shvachko commented on MAPREDUCE-6931:


Hey [~djp] I got confused with jira versions, as 2.8.3 was not available. Now 
it is, thanks.
But I hoped the confusing field, which this jira is removing, will not sneak 
into any releases at all. To avoid questions like what it means and why it was 
removed later on. 
I would strongly recommend merging this into 2.8.2. The final decision of 
course is up to the release manager.

> Remove TestDFSIO "Total Throughput" calculation
> ---
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Trivial
> Fix For: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3
>
> Attachments: MAPREDUCE-6931-001.patch
>
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated fields can also use toMB and a shared 
> milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5124) AM lacks flow control for task events

2017-09-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150458#comment-16150458
 ] 

Jason Lowe commented on MAPREDUCE-5124:
---

Having the AM send the heartbeat means the AM needs to be the client in the RPC 
connection since only servers receive method calls.  That creates two problems 
in practice.  First is the discovery problem -- how does the AM know the 
listening port for each task?  Second is thread scaling since the client RPC 
layer creates a thread for every connection.  That means a thread per task 
which is not going to work for large jobs.

bq. The actual code may use asynchronous calls not to create a thread for each 
task.

This is really the key and the only thing necessary to solve the problem. The 
root cause of this problem is that the AM is quickly sending a response to each 
heartbeat without actually processing it.  That creates a flow control issue 
since the rate of processing heartbeats is somewhat disconnected from the 
incoming rate.  Therefore we can receive them at a rate far greater than it 
takes to process, causing an unbounded pileup of backlogged events.  The reason 
the AM behaves this way is that it needs to free up the IPC Server handler 
thread so it can handle other tasks requests, like other heartbeats, new task 
attempt connections, task requests, etc.  There's lots of other places in YARN 
and MAPREDUCE where a similar tactic is taken with the resulting flow control 
issue as a result.

The real fix is to not send a heartbeat reply until the heartbeat is completely 
processed.  Then there will only ever be as many outstanding heartbeats and 
metrics status updates as there are task attempts running at the time, rather 
than an unbounded number based on the rate difference between how fast the 
tasks are posting heartbeats and how fast the AsyncDispatcher can process them. 
 If we were able to synchronously process the heartbeat in a way that doesn't 
completely tie up an IPC Server handler thread for the duration of the 
heartbeat call then we're all set.  Task heartbeats naturally slow down as the 
ability of the AM to process them degrades.  No need for the AM to be explicit 
about rejecting requests or the AM itself doing any 3 second sleeping.  We just 
need to leverage the functionality added in HADOOP-11552 so we aren't compelled 
to reply to heartbeats before they are fully processed to free up an IPC Server 
thread.


> AM lacks flow control for task events
> -
>
> Key: MAPREDUCE-5124
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5124
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 2.0.3-alpha, 0.23.5
>Reporter: Jason Lowe
>Assignee: Haibo Chen
> Attachments: MAPREDUCE-5124-proto.2.txt, MAPREDUCE-5124-prototype.txt
>
>
> The AM does not have any flow control to limit the incoming rate of events 
> from tasks.  If the AM is unable to keep pace with the rate of incoming 
> events for a sufficient period of time then it will eventually exhaust the 
> heap and crash.  MAPREDUCE-5043 addressed a major bottleneck for event 
> processing, but the AM could still get behind if it's starved for CPU and/or 
> handling a very large job with tens of thousands of active tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6950) Error Launching job : java.io.IOException: Unknown Job job_xxx_xxx

2017-09-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150422#comment-16150422
 ] 

Jason Lowe commented on MAPREDUCE-6950:
---

Retries of the write are automatically performed by the HDFS client layer 
before ultimately giving up and bubbling the error up to the application layer. 
 Looking back in the AM logs you should be able to find indications of this.  
Since the HDFS layer is already retrying, the utility of retrying again at the 
application layer is questionable.


> Error Launching job : java.io.IOException: Unknown Job job_xxx_xxx
> --
>
> Key: MAPREDUCE-6950
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6950
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.7.1
>Reporter: zhengchenyu
> Fix For: 2.7.5
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> some job report error, like this:
> {code}
> hadoop.mapreduce.Job.monitorAndPrintJob(Job.java 1367) [main] :  map 100% 
> reduce 100%
> [2017-08-31T20:27:12.591+08:00] [INFO] 
> hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java 277) 
> [main] : Application state is completed. FinalApplicationStatus=SUCCEEDED. 
> Redirecting to job history server
> [2017-08-31T20:27:12.821+08:00] [INFO] 
> hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java 277) 
> [main] : Application state is completed. FinalApplicationStatus=SUCCEEDED. 
> Redirecting to job history server
> [2017-08-31T20:27:13.039+08:00] [INFO] 
> hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java 277) 
> [main] : Application state is completed. FinalApplicationStatus=SUCCEEDED. 
> Redirecting to job history server
> [2017-08-31T20:27:13.256+08:00] [ERROR] 
> hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java 1034) [main] : 
> Error Launching job : java.io.IOException: Unknown Job job_xxx_xxx
> {code}
> I found the am container log, like below. Here we know error happened in 
> pipeline, maybe some dn error. And I also found some other reason which close 
> the JobHistoryEventHandler. So MR AM can't write the information for JH. So 
> client counldn't know whether the appplication is finished. 
> {code}
> 2017-08-31 20:27:10,813 INFO [Thread-1968] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, 
> writing event MAP_ATTEMPT_STARTED
> 2017-08-31 20:27:10,814 ERROR [Thread-1968] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing 
> History Event: 
> org.apache.hadoop.mapreduce.jobhistory.TaskAttemptStartedEvent@2055ea0a
> java.io.EOFException: Premature EOF: no length prefix available
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2292)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1317)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1237)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
> 2017-08-31 20:27:10,814 INFO [Thread-1968] 
> org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler 
> failed in state STOPPED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.EOFException: 
> Premature EOF: no length prefix available
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.EOFException: 
> Premature EOF: no length prefix available
> at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:580)
> at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:374)
>  
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
> at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
> {code}
> This problem is serious , especially for hive. Job must rerun meaninglessly!  
> So I think we need to retry the operation of writing history event. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org