[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation

2017-08-25 Thread Dennis Huo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142510#comment-16142510
 ] 

Dennis Huo commented on MAPREDUCE-6931:
---

Done.

> Remove TestDFSIO "Total Throughput" calculation
> ---
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Priority: Trivial
> Attachments: MAPREDUCE-6931-001.patch
>
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated fields can also use toMB and a shared 
> milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation

2017-08-14 Thread Dennis Huo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125997#comment-16125997
 ] 

Dennis Huo commented on MAPREDUCE-6931:
---

Right, that line is part of the refactor to make time and byte conversions 
consistently use the helper functions instead of having different places.

So the current pull request keeps the refactoring but removes the "Total 
Throughput" line as you suggested. If you prefer to also remove all the 
refactoring and keep the hard-coded "(float)execTime / 1000" stuff I can do 
that too, just let me know.

> Remove TestDFSIO "Total Throughput" calculation
> ---
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Priority: Trivial
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated fields can also use toMB and a shared 
> milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6931) Fix TestDFSIO "Total Throughput" calculation

2017-08-12 Thread Dennis Huo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124768#comment-16124768
 ] 

Dennis Huo commented on MAPREDUCE-6931:
---

Hmm, looking at github I only see the refactoring of the older messages, along 
with complete removal of the "Total Throughput" line.

The confusion might be that there's only one commit because I used "commit 
--amend", force of habit from other repos iI've worked on where this convention 
is used for review-time changes to small patches. I could probably reconstruct 
the commit history if you prefer.

> Fix TestDFSIO "Total Throughput" calculation
> 
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Priority: Trivial
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated fields can also use toMB and a shared 
> milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation

2017-08-12 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated MAPREDUCE-6931:
--
Summary: Remove TestDFSIO "Total Throughput" calculation  (was: Fix 
TestDFSIO "Total Throughput" calculation)

> Remove TestDFSIO "Total Throughput" calculation
> ---
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Priority: Trivial
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated fields can also use toMB and a shared 
> milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6931) Fix TestDFSIO "Total Throughput" calculation

2017-08-10 Thread Dennis Huo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122505#comment-16122505
 ] 

Dennis Huo commented on MAPREDUCE-6931:
---

Fair enough, makes sense. I went ahead and removed that line, keeping the 
refactorings otherwise; I also updated my commit message and pull request title 
to reflect the "removal" rather than the "fix" of the line, but it sounds like 
guidelines are to avoid editing JIRAs inplace, so I'll leave that untouched.

> Fix TestDFSIO "Total Throughput" calculation
> 
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Priority: Trivial
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated fields can also use toMB and a shared 
> milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6931) Fix TestDFSIO "Total Throughput" calculation

2017-08-10 Thread Dennis Huo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122308#comment-16122308
 ] 

Dennis Huo commented on MAPREDUCE-6931:
---

Thanks for the explanation! I have no strong preference about removing the 
particular "Total Throughput" metric, but from my own experience using 
TestDFSIO in the past, I do find that the "average single-stream throughput" 
calculation historically provided by TestDFSIO can itself be somewhat 
misleading in characterizing a cluster since it makes it difficult to infer the 
level of concurrency corresponding to that per-stream performance without 
backing out the numbers manually.

I see the new metric as being a useful measure of "Effective Aggregate 
Throughput", all-in including overhead.

For example, if I use memory settings that only fit 1 container per physical 
machine at a time, my TestDFSIO will trickle through 1 task per machine at a 
time, and those single tasks will have very high single-stream throughput. If I 
instead do memory packing so that every machine runs, say, 64 tasks 
concurrently, then single-stream throughput will suffer significantly, while 
total walltime will decrease significantly. With a walltime-based calculation, 
I can see at a glance the approximate total throughput rating of my cluster 
when everything is running at full throttle; I'd expect increasing concurrency 
to increase aggregate throughput until IO limits are reached, where aggregate 
throughput will become flat w.r.t. increasing concurrency or slightly declining 
due to thrashing.

This could also be my cloud bias, where it becomes more important to 
characterize a full-blast cluster against a remote filesystem vs caring so much 
about per-stream throughputs.

It seems like an "effective aggregate throughput" calculation would help 
encompass the cluster-wide effects of things like optimal CPU oversubscription 
ratios, scheduler settings, speculative execution vs failure rates, etc.

I agree the wording and computation as-is might not be the right fit for this 
though. I see a few options that might be worthwhile, possibly in some 
combination:

* Change wording to say "Effective Aggregate Throughput" to more accurately 
describe what the number means
* Add a metric displaying the "time" as "Slot Seconds" or something like that 
so that user doesn't have to compute it by dividing "Total MBytes processes" by 
"Throughput mb/sec" explicitly. This also helps clarify that the throughput is 
computed in terms is slot time, not walltime.
* Additionally, maybe provide a measure of "average concurrency" taking total 
slot time divided by walltime. This would legitimately consider scheduler 
overheads; if my whole test only ran 1 task in an hour, and it only had 30 
minutes of slot time, then a concurrency of 0.5 correctly characterizes the 
fact that I'm only squeezing out 0.5 utilization after factoring in delays.


In any case, happy to just delete the one line in-place to have the 
refactorings committed if you feel it's better not to change/add metrics or if 
these are better discussed in a followup JIRA, let me know.

Re: MAPREDUCE and HDFS, I'll be sure remember TestDFSIO goes under HDFS in the 
future. For this one I looked at a search for "TestDFSIO" in JIRA and eyeballed 
that a plurality seemed to be under MAPREDUCE, a smaller fraction in HDFS, and 
then remaining ones in HADOOP. Combined with this code going under the 
hadoop-mapreduce directory, it looked like MAPREDUCE was more correct.

> Fix TestDFSIO "Total Throughput" calculation
> 
>
> Key: MAPREDUCE-6931
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks, test
>Affects Versions: 2.8.0
>Reporter: Dennis Huo
>Priority: Trivial
>
> The new "Total Throughput" line added in 
> https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
> {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
> {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
> actual value:
> {code:java}
> String resultLines[] = {
> "- TestDFSIO - : " + testType,
> "Date & time: " + new Date(System.currentTimeMillis()),
> "Number of files: " + tasks,
> " Total MBytes processed: " + df.format(toMB(size)),
> "  Throughput mb/sec: " + df.format(size * 1000.0 / (time * 
> MEGA)),
> "Total Throughput mb/sec: " + df.format(toMB(size) / 
> ((float)execTime)),
> " Average IO rate mb/sec: " + df.format(med),
> "  IO rate std deviation: " + df.format(stdDev),
> " Test exec time sec: " + df.format((float)execTime / 1000),
> "" };
> {code}
> The different calculated 

[jira] [Created] (MAPREDUCE-6931) Fix TestDFSIO "Total Throughput" calculation

2017-08-02 Thread Dennis Huo (JIRA)
Dennis Huo created MAPREDUCE-6931:
-

 Summary: Fix TestDFSIO "Total Throughput" calculation
 Key: MAPREDUCE-6931
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: benchmarks, test
Affects Versions: 2.8.0
Reporter: Dennis Huo
Priority: Trivial


The new "Total Throughput" line added in 
https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as 
{{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but 
{{execTime}} is in milliseconds; thus, the reported number is 1/1000x the 
actual value:

{code:java}
String resultLines[] = {
"- TestDFSIO - : " + testType,
"Date & time: " + new Date(System.currentTimeMillis()),
"Number of files: " + tasks,
" Total MBytes processed: " + df.format(toMB(size)),
"  Throughput mb/sec: " + df.format(size * 1000.0 / (time * MEGA)),
"Total Throughput mb/sec: " + df.format(toMB(size) / ((float)execTime)),
" Average IO rate mb/sec: " + df.format(med),
"  IO rate std deviation: " + df.format(stdDev),
" Test exec time sec: " + df.format((float)execTime / 1000),
"" };
{code}

The different calculated fields can also use toMB and a shared 
milliseconds-to-seconds conversion to make it easier to keep units consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6759) JobSubmitter/JobResourceUploader should parallelize upload of -libjars, -files, -archives

2016-08-17 Thread Dennis Huo (JIRA)
Dennis Huo created MAPREDUCE-6759:
-

 Summary: JobSubmitter/JobResourceUploader should parallelize 
upload of -libjars, -files, -archives
 Key: MAPREDUCE-6759
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6759
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: job submission
Reporter: Dennis Huo


During job submission, the {{JobResourceUploader}} currently iterates over 
for-loops of {{-libjars}}, {{-files}}, and {{-archives}} sequentially, which 
can significantly slow down job startup time when a large number of files need 
to be uploaded, especially if staging the files to a cloud object-store based 
FileSystem implementation like S3, GCS, WABS, etc., where round-trip latencies 
may be higher than HDFS despite having good throughput when parallelized:

{code:title=JobResourceUploader.java}
if (files != null) {
  FileSystem.mkdirs(jtFs, filesDir, mapredSysPerms);
  String[] fileArr = files.split(",");
  for (String tmpFile : fileArr) {
URI tmpURI = null;
try {
  tmpURI = new URI(tmpFile);
} catch (URISyntaxException e) {
  throw new IllegalArgumentException(e);
}
Path tmp = new Path(tmpURI);
Path newPath = copyRemoteFiles(filesDir, tmp, conf, replication);
try {
  URI pathURI = getPathURI(newPath, tmpURI.getFragment());
  DistributedCache.addCacheFile(pathURI, conf);
} catch (URISyntaxException ue) {
  // should not throw a uri exception
  throw new IOException("Failed to create uri for " + tmpFile, ue);
}
  }
}

if (libjars != null) {
  FileSystem.mkdirs(jtFs, libjarsDir, mapredSysPerms);
  String[] libjarsArr = libjars.split(",");
  for (String tmpjars : libjarsArr) {
Path tmp = new Path(tmpjars);
Path newPath = copyRemoteFiles(libjarsDir, tmp, conf, replication);
DistributedCache.addFileToClassPath(
new Path(newPath.toUri().getPath()), conf, jtFs);
  }
}

if (archives != null) {
  FileSystem.mkdirs(jtFs, archivesDir, mapredSysPerms);
  String[] archivesArr = archives.split(",");
  for (String tmpArchives : archivesArr) {
URI tmpURI;
try {
  tmpURI = new URI(tmpArchives);
} catch (URISyntaxException e) {
  throw new IllegalArgumentException(e);
}
Path tmp = new Path(tmpURI);
Path newPath = copyRemoteFiles(archivesDir, tmp, conf, replication);
try {
  URI pathURI = getPathURI(newPath, tmpURI.getFragment());
  DistributedCache.addCacheArchive(pathURI, conf);
} catch (URISyntaxException ue) {
  // should not throw an uri excpetion
  throw new IOException("Failed to create uri for " + tmpArchives, ue);
}
  }
}
{code}

Parallelizing the upload of these files would improve job submission time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6758) TestDFSIO should parallelize its creation of control files on setup

2016-08-16 Thread Dennis Huo (JIRA)
Dennis Huo created MAPREDUCE-6758:
-

 Summary: TestDFSIO should parallelize its creation of control 
files on setup
 Key: MAPREDUCE-6758
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6758
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: test
Reporter: Dennis Huo


TestDFSIO currently performs a sequential for-loop to create {{nrFiles}} 
control files in the {{controlDir}} which is a subdirectory of the overall 
{{test.build.data}} directory, which may be a non-HDFS FileSystem 
implementation:

{code:java}
private void createControlFile(FileSystem fs,
long nrBytes, // in bytes
int nrFiles
  ) throws IOException {
  LOG.info("creating control file: "+nrBytes+" bytes, "+nrFiles+" files");

  Path controlDir = getControlDir(config);
  fs.delete(controlDir, true);

  for(int i=0; i < nrFiles; i++) {
String name = getFileName(i);
Path controlFile = new Path(controlDir, "in_file_" + name);
SequenceFile.Writer writer = null;
try {
  writer = SequenceFile.createWriter(fs, config, controlFile,
 Text.class, LongWritable.class,
 CompressionType.NONE);
  writer.append(new Text(name), new LongWritable(nrBytes));
} catch(Exception e) {
  throw new IOException(e.getLocalizedMessage());
} finally {
  if (writer != null)
writer.close();
  writer = null;
}
  }
  LOG.info("created control files for: "+nrFiles+" files");
}
{code}

When testing in an object-store based filesystem with higher round-trip latency 
than HDFS (like S3 or GCS), this means job setup that might only take seconds 
in HDFS ends up taking minutes or even tens of minutes against the object 
stores if the test is using thousands of control files. In the same vein as 
other JIRAs in [https://issues.apache.org/jira/browse/HADOOP-11694], the 
control-file creation should be parallelized/multithreaded to efficiently 
launch large TestDFSIO jobs against FileSystem impls with high round-trip 
latency but which can still support high overall throughput/QPS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org