[jira] [Commented] (MAPREDUCE-4659) Confusing output when running "hadoop version" from one hadoop installation when HADOOP_HOME points to another

2012-09-14 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456301#comment-13456301
 ] 

Harsh J commented on MAPREDUCE-4659:


Good idea, I've seen this bite a few users before. Note though, that on 2.x and 
current 1.x, the HADOOP_HOME was deprecated for HADOOP_PREFIX and per-component 
homes. So maybe we could also look at HADOOP_COMMON_HOME, which is the lib set 
that has the VersionInfo?

> Confusing output when running "hadoop version" from one hadoop installation 
> when HADOOP_HOME points to another
> --
>
> Key: MAPREDUCE-4659
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4659
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.20.2
>Reporter: Sandy Ryza
>
> Hadoop version X is downloaded to ~/hadoop-x, and Hadoop version Y is 
> downloaded to ~/hadoop-y.  HADOOP_HOME is set to hadoop-x.  A user running 
> hadoop-y/bin/hadoop might expect to be running the hadoop-y jars, but, 
> because of HADOOP_HOME, will actually be running hadoop-x jars.
> "hadoop version" could help clear this up a little by reporting the current 
> HADOOP_HOME.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4647) We should only unjar jobjar if there is a lib directory in it.

2012-09-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456191#comment-13456191
 ] 

Hadoop QA commented on MAPREDUCE-4647:
--

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12545216/MR-4647.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 5 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2855//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2855//console

This message is automatically generated.

> We should only unjar jobjar if there is a lib directory in it.
> --
>
> Key: MAPREDUCE-4647
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4647
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 0.23.3
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Attachments: MR-4647-branch-0.23.txt, MR-4647.txt, MR-4647.txt, 
> MR-4647.txt, MR-4647.txt
>
>
> For backwards compatibility we recently added made is so we would unjar the 
> job.jar and add anything to the classpath in the lib directory of that jar.  
> But this also slows job startup down a lot if the jar is large.  We should 
> only unjar it if actually doing so would add something new to the classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4647) We should only unjar jobjar if there is a lib directory in it.

2012-09-14 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-4647:
---

Attachment: MR-4647.txt
MR-4647-branch-0.23.txt

Attaching an updated version for trunk/branch-2 and another for branch-0.23.  
These address the comments from before.

> We should only unjar jobjar if there is a lib directory in it.
> --
>
> Key: MAPREDUCE-4647
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4647
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 0.23.3
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Attachments: MR-4647-branch-0.23.txt, MR-4647.txt, MR-4647.txt, 
> MR-4647.txt, MR-4647.txt
>
>
> For backwards compatibility we recently added made is so we would unjar the 
> job.jar and add anything to the classpath in the lib directory of that jar.  
> But this also slows job startup down a lot if the jar is large.  We should 
> only unjar it if actually doing so would add something new to the classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4647) We should only unjar jobjar if there is a lib directory in it.

2012-09-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456105#comment-13456105
 ] 

Robert Joseph Evans commented on MAPREDUCE-4647:


I'll look at updating the documentation, and the warning.  Yes there is a minor 
difference between 0.23 and 2.0.  The original patch applied to 0.23, but not 
2.0 so I rebased on trunk.  I will supply a 0.23 version with the updates as 
well.

> We should only unjar jobjar if there is a lib directory in it.
> --
>
> Key: MAPREDUCE-4647
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4647
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 0.23.3
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Attachments: MR-4647.txt, MR-4647.txt, MR-4647.txt
>
>
> For backwards compatibility we recently added made is so we would unjar the 
> job.jar and add anything to the classpath in the lib directory of that jar.  
> But this also slows job startup down a lot if the jar is large.  We should 
> only unjar it if actually doing so would add something new to the classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4645) Providing a random seed to Slive should make the sequence of filenames completely deterministic

2012-09-14 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated MAPREDUCE-4645:


Attachment: MAPREDUCE-4645.branch-0.23.patch

Thanks for your review and suggestion Konstantin! I've updated the patch to use 
the taskID to seed the RNG.

> Providing a random seed to Slive should make the sequence of filenames 
> completely deterministic
> ---
>
> Key: MAPREDUCE-4645
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4645
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: performance, test
>Affects Versions: 0.23.1, 2.0.0-alpha
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>  Labels: performance, test
> Attachments: MAPREDUCE-4645.branch-0.23.patch, 
> MAPREDUCE-4645.branch-0.23.patch
>
>
> Using the -random seed option still doesn't produce a deterministic sequence 
> of filenames. Hence there's no way to replicate the performance test. If I'm 
> providing a seed, its obvious that I want the test to be reproducible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4647) We should only unjar jobjar if there is a lib directory in it.

2012-09-14 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456065#comment-13456065
 ] 

Thomas Graves commented on MAPREDUCE-4647:
--

Few minor comments.

- Since Pattern is really only implemented for jars but it still handles zip, 
tar.gz, tar we should try to document the behavior better.
- It might be nice to add something like "even though specified as a Pattern" 
to the warning message in unpack in FSDownload.java

> We should only unjar jobjar if there is a lib directory in it.
> --
>
> Key: MAPREDUCE-4647
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4647
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 0.23.3
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Attachments: MR-4647.txt, MR-4647.txt, MR-4647.txt
>
>
> For backwards compatibility we recently added made is so we would unjar the 
> job.jar and add anything to the classpath in the lib directory of that jar.  
> But this also slows job startup down a lot if the jar is large.  We should 
> only unjar it if actually doing so would add something new to the classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4647) We should only unjar jobjar if there is a lib directory in it.

2012-09-14 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456067#comment-13456067
 ] 

Thomas Graves commented on MAPREDUCE-4647:
--

Also your patch doesn't apply to branch-0.23

> We should only unjar jobjar if there is a lib directory in it.
> --
>
> Key: MAPREDUCE-4647
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4647
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 0.23.3
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Attachments: MR-4647.txt, MR-4647.txt, MR-4647.txt
>
>
> For backwards compatibility we recently added made is so we would unjar the 
> job.jar and add anything to the classpath in the lib directory of that jar.  
> But this also slows job startup down a lot if the jar is large.  We should 
> only unjar it if actually doing so would add something new to the classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4659) Confusing output when running "hadoop version" from one hadoop installation when HADOOP_HOME points to another

2012-09-14 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated MAPREDUCE-4659:
--

Affects Version/s: 1.0.0

> Confusing output when running "hadoop version" from one hadoop installation 
> when HADOOP_HOME points to another
> --
>
> Key: MAPREDUCE-4659
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4659
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.20.2
>Reporter: Sandy Ryza
>
> Hadoop version X is downloaded to ~/hadoop-x, and Hadoop version Y is 
> downloaded to ~/hadoop-y.  HADOOP_HOME is set to hadoop-x.  A user running 
> hadoop-y/bin/hadoop might expect to be running the hadoop-y jars, but, 
> because of HADOOP_HOME, will actually be running hadoop-x jars.
> "hadoop version" could help clear this up a little by reporting the current 
> HADOOP_HOME.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4659) Confusing output when running "hadoop version" from one hadoop installation when HADOOP_HOME points to another

2012-09-14 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated MAPREDUCE-4659:
--

Affects Version/s: (was: 2.0.1-alpha)
   (was: 1.0.0)
   0.20.2

> Confusing output when running "hadoop version" from one hadoop installation 
> when HADOOP_HOME points to another
> --
>
> Key: MAPREDUCE-4659
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4659
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.20.2
>Reporter: Sandy Ryza
>
> Hadoop version X is downloaded to ~/hadoop-x, and Hadoop version Y is 
> downloaded to ~/hadoop-y.  HADOOP_HOME is set to hadoop-x.  A user running 
> hadoop-y/bin/hadoop might expect to be running the hadoop-y jars, but, 
> because of HADOOP_HOME, will actually be running hadoop-x jars.
> "hadoop version" could help clear this up a little by reporting the current 
> HADOOP_HOME.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4644) mapreduce-client-jobclient-tests do not run from dist tarball

2012-09-14 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455994#comment-13455994
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4644:
---

Opened MAPREDUCE-4644

> mapreduce-client-jobclient-tests do not run from dist tarball
> -
>
> Key: MAPREDUCE-4644
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4644
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 2.0.2-alpha
>Reporter: Jason Lowe
>Priority: Blocker
>
> The mapreduce jobclient tests rely on junit which is missing from the dist 
> tarball.  This prevents running often-used tests like sleep jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-4659) Confusing output when running "hadoop version" from one hadoop installation when HADOOP_HOME points to another

2012-09-14 Thread Sandy Ryza (JIRA)
Sandy Ryza created MAPREDUCE-4659:
-

 Summary: Confusing output when running "hadoop version" from one 
hadoop installation when HADOOP_HOME points to another
 Key: MAPREDUCE-4659
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4659
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client
Affects Versions: 2.0.1-alpha
Reporter: Sandy Ryza


Hadoop version X is downloaded to ~/hadoop-x, and Hadoop version Y is 
downloaded to ~/hadoop-y.  HADOOP_HOME is set to hadoop-x.  A user running 
hadoop-y/bin/hadoop might expect to be running the hadoop-y jars, but, because 
of HADOOP_HOME, will actually be running hadoop-x jars.

"hadoop version" could help clear this up a little by reporting the current 
HADOOP_HOME.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4502) Multi-level aggregation with combining the result of maps per node/rack

2012-09-14 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455909#comment-13455909
 ] 

Chris Douglas commented on MAPREDUCE-4502:
--

bq. This seems to be good approach to deal with rack-level aggregation. Do you 
have any results about the benchmark?

For reducing on key ranges, there's a paper in 
[SOCC|http://www.socc2012.org/papers] on Sailfish. I don't have a link to that 
paper, though there's a [tech 
report|http://research.yahoo.com/files/yl-2012-003.pdf]. For the benchmark, we 
were mostly handling cases without combiners; in our data, each combiner was 
too effective to benefit from an intermediate level.

> Multi-level aggregation with combining the result of maps per node/rack
> ---
>
> Key: MAPREDUCE-4502
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4502
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: applicationmaster, mrv2
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: speculative_draft.pdf
>
>
> The shuffle costs is expensive in Hadoop in spite of the existence of 
> combiner, because the scope of combining is limited within only one MapTask. 
> To solve this problem, it's a good way to aggregate the result of maps per 
> node/rack by launch combiner.
> This JIRA is to implement the multi-level aggregation infrastructure, 
> including combining per container(MAPREDUCE-3902 is related), coordinating 
> containers by application master without breaking fault tolerance of jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-4658) Move tools JARs into separate lib directories and have common bootstrap script.

2012-09-14 Thread Alejandro Abdelnur (JIRA)
Alejandro Abdelnur created MAPREDUCE-4658:
-

 Summary: Move tools JARs into separate lib directories and have 
common bootstrap script.
 Key: MAPREDUCE-4658
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4658
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.0.2-alpha
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur


This is a follow up of the discussion going on on MAPREDUCE-4644

--
Moving each tools JARs into separate lib/ dirs it is quite easy (modifying a 
single assembly). What we should think is a common bootstrap script for that so 
each tool does not have to duplicate (and get wrong) such script. I'll open a 
JIRA for that.
--


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4644) mapreduce-client-jobclient-tests do not run from dist tarball

2012-09-14 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455900#comment-13455900
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4644:
---

Moving each tools JARs into separate lib/ dirs it is quite easy (modifying a 
single assembly). What we should think is a common bootstrap script for that so 
each tool does not have to duplicate (and get wrong) such script. I'll open a 
JIRA for that.

> mapreduce-client-jobclient-tests do not run from dist tarball
> -
>
> Key: MAPREDUCE-4644
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4644
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 2.0.2-alpha
>Reporter: Jason Lowe
>Priority: Blocker
>
> The mapreduce jobclient tests rely on junit which is missing from the dist 
> tarball.  This prevents running often-used tests like sleep jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4651) Benchmarking random reads with DFSIO

2012-09-14 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455895#comment-13455895
 ] 

Ravi Prakash commented on MAPREDUCE-4651:
-

Thanks Konstantin! I applied the patch and ran the random and backward read 
tests on my single node dev box.

{noformat}
$HADOOP_PREFIX/bin/hadoop org.apache.hadoop.fs.TestDFSIO -read -random 
-fileSize 10MB 
Average IO rate mb/sec: 134.43310546875
IO rate std deviation: 0.0089636501456

$HADOOP_PREFIX/bin/hadoop org.apache.hadoop.fs.TestDFSIO -read -backward 
-fileSize 10MB
Average IO rate mb/sec: 134.49253845214844
IO rate std deviation: 0.026679629420752023

$HADOOP_PREFIX/bin/hadoop org.apache.hadoop.fs.TestDFSIO -read -random 
-fileSize 1GB
Average IO rate mb/sec: 249.47183227539062
IO rate std deviation: 0.014617091655162118

$HADOOP_PREFIX/bin/hadoop org.apache.hadoop.fs.TestDFSIO -read -backward 
-fileSize 1GB
Average IO rate mb/sec: 295.8538818359375
IO rate std deviation: 0.061419808441541615

$HADOOP_PREFIX/bin/hadoop org.apache.hadoop.fs.TestDFSIO -read -random 
-fileSize 10GB
Average IO rate mb/sec: 320.3417663574219
IO rate std deviation: 0.05935480659067817

$HADOOP_PREFIX/bin/hadoop org.apache.hadoop.fs.TestDFSIO -read -backward 
-fileSize 10GB
Average IO rate mb/sec: 323.28045654296875
IO rate std deviation: 0.0598550775330073

$HADOOP_PREFIX/bin/hadoop org.apache.hadoop.fs.TestDFSIO -read -backward 
-fileSize 30GB
Average IO rate mb/sec: 390.9880065917969
IO rate std deviation: 0.06083891027478396

$HADOOP_PREFIX/bin/hadoop org.apache.hadoop.fs.TestDFSIO -read -random 
-fileSize 30GB
Average IO rate mb/sec: 369.2136535644531
IO rate std deviation: 0.056819116587427144
{noformat}

Could you please post recommended usage? And at what sizes do we expect to 
achieve stable IO rates?


> Benchmarking random reads with DFSIO
> 
>
> Key: MAPREDUCE-4651
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4651
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: benchmarks, test
>Affects Versions: 1.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: randomDFSIO.patch, randomDFSIO.patch
>
>
> TestDFSIO measures throughput of HDFS write, read, and append operations. It 
> will be useful to have an option to use it for benchmarking random reads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4502) Multi-level aggregation with combining the result of maps per node/rack

2012-09-14 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455842#comment-13455842
 ] 

Tsuyoshi OZAWA commented on MAPREDUCE-4502:
---

s/Do you have some result to/Do you have any results about the benchmark?/

> Multi-level aggregation with combining the result of maps per node/rack
> ---
>
> Key: MAPREDUCE-4502
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4502
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: applicationmaster, mrv2
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: speculative_draft.pdf
>
>
> The shuffle costs is expensive in Hadoop in spite of the existence of 
> combiner, because the scope of combining is limited within only one MapTask. 
> To solve this problem, it's a good way to aggregate the result of maps per 
> node/rack by launch combiner.
> This JIRA is to implement the multi-level aggregation infrastructure, 
> including combining per container(MAPREDUCE-3902 is related), coordinating 
> containers by application master without breaking fault tolerance of jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4502) Multi-level aggregation with combining the result of maps per node/rack

2012-09-14 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455840#comment-13455840
 ] 

Tsuyoshi OZAWA commented on MAPREDUCE-4502:
---

Chris and Karthik,

Thank you for your sharing your experience and thinking. These are very useful 
for me.

bq. ShuffleHandler is an auxiliary service loaded in the NodeManager. It's 
shared across all containers. 

I see. I have to redesign it to run combiner in container.

bq. Carlo Curino and I experimented with this, but (a) saw only slight 
improvements in job performance and (b) the changes to the AM to accommodate a 
new task type were extensive.

This is very interesting. In fact, I prototyped to run combiner at the end of 
MapTask as the first version. And, its performance was good. In this case, I 
found that it's needed to add new status to MapTask because of assuring fault 
tolerance. Is it acceptable for hadoop to do that?

bq. With logic to manage skew, we're hoping that scheduling an aggressive range 
can have a similar effect to combiner tasks, without introducing the new task 
type.

This seems to be good approach to deal with rack-level aggregation. Do you have 
some result to 

bq. 1. Perform node-level aggregation (reduce) at the end of maps in 
co-ordination with AM.
bq. 2. Perform rack-level aggregation at the end of node-level aggregation 
again in co-ordination with AM. The aggregation could be performed in parallel 
across the involved nodes such that each node has aggregated values of 
different keys.
bq. 3. Schedule reducers taking the key-distribution into account across racks.

Nice wrap-up :-)

bq. The con will be that the shuffle won't be asynchronous to map computation, 
but hopefully this wouldn't offset the gains of decreased network and disk I/O.

The balance between the gains by asynchronous processing and the one by 
decreasing network and disk I/O. In my previous experiment, it deeply depends 
on number of reducers. I think these gains are trade-off, so parameters are 
necessary to deal with various workloads.

bq. PS. http://dl.acm.org/citation.cfm?id=1901088 documents the advantages of 
multi-level aggregation in the context of graph algorithms modeled as iterative 
MR jobs.

I'm going read it :)

It's time for me to create the new revision of design note with reflecting your 
opinion. 

Thanks,
-- Tsuyoshi

> Multi-level aggregation with combining the result of maps per node/rack
> ---
>
> Key: MAPREDUCE-4502
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4502
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: applicationmaster, mrv2
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: speculative_draft.pdf
>
>
> The shuffle costs is expensive in Hadoop in spite of the existence of 
> combiner, because the scope of combining is limited within only one MapTask. 
> To solve this problem, it's a good way to aggregate the result of maps per 
> node/rack by launch combiner.
> This JIRA is to implement the multi-level aggregation infrastructure, 
> including combining per container(MAPREDUCE-3902 is related), coordinating 
> containers by application master without breaking fault tolerance of jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4644) mapreduce-client-jobclient-tests do not run from dist tarball

2012-09-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455826#comment-13455826
 ] 

Robert Joseph Evans commented on MAPREDUCE-4644:


We need to be in the process of separating out true unit test from system and 
integration tests.  Unit tests should run fast and are something that we can do 
for all components as part of the pre-commit build.  If the test is full 
featured enough that it could be called as a "tool" then there is no way that 
it is a true unit test. I am +1 for moving those out to be part of a tools or 
examples package somewhere.

We also need to look at cleaning up our classpaths.  Separating each tool out 
into a directory with a full list of its dependencies seems like a reasonable 
solution.  It is what Oozie asks users to do for their work flows and seems to 
work fairly well.  But that starts to sound like a larger effort then moving a 
few classes around and splitting a launcher into two.  I think it is something 
that needs to be done, but perhaps needs some design work, especially in 
relation to how we may want to do dependency isolation in the future with OSGi 
or something else.  Alejandro, I know you have been looking at and thinking 
about the classpath issue with YARN/MR, and how we should package things a lot 
already. MAPREDUCE-3745, HADOOP-7935, and MAPREDUCE-4421 do we need another 
JIRA explicitly for tools?  How do we handle the case of a tool having a 
map/reduce dependency now that the MR code is going to be separated out so that 
we can use a different version of MR?  Does that mean that they have to provide 
their own tools with a MR dependency and a config to point to them?  It just 
seems like a change like this needs a full design. 

> mapreduce-client-jobclient-tests do not run from dist tarball
> -
>
> Key: MAPREDUCE-4644
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4644
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 2.0.2-alpha
>Reporter: Jason Lowe
>Priority: Blocker
>
> The mapreduce jobclient tests rely on junit which is missing from the dist 
> tarball.  This prevents running often-used tests like sleep jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-3357) TestMRWithDistributedCache fails on branch-20-security

2012-09-14 Thread Amir Sanjar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455821#comment-13455821
 ] 

Amir Sanjar commented on MAPREDUCE-3357:


add "chmod a+x $HOME" .. it should fix the problem. let me know

> TestMRWithDistributedCache fails on branch-20-security
> --
>
> Key: MAPREDUCE-3357
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3357
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: distributed-cache, test
>Affects Versions: 0.20.205.0
>Reporter: Eli Collins
>
> TestMRWithDistributedCache testLocalJobRunner fails on branch-20-security:
> {noformat}
> Testcase: testLocalJobRunner took 5.501 sec
> FAILED
> null
> junit.framework.AssertionFailedError: null
> at 
> org.apache.hadoop.filecache.TestMRWithDistributedCache.testWithConf(TestMRWithDistributedCache.java:154)
> at 
> org.apache.hadoop.filecache.TestMRWithDistributedCache.testLocalJobRunner(TestMRWithDistributedCache.java:162)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4644) mapreduce-client-jobclient-tests do not run from dist tarball

2012-09-14 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455793#comment-13455793
 ] 

Tom White commented on MAPREDUCE-4644:
--

bq. If desired we could separate these into "tests that are really tools" which 
would go into tools/ and shouldn't rely on junit or other test framework stuff 
and "tests that are really unit tests" that go into something like tests/.

+1. Many of the "tests that are really tools" are benchmarks so we could call 
them that.

> mapreduce-client-jobclient-tests do not run from dist tarball
> -
>
> Key: MAPREDUCE-4644
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4644
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 2.0.2-alpha
>Reporter: Jason Lowe
>Priority: Blocker
>
> The mapreduce jobclient tests rely on junit which is missing from the dist 
> tarball.  This prevents running often-used tests like sleep jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4644) mapreduce-client-jobclient-tests do not run from dist tarball

2012-09-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455782#comment-13455782
 ] 

Jason Lowe commented on MAPREDUCE-4644:
---

Actually I'm thinking of cases where the test jars themselves cause the 
problems, see HDFS-3831.  There are a lot of things in these tests jars besides 
the items that are invoked by ToolRunner, and not all test jars even use 
ToolRunner.  If desired we could separate these into "tests that are really 
tools" which would go into tools/ and shouldn't rely on junit or other test 
framework stuff and "tests that are really unit tests" that go into something 
like tests/.  That would make running the "tests that are tools" a bit easier 
since we hopefully don't need a separate classpath beyond TOOL_PATH, but the 
junit test cases are completely out of the way.

> mapreduce-client-jobclient-tests do not run from dist tarball
> -
>
> Key: MAPREDUCE-4644
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4644
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 2.0.2-alpha
>Reporter: Jason Lowe
>Priority: Blocker
>
> The mapreduce jobclient tests rely on junit which is missing from the dist 
> tarball.  This prevents running often-used tests like sleep jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira