Re: hadoop1.2.1 speedup model

2013-09-09 Thread Robert Evans
How many times did you run the experiment at each setting?  What is the
standard deviation for each of these settings.  It could be that you are
simply running into the error bounds of Hadoop.  Hadoop is far from
consistent in it's performance.  For our benchmarking we typically will
run the test 5 times, throw out the top and bottom result, as possibly
outliers and then average the other runs.  Even with that we have to be
very careful that we weed out bad nodes or the numbers are useless for
comparison.  The other thing to look at is where was all of the time spent
for each of these settings.  The map portion should be very close to
linear with the number of tasks, assuming that there is no disk or network
contention.  The shuffle is far from linear as the number of fetches is a
function of the number of maps and the number of reducers.  The reduce
phase itself should be close to linear assuming that there isn't much skew
to your data.

--Bobby

On 9/7/13 3:33 AM, 牛兆捷 nzjem...@gmail.com wrote:

But I still want to fine the most efficient assignment and scale both data
and nodes as you said, for example in my result, 2 is the best, and 8 is
better than 4.

Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
hard to model this result. Can you give me some hint about this kind of
trend?


2013/9/7 Vinod Kumar Vavilapalli vino...@hortonworks.com


 Clearly your input size isn't changing. And depending on how they are
 distributed on the nodes, there could be Datanode/disks contention.

 The better way to model this is by scaling the input data also linearly.
 More nodes should process more data in the same amount of time.

 Thanks,
 +Vinod

 On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:

  Hi all:
 
  I vary the computational nodes of cluster and get the speedup result
in
 attachment.
 
  In my mind, there are three type of speedup model: linear, sub-linear
 and super-linear. However the curve of my result seems a little
strange. I
 have attached it.
  speedup.png
 
  This is sort in example.jar, actually it is done only using the
default
 map-reduce mechanism of Hadoop.
 
  I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12
cpu,
 20g men)
   io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
  reduce.slowstart = 0.05, the others are default.
 
  Input data: 20g, I divide it to 64 files
 
  Sort example: 64 map tasks, 64 reduce tasks
 
  Computational nodes: varying from 2 to 9
 
  Why the speedup mechanism is like this? How can I model it properly?
 
  Thanks〜
 
  --
  Sincerely,
  Zhaojie
 


 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or
entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the
reader
 of this message is not the intended recipient, you are hereby notified
that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender
immediately
 and delete it from your system. Thank You.




-- 
*Sincerely,*
*Zhaojie*
*
*



Re: [VOTE] Release Apache Hadoop 0.23.9

2013-07-02 Thread Robert Evans
+1 downloaded the release.  Ran a couple of simple jobs and everything
worked.

On 7/1/13 12:20 PM, Thomas Graves tgra...@yahoo-inc.com wrote:

I've created a release candidate (RC0) for hadoop-0.23.9 that I would like
to release.

The RC is available at:
http://people.apache.org/~tgraves/hadoop-0.23.9-candidate-0/
The RC tag in svn is here:
http://svn.apache.org/viewvc/hadoop/common/tags/release-0.23.9-rc0/

The maven artifacts are available via repository.apache.org.

Please try the release and vote; the vote will run for the usual 7 days
til July 8th.

I am +1 (binding).

thanks,
Tom Graves



Re: InputFormat to regroup splits of underlying InputFormat to control number of map tasks

2013-06-19 Thread Robert Evans
This sounds similar to MultiFileInputFormat

http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/h
adoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apach
e/hadoop/mapred/MultiFileInputFormat.java?revision=1239482view=markup

It would be nice if you could take a look at it and see if there is
something we can do here to improve it/combine the two.

--Bobby

On 6/19/13 2:53 AM, Nicolae Marasoiu nmara...@adobe.com wrote:

Hi,

When running map-reduce with many splits it would be nice from a
performance perspective to have fewer splits while maintaining data
locality, so that the overhead of running a map task (jvm creation, map
executor ramp-up e.g. spring context, etc) be less impactful when
frequently running map-reduces with low data  processing.

I created such an AggregatingInputFormat that simply groups input splits
into composite ones with same location and creates a record reader that
iterates over the record reader created by underlying inputFormat for the
underlying raw splits.

Currently we intend to use it for hbase sharding but I would like to also
implement an optimal algorithm to ensure both fair distribution and
locality, which I can describe if you find it useful to apply in
multi-locations such as replicated kafka or hdfs.

Thanks,
waiting for your feedback,
Nicu Marasoiu
Adobe



Re: mapred.child.ulimit in MR2

2013-06-19 Thread Robert Evans
Sandy,

I think it was something that was missed in the port to YARN and the dead
code was cleaned up as part of HADOOP-8288.

If you have a use case for it or are worried about backwards compatibility
we can add it back in.  It is not that hard, all it did was add 'ulimt -v
number' to the shell script that launched the task, except on windows.

--Bobby

On 6/18/13 3:56 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

Hi yarn-dev/mapreduce-dev,

Is there a reason that mapred.child.ulimit no longer has an effect in MR2?
 Should it be added back in?

thanks for any help,
-Sandy



Re: Visual debugging tools for hadoop

2013-06-18 Thread Robert Evans
Yes data flow visualizations definitely sound like something that would be
good for Ambari.  If you are interested in debugging Hadoop jobs there is
also the Hadoop Development Tools project

http://incubator.apache.org/projects/hdt.html

It is taking the Eclipse plugin for Hadoop and really improving it.  I
know that there has been some work to try and get a debugger working over
there where you could walk through parts of your MR job line by line.

--Bobby

On 6/14/13 12:40 PM, Chris Nauroth cnaur...@hortonworks.com wrote:

Hi Saikat,

You might want to investigate contributing on Apache Ambari, which has
features for visualization of jobs and end-to-end flows consisting of
multiple dependent jobs.

http://incubator.apache.org/ambari/

Chris Nauroth
Hortonworks
http://hortonworks.com/



On Fri, Jun 14, 2013 at 8:20 AM, Saikat Kanjilal
sxk1...@hotmail.comwrote:

 Hi Folks,
 I was wondering if anyone is currently working on or thinking about
visual
 debugging tools for mapreduce jobs, I was thinking about starting an
effort
 to build an end to end visual tool that shows all the steps in the
 mapreduce workflow and data flows, variable content changing to speed up
 debugging of jobs.Please ignore if something like this already
exists
 and if not I'd love to collaborate with folks to build something.


 Regards




Re: [VOTE] Release Apache Hadoop 0.23.8

2013-05-30 Thread Robert Evans
+1

Downloaded the release and ran a few basic tests.

--Bobby

On 5/28/13 11:00 AM, Thomas Graves tgra...@yahoo-inc.com wrote:


I've created a release candidate (RC0) for hadoop-0.23.8 that I would like
to release.

This release is a sustaining release with several important bug fixes in
it.  The most critical one is MAPREDUCE-5211.

The RC is available at:
http://people.apache.org/~tgraves/hadoop-0.23.8-candidate-0/
The RC tag in svn is here:
http://svn.apache.org/viewvc/hadoop/common/tags/release-0.23.8-rc0/

The maven artifacts are available via repository.apache.org.

Please try the release and vote; the vote will run for the usual 7 days.

I am +1 (binding).

thanks,
Tom Graves




Re: [VOTE] Plan to create release candidate for 0.23.8

2013-05-20 Thread Robert Evans
+1

On 5/17/13 4:10 PM, Thomas Graves tgra...@yahoo-inc.com wrote:

Hello all,

We've had a few critical issues come up in 0.23.7 that I think warrants a
0.23.8 release. The main one is MAPREDUCE-5211.  There are a couple of
other issues that I want finished up and get in before we spin it.  Those
include HDFS-3875, HDFS-4805, and HDFS-4835.  I think those are on track
to finish up early next week.   So I hope to spin 0.23.8 soon after this
vote completes.

Please vote '+1' to approve this plan. Voting will close on Friday May
24th at 2:00pm PDT.

Thanks,
Tom Graves




Re: Heads up - 2.0.5-beta

2013-05-03 Thread Robert Evans
I agree that destructive is not the correct word to describe features
like snapshots and windows support.  However, I also agree with Konstantin
that any large feature will have a destabilizing effect on the code base,
even if it is done on a branch and thoroughly tested before being merged
in. HDFS HA from what I have seen and heard is rock solid, but it took a
while to get there even after it was merged into branch-2. And we all know
how long YARN and MRv2 have taken to stabilize.

I also agree that no one individual is able to police all of Hadoop.  We
have to rely on the committers to make sure that what is placed in a
branch is appropriate for that branch in preparation for a release.  As a
community we need to decided what the goals of a branch are so that I as a
committer can know what is and is not appropriate to be placed in that
branch.  This is the reason why we are discussing API and binary
compatibility. This is the reason why I support having a vote for a
release plan.  The question for the community comes down to do we want to
release quickly and often off of trunk trying hard to maintain
compatibility between releases or do we want to follow what we have done
up to now where a single branch goes into stabilization, trunk gets
anything that is not compatible with that branch, and it takes a huge
effort to switch momentum from one branch to another.  Up to this point we
have almost successfully done this switch once, from 1.0 to 2.0. I have a
hard time believing that we are going to do this again for another 5 years.

There is nothing preventing the community from letting each organization
decide what they want to do and we end up with both.  But this results in
fragmentation of the community, and makes it difficult for those trying to
stabilize a release because there is no critical mass of individuals using
and testing that branch.  It also results in the scrambling we are seeing
now to try and revert the incompatibles between 1.0 and 2.0 that were
introduced in the years between these releases.  If we are going to do the
same and make 3.0 compatible with 2.0 when the switch comes, why do we
even allow any incompatible changes in at all?  It just feels like trunk
is a place to put tech debt that we are going to try and revert later.  I
personally like the Linux and BSD models, where there is a new feature
merge window and any new features can come in, then the entire community
works together to stabilize the release before going on the the next merge
window.  If the release does not stabilize quickly the next merge window
gets pushed back. I realize this is very different from the current model
and is not likely to receive a lot of support, but it has worked for them
for a long time, and they have code bases just as large as Hadoop and even
larger and more diverse communities.

I am +1 for Konstantin's release plan and will vote as such on that thread.

--Bobby

On 5/3/13 3:06 AM, Konstantin Shvachko shv.had...@gmail.com wrote:

Hi Arun and Suresh,

I am glad my choice of words attracted your attention. I consider this
important for the project otherwise I wouldn't waste everybody's time.
You tend reacting on a latest message taken out of context, which does not
reveal full picture.
I'll try here to summarize my proposal and motivation expressed earlier in
these two threads:
http://s.apache.org/fs
http://s.apache.org/Streamlining

I am advocating
1. to make 2.0.5 a release that will
a) make any necessary changes so that Hadoop APIs could be fixed after
that
b) fix bugs: internal and those important for stabilizing downstream
projects
2. Release 2.1.0 stable. I.e. both with stable APIs and stable code base.
3. Produce a series of feature releases. Potentially catching up with the
state of trunk.
4. Release from trunk afterwards.

The main motivation to minimize changes in 2.0.5 is to let Hadoop users
and
the downstream projects, that is the Hadoop community, to start adapting
to
the new APIs asap. This will provide certainty that people can build their
products on top of 2.0.5 APIs with minimal risk the next release will
break
them.
Thus Bobby in http://goo.gl/jm5am
is saying that the meaning of beta for him is locked down APIs for wire
and
binary compatibility. For Hadoop Yahoo using 2.x is an opportunity to have
it tested at very large scale, which in turn will bring other users on
board.

I agree with Arun that we are not disagreeing on much. Just on the order
of
execution: what goes first stability or features.
I am not challenging any features, the implementations, or the developers.
But putting all changes together is destructive for the stability of the
release. Adding a 500 KB patch invalidates prio testing solely because it
is a big change that needs testing not only by itself but with upstream
applications.
With 2.0.3 , 2.0.4 tested thoroughly and widely in many organizations and
several distributions it seems like a perfect base for the stable release.
We could be just 

Re: JVM vs container memory configs

2013-05-03 Thread Robert Evans
For us we typically leave a 500MB difference between the heap and the
container size.  I think we can make this smaller, but we have not really
tried.

--Bobby

On 5/3/13 11:20 AM, Karthik Kambatla ka...@cloudera.com wrote:

Hi

While looking into MAPREDUCE-5207 (adding defaults for
mapreduce.{map|reduce}.memory.mb), I was wondering how much headroom
should
be left on top of mapred.child.java.opts (or other similar JVM opts) for
the container memory itself?

Currently, mapred.child.java.opts (per mapred-default.xml) is set to 200
MB
by default. The default for mapreduce.{map|reduce}.memory.mb is 1024 in
the
code, which is significantly higher than the 200MB value.

Do we need more than 100 MB for non-JVM memory per container? If so, does
it make sense make that a config property in itself and the code to verify
all 3 values are clear enough?

Thanks
Karthik



Re: Versions - Confusion

2013-04-26 Thread Robert Evans
It is kind of complex.

Up until 0.20 everything was fairly regular like you would expect.  In
0.20 there was a split where security was added in to a branch and started
to be numbered as 0.20.20X.  But the other releases went on without
security and became 0.21 and 0.22.  0.23 was created when YARN was
introduced and it also had security merged in.  To be fair 0.22 had
security in it, but was never officially supported in a release.  At about
this same time the community decided that we needed to do something better
with number and renamed 0.20.20X to be 1.0 and started releasing more
versions from this line.  This is the current stable line. 0.23 was
renamed 2.0 and there have been a few releases but the code is still being
stabilized.  To make things even more confusing some people kept 0.23
alive and stabilized it, so there have been some releases of 0.23 in
parallel with 2.0.  The difference between the two is that 2.0 had HDFS HA
in it where as 0.23 does not.

--Bobby Evans

On 4/26/13 12:39 AM, Suresh S suresh...@gmail.com wrote:

Hello,

I was confused with Hadoop versioning.
I found that,  some people working on version starting with 0.
Some others, working on version starting with 2.
Also, i was confused with branch.

Which version is really current version.
*Regards*
*S.Suresh,*
*Research Scholar,*
*Department of Computer Applications,*
*National Institute of Technology,*
*Tiruchirappalli - 620015.*
*+91-9941506562*



Re: [VOTE] Release Apache Hadoop 2.0.4-alpha

2013-04-17 Thread Robert Evans
+1 (binding)

Downloaded the tar ball and ran some simple jobs.

--Bobby Evans

On 4/17/13 2:01 PM, Siddharth Seth seth.siddha...@gmail.com wrote:

+1 (binding)
Verified checksums and signatures.
Built from the source tar, deployed a single node cluster and tested a
couple of simple MR jobs.

- Sid


On Fri, Apr 12, 2013 at 2:56 PM, Arun C Murthy a...@hortonworks.com
wrote:

 Folks,

 I've created a release candidate (RC2) for hadoop-2.0.4-alpha that I
would
 like to release.

 The RC is available at:
 http://people.apache.org/~acmurthy/hadoop-2.0.4-alpha-rc2/
 The RC tag in svn is here:
 
http://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.4-alpha-rc
2

 The maven artifacts are available via repository.apache.org.

 Please try the release and vote; the vote will run for the usual 7 days.

 thanks,
 Arun


 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/






Re: Help on submitting a patch for an unassigned bug

2013-03-26 Thread Robert Evans
Also be aware that sometimes committers don't notice that a patch is not
in patch available, so if you need a review and no one has started
reviewing it, please send an e-mail to the dev list and we will do our
best to take a look at it.

--Bobby

On 3/26/13 5:28 AM, Harsh J ha...@cloudera.com wrote:

Hi Niranjan,

You can work on these by submitting a patch directly to the ticket. A
committer/reviewer will assign the issue to you the on first
contribution time on each project, and thereon you can assign them to
yourself as you work on more.

On Tue, Mar 26, 2013 at 9:26 AM, maisnam ns maisnam...@gmail.com wrote:
 Hello ,

 I recently started looking into the bugs in issues.appache.org/jira
related
 to (HADOOP/MAP REDUCE/HDFS) which have not be assigned to anyone and the
 status as unresolved. My intention is to fix those bugs , can you please
 let me know if I can assign those bugs to myself and submit a patch with
 unit test case or do I have to send a mail to any comitter asking for
 approval for assigning the bug to myself .

 Could somebody help me out. If somebody can elaborate a little on the
 process it would be helpful, I have read the 'How to contribute to
Hadoop'
 on Wiki but couldn't find related to assigning the bug or may be I
missed
 out somewhere.

 Thanks in advance.

 Regards,
 Niranjan



-- 
Harsh J



Re: [Vote] Merge branch-trunk-win to trunk

2013-02-28 Thread Robert Evans
 add Windows support for new features that are
   platform specific is it assumed that Windows development will either
   lag or will people actively work on keeping Windows up with the
   latest?  And vice versa in case Windows support is implemented
first.
  
   Is there a jira for resolving the outstanding TODOs in the code base
   (similar to HDFS-2148)?  Looks like this merge doesn't introduce
   many which is great (just did a quick diff and grep).
  
   Thanks,
   Eli
  
   On Wed, Feb 27, 2013 at 8:17 AM, Robert Evans ev...@yahoo-inc.com
  wrote:
After this is merged in is Windows still going to be a second
class citizen but happens to work for more than just development
or is it a fully supported platform where if something breaks it
can block a
   release?
 How do we as a community intend to keep Windows support from
 breaking?
We don't have any Jenkins slaves to be able to run nightly tests
to validate everything still compiles/runs.  This is not a blocker
for me because we often rely on individuals and groups to test
Hadoop, but I
  do
think we need to have this discussion before we put it in.
   
--Bobby
   
On 2/26/13 4:55 PM, Suresh Srinivas sur...@hortonworks.com
 wrote:
   
   I had posted heads up about merging branch-trunk-win to trunk on
   Feb
  8th.
   I
   am happy to announce that we are ready for the merge.
   
   Here is a brief recap on the highlights of the work done:
   - Command-line scripts for the Hadoop surface area
   - Mapping the HDFS permissions model to Windows
   - Abstracted and reconciled mismatches around differences in Path
   semantics in Java and Windows
   - Native Task Controller for Windows
   - Implementation of a Block Placement Policy to support cloud
   environments, more specifically Azure.
   - Implementation of Hadoop native libraries for Windows
   (compression codecs, native I/O)
   - Several reliability issues, including race-conditions,
   intermittent
   test
   failures, resource leaks.
   - Several new unit test cases written for the above changes
   
   Please find the details of the work in
   CHANGES.branch-trunk-win.txt - Common
   changeshttp://bit.ly/Xe7Ynv, HDFS changes
  http://bit.ly/13QOSo9
   ,
   and YARN and MapReduce changes http://bit.ly/128zzMt. This is
   the
  work
   ported from branch-1-win to a branch based on trunk.
   
   For details of the testing done, please see the thread -
   http://bit.ly/WpavJ4. Merge patch for this is available on
  HADOOP-8562
   https://issues.apache.org/jira/browse/HADOOP-8562.
   
   This was a large undertaking that involved developing code,
   testing the entire Hadoop stack, including scale tests. This is
   made possible only with the contribution from many many folks in
   the community. Following
  people
   contributed to this work: Ivan Mitic, Chuan Liu, Ramya Sunil,
   Bikas
  Saha,
   Kanna Karanam, John Gordon, Brandon Li, Chris Nauroth, David Lao,
   Sumadhur
   Reddy Bolli, Arpit Agarwal, Ahmed El Baz, Mike Liddell, Jing Zhao,
  Thejas
   Nair, Steve Maine, Ganeshan Iyer, Raja Aluri, Giridharan Kesavan,
   Ramya Bharathi Nimmagadda, Daryn Sharp, Arun Murthy, Tsz-Wo
   Nicholas Sze,
   Suresh
   Srinivas and Sanjay Radia. There are many others who contributed
   as
  well
   providing feedback and comments on numerous jiras.
   
   The vote will run for seven days and will end on March 5, 6:00PM
PST.
   
   Regards,
   Suresh
   
   
   
   
   On Thu, Feb 7, 2013 at 6:41 PM, Mahadevan Venkatraman
   mah...@microsoft.comwrote:
   
It is super exciting to look at the prospect of these changes
   being merged  to trunk. Having Windows as one of the supported
   Hadoop platforms is
  a
fantastic opportunity both for the Hadoop project and Microsoft
   customers.
   
This work began around a year back when a few of us started with
a
   basic
port of Hadoop on Windows. Ever since, the Hadoop team in
Microsoft
   have
made significant progress in the following areas:
(PS: Some of these items are already included in Suresh's email,
but including again for completeness)
   
- Command-line scripts for the Hadoop surface area
- Mapping the HDFS permissions model to Windows
- Abstracted and reconciled mismatches around differences in
Path semantics in Java and Windows
- Native Task Controller for Windows
- Implementation of a Block Placement Policy to support cloud
environments, more specifically Azure.
- Implementation of Hadoop native libraries for Windows
(compression codecs, native I/O) - Several reliability issues,
including race-conditions, intermittent test failures, resource
 leaks.
- Several new unit test cases written for the above changes
   
In the process, we have closely engaged with the Apache open
source community and have got great support and assistance from
the
  community
   in
terms of contributing fixes, code review comments

Re: [Vote] Merge branch-trunk-win to trunk

2013-02-27 Thread Robert Evans
After this is merged in is Windows still going to be a second class
citizen but happens to work for more than just development or is it a
fully supported platform where if something breaks it can block a release?
 How do we as a community intend to keep Windows support from breaking?
We don't have any Jenkins slaves to be able to run nightly tests to
validate everything still compiles/runs.  This is not a blocker for me
because we often rely on individuals and groups to test Hadoop, but I do
think we need to have this discussion before we put it in.

--Bobby

On 2/26/13 4:55 PM, Suresh Srinivas sur...@hortonworks.com wrote:

I had posted heads up about merging branch-trunk-win to trunk on Feb 8th.
I
am happy to announce that we are ready for the merge.

Here is a brief recap on the highlights of the work done:
- Command-line scripts for the Hadoop surface area
- Mapping the HDFS permissions model to Windows
- Abstracted and reconciled mismatches around differences in Path
semantics
in Java and Windows
- Native Task Controller for Windows
- Implementation of a Block Placement Policy to support cloud
environments,
more specifically Azure.
- Implementation of Hadoop native libraries for Windows (compression
codecs, native I/O)
- Several reliability issues, including race-conditions, intermittent test
failures, resource leaks.
- Several new unit test cases written for the above changes

Please find the details of the work in CHANGES.branch-trunk-win.txt -
Common changeshttp://bit.ly/Xe7Ynv, HDFS changeshttp://bit.ly/13QOSo9,
and YARN and MapReduce changes http://bit.ly/128zzMt. This is the work
ported from branch-1-win to a branch based on trunk.

For details of the testing done, please see the thread -
http://bit.ly/WpavJ4. Merge patch for this is available on HADOOP-8562
https://issues.apache.org/jira/browse/HADOOP-8562.

This was a large undertaking that involved developing code, testing the
entire Hadoop stack, including scale tests. This is made possible only
with
the contribution from many many folks in the community. Following people
contributed to this work: Ivan Mitic, Chuan Liu, Ramya Sunil, Bikas Saha,
Kanna Karanam, John Gordon, Brandon Li, Chris Nauroth, David Lao, Sumadhur
Reddy Bolli, Arpit Agarwal, Ahmed El Baz, Mike Liddell, Jing Zhao, Thejas
Nair, Steve Maine, Ganeshan Iyer, Raja Aluri, Giridharan Kesavan, Ramya
Bharathi Nimmagadda, Daryn Sharp, Arun Murthy, Tsz-Wo Nicholas Sze, Suresh
Srinivas and Sanjay Radia. There are many others who contributed as well
providing feedback and comments on numerous jiras.

The vote will run for seven days and will end on March 5, 6:00PM PST.

Regards,
Suresh




On Thu, Feb 7, 2013 at 6:41 PM, Mahadevan Venkatraman
mah...@microsoft.comwrote:

 It is super exciting to look at the prospect of these changes being
merged
 to trunk. Having Windows as one of the supported Hadoop platforms is a
 fantastic opportunity both for the Hadoop project and Microsoft
customers.

 This work began around a year back when a few of us started with a basic
 port of Hadoop on Windows. Ever since, the Hadoop team in Microsoft have
 made significant progress in the following areas:
 (PS: Some of these items are already included in Suresh's email, but
 including again for completeness)

 - Command-line scripts for the Hadoop surface area
 - Mapping the HDFS permissions model to Windows
 - Abstracted and reconciled mismatches around differences in Path
 semantics in Java and Windows
 - Native Task Controller for Windows
 - Implementation of a Block Placement Policy to support cloud
 environments, more specifically Azure.
 - Implementation of Hadoop native libraries for Windows (compression
 codecs, native I/O) - Several reliability issues, including
 race-conditions, intermittent test failures, resource leaks.
 - Several new unit test cases written for the above changes

 In the process, we have closely engaged with the Apache open source
 community and have got great support and assistance from the community
in
 terms of contributing fixes, code review comments and commits.

 In addition, the Hadoop team at Microsoft has also made good progress in
 other projects including Hive, Pig, Sqoop, Oozie, HCat and HBase. Many
of
 these changes have already been committed to the respective trunks with
 help from various committers and contributors. It is great to see the
 commitment of the community to support multiple platforms, and we look
 forward to the day when a developer/customer is able to successfully
deploy
 a complete solution stack based on Apache Hadoop releases.

 Next Steps:

 All of the above changes are part of the Windows Azure HDInsight and
 HDInsight Server products from Microsoft. We have successfully
on-boarded
 several internal customers and have been running production workloads on
 Windows Azure HDInsight. Our vision is to create a big data platform
based
 on Hadoop, and we are committed to helping make Hadoop a world-class
 solution that anyone can use to solve 

Re: tests in mapreduce.lib excluded in jenkins?

2013-02-26 Thread Robert Evans
All of the pre-commit builds only run tests for the projects that had
changes.  This is a known issue, but was done because the pre-commit
builds were taking a very long time.  There have been a few proposals to
improve the situation, like having any change in map/reduce run all of the
map/reduce tests instead of just a sub set of them (sorry JIRA is acting
up right now so I don't have a reference to the JIRA number).  But none of
them have gone in yet.

--Bobby

On 2/25/13 6:45 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

A recent patch of mine
(https://issues.apache.org/jira/browse/MAPREDUCE-4994)
broke a couple of tests, but the Hadoop QA build (
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3321/testReport/)
didn't
catch anything wrong.  It looks like the tests under mapred.lib and
mapreduce.lib, such as TestChainMapReduce and TestLineRecordReader aren't
running.  Is this intentional?

thanks,
Sandy



timeout is now requested to be on all tests

2013-02-20 Thread Robert Evans
Sorry about cross posting, but this will impact all developers and I wanted to 
give you all a heads-up.

HADOOP-9112https://issues.apache.org/jira/browse/HADOOP-9112 was just checked 
it.  This means that the pre commit build will now give a –1 for any patch with 
junit tests that do not include a timeout option.  See 
http://junit.sourceforge.net/javadoc/org/junit/Test.html for more info on that. 
 This is to avoid surefire timing out junit when it gets stuck and not giving 
any real feedback on which test failed.

--Bobby


Re: Doubt about map reduce version 2

2013-02-08 Thread Robert Evans
Suresh,

The 1.0 line is still the stable line and improvements there can have a
large impact on existing users.  That being said I think there will be a
lot of movement to Yarn/MRv2 starting in the second half of this year and
all of next year.  Also YARN scheduling is a larger area for study because
it doesn't just run Map/Reduce.  It allows you to explore how to
effectively schedule other workloads in a multi-tennant environment. There
has already been a lot of discussion about the scheduler and its protocol
recently because it is still a very new area to explore and no one really
knows how well the current solutions work for other work loads.

As for speculative execution in MRv2 it is completely pluggable by the
user.  This should make it very easy for you to explore and compare
different speculation schemes.

--Bobby


On 2/7/13 11:39 PM, Suresh S suresh...@gmail.com wrote:

Hello Friends,

 I am working to propose some improved hadoop scheduling algorithm
or speculative execution algorithim as part of my Phd research work.  Now,
the new version of hadoop, YARN/MR v2, is available.

I have the following doubts:

So, the algorithms (particularly scheduling and speculation
algorithms) proposed for old hadoop version are applicable for new version
of hadoop (YARN) or not.

Is it worth and usefull to propose an algorithm for old hadoop version
now?

Is the user community can support and discuss the issues releted to old
version?

Thanks in Advance.

*Regards*
*S.Suresh,*
*Research Scholar,*
*Department of Computer Applications,*
*National Institute of Technology,*
*Tiruchirappalli - 620015.*
*+91-9941506562*



Re: [VOTE] Release hadoop-2.0.3-alpha

2013-02-07 Thread Robert Evans
I downloaded the binary package and ran a few example jobs on a 3 node
cluster.  Everything seems to be working OK on it, I did see

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable

For every shell command, but just like with 0.23.6 I don't think it is a
blocker.

+1 (Binding)

--Bobby

On 2/6/13 9:59 PM, Arun C Murthy a...@hortonworks.com wrote:

Folks,

I've created a release candidate (rc0) for hadoop-2.0.3-alpha that I
would like to release.

This release contains several major enhancements such as QJM for HDFS HA,
multi-resource scheduling for YARN, YARN ResourceManager restart etc.
Also YARN has achieved significant stability at scale (more details from
Y! folks here: http://s.apache.org/VYO).

The RC is available at:
http://people.apache.org/~acmurthy/hadoop-2.0.3-alpha-rc0/
The RC tag in svn is here:
http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.3-alpha-rc0/

The maven artifacts are available via repository.apache.org.

Please try the release and vote; the vote will run for the usual 7 days.

thanks,
Arun



--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/





Re: One output file per node

2012-12-13 Thread Robert Evans
Tejay,

The way the scheduler works you are not guaranteed to get one reducer per
node.  Reducers are not scheduled based off of locality of any kind, and
even if they were the scheduler typically treats rack local the same as
node local.  The partitioner interface only allows you to say what numeric
partition an entry should go to, nothing else. There is no way to map that
numeric partition to a particular machine.  You could try to play games
but they would be very difficult to get right, especially for the corner
cases where a task can fail and may be rerun.  If your partitioner is not
exactly deterministic, you could lose some data, and double count other
data in the case of a failure.

Why don't you want to send all of the data over the wire?  When you write
it out to HDFS it will all be sent over the wire. How do you plan on using
these indexes after they are generated?  Do you plan to read from all of
the indexes in parallel to search for a single entry or do you want to
merge them together again before actually using them?

You could do your original proposal by simulating the combiner within the
map itself.  If your data is small enough you could aggregate the data
within the mapper and then only output the aggregate when all the entries
have been processed.  If it is too big to fit into memory you could look
at having a disk backed data structure with in memory caching, or even
simulate Map/Reduce itself and write all of the data out to a local file,
sort the data and read it back in already partitioned.

--Bobby

On 12/13/12 1:47 AM, Aloke Ghoshal alghos...@gmail.com wrote:

Hi Tejay,

Building a consolidated index file for all your source files (for terms
within the source files) may not be doable this way. On the other hand,
building one index file per node is doable if you run a Reducer per Node 
use a Partitioner.

- Run one Reducer per node
- Let Mapper output carry *NodeHostName:Term* as the key
- Use a Partitioner based on the NodeHostName portion of the key
(KeyFieldBasedPartitioner)  a GroupingComparator based on the Term
portion

Regards,
Aloke

On Wed, Dec 12, 2012 at 11:32 PM, Cardon, Tejay E
tejay.e.car...@lmco.comwrote:

  First, I hope I¹m posting this to the right list.  I wasn¹t sure if
 developer questions belonged here or on the user list.

 Second, thanks for your thoughts.

 ** **

 So I have a situation in which I¹m building an index across many files.
 I
 don¹t want to send ALL the data across the wire by using reducers, so
I¹d
 like to use a map only job.   However, I don¹t want one file per mapper,
 I¹d like to consolidate them to only one file per node.  Effectively,
I¹d
 like to have the output of the combiner go to file, but I know I can¹t
 trust combiner to always run on all outputs for the map.

 ** **

 Is this possible?  Perhaps some crafty partitioner that somehow sends
all
 records to a reducer on the local node?? (I don¹t see this working)


 Thanks,
 Tejay

 ** **

 ** **


 ** **

 Follow me on Eureka https://eureka.isgs.lmco.com/#people/cardonte and
 Brainstorm http://brainstorm.isgs.lmco.com/Person.aspx?id=1200

 ** **

 ** **

 ** **




Re: Shuffle phase: fine-grained control of data flow

2012-11-08 Thread Robert Evans
Jiwei,

Ok so you are specifically looking at reducing overall network bandwidth
of skewed map outputs, not all map outputs.  That would very much mean
that #1 and #3 are off base. But as you point out it would only really be
performance win if the data fits into memory. It seems like an interesting
idea. If the goal is to reduce bandwidth and not improve individual job
performance then it seems more plausible.  Do you have a benchmark (grid
mix run etc) that really taxes the network that you could use to measure
the impact such a change would have?  Something like this really needs
some hard numbers for a proper evaluation.

--Bobby Evans 

On 11/7/12 11:32 PM, Jiwei Li cxm...@gmail.com wrote:

Hi Bobby,

Thank you a lot for your suggestions. My whole idea is to minimize the
aggregate network bandwidth during Shuffle Phase, that is, to limit the
hops to minimum when transmitting data from map node to reduce node.
Usually, Partitioner creates skews that the JobTracker allocates different
amounts of map outputs to participating reduce nodes. Making reduce nodes
near map outputs with largest concerned partitions can reduce the
aggregate
network bandwidth.

For #1, there is no need to schedule map tasks to be close to one another,
since it will only congest links among the cluster. For #2, the location
and size of each partition in each map output can be sent to JobTracker
along with the processing of InputSplit. Collecting enough such
information
(not necessarily waiting map tasks to finish), the JobTracker starts to
schedule reduce tasks to fetch map output data. #3 is the same as #1.

Now the tricky part is that if all map outputs are spilled to disks,
network bandwidth may not be a bottleneck, because the time consumed in
disk seeks outnumbers that in data transmission. If map outputs fit in
memory, then network must be taken seriously. Also note that for evenly
distributed map outputs, current scheduling policy works just fine.

Jiwei


On Wed, Nov 7, 2012 at 11:45 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Jiwei,

 I think you could use that knowledge to launch reducers closer to the
map
 output, but I am not sure that it would make much difference.  It may
even
 slow things down. It is a question of several things

 1) Can we get enough map tasks close to one another that it will make a
 difference?
 2) Does the reduced shuffle time offset the overhead of waiting for the
 map location data before launching and fetching data early?
 3) and do the time savings also offset the overhead of getting the map
 tasks to be close to one another?

 For #2 you might be able to deal with this by using speculative
execution,
 and launching some reduce tasks later if you see a clustering of map
 output.  For #1 it will require changes to how we schedule tasks which
 depending on how well it is implemented will impact #3 as well.
 Additionally for #1 any job that approaches the same order of size as
the
 cluster will almost require the map tasks to be evenly distributed
around
 the cluster. If you can come up with a patch I would love to see some
 performance numbers.

 Personally I think spending time reducing the size of the data sent to
the
 reducers is a much bigger win.  Can you use a combiner? Do you really
need
 all of the data or can you sample the data to get a statistically
 significant picture of what is in the data?  Have you enabled
compression
 between the maps and the reducers?

 --Bobby

 On 11/7/12 8:05 AM, Harsh J ha...@cloudera.com wrote:

 Hi Jiwei,
 
 In trunk (i.e. MR2), the completion events selection + scheduling
 logic lies under class EventFetcher's getMapCompletionEvents() method,
 as viewable at
 
 
http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project
/
 
hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/a
pa
 che/hadoop/mapreduce/task/reduce/EventFetcher.java?view=markup
 
 This EventFetcher thread is used by the Shuffle (reduce package)
 class, to continually do the shuffling. The Shuffle class is then
 itself used by the ReduceTask class (look in mapred package of same
 maven module).
 
 I guess you can start there, to see if a better selection+scheduling
 logic would yield better results.
 
 On Wed, Nov 7, 2012 at 12:26 PM, Jiwei Li cxm...@gmail.com wrote:
  Dear all,
 
  For jobs like Sort, massive amounts of network traffic happen during
  shuffle phase. The simple mechanism in Hadoop 1.0.4 to choose reduce
 nodes
  does not help reduce network traffic. If JobTracker is fully aware of
  locations of every map output, why not take advantage of this
topology
  knowledge?
 
  So, is there anyone who knows where to develop such codes upon? Many
 thanks.
 
  Regards.
  --
  Jiwei
 
 
 
 --
 Harsh J




-- 
Jiwei Li



Re: Shuffle phase: fine-grained control of data flow

2012-11-07 Thread Robert Evans
Jiwei,

I think you could use that knowledge to launch reducers closer to the map
output, but I am not sure that it would make much difference.  It may even
slow things down. It is a question of several things

1) Can we get enough map tasks close to one another that it will make a
difference?
2) Does the reduced shuffle time offset the overhead of waiting for the
map location data before launching and fetching data early?
3) and do the time savings also offset the overhead of getting the map
tasks to be close to one another?

For #2 you might be able to deal with this by using speculative execution,
and launching some reduce tasks later if you see a clustering of map
output.  For #1 it will require changes to how we schedule tasks which
depending on how well it is implemented will impact #3 as well.
Additionally for #1 any job that approaches the same order of size as the
cluster will almost require the map tasks to be evenly distributed around
the cluster. If you can come up with a patch I would love to see some
performance numbers.

Personally I think spending time reducing the size of the data sent to the
reducers is a much bigger win.  Can you use a combiner? Do you really need
all of the data or can you sample the data to get a statistically
significant picture of what is in the data?  Have you enabled compression
between the maps and the reducers?

--Bobby

On 11/7/12 8:05 AM, Harsh J ha...@cloudera.com wrote:

Hi Jiwei,

In trunk (i.e. MR2), the completion events selection + scheduling
logic lies under class EventFetcher's getMapCompletionEvents() method,
as viewable at 
http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/
hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apa
che/hadoop/mapreduce/task/reduce/EventFetcher.java?view=markup

This EventFetcher thread is used by the Shuffle (reduce package)
class, to continually do the shuffling. The Shuffle class is then
itself used by the ReduceTask class (look in mapred package of same
maven module).

I guess you can start there, to see if a better selection+scheduling
logic would yield better results.

On Wed, Nov 7, 2012 at 12:26 PM, Jiwei Li cxm...@gmail.com wrote:
 Dear all,

 For jobs like Sort, massive amounts of network traffic happen during
 shuffle phase. The simple mechanism in Hadoop 1.0.4 to choose reduce
nodes
 does not help reduce network traffic. If JobTracker is fully aware of
 locations of every map output, why not take advantage of this topology
 knowledge?

 So, is there anyone who knows where to develop such codes upon? Many
thanks.

 Regards.
 --
 Jiwei



-- 
Harsh J



Re: division by zero in getLocalPathForWrite()

2012-10-25 Thread Robert Evans
It looks like you are running with an older version of 2.0, even though it
does not really make much of a difference in this case,  The issue shows
up when getLocalPathForWrite thinks there is no space on to write to on
any of the disks it has configured.  This could be because you do not have
any directories configured.  I really don't know for sure exactly what is
happening.  It might be disk fail in place removing disks for you because
of other issues. Either way we should file a JIRA against Hadoop to make
it so we never get the / by zero error and provide a better way to handle
the possible causes.

--Bobby Evans

On 10/24/12 11:54 PM, Ted Yu yuzhih...@gmail.com wrote:

Hi,
HBase has Jenkins build against hadoop 2.0
I was checking why TestRowCounter sometimes failed:
https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/231/testReport/o
rg.apache.hadoop.hbase.mapreduce/TestRowCounter/testRowCounterExclusiveCol
umn/

I think the following could be the cause:

2012-10-22 23:46:32,571 WARN  [AsyncDispatcher event handler]
resourcemanager.RMAuditLogger(255): USER=jenkins   OPERATION=Application
Finished - Failed  TARGET=RMAppManager RESULT=FAILURE  DESCRIPTION=App
failed with state: FAILED  PERMISSIONS=Application
application_1350949562159_0002 failed 1 times due to AM Container for
appattempt_1350949562159_0002_01 exited with  exitCode: -1000 due
to: java.lang.ArithmeticException: / by zero
   at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathFor
Write(LocalDirAllocator.java:355)
   at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAlloca
tor.java:150)
   at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAlloca
tor.java:131)
   at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAlloca
tor.java:115)
   at 
org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocal
PathForWrite(LocalDirsHandlerService.java:257)
   at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.Resou
rceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.jav
a:849)

However, I don't seem to find where in getLocalPathForWrite() division by
zero could have arisen.

Comment / hint is welcome.

Thanks



Re: pluggable resources

2012-10-22 Thread Robert Evans
I agree that having it be pluggable opens up a lot of new possibilities.
+1 for the idea.  Although I think in the short term we are having enough
problems as it is with just CPU and memory that it may be a little while
before we get to a pluggable solution.  Once YARN-2 goes in, if you can
get an initial proof of concept patch for a generic solution I would be
happy to review it and push for it to go in.

--Bobby

On 10/22/12 5:41 AM, Radim Kolar h...@filez.com wrote:

I have proposal for improved resource scheduling.

https://issues.apache.org/jira/browse/MAPREDUCE-4256

as i see, development seems to go other way for example in
https://issues.apache.org/jira/browse/YARN-2 for every added kind of
resource there has to be significant rework.

you do not see benefits of having framework able to handle custom
resource types? Its not all about memory and cores. You need to schedule
jobs based on other factors (network capacity, availability of GPU
cores, data locality).

And every cluster might have special considerations for example do not
overload central SQL database. We usually have few hundred submitted
jobs, proper resource sharing is essential. No point in running jobs
which needs GPU which is in use by other mapper, better to run some
other jobs until gpu becomes available again.



Re: Fix versions for commits branch-0.23

2012-10-09 Thread Robert Evans
I don't see much of a reason to have the same JIRA listed under both 0.23
and 2.0.  I can see some advantage of being able to see what went into
0.23.X by looking at a 2.0.X CHANGES.txt, but unless the two are released
at exactly the same time they will be out of date with each other in the
best cases.  I personally think the only way to truly know what is in
0.23.X is to look at the CHANGES.txt on 0.23.X and similarly for 2.X.
Having JIRA be in sync is a huge help and we should definitely push for
that.  I just don't see much value in trying very hard to have the
CHANGES.txt stay in sync.

--Bobby

On 10/8/12 10:21 PM, Siddharth Seth seth.siddha...@gmail.com wrote:

Along with fix versions, does it make sense to add JIRAs under 0.23 as
well
as branch-2 in CHANGES.txt, if they're committed to both branches.
CHANGES.txt tends to get out of sync with the different release schedules
of the 2 branches.

Thanks
- Sid

On Sat, Sep 29, 2012 at 10:33 PM, Arun C Murthy a...@hortonworks.com
wrote:

 Guys,

  A request - can everyone please set fix-version to both 2.* and
0.23.*? I
 found some with only 0.23.*, makes generating release-notes very hard.

 thanks,
 Arun



Re: Commits breaking compilation of MR 'classic' tests

2012-09-26 Thread Robert Evans
That is fine, we may want to then mark it so that the MR-4687 depends on the 
JIRA to port the tests, so the tests don't disapear before we are done.

--Bobby

From: Arun C Murthy a...@hortonworks.commailto:a...@hortonworks.com
Date: Wednesday, September 26, 2012 12:31 PM
To: hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org 
hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org, Yahoo! Inc. 
ev...@yahoo-inc.commailto:ev...@yahoo-inc.com
Cc: common-...@hadoop.apache.orgmailto:common-...@hadoop.apache.org 
common-...@hadoop.apache.orgmailto:common-...@hadoop.apache.org, 
yarn-...@hadoop.apache.orgmailto:yarn-...@hadoop.apache.org 
yarn-...@hadoop.apache.orgmailto:yarn-...@hadoop.apache.org, 
mapreduce-dev@hadoop.apache.orgmailto:mapreduce-dev@hadoop.apache.org 
mapreduce-dev@hadoop.apache.orgmailto:mapreduce-dev@hadoop.apache.org
Subject: Re: Commits breaking compilation of MR 'classic' tests

Fair, however there are still tests which need to be ported over. We can remove 
them after the port.

On Sep 26, 2012, at 9:54 AM, Robert Evans wrote:

As per my comment on the bug.  I though we were going to remove them.

MAPREDUCE-4266 only needs a little bit more work, change a patch to a
script, before they disappear entirely.  I would much rather see dead code
die then be maintained for a few tests that are mostly testing the dead
code itself.


--Bobby

On 9/26/12 9:39 AM, Arun C Murthy 
a...@hortonworks.commailto:a...@hortonworks.com wrote:

Point. I've opened https://issues.apache.org/jira/browse/MAPREDUCE-4687
to track this.

On Sep 25, 2012, at 9:33 PM, Eli Collins wrote:

How about adding this step to the MR PreCommit jenkins job so it's run
as part test-patch?

On Tue, Sep 25, 2012 at 7:48 PM, Arun C Murthy 
a...@hortonworks.commailto:a...@hortonworks.com
wrote:
Committers,

As most people are aware, the MapReduce 'classic' tests (in
hadoop-mapreduce-project/src/test) still need to built using ant since
they aren't mavenized yet.

I've seen several commits (and 2 within the last hour i.e.
MAPREDUCE-3681 and MAPREDUCE-3682) which lead me to believe
developers/committers aren't checking for this.

Henceforth, with all changes, before committing, please do run:
$ mvn install
$ cd hadoop-mapreduce-project
$ ant veryclean all-jars -Dresolvers=internal

These instructions were already in
http://wiki.apache.org/hadoop/HowToReleasePostMavenization and I've
just updated http://wiki.apache.org/hadoop/HowToContribute.

thanks,
Arun


--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Speculative Execution...

2012-09-13 Thread Robert Evans
Under YARN (branch-2, branch-0.23, and trunk) the speculative execution
decision is pluggable, and can be replaced by a user.  If you could come
up with a better solution to speculative execution that would be great.
We have known for a while that it is not very good (most of the time we
run a speculative task it is just wasted).  In branch-2 we have a new
version that we think is better (more conservative), but I am not sure how
much of a study has been done on exactly how much better it is or what
else can be done to make it even better.  I would look at adding your
ideas into that plugin and not so much of using the config to turn
speculation on or off dynamically, because there are some map/reduce
applications that abuse map/reduce somewhat and will not run correctly if
speculative execution is enabled.

--Bobby Evans 

On 9/13/12 1:08 AM, Suresh S suresh...@gmail.com wrote:

Hello Sir/Madam

   I am doing PhD. I am interested to do research in Hadoop for publishing
paper.
I know little bit about speculative execution of slow tasks. I know it is
possible to enable or disable speculative execution.

But,
   Is there any idea published already for dynamically enable or disable
the speculative execution depending on application, cluster load and other
run time parameters?

   Is it worth to do research in this direction?

   Is this contribution is worth to publish a conference or journal
paper(s)?

*Regards*
*S.Suresh,*
*Research Scholar,*
*Department of Computer Applications,*
*National Institute of Technology,*
*Tiruchirappalli - 620015.*
*India*
*Mobile: +91-9941506562*



Re: On the topic of task scheduling

2012-09-04 Thread Robert Evans
The other thing to point out too is that in order to solve this problem
perfectly you litterly have to solve the halting problem.  You have to
predict if the maps are going to finish quickly or slowly.  If they finish
quickly then you want to launch reduces quickly to start fetching data
from the mappers, if they are going to finish very slowly, then you have a
lot of reducers taking up resources not doing anything.  That is why there
is the config parameter that can be set on a per job basis to tell the AM
when to start launch maps.  We have actually been experimenting with
setting this to 100% because it improves utilization of the cluster a lot.
But be careful there are a lot of bug that you might run into if you do
this.  I think we have fixed al of them, but I don't know how many have
been merged into 2.1 and how many are still sitting on 2.2.

--Bobby

On 9/2/12 1:46 PM, Arun C Murthy a...@hortonworks.com wrote:

Vasco,

 Welcome to Hadoop!

 You observations are all correct - in simplest case you launch all
reduces up front (we used to do that initially) and get a good 'pipeline'
between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce
itself.

 However, one thing to remember is that keeping reduces up and running
without sufficient maps being completed is a waste of resources in the
cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not
launch reduces until a certain percentage of the job's maps are complete
- by default it's set to 5%. However, there still is a flaw with it
(regardless of what you set it to be i.e. 5% or 50%). If it's too high,
you lose the 'pipeline' and too low (5%), reduces still spin waiting for
all maps to complete wasting resources in the cluster.

 Given that, we've implemented the heuristic you've described below for
hadoop-2 which is better at balancing resource-utilization v/s pipelining
or job latency.

 However, as you've pointed out there are several improvements which are
feasible. But, remember that the complexity involved has on a number of
factors you've already mentioned:
 # Job size (a job with 100m/10r v/s 10m/1r)
 # Skew for reduces
 # Resource availability i.e. other active jobs/shuffles in the system,
network bandwidth etc.

 If you look at an ideal shuffle it will look so (pardon my primitive
scribble):
 http://people.apache.org/~acmurthy/ideal-shuffle.png

 From that graph:
 # X i.e. when to launch reduces depends on resource availability, job
size  maps' completion rate.
 # Slope of shuffles (red worm) depends on network b/w, skew etc.

 None of your points are invalid - I'm just pointing out the
possibilities and complexities.

 Your points about aggregation are also valid, look at
http://code.google.com/p/sailfish/ for e.g.

 One of the advantages of hadoop-2 is that anyone can play with these
heuristics and implement your own - I'd love to help if you are
interested in playing with them.

 Related jiras:
 https://issues.apache.org/jira/browse/MAPREDUCE-4584

hth,
Arun

On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote:

 Hi,
 
 I am new to the list, I am working with hadoop in the context of my
 MSc graduation project (has nothing to do with task scheduling per
 se). I came across task scheduling because I ran into the fifo
 starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
 the fifo starvation issue is solved. The behavior of task scheduling I
 observe in this branch is as follows. It begins with all containers
 allocated to mappers. Pretty quickly reducers are starting to be
 scheduled. In a linear way more containers are given to reducers,
 until about 50% (does anybody know why 50%?) of available containers
 are reducers (this point is reached when ~ 50% of the mappers are
 finished). It stays ~50-50 for until all mappers are scheduled. Only
 then the proportion of containers allocated to reducers is increased
 to  50%.
 
 I don't think this is in general quite the optimal (in terms of total
 job completion time) scheduling behavior. The reason being that the
 last reducer can only be scheduled when a free container becomes
 available after all mappers are scheduled. Thus, in order to shorten
 total job completion time the last reducer must be scheduled as early
 as possible.
 
 For the following gedankenexperiment, assume # reducer is set to 99%
 capacity, as suggested somewhere in the hadoop docs, and that each
 reducer will process roughly the same amount of work. I am going to
 schedule as in 2.1.0, but instead of allocating reducers slowly up to
 50 % of capacity, I am just going to take away containers. Thus, the
 amount of map work is the same as in 2.1.0, only no reduce work will
 be done. At the point that the proportion of reducers would increased
 to more than 50% of the containers (i.e., near the end of the map
 phase), I schedule all reducers in the containers I took away, making
 sure that the last reducer is scheduled at the same moment as it would
 be in 2.1.0.  My claim 

Re: On the topic of task scheduling

2012-09-04 Thread Robert Evans
You are correct about my typo, should be launching reducers, not maps.

We do want a solution that is good in most cases, and preferably
automatic, because most users and not going to change any default values.
But I think you also want to give administrators of a cluster and
individual users as well the knobs to adjust if resources are better spent
on improving overall throughput of the cluster or if the run time of a job
is a higher priority.  On our clusters some jobs have a tight SLA.  We
ideally want to do what we can to meet their SLA, even if it requires
using more resources.  On the other hand, running on the same cluster will
be jobs with either no SLA or a very lenient one.  In those cases we want
to use the resources as wisely as possible so as many jobs as possible can
complete in the given time frame.  This has bigger ramifications with the
RM's scheduling, but ideally AM would also adjust its timing of requests
as well so both work together for a common goal.

--Bobby Evans  

On 9/4/12 8:59 AM, Vasco Visser vasco.vis...@gmail.com wrote:

On Tue, Sep 4, 2012 at 3:11 PM, Robert Evans ev...@yahoo-inc.com wrote:
 The other thing to point out too is that in order to solve this problem
 perfectly you litterly have to solve the halting problem.  You have to
 predict if the maps are going to finish quickly or slowly.  If they
finish
 quickly then you want to launch reduces quickly to start fetching data
 from the mappers, if they are going to finish very slowly, then you
have a
 lot of reducers taking up resources not doing anything.

I agree with you that a perfect solution is not going to be feasible.
The aim should probably be a solution that is good in many cases.

 That is why there
 is the config parameter that can be set on a per job basis to tell the
AM
 when to start launch maps.

I assume you mean start launching reducers

 We have actually been experimenting with
 setting this to 100% because it improves utilization of the cluster a
lot.

thanks for pointing this out, I didn't know about this config option.
That the utilization of the cluster improves by setting this to 1
doesn't surprise me.

Maybe it is a good idea to introduce a concept like job container
time that captures how much resources a job uses in its life time.
For example, if a job uses 10 mappers each for a minute and 10
reducers also each for a minute, then the container time would be 20
minutes. Having idle reducer will increase container time.

A conceptually simple method to optimize the container time of a job
is to let the AM monitor for each scheduled reducer how much of the
time it is waiting for mappers to produce intermediate data  (maybe
embed this in the heartbeat?). If the average waiting for all
scheduled reducers is above a certain proportion (say waiting more
than 25% of the time or smt), then the AM can decide to discard
some/all reducers and give the freed resources to mappers.

This is just an idea, I don't know about the feasibility. Also I
didn't think about the relationship between optimizing container time
for a single job and optimizing it for all jobs utilizing on the
cluster. Might be that minimizing for each job gives minimal overall,
but not sure.

 On 9/2/12 1:46 PM, Arun C Murthy a...@hortonworks.com wrote:

Vasco,

 Welcome to Hadoop!

 You observations are all correct - in simplest case you launch all
reduces up front (we used to do that initially) and get a good
'pipeline'
between maps, shuffle (i.e. moving map-outputs to reduces) and the
reduce
itself.

 However, one thing to remember is that keeping reduces up and running
without sufficient maps being completed is a waste of resources in the
cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not
launch reduces until a certain percentage of the job's maps are complete
- by default it's set to 5%. However, there still is a flaw with it
(regardless of what you set it to be i.e. 5% or 50%). If it's too high,
you lose the 'pipeline' and too low (5%), reduces still spin waiting for
all maps to complete wasting resources in the cluster.

 Given that, we've implemented the heuristic you've described below for
hadoop-2 which is better at balancing resource-utilization v/s
pipelining
or job latency.

 However, as you've pointed out there are several improvements which are
feasible. But, remember that the complexity involved has on a number of
factors you've already mentioned:
 # Job size (a job with 100m/10r v/s 10m/1r)
 # Skew for reduces
 # Resource availability i.e. other active jobs/shuffles in the system,
network bandwidth etc.

 If you look at an ideal shuffle it will look so (pardon my primitive
scribble):
 http://people.apache.org/~acmurthy/ideal-shuffle.png

 From that graph:
 # X i.e. when to launch reduces depends on resource availability, job
size  maps' completion rate.
 # Slope of shuffles (red worm) depends on network b/w, skew etc.

 None of your points are invalid - I'm just pointing out the
possibilities

Re: Cannot create a new Jira issue for MapReduce

2012-08-09 Thread Robert Evans
It is a bit worse then that though.  I found that it did create the JIRA,
but it is in a bad state where you cannot put it in patch available or
close it. So we may need to do some cleanup of these JIRAs later.

--Bobby

On 8/9/12 3:19 PM, Ted Yu yuzhih...@gmail.com wrote:

This has been reported by HBase developers as well.

See https://issues.apache.org/jira/browse/INFRA-5131

On Thu, Aug 9, 2012 at 1:10 PM, Benoy Antony bant...@gmail.com wrote:

 Hi,

 I am getting the following error when I try to create a Jira issue.

 Error creating issue: com.atlassian.jira.util.RuntimeIOException:
 java.io.IOException: read past EOF

 Anyone else face the same problem ?

 Thanks ,
 Benoy




Re: Multi-level aggregation with combining the result of maps per node/rack

2012-07-31 Thread Robert Evans
Tsuyoshi,


There has been a lot of work happening in the shuffle phase.  It is being
made pluggable in both 1.0 and 2.0/trunk (MAPREDUCE-4049).  There is also
some work being done to reuse containers in trunk/2.0 (MAPREDUCE-3902).
This should have a similar, although perhaps more limited result, because
when different map tasks run in the same container their outputs also go
through the same combiner.  I have heard that it is showing some good
results for both small and large jobs.  There was also some work to try
and pull in Sailfish (No JIRA just ramblings on the mailing list), which
moves the shuffle phase to a separate process.  I have not seen much
happen on that front recently, but it saw some large gains on big jobs,
but is worse on small jobs.  I think that this is something very
interesting and I would encourage you to file a JIRA and pursue it.

I don't know anything about your design, so please feel free to disregard
my comments if they do not apply.  I would encourage you to think about
security on this.  When you run the combiner you need to be sure that it
runs as the user that owns the data.  This should probably not be too
difficult if you hijack a mapper tasks that has just finished to try and
combine the data from others on the same node.  To do this you will
probably need some sort of a coordination system in the AM to tell that
mapper what other mappers to try and combine data from.  It would be nice
to coordinate this with the container reuse work, which currently just
tells the container to run another split through.  It could be another
option to tell it to combine with the map output from container X.

Another thing to be aware of is small jobs.  It would be great to see how
this impacts small jobs, and if it has a negative impact we should look
for an automated way to turn this off or on.

Thanks for your work,

Bobby Evans

On 7/30/12 8:11 PM, Tsuyoshi OZAWA ozawa.tsuyo...@gmail.com wrote:

Hi,

We consider the shuffle cost is a main concern in MapReduce,
in particular, aggregation processing.
The shuffle costs is also expensive in Hadoop in spite of the
existence of combiner, because the scope of combining is limited
within only one MapTask.

To solve this problem, I've implemented the prototype that
combines the result of multiple maps per node[1].
This is the first step to make hadoop faster with multi-level
aggregation technique like Google Dremel[2].

I took a benchmark with the prototype.
We used WordCount program with in-mapper combining optimization
as the benchmark. The benchmark is taken under 40 nodes [3].
The input data set is 300GB, 500GB, 1TB, and 2TB texts which is generated
by default RandomTextWriter. Reducer is configured
as 1 on the assumption that some workload forces 1 reducer
like Google Dremel. The result is as follows:

 | 300GB | 500GB |   1TB |   2TB |
Normal (sec) |  4004 |  5551 | 12177 | 27608 |
Combining per node (sec) |  3678 |  3844 |  7440 | 15591 |

Note that a MapTask runs combiner per node every 3 minutes in
the current prototype, so the aggregation rate is very limited.

Normal is the result of current hadoop, and Combining per node
is the result with my optimization.  Regardless of the 3-minutes
restriction, the prototype is 1.7 times faster than normal hadoop
in 2TB case.  Another benchmark also shows that the shuffle costs
is cut down by 50%.

I want to know from you guys, do you think is it a useful feature?
If yes, I will work for contributing it.
It is also welcome to tell me the benchmark that you want me to do
with my prototype.

Regards,
Tsuyoshi


[1] The idea is also described in Hadoop wiki:
http://wiki.apache.org/hadoop/HadoopResearchProjects
[2] Dremel paper is available at:
http://research.google.com/pubs/pub36632.html
[3] The specification of each nodes is as follows:
CPU Core(TM)2 Duo CPU E7400 2.80GHz x 2
Memory 8 GB
Network 1 GbE



Re: Can we use String.intern inside WritableUtils#readString()?

2012-07-13 Thread Robert Evans
Yes I filed a JIRA for something like this a while ago MAPREDUCE-4303.  I
have not done anything with it for this very reason.  There are some
potential fixes for this, we could keep a somewhat small weak reference
cache of these strings so that if a string is read multiple times it is
dedupped and if it is collected we don't force it to stay around too long
and it is not placed in the permgen space. But that is not a small change.
 If you want to take over that JIRA feel free, otherwise I will get around
to it eventually.

--Bobby Evans

On 7/12/12 1:27 PM, Ramkumar Vadali ramkumar.vad...@gmail.com wrote:

String.intern() should be used with caution. The intern'ed strings go to
the perm gen space in the java process, which is limited. You could
easily run out of that space and get OOM errors even when the total usage
is well below the Xmx value. A better way would be to have a MapString,
String that de-deplicates string objects

Ramkumar

On Thu, Jul 12, 2012 at 6:02 AM, Bhallamudi Venkata Siva Kamesh 
kames...@imaginea.com wrote:

 Hi All,
  I noticed that WritableUtils.readString(), while deserializing the
 strings, creates a string object every time. But there may be
applications,
 which serialize a small no of the strings, a huge number of times. So
while
 deserializing them, this may lead to OOMs sometimes.

 I think using intern() will reduce the creation of the number of String
 objects. Please correct me if my understading is wrong.

 --
 ThanksRegards,
 Bh.V.S.Kamesh,
 +91-9652725948









Re: Cyclic dependency in JobControl job DAG

2012-06-25 Thread Robert Evans
I personally think it is useful.  I would say contribute it.

(Moved common-dev to bcc, we try not to cross post on these lists)

--Bobby Evans

On 6/25/12 3:37 AM, madhu phatak phatak@gmail.com wrote:

Hi,
 In current implementation of JobControl, whenever there is a cyclic
dependency between the jobs it throws a Stack overflow exception .
 For example,
   ControlledJob job1 = new ControlledJob(new Configuration());
job1.setJobName(job1);
ControlledJob job2 = new ControlledJob(new Configuration());
job2.setJobName(job2);
job1.addDependingJob(job2);
job2.addDependingJob(job1);
JobControl jobControl = new JobControl(jobcontrol);
jobControl.addJob(job1);
jobControl.addJob(job2);
jobControl.run();

throws
  java.lang.StackOverflowError
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.checkState(ControlledJob.java:295)

Whenever we write complex application, there is always possibility of
cyclic dependencies.I have written a method which checks for the cyclic
dependency  upfront and informs it to the user. I want to know from you
guys, do you think is it a useful feature? If yes I can contribute it as a
patch.

Regards,
Madhukara Phatak
--
https://github.com/zinnia-phatak-dev/Nectar



Re: try to fix hadoop streaming bug

2012-06-14 Thread Robert Evans
It looks like your jar's MANIFEST file is missing the Main Class attribute.  It 
may have something to do with how you created the updated jar you are using.  
Hadoop is trying to run the jar, and because it did not find the MainClass in 
the jar's manifest it thinks you are supplying it as the next argument, and 
looking for the -mapper class, which obviously does not exist.  You can either 
update the MANIFEST when you build the jar, or you can supply the main class on 
the command line like

hadoop path/hadoop-streaming.jar org.apache.hadoop.streaming.HadoopStreaming 
-mapper ...

--Bobby Evans


On 6/14/12 5:01 AM, HU Wenjing A wenjing.a...@alcatel-sbell.com.cn wrote:

Hi all,

   I tried to fix the hadoop streaming bug for the version 0.21.0 (streaming 
overrides user given output key and value types). I saw some useful message 
about this issue on 
https://issues.apache.org/jira/browse/MAPREDUCE-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 and modified some code following the patch file. I modified and compiled the 
code. It seems only about thirteen .java files need to be modified. But when I 
tried to replace the old .classes files using the new ones, I can only find 
StreamJob.class in ${hadoop_home}/ 
/root/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar.  And 
the other twelve modified files could't be found in any jar files in the 
${hadoop_home} directory.
Then I executed the command bin/hadoop jar 
mapred/contrib/streaming/hadoop-0.21.0-streaming.jar  -mapper 
org.apache.hadoop.mapred.lib.IdentityMapper  -reducer NONE -input input -output 
output with the modified streaming jar and just received some error 
information:

Exception in thread main java.lang.ClassNotFoundException: -mapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:185)


And I think this error should have some thing to do with the modification of 
the StreamJob.java. But I saw someone says they have fixed the streaming 
override issue using the patch.
So, Could anyone give me some suggestion about this issue? Or just give me 
another way to fix the bug?

Thanks in advance!  : )

Thanks  best regards,
Wenjing




Re: Hadoop optimization for Lustre FS

2012-05-16 Thread Robert Evans
Zam,

http://wiki.apache.org/hadoop/HowToContribute is a wiki that can tell you in 
more detail the steps you need to do for this. In general though to push the 
patch upstream you want to file a Map/Reduce JIRA, and attach your patch.  
After that several people from the community are likely to comment on the JIRA. 
 If you don't get feedback you can bug us on the dev mailing list about it.  As 
part of this you are also going to need to do a port to trunk, as we do not 
want to have new features go into any line without having it go into trunk as 
well.  Even though this sounds potentially complex because trunk uses YARN 
instead of the previous Map/Reduce specific framework both 1.0 and trunk are in 
the process of getting a pluggable shuffle service MAPREDUCE-4049.  It would 
probably be best to port your patch to be a plugin for this.  Then hopefully 
the porting between trunk and 1.0 will be relatively simple.

If this is the route you want to go you should put 1.1 and 3.0.0 as the target 
versions of the JIRA.  3.0.0 corresponds to trunk, and 1.1 is the next release 
of the 1 line that is accepting new major feature work.  You probably also want 
to link your JIRA to the MAPREDUCE-4049 JIRA as a dependency, if you are making 
it a plugin.

In addition because this is an optimization it would be nice to have some 
information in the JIRA showing the benchmarks you ran and the performance 
improvements you got.  Ultimately we are also going to want to have some 
documentation about this as well, but that is something that can come later 
after you lock down the code more.

--Bobby Evans


On 5/16/12 3:34 AM, Alexander Zarochentsev 
alexander_zarochent...@xyratex.com wrote:

Hello,

there is an optimization for Hadoop on Lustre FS, or any
high-performance distributed filesystem.

The research paper with test results can be found here
http://www.xyratex.com/pdfs/whitepapers/Xyratex_white_paper_MapReduce_1-4.pdf
and a presentation for LUG 2011:
http://www.olcf.ornl.gov/wp-content/events/lug2011/4-12-2011/1100-1130_Nathan_Rutman_MapReduce_Lug_2011.pptx

Basically the optimization is a replacement for http transport in
shuffle phase by simple linking target file to the source one. I
attached a draft patch against hadoop-1.0.0 to illustrate the idea.
How to push this patch upstream?

Thanks,
--

Alexander Zam Zarochentsev
alexander_zarochent...@xyratex.com


__
This email may contain privileged or confidential information, which should 
only be used for the purpose for which it was sent by Xyratex. No further 
rights or licenses are granted to use such information. If you are not the 
intended recipient of this message, please notify the sender by return and 
delete it. You may not use, copy, disclose or rely on the information contained 
in it.

Internet email is susceptible to data corruption, interception and unauthorised 
amendment for which Xyratex does not accept liability. While we have taken 
reasonable precautions to ensure that this email is free of viruses, Xyratex 
does not accept liability for the presence of any computer viruses in this 
email, nor for any losses caused as a result of viruses.

Xyratex Technology Limited (03134912), Registered in England  Wales, 
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.

The Xyratex group of companies also includes, Xyratex Ltd, registered in 
Bermuda, Xyratex International Inc, registered in California, Xyratex 
(Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd 
registered in The People's Republic of China and Xyratex Japan Limited 
registered in Japan.
__




Re: Building first time

2012-05-09 Thread Robert Evans
http://wiki.apache.org/hadoop/HowToContribute is the best place to start.  
Checking the code in through git will not trigger a jenkins build, unless you 
have a special setup that goes beyond Apache provides.  You do not need to 
compile the entire tree to get Map/Reduce, but typically it is not a big deal 
to compile everything.

--Bobby Evans

On 5/8/12 11:50 PM, Radim Kolar h...@filez.com wrote:

I am interested in working on mapreduce package, so not sure if I need
to compile the whole tree.

I work on branch-0.23. It can be just imported into SpringToolsSuite,
then click on Run - Maven - type in 'compile' target. It compiles
module it just fails on Avro stuff. But it is good enough that you can
edit it in Eclipse with some comfort. Then just commit to git and let
Jenkins on Unix to build it for you.



Mixed Mode Environments

2012-02-02 Thread Robert Evans
I just noticed that HADOOP-7484 and MAPREDUCE-3500 recently got committed to
trunk and 0.23.  I missed them before they were committed.  I am curious if
we are dropping support for running Hadoop in mixed mode environments?
Meaning I want Hadoop to run as 32-bit by default, because that is faster
then 64-bit, but if one of my users wants to launch a mapper or reducer in a
64-bit JVM to have access to more memory they can, and the native libraries
should be able to work with them.

--Bobby Evans



Re: Status of the completed containers (0.23)

2012-01-09 Thread Robert Evans
Praveen,

Looking at the code, it does not appear to currently be used outside of 
testing.  I really don't know.  Perhaps in the future if it is extended then it 
might be used more.  Or perhaps the author of the API added it in for 
completeness. Just speculating.

--Bobby Evans

On 1/9/12 7:42 AM, Praveen Sripati praveensrip...@gmail.com wrote:

Hi,

Documentation says that the NM sends the status of the completed containers
to the RM and the RM sends it to the AM. This is the interface (1) below.
What is the purpose of the interface (2)?

1) AMRMProtocol has the below method. The AllocateResponse has the list of
completed containers.

  public AllocateResponse allocate(AllocateRequest request)
  throws YarnRemoteException;

2) ContainerManager has the below method. The GetContainerStatusResponse
has the status of the container.

  GetContainerStatusResponse getContainerStatus(
  GetContainerStatusRequest request) throws YarnRemoteException;

Regards,
Praveen



Re: Reduce output is strange

2011-12-19 Thread Robert Evans
It looks mostly correct to me.  I am not an expert on sequence files, and I 
have not checked the text against the spec nor have I checked the binary 
numbers in it to be sure they add up to the correct lengths etc, but it looks 
good from a first glance.  I can see the SEQ tag at the beginning to mark it as 
a sequence file and the org.apache.hadoop.io.Text as the type for both the keys 
and the values.

--Bobby Evans

On 12/19/11 7:51 AM, Pedro Costa psdc1...@gmail.com wrote:

Hi,

In the hadoop MapReduce, I've executed the webdatascan example, and the
reduce output is in a SequeceFile. The result is shows here (
http://paste.lisp.org/display/126572). What's the trash (random
characters), like u 265
100 330 320 252  \n # ; 374 5 211 V ' 340 376 in the output? Is the
output correct?


000   S   E   Q 006 031   o   r   g   .   a   p   a   c   h   e   .
020   h   a   d   o   o   p   .   i   o   .   T   e   x   t 031   o
040   r   g   .   a   p   a   c   h   e   .   h   a   d   o   o   p
060   .   i   o   .   T   e   x   t  \0  \0  \0  \0  \0  \0   u 265
100 330 320 252 \n   #   ; 374   5 211   V   ' 340 376  \0  \0
120  \0   X  \0  \0  \0 037   a   p   p   l   e   a   p   p
140   l   e   b   a   n   a   n   a   a   p   p   l   e
160   a   p   p   l   e   7   c   a   r   r   o   t   c   a
200   r   r   o   t   c   a   r   r   o   t   c   a   r   r
220   o   t   a   p   p   l   e   b   a   n   a   n   a
240   c   a   r   r   o   t   b   a   n   a   n   a
256


--
Thanks,



Re: Reduce output is strange

2011-12-19 Thread Robert Evans
Oh I forgot to say that part of the Random Characters are actually random 
characters.  Sequence files store a set of random characters as synch points 
within the file.  This allows for splitting the file easily without a high risk 
that the random sequence appears inside the data itself just by chance.

--Bobby Evans

On 12/19/11 7:51 AM, Pedro Costa psdc1...@gmail.com wrote:

Hi,

In the hadoop MapReduce, I've executed the webdatascan example, and the
reduce output is in a SequeceFile. The result is shows here (
http://paste.lisp.org/display/126572). What's the trash (random
characters), like u 265
100 330 320 252  \n # ; 374 5 211 V ' 340 376 in the output? Is the
output correct?


000   S   E   Q 006 031   o   r   g   .   a   p   a   c   h   e   .
020   h   a   d   o   o   p   .   i   o   .   T   e   x   t 031   o
040   r   g   .   a   p   a   c   h   e   .   h   a   d   o   o   p
060   .   i   o   .   T   e   x   t  \0  \0  \0  \0  \0  \0   u 265
100 330 320 252 \n   #   ; 374   5 211   V   ' 340 376  \0  \0
120  \0   X  \0  \0  \0 037   a   p   p   l   e   a   p   p
140   l   e   b   a   n   a   n   a   a   p   p   l   e
160   a   p   p   l   e   7   c   a   r   r   o   t   c   a
200   r   r   o   t   c   a   r   r   o   t   c   a   r   r
220   o   t   a   p   p   l   e   b   a   n   a   n   a
240   c   a   r   r   o   t   b   a   n   a   n   a
256


--
Thanks,



Re: Multiple resource requests for a given node (or all nodes)?

2011-12-13 Thread Robert Evans
Arun,

I am saying that I don't know what the correct solution is to updating the 
scheduler interface.  Perhaps the correct solution is no change, I have not 
taken the time to think about it much.  What I am saying is that there are a 
number of new features that are likely going to be going into the scheduler, 
and if we are going to change the interface, I want to be sure that we think 
about these use cases before we change it.  That is all I am saying.  I am not 
advocating for a particular interface at this point, as I said I have not taken 
the time to think about it in depth.

--Bobby Evans

On 12/13/11 12:42 AM, Arun C Murthy a...@hortonworks.com wrote:

I'd argue that Robert is complaining that the interface *is not* MR-centric 
enough.

IAC, priorities is fairly generic. MR AM uses it to get constraints to stick.

Arun

On Dec 12, 2011, at 7:50 PM, Patrick Wendell wrote:

 Todd - that's a good question and I haven't looked closely into
 whether simply adding a multimap is enough or if there are more deeply
 seeded issues (at least to address this specific case). If it's the
 former I'll probably just submit a patch.

 Arun - that seems like a hack but I guess it is a sufficient
 workaround for current applications.

 I'm finishing up a bare-bones version of the Fair Scheduler right now
 (going to throw something up for review soon) but I haven't yet added
 preemption.  How this is going to work well with various types of
 applications is unclear. In the MR case we can probably just preempt
 based on priorities, since they are essentially just ordering
 constraints right now. As Robert points out, this interface is very
 MR-Centric right now - i'm not sure this generalizes well to other
 applications depending on how they use priorities.

 - Patrick

 On Mon, Dec 12, 2011 at 1:27 PM, Arun C Murthy a...@hortonworks.com wrote:
 Use priorities to ask for different resource types.

 Arun

 On Dec 10, 2011, at 12:23 PM, Patrick Wendell wrote:

 If you look at how resource requests are stored now, they use a map
 keyed on the node hostname.

 == AppSchedulingInfo.java ==

  final MapPriority, MapString, ResourceRequest requests =
new HashMapPriority, MapString, ResourceRequest();

 

 What happens if an application wants to request multiple container
 types on a given node. E.g. say I need 10 2GB containers and 10 1GB
 containers, and I don't care which node they are on (i.e. RMNode.ANY).
 I really want to store 2 resource requests under RMNode.ANY in this
 case... don't I?

 Is the model just that an AM would ask for these in series?

 - Patrick





Re: Multiple resource requests for a given node (or all nodes)?

2011-12-12 Thread Robert Evans
I think there may be some need for a bigger redesign in how requests are made 
to the scheduler because the only use case really was map/reduce at the time it 
was designed.  It works very well for that purpose but has missed a few other 
use cases.  For example there could be something like  HBase where it wants a 
specific number of nodes with no overlap on the same physical machines (Yes you 
can do it now but it may take many iterations to get it right).   Or perhaps 
like with MPI or Storm where they don't really care where the nodes are so long 
as they are all relatively close to one another in the network topology.  Or 
things like with MPI where it cannot start any processing until all of the 
containers are ready (gang scheduling).

It gets even more complicated if we want to support preemption like with the 
fair scheduler.  Which imo is needed even more once MPI and other potentially 
very long lived jobs start to coexist with shorter jobs with tight SLAs.  In 
order to make a good decision about what to preempt the scheduler needs to know 
that if it preempts a mapper, even though it may have been running a lot 
shorter time then some reducer in the same application it is likely to slow 
things down further then if it preempts that reducer.  Or if it preempts an MPI 
node it might was well kill the entire application and start over, unless we 
some how give the scheduler the ability to tell MPI that it is going to be 
preempted and it needs to save its state away.  But even then the scheduler 
needs to know that preempting an MPI job will cause all progress on it, and all 
of the containers it is holding, to stop.

Even if we are not putting any of these scheduling features in now we need to 
think about them when designing the interface to not limit ourselves and force 
us to change things drastically later on.  I am just saying that I am not sure 
just switching to a multimap is enough.

--
Bobby Evans

On 12/10/11 6:21 PM, Todd Lipcon t...@cloudera.com wrote:

On Sat, Dec 10, 2011 at 12:23 PM, Patrick Wendell
pwend...@eecs.berkeley.edu wrote:

 What happens if an application wants to request multiple container
 types on a given node. E.g. say I need 10 2GB containers and 10 1GB
 containers, and I don't care which node they are on (i.e. RMNode.ANY).
 I really want to store 2 resource requests under RMNode.ANY in this
 case... don't I?

 Is the model just that an AM would ask for these in series?

My hunch is that this was overlooked because the resource sizes for MR
are basically set on a per-task-type level. That is, maps need X MB
and reduces need Y MB. Since maps and reduces are set at different
'priorities', they haven't conflicted.

Does it seem straightforward to change it to a multimap? Guava has a
nice implementation.

-Todd
--
Todd Lipcon
Software Engineer, Cloudera



Re: Incremental builds in 0.23 using Maven

2011-12-07 Thread Robert Evans
Praveen,

One thing to be aware of with removing the clean is that I have run into 
situations, in both hadoop and in other projects, where an API changed as part 
of the update or something and maven did not realize it and did not rebuild 
something that depended on it.  I then got a runtime error and after doing a 
clean I got a compile error.  I have also seen situations where the tar.gz 
package for hadoop was not properly updated and after deploying, it did not 
have my fix in it.  It took me a long time to figure out why my fix did not 
work.  After doing a clean and rebuilding/redeploying everything worked just 
fine.  I have not had the time to dig though why that is happening or file any 
JIRAs on them.  Also it is rather rare, but if I want to be sure something has 
my changes in it I always do a clean first.

--Bobby Evans

On 12/7/11 7:42 AM, Harsh J ha...@cloudera.com wrote:

Praveen,

Obviously, a clean target will wipe out all your existing build directories, 
and hence the other things start from scratch. That is your slowdown-causer.

Just remove the clean from that command and you're good to go.

On 07-Dec-2011, at 6:37 PM, Praveen Sripati wrote:

 Alejandro,

 Here is the command I use for branch-0.23

 mvn clean install package -Pdist -Dtar -DskipTests -Dmaven.javadoc.skip=true

 Regards,
 Praveen

 On Wed, Dec 7, 2011 at 11:24 AM, Alejandro Abdelnur t...@cloudera.comwrote:

 what is your 'do a build' command in both cases?

 On Tue, Dec 6, 2011 at 6:06 PM, Praveen Sripati praveensrip...@gmail.com
 wrote:

 Alejandro,

 Here is the sequence

 1. 'svn get '
 2. do a build
 3. 'svn up' with no changes
 4. do a build

 Tasks (2) and (4) are taking almost equal time. I expected task (4) to be
 much faster.

 Regards,
 Praveen

 On Tue, Dec 6, 2011 at 11:08 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:

 Maven does incremental builds.

 taking time as in?

 Thanks.

 Alejandro

 On Tue, Dec 6, 2011 at 6:31 AM, Praveen Sripati 
 praveensrip...@gmail.com
 wrote:

 Could someone please respond to the below query?

 Regards,
 Praveen

 On Tue, Nov 22, 2011 at 11:43 AM, Praveen Sripati
 praveensrip...@gmail.comwrote:

 Hi,

 Does Maven support incremental builds? After `svn up', the build is
 taking
 time even without any updates from svn.

 Thanks,
 Praveen










Re: Automatically Documenting Apache Hadoop Configuration

2011-12-05 Thread Robert Evans
From my work on yarn trying to document the configs there and to standardize 
them, writing anything that is going to automatically detect config values 
through static analysis is going to be very difficult.  This is because most 
of the configs in yarn are now built up using static string concatenation.

public static String BASE = yarn.base.;
public static String CONF = BASE+config;

I am not sure that there is a good way around this short of using a full java 
parser to trace out all method calls, and try to resolve the parameters.  I 
know this is possible, just not that simple to do.

I am +1 for anything that will clean up configs and improve the documentation 
of them.  Even if we have to rewire or rewrite a lot of the Configuration class 
to make things work properly.

--Bobby Evans

On 12/5/11 11:54 AM, Harsh J ha...@cloudera.com wrote:

Praveen,

(Inline.)

On 05-Dec-2011, at 10:14 PM, Praveen Sripati wrote:

 Hi,

 Recently there was a query about the Hadoop framework being tolerant for
 map/reduce task failure towards the job completion. And the solution was to
 set the 'mapreduce.map.failures.maxpercent` and
 'mapreduce.reduce.failures.maxpercent' properties. Although this feature
 was introduced couple of years back, it was not documented. Had similar
 experience with 0.23 release also.

I do not know if we recommend using config strings directly when there's an API 
in Job/JobConf supporting setting the same thing. Just saying - that there was 
javadoc already available on this. But of course, it would be better if the 
tutorial covered this too. Doc-patches welcome!

 It would be really good for Hadoop adoption to automatically dig and
 document all the existing configurable properties in Hadoop and also to
 identify newly added properties in a particular release during the build
 processes. Documentation would also lead to fewer queries in the forums.
 Cloudera has done something similar [1], though it's not 100% accurate, it
 would definitely help to some extent.

I'm +1 for this. We do request and consistently add entries to *-default.xml 
files if we find them undocumented today. I think we should also enforce it at 
the review level, so that patches do not go in undocumented -- at minimum the 
configuration tweaks at least.



Re: Start Nodemanager with webapp disabled.

2011-10-05 Thread Robert Evans
The simplest way is to use ephemeral ports.  Set the port number to 0 in the 
config and the node manager will pick a free port to listen on.  It will then 
heartbeat back into the Resource Manager with the port it is listening on and 
the RM can pass that info off to whoever else needs it.  I am not positive that 
this will work in all cases as I have not tried it myself.  There is some work 
to enable this in the mini yarn cluster.

--Bobby Evans

On 10/5/11 3:24 AM, Prashant Sharma prashant.ii...@gmail.com wrote:

Hi all,
Is it possible to start NM daemon with webapp disabled. Overriding
port address in yarn-site is not an option. Since I want more than one
NM to be started. I tried passing the property
-Dyarn.nodemanager.webapp.address=localhost:port in command line
options unfortunately that doesnot seem to override the default.

Thanks
Prashant.



Re: Regarding 'branch-0.20-security'

2011-09-28 Thread Robert Evans
It is kind of a long history and I will try to leave out all of the politics 
involved to make it shorter.  For a long time 0.20 has been the stable release 
of Hadoop.  It is supposedly in sustaining releases now, but many new features 
keep going in because that is what most people use in production and they do 
not want to wait several years for a new interesting feature.  One of the very 
big features that went in is security.  There was a separate branch created for 
it which is branch-0.20-security.  Branch-0.20-security has essentially 
replaced branch-0.20 as the 0.20 release branch.  All features that go into 
branch-0.20-security or any other release branch are also supposed to go into 
trunk first, if they are not specific to branch-0.20-security.  So in theory 
everything in any release has also been applied to trunk.

--Bobby Evans

On 9/28/11 9:01 AM, Praveen Sripati praveensrip...@gmail.com wrote:

Hi,

There seems to be continuous changes to the 'branch-0.20-security' and also
there are references to it once in a while in the mailing list. What is the
significance of the 'branch-0.20-security'? Do all the security related
features go into this branch and then ported to others?

Thanks,
Praveen



Re: RecommenderJob Mahout Creating a data model

2011-09-14 Thread Robert Evans
This should probably be directed more toward the Mahout list then the Hadoop 
Map/reduce one.

mahout-u...@apache.org

--Bobby Evans

On 9/14/11 6:28 AM, Amit Sangroya sangroyaa...@gmail.com wrote:

Hi all,

I am trying to run the example from
https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering
,

with the following command bin/mahout
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input -Dmapred.output.dir=output --itemsFile itemfile
--tempDir tempDir

The algorithm estimate the preference of a user towards an item which he/she
has not yet seen. Once an algorithm can predict preferences it can also be
used to do Top-N-Recommendation where the task is to find the N items a
given user might like best. It is mentioned that given a DataModel, it can
produce recommendations.

The algorithm takes approx. 5 minutes to generate top 5 recommendations for
one user on a 10 node hadoop cluster. The size of input is shortened only to
200 users from 1 Million MovieLens Dataset from Grouplens.org.

I have few questions:

1) I want to know that if it is possible to isolate the data model building
step to generating recommendations.

2) Can we use the model once generated using the training data for
generating recommendations for a range of users.

3) To be specific, if I want to provide an on-line service that generates
recommendations for users, Can I minimize the cost of MapReduce interactions
each time.

I am not a data mining expert. Please help me to understand this in a better
way.


Thanks and Regards,
Amit



500 error in review board

2011-09-12 Thread Robert Evans
Whenever I try to post a new patch to review board I get a 500 error.

Something broke! (Error 500)
 

 It appears something broke when you tried to go to here. This is either a
bug in Review Board or a server configuration error. Please report this to
your administrator.


Who should I talk to/report this to?

Thanks,

Bobby Evans



MAPREDUCE-2864 Has been merged to trunk and 0.23

2011-09-09 Thread Robert Evans
MAPREDCUE-2864 was an effort to rename and reorganize the YARN configuration
parameters to make them consistent.  If you are setting anything in your
yarn-site.xml then you will need to update your configuration.  The patch
did not provide backwards compatible mappings because there has never been a
release with these configs in it.

I have provided a script that should hopefully do the conversion for you.

https://issues.apache.org/jira/secure/attachment/12492495/update.pl

I have not fully tested it so please double check the results when it is
run.  This script is a bit of a hack, but it should take a config file name
on the command line as input and update all of the configs in it to use the
newly renamed ones. The original file will be saved with .orig at the end.

If you do have any problems with this please feel free to respond to this
e-mail and I will do my best to help you out.

--Bobby Evans



Re: MAPREDUCE-2864 Has been merged to trunk and 0.23

2011-09-09 Thread Robert Evans
A quick update.  I found a bug in the script, and it has now been fixed.
Please use this script instead.

https://issues.apache.org/jira/secure/attachment/12493787/update.pl

--Bobby Evans

On 9/9/11 8:53 AM, Robert Evans ev...@yahoo-inc.com wrote:

 MAPREDCUE-2864 was an effort to rename and reorganize the YARN configuration
 parameters to make them consistent.  If you are setting anything in your
 yarn-site.xml then you will need to update your configuration.  The patch did
 not provide backwards compatible mappings because there has never been a
 release with these configs in it.
 
 I have provided a script that should hopefully do the conversion for you.
 
 https://issues.apache.org/jira/secure/attachment/12492495/update.pl
 
 I have not fully tested it so please double check the results when it is run.
 This script is a bit of a hack, but it should take a config file name on the
 command line as input and update all of the configs in it to use the newly
 renamed ones. The original file will be saved with .orig at the end.
 
 If you do have any problems with this please feel free to respond to this
 e-mail and I will do my best to help you out.
 
 --Bobby Evans



Re: MRv1 in 0.23+

2011-09-07 Thread Robert Evans
There is a MiniYarnCluster and a MiniMRYarnCluster, it is just that the tests 
have not been ported over to use them yet.

--Bobby

On 9/7/11 2:01 PM, Eli Collins e...@cloudera.com wrote:

My understanding is that the MR1 code is currently needed to run the tests
because there is no Mini MR cluster for MR2.  So the code is needed until
the tests can run against MR2 (not sure if there's an effort underway).
However, see MR-2736, if we remove the ability to run the daemons I don't
think we need to maintain eg the code for security patches. Ie seems like 23
and trunk should be able to ignore the LTC fixes.

Thanks,
Eli

On Wed, Sep 7, 2011 at 11:22 AM, milind.bhandar...@emc.com wrote:

 Folks,

 Has the community decided how long MRv1 will remain part of the codebase,
 after 0.23 ? The reason I am asking is, for those who are working on
 forward porting LinuxTaskController fixes (from 0.20.2xx) to 0.22, will
 they have to patch 0.23 and trunk as well ? Or should these branches be
 left alone ?

 - Milind

 ---
 Milind Bhandarkar
 Greenplum Labs, EMC
 (Disclaimer: Opinions expressed in this email are those of the author, and
 do not necessarily represent the views of any organization, past or
 present, the author might be affiliated with.)





Re: Get Hadoop 0.24.0-SNAPSHOT ready for Eclipse fails on retrieve hadoop-yarn-common jar

2011-09-02 Thread Robert Evans
I believe that if you take off the -e then it will work.  If not run mvn 
eclipse:clean and then mvn eclipse:eclipse.  It worked for me yesterday.

--Bobby


On 9/2/11 4:53 AM, Mario Pastorelli pastorelli.ma...@gmail.com wrote:

Hi all,
I'm trying to download and prepare Hadoop trunk to be used on Eclipse
using https://wiki.apache.org/hadoop/EclipseEnvironment but I'm having
problems with Yarn. In particular the command

mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true -e

output this (this is just the end, other goals compile):

[INFO]

[INFO] Building hadoop-yarn-api 0.24.0-SNAPSHOT
[INFO]

[INFO]
[INFO]  maven-eclipse-plugin:2.8:eclipse (default-cli) @
hadoop-yarn-api 
[INFO]
[INFO] --- maven-antrun-plugin:1.6:run
(create-protobuf-generated-sources-directory) @ hadoop-yarn-api ---
[INFO] Executing tasks

main:
[INFO] Executed tasks
[INFO]
[INFO] --- exec-maven-plugin:1.2:exec (generate-sources) @
hadoop-yarn-api ---
[INFO]
[INFO] --- build-helper-maven-plugin:1.5:add-source (add-source) @
hadoop-yarn-api ---
[INFO] Source directory:
/home/rief/Programmazione/Java/Hadoop/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/target/generated-sources/proto

added.
[INFO]
[INFO]  maven-eclipse-plugin:2.8:eclipse (default-cli) @
hadoop-yarn-api 
[INFO]
[INFO] --- maven-eclipse-plugin:2.8:eclipse (default-cli) @
hadoop-yarn-api ---
[INFO] Using Eclipse Workspace: null
[INFO] Adding default classpath container:
org.eclipse.jdt.launching.JRE_CONTAINER
[INFO] File
/home/rief/Programmazione/Java/Hadoop/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/.project

already exists.
Additional settings will be preserved, run mvn eclipse:clean if
you want old settings to be removed.
[INFO] Wrote Eclipse project for hadoop-yarn-api to
/home/rief/Programmazione/Java/Hadoop/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api.
[INFO]
[INFO]
[INFO]

[INFO] Building hadoop-yarn-common 0.24.0-SNAPSHOT
[INFO]

[INFO]
[INFO]  maven-eclipse-plugin:2.8:eclipse (default-cli) @
hadoop-yarn-common 
[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Apache Hadoop Project POM . SUCCESS [1.437s]
[INFO] Apache Hadoop Annotations . SUCCESS [0.153s]
[INFO] Apache Hadoop Project Dist POM  SUCCESS [0.032s]
[INFO] Apache Hadoop Assemblies .. SUCCESS [0.059s]
[INFO] Apache Hadoop Auth  SUCCESS [0.222s]
[INFO] Apache Hadoop Auth Examples ... SUCCESS [0.138s]
[INFO] Apache Hadoop Common .. SUCCESS [2.172s]
[INFO] Apache Hadoop Common Project .. SUCCESS [0.017s]
[INFO] Apache Hadoop HDFS  SUCCESS [2.305s]
[INFO] Apache Hadoop HDFS Project  SUCCESS [0.016s]
[INFO] hadoop-yarn-api ... SUCCESS [1.964s]
[INFO] hadoop-yarn-common  FAILURE [0.234s]
[INFO] hadoop-yarn-server-common . SKIPPED
[INFO] hadoop-yarn-server-nodemanager  SKIPPED
[INFO] hadoop-yarn-server-resourcemanager  SKIPPED
[INFO] hadoop-yarn-server-tests .. SKIPPED
[INFO] hadoop-yarn-server  SKIPPED
[INFO] hadoop-yarn ... SKIPPED
[INFO] hadoop-mapreduce-client-core .. SKIPPED
[INFO] hadoop-mapreduce-client-common  SKIPPED
[INFO] hadoop-mapreduce-client-shuffle ... SKIPPED
[INFO] hadoop-mapreduce-client-app ... SKIPPED
[INFO] hadoop-mapreduce-client-hs  SKIPPED
[INFO] hadoop-mapreduce-client-jobclient . SKIPPED
[INFO] hadoop-mapreduce-client ... SKIPPED
[INFO] hadoop-mapreduce .. SKIPPED
[INFO] Apache Hadoop Main  SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 11.370s
[INFO] Finished at: Fri Sep 02 10:58:42 CEST 2011
[INFO] Final Memory: 26M/269M
[INFO]

[ERROR] Failed to execute goal on project hadoop-yarn-common: Could not
resolve dependencies for project
org.apache.hadoop:hadoop-yarn-common:jar:0.24.0-SNAPSHOT: Failure to
find org.apache.hadoop:hadoop-yarn-api:jar:0.24.0-SNAPSHOT in

Re: Jenkins's Links to FindBugs warnings not useful

2011-09-02 Thread Robert Evans
You can do mvn findbugs:gui and then open up each of the findbugsXml.xml files 
manually.  Or you should be able to run mvn site to generate HTML.  You may 
need to modify the pom.xml file to include findbugs in the report section 
though.


On 9/2/11 9:38 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote:

Oh, I also just found this working link
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/lastSuccessfulBuild/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.htmlon
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/ . Seems that the
artifacts are there only for the lastSuccessfulBuild though.

+Vinod


On Fri, Sep 2, 2011 at 8:03 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:

 None of the links to the warnings related to FindBugs by Jenkins on
 submitting patch are working. You can see any of the JIRAs being built at
 https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/.

 OTOH, I ran ~/Applications/full-packages/apache-maven-3.0.3/bin/mvn clean
 test findbugs:findbugs -DskipTests -DHadoopPatchProcess to generate the
 warnings on my local box. I do see bunch of findBugsXml.xml files which seem
 to indicating warnings, but they are hardly readable. Does anyone know how
 to generate html reports locally? Giri?

 Thanks,
 +Vinod




Trunk and 0.23 build failing with clean .m2 directory

2011-08-29 Thread Robert Evans
I am getting the following errors when I try to build either trunk or 0.23
with a clean maven cache.  I don't get any errors if I use my old cache.

[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @
hadoop-yarn-common ---
[INFO] Compiling 2 source files to
/home/evans/src/hadoop-git/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-
common/target/classes
[INFO] 
[INFO] 

[INFO] Building hadoop-yarn-server-common 0.24.0-SNAPSHOT
[INFO] 

[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Hadoop Project POM . SUCCESS [0.714s]
[INFO] Apache Hadoop Annotations . SUCCESS [0.323s]
[INFO] Apache Hadoop Project Dist POM  SUCCESS [0.001s]
[INFO] Apache Hadoop Assemblies .. SUCCESS [0.025s]
[INFO] Apache Hadoop Alfredo . SUCCESS [0.067s]
[INFO] Apache Hadoop Common .. SUCCESS [2.117s]
[INFO] Apache Hadoop Common Project .. SUCCESS [0.001s]
[INFO] Apache Hadoop HDFS  SUCCESS [1.419s]
[INFO] Apache Hadoop HDFS Project  SUCCESS [0.001s]
[INFO] hadoop-yarn-api ... SUCCESS [7.019s]
[INFO] hadoop-yarn-common  SUCCESS [2.181s]
[INFO] hadoop-yarn-server-common . FAILURE [0.058s]
[INFO] hadoop-yarn-server-nodemanager  SKIPPED
[INFO] hadoop-yarn-server-resourcemanager  SKIPPED
[INFO] hadoop-yarn-server-tests .. SKIPPED
[INFO] hadoop-yarn-server  SKIPPED
[INFO] hadoop-yarn ... SKIPPED
[INFO] hadoop-mapreduce-client-core .. SKIPPED
[INFO] hadoop-mapreduce-client-common  SKIPPED
[INFO] hadoop-mapreduce-client-shuffle ... SKIPPED
[INFO] hadoop-mapreduce-client-app ... SKIPPED
[INFO] hadoop-mapreduce-client-hs  SKIPPED
[INFO] hadoop-mapreduce-client-jobclient . SKIPPED
[INFO] hadoop-mapreduce-client ... SKIPPED
[INFO] hadoop-mapreduce .. SKIPPED
[INFO] Apache Hadoop Main  SKIPPED
[INFO] 

[INFO] BUILD FAILURE
[INFO] 

[INFO] Total time: 14.938s
[INFO] Finished at: Mon Aug 29 11:18:06 CDT 2011
[INFO] Final Memory: 29M/207M
[INFO] 

[ERROR] Failed to execute goal on project hadoop-yarn-server-common: Could
not resolve dependencies for project
org.apache.hadoop:hadoop-yarn-server-common:jar:0.24.0-SNAPSHOT: Failure to
find org.apache.hadoop:hadoop-yarn-common:jar:tests:0.24.0-SNAPSHOT in
http://ymaven.corp.yahoo.com:/proximity/repository/apache.snapshot was
cached in the local repository, resolution will not be reattempted until the
update interval of local apache.snapshot mirror has elapsed or updates are
forced - [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExcepti
on
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn goals -rf :hadoop-yarn-server-common


Is anyone looking into this yet?

--Bobby



Re: Trunk and 0.23 build failing with clean .m2 directory

2011-08-29 Thread Robert Evans
Wow this is odd install works just fine, but compile fails unless I do an 
install first (I found this trying to run test-patch).

$mvn --version
Apache Maven 3.0.3 (r1075438; 2011-02-28 11:31:09-0600)
Maven home: /home/evans/bin/maven
Java version: 1.6.0_22, vendor: Sun Microsystems Inc.
Java home: /home/evans/bin/jdk1.6.0/jre
Default locale: en_US, platform encoding: UTF-8
OS name: linux, version: 2.6.18-238.12.1.el5, arch: i386, family: unix

Has anyone else seen this, or is there something messed up with my machine?

Thanks,

Bobby

On 8/29/11 11:18 AM, Robert Evans ev...@yahoo-inc.com wrote:

I am getting the following errors when I try to build either trunk or 0.23
with a clean maven cache.  I don't get any errors if I use my old cache.

[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @
hadoop-yarn-common ---
[INFO] Compiling 2 source files to
/home/evans/src/hadoop-git/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-
common/target/classes
[INFO]
[INFO]

[INFO] Building hadoop-yarn-server-common 0.24.0-SNAPSHOT
[INFO]

[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Apache Hadoop Project POM . SUCCESS [0.714s]
[INFO] Apache Hadoop Annotations . SUCCESS [0.323s]
[INFO] Apache Hadoop Project Dist POM  SUCCESS [0.001s]
[INFO] Apache Hadoop Assemblies .. SUCCESS [0.025s]
[INFO] Apache Hadoop Alfredo . SUCCESS [0.067s]
[INFO] Apache Hadoop Common .. SUCCESS [2.117s]
[INFO] Apache Hadoop Common Project .. SUCCESS [0.001s]
[INFO] Apache Hadoop HDFS  SUCCESS [1.419s]
[INFO] Apache Hadoop HDFS Project  SUCCESS [0.001s]
[INFO] hadoop-yarn-api ... SUCCESS [7.019s]
[INFO] hadoop-yarn-common  SUCCESS [2.181s]
[INFO] hadoop-yarn-server-common . FAILURE [0.058s]
[INFO] hadoop-yarn-server-nodemanager  SKIPPED
[INFO] hadoop-yarn-server-resourcemanager  SKIPPED
[INFO] hadoop-yarn-server-tests .. SKIPPED
[INFO] hadoop-yarn-server  SKIPPED
[INFO] hadoop-yarn ... SKIPPED
[INFO] hadoop-mapreduce-client-core .. SKIPPED
[INFO] hadoop-mapreduce-client-common  SKIPPED
[INFO] hadoop-mapreduce-client-shuffle ... SKIPPED
[INFO] hadoop-mapreduce-client-app ... SKIPPED
[INFO] hadoop-mapreduce-client-hs  SKIPPED
[INFO] hadoop-mapreduce-client-jobclient . SKIPPED
[INFO] hadoop-mapreduce-client ... SKIPPED
[INFO] hadoop-mapreduce .. SKIPPED
[INFO] Apache Hadoop Main  SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 14.938s
[INFO] Finished at: Mon Aug 29 11:18:06 CDT 2011
[INFO] Final Memory: 29M/207M
[INFO]

[ERROR] Failed to execute goal on project hadoop-yarn-server-common: Could
not resolve dependencies for project
org.apache.hadoop:hadoop-yarn-server-common:jar:0.24.0-SNAPSHOT: Failure to
find org.apache.hadoop:hadoop-yarn-common:jar:tests:0.24.0-SNAPSHOT in
http://ymaven.corp.yahoo.com:/proximity/repository/apache.snapshot was
cached in the local repository, resolution will not be reattempted until the
update interval of local apache.snapshot mirror has elapsed or updates are
forced - [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please
read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExcepti
on
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn goals -rf :hadoop-yarn-server-common


Is anyone looking into this yet?

--Bobby




Re: Trunk and 0.23 build failing with clean .m2 directory

2011-08-29 Thread Robert Evans
Thanks Alejandro,

That really clears things up. Is the a JIRA you know of to change test-patch to 
do mvn test -DskipTests instead of mvn compile?  If not I can file one and do 
the work.  Test-patch failed for me because of this.

--Bobby

On 8/29/11 12:21 PM, Alejandro Abdelnur t...@cloudera.com wrote:

The reason for this failure is because of how Maven reactor/dependency
resolution works (IMO a bug).

Maven reactor/dependency resolution is smart enough to create the classpath
using the classes from all modules being built.

However, this smartness falls short just a bit. The dependencies are
resolved using the deepest maven phase used by current mvn invocation. If
you are doing 'mvn compile' you don't get to the test compile phase.  This
means that the TEST classes are not resolved from the build but from the
cache/repo.

The solution is to run 'mvn test -DskipTests' instead 'mvn compile'. This
will include the TEST classes from the build.

The same when creating the eclipse profile, run 'mvn test -DskipTests
eclipse:eclipse'

Thanks.

Alejandro

On Mon, Aug 29, 2011 at 9:59 AM, Ravi Prakash ravihad...@gmail.com wrote:

 Yeah I've seen this before. Sometimes I had to descend into child
 directories to mvn install them, before I could maven install parents. I'm
 hoping/guessing that issue is fixed now

 On Mon, Aug 29, 2011 at 11:39 AM, Robert Evans ev...@yahoo-inc.com
 wrote:

  Wow this is odd install works just fine, but compile fails unless I do an
  install first (I found this trying to run test-patch).
 
  $mvn --version
  Apache Maven 3.0.3 (r1075438; 2011-02-28 11:31:09-0600)
  Maven home: /home/evans/bin/maven
  Java version: 1.6.0_22, vendor: Sun Microsystems Inc.
  Java home: /home/evans/bin/jdk1.6.0/jre
  Default locale: en_US, platform encoding: UTF-8
  OS name: linux, version: 2.6.18-238.12.1.el5, arch: i386, family:
  unix
 
  Has anyone else seen this, or is there something messed up with my
 machine?
 
  Thanks,
 
  Bobby
 
  On 8/29/11 11:18 AM, Robert Evans ev...@yahoo-inc.com wrote:
 
  I am getting the following errors when I try to build either trunk or
 0.23
  with a clean maven cache.  I don't get any errors if I use my old cache.
 
  [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @
  hadoop-yarn-common ---
  [INFO] Compiling 2 source files to
 
 
 /home/evans/src/hadoop-git/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-
  common/target/classes
  [INFO]
  [INFO]
  
  [INFO] Building hadoop-yarn-server-common 0.24.0-SNAPSHOT
  [INFO]
  
  [INFO]
  
  [INFO] Reactor Summary:
  [INFO]
  [INFO] Apache Hadoop Project POM . SUCCESS
 [0.714s]
  [INFO] Apache Hadoop Annotations . SUCCESS
 [0.323s]
  [INFO] Apache Hadoop Project Dist POM  SUCCESS
 [0.001s]
  [INFO] Apache Hadoop Assemblies .. SUCCESS
 [0.025s]
  [INFO] Apache Hadoop Alfredo . SUCCESS
 [0.067s]
  [INFO] Apache Hadoop Common .. SUCCESS
 [2.117s]
  [INFO] Apache Hadoop Common Project .. SUCCESS
 [0.001s]
  [INFO] Apache Hadoop HDFS  SUCCESS
 [1.419s]
  [INFO] Apache Hadoop HDFS Project  SUCCESS
 [0.001s]
  [INFO] hadoop-yarn-api ... SUCCESS
 [7.019s]
  [INFO] hadoop-yarn-common  SUCCESS
 [2.181s]
  [INFO] hadoop-yarn-server-common . FAILURE
 [0.058s]
  [INFO] hadoop-yarn-server-nodemanager  SKIPPED
  [INFO] hadoop-yarn-server-resourcemanager  SKIPPED
  [INFO] hadoop-yarn-server-tests .. SKIPPED
  [INFO] hadoop-yarn-server  SKIPPED
  [INFO] hadoop-yarn ... SKIPPED
  [INFO] hadoop-mapreduce-client-core .. SKIPPED
  [INFO] hadoop-mapreduce-client-common  SKIPPED
  [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED
  [INFO] hadoop-mapreduce-client-app ... SKIPPED
  [INFO] hadoop-mapreduce-client-hs  SKIPPED
  [INFO] hadoop-mapreduce-client-jobclient . SKIPPED
  [INFO] hadoop-mapreduce-client ... SKIPPED
  [INFO] hadoop-mapreduce .. SKIPPED
  [INFO] Apache Hadoop Main  SKIPPED
  [INFO]
  
  [INFO] BUILD FAILURE
  [INFO]
  
  [INFO] Total time: 14.938s
  [INFO] Finished at: Mon Aug 29 11:18:06 CDT 2011
  [INFO] Final Memory

Re: which Eclipse plugin to use for Maven?

2011-08-29 Thread Robert Evans
Jim,

The m2 plugin replaces the normal eclipse build system with maven.  If you want 
to use M2 then you don't need to run mvn eclipse:eclipse at all.  What mvn 
eclipse:eclipse does is it generates source code, and produces a .project and 
.classpath so that eclipse can use it's normal build system not work.  The two 
approaches are not really compatible with each other.

--Bobby

On 8/29/11 11:52 AM, Jim Falgout jim.falg...@pervasive.com wrote:

Using the latest trunk code, used the mvn eclipse:eclipse target to build the 
Eclipse project files. I've got the M2E plugin for Maven installed. After some 
trouble with lifecycle errors (Plugin execution not covered by lifecycle 
configuration error messages) I noticed this comment in the .project file: 
NO_M2ECLIPSE_SUPPORT: Project files created with the maven-eclipse-plugin are 
not supported in M2Eclipse.

Is there another recommendation for Maven integration using an Eclipse plugin 
that will work out of the box?

Thanks!





Re: Trunk and 0.23 build failing with clean .m2 directory

2011-08-29 Thread Robert Evans
DONE I filed HADOOP-7589 and uploaded my patch to it.  Alejandro, could you 
take a quick look at the patch because you appear to be the maven expert.

Thanks,

Bobby Evans

On 8/29/11 12:39 PM, Mahadev Konar maha...@hortonworks.com wrote:

Bobby,
 You are right. The test-patch uses mvn compile. Please file a jira.
It should be a minor change:

thanks
mahadev

On Mon, Aug 29, 2011 at 10:34 AM, Robert Evans ev...@yahoo-inc.com wrote:
 Thanks Alejandro,

 That really clears things up. Is the a JIRA you know of to change test-patch 
 to do mvn test -DskipTests instead of mvn compile?  If not I can file one and 
 do the work.  Test-patch failed for me because of this.

 --Bobby

 On 8/29/11 12:21 PM, Alejandro Abdelnur t...@cloudera.com wrote:

 The reason for this failure is because of how Maven reactor/dependency
 resolution works (IMO a bug).

 Maven reactor/dependency resolution is smart enough to create the classpath
 using the classes from all modules being built.

 However, this smartness falls short just a bit. The dependencies are
 resolved using the deepest maven phase used by current mvn invocation. If
 you are doing 'mvn compile' you don't get to the test compile phase.  This
 means that the TEST classes are not resolved from the build but from the
 cache/repo.

 The solution is to run 'mvn test -DskipTests' instead 'mvn compile'. This
 will include the TEST classes from the build.

 The same when creating the eclipse profile, run 'mvn test -DskipTests
 eclipse:eclipse'

 Thanks.

 Alejandro

 On Mon, Aug 29, 2011 at 9:59 AM, Ravi Prakash ravihad...@gmail.com wrote:

 Yeah I've seen this before. Sometimes I had to descend into child
 directories to mvn install them, before I could maven install parents. I'm
 hoping/guessing that issue is fixed now

 On Mon, Aug 29, 2011 at 11:39 AM, Robert Evans ev...@yahoo-inc.com
 wrote:

  Wow this is odd install works just fine, but compile fails unless I do an
  install first (I found this trying to run test-patch).
 
  $mvn --version
  Apache Maven 3.0.3 (r1075438; 2011-02-28 11:31:09-0600)
  Maven home: /home/evans/bin/maven
  Java version: 1.6.0_22, vendor: Sun Microsystems Inc.
  Java home: /home/evans/bin/jdk1.6.0/jre
  Default locale: en_US, platform encoding: UTF-8
  OS name: linux, version: 2.6.18-238.12.1.el5, arch: i386, family:
  unix
 
  Has anyone else seen this, or is there something messed up with my
 machine?
 
  Thanks,
 
  Bobby
 
  On 8/29/11 11:18 AM, Robert Evans ev...@yahoo-inc.com wrote:
 
  I am getting the following errors when I try to build either trunk or
 0.23
  with a clean maven cache.  I don't get any errors if I use my old cache.
 
  [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @
  hadoop-yarn-common ---
  [INFO] Compiling 2 source files to
 
 
 /home/evans/src/hadoop-git/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-
  common/target/classes
  [INFO]
  [INFO]
  
  [INFO] Building hadoop-yarn-server-common 0.24.0-SNAPSHOT
  [INFO]
  
  [INFO]
  
  [INFO] Reactor Summary:
  [INFO]
  [INFO] Apache Hadoop Project POM . SUCCESS
 [0.714s]
  [INFO] Apache Hadoop Annotations . SUCCESS
 [0.323s]
  [INFO] Apache Hadoop Project Dist POM  SUCCESS
 [0.001s]
  [INFO] Apache Hadoop Assemblies .. SUCCESS
 [0.025s]
  [INFO] Apache Hadoop Alfredo . SUCCESS
 [0.067s]
  [INFO] Apache Hadoop Common .. SUCCESS
 [2.117s]
  [INFO] Apache Hadoop Common Project .. SUCCESS
 [0.001s]
  [INFO] Apache Hadoop HDFS  SUCCESS
 [1.419s]
  [INFO] Apache Hadoop HDFS Project  SUCCESS
 [0.001s]
  [INFO] hadoop-yarn-api ... SUCCESS
 [7.019s]
  [INFO] hadoop-yarn-common  SUCCESS
 [2.181s]
  [INFO] hadoop-yarn-server-common . FAILURE
 [0.058s]
  [INFO] hadoop-yarn-server-nodemanager  SKIPPED
  [INFO] hadoop-yarn-server-resourcemanager  SKIPPED
  [INFO] hadoop-yarn-server-tests .. SKIPPED
  [INFO] hadoop-yarn-server  SKIPPED
  [INFO] hadoop-yarn ... SKIPPED
  [INFO] hadoop-mapreduce-client-core .. SKIPPED
  [INFO] hadoop-mapreduce-client-common  SKIPPED
  [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED
  [INFO] hadoop-mapreduce-client-app ... SKIPPED
  [INFO] hadoop-mapreduce-client-hs  SKIPPED
  [INFO] hadoop-mapreduce-client-jobclient . SKIPPED
  [INFO] hadoop-mapreduce-client

Re: DistCpV2 in 0.23

2011-08-26 Thread Robert Evans
I agree with Mithun.  They are related but this goes beyond distcpv2 and should 
not block distcpv2 from going in.  It would be very nice, however, to get the 
layout settled soon so that we all know where to find something when we want to 
work on it.

Also +1 for Alejandro's I also prefer to keep tools at the trunk level.

Even though HDFS, Common, and Mapreduce and perhaps soon tools are separate 
modules right now, there is still tight coupling between the different pieces, 
especially with tests.  IMO until we can reduce that coupling we should treat 
building and testing Hadoop as a single project instead of trying to keep them 
separate.

--Bobby

On 8/26/11 7:45 AM, Mithun Radhakrishnan mithun.radhakrish...@yahoo.com 
wrote:

Would it be acceptable if retooling of tools/ were taken up separately? It 
sounds to me like this might be a distinct (albeit related) task.

Mithun



From: Giridharan Kesavan gkesa...@hortonworks.com
To: mapreduce-dev@hadoop.apache.org
Sent: Friday, August 26, 2011 12:04 PM
Subject: Re: DistCpV2 in 0.23

+1 to Alejandro's

I prefer to keep the hadoop-tools at trunk level.

-Giri

On Thu, Aug 25, 2011 at 9:15 PM, Alejandro Abdelnur t...@cloudera.com wrote:
 I'd suggest putting hadoop-tools either at trunk/ level or having a a tools
 aggregator module for hdfs and other for common.

 I personal would prefer at trunk/.

 Thanks.

 Alejandro

 On Thu, Aug 25, 2011 at 9:06 PM, Amareshwari Sri Ramadasu 
 amar...@yahoo-inc.com wrote:

 Agree. It should be separate maven module (and patch puts it as separate
 maven module now). And top level for hadoop tools is nice to have, but it
 becomes hard to maintain until patch automation tests run the tests under
 tools. Currently we see many times the changes in HDFS effecting RAID tests
 in MapReduce. So, I'm fine putting the tools under hadoop-mapreduce.

 I propose we can have something like the following:

 trunk/
  - hadoop-mapreduce
  - hadoop-mr-client
  - hadoop-yarn
  - hadoop-tools
  - hadoop-streaming
  - hadoop-archives
  - hadoop-distcp

 Thoughts?

 @Eli and @JD, we did not replace old legacy distcp because this is really a
 complete rewrite and did not want to remove it until users are familiarized
 with new one.

 On 8/26/11 12:51 AM, Todd Lipcon t...@cloudera.com wrote:

 Maybe a separate toplevel for hadoop-tools? Stuff like RAID could go
 in there as well - ie tools that are downstream of MR and/or HDFS.

 On Thu, Aug 25, 2011 at 12:09 PM, Mahadev Konar maha...@hortonworks.com
 wrote:
  +1 for a seperate module in hadoop-mapreduce-project. I think
  hadoop-mapreduce-client might not be right place for it. We might have
  to pick a new maven module under hadoop-mapreduce-project that could
  host streaming/distcp/hadoop archives.
 
  thanks
  mahadev
 
  On Thu, Aug 25, 2011 at 11:04 AM, Alejandro Abdelnur t...@cloudera.com
 wrote:
  Agree, it should be a separate maven module.
 
  And it should be under hadoop-mapreduce-client, right?
 
  And now that we are in the topic, the same should go for streaming, no?
 
  Thanks.
 
  Alejandro
 
  On Thu, Aug 25, 2011 at 10:58 AM, Todd Lipcon t...@cloudera.com
 wrote:
 
  On Thu, Aug 25, 2011 at 10:36 AM, Eli Collins e...@cloudera.com
 wrote:
   Nice work!   I definitely think this should go in 23 and 20x.
  
   Agree with JD that it should be in the core code, not contrib.  If
   it's going to be maintained then we should put it in the core code.
 
  Now that we're all mavenized, though, a separate maven module and
  artifact does make sense IMO - ie hadoop jar
  hadoop-distcp-0.23.0-SNAPSHOT rather than hadoop distcp
 
  -Todd
  --
  Todd Lipcon
  Software Engineer, Cloudera
 
 
 



 --
 Todd Lipcon
 Software Engineer, Cloudera






--
-Giri



Re: Picking up local common changes in mr

2011-08-19 Thread Robert Evans
One thing to be aware of is that with -SNAPSHOT at the end of the version Maven 
will start looking at dates.  So if you have a 0.23.0-SNAPSHOT that you 
personally modified/built in your .m2 repository and go to build something that 
depends on it.  If the nightly build has pushed it to the apache repo after you 
built your version maven might download the newer version replacing your 
changes.  If you changes impact multiple components then your choices are to 
always build the entire project (or at least the subset that has dependent 
changes) or always build with -o after your initial build/install.

--Bobby

On 8/19/11 11:41 AM, Matt Foley mfo...@hortonworks.com wrote:

Thanks for the nice clear statement, Alejandro.
--Matt

On Thu, Aug 18, 2011 at 4:40 PM, Alejandro Abdelnur t...@cloudera.comwrote:

 This is handled by maven reactor.

 When your run Maven in a multimodule project (like we have), all modules
 that are part of the build (from the dir where you are) down are used for
 the build/test/packaging, all modules that are not part of the build are
 picked up from .m2/repo.

 For example

 cd trunk/hadoop-mapreduce;mvn compile uses hadoop-common  hadoop-hdfs
 from m2/repo

 cd trunk;mvn compile uses hadoop-common, hadoop-hdfs, hadoop-mapreduce
 from the build.

 HTH

 Thxs.

 Alejandro


 On Thu, Aug 18, 2011 at 4:35 PM, Matt Foley mfo...@hortonworks.com
 wrote:

  Since we put all the effort into un-splitting the components, shouldn't
  we
  have a switch
  that causes, eg, the MAPREDUCE build to pick up artifacts from COMMON and
  HDFS builds
  in specified sibling directories, without using m2 as an intermediary?
 
  Of course it should respect dependencies (via maven) so that if HDFS
 source
  has been modified,
  the HDFS artifacts will also be rebuilt before MAPREDUCE uses them :-)
 
  --Matt
 
  On Thu, Aug 18, 2011 at 3:30 PM, Giridharan Kesavan 
  gkesa...@hortonworks.com wrote:
 
   Hello,
  
   Its the same -Dresolvers=internal for the ant build system; For the
   maven/yarn build system as long as you have the latest common jar in
   the m2 cache its going to resolve common from the maven cache. If not
   from the apache maven repo. You can force the builds to use the cache
   by adding -o option. (offline builds)
  
   Thanks,
   Giri
  
   On Thu, Aug 18, 2011 at 3:19 PM, Eli Collins e...@cloudera.com wrote:
Hey gang,
   
What's the new equivalent of resolvers=true in the new MR build? ie
how do you get a  a local common change to get picked up by mr?
   
Thanks,
Eli
   
  
 




Re: Notes for working on mapreduce trunk after the MR-279 merge.

2011-08-18 Thread Robert Evans
It looks like git has not seen the changes yet, even though the last change was 
over 90 mins ago.  Is there any way to kick git to pull in the changes sooner 
so I can rebase?

Thanks,

Bobby Evans

On 8/18/11 7:49 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote:

MR-279 branch is merged into mapreduce trunk and this changes things a
bit for developing on mapreduce.

You can get all the help that is needed from the INSTALL file at
http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce/INSTALL.
Reproducing some of those contents here for the short-term lookup.


Checking out source code

svn checkout 
http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce

--
Directory structure
--

trunk/
  - hadoop-mapreduce ( was mapreduce before)

trunk/hadoop-mapreduce - Classic code. JT/TT reside here
 - build.xml
 - src

trunk/hadoop-mapreduce/ - New code related to yarn reside here.
 - assembly
 - pom.xml
 - hadoop-mr-client
 - hadoop-yarn - Yarn APIs, libraries, and server code
   -- hadoop-yarn-api
   -- hadoop-yarn-common
   -- hadoop-yarn-server - Server code, ResourceManager, NodeManager,
server libraries and tests.
  --- hadoop-yarn-server-common
  --- hadoop-yarn-server-nodemanager
  --- hadoop-yarn-server-resourcemanager
  --- hadoop-yarn-server-tests
 - hadoop-mr-client - MapReduce server and client code
   -- hadoop-mapreduce-client-app
   -- hadoop-mapreduce-client-core
   -- hadoop-mapreduce-client-jobclient
   -- hadoop-mapreduce-client-common
   -- hadoop-mapreduce-client-hs
   -- hadoop-mapreduce-client-shuffle

---
Building
---
Building yarn code and install into the local maven cache.
 - mvn clean install
 - In case you want to skip the tests run: mvn clean install -DskipTests

Building classic code once yarn code is built.
 - ant veryclean jar jar-test  -Dresolvers=internal

--
Eclipse
---
 1) For hacking on the new yarn+MR code in eclipse, you should run
mvn eclipse:eclipse and then import the checked out source root as a
maven project.
 2) For developing on classic JT/TT code, running ant eclipse and
importing as java project should continue to work.

Hope that helps. If you run into issues, please send an email or
create a JIRA issue.

Thanks,
+Vinod



Re: Problem while running eclipse-files for Next Gen Mapreduce branch

2011-07-08 Thread Robert Evans
I mapreduce/INSTALL also has some important information in it, and be aware 
that you do not have to install the avro plugin any more.  Maven can download 
it and install it automatically now, but the README was never updated.  Also be 
sure to install protocol buffers.  The build will fail without it.

--Bobby

On 7/8/11 9:04 AM, Josh Wills jwi...@cloudera.com wrote:

You want to generate them using mvn instead.  See the mapreduce/yarn/README
file for how to do it.

On Fri, Jul 8, 2011 at 7:00 AM, Devaraj K devara...@huawei.com wrote:

 Hi,



   I am getting this below errors when I try to generate eclipse files using
 eclipse-files target. Can anybody help me?





 Buildfile: D:\svn\nextgenmapreduce\mapreduce\build.xml

 ivy-download:
  [get] Getting:
 http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
 http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
  [get] To: D:\svn\nextgenmapreduce\mapreduce\ivy\ivy-2.2.0.jar
  [get] Not modified - so not downloaded

 ivy-init-dirs:

 ivy-probe-antlib:

 ivy-init-antlib:

 ivy-init:
 [ivy:configure] :: Ivy non official version -  ::
 http://ant.apache.org/ivy/ http://ant.apache.org/ivy/ ::
 [ivy:configure] :: loading settings :: file =
 D:\svn\nextgenmapreduce\mapreduce\ivy\ivysettings.xml

 ivy-resolve-common:
 [ivy:resolve]
 [ivy:resolve] :: problems summary ::
 [ivy:resolve]  WARNINGS
 [ivy:resolve] module not found:
 org.apache.hadoop#yarn-server-common;1.0-SNAPSHOT
 [ivy:resolve]  apache-snapshot: tried
 [ivy:resolve]
 
 https://repository.apache.org/content/repositories/snapshots/org/apache/had
 oop/yarn-server-common/1.0-SNAPSHOT/yarn-server-common-1.0-SNAPSHOT.pom

 https://repository.apache.org/content/repositories/snapshots/org/apache/hado
 op/yarn-server-common/1.0-SNAPSHOT/yarn-server-common-1.0-SNAPSHOT.pom
 [ivy:resolve]   -- artifact
 org.apache.hadoop#yarn-server-common;1.0-SNAPSHOT!yarn-server-common.jar:
 [ivy:resolve]
 
 https://repository.apache.org/content/repositories/snapshots/org/apache/had
 oop/yarn-server-common/1.0-SNAPSHOT/yarn-server-common-1.0-SNAPSHOT.jar

 https://repository.apache.org/content/repositories/snapshots/org/apache/hado
 op/yarn-server-common/1.0-SNAPSHOT/yarn-server-common-1.0-SNAPSHOT.jar
 [ivy:resolve]  maven2: tried
 [ivy:resolve]
 
 http://repo1.maven.org/maven2/org/apache/hadoop/yarn-server-common/1.0-SNAP
 SHOT/yarn-server-common-1.0-SNAPSHOT.pom

 http://repo1.maven.org/maven2/org/apache/hadoop/yarn-server-common/1.0-SNAPS
 HOT/yarn-server-common-1.0-SNAPSHOT.pom
 [ivy:resolve]   -- artifact
 org.apache.hadoop#yarn-server-common;1.0-SNAPSHOT!yarn-server-common.jar:
 [ivy:resolve]
 
 http://repo1.maven.org/maven2/org/apache/hadoop/yarn-server-common/1.0-SNAP
 SHOT/yarn-server-common-1.0-SNAPSHOT.jar

 http://repo1.maven.org/maven2/org/apache/hadoop/yarn-server-common/1.0-SNAPS
 HOT/yarn-server-common-1.0-SNAPSHOT.jar
 [ivy:resolve] module not found:
 org.apache.hadoop#hadoop-mapreduce-client-core;1.0-SNAPSHOT
 [ivy:resolve]  apache-snapshot: tried
 [ivy:resolve]
 
 https://repository.apache.org/content/repositories/snapshots/org/apache/had

 oop/hadoop-mapreduce-client-core/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1
 .0-SNAPSHOT.pom

 https://repository.apache.org/content/repositories/snapshots/org/apache/hado

 op/hadoop-mapreduce-client-core/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.
 0-SNAPSHOT.pom
 [ivy:resolve]   -- artifact

 org.apache.hadoop#hadoop-mapreduce-client-core;1.0-SNAPSHOT!hadoop-mapreduce
 -client-core.jar:
 [ivy:resolve]
 
 https://repository.apache.org/content/repositories/snapshots/org/apache/had

 oop/hadoop-mapreduce-client-core/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1
 .0-SNAPSHOT.jar

 https://repository.apache.org/content/repositories/snapshots/org/apache/hado

 op/hadoop-mapreduce-client-core/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.
 0-SNAPSHOT.jar
 [ivy:resolve]  maven2: tried
 [ivy:resolve]
 
 http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-cor
 e/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.0-SNAPSHOT.pom

 http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core
 /1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.0-SNAPSHOT.pom
 [ivy:resolve]   -- artifact

 org.apache.hadoop#hadoop-mapreduce-client-core;1.0-SNAPSHOT!hadoop-mapreduce
 -client-core.jar:
 [ivy:resolve]
 
 http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-cor
 e/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.0-SNAPSHOT.jar

 http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core
 /1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.0-SNAPSHOT.jar
 [ivy:resolve] module not found:
 org.apache.hadoop#yarn-common;1.0-SNAPSHOT
 [ivy:resolve]  apache-snapshot: tried
 [ivy:resolve]
 
 

Re: Reg ChainReducer usage

2011-06-02 Thread Robert Evans
Moving to mapreduce user.

Ravi,

The issue is with the shuffle.  The chain reducer cannot re-shuffle the output 
of a previous reducer.  If you want that then you need to run a second reduce 
only job.  Instead usually the chain reducer would have a single reducer 
followed by 0 or more mappers, that can process the output of the reducer.

--Bobby

On 6/2/11 5:25 AM, Ravi Teja ravit...@huawei.com wrote:

Hi,

I Had some queries in the usage of the ChainReducer .

1)Only one reducer can be set. If we try to set the second reducer to the
chain, IllegalArgumentException will be thrown. Then why is it a
chainreducer ?

2)We have a option chain.reducer.byValue where in, it will decide whether
the key value pair can be passed or not to the next Mapper/Reducer.
But why is this property significant, as only reducer is called at last in
the chain, no matter whatever the order is in the chain and there is nothing
to pass to.

Regards,
Ravi Teja


***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!