Re: hadoop1.2.1 speedup model

2013-09-09 Thread Robert Evans
How many times did you run the experiment at each setting?  What is the
standard deviation for each of these settings.  It could be that you are
simply running into the error bounds of Hadoop.  Hadoop is far from
consistent in it's performance.  For our benchmarking we typically will
run the test 5 times, throw out the top and bottom result, as possibly
outliers and then average the other runs.  Even with that we have to be
very careful that we weed out bad nodes or the numbers are useless for
comparison.  The other thing to look at is where was all of the time spent
for each of these settings.  The map portion should be very close to
linear with the number of tasks, assuming that there is no disk or network
contention.  The shuffle is far from linear as the number of fetches is a
function of the number of maps and the number of reducers.  The reduce
phase itself should be close to linear assuming that there isn't much skew
to your data.

--Bobby

On 9/7/13 3:33 AM, "牛兆捷"  wrote:

>But I still want to fine the most efficient assignment and scale both data
>and nodes as you said, for example in my result, 2 is the best, and 8 is
>better than 4.
>
>Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
>hard to model this result. Can you give me some hint about this kind of
>trend?
>
>
>2013/9/7 Vinod Kumar Vavilapalli 
>
>>
>> Clearly your input size isn't changing. And depending on how they are
>> distributed on the nodes, there could be Datanode/disks contention.
>>
>> The better way to model this is by scaling the input data also linearly.
>> More nodes should process more data in the same amount of time.
>>
>> Thanks,
>> +Vinod
>>
>> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:
>>
>> > Hi all:
>> >
>> > I vary the computational nodes of cluster and get the speedup result
>>in
>> attachment.
>> >
>> > In my mind, there are three type of speedup model: linear, sub-linear
>> and super-linear. However the curve of my result seems a little
>>strange. I
>> have attached it.
>> > 
>> >
>> > This is sort in example.jar, actually it is done only using the
>>default
>> map-reduce mechanism of Hadoop.
>> >
>> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12
>>cpu,
>> 20g men)
>> >  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
>>  reduce.slowstart = 0.05, the others are default.
>> >
>> > Input data: 20g, I divide it to 64 files
>> >
>> > Sort example: 64 map tasks, 64 reduce tasks
>> >
>> > Computational nodes: varying from 2 to 9
>> >
>> > Why the speedup mechanism is like this? How can I model it properly?
>> >
>> > Thanks〜
>> >
>> > --
>> > Sincerely,
>> > Zhaojie
>> >
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or
>>entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the
>>reader
>> of this message is not the intended recipient, you are hereby notified
>>that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>>immediately
>> and delete it from your system. Thank You.
>>
>
>
>
>-- 
>*Sincerely,*
>*Zhaojie*
>*
>*



Re: [VOTE] Release Apache Hadoop 0.23.9

2013-07-02 Thread Robert Evans
+1 downloaded the release.  Ran a couple of simple jobs and everything
worked.

On 7/1/13 12:20 PM, "Thomas Graves"  wrote:

>I've created a release candidate (RC0) for hadoop-0.23.9 that I would like
>to release.
>
>The RC is available at:
>http://people.apache.org/~tgraves/hadoop-0.23.9-candidate-0/
>The RC tag in svn is here:
>http://svn.apache.org/viewvc/hadoop/common/tags/release-0.23.9-rc0/
>
>The maven artifacts are available via repository.apache.org.
>
>Please try the release and vote; the vote will run for the usual 7 days
>til July 8th.
>
>I am +1 (binding).
>
>thanks,
>Tom Graves



Re: [VOTE] Release Apache Hadoop 0.23.8

2013-05-30 Thread Robert Evans
+1

Downloaded the release and ran a few basic tests.

--Bobby

On 5/28/13 11:00 AM, "Thomas Graves"  wrote:

>
>I've created a release candidate (RC0) for hadoop-0.23.8 that I would like
>to release.
>
>This release is a sustaining release with several important bug fixes in
>it.  The most critical one is MAPREDUCE-5211.
>
>The RC is available at:
>http://people.apache.org/~tgraves/hadoop-0.23.8-candidate-0/
>The RC tag in svn is here:
>http://svn.apache.org/viewvc/hadoop/common/tags/release-0.23.8-rc0/
>
>The maven artifacts are available via repository.apache.org.
>
>Please try the release and vote; the vote will run for the usual 7 days.
>
>I am +1 (binding).
>
>thanks,
>Tom Graves
>



Re: [VOTE] Plan to create release candidate for 0.23.8

2013-05-20 Thread Robert Evans
+1

On 5/17/13 4:10 PM, "Thomas Graves"  wrote:

>Hello all,
>
>We've had a few critical issues come up in 0.23.7 that I think warrants a
>0.23.8 release. The main one is MAPREDUCE-5211.  There are a couple of
>other issues that I want finished up and get in before we spin it.  Those
>include HDFS-3875, HDFS-4805, and HDFS-4835.  I think those are on track
>to finish up early next week.   So I hope to spin 0.23.8 soon after this
>vote completes.
>
>Please vote '+1' to approve this plan. Voting will close on Friday May
>24th at 2:00pm PDT.
>
>Thanks,
>Tom Graves
>



Re: [VOTE] - Release 2.0.5-beta

2013-05-16 Thread Robert Evans
-0 (Binding)

I have made my opinion known in the previous thread/vote, but I have spent
enough time discussing this and need to get back to my day job. If the
community is able to get snapshots and everything else in this list merged
and stable without breaking the stack above it in two weeks it will be
wonderful, but I have serious doubts that it is going to actually be
possible.

--Bobby

On 5/15/13 12:57 PM, "Arun C Murthy"  wrote:

>Folks,
>
>A considerable number of people have expressed confusion regarding the
>recent vote on 2.0.5, beta status etc. given lack of specifics, the
>voting itself (validity of the vote itself, whose votes are binding) etc.
>
>IMHO technical arguments (incompatibility b/w 2.0 & 2.1, current
>stability of 3 features under debate etc.) have been lost in the
>discussion in favor of non-technical (almost dramatic) nuances such as
>"seizing the moment". There is now dangerous talk of tolerating
>incompatibility b/w 2.0 and 2.1) - this is a red flag for me;
>particularly when there are just 3 features being debated and active
>committers and contributors are confident of and ready to stand by their
>work. All patches, I believe, are ready to be merged in the the next few
>days per discussions on jira. This will, clearly, not delay the other API
>work which everyone agrees is crucial. As a result, I feel no recourse
>but to restart a new vote - all attempts at calm, reasoned, civil
>discussion based on technical arguments have come to naught - I apologize
>for the thrash caused to everyone's attention.
>
>To get past all of this confusion, I'd like to present an alternate,
>specific proposal for consideration.
>
>I propose we continue the original plan and make a 2.0.5-beta release by
>May end with the following content:
># HDFS-347
># HDFS Snapshots
># Windows support
># Necessary & final API/protocol changes such as:
> * Final YARN API changes: YARN-386
> * MR Binary Compatibility: MAPREDUCE-5108
> * Final RPC cleanup: HADOOP-8990
>
>People working on the above features have all expressed considerable
>comfort with them and are ready to stand-by to help expedite any
>necessary bug-fixes etc. to get to stabilization quickly. I'm confident
>we can get this release out by end of May. This sets stage for a
>hadoop-2.x GA release right after with some more testing - this means I
>think I can quickly turn around and make bug-fix releases as necessary
>right after 2.0.5-beta.
>
>I request that people consider helping out with this plan and sign up to
>help push hadoop-2.x to stability as outlined above. I believe this will
>help achieve our shared goals of quickly stabilizing hadoop-2 and help
>ensure we can support it for forseeable future in a compatible manner for
>the benefit of our users and downstream projects.
>
>Please vote, the vote will run the normal 7 days. Obviously, I'm +1.
>
>thanks,
>Arun
>
>PS: To keep this discussion grounded in technical details I've moved this
>to dev@ (bcc general@).
>



Re: Heads up - 2.0.5-beta

2013-05-03 Thread Robert Evans
I agree that "destructive" is not the correct word to describe features
like snapshots and windows support.  However, I also agree with Konstantin
that any large feature will have a destabilizing effect on the code base,
even if it is done on a branch and thoroughly tested before being merged
in. HDFS HA from what I have seen and heard is rock solid, but it took a
while to get there even after it was merged into branch-2. And we all know
how long YARN and MRv2 have taken to stabilize.

I also agree that no one individual is able to police all of Hadoop.  We
have to rely on the committers to make sure that what is placed in a
branch is appropriate for that branch in preparation for a release.  As a
community we need to decided what the goals of a branch are so that I as a
committer can know what is and is not appropriate to be placed in that
branch.  This is the reason why we are discussing API and binary
compatibility. This is the reason why I support having a vote for a
release plan.  The question for the community comes down to do we want to
release quickly and often off of trunk trying hard to maintain
compatibility between releases or do we want to follow what we have done
up to now where a single branch goes into stabilization, trunk gets
anything that is not "compatible" with that branch, and it takes a huge
effort to switch momentum from one branch to another.  Up to this point we
have almost successfully done this switch once, from 1.0 to 2.0. I have a
hard time believing that we are going to do this again for another 5 years.

There is nothing preventing the community from letting each organization
decide what they want to do and we end up with both.  But this results in
fragmentation of the community, and makes it difficult for those trying to
stabilize a release because there is no critical mass of individuals using
and testing that branch.  It also results in the scrambling we are seeing
now to try and revert the incompatibles between 1.0 and 2.0 that were
introduced in the years between these releases.  If we are going to do the
same and make 3.0 compatible with 2.0 when the switch comes, why do we
even allow any incompatible changes in at all?  It just feels like trunk
is a place to put tech debt that we are going to try and revert later.  I
personally like the Linux and BSD models, where there is a new feature
merge window and any new features can come in, then the entire community
works together to stabilize the release before going on the the next merge
window.  If the release does not stabilize quickly the next merge window
gets pushed back. I realize this is very different from the current model
and is not likely to receive a lot of support, but it has worked for them
for a long time, and they have code bases just as large as Hadoop and even
larger and more diverse communities.

I am +1 for Konstantin's release plan and will vote as such on that thread.

--Bobby

On 5/3/13 3:06 AM, "Konstantin Shvachko"  wrote:

>Hi Arun and Suresh,
>
>I am glad my choice of words attracted your attention. I consider this
>important for the project otherwise I wouldn't waste everybody's time.
>You tend reacting on a latest message taken out of context, which does not
>reveal full picture.
>I'll try here to summarize my proposal and motivation expressed earlier in
>these two threads:
>http://s.apache.org/fs
>http://s.apache.org/Streamlining
>
>I am advocating
>1. to make 2.0.5 a release that will
>a) make any necessary changes so that Hadoop APIs could be fixed after
>that
>b) fix bugs: internal and those important for stabilizing downstream
>projects
>2. Release 2.1.0 stable. I.e. both with stable APIs and stable code base.
>3. Produce a series of feature releases. Potentially catching up with the
>state of trunk.
>4. Release from trunk afterwards.
>
>The main motivation to minimize changes in 2.0.5 is to let Hadoop users
>and
>the downstream projects, that is the Hadoop community, to start adapting
>to
>the new APIs asap. This will provide certainty that people can build their
>products on top of 2.0.5 APIs with minimal risk the next release will
>break
>them.
>Thus Bobby in http://goo.gl/jm5am
>is saying that the meaning of beta for him is locked down APIs for wire
>and
>binary compatibility. For Hadoop Yahoo using 2.x is an opportunity to have
>it tested at very large scale, which in turn will bring other users on
>board.
>
>I agree with Arun that we are not disagreeing on much. Just on the order
>of
>execution: what goes first stability or features.
>I am not challenging any features, the implementations, or the developers.
>But putting all changes together is destructive for the stability of the
>release. Adding a 500 KB patch invalidates prio testing solely because it
>is a big change that needs testing not only by itself but with upstream
>applications.
>With 2.0.3 , 2.0.4 tested thoroughly and widely in many organizations and
>several distributions it seems like a perfect base for the sta

Re: mrv1 vs YARN

2013-04-22 Thread Robert Evans
Like with most major releases of Hadoop the releases are API compatible,
but not necessarily binary compatible.  That means a job for 1.0 can be
recompiled against 2.0 and it should compile and run similarly to 1.0.  If
it does not feel free to file a JAIR on the incompatibility.  There have
been a few and we have worked to make them backwards compatible.  As far a
binary compatibility is concerned. For the most part you should be able to
run your jobs without recompiling.  There are some people trying to make
it as binary compatible as possible, but it is not a guarantee.

I would expect to have a non-alpha semi-stable release of 2.0 by late June
or early July.  I am not an expert on this and there are lots of things
that could show up and cause those dates to slip.

--Bobby

On 4/21/13 6:45 PM, "Shekhar Gupta"  wrote:

>I am sorry Amir. I don't have answers for these questions. Because I don't
>use Hadoop for any real production jobs.
>Mainly I play with the Scheduler and ResourceManager of YARN as part of my
>thesis. So I just run some simple jobs to test the performance of the
>scheduler.
>
>
>On Sun, Apr 21, 2013 at 4:18 PM, Amir Sanjar  wrote:
>
>> thanks Shekhar, do you know when we will have an stable release for
>>hadoop
>> 2.0 (not alpha)?
>> Also from you experience, what component of hadoop 2.0.3 are most
>>unstable
>> and are more likely need more attention?
>>
>> Best Regards
>> Amir Sanjar
>>
>> System Management Architect
>> PowerLinux Open Source Hadoop development lead
>> IBM Senior Software Engineer
>> Phone# 512-286-8393
>> Fax#  512-838-8858
>>
>>
>>
>> [image: Inactive hide details for Shekhar Gupta ---04/21/2013 06:06:33
>> PM---As par my experience the MapReduce API is same for both YAR]Shekhar
>> Gupta ---04/21/2013 06:06:33 PM---As par my experience the MapReduce
>>API is
>> same for both YARN and MRv2. Applications compiled again Y
>>
>> From: Shekhar Gupta 
>> To: common-dev@hadoop.apache.org,
>> Date: 04/21/2013 06:06 PM
>> Subject: Re: mrv1 vs YARN
>> --
>>
>>
>>
>> As par my experience the MapReduce API is same for both YARN and MRv2.
>> Applications compiled again YARN should run smoothly on MRv1. Ans the
>>vice
>> versa is also true.
>> And in general YARN is pretty stable now.
>>
>> Regards,
>> Shekhar
>>
>>
>>
>> On Sun, Apr 21, 2013 at 3:45 PM, Amir Sanjar 
>>wrote:
>>
>> > Would an application compiled against YARN/MRv2 run transparently on
>> MRv1?
>> > Are there any API differences ?
>> > How stable is YARN/MRV2?
>> >
>> > Best Regards
>> > Amir Sanjar
>> >
>> > System Management Architect
>> > PowerLinux Open Source Hadoop development lead
>> > IBM Senior Software Engineer
>> > Phone# 512-286-8393
>> > Fax#  512-838-8858
>> >
>>
>>



Re: [VOTE] Release Apache Hadoop 2.0.4-alpha

2013-04-17 Thread Robert Evans
+1 (binding)

Downloaded the tar ball and ran some simple jobs.

--Bobby Evans

On 4/17/13 2:01 PM, "Siddharth Seth"  wrote:

>+1 (binding)
>Verified checksums and signatures.
>Built from the source tar, deployed a single node cluster and tested a
>couple of simple MR jobs.
>
>- Sid
>
>
>On Fri, Apr 12, 2013 at 2:56 PM, Arun C Murthy 
>wrote:
>
>> Folks,
>>
>> I've created a release candidate (RC2) for hadoop-2.0.4-alpha that I
>>would
>> like to release.
>>
>> The RC is available at:
>> http://people.apache.org/~acmurthy/hadoop-2.0.4-alpha-rc2/
>> The RC tag in svn is here:
>> 
>>http://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.4-alpha-rc
>>2
>>
>> The maven artifacts are available via repository.apache.org.
>>
>> Please try the release and vote; the vote will run for the usual 7 days.
>>
>> thanks,
>> Arun
>>
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>>
>>



Re: [VOTE] Release Apache Hadoop 0.23.7

2013-04-16 Thread Robert Evans
+1 (binding)

I downloaded the release and ran a few sanity tests on it.

--Bobby

On 4/11/13 2:55 PM, "Thomas Graves"  wrote:

>I've created a release candidate (RC0) for hadoop-0.23.7 that I would like
>to release.
>
>This release is a sustaining release with several important bug fixes in
>it.
>
>The RC is available at:
>http://people.apache.org/~tgraves/hadoop-0.23.7-candidate-0/
>The RC tag in svn is here:
>http://svn.apache.org/viewvc/hadoop/common/tags/release-0.23.7-rc0/
>
>The maven artifacts are available via repository.apache.org.
>
>Please try the release and vote; the vote will run for the usual 7 days.
>
>thanks,
>Tom Graves
>



Re: Hadoop Source Code

2013-03-18 Thread Robert Evans
Look at 

http://wiki.apache.org/hadoop/HowToContribute

It gives step by step instructions.

--Bobby

On 3/18/13 6:43 AM, "Mustaqeem" <3m.mustaq...@gmail.com> wrote:

>I am also working in same direction.
>As I am new, First of all, I want to know that what have you done to
>enhance the 
>hadoop performance in heterogeneous environment.
>I have an strategy for that.
>What I really need is how to modify the Hadoop source code?? and how to
>compile 
>the source code.
>As I am very new to this field, I want step by step instructions.
>
>Please, help me. It is for my thesis.
>Hope you will reply as soon as possible.
>



Re: [VOTE] Plan to create release candidate Monday 3/18

2013-03-15 Thread Robert Evans
+1

On 3/10/13 10:38 PM, "Matt Foley"  wrote:

>Hi all,
>I have created branch-1.2 from branch-1, and propose to cut the first
>release candidate for 1.2.0 on Monday 3/18 (a week from tomorrow), or as
>soon thereafter as I can achieve a stable build.
>
>Between 1.1.2 and the current 1.2.0, there are 176 patches!!  Draft
>release
>notes are available at .../branch-1.2/src/docs/releasenotes.html in the
>sources.
>
>Any non-destabilizing patches committed to branch-1.2 during the coming
>week (and of course also committed to branch-1) will be included in the
>RC.
> However, at this point I request that any big new developments not yet in
>branch-1.2 be targeted for 1.3.
>
>Release plans have to be voted on too, so please vote '+1' to approve this
>plan.  Voting will close on Sunday 3/17 at 8:30pm PDT.
>
>Thanks,
>--Matt
>(release manager)



Re: [VOTE] Plan to create release candidate for 0.23.7

2013-03-15 Thread Robert Evans
+1

On 3/13/13 11:31 AM, "Thomas Graves"  wrote:

>Hello all,
>
>I think enough critical bug fixes have went in to branch-0.23 that
>warrant another release. I plan on creating a 0.23.7 release by the end
>March.
>
>Please vote '+1' to approve this plan.  Voting will close on Wednesday
>3/20 at 10:00am PDT.
>
>Thanks,
>Tom Graves
>(release manager)



Re: testing

2013-03-05 Thread Robert Evans
I personally would start off with a bug in an area that you are interested
in.

https://issues.apache.org/jira/issues/?jql=project%20in%20%28HADOOP%2C%20MA
PREDUCE%2C%20HDFS%2C%20YARN%29%20AND%20status%20%3D%20Open%20AND%20type%20%
3D%20Bug%20AND%20assignee%20is%20EMPTY%20ORDER%20BY%20priority%20ASC

Is a JIRA query of bugs that are in Hadoop core.  Once you feel
comfortable changing the code and going through the entire process then
you can start looking at bigger things.  Also be aware that committers
some times get busy with other things.  If your patch is ready for review
and no one starts looking at it ping the the -dev mailing list associated
with the ticket and ask for someone to take a look at it.

--Bobby

On 3/5/13 10:01 AM, "VENKAT KAUSHIK"  wrote:

>Hi,
>I am following the instructions on
>HowToContribute wiki to get started with
>some of the tasks listed (test/research).
>
>Where is a good place to start contributing - test or research projects?
>I would like to choose something small and under represented. Please
>let me know.
>
>Thanks,
>Venkat
>
>-- 
>=
>Venkatesh Kaushik
>Research Associate
>
>University of Arizona  ATLAS Experiment@CERN
>Office: PAS 33440-1-C11
>Tucson, ArizonaGenève, Switzerland
>Tel: +1 520 626 7042   +41 22 76 79137
>ven...@physics.arizona.edu venkat.kaus...@cern.ch
>http://atlas.physics.arizona.edu/~venkat
>=



Re: [Vote] Merge branch-trunk-win to trunk

2013-02-28 Thread Robert Evans
>
>> If I submit a patch and it gets -1 "tests failed" on the Windows slave,
>> how am I supposed to proceed?
>>
>> I think a reasonable compromise would be that the tests should always
>> *build* on Windows before commit, and contributors should do their best
>>to
>> look at the test logs for any Windows-specific failures. But, beyond
>> looking at the logs, a "-1 Tests failed on windows" should not block a
>> commit.
>>
>> Those contributors who are interested in Windows being a first-class
>> platform should be responsible for watching the Windows builds and
>> debugging/fixing any regressions that might be Windows-specific.
>>
>> I also think the KDE model that Harsh pointed out is an interesting one
>>--
>> ie the idea that we would not merge windows support to trunk, but rather
>> treat is as a "parallel code line" which lives in the ASF and has its
>>own
>> builds and releases. The windows team would periodically merge
>>trunk->win
>> to pick up any new changes, and do a separate test/release process. I'm
>>not
>> convinced this is the best idea, but worth discussion of pros and cons.
>>
>> -Todd
>>
>>
>> >
>> > On Wed, Feb 27, 2013 at 11:56 AM, Eli Collins 
>>wrote:
>> >
>> > > Bobby raises some good questions.  A related one, since most current
>> > > developers won't add Windows support for new features that are
>> > > platform specific is it assumed that Windows development will either
>> > > lag or will people actively work on keeping Windows up with the
>> > > latest?  And vice versa in case Windows support is implemented
>>first.
>> > >
>> > > Is there a jira for resolving the outstanding TODOs in the code base
>> > > (similar to HDFS-2148)?  Looks like this merge doesn't introduce
>> > > many which is great (just did a quick diff and grep).
>> > >
>> > > Thanks,
>> > > Eli
>> > >
>> > > On Wed, Feb 27, 2013 at 8:17 AM, Robert Evans 
>> > wrote:
>> > > > After this is merged in is Windows still going to be a second
>> > > > class citizen but happens to work for more than just development
>> > > > or is it a fully supported platform where if something breaks it
>> > > > can block a
>> > > release?
>> > > >  How do we as a community intend to keep Windows support from
>> breaking?
>> > > > We don't have any Jenkins slaves to be able to run nightly tests
>> > > > to validate everything still compiles/runs.  This is not a blocker
>> > > > for me because we often rely on individuals and groups to test
>> > > > Hadoop, but I
>> > do
>> > > > think we need to have this discussion before we put it in.
>> > > >
>> > > > --Bobby
>> > > >
>> > > > On 2/26/13 4:55 PM, "Suresh Srinivas" 
>> wrote:
>> > > >
>> > > >>I had posted heads up about merging branch-trunk-win to trunk on
>> > > >>Feb
>> > 8th.
>> > > >>I
>> > > >>am happy to announce that we are ready for the merge.
>> > > >>
>> > > >>Here is a brief recap on the highlights of the work done:
>> > > >>- Command-line scripts for the Hadoop surface area
>> > > >>- Mapping the HDFS permissions model to Windows
>> > > >>- Abstracted and reconciled mismatches around differences in Path
>> > > >>semantics in Java and Windows
>> > > >>- Native Task Controller for Windows
>> > > >>- Implementation of a Block Placement Policy to support cloud
>> > > >>environments, more specifically Azure.
>> > > >>- Implementation of Hadoop native libraries for Windows
>> > > >>(compression codecs, native I/O)
>> > > >>- Several reliability issues, including race-conditions,
>> > > >>intermittent
>> > > test
>> > > >>failures, resource leaks.
>> > > >>- Several new unit test cases written for the above changes
>> > > >>
>> > > >>Please find the details of the work in
>> > > >>CHANGES.branch-trunk-win.txt - Common
>> > > >>changes<http://bit.ly/Xe7Ynv>, HDFS changes<
>> > http://bit.ly/13QOSo9
>> > > >,
>> >

Re: [Vote] Merge branch-trunk-win to trunk

2013-02-27 Thread Robert Evans
After this is merged in is Windows still going to be a second class
citizen but happens to work for more than just development or is it a
fully supported platform where if something breaks it can block a release?
 How do we as a community intend to keep Windows support from breaking?
We don't have any Jenkins slaves to be able to run nightly tests to
validate everything still compiles/runs.  This is not a blocker for me
because we often rely on individuals and groups to test Hadoop, but I do
think we need to have this discussion before we put it in.

--Bobby

On 2/26/13 4:55 PM, "Suresh Srinivas"  wrote:

>I had posted heads up about merging branch-trunk-win to trunk on Feb 8th.
>I
>am happy to announce that we are ready for the merge.
>
>Here is a brief recap on the highlights of the work done:
>- Command-line scripts for the Hadoop surface area
>- Mapping the HDFS permissions model to Windows
>- Abstracted and reconciled mismatches around differences in Path
>semantics
>in Java and Windows
>- Native Task Controller for Windows
>- Implementation of a Block Placement Policy to support cloud
>environments,
>more specifically Azure.
>- Implementation of Hadoop native libraries for Windows (compression
>codecs, native I/O)
>- Several reliability issues, including race-conditions, intermittent test
>failures, resource leaks.
>- Several new unit test cases written for the above changes
>
>Please find the details of the work in CHANGES.branch-trunk-win.txt -
>Common changes, HDFS changes,
>and YARN and MapReduce changes . This is the work
>ported from branch-1-win to a branch based on trunk.
>
>For details of the testing done, please see the thread -
>http://bit.ly/WpavJ4. Merge patch for this is available on HADOOP-8562<
>https://issues.apache.org/jira/browse/HADOOP-8562>.
>
>This was a large undertaking that involved developing code, testing the
>entire Hadoop stack, including scale tests. This is made possible only
>with
>the contribution from many many folks in the community. Following people
>contributed to this work: Ivan Mitic, Chuan Liu, Ramya Sunil, Bikas Saha,
>Kanna Karanam, John Gordon, Brandon Li, Chris Nauroth, David Lao, Sumadhur
>Reddy Bolli, Arpit Agarwal, Ahmed El Baz, Mike Liddell, Jing Zhao, Thejas
>Nair, Steve Maine, Ganeshan Iyer, Raja Aluri, Giridharan Kesavan, Ramya
>Bharathi Nimmagadda, Daryn Sharp, Arun Murthy, Tsz-Wo Nicholas Sze, Suresh
>Srinivas and Sanjay Radia. There are many others who contributed as well
>providing feedback and comments on numerous jiras.
>
>The vote will run for seven days and will end on March 5, 6:00PM PST.
>
>Regards,
>Suresh
>
>
>
>
>On Thu, Feb 7, 2013 at 6:41 PM, Mahadevan Venkatraman
>wrote:
>
>> It is super exciting to look at the prospect of these changes being
>>merged
>> to trunk. Having Windows as one of the supported Hadoop platforms is a
>> fantastic opportunity both for the Hadoop project and Microsoft
>>customers.
>>
>> This work began around a year back when a few of us started with a basic
>> port of Hadoop on Windows. Ever since, the Hadoop team in Microsoft have
>> made significant progress in the following areas:
>> (PS: Some of these items are already included in Suresh's email, but
>> including again for completeness)
>>
>> - Command-line scripts for the Hadoop surface area
>> - Mapping the HDFS permissions model to Windows
>> - Abstracted and reconciled mismatches around differences in Path
>> semantics in Java and Windows
>> - Native Task Controller for Windows
>> - Implementation of a Block Placement Policy to support cloud
>> environments, more specifically Azure.
>> - Implementation of Hadoop native libraries for Windows (compression
>> codecs, native I/O) - Several reliability issues, including
>> race-conditions, intermittent test failures, resource leaks.
>> - Several new unit test cases written for the above changes
>>
>> In the process, we have closely engaged with the Apache open source
>> community and have got great support and assistance from the community
>>in
>> terms of contributing fixes, code review comments and commits.
>>
>> In addition, the Hadoop team at Microsoft has also made good progress in
>> other projects including Hive, Pig, Sqoop, Oozie, HCat and HBase. Many
>>of
>> these changes have already been committed to the respective trunks with
>> help from various committers and contributors. It is great to see the
>> commitment of the community to support multiple platforms, and we look
>> forward to the day when a developer/customer is able to successfully
>>deploy
>> a complete solution stack based on Apache Hadoop releases.
>>
>> Next Steps:
>>
>> All of the above changes are part of the Windows Azure HDInsight and
>> HDInsight Server products from Microsoft. We have successfully
>>on-boarded
>> several internal customers and have been running production workloads on
>> Windows Azure HDInsight. Our vision is to create a big data platform
>>base

timeout is now requested to be on all tests

2013-02-20 Thread Robert Evans
Sorry about cross posting, but this will impact all developers and I wanted to 
give you all a heads-up.

HADOOP-9112 was just checked 
it.  This means that the pre commit build will now give a –1 for any patch with 
junit tests that do not include a timeout option.  See 
http://junit.sourceforge.net/javadoc/org/junit/Test.html for more info on that. 
 This is to avoid surefire timing out junit when it gets stuck and not giving 
any real feedback on which test failed.

--Bobby


Re: [VOTE] Hadoop 1.1.2-rc5 release candidate vote

2013-02-08 Thread Robert Evans
Sorry about that +1 (binding)
I downloaded the binary tar started everything up and ran a few simple
jobs.

--Bobby

On 2/8/13 12:04 AM, "Matt Foley"  wrote:

>Wow, total apathy!  We only got one vote besides mine, and that was
>non-binding.
>I'll try again.  Please vote on this release candidate for Hadoop
>1.1.2-rc5.
>Voting will close one week from now, at 10pm PST on Thursday 14 Feb.
>
>Thanks,
>--Matt
>
>
>On Fri, Feb 1, 2013 at 11:28 AM, Chris Nauroth
>wrote:
>
>> +1 (non-binding)
>>
>> I deployed hadoop-1.1.2.tar.gz to 3 Ubuntu VMs and ran NN, JT, 2NN, 2 *
>> DN, and 2 * TT.  I verified the checksum.  I tested multiple command
>>line
>> HDFS interactions and MapReduce jobs.  I specifically tested for the
>> HDFS-4423 blocker bug fix, and it worked.  Since that change touched
>> checkpointing, I also verified that the 2NN could complete a successful
>> checkpoint.
>>
>> I'll also verify the PGP signature once I track down the public key that
>> was used for signing.
>>
>> Thank you,
>> --Chris
>>
>>
>> On Thu, Jan 31, 2013 at 7:13 PM, Matt Foley  wrote:
>>
>>> (resending with modified Subject line for RC5)
>>>
>>> Hadoop-1.1.2-rc4 is withdrawn.
>>>
>>> Hadoop-1.1.2-rc5 is available at
>>> http://people.apache.org/~mattf/hadoop-1.1.2-rc5/
>>> or in SVN at
>>> http://svn.apache.org/viewvc/hadoop/common/tags/release-1.1.2-rc5/
>>> or in the Maven repo.
>>>
>>> This candidate for a stabilization release of the Hadoop-1.1 branch
>>>has 24
>>> patches and several cleanups compared to the Hadoop-1.1.1 release.
>>>  Release
>>> notes are available at
>>> http://people.apache.org/~mattf/hadoop-1.1.2-rc5/releasenotes.html
>>>
>>> Please vote for this as the next release of Hadoop-1.  Voting will
>>>close
>>> next Thursday, 7 Feb, at 3:00pm PST.
>>>
>>> Thanks,
>>> --Matt
>>>
>>>



Re: [VOTE] Release hadoop-2.0.3-alpha

2013-02-07 Thread Robert Evans
I downloaded the binary package and ran a few example jobs on a 3 node
cluster.  Everything seems to be working OK on it, I did see

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable

For every shell command, but just like with 0.23.6 I don't think it is a
blocker.

+1 (Binding)

--Bobby

On 2/6/13 9:59 PM, "Arun C Murthy"  wrote:

>Folks,
>
>I've created a release candidate (rc0) for hadoop-2.0.3-alpha that I
>would like to release.
>
>This release contains several major enhancements such as QJM for HDFS HA,
>multi-resource scheduling for YARN, YARN ResourceManager restart etc.
>Also YARN has achieved significant stability at scale (more details from
>Y! folks here: http://s.apache.org/VYO).
>
>The RC is available at:
>http://people.apache.org/~acmurthy/hadoop-2.0.3-alpha-rc0/
>The RC tag in svn is here:
>http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.3-alpha-rc0/
>
>The maven artifacts are available via repository.apache.org.
>
>Please try the release and vote; the vote will run for the usual 7 days.
>
>thanks,
>Arun
>
>
>
>--
>Arun C. Murthy
>Hortonworks Inc.
>http://hortonworks.com/
>
>



Re: More information regarding the Project suggestions given on the Hadoop website

2013-02-07 Thread Robert Evans
This conversation is probably better for common-user@ so I am moving it
over there, I put common-dev@ in the BCC.

I am not really sure what you mean by validate.  I assume you want to test
that your library does what you want it to do.  I would start out with
unit tests to validate the individual pieces work as you designed them to.
 After that you want to do some system level testing.  When I typically
port an algorithm over to Hadoop there are one of two goals that I have.
I either want to reproduce the original algorithm exactly or I want to
create a good enough approximation of it that is extremely scalable.

If you recreated the algorithm exactly you could validate it against the
single computer reference implementation and check that the results are
identical.  With machine learning this is often difficult because many
algorithms use random numbers as part of the process.  To get around this
you sometimes have to modify both implementations to be able to use a
consistent set of pseudo-random numbers.

The other alternative is to use statistics, and this works fairly well no
matter how you ported the algorithm.  Train using the same input data
multiple times using each implementation.  Compare the results against a
test set.  As grad students you probably already understand the stats
necessary to do this correctly already.  Your advisor will probably also
be able to give you better advice on this too, because they can sit down
with you and give you much faster feedback.

--Bobby

On 2/7/13 12:55 AM, "Varsha Raveendran" 
wrote:

>Hello!
>
>
>Based on couple of existing genetic algorithms library available on the
>net, my team and I have come up with a design for the library. But we are
>not able to understand how to validate the library -
>
>Are there any test designs followed to test if a library is working
>correctly?
>
>
>I would like to again mention that we are graduate students and have just
>started working on Hadoop.
>
>Thanks in advance,
>Varsha
>
>
>
>On Sat, Jan 19, 2013 at 9:42 AM, Varsha Raveendran <
>varsha.raveend...@gmail.com> wrote:
>
>> Thank you! I will check with the Mahout team and also go through Commons
>> Math site.
>>
>> Thanks & Regards,
>> Varsha
>>
>>
>> On Sat, Jan 19, 2013 at 12:16 AM, Robert Evans
>>wrote:
>>
>>> I'm not sure I am exactly the right person for this, but I assume that
>>>you
>>> are familiar with genetic algorithms.  The Mahout Project is probably a
>>> good place to start http://mahout.apache.org/ they have a number of
>>> machine learning algorithms that run on top of Hadoop.  I did a search
>>>and
>>> it looks like there may already be some support for them in Mahout,
>>>but I
>>> don't know the current state of it.  It looked like there was some
>>> discussion about it being abandoned and might be deleted.  Either way
>>>it
>>> would be a good starting point.  Commons Math may be a good place to
>>>look
>>> too because there is an implementation there that is already Apache
>>> licensed. So if you borrow some of the code there is no issue
>>> http://commons.apache.org/math/userguide/genetics.html.
>>>
>>> --Bobby Evans
>>>
>>> On 1/16/13 8:24 AM, "Varsha Raveendran" 
>>> wrote:
>>>
>>> >Hello!
>>> >
>>> >I require information regarding a project given on the Hadoop website.
>>> Can
>>> >anyone guide me in the right direction?
>>> >
>>> >The project is "Implement a library/framework to support Genetic
>>> >Algorithms<http://en.wikipedia.org/wiki/Genetic_algorithm>on Hadoop
>>> >Map-Reduce."
>>> >
>>> >
>>> >Regards,
>>> >Varsha
>>> >
>>> >New to Hadoop :)
>>>
>>>
>>
>>
>> --
>> *-Varsha *
>>
>
>
>
>-- 
>*-Varsha *



Re: More information regarding the Project suggestions given on the Hadoop website

2013-01-18 Thread Robert Evans
I'm not sure I am exactly the right person for this, but I assume that you
are familiar with genetic algorithms.  The Mahout Project is probably a
good place to start http://mahout.apache.org/ they have a number of
machine learning algorithms that run on top of Hadoop.  I did a search and
it looks like there may already be some support for them in Mahout, but I
don't know the current state of it.  It looked like there was some
discussion about it being abandoned and might be deleted.  Either way it
would be a good starting point.  Commons Math may be a good place to look
too because there is an implementation there that is already Apache
licensed. So if you borrow some of the code there is no issue
http://commons.apache.org/math/userguide/genetics.html.

--Bobby Evans

On 1/16/13 8:24 AM, "Varsha Raveendran" 
wrote:

>Hello!
>
>I require information regarding a project given on the Hadoop website. Can
>anyone guide me in the right direction?
>
>The project is "Implement a library/framework to support Genetic
>Algorithmson Hadoop
>Map-Reduce."
>
>
>Regards,
>Varsha
>
>New to Hadoop :)



Re: Problem creating patch for HADOOP-9184

2013-01-10 Thread Robert Evans
http://wiki.apache.org/hadoop/HowToContribute should explain it.  There
are sections for both pre and post mavinization.  Look for test-patch.sh
which is the script you run for the 1.0 branch

--Bobby

On 1/10/13 9:54 AM, "Jeremy Karn"  wrote:

>The patch doesn't apply to trunk (I'm not that familiar with the latest
>code structure but it looks like the problem doesn't exist in trunk), but
>I'm not sure what I'm suppose to run for a non-trunk commit.  The
>dev-support directory doesn't exist and there's no pre-commit ant target
>for the 0.20 branch.  I tried running the test-contrib target but I was
>having tests fail because of timeouts.
>
>Is there documentation somewhere about what I should post to the jira for
>an older commit?
>
>
>On Tue, Jan 8, 2013 at 10:53 AM, Jeremy Karn  wrote:
>
>> Thanks!
>>
>>
>> On Tue, Jan 8, 2013 at 10:48 AM, Robert Evans 
>>wrote:
>>
>>> This is because your patch is against the 0.20 branch, not against
>>>trunk.
>>> Jenkins pre commit only works for trunk right now.  If the issue also
>>> exists on trunk then please provide a patch for trunk too, if it is a
>>> 1.0/0.20 specific issue then you
>>>
>>
>>



Re: Problem creating patch for HADOOP-9184

2013-01-08 Thread Robert Evans
This is because your patch is against the 0.20 branch, not against trunk.
Jenkins pre commit only works for trunk right now.  If the issue also
exists on trunk then please provide a patch for trunk too, if it is a
1.0/0.20 specific issue then you can run the pre commit tests yourself and
just post the results.

--Bobby

On 1/8/13 9:43 AM, "Jeremy Karn"  wrote:

>I opened this jira yesterday and tried to include a patch for the problem
>but the Jenkins pre-commit job keeps failing because it says it can't
>apply
>my patch 
>(https://builds.apache.org/job/PreCommit-HADOOP-Build/2012//console).
> I thought at first the problem was because I generated the patch with
>git,
>but I've since done a svn checkout, regenerated the patch file, and been
>able to apply the commit locally without a problem.
>
>Any help would be appreciated!  Thanks,
>
>-- 
>
>Jeremy Karn / Lead Developer
>MORTAR DATA / www.mortardata.com



Re: [VOTE] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-26 Thread Robert Evans
+1, +1, 0

On 11/24/12 2:13 PM, "Matt Foley"  wrote:

>For discussion, please see previous thread "[PROPOSAL] introduce Python as
>build-time and run-time dependency for Hadoop and throughout Hadoop
>stack".
>
>This vote consists of three separate items:
>
>1. Contributors shall be allowed to use Python as a platform-independent
>scripting language for build-time tasks, and add Python as a build-time
>dependency.
>Please vote +1, 0, -1.
>
>2. Contributors shall be encouraged to use Maven tasks in combination with
>either plug-ins or Groovy scripts to do cross-platform build-time tasks,
>even under ant in Hadoop-1.
>Please vote +1, 0, -1.
>
>3. Contributors shall be allowed to use Python as a platform-independent
>scripting language for run-time tasks, and add Python as a run-time
>dependency.
>Please vote +1, 0, -1.
>
>Note that voting -1 on #1 and +1 on #2 essentially REQUIRES contributors
>to
>use Maven plug-ins or Groovy as the only means of cross-platform
>build-time
>tasks, or to simply continue using platform-dependent scripts as is being
>done today.
>
>Vote closes at 12:30pm PST on Saturday 1 December.
>-
>Personally, my vote is +1, +1, +1.
>I think #2 is preferable to #1, but still has many unknowns in it, and
>until those are worked out I don't want to delay moving to cross-platform
>scripts for build-time tasks.
>
>Best regards,
>--Matt



Re: [PROPOSAL] 1.1.1 and 1.2.0 scheduling

2012-11-09 Thread Robert Evans
+1

On 11/9/12 12:27 PM, "Steve Loughran"  wrote:

>On 9 November 2012 17:52, Matt Foley  wrote:
>
>> Hi all,
>> Hadoop 1.1.0 came out on Oct 12.  I think there's enough interest to do
>>a
>> maintenance release with some important patches.  I propose to code
>>freeze
>> branch-1.1 a week from today, Fri 16 Nov, and have a 1.1.1 release
>> candidate ready for eval & vote starting Mon 19 Nov.
>>
>> There's also a lot of good new stuff in branch-1.  I suggest that on
>>Dec.1,
>> I create a branch-1.2 from branch-1, with a code freeze on Dec.7, and
>>I'll
>> create a 1.2.0 release candidate on Mon 10 Dec.
>>
>> Please provide your +1 if this is acceptable to you.
>>
>
>+1
>
>
>>
>> For 1.1.1, I propose to include the below, and I am of course open to
>> additional high-priority patches if they are reliable and can be
>>committed
>> to branch-1.1 by the code freeze date.  Let's try to stick to serious
>>bugs
>> and not new features.  Thanks!
>>
>> --Matt Foley
>> Release Manager
>>
>> HADOOP-8823. ant package target should not depend on cn-docs.
>> (szetszwo)
>>
>> HADOOP-8878. Uppercase namenode hostname causes hadoop dfs calls
>>with
>> webhdfs filesystem and fsck to fail when security is on.
>> (Arpit Gupta via suresh)
>>
>> HADOOP-8882. Uppercase namenode host name causes fsck to fail when
>> useKsslAuth is on. (Arpit Gupta via suresh)
>>
>> HADOOP-8995. Remove unnecessary bogus exception from
>> Configuration.java.
>> (Jing Zhao via suresh)
>>
>> HDFS-2815. Namenode is not coming out of safemode when we perform
>> (NN crash + restart). Also FSCK report shows blocks missed.
>>(umamahesh)
>>
>> HDFS-3791. HDFS-173 Backport - Namenode will not block until a large
>> directory deletion completes. It allows other operations when the
>> deletion is in progress. (umamahesh via suresh)
>>
>> HDFS-4134. hadoop namenode and datanode entry points should return
>> negative exit code on bad arguments. (Steve Loughran via suresh)
>>
>> MAPREDUCE-4782. NLineInputFormat skips first line of last InputSplit
>> (Mark Fuhs via bobby)
>>



RE: [DISCUSS] remove packaging

2012-10-15 Thread Robert Evans
Eli answered my question I am a +1 too.

-Original Message-
From: Alejandro Abdelnur [mailto:t...@cloudera.com] 
Sent: Monday, October 15, 2012 11:02 AM
To: common-dev@hadoop.apache.org
Subject: Re: [DISCUSS] remove packaging

+1

Alejandro

On Oct 15, 2012, at 10:52 AM, Robert Evans  wrote:

> Eli,
> 
> By packaging I assume that you mean the RPM/Deb packages and not the tar.gz.  
> If that is the case I have no problem with them being removed because as you 
> said in the JIRA BigTop is already providing a working alternative.  If 
> someone else wants to step up to maintain them I also don't have a problem 
> with them staying, so long as they become a part of HADOOP-8914 (Automating 
> the release build).
> 
> --Bobby
> 
> 
> -Original Message-
> From: Eli Collins [mailto:e...@cloudera.com] 
> Sent: Monday, October 15, 2012 10:33 AM
> To: common-dev@hadoop.apache.org
> Subject: [DISCUSS] remove packaging
> 
> Hey guys,
> 
> Heads up: I filed HADOOP-8925 to remove the packaging from trunk and
> branch-2.  The packages are not currently being built, were never
> updated for MR2/YARN, I'm not aware of anyone planning to do this work
> or maintain them, etc. No sense in letting them continue to bit rot in
> the code base.
> 
> Thanks,
> Eli


RE: [DISCUSS] remove packaging

2012-10-15 Thread Robert Evans
Eli,

By packaging I assume that you mean the RPM/Deb packages and not the tar.gz.  
If that is the case I have no problem with them being removed because as you 
said in the JIRA BigTop is already providing a working alternative.  If someone 
else wants to step up to maintain them I also don't have a problem with them 
staying, so long as they become a part of HADOOP-8914 (Automating the release 
build).

--Bobby


-Original Message-
From: Eli Collins [mailto:e...@cloudera.com] 
Sent: Monday, October 15, 2012 10:33 AM
To: common-dev@hadoop.apache.org
Subject: [DISCUSS] remove packaging

Hey guys,

Heads up: I filed HADOOP-8925 to remove the packaging from trunk and
branch-2.  The packages are not currently being built, were never
updated for MR2/YARN, I'm not aware of anyone planning to do this work
or maintain them, etc. No sense in letting them continue to bit rot in
the code base.

Thanks,
Eli


Re: [VOTE] Hadoop-1.0.4-rc0

2012-10-10 Thread Robert Evans
I had to also update id.apache.org because the maven repo started to
complain that it didn't know who I was. I didn't have any issues checking
in my updated keys to the svn repo though.

--Bobby

On 10/9/12 5:38 PM, "Eli Collins"  wrote:

>On Tue, Oct 9, 2012 at 1:02 PM, Matt Foley  wrote:
>> Hi Eli,
>> Thanks for the suggestion.  Looks like this has gotten fleshed out a
>>little
>> more since I started doing releases.
>>
>> I've had my key posted at MIT since the beginning.  I've now also
>>uploaded
>> it to the PGP Global Directory, and uploaded the key fingerprint to my
>> profile at id.apache.org.
>>
>> However, when I tried to commit it to
>> KEYS,
>> I got error "svn:  access to
>> '/repos/asf/!svn/act/69e4489f-fcdc-45b3-9a14-637b3a078b13' forbidden".
>>  Also, at http://svn.apache.org/repos/asf/hadoop/common/dist/readme.txt
>>it
>> says:
>>
>>
>> To generate the KEYS file, use:
>>
>> % wget https://people.apache.org/keys/group/hadoop.asc > KEYS
>>
>> which would seem to argue against simply committing changes to KEYS.
>>Yet
>> the file at https://people.apache.org/keys/group/hadoop.asc
>> is considerably behind the file at
>> http://svn.apache.org/repos/asf/hadoop/common/dist/KEYS
>>
>> Any idea what the correct resolution of this is?  Or should I ping
>> infra@for the correct instructions?
>
>Bobby, you just updated the KEYS file right? Did you run into any of
>these issues?
>
>Thanks,
>Eli



Re: Fix versions for commits branch-0.23

2012-10-09 Thread Robert Evans
I don't see much of a reason to have the same JIRA listed under both 0.23
and 2.0.  I can see some advantage of being able to see what went into
0.23.X by looking at a 2.0.X CHANGES.txt, but unless the two are released
at exactly the same time they will be out of date with each other in the
best cases.  I personally think the only way to truly know what is in
0.23.X is to look at the CHANGES.txt on 0.23.X and similarly for 2.X.
Having JIRA be in sync is a huge help and we should definitely push for
that.  I just don't see much value in trying very hard to have the
CHANGES.txt stay in sync.

--Bobby

On 10/8/12 10:21 PM, "Siddharth Seth"  wrote:

>Along with fix versions, does it make sense to add JIRAs under 0.23 as
>well
>as branch-2 in CHANGES.txt, if they're committed to both branches.
>CHANGES.txt tends to get out of sync with the different release schedules
>of the 2 branches.
>
>Thanks
>- Sid
>
>On Sat, Sep 29, 2012 at 10:33 PM, Arun C Murthy 
>wrote:
>
>> Guys,
>>
>>  A request - can everyone please set fix-version to both 2.* and
>>0.23.*? I
>> found some with only 0.23.*, makes generating release-notes very hard.
>>
>> thanks,
>> Arun



Re: Commits breaking compilation of MR 'classic' tests

2012-09-26 Thread Robert Evans
That is fine, we may want to then mark it so that the MR-4687 depends on the 
JIRA to port the tests, so the tests don't disapear before we are done.

--Bobby

From: Arun C Murthy mailto:a...@hortonworks.com>>
Date: Wednesday, September 26, 2012 12:31 PM
To: "hdfs-...@hadoop.apache.org<mailto:hdfs-...@hadoop.apache.org>" 
mailto:hdfs-...@hadoop.apache.org>>, "Yahoo! Inc." 
mailto:ev...@yahoo-inc.com>>
Cc: "common-dev@hadoop.apache.org<mailto:common-dev@hadoop.apache.org>" 
mailto:common-dev@hadoop.apache.org>>, 
"yarn-...@hadoop.apache.org<mailto:yarn-...@hadoop.apache.org>" 
mailto:yarn-...@hadoop.apache.org>>, 
"mapreduce-...@hadoop.apache.org<mailto:mapreduce-...@hadoop.apache.org>" 
mailto:mapreduce-...@hadoop.apache.org>>
Subject: Re: Commits breaking compilation of MR 'classic' tests

Fair, however there are still tests which need to be ported over. We can remove 
them after the port.

On Sep 26, 2012, at 9:54 AM, Robert Evans wrote:

As per my comment on the bug.  I though we were going to remove them.

MAPREDUCE-4266 only needs a little bit more work, change a patch to a
script, before they disappear entirely.  I would much rather see dead code
die then be maintained for a few tests that are mostly testing the dead
code itself.


--Bobby

On 9/26/12 9:39 AM, "Arun C Murthy" 
mailto:a...@hortonworks.com>> wrote:

Point. I've opened https://issues.apache.org/jira/browse/MAPREDUCE-4687
to track this.

On Sep 25, 2012, at 9:33 PM, Eli Collins wrote:

How about adding this step to the MR PreCommit jenkins job so it's run
as part test-patch?

On Tue, Sep 25, 2012 at 7:48 PM, Arun C Murthy 
mailto:a...@hortonworks.com>>
wrote:
Committers,

As most people are aware, the MapReduce 'classic' tests (in
hadoop-mapreduce-project/src/test) still need to built using ant since
they aren't mavenized yet.

I've seen several commits (and 2 within the last hour i.e.
MAPREDUCE-3681 and MAPREDUCE-3682) which lead me to believe
developers/committers aren't checking for this.

Henceforth, with all changes, before committing, please do run:
$ mvn install
$ cd hadoop-mapreduce-project
$ ant veryclean all-jars -Dresolvers=internal

These instructions were already in
http://wiki.apache.org/hadoop/HowToReleasePostMavenization and I've
just updated http://wiki.apache.org/hadoop/HowToContribute.

thanks,
Arun


--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Commits breaking compilation of MR 'classic' tests

2012-09-26 Thread Robert Evans
As per my comment on the bug.  I though we were going to remove them.

MAPREDUCE-4266 only needs a little bit more work, change a patch to a
script, before they disappear entirely.  I would much rather see dead code
die then be maintained for a few tests that are mostly testing the dead
code itself.


--Bobby

On 9/26/12 9:39 AM, "Arun C Murthy"  wrote:

>Point. I've opened https://issues.apache.org/jira/browse/MAPREDUCE-4687
>to track this.
>
>On Sep 25, 2012, at 9:33 PM, Eli Collins wrote:
>
>> How about adding this step to the MR PreCommit jenkins job so it's run
>> as part test-patch?
>> 
>> On Tue, Sep 25, 2012 at 7:48 PM, Arun C Murthy 
>>wrote:
>>> Committers,
>>> 
>>> As most people are aware, the MapReduce 'classic' tests (in
>>>hadoop-mapreduce-project/src/test) still need to built using ant since
>>>they aren't mavenized yet.
>>> 
>>> I've seen several commits (and 2 within the last hour i.e.
>>>MAPREDUCE-3681 and MAPREDUCE-3682) which lead me to believe
>>>developers/committers aren't checking for this.
>>> 
>>> Henceforth, with all changes, before committing, please do run:
>>> $ mvn install
>>> $ cd hadoop-mapreduce-project
>>> $ ant veryclean all-jars -Dresolvers=internal
>>> 
>>> These instructions were already in
>>>http://wiki.apache.org/hadoop/HowToReleasePostMavenization and I've
>>>just updated http://wiki.apache.org/hadoop/HowToContribute.
>>> 
>>> thanks,
>>> Arun
>>> 
>
>--
>Arun C. Murthy
>Hortonworks Inc.
>http://hortonworks.com/
>
>



Re: About ant hadoop

2012-09-19 Thread Robert Evans
It would help if you could explain a bit more about what you changed.  It
is hard to debug something simply by saying it compiles but does not run
correctly.  You probably want to check the logs/UI for the JT and try to
trace down the path this job is taking.

--Bobby

On 9/19/12 2:01 AM, "Li Shengmei"  wrote:

>Hi,all
>
> I revise the source code of hadoop-1.0.3 and use ant to recompile
>hadoop. It compiles successfully. Then I "jar cvf hadoop-core-1.0.3.jar *"
>and copy the new hadoop-core-1.0.3.jar to overwirite the $HADOOP_HOME/
>hadoop-core-1.0.3.jar in every node machine.Then I use hadoop to
>test the wordcount application. But the application halts at map 0% reduce
>0%.
>
> Does anyone give suggestions?
>
>Thanks a lot.
>
> 
>
>May
>
> 
>
> 
>
> 
>



Re: Hadoop fs -ls behaviour compared to native ls

2012-09-11 Thread Robert Evans
I think most of the rational is for backwards compatibility, but I could
be wrong.  If you want to change it file a JIRA about it and we can
discuss on the JIRA the merits of the change.

--Bobby

On 9/11/12 6:28 AM, "Hemanth Yamijala"  wrote:

>Hi,
>
>hadoop fs -ls dirname
>
>lists entries like
>
>dirname/file1
>dirname/file2
>
>i.e. dirname is repeated. And it takes a small second to realize that
>there's actually no directory called dirname under dirname.
>
>Native ls doesn't repeat dirname when listing the output.
>
>I suppose the current behaviour is to handle globbing. So,
>
>hadoop fs -ls dirname*
>
>will list
>
>dirname1/file1
>dirname2/file1
>
>which makes a little more sense. In contrast, native ls will list the
>same as
>
>dirname1:
>file1
>
>dirname2:
>file1
>
>Is there a rationale to keep this as the listing behaviour in Hadoop ?



Re: [DISCUSS] release branching scheme under maven was [VOTE] 0.23.3 release

2012-09-10 Thread Robert Evans
I forked the thread, because it is not really about the release vote any
more, although we both seem to be on the same page so this may be overkill
:)

On 9/10/12 3:50 PM, "Owen O'Malley"  wrote:

>On Mon, Sep 10, 2012 at 12:19 PM, Robert Evans 
>wrote:
>> Thanks for the info Owen I was not aware of that, I can see at the
>> beginning of the twiki
>> http://wiki.apache.org/hadoop/HowToReleasePostMavenization that it is
>>kind
>> of implied by the skip this section comment.  But, I was just confused
>> because to do an official release I need to change the version number in
>> several different places, and I didn¹t think that is was all that kosher
>> to do that directly in the tags area.
>
>Please don't make edits in the tags area.

Totally concur on that.

>
>>  It also seems to be a bit different
>> then what is happening on branch-2.
>
>Branch 2 also has issues, which we've also been ironing out. The
>branch-2.0.1-alpha and branch-2.0.2-alpha will be deleted soon.

Sounds good.

>
>> I was planning on deleting the branch once the vote for 0.23.3 finished,
>
>That would be fine.
>
>> but I can delete it now if you would prefer and the roll back the
>>version
>> numbers on branch-0.23 to be 0.23.3? Or would it be 0.23.3-SNAPSHOT
>>still?
>
>You can wait for 0.23.3 to clean up. The only odd part of that is that
>the 0.23.3 tag won't be on the branch-0.23.

Yes that is ugly.  I agree.

>
>With Ant, we could just override the version on the command line.
>Clearly Maven is a little pickier. You could:
>1. Set the version to 0.23.3
>2. Check in
>3. Make the tag
>4. Set the version to 0.23.4-SNAPSHOT
>5. Check in
>
>I'd also be ok with just waiting with 4+5 until the vote passes.
>Thoughts? (Of course as discussed this would be more about 0.23.4 :) )

I am fine with either way.  If we do go all they way to 5 and there are
blockers that force a re-spin of the RC then we are back to creating a new
branch off of the original tag to make changes and we get a tag that is
not off of branch-X.Y. But, if we don't go all the way to 5 then the
branch must stay locked for a week, which is not the end of the world, but
in some cases will slow down development a little.  It is a balancing
game, but I don't see major drawbacks to either situation.

>
>>  I also want to be sure that we want to not allow anyone to check
>>anything
>> in for a week+ on branch-0.23.  There are a few things I know of that
>>are
>> almost ready but are not Blockers.
>
>Sure. That makes sense.
>
>-- Owen

--Bobby



Re: [VOTE] 0.23.3 release

2012-09-10 Thread Robert Evans
Thanks for the info Owen I was not aware of that, I can see at the
beginning of the twiki
http://wiki.apache.org/hadoop/HowToReleasePostMavenization that it is kind
of implied by the skip this section comment.  But, I was just confused
because to do an official release I need to change the version number in
several different places, and I didn¹t think that is was all that kosher
to do that directly in the tags area.  It also seems to be a bit different
then what is happening on branch-2.

http://svn.apache.org/viewvc/hadoop/common/branches/branch-2/

http://svn.apache.org/viewvc/hadoop/common/branches/branch-2.0.1-alpha/

http://svn.apache.org/viewvc/hadoop/common/branches/branch-2.0.2-alpha/

I was planning on deleting the branch once the vote for 0.23.3 finished,
but I can delete it now if you would prefer and the roll back the version
numbers on branch-0.23 to be 0.23.3? Or would it be 0.23.3-SNAPSHOT still?
 I also want to be sure that we want to not allow anyone to check anything
in for a week+ on branch-0.23.  There are a few things I know of that are
almost ready but are not Blockers.

--Bobby 

On 9/10/12 12:33 PM, "Owen O'Malley"  wrote:

>On Fri, Sep 7, 2012 at 12:56 PM, Bobby Evans  wrote:
>> I have built an RC0 for 0.23.3 and am calling a vote on it
>
>Hi Bobby,
>   I noticed that you created a branch-0.23.3. The pattern we have
>used in Hadoop is that there is a release branch that is for a minor
>release (branch-0.23) and tags for the release candidates
>(release-0.23.3-rc0) and releases (release-0.23.3). Generally, the RM
>freezes the minor branch while they are trying to get the right set of
>patches together for the release candidate.
>
>-- Owen



Re: Branch 2 release names

2012-09-05 Thread Robert Evans
I must have misread it.  Thanks for clarifying.

--Bobby

From: Vinod Kumar Vavilapalli 
mailto:vino...@hortonworks.com>>
Reply-To: "common-dev@hadoop.apache.org<mailto:common-dev@hadoop.apache.org>" 
mailto:common-dev@hadoop.apache.org>>
Date: Tuesday, September 4, 2012 7:29 PM
To: "common-dev@hadoop.apache.org<mailto:common-dev@hadoop.apache.org>" 
mailto:common-dev@hadoop.apache.org>>
Subject: Re: Branch 2 release names

May be you misread the proposal. This is only about nuking 2.1.0-alpha and wait 
for 0.23.3 to be stabilized and released. Once that happens, we can create a 
branch-2.1 off branch-2.

Does that sound okay?

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Sep 4, 2012, at 3:05 PM, Robert Evans wrote:

I am fine with that too, but it is going to be a fairly large amount of
work to pull in all of the bug fixes into 2.0 that have gone into 0.23.
There was already a lot of discussion about just rebasing 2.1 instead of
trying to merge everything back into it and 2.1 is a lot further along
then 2.0 is.  Just something to be aware of.

--Bobby Evans



Re: Branch 2 release names

2012-09-04 Thread Robert Evans
I am fine with that too, but it is going to be a fairly large amount of
work to pull in all of the bug fixes into 2.0 that have gone into 0.23.
There was already a lot of discussion about just rebasing 2.1 instead of
trying to merge everything back into it and 2.1 is a lot further along
then 2.0 is.  Just something to be aware of.

--Bobby Evans  

On 9/4/12 2:19 PM, "Vinod Kumar Vavilapalli" 
wrote:

>
>+1 for moving on with 2.0 till it gets GA'ed, given we haven't made much
>progress on 2.0.1-alpha.
>
>+1 for putting the alpha/beta tags only on releases, and not on branches.
>
>This also reduces some branch-clutter like I mentioned on the other
>thread on general@h.a.o.
>
>Thanks,
>+Vinod
>
>On Sep 4, 2012, at 11:55 AM, Owen O'Malley wrote:
>
>> While cleaning up the subversion branches, I thought more about the
>> branch 2 release names. I'm concerned if we backtrack and reuse
>> release numbers it will be extremely confusing to users. It also
>> creates problems for tools like Maven that parse version numbers and
>> expect a left to right release numbering scheme (eg. 2.1.1-alpha >
>> 2.1.0). It also seems better to keep on the 2.0.x minor release until
>> after we get a GA release off of the 2.0 branch.
>> 
>> Therefore, I'd like to propose:
>> 1. rename branch-2.0.1-alpha -> branch-2.0
>> 2. delete branch-2.1.0-alpha
>> 3. stabilizing goes into branch-2.0 until it gets to GA
>> 4. features go into branch-2 and will be branched into branch-2.1 later
>> 5. The release tags can have the alpha/beta tags on them.
>> 
>> Thoughts?
>> 
>> -- Owen
>



Re: Unused API in LocalDirAllocator

2012-09-04 Thread Robert Evans
I don't think it really matters that much.  The API is limited Private and
unstable, so I would say just remove it, but fixing it is fine too.
Either way file a JIRA on it.

--Bobby

On 9/4/12 6:34 AM, "Hemanth Yamijala"  wrote:

>Hi,
>
>Stumbled on the fact that LocalDirAllocator.ifExists() is not used
>anywhere. The previous usage of this API was in the IsolationRunner
>that was removed in MAPREDUCE-2606.
>
>This API doesn't call the confChanged method and hence there is an
>uninitialised variable that causes a NullPointerException. So, either
>we can fix that, or remove the API if it's not required. This is also
>one of the reasons why the IsolationRunner was broken in 1.0.
>
>Thoughts ?
>
>Thanks
>hemanth



Re: How to get TaskId from ContainerId or ApplicationId or Request in Hadoop 0.23??

2012-08-23 Thread Robert Evans
There really is no way.  The RM also has no knowledge of map tasks vs
reduce tasks nor should it know.

--Bobby

On 8/22/12 8:23 PM, "Shekhar Gupta"  wrote:

>In ResourceManager, is there any way to findout if the assigned container
>is going to execute a mapping task or a reduce task? I can access objects
>Container, Application and Request in ResourceManager, can I somehow get
>TaskId by using any of these objects?? Please let me know a way.
>
>Thanks.



Re: MultithreadedMapper

2012-07-26 Thread Robert Evans
In general multithreaded does not get you much in traditional Map/Reduce.
If you want the mappers to run faster you can drop the split size and get
a similar result, because you get more parallelism.  This is the use case
that we have typically concentrated on.  About the only time that
MultiThreaded mapper makes a lot of since is if there is a lot of
computation associated with each key/value pair.  Your process is very
compute bound, and not I/O bound.  Wordcount is typically going to be I/O
bound.  I am not aware of any work that is being done to reduce lock
contention in these cases.  If you want to file a generic JIRA for the
lock contention that would be great.

My gut feeling is that the reason the lock is so course is because the
InputFormats themselves are not thread safe.  Perhaps the simplest thing
you could do is to change it so that each thread gets its own "split" of
the actual split, and then if one finishes early there could be some logic
to try and share a "split" among a limited number of threads. But like
with anything in performance never trust your gut, so please profile it
before doing any code changes.

--Bobby Evans

On 7/26/12 12:47 AM, "kenyh"  wrote:

>
>Multithread Mapreduce introduces multithread execution in map task. In
>hadoop
>1.0.2, MultithreadedMapper implements multithread execution in mapper
>function. But I found that synchronization is needed for record
>reading(read
>the input Key and Value) and result output. This contention brings heavy
>overhead in performance, which increase 50MB wordcount task execution from
>40 seconds to 1 minute. I wonder if there are any optimization about the
>multithread mapper to decrease the contention of input reading and
>output? 
>-- 
>View this message in context:
>http://old.nabble.com/MultithreadedMapper-tp34213805p34213805.html
>Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>



Re: Shifting to Java 7 . Is it good choice?

2012-07-17 Thread Robert Evans
Oracle is dropping java 6 support by the end of the year.  So there is
likely to be a big shift to java 7 before then.  Currently Hadoop
officially supports java 6 so unless there is an official change of
position you cannot use Java 7 specific APIs if you want to check your
code into Hadoop. Hadoop currently should work on 7, like Radim said, and
if you are building something on top of Hadoop it is fine, but if we are
dropping support for java 6 that will require some discussion on the
mailing lists.

--Bobby Evans

On 7/17/12 2:35 PM, "Radim Kolar"  wrote:

>
>>I have to tweak a few classes and for this I needed few packages
>>which
>> are
>> only present in Java 7 like "java.nio.file" , So I was wondering If I
>>can
>> shift my
>> development environment of Hadoop to Java 7? Would this break anything ?
>openjdk 7 works, but nio async file access is slower then traditional.



Re: New JIRA version field for branch-2's next release?

2012-07-16 Thread Robert Evans
Thanks for catching that and fixing it Harsh and Arun.

On 7/15/12 10:26 PM, "Harsh J"  wrote:

>Ah looks like you've covered that edge too, many thanks!
>
>On Mon, Jul 16, 2012 at 8:40 AM, Harsh J  wrote:
>> Thanks Arun! I will now diff both branches and fix any places the JIRA
>> fix version needs to be corrected at.
>>
>> On Mon, Jul 16, 2012 at 8:30 AM, Arun C Murthy 
>>wrote:
>>> Done.
>>>
>>> On Jul 13, 2012, at 11:12 PM, Harsh J wrote:
>>>
 Hey devs,

 I noticed 2.0.1 has already been branched, but there's no newer JIRA
 version field added in for 2.1.0? Can someone with the right powers
 add it across all projects, so that backports to branch-2 can be
 marked properly in their fix versions field?

 Thanks!
 --
 Harsh J
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>
>>
>>
>> --
>> Harsh J
>
>
>
>-- 
>Harsh J



Re: Jetty fixes for Hadoop

2012-07-11 Thread Robert Evans
I am +1 on this also, although I think we need to look at moving to
Jetty-7 or possibly dropping Jetty completely and look at Netty or even
Tomcat long term.  Jetty has just been way too unstable at Hadoop scale
and that has not really changed with newer versions of Jetty.  Sticking
with an old forked unsupported version of Jetty longterm seems very risky
too me.

--Bobby

On 7/10/12 5:19 PM, "Todd Lipcon"  wrote:

>+1 from me too. We've had this in CDH since Sep '11 and been working
>much better than the stock 6.1.26.
>
>-Todd
>
>On Tue, Jul 10, 2012 at 3:14 PM, Owen O'Malley  wrote:
>> On Tue, Jul 10, 2012 at 2:59 PM, Thomas Graves
>>wrote:
>>
>>> I'm +1 for adding it.
>>>
>>
>> I'm +1 also.
>>
>> -- Owen
>
>
>
>-- 
>Todd Lipcon
>Software Engineer, Cloudera



Re: No mapred-site.xml in the hadoop-0.23.3 distribution

2012-07-09 Thread Robert Evans
I think so, or you have to have it set if your environment when you launch
the tasks.

On 7/9/12 3:27 PM, "Pavan Kulkarni"  wrote:

>Thanks a lot Robert. I also assume we need to set JAVA_HOME parameter in
>the hadoop-env.sh .
>I couldn't find this file as default just like mapred-site.xml.Am I
>correct?
>
>On Mon, Jul 9, 2012 at 12:36 PM, Robert Evans  wrote:
>
>> On 2.0 core-site, yarn-site, hdfs-site, and mapped-site are all kind of
>> needed.  The exact configs that you need to set may very a lot based off
>> of what you are trying to do.
>> --Bobby Evans
>>
>> On 7/6/12 6:58 PM, "Pavan Kulkarni"  wrote:
>>
>> >Hi Robert,
>> >
>> > Can you please share what all configuration files are mandatory for
>>the
>> >hadoop-0.23.3 to work.
>> >I am tuning a few but still not able to set it up completely.Thanks
>> >
>> >On Fri, Jul 6, 2012 at 10:24 AM, Robert Evans 
>> wrote:
>> >
>> >> Sorry I don't know of a good source for that right now.  Perhaps
>>others
>> >>on
>> >> the list might know better then I do.
>> >>
>> >> On 7/6/12 12:05 PM, "Pavan Kulkarni"  wrote:
>> >>
>> >> >Bobby,
>> >> >
>> >> >  Thanks a lot for your clarification.
>> >> >Yes as you said it is just a template, but it may
>> >> >be quite confusing to new users while configuring.
>> >> >I have raised the Issue
>> >> ><https://issues.apache.org/jira/browse/HADOOP-8575>,
>> >> >in case you might want to
>> >> >have a look.Thanks
>> >> >
>> >> > Also I wanted to know is there a good source where I can
>> >> >look upto for running a multi-node 2nd generation Hadoop ?
>> >> >All I find is 1st generation Hadoop setup.
>> >> >
>> >> >On Fri, Jul 6, 2012 at 7:13 AM, Robert Evans 
>> >>wrote:
>> >> >
>> >> >> That may be something that we missed, as I have been providing my
>>own
>> >> >> marped-site.xml for quite a while now.  Have you tried it with
>> >>branch-2
>> >> >>or
>> >> >> trunk to see if they are providing it?  In either case it is just
>> >>going
>> >> >>to
>> >> >> be a template for you to fill in, but it would be nice to package
>> >>that
>> >> >> template for our users to follow.  If you want to file a JIRA for
>> >>that
>> >> >>it
>> >> >> would be good, but I don't know how quickly we will be able to get
>> >> >>around
>> >> >> to doing it.
>> >> >>
>> >> >> --Bobby Evans
>> >> >>
>> >> >> On 7/5/12 7:23 PM, "Pavan Kulkarni" 
>>wrote:
>> >> >>
>> >> >> >Hi,
>> >> >> >
>> >> >> >  I downloaded the Hadoop-0.23.3 source and tweaked a few classes
>> >>and
>> >> >> >when I built the binary distribution and untar'd it .I don't see
>>the
>> >> >> >mapred-site.xml
>> >> >> >file in the /etc/hadoop directory. But by the details given on
>>how
>> >>to
>> >> >>run
>> >> >> >the
>> >> >> >Hadoop-0.23.3 the mapred-site.xml needs to be configured right?
>> >> >> >
>> >> >> >  So I was just wondering if we are supposed to create the
>> >> >>mapred-site.xml
>> >> >> >, or
>> >> >> >it doesn't exist at all? Thanks
>> >> >> >
>> >> >> >--
>> >> >> >
>> >> >> >--With Regards
>> >> >> >Pavan Kulkarni
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >--
>> >> >
>> >> >--With Regards
>> >> >Pavan Kulkarni
>> >>
>> >>
>> >
>> >
>> >--
>> >
>> >--With Regards
>> >Pavan Kulkarni
>>
>>
>
>
>-- 
>
>--With Regards
>Pavan Kulkarni



Re: No mapred-site.xml in the hadoop-0.23.3 distribution

2012-07-09 Thread Robert Evans
On 2.0 core-site, yarn-site, hdfs-site, and mapped-site are all kind of
needed.  The exact configs that you need to set may very a lot based off
of what you are trying to do.
--Bobby Evans

On 7/6/12 6:58 PM, "Pavan Kulkarni"  wrote:

>Hi Robert,
>
> Can you please share what all configuration files are mandatory for the
>hadoop-0.23.3 to work.
>I am tuning a few but still not able to set it up completely.Thanks
>
>On Fri, Jul 6, 2012 at 10:24 AM, Robert Evans  wrote:
>
>> Sorry I don't know of a good source for that right now.  Perhaps others
>>on
>> the list might know better then I do.
>>
>> On 7/6/12 12:05 PM, "Pavan Kulkarni"  wrote:
>>
>> >Bobby,
>> >
>> >  Thanks a lot for your clarification.
>> >Yes as you said it is just a template, but it may
>> >be quite confusing to new users while configuring.
>> >I have raised the Issue
>> ><https://issues.apache.org/jira/browse/HADOOP-8575>,
>> >in case you might want to
>> >have a look.Thanks
>> >
>> > Also I wanted to know is there a good source where I can
>> >look upto for running a multi-node 2nd generation Hadoop ?
>> >All I find is 1st generation Hadoop setup.
>> >
>> >On Fri, Jul 6, 2012 at 7:13 AM, Robert Evans 
>>wrote:
>> >
>> >> That may be something that we missed, as I have been providing my own
>> >> marped-site.xml for quite a while now.  Have you tried it with
>>branch-2
>> >>or
>> >> trunk to see if they are providing it?  In either case it is just
>>going
>> >>to
>> >> be a template for you to fill in, but it would be nice to package
>>that
>> >> template for our users to follow.  If you want to file a JIRA for
>>that
>> >>it
>> >> would be good, but I don't know how quickly we will be able to get
>> >>around
>> >> to doing it.
>> >>
>> >> --Bobby Evans
>> >>
>> >> On 7/5/12 7:23 PM, "Pavan Kulkarni"  wrote:
>> >>
>> >> >Hi,
>> >> >
>> >> >  I downloaded the Hadoop-0.23.3 source and tweaked a few classes
>>and
>> >> >when I built the binary distribution and untar'd it .I don't see the
>> >> >mapred-site.xml
>> >> >file in the /etc/hadoop directory. But by the details given on how
>>to
>> >>run
>> >> >the
>> >> >Hadoop-0.23.3 the mapred-site.xml needs to be configured right?
>> >> >
>> >> >  So I was just wondering if we are supposed to create the
>> >>mapred-site.xml
>> >> >, or
>> >> >it doesn't exist at all? Thanks
>> >> >
>> >> >--
>> >> >
>> >> >--With Regards
>> >> >Pavan Kulkarni
>> >>
>> >>
>> >
>> >
>> >--
>> >
>> >--With Regards
>> >Pavan Kulkarni
>>
>>
>
>
>-- 
>
>--With Regards
>Pavan Kulkarni



Re: No mapred-site.xml in the hadoop-0.23.3 distribution

2012-07-06 Thread Robert Evans
Sorry I don't know of a good source for that right now.  Perhaps others on
the list might know better then I do.

On 7/6/12 12:05 PM, "Pavan Kulkarni"  wrote:

>Bobby,
>
>  Thanks a lot for your clarification.
>Yes as you said it is just a template, but it may
>be quite confusing to new users while configuring.
>I have raised the Issue
><https://issues.apache.org/jira/browse/HADOOP-8575>,
>in case you might want to
>have a look.Thanks
>
> Also I wanted to know is there a good source where I can
>look upto for running a multi-node 2nd generation Hadoop ?
>All I find is 1st generation Hadoop setup.
>
>On Fri, Jul 6, 2012 at 7:13 AM, Robert Evans  wrote:
>
>> That may be something that we missed, as I have been providing my own
>> marped-site.xml for quite a while now.  Have you tried it with branch-2
>>or
>> trunk to see if they are providing it?  In either case it is just going
>>to
>> be a template for you to fill in, but it would be nice to package that
>> template for our users to follow.  If you want to file a JIRA for that
>>it
>> would be good, but I don't know how quickly we will be able to get
>>around
>> to doing it.
>>
>> --Bobby Evans
>>
>> On 7/5/12 7:23 PM, "Pavan Kulkarni"  wrote:
>>
>> >Hi,
>> >
>> >  I downloaded the Hadoop-0.23.3 source and tweaked a few classes and
>> >when I built the binary distribution and untar'd it .I don't see the
>> >mapred-site.xml
>> >file in the /etc/hadoop directory. But by the details given on how to
>>run
>> >the
>> >Hadoop-0.23.3 the mapred-site.xml needs to be configured right?
>> >
>> >  So I was just wondering if we are supposed to create the
>>mapred-site.xml
>> >, or
>> >it doesn't exist at all? Thanks
>> >
>> >--
>> >
>> >--With Regards
>> >Pavan Kulkarni
>>
>>
>
>
>-- 
>
>--With Regards
>Pavan Kulkarni



Re: No mapred-site.xml in the hadoop-0.23.3 distribution

2012-07-06 Thread Robert Evans
That may be something that we missed, as I have been providing my own
marped-site.xml for quite a while now.  Have you tried it with branch-2 or
trunk to see if they are providing it?  In either case it is just going to
be a template for you to fill in, but it would be nice to package that
template for our users to follow.  If you want to file a JIRA for that it
would be good, but I don't know how quickly we will be able to get around
to doing it.

--Bobby Evans

On 7/5/12 7:23 PM, "Pavan Kulkarni"  wrote:

>Hi,
>
>  I downloaded the Hadoop-0.23.3 source and tweaked a few classes and
>when I built the binary distribution and untar'd it .I don't see the
>mapred-site.xml
>file in the /etc/hadoop directory. But by the details given on how to run
>the
>Hadoop-0.23.3 the mapred-site.xml needs to be configured right?
>
>  So I was just wondering if we are supposed to create the mapred-site.xml
>, or
>it doesn't exist at all? Thanks
>
>-- 
>
>--With Regards
>Pavan Kulkarni



Re: JobTracker/TaskTraker heartbeats communication mechanism?

2012-06-29 Thread Robert Evans
Daniel and Joao,

The RPC classes in Hadoop handle this.  Essentially a proxy object is
create on the client side for the interface, then when a method is called
in the proxy object the parameters are serialized sent to the configured
server along with the method name, where they are deserialized and through
reflection the server method is called.  The for the return
value/exceptions it is done in reverse.

I hope this helps.

--Bobby Evans 

On 6/29/12 8:43 AM, "Daniel Parreira"  wrote:

>*Hello,
>
>We're having some problems understanding the communication system used in
>the heartbeats between the JobTracker and the TaskTracker. They are both
>calling the same interface function, but we don¹t understand how that
>allows for communication. *
>*Is there any compiler plugin or RMI-style mechanism involved **that we
>might not be aware of?** We found the interface as the point of
>communication in VersionedProtocol but we haven't found the actual
>implementation anywhere...*
>*
>Thank you,
>Daniel Parreira & João Silva*



Re: Resolving find bug issue

2012-06-26 Thread Robert Evans
The issue you are running into is because you made the HOST variable public, 
when it was package previously.  Findbugs thinks that you want HOST to be a 
constant because it is ALL CAPS and is only set once and read all other times.  
By making it public it is now difficult to ensure that it is never written to, 
hence the suggestion to make it final.  I would prefer to actually switch it 
over to private and add in a new public method that return the value of HOST.

--Bobby Evans


On 6/26/12 6:01 AM, "madhu phatak"  wrote:

Hi,
 I have submitted a patch for jira (HADOOP-8521) which is giving findbug(
https://issues.apache.org/jira/browse/HADOOP-8521) error. To fix the
issue,I have to duplicate the StreamUtil class to the newly introduced
mapreduce package.Is it a good practice or is there other way to fix this?


Regards,
Madhukara Phatak

--
https://github.com/zinnia-phatak-dev/Nectar



Re: Cyclic dependency in JobControl job DAG

2012-06-25 Thread Robert Evans
I personally think it is useful.  I would say contribute it.

(Moved common-dev to bcc, we try not to cross post on these lists)

--Bobby Evans

On 6/25/12 3:37 AM, "madhu phatak"  wrote:

Hi,
 In current implementation of JobControl, whenever there is a cyclic
dependency between the jobs it throws a Stack overflow exception .
 For example,
   ControlledJob job1 = new ControlledJob(new Configuration());
job1.setJobName("job1");
ControlledJob job2 = new ControlledJob(new Configuration());
job2.setJobName("job2");
job1.addDependingJob(job2);
job2.addDependingJob(job1);
JobControl jobControl = new JobControl("jobcontrol");
jobControl.addJob(job1);
jobControl.addJob(job2);
jobControl.run();

throws
  java.lang.StackOverflowError
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.checkState(ControlledJob.java:295)

Whenever we write complex application, there is always possibility of
cyclic dependencies.I have written a method which checks for the cyclic
dependency  upfront and informs it to the user. I want to know from you
guys, do you think is it a useful feature? If yes I can contribute it as a
patch.

Regards,
Madhukara Phatak
--
https://github.com/zinnia-phatak-dev/Nectar



Re: contributing as student

2012-05-29 Thread Robert Evans
Also be aware that a lot of us are very busy, so sadly you may need to send 
mail to the appropriate mailing list if your patch is not reviewed quickly.

--Bobby Evans

On 5/29/12 4:32 AM, "Devaraj k"  wrote:

Good to hear Hasan.

Welcome to the Hadoop community. You can find more details in the below link 
for how to contribute.


http://wiki.apache.org/hadoop/HowToContribute


Thanks
Devaraj


From: Hasan Gürcan [hasan.guer...@googlemail.com]
Sent: Tuesday, May 29, 2012 2:37 PM
To: common-dev@hadoop.apache.org
Subject: contributing as student

Hello hadoop community,

i am student at the freie universitat berlin and i am participating a class
called open source.

therefore i have to contribute for a open source project and i chose hadoop
because it is revolutionary.

could someone tell me how i could contribute.
i have a good understanding about the map reduce algorithm and also about
the hdfs.

contributing means i can fix a bug or do a task but also jobs like
translating or documentary.

best regards
Hasan Gürcan



Re: Need Urgent Help on Architecture

2012-05-21 Thread Robert Evans
All attachments are stripped when sent to the mailing list.  You will need to 
use another service if you want us to see the diagram.


On 5/18/12 12:50 PM, "samir das mohapatra"  wrote:

Hi harsh,

   I wanted to implement one Workflow within the MAPPER. I am Sharing my 
concept through the Architecture Diagram, Please correct me if I am wrong
   and suggest me any Good Approach for that

   Many thanks  in advance)

  Thanks
samir



Re: Sailfish

2012-05-11 Thread Robert Evans
That makes perfect sense to me.  Especially because it really is a new 
implementation of shuffle that is optimized for very large jobs.  I am happy to 
see anything go in that is going to improve the performance of hadoop, and I 
look forward to running some benchmarks on the changes.  I am not super 
familiar with sailfish, but from what I remember from a while ago it is the 
modified version of KFS that is in reality doing the sorting.  The maps will 
output data to "chunks" aka blocks that when each chunk is full it is sorted.  
When the sorting is finished for a chunk the reducers are now free to pull the 
sorted data from the chunks and run.  I have a few concerns with it though.


 1.  How do we securely handle different comparators?  Currently comparators 
run as the user that launched the job, not as a privileged user.  Sailfish 
seems to require that comparators run as a privileged user, or we only support 
pure bitwise sorting of keys.
 2.  How does this work in a mixed environment?  Sailfish, as I understand it, 
is optimized for large map/reduce jobs, and can be slower on small jobs than 
the current implementation.  How do we make it so that large jobs are able to 
run faster, but not negatively impact the more common small jobs?  We could run 
both in parallel and switch between them depending on the size of the job's 
input, or a config key of some sort, but then the RAM needed to make these big 
jobs run fast would not be available for smaller jobs to use when no really big 
job is running.

--Bobby Evans

On 5/11/12 1:32 AM, "Todd Lipcon"  wrote:

Hey Sriram,

We discussed this before, but for the benefit of the wider audience: :)

It seems like the requirements imposed on KFS by Sailfish are in most
ways much simplier than the requirements of a full distributed
filesystem. The one thing we need is atomic record append -- but we
don't need anything else, like filesystem metadata/naming,
replication, corrupt data scanning, etc. All of the data is
transient/short-lived and at replication count 1.

So I think building something specific to this use case would be
pretty practical - and my guess is it might even have some benefits
over trying to use a full DFS.

In the MR2 architecture, I'd probably try to build this as a service
plugin in the NodeManager (similar to the way that the ShuffleHandler
in the current implementation works)

-Todd

On Thu, May 10, 2012 at 11:01 PM, Sriram Rao  wrote:
> Srivas,
>
> Sailfish is builds upon record append (a feature not present in HDFS).
>
> The software that is currently released is based on Hadoop-0.20.2.  You use
> the Sailfish version of Hadoop-0.20.2, KFS for the intermediate data, and
> then HDFS (or KFS) for storing the job/input.  Since the changes are all in
> the handling of map output/reduce input, it is transparent to existing jobs.
>
> What is being proposed below is to bolt all the starting/stopping of the
> related deamons into YARN as a first step.  There are other approaches that
> are possible, which have a similar effect.
>
> Hope this helps.
>
> Sriram
>
>
> On Thu, May 10, 2012 at 10:50 PM, M. C. Srivas  wrote:
>
>> Sriram,   Sailfish depends on append. I just noticed the HDFS disabled
>> append. How does one use this with Hadoop?
>>
>>
>> On Wed, May 9, 2012 at 9:00 AM, Otis Gospodnetic <
>> otis_gospodne...@yahoo.com
>> > wrote:
>>
>> > Hi Sriram,
>> >
>> > >> The I-file concept could possibly be implemented here in a fairly self
>> > contained way. One
>> > >> could even colocate/embed a KFS filesystem with such an alternate
>> > >> shuffle, like how MR task temporary space is usually colocated with
>> > >> HDFS storage.
>> >
>> > >  Exactly.
>> >
>> > >> Does this seem reasonable in any way?
>> >
>> > > Great. Where do go from here?  How do we get a colloborative effort
>> > going?
>> >
>> >
>> > Sounds like a JIRA issue should be opened, the approach briefly
>> described,
>> > and the first implementation attempt made.  Then iterate.
>> >
>> > I look forward to seeing this! :)
>> >
>> > Otis
>> > --
>> >
>> > Performance Monitoring for Solr / ElasticSearch / HBase -
>> > http://sematext.com/spm
>> >
>> >
>> >
>> > >
>> > > From: Sriram Rao 
>> > >To: common-dev@hadoop.apache.org
>> > >Sent: Tuesday, May 8, 2012 6:48 PM
>> > >Subject: Re: Sailfish
>> > >
>> > >Dear Andy,
>> > >
>> > >> From: Andrew Purtell 
>> > >> ...
>> > >
>> > >> Do you intend this to be a joint project with the Hadoop community or
>> > >> a technology competitor?
>> > >
>> > >As I had said in my email, we are looking for folks to colloborate
>> > >with us to help get us integrated with Hadoop.  So, to be explicitly
>> > >clear, we are intending for this to be a joint project with the
>> > >community.
>> > >
>> > >> Regrettably, KFS is not a "drop in replacement" for HDFS.
>> > >> Hypothetically: I have several petabytes of data in an existing HDFS
>> > >> deployment, which is the norm, and a continuous MapReduce workflow.
>> 

Re: Hadoop: Trunk vs branch src code

2012-04-10 Thread Robert Evans
That depends on where you want your code to go in.

If it is a new feature then it needs to go into trunk at a minimum.  Trunk and 
branch-2 are very similar right now so if you want it to go into the next 
release with MRV2 you may want to target branch-2 as well.  It should be 
minimal effort to have it go into both trunk and branch-2.  If it is something 
for the current stable line (MRV1) you want to target that too, but in many 
cases it may be a lot of effort because trunk and branch-1 have diverged 
significantly.

--Bobby Evans

On 4/10/12 8:48 AM, "Amir Sanjar"  wrote:

which one has the latest code? If we are planing to contribute which code
base we need to use?

Best Regards
Amir Sanjar

Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax#  512-838-8858



Re: Help with error

2012-04-09 Thread Robert Evans
What do you mean by relocated some supporting files to HDFS?  How do you 
relocate them?  What API do you use?

--Bobby Evans


On 4/9/12 11:10 AM, "Ralph Castain"  wrote:

Hi folks

I'm trying to develop an AM for the 0.23 branch and running into a problem that 
I'm having difficulty debugging. My client relocates some supporting files to 
HDFS, creates the application object for the AM, and submits it to the RM.

The file relocation request doesn't generate an error, so I must assume it 
succeeded. It would be nice if there was some obvious way to verify that, but I 
haven't discovered it. Can anyone give me a hint? I tried asking hdfs to -ls, 
but all I get is that "." doesn't exist. I have no idea where the file would be 
placed, if it would persist once the job fails, etc.

When the job is submitted, all I get is an "Error 500", which tells me nothing. 
Reminds me of the old days of 40 years ago when you'd get the dreaded "error 
11", which meant anything from a divide by zero to a memory violation. Are 
there any debug flags I could set that might provide more info?

Thanks
Ralph




Re: Requirements for patch review

2012-04-04 Thread Robert Evans
I personally like the clarification and it is in line with how I understood the 
original bylaw when I read it.  I don't really want this to turn into a legal 
document but as this is getting more explicit with clarification it would be 
nice to put in a small exception for release managers when they are changing 
versions and setting up a new release branch.

--Bobby Evans

On 4/4/12 4:12 PM, "Todd Lipcon"  wrote:

Hi folks,

Some discussion between Nicholas, Aaron, and me started in the
comments of HDFS-3168 which I think is better exposed on the mailing
list instead of trailing an already-committed JIRA.

The question at hand is what the policy is with regarding our
review-then-commit policies. The bylaws state:

>>>
*Code Change*
A change made to a codebase of the project and committed by a
committer. This includes source code, documentation, website content,
etc. Lazy consensus of active committers, but with a minimum of one
+1. The code can be committed after the first +1, unless the code
change represents a merge from a branch, in which case three +1s are
required.
<<<

The wording here is ambiguous, though, whether the committer who
provides the minimum one +1 may also be the author of the code change.
If so, that would seem to imply that committers may always make code
changes by merely +1ing their own patches, which seems counter to the
whole point of "review-then-commit". So, I'm pretty sure that's not
what it means.

The question that came up, however, was whether a non-committer
contributor may provide a binding +1 for a patch written by a
committer. So, if I write a patch as a committer, and then a community
member reviews it, am I free to commit it without another committer
looking at it? My understanding has always been that this is not the
case, but we should clarify the by-laws if there is some ambiguity.

I would propose the following amendments:
A committer may not provide a binding +1 for his or her own patch.
However, in the case of trivial patches only, a committer may use a +1
from the problem reporter or other contributor in lieu of another
committer's +1. The definition of a trivial patch is subject to the
committer's best judgment, but in general should consist of things
such as: documentation fixes, spelling mistakes, log message changes,
or additional test cases.

I think the above strikes a reasonable balance between pragmatism for
quick changes, and keeping a rigorous review process for patches that
should have multiple experienced folks look over.

Thoughts?

Todd
--
Todd Lipcon
Software Engineer, Cloudera



Re: Proto files

2012-03-26 Thread Robert Evans
I responded in the JIRA for this.  Because we wrap proto in Hadoop RPC right 
now those .proto files are not going to do very many people a lot of good, 
unless they have a client that can also communicate over a simple form of 
Hadoop RPC.  I think it would be good to move to a pure PB RPC implementation, 
but that involves security changes and a lot of other things so it is not a 
small undertaking.

--Bobby Evans

On 3/24/12 8:38 PM, "Eli Collins"  wrote:

Good idea, no reason we shouldn't, the build probably wasn't updated to
include the who we added then. File a jira?

On Saturday, March 24, 2012, Ralph Castain  wrote:
> Hi folks
>
> I notice that the .proto files are not present in the built tarball. This
presents a problem to those of us working on 3rd party tools that need to
talk to Hadoop tools such as the resource manager. It means that anyone
wanting to build our tools has to install an svn checkout of the code as
opposed to simply installing the tarball.
>
> Is there any reason -not- to include the .proto files in the tarball for
distribution? It would help a great deal.
>
> Thanks
> Ralph
>
>



Re: Question about Hadoop-8192 and rackToBlocks ordering

2012-03-22 Thread Robert Evans
If it really is the ordering of the hash map I would say no it should not, and 
the code should be updated.  If ordering matters we need to use a map that 
guarantees a given order, and hash map is not one of them.

--Bobby Evans

On 3/22/12 7:24 AM, "Kumar Ravi"  wrote:

Hello,

 We have been looking at IBM JDK junit failures on Hadoop-1.0.1
independently and have ran into the same failures as reported in this JIRA.
I have a question based upon what I have observed below.

We started debugging the problems in the testcase -
org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
The testcase fails because the number of splits returned back from
CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
expected return value is 2.

So far, we have found the reason for this difference in number of splits is
because the order in which elements in the rackToBlocks hashmap get created
is in the reverse order that Sun JDK creates.

The question I have at this point is -- Should there be a strict dependency
in the order in which the rackToBlocks hashmap gets populated, to determine
the number of splits that get should get created in a hadoop cluster? Is
this Working as designed?

Regards,
Kumar



Re: Compressor tweaks corresponding to HDFS-2834, 3051?

2012-03-07 Thread Robert Evans
I am a +1 on opening a new JIRA for a first stab at reducing the amount of data 
that gets copied around.

--Bobby Evans


On 3/7/12 1:26 AM, "Tim Broberg"  wrote:

In https://issues.apache.org/jira/browse/HDFS-2834, Todd says, "

  This is also useful whenever a native decompression codec is being used. In 
those cases, we generally have the following copies:

  1) Socket -> DirectByteBuffer (in SocketChannel implementation)
  2) DirectByteBuffer -> byte[] (in SocketInputStream)
  3) byte[] -> Native buffer (set up for decompression)
  4*) decompression to a different native buffer (not really a copy - 
decompression necessarily rewrites)
  5) native buffer -> byte[]

  with the proposed improvement we can hopefully eliminate #2,#3 for all 
applications, and #2,#3,and #5 for libhdfs.
"


It seems like we need to tweak the Decompressor (and Compressor?) classes to 
take DirectByteBuffer inputs / outputs rather than byte[]'s to support this 
improvement.

Is the right thing to do for me to open a jira in common for this and take a 
first stab at defining the interface?

- Tim.

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.



Re: Hadoop on non x86 systems

2012-03-06 Thread Robert Evans
There are kind of two ways to submit a proposal.

1 - send an e-mail here with the proposal.
2 - file a JIRA and attach your proposal to it.

Usually it is a combination of the two.  You start a conversation on the 
mailing list, and then at some point file a JIRA to track the work and capture 
the discussion.

To become a contributor, from the perspective of apache, all you have to do is 
to submit a patch and grant the copyright to apache for submission.  Your 
company may have more guidelines/legal requirements and you probably want to 
check with them.  See the wiki http://wiki.apache.org/hadoop/HowToContribute 
for more details.

There are several conferences coming up.  Hadoop Summit is the next big one 
that I know of http://hadoopsummit.org/

As far as the build system is concerned are you interested in the 1.0 line that 
uses ant or the 0.23/trunk that uses maven?

Always glad to see more people wanting to make Hadoop better.

--Bobby Evans

On 3/6/12 7:51 AM, "Amir Sanjar"  wrote:

Hi all,
My team is actively involve in porting Hadoop 1.0.x to IBM POWER
architecture, that of course includes building
Hadoop 1.0 using IBM JAVA 6. So to start I have following questions:
1) How hadoop-common supports multi architecture builds?
 If it doesn't , we have a proposal, what is the process to submit a
proposal?
2) What is the process to become a contributor?
3) When is the next hadoop conference?


Best Regards
Amir Sanjar

Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax#  512-838-8858





Re: Execute a Map/Reduce Job Jar from Another Java Program.

2012-02-03 Thread Robert Evans
It looks like there is something wrong with your configuration where the 
default file system is coming back as the local file system, but you are 
passing in an HDFS URI fs.exists(Path), I cannot tell for sure because I don't 
have access to 
com.amd.kdf.protobuf.SequentialFileDriver.main(SequentialFileDriver.java:64).

If running it works just fine from the command line, you could try doing a 
fork/exec to launch the process and then monitor it.

--Bobby Evans

On 2/2/12 11:31 PM, "Abees Muhammad"  wrote:

Hi Evans,

Thanks for your reply. I have a mapreduce job jar file lets call it as 
test.jar. I am executing this jar file as hadoop jar test.jar inputpath 
outPath, and it is executed succesfully. Now i want to execute this job for a 
batch of files(a batch of 20 files), for this purpose i have created another 
java application,this application moves a batch of files from one location of 
hdfs to another location in hdfs. After that this application needs to execute 
the m/R job for this batch. we will invoke the second application(which will 
execute the M/R Job) from as control m job.But i dont know how to create the 
second java application which will invoke the M/R job. The code snippet i used 
for testing the jar which calls the M/R job is

List arguments = new ArrayList();
arguments.add("test.jar");
arguments.add("inputPath");
arguments.add(outputPath);
RunJar.main((String[])arguments.toArray(new String[0]));

i executed this jar as java -jar M/RJobInvokeApp.jar but i got error as

java.lang.IllegalArgumentException: Wrong FS: hdfs://ip 
address:54310/tmp/test-out, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:410)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:56)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:379)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:748)
at 
com.amd.kdf.protobuf.SequentialFileDriver.main(SequentialFileDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
at com.amd.wrapper.main.ParserWrapper.main(ParserWrapper.java:31)






Thanks,
Abees

On 2 February 2012 23:02, Robert Evans  wrote:
What happens?  Is there an exception, does nothing happen?  I am curious.  Also 
how did you launch your other job that is trying to run this one.  The hadoop 
script sets up a lot of environment variables classpath etc to make hadoop work 
properly, and some of that may not be set up correctly to make RunJar work.

--Bobby Evans


On 2/2/12 9:36 AM, "Harsh J" http://ha...@cloudera.com> > 
wrote:

Moving to common-user. Common-dev is for project development
discussions, not user help.

Could you elaborate on how you used RunJar? What arguments did you
provide, and is the target jar a runnable one or a regular jar? What
error did you get?

On Thu, Feb 2, 2012 at 8:44 PM, abees muhammad http://abees...@gmail.com> > wrote:
>
> Hi,
>
> I am a newbie to Hadoop Development. I have a Map/Reduce job jar file, i
> want to execute this jar file programmatically from another java program. I
> used the following code to execute it.
>
> RunJar.main(String[] args). But The jar file is not executed.
>
> Can you please give me  a work around for this issue.
> --
> View this message in context: 
> http://old.nabble.com/Execute-a-Map-Reduce-Job-Jar-from-Another-Java-Program.-tp33250801p33250801.html
> Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>



--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about





Re: Execute a Map/Reduce Job Jar from Another Java Program.

2012-02-02 Thread Robert Evans
What happens?  Is there an exception, does nothing happen?  I am curious.  Also 
how did you launch your other job that is trying to run this one.  The hadoop 
script sets up a lot of environment variables classpath etc to make hadoop work 
properly, and some of that may not be set up correctly to make RunJar work.

--Bobby Evans

On 2/2/12 9:36 AM, "Harsh J"  wrote:

Moving to common-user. Common-dev is for project development
discussions, not user help.

Could you elaborate on how you used RunJar? What arguments did you
provide, and is the target jar a runnable one or a regular jar? What
error did you get?

On Thu, Feb 2, 2012 at 8:44 PM, abees muhammad  wrote:
>
> Hi,
>
> I am a newbie to Hadoop Development. I have a Map/Reduce job jar file, i
> want to execute this jar file programmatically from another java program. I
> used the following code to execute it.
>
> RunJar.main(String[] args). But The jar file is not executed.
>
> Can you please give me  a work around for this issue.
> --
> View this message in context: 
> http://old.nabble.com/Execute-a-Map-Reduce-Job-Jar-from-Another-Java-Program.-tp33250801p33250801.html
> Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>



--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about



Re: Moving TB of data from NFS to HDFS

2012-01-25 Thread Robert Evans
Hadoop fs -put operates on a single thread at a time, and writes the data to 
HDFS in order.  Depending on the connectivity between the filer/NFS server and 
the datanodes it may be difficult to saturate that connection.  Which is the 
only way to really speed things up.  If there are multiple files, then like was 
said in other posts you can increase the thread count of transfers and do a 
better job of getting the data into HDFS faster.  Just be careful, like was 
stated before, that the NN can keep up with all of the data being transferred.

--Bobby Evans

On 1/24/12 11:16 PM, "Praveen Sripati"  wrote:

> If it is divided up into several files and you can mount your NFS
directory on each of the datanodes.

Just curious, how will this help.

Praveen

On Wed, Jan 25, 2012 at 12:39 AM, Robert Evans  wrote:

> If it is divided up into several files and you can mount your NFS
> directory on each of the datanodes, you could possibly use distcp to do it.
>  I have never tried using distcp for this, but it should work.  Or you can
> write your own streaming Map/Reduce script that does more or less the same
> thing as distcp and will take as input the list of files to copy, and will
> do a hadoop fs -put for each file having it come from NFS.
>
> --Bobby Evans
>
> On 1/24/12 12:49 AM, "rajmca2002"  wrote:
>
>
>
> Hi,
>
> I have TB of Data in NFS i need to move this data to hdfs. I have used
> hadoop put command to do the same, but it resulted in taking hours to place
> the file in HDFS, Is there any good approach to move large files to hdfs.
>
> Please reply asap.
> --
> View this message in context:
> http://old.nabble.com/Moving-TB-of-data-from-NFS-to-HDFS-tp33193061p33193061.html
> Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>
>
>



Re: Moving TB of data from NFS to HDFS

2012-01-24 Thread Robert Evans
If it is divided up into several files and you can mount your NFS directory on 
each of the datanodes, you could possibly use distcp to do it.  I have never 
tried using distcp for this, but it should work.  Or you can write your own 
streaming Map/Reduce script that does more or less the same thing as distcp and 
will take as input the list of files to copy, and will do a hadoop fs -put for 
each file having it come from NFS.

--Bobby Evans

On 1/24/12 12:49 AM, "rajmca2002"  wrote:



Hi,

I have TB of Data in NFS i need to move this data to hdfs. I have used
hadoop put command to do the same, but it resulted in taking hours to place
the file in HDFS, Is there any good approach to move large files to hdfs.

Please reply asap.
--
View this message in context: 
http://old.nabble.com/Moving-TB-of-data-from-NFS-to-HDFS-tp33193061p33193061.html
Sent from the Hadoop core-dev mailing list archive at Nabble.com.




Re: Security in 0.23

2012-01-04 Thread Robert Evans

It should have all of the same security.  Some of it has been renamed, and some 
of the token work is still on going.  The LinuxTaskController has been renamed 
because "Tasks" are map reduce specific.  It is now the LinuxContainerExecutor. 
 I don't remember all of the updated config names, off the top of my head.

--Bobby Evans

On 1/4/12 2:17 PM, "Benoy Antony"  wrote:

Hi All,

Does 0.23 have all the security features compared to (205 ) ?

I do not see the LinuxTaskController class and c++ code which enables the
ownership of the task processes. Which are the missing security features ?

Thanks and Regards,
Benoy Antony



Re: How Jobtracler stores tasktracker's information

2011-12-13 Thread Robert Evans
I am not completely sure what you mean by this.  In Hadoop the TaskTracker will 
heartbeat into the JobTracker to report its status and get new tasks to launch. 
 The Scheduler, which is pluggable, then matches up requests for tasks with the 
TaskTracker.  If you want to see where the matching up of tasks to task 
trackers takes place you should look at the schedulers.  There are two main 
ones in use by hadoop.  They are the Capacity Scheduler and the Fair Scheduler. 
 You should be able to find them under contrib.

--
Bobby Evans

On 12/13/11 3:02 AM, "hadoop anis"  wrote:

 Anyone please tell this,
  I want to know from where Jobtracker sends task(taskid) to
tasktarcker for scheduling.
  i.e where it creates taskid & tasktracker pairs



Thanks & Regards,

Mohmmadanis Moulavi

Student,
MTech (Computer Sci. & Engg.)
Walchand college of Engg. Sangli (M.S.) India



Re: Automatically Documenting Apache Hadoop Configuration

2011-12-05 Thread Robert Evans
>From my work on yarn trying to document the configs there and to standardize 
>them, writing anything that is going to automatically detect config values 
>through static analysis is going to be very difficult.  This is because most 
>of the configs in yarn are now built up using static string concatenation.

public static String BASE = "yarn.base.";
public static String CONF = BASE+"config";

I am not sure that there is a good way around this short of using a full java 
parser to trace out all method calls, and try to resolve the parameters.  I 
know this is possible, just not that simple to do.

I am +1 for anything that will clean up configs and improve the documentation 
of them.  Even if we have to rewire or rewrite a lot of the Configuration class 
to make things work properly.

--Bobby Evans

On 12/5/11 11:54 AM, "Harsh J"  wrote:

Praveen,

(Inline.)

On 05-Dec-2011, at 10:14 PM, Praveen Sripati wrote:

> Hi,
>
> Recently there was a query about the Hadoop framework being tolerant for
> map/reduce task failure towards the job completion. And the solution was to
> set the 'mapreduce.map.failures.maxpercent` and
> 'mapreduce.reduce.failures.maxpercent' properties. Although this feature
> was introduced couple of years back, it was not documented. Had similar
> experience with 0.23 release also.

I do not know if we recommend using config strings directly when there's an API 
in Job/JobConf supporting setting the same thing. Just saying - that there was 
javadoc already available on this. But of course, it would be better if the 
tutorial covered this too. Doc-patches welcome!

> It would be really good for Hadoop adoption to automatically dig and
> document all the existing configurable properties in Hadoop and also to
> identify newly added properties in a particular release during the build
> processes. Documentation would also lead to fewer queries in the forums.
> Cloudera has done something similar [1], though it's not 100% accurate, it
> would definitely help to some extent.

I'm +1 for this. We do request and consistently add entries to *-default.xml 
files if we find them undocumented today. I think we should also enforce it at 
the review level, so that patches do not go in undocumented -- at minimum the 
configuration tweaks at least.



Re: Hadoop - non disk based sorting?

2011-12-01 Thread Robert Evans
Mingxi,

My understanding was that just like with the maps that when a reducer's in 
memory buffer fills up it too will spill to disk as part of the sort.  In fact 
I think it uses the exact same code for doing the sort as the map does.  There 
may be an issue where your sort buffer is some how too large for the amount of 
heap that you requested as part of the mapred.child.java.opts.  I have 
personally run a reduce that took in 300GB of data, which it successfully 
sorted, to test this very thing.  And no the box did not have 300 GB of RAM.

--Bobby Evans

On 12/1/11 4:12 AM, "Ravi teja ch n v"  wrote:

Hi Mingxi ,

>So, why when map outputs are huge, reducer will not able to copy them?

The Reducer  will copy the Map output into its inmemory buffer. When the 
Reducer JVM doesnt have enough memory to accomodate the
Map output, then it leads to OutOfMemoryException.

>Can you please kindly explain what's the function of mapred.child.java.opts? 
>how does it relate to copy?

The Maps and Reducers will be launched in separate child JVMs launched at the 
Tasktrackers.
When the Tasktracker launches the Map or Reduce JVMs, it uses the 
mapred.child.java.opts as JVM arguments for the new child JVMs.

Regards,
Ravi Teja

From: Mingxi Wu [mingxi...@turn.com]
Sent: 01 December 2011 12:37:54
To: common-dev@hadoop.apache.org
Subject: RE: Hadoop - non disk based sorting?

Thanks Ravi.

So, why when map outputs are huge, reducer will not able to copy them?

Can you please kindly explain what's the function of mapred.child.java.opts? 
how does it relate to copy?

Thank you,

Mingxi

-Original Message-
From: Ravi teja ch n v [mailto:raviteja.c...@huawei.com]
Sent: Tuesday, November 29, 2011 9:46 PM
To: common-dev@hadoop.apache.org
Subject: RE: Hadoop - non disk based sorting?

Hi Mingxi,

>From your stacktrace, I understand that the OutOfMemoryError has actually 
>occured while copying the MapOutputs, not while sorting them.

Since your Mapoutputs are huge and your reducer does have enough heap memory, 
you got the problem.
When you have made the reducers to 200, your Map outputs have got partitioned 
amoung 200 reducers, so you didnt get this problem.

By setting the max memory of your reducer with mapred.child.java.opts, you can 
get over this problem.

Regards,
Ravi teja



From: Mingxi Wu [mingxi...@turn.com]
Sent: 30 November 2011 05:14:49
To: common-dev@hadoop.apache.org
Subject: Hadoop - non disk based sorting?

Hi,

I have a question regarding the shuffle phase of reducer.

It appears when there are large map output (in my case, 5 billion records), I 
will have out of memory Error like below.

Error: java.lang.OutOfMemoryError: Java heap space at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1592)
 at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1452)
 at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
 at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)

However, I thought the shuffling phase is using disk-based sort, which is not 
constraint by memory.
So, why will user run into this outofmemory error? After I increased my number 
of reducers from 100 to 200, the problem went away.

Any input regarding this memory issue would be appreciated!

Thanks,

Mingxi



Re: Which branch for my patch?

2011-11-30 Thread Robert Evans
Niels,

I think that the branch you put it on depends mostly on where you and others 
what to see this feature (splittable Gzip) go in.  At a minimum you should 
target trunk.  If you want to see it go into 1.* then you probably also want to 
port it to that line (branch-1).  Once they are in porting it to other branches 
should be fairly simple and really depends on where you and others want to use 
your feature.

--Bobby Evans

On 11/30/11 10:23 AM, "Niels Basjes"  wrote:

Hi all,

A while ago I created a feature for Hadoop and submitted this to be
included (HADOOP-7076) .
Around the same time the MRv2 started happening and the entire source tree
was restructured.

At this moment I'm prepared to change the patch I created earlier so I can
submit it again for your consideration.

Caused by the email about the new branches (branch-1 and branch-1.0) I'm a
bit puzzled at this moment where to start.

I see the mentioned branches and the trunk at probable starting points.

As far as I understand the repository structure the branch-1 is the basis
for the "old style" Hadoop and the trunk is the basis for the "yarn" Hadoop.

For which branch of the source tree should I make my changes so you guys
will reevaluate it for inclusion?

Thanks.

--
Best regards / Met vriendelijke groeten,

Niels Basjes



Re: Parallel mapred jobs in Yarn

2011-11-09 Thread Robert Evans
The configuration options are somewhat different for yarn, then they are for 
MRV1.  You probably want to generate the documentation for yarn

mvn site

And then read through it about how to set up your cluster

./hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-site/target/site/index.html

There is documentation about setting up the capacity scheduler too.  If you run 
into any issues then reply here and if the documentation needs to be cleaned up 
was can file a JIRA against the documentation.  The documentation is kind of 
new so it would be good to get some real feedback on it.

--Bobby Evans

On 11/9/11 3:07 AM, "Vinod Kumar Vavilapalli"  wrote:

FairScheduler isn't ported yet to YARN. The default scheduler is there (
which is FifoScheduler) and CapacityScheduler can be configured too.

HTH,
+Vinod


On Wed, Nov 9, 2011 at 10:14 AM, Bharath Ravi wrote:

> Thanks, Prashant!
> I'll try Yarn out with the Fairscheduler.
>
> On 8 November 2011 01:01, Prashant Sharma 
> wrote:
>
> > Yes! , you can do the same in yarn as well.
> > -P
> >
> > On Tue, Nov 8, 2011 at 3:24 AM, Bharath Ravi 
> > wrote:
> >
> > > Hi all,
> > >
> > > I have a beginner's question:
> > > How can I configure yarn to allow multiple parallel mapreduce jobs to
> > run?
> > > Currently, the execution is sequential: each submitted job waits for
> the
> > > previous to run.
> > >
> > > In MR1, this could be done by enabling the
> > fairscheduler/capacityscheduler.
> > > Is there a similar configuration in Yarn as well?
> > >
> > > Thanks a lot!
> > > --
> > > Bharath Ravi
> > >
> >
>
>
>
> --
> Bharath Ravi
>



Re: Viewing hadoop mapper output

2011-10-07 Thread Robert Evans
The difference in the command is where the shell script is coming from.  If you 
are using ~/mapper.sh then it will look in your home directory to run the 
script.  If you have a small cluster with your home directory mounted on all of 
them then it is not that big of a deal.  If you have a large cluster then the 
NFS mounting the directory on all of the boxes can cause a lot of issues.  If 
you have a large cluster you should use the distributed cache to send it over 
(you are already sending it through the distributed cache by using the -file 
option).

I am not completely sure why it would be timing out.  Are all of them timing 
out, or is it just a single mapper that is timing out.  One thing you can do it 
to run your streaming job, but with echo instead of mapper.sh, then you can use 
that as input to the command running on your local box.

./hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file ~/mapper.sh 
-mapper echo -input ../foo.txt -output output
./hadoop fs -cat output/part-0 | ~/mapper.sh

#or pick a different part file that corresponds to the mapper task that is 
timing out.

--Bobby Evans

On 10/7/11 1:43 AM, "Aishwarya Venkataraman"  wrote:

Robert,

My mapper job fails. I am basically trying to run a crawler on hadoop and
hadoop kills the crawler (mapper) if it has not heard from it for a certain
timeout period. But I already have a timeout set in my mapper(500 seconds)
which is lesser than hadoop's timeout(900 seconds). The mapper just stalls
for some reason. My mapper code is as follows:

while read line;do
  result="`wget -O - --timeout=500 http://$line 2>&1`"
  echo $result
done

Any idea why my mapper is getting stalled ?

I don't see the difference between the command you have given and the one I
ran. I am not running in local mode. Is there some way by which I can get
intermediate mapper outputs ? I would like to see for which site the mapper
is getting stalled.

Thanks,
Aishwarya

On Thu, Oct 6, 2011 at 1:41 PM, Robert Evans  wrote:

> Alshwarya,
>
> Are you running in local mode?  If not you probably want to run
>
> hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file
> ~/mapper.sh -mapper ./mapper.sh -input ../foo.txt -output output
>
> You may also want to run hadoop fs -ls output/* to see what files were
> produced.  If your mappers failed for some reason then there will be no
> files in the output directory. And you may want to look at the stderr logs
> for your processes through the web UI.
>
> --Bobby Evans
>
> On 10/6/11 3:30 PM, "Aishwarya Venkataraman"  wrote:
>
> I ran the following (I am using IdentityReducer) :
>
> ./hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file
> ~/mapper.sh -mapper ~/mapper.sh -input ../foo.txt -output output
>
> When I do
> ./hadoop dfs -cat output/* I do not see any output on screen. Is this how I
> view the output of mapper ?
>
> Thanks,
> AIshwarya
>
> On Thu, Oct 6, 2011 at 12:37 PM, Robert Evans  wrote:
>
> > A streaming jobs stderr is logged for the task, but its stdout is what is
> > sent to the reducer.  The simplest way to get it is to turn off the
> > reducers, and then look at the output in HDFS.
> >
> > --Bobby Evans
> >
> > On 10/6/11 1:16 PM, "Aishwarya Venkataraman" 
> wrote:
> >
> > Hello,
> >
> > I want to view the mapper output for a given hadoop streaming jobs (that
> > runs a shell script). However I am not able to find this in any log
> files.
> > Where should I look for this ?
> >
> > Thanks,
> > Aishwarya
> >
> >
>
>
> --
> Thanks,
> Aishwarya Venkataraman
> avenk...@cs.ucsd.edu
> Graduate Student | Department of Computer Science
> University of California, San Diego
>
>


--
Thanks,
Aishwarya Venkataraman
avenk...@cs.ucsd.edu
Graduate Student | Department of Computer Science
University of California, San Diego



Re: Viewing hadoop mapper output

2011-10-06 Thread Robert Evans
Alshwarya,

Are you running in local mode?  If not you probably want to run

hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file ~/mapper.sh 
-mapper ./mapper.sh -input ../foo.txt -output output

You may also want to run hadoop fs -ls output/* to see what files were 
produced.  If your mappers failed for some reason then there will be no files 
in the output directory. And you may want to look at the stderr logs for your 
processes through the web UI.

--Bobby Evans

On 10/6/11 3:30 PM, "Aishwarya Venkataraman"  wrote:

I ran the following (I am using IdentityReducer) :

./hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file
~/mapper.sh -mapper ~/mapper.sh -input ../foo.txt -output output

When I do
./hadoop dfs -cat output/* I do not see any output on screen. Is this how I
view the output of mapper ?

Thanks,
AIshwarya

On Thu, Oct 6, 2011 at 12:37 PM, Robert Evans  wrote:

> A streaming jobs stderr is logged for the task, but its stdout is what is
> sent to the reducer.  The simplest way to get it is to turn off the
> reducers, and then look at the output in HDFS.
>
> --Bobby Evans
>
> On 10/6/11 1:16 PM, "Aishwarya Venkataraman"  wrote:
>
> Hello,
>
> I want to view the mapper output for a given hadoop streaming jobs (that
> runs a shell script). However I am not able to find this in any log files.
> Where should I look for this ?
>
> Thanks,
> Aishwarya
>
>


--
Thanks,
Aishwarya Venkataraman
avenk...@cs.ucsd.edu
Graduate Student | Department of Computer Science
University of California, San Diego



Re: Viewing hadoop mapper output

2011-10-06 Thread Robert Evans
A streaming jobs stderr is logged for the task, but its stdout is what is sent 
to the reducer.  The simplest way to get it is to turn off the reducers, and 
then look at the output in HDFS.

--Bobby Evans

On 10/6/11 1:16 PM, "Aishwarya Venkataraman"  wrote:

Hello,

I want to view the mapper output for a given hadoop streaming jobs (that
runs a shell script). However I am not able to find this in any log files.
Where should I look for this ?

Thanks,
Aishwarya



Re: problem of lost name-node

2011-09-28 Thread Robert Evans
There is also some work underway to add in HA and failover to the namenode.  
You might get more success if you send your note to hdfs-dev instead of 
common-dev.  One other thing that can sometimes get a discussion going is to 
just file a JIRA for it.  People interested in it are likely to start watching 
it, and you can often have a good conversation there about it.

--Bobby Evans

On 9/28/11 8:27 AM, "Ravi Prakash"  wrote:

Hi Mirko,

Its seems like a great idea to me!! The architects and senior developers
might have some more insight on this though.

I think part of the reason why the community might be lazy about
implementing this is because the Namenode being a single point of failure is
usually regarded as FUD. There are simple tricks (like writing the fsimage
and editslog to NFS) which can guard against some failure scenarios, and I
think most users of hadoop are satisfied with that.

I wouldn't be too surprised if there is already a JIRA for this. But if you
could come up with a patch, I'm hopeful the community would be interested in
it.

Cheers
Ravi

2011/9/27 Mirko Kämpf 

> Hi,
> during the Cloudera Developer Training at Berlin I came up with an idea,
> regarding a lost name-node.
> As in this case all data blocks are lost. The solution could be, to have a
> table which relates filenames and block_ids on that node, which can be
> scaned
> after a name-node is lost. Or on every block could be a kind of a backlink
> to the filename and the total nr of blocks and/or a total hashsum attached.
> This would it make easy to recover with minimal overhead.
>
> Now I would like to ask the developer community, if there is any good
> reason
> not to do this?
> Before I start to figure out where to start an implementation of such a
> feature.
>
> Thanks,
> Mirko
>



Re: Maven eclipse plugin issue

2011-09-20 Thread Robert Evans
Sorry that was a different page I looked at before going to the JIRA.

It said that we can manually add in the target/generated-test-source/java 
directory as a source directory with something like

target/generated-test-source/java/**/*.java

I have no idea if it will work, but it looked cleaner to me.

--Bobby Evans

On 9/20/11 9:14 AM, "Alejandro Abdelnur"  wrote:

Bobby,

What is the  POM change you are referring to?

Thanks.

Alejandro

On Tue, Sep 20, 2011 at 7:00 AM, Robert Evans  wrote:

> This is a known issue with the eclipse maven mojo
>
> http://jira.codehaus.org/browse/MECLIPSE-37
>
> The JIRA also describes a workaround, add the generated tests directory in
> the eclipse config with a pom change, which I think would be better then
> trying to  move the phase where test code is generated.  So please file a
> JIRA for it and we can discuss the proper fix in context of that JIRA.
>
> --Bobby Evans
>
>
> On 9/20/11 8:35 AM, "Alejandro Abdelnur"  wrote:
>
> Laxman,
>
> This is not an incorrect usage of maven phases, those generated Java
> classes
> are test classes, thus is generation in the 'generate-test-sources' phase.
>
> The problem seem to be that eclipse does not recognize the
> target/generated-test-source/java directory as a source directory (for
> example, IntelliJ does).
>
> One thing we could try (not 100% correct but it would simplify the life of
> eclipse developers) is -as you suggest- to change the phase to
> 'generate-sources'. But the generated sources and corresponding compiled
> classes there must be compiled and used for testing, the classes should end
> up in the target/test-classes directory.
>
> If the above is doable it should a nice workaround.
>
> Please open a JIRA to follow up with this. Note that is not only in common
> that code is generated, but in mapreduce as well. And there are different
> things being generated, avro, protobuf, etc.
>
> Thanks.
>
> Alejandro
>
> On Tue, Sep 20, 2011 at 2:41 AM, Laxman  wrote:
>
> > Hi All,
> >
> >
> >
> > I can see lot of compilation issues after setting up my development
> > environment using "mvn eclipse:eclipse".
> >
> > All these compilation issues are resolved after adding
> > "target/generated-test-sources" as a source folder to the common project.
> >
> >
> >
> > When verified the "pom.xml", it's noticed that these are included under
> > "generate-test-sources" phase.
> >
> > This seems to be a problem occurred because of incorrect
> > understanding/usage
> > of "build-helper-maven-plugin" in Common project.
> >
> >
> >
> > All these compilation issues are resolved after changing the phase to
> > "generate-sources".
> >
> >
> >
> > Please correct me if my understanding is wrong.
> >
> >
> >
> > I found similar issue here.
> >
> > https://issues.sonatype.org/browse/MNGECLIPSE-2387
> >
> > --
> >
> > Thanks,
> >
> > Laxman
> >
> >
>
>



Re: Maven eclipse plugin issue

2011-09-20 Thread Robert Evans
This is a known issue with the eclipse maven mojo

http://jira.codehaus.org/browse/MECLIPSE-37

The JIRA also describes a workaround, add the generated tests directory in the 
eclipse config with a pom change, which I think would be better then trying to  
move the phase where test code is generated.  So please file a JIRA for it and 
we can discuss the proper fix in context of that JIRA.

--Bobby Evans


On 9/20/11 8:35 AM, "Alejandro Abdelnur"  wrote:

Laxman,

This is not an incorrect usage of maven phases, those generated Java classes
are test classes, thus is generation in the 'generate-test-sources' phase.

The problem seem to be that eclipse does not recognize the
target/generated-test-source/java directory as a source directory (for
example, IntelliJ does).

One thing we could try (not 100% correct but it would simplify the life of
eclipse developers) is -as you suggest- to change the phase to
'generate-sources'. But the generated sources and corresponding compiled
classes there must be compiled and used for testing, the classes should end
up in the target/test-classes directory.

If the above is doable it should a nice workaround.

Please open a JIRA to follow up with this. Note that is not only in common
that code is generated, but in mapreduce as well. And there are different
things being generated, avro, protobuf, etc.

Thanks.

Alejandro

On Tue, Sep 20, 2011 at 2:41 AM, Laxman  wrote:

> Hi All,
>
>
>
> I can see lot of compilation issues after setting up my development
> environment using "mvn eclipse:eclipse".
>
> All these compilation issues are resolved after adding
> "target/generated-test-sources" as a source folder to the common project.
>
>
>
> When verified the "pom.xml", it's noticed that these are included under
> "generate-test-sources" phase.
>
> This seems to be a problem occurred because of incorrect
> understanding/usage
> of "build-helper-maven-plugin" in Common project.
>
>
>
> All these compilation issues are resolved after changing the phase to
> "generate-sources".
>
>
>
> Please correct me if my understanding is wrong.
>
>
>
> I found similar issue here.
>
> https://issues.sonatype.org/browse/MNGECLIPSE-2387
>
> --
>
> Thanks,
>
> Laxman
>
>



Re: Platform MapReduce - Enterprise Features

2011-09-12 Thread Robert Evans
Chi,

Most of these features are things that Hadoop is working on.  There is an HA 
branch in progress that should go into trunk relatively soon.

As far as the batch system integration is concerned if what you care about is 
scheduling of jobs, which jobs run when and on which machines, you can write 
your own scheduler which is a standard API.

--Bobby Evans

On 9/12/11 1:04 PM, "Chi Chan"  wrote:

Are any Hadoop implementations planning to add "enterprise features"
in Platform MapReduce?

http://www.youtube.com/watch?v=QV4wJifsqbQ
http://www.youtube.com/watch?v=cDfZTx-BOyY
http://www.youtube.com/watch?v=MEKXo-1hnkQ

Platform said that its MapReduce implementation totally replaces the
JobTracker, while the rest of the Hadoop stack is unchanged. Is there
a Hadoop API that would allow external batch systems (like Grid Engine
or Open Grid Scheduler, PBS, Condor, SLURM, etc) to plug into Hadoop?

--Chi



Re: JIRA attachments order

2011-09-09 Thread Robert Evans
Can I ask, though that we do add branch information in the patches.  Too often 
a patch is intended to apply to some branch other then trunk, and there is no 
easy way to tell what branch it was intended for.

--Bobby Evans


On 9/9/11 10:52 AM, "Mattmann, Chris A (388J)"  
wrote:

Wow, I didn't know that!

Learn something new everyday, thanks guys.

Cheers,
Chris

On Sep 9, 2011, at 9:48 AM, Doug Cutting wrote:

> On 09/09/2011 07:27 AM, Ted Dunning wrote:
>> If you post the same patch with the same name, JIRA helps you out by greying
>> all the earlier versions out.
>
> Indeed.  That's the best practice, not to add version numbers to patch
> files, for this very reason.  We should perhaps note this on:
>
> http://wiki.apache.org/hadoop/HowToContribute
>
> I am a Jira administrator and would be happy to change the default
> ordering of attachments if it were possible, however I can see no option
> to do so.
>
> Doug


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: ERROR building latest trunk for Hadoop project

2011-08-31 Thread Robert Evans
I ran into the same error with mvn compile.  There are some issues with 
dependency resolution in mvn and you need to run

mvn test -DskipTests

To compile the code.

--Bobby


On 8/30/11 7:21 AM, "Praveen Sripati"  wrote:

Rerun the build with the below options and see if you can get more
information to solve this.

>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
-e switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.

Thanks,
Praveen

On Tue, Aug 30, 2011 at 12:16 PM, A BlueCoder wrote:

> Hi, I have checked out the HEAD version of Hadoop trunk from svn, and run
> into the following errors when I do 'mvn compile'.
>
> Can some one shed some light on what goes wrong?
>
> Thanks a lot,
>
> B.C. 008
>
> NB, I have separately built and installed ProtocolBuffer package:
> protocobuf-2.4.1 successfully.
>
> 
> screenshot
> --
> [INFO]
> [INFO] --- maven-antrun-plugin:1.6:run
> (create-protobuf-generated-sources-directory) @ hadoop-yarn-c
> ommon ---
> [INFO] Executing tasks
>
> main:
>[mkdir] Created dir:
> C:\temp\hadoop_trunk1\hadoop-mapreduce-project\hadoop-yarn\hadoop-yarn-comm
> on\target\generated-sources\proto
> [INFO] Executed tasks
> [INFO]
> [INFO] --- exec-maven-plugin:1.2:exec (generate-sources) @
> hadoop-yarn-common ---
> [INFO]
> [INFO] --- exec-maven-plugin:1.2:exec (generate-version) @
> hadoop-yarn-common ---
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Apache Hadoop Project POM . SUCCESS [3.265s]
> [INFO] Apache Hadoop Annotations . SUCCESS [0.766s]
> [INFO] Apache Hadoop Project Dist POM  SUCCESS [0.000s]
> [INFO] Apache Hadoop Assemblies .. SUCCESS [0.156s]
> [INFO] Apache Hadoop Alfredo . SUCCESS [0.578s]
> [INFO] Apache Hadoop Common .. SUCCESS
> [52.877s]
> [INFO] Apache Hadoop Common Project .. SUCCESS [0.015s]
> [INFO] Apache Hadoop HDFS  SUCCESS [9.376s]
> [INFO] Apache Hadoop HDFS Project  SUCCESS [0.000s]
> [INFO] hadoop-yarn-api ... SUCCESS
> [20.985s]
> [INFO] hadoop-yarn-common  FAILURE [0.500s]
> [INFO] hadoop-yarn-server-common . SKIPPED
> [INFO] hadoop-yarn-server-nodemanager  SKIPPED
> [INFO] hadoop-yarn-server-resourcemanager  SKIPPED
> [INFO] hadoop-yarn-server-tests .. SKIPPED
> [INFO] hadoop-yarn-server  SKIPPED
> [INFO] hadoop-yarn ... SKIPPED
> [INFO] hadoop-mapreduce-client-core .. SKIPPED
> [INFO] hadoop-mapreduce-client-common  SKIPPED
> [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED
> [INFO] hadoop-mapreduce-client-app ... SKIPPED
> [INFO] hadoop-mapreduce-client-hs  SKIPPED
> [INFO] hadoop-mapreduce-client-jobclient . SKIPPED
> [INFO] hadoop-mapreduce-client ... SKIPPED
> [INFO] hadoop-mapreduce .. SKIPPED
> [INFO] Apache Hadoop Main  SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 1:29.940s
> [INFO] Finished at: Mon Aug 29 23:41:34 PDT 2011
> [INFO] Final Memory: 11M/40M
> [INFO]
> 
> [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2:exec
> (generate-version) on pr
> oject hadoop-yarn-common: Command execution failed. Cannot run program
> "scripts\saveVersion.sh" (in
> directory
>
> "C:\temp\hadoop_trunk1\hadoop-mapreduce-project\hadoop-yarn\hadoop-yarn-common"):
> CreatePr
> ocess error=2, The system cannot find the file specified -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please
> read the following arti
> cles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :hadoop-yarn-common
>



Re: Question about invoking an executable from Hadoop mapper

2011-08-25 Thread Robert Evans
I think this is a java issue.  I don't think that it is launching a shell to 
run your command.  I think it is just splitting on white space and then passing 
all the args to hadoop.  What you want to do is to run

sh -e 'hadoop dfs -cat file| myExec'

Or with streaming white a small shell script that has the command in it and 
then tall the mapper/reducer to use that.  The key with streaming is that you 
have to make sure that you read stdin before going on, or it will error out

#!/bin/sh
Hadoop fs -cat file | myExec;
#Read all of the input to this mapper.
cat > /dev/null;


--Bobby Evans

On 8/25/11 10:32 AM, "Zhixuan Zhu"  wrote:

I think this should be a common use case. I'm trying to invoke an
executable from my mapper. Because my executable takes steaming input, I
used something like the following:

String lCmdStr = "hadoop dfs -cat file | myExec";

Process lChldProc = Runtime.getRuntime().exec(lCmdStr);

All executable and file names are full qualified name with path. I got
the following errors:

cat: File does not exist: |
cat: File does not exist: myExec

Looks like this kind of steaming/pipe method was not supported from a
mapper? I can take the exact command string and run it directly on a
Hadoop server and it works. Anybody have any experience with it?

I also tried to use Hadoop-steaming and it did not work either. It did
not give any error but nothing happened either. My program is supposed
to write a file on the local system and it's not there. I'm at the end
of my wit. Any help is most appreciated. Hadoop version 0.20.2.

String lCmdStr = "hadoop jar hadoop-0.20.2-streaming.jar -input
inputFile -output outputFile -mapper myExec";

Thanks very much,
Grace




Re: how to pass a hdfs file to a c++ process

2011-08-23 Thread Robert Evans
Hadoop streaming is the simplest way to do this, if you program is set up to 
take stdin as its input, write to stdout for the output, and each record "file" 
in your case is a single line of text.

You need to be able to have it work with the following shell script

Hadoop fs -cat  | head -1 | ./myprocess > output.txt

And ideally what is stored in output.txt are lines of text that can have their 
order rearranged without impacting the result (This is not a requirement unless 
you want to use a reduce too, but streaming will still try to parse it that way.

If not there are tricks you can play to make it work, but they are kind of ugly.

--Bobby Evans


On 8/22/11 2:57 PM, "Zhixuan Zhu"  wrote:

Hi All,

I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question
about FileInputFormat a few days ago and get some prompt replys from
this forum and it helped a lot. Thanks again! Now I have another
question. I'm trying to invoke a C++ process from my mapper for each
hdfs file in the input directory to achieve some parallel processing.
But how do I pass the file to the program? I would want to do something
like the following in my mapper:

Process lChldProc = Runtime.getRuntime().exec("myprocess -file
$filepath");

How do I pass the hdfs filesystem to an outside process like that? Is
HadoopStreaming the direction I should go?

Thanks very much for any reply in advance.

Best,
Grace