Re: Running Mapred jobs after launching cluster

Andrei Savu Thu, 03 Feb 2011 09:32:48 -0800

Here are the two relevant lines from the install scripts:

HADOOP_VERSION=${HADOOP_VERSION:-0.20.2}
HADOOP_HOME=/usr/local/hadoop-$HADOOP_VERSION
HADOOP_CONF_DIR=$HADOOP_HOME/conf


Have you tried changing CHIL_OPTS in apache/hadoop/post-configure and use
that custom script to deploy a cluster?

I don't have a running cluster to check this now.

On Thu, Feb 3, 2011 at 7:18 PM, <praveen.pe...@nokia.com> wrote:

> Hi Tom/all
> Where are the hadoop config files stored on the cluster nodes? I would like
> to debug this issue since I need to give more memory for child java mapred
> processes to process huge chunks of data.
>
> Thanks
> Praveen
> -----Original Message-----
> From: ext praveen.pe...@nokia.com [mailto:praveen.pe...@nokia.com]
> Sent: Wednesday, February 02, 2011 5:23 PM
> To: whirr-user@incubator.apache.org
> Subject: RE: Running Mapred jobs after launching cluster
>
> Can anyone think of a reason why the below property is not honoured when I
> overwrote this along with other properties in the post_configure. Other
> properties are correctly overwritten except this one. I need to set the
> mapred tasks jvm to bigger than 200m.
>
> Praveen
> ________________________________________
> From: Peddi Praveen (Nokia-MS/Boston)
> Sent: Tuesday, February 01, 2011 11:21 AM
> To: whirr-user@incubator.apache.org
> Subject: RE: Running Mapred jobs after launching cluster
>
> Thanks Tom. Silly me I should have thought of the property name. It works
> now except one issue: I ran the wordcount example and I saw that no. of map
> and reduce tasks are as I configured in post_configure script but for some
> reason the below property in job.xml is always -Xmx200m and I set it to
> -Xmx1700m. Not sure if this property is any special.
>
> mapred.child.java.opts  -Xmx200m
>
> Praveen
> ________________________________________
> From: ext Tom White [tom.e.wh...@gmail.com]
> Sent: Tuesday, February 01, 2011 12:13 AM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> Try setting whirr.run-url-base, not run-url-base.
>
> Tom
>
> On Mon, Jan 31, 2011 at 5:33 PM,  <praveen.pe...@nokia.com> wrote:
> > I am not using cdh (for now anyway) but the default hadoop. I even
> changed the "localhost" to ipaddress and still no luck. It likely that I am
> doing something wrong but having hard time debugging.
> > Here are the properties I changed in
>  /var/www/apache/hadoop/post-configure but when I run the job I am not
> seeing these values.
> >  MAX_MAP_TASKS=16
> >  MAX_REDUCE_TASKS=24
> >  CHILD_OPTS=-Xmx1700m
> >
> > Here is what I see in  /tmp/runscript/runscript.sh of master node. It
> doesn't look like it used my scripts...
> >
> > installRunUrl || exit 1
> > runurl
> > http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure
> > -hostnames -c cloudservers runurl
> > http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/insta
> > ll runurl
> > http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/
> > install -c cloudservers
> >
> > Any suggestions?
> > Praveen
> > ________________________________________
> > From: ext Tom White [tom.e.wh...@gmail.com]
> > Sent: Monday, January 31, 2011 6:23 PM
> > To: whirr-user@incubator.apache.org
> > Subject: Re: Running Mapred jobs after launching cluster
> >
> > On Mon, Jan 31, 2011 at 3:03 PM,  <praveen.pe...@nokia.com> wrote:
> >> If I anyway have to upload the files to webservers, do I still need the
> patch then? It looks like the script has these properties that I can
> overwrite.
> >
> > I suggested you look at the patch (WHIRR-55) so you can see how it
> > will be possible once it's committed. To try it out you need to upload
> > the scripts to a webserver (since the patch changes one of them).
> >
> >>
> >> BTW I tried with webserver path and I could not make it work so far.
> >>
> >> 1. I copied scripts/apache folder to my /var/www folder and modified
> below 3 properties in /var/www/apache/hadoop/post-configure.
> >> 2. I changed hadoop.properties added following line
> >>        run-url-base=http://localhost/ 3. Launched the cluster and
> >> verified the job properties are not what I changed to. They are all
> defaults.
> >
> > This looks right to me. If you are using CDH you need to change
> > cloudera/cdh/post-configure.
> >
> >>
> >> How do I debug this issue?
> >
> > You can log into the instances (see the FAQ for how to do this) and
> > look at the scripts that actually ran (and their output) in the /tmp
> > directory.
> >
> >
> > Tom
> >
> >>
> >> Praveen
> >>
> >>
> >> Launched the cluster and I didn't see child jvm have 2G alloc
> >> -----Original Message-----
> >> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
> >> Sent: Monday, January 31, 2011 3:02 PM
> >> To: whirr-user@incubator.apache.org
> >> Subject: Re: Running Mapred jobs after launching cluster
> >>
> >> Hi Praveen,
> >>
> >> I think removing the webserver dependency (or making it optional)
> >> would be a good goal, but we're not there yet. I've just created
> >> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss
> the design and implementation.
> >>
> >> In the meantime you could take a look at
> https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch
> there to override some Hadoop properties (you will need to upload the
> scripts to a webserver still however, until it is committed, since it
> modifies Hadoop's post-configure script).
> >>
> >> Hope this helps.
> >>
> >> Cheers,
> >> Tom
> >>
> >> BTW what are the security concerns you have? There are no credentials
> embedded in the scripts, so it should be safe to host them publicly, no?
> >>
> >> On Mon, Jan 31, 2011 at 11:00 AM,  <praveen.pe...@nokia.com> wrote:
> >>> Hi Tom,
> >>> If the hadoop install is fixed, Whirr must be getting all default
> hadoop properties from the hadoop install itself, correct? I sent an email
> about configuring hadoop properties and you mentioned I need to put the
> modified scripts on a webserver that is publicly accessible. I was wondering
> if there is place inside hadoop install I can change so that I don't need to
> put the scripts on webserver (for security reasons). Do you think it is
> possible? If so, how? I do not mind customizing the jar file for our
> purposes. I want to change the following properties:
> >>>
> >>> mapred.reduce.tasks=24
> >>> mapred.map.tasks=64
> >>> mapred.child.java.opts=-Xmx2048m
> >>>
> >>> Thanks in advance.
> >>> Praveen
> >>>
> >>> -----Original Message-----
> >>> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
> >>> Sent: Friday, January 28, 2011 4:02 PM
> >>> To: whirr-user@incubator.apache.org
> >>> Subject: Re: Running Mapred jobs after launching cluster
> >>>
> >>> It is fixed, and currently on 0.20.2. It will be made configurable in
> https://issues.apache.org/jira/browse/WHIRR-222.
> >>>
> >>> Cheers
> >>> Tom
> >>>
> >>> On Fri, Jan 28, 2011 at 12:56 PM,  <praveen.pe...@nokia.com> wrote:
> >>>> Hi Tom,
> >>>> So the hadoop version is not going to change for a given Whirr
> install? I thought Whirr is getting hadoop install dynamically from a URL
> which is always going to have the latest hadoop version. If that is not the
> case I guess I am fine. I just don't want to get hadoop version mismatch 6
> months after our software is released just because new hadoop version got
> released.
> >>>>
> >>>> Thanks
> >>>> Praveen
> >>>>
> >>>> -----Original Message-----
> >>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
> >>>> Sent: Friday, January 28, 2011 3:35 PM
> >>>> To: whirr-user@incubator.apache.org
> >>>> Subject: Re: Running Mapred jobs after launching cluster
> >>>>
> >>>> On Fri, Jan 28, 2011 at 12:06 PM,  <praveen.pe...@nokia.com> wrote:
> >>>>> Thanks Tom. I think I got it working with my own driver so I will go
> with it for now (unless that proves to be a bad option).
> >>>>>
> >>>>> BTW, could you tell me how to stick with one hadoop version while
> launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like
> Whirr gets the latest hadoop from the repository. Since the latest version
> may be different depending on the time, I would like to stick to one version
> so that hadoop version mismatch won't happen.
> >>>>
> >>>> You do need to make sure that the versions are the same. See the
> Hadoop integration tests, which specify the version of Hadoop to use in
> their POM.
> >>>>
> >>>>>
> >>>>> Also what jar files are necessary for launching cluster using Java.
> Currently I have cli version of jar file but that's way too large since it
> has ervrything in it.
> >>>>
> >>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
> >>>> If you look at the POMs in the source code they will tell you the
> dependencies.
> >>>>
> >>>> Cheers
> >>>> Tom
> >>>>
> >>>>>
> >>>>> Thanks
> >>>>> Praveen
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
> >>>>> Sent: Friday, January 28, 2011 2:12 PM
> >>>>> To: whirr-user@incubator.apache.org
> >>>>> Subject: Re: Running Mapred jobs after launching cluster
> >>>>>
> >>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <praveen.pe...@nokia.com> wrote:
> >>>>>> Thanks Tom. Could you eloborate little more on the second option.
> >>>>>>
> >>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
> >>>>>
> >>>>> ~/.whirr/<cluster-name>
> >>>>>
> >>>>>> When you said run in new process, did you mean using command line
> Whirr tool?
> >>>>>
> >>>>> I meant that you could launch Whirr using the CLI, or Java. Then run
> the job in another process, with HADOOP_CONF_DIR set.
> >>>>>
> >>>>> The MR jobs you are running I assume can be run against an arbitrary
> cluster, so you should be able to point them at a cluster started by Whirr.
> >>>>>
> >>>>> Tom
> >>>>>
> >>>>>>
> >>>>>> I may finally end up writing my own driver for running external
> mapred jobs so I can have more control but I was just curious to know if
> option #2 is better than writing my own driver.
> >>>>>>
> >>>>>> Praveen
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: ext Tom White [mailto:t...@cloudera.com]
> >>>>>> Sent: Thursday, January 27, 2011 4:01 PM
> >>>>>> To: whirr-user@incubator.apache.org
> >>>>>> Subject: Re: Running Mapred jobs after launching cluster
> >>>>>>
> >>>>>> If they implement the Tool interface then you can set configuration
> on them. Failing that you could set HADOOP_CONF_DIR and run them in a new
> process.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Tom
> >>>>>>
> >>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <praveen.pe...@nokia.com> wrote:
> >>>>>>> Hmm...
> >>>>>>> I am running some of the map reduce jobs written by me but some of
> them are in external libraries (eg. Mahout) which I don't have control over.
> Since I can't modify the code in external libraries, is there any other way
> to make this work?
> >>>>>>>
> >>>>>>> Praveen
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
> >>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
> >>>>>>> To: whirr-user@incubator.apache.org
> >>>>>>> Subject: Re: Running Mapred jobs after launching cluster
> >>>>>>>
> >>>>>>> You don't need to add anything to the classpath, but you need to
> use the configuration in the org.apache.whirr.service.Cluster object to
> populate your Hadoop Configuration object so that your code knows which
> cluster to connect to. See the getConfiguration() method in
> HadoopServiceController for how to do this.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Tom
> >>>>>>>
> >>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <praveen.pe...@nokia.com>
> wrote:
> >>>>>>>> Hello all,
> >>>>>>>> I wrote a java class HadoopLanucher that is very similar to
> >>>>>>>> HadoopServiceController. I was succesfully able to launch a
> >>>>>>>> cluster programtically from my application using Whirr. Now I
> >>>>>>>> want to copy files to hdfs and also run a job progrmatically.
> >>>>>>>>
> >>>>>>>> When I copy a file to hdfs its copying to local file system, not
> hdfs.
> >>>>>>>> Here is the code I used:
> >>>>>>>>
> >>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
> >>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
> >>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
> >>>>>>>>
> >>>>>>>> Do I need to add anything else to the classpath so Hadoop
> >>>>>>>> libraries know that it needs to talk to the dynamically lanuched
> cluster?
> >>>>>>>> When running Whirr from command line I know it uses
> >>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
> >>>>>>>> the same from Java I am wondering how to solve this issue.
> >>>>>>>>
> >>>>>>>> Praveen
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>



-- 
Andrei Savu -- andreisavu.ro

Re: Running Mapred jobs after launching cluster

Reply via email to