Hi Tom/all Where are the hadoop config files stored on the cluster nodes? I would like to debug this issue since I need to give more memory for child java mapred processes to process huge chunks of data.
Thanks Praveen -----Original Message----- From: ext praveen.pe...@nokia.com [mailto:praveen.pe...@nokia.com] Sent: Wednesday, February 02, 2011 5:23 PM To: whirr-user@incubator.apache.org Subject: RE: Running Mapred jobs after launching cluster Can anyone think of a reason why the below property is not honoured when I overwrote this along with other properties in the post_configure. Other properties are correctly overwritten except this one. I need to set the mapred tasks jvm to bigger than 200m. Praveen ________________________________________ From: Peddi Praveen (Nokia-MS/Boston) Sent: Tuesday, February 01, 2011 11:21 AM To: whirr-user@incubator.apache.org Subject: RE: Running Mapred jobs after launching cluster Thanks Tom. Silly me I should have thought of the property name. It works now except one issue: I ran the wordcount example and I saw that no. of map and reduce tasks are as I configured in post_configure script but for some reason the below property in job.xml is always -Xmx200m and I set it to -Xmx1700m. Not sure if this property is any special. mapred.child.java.opts -Xmx200m Praveen ________________________________________ From: ext Tom White [tom.e.wh...@gmail.com] Sent: Tuesday, February 01, 2011 12:13 AM To: whirr-user@incubator.apache.org Subject: Re: Running Mapred jobs after launching cluster Try setting whirr.run-url-base, not run-url-base. Tom On Mon, Jan 31, 2011 at 5:33 PM, <praveen.pe...@nokia.com> wrote: > I am not using cdh (for now anyway) but the default hadoop. I even changed > the "localhost" to ipaddress and still no luck. It likely that I am doing > something wrong but having hard time debugging. > Here are the properties I changed in /var/www/apache/hadoop/post-configure > but when I run the job I am not seeing these values. > MAX_MAP_TASKS=16 > MAX_REDUCE_TASKS=24 > CHILD_OPTS=-Xmx1700m > > Here is what I see in /tmp/runscript/runscript.sh of master node. It doesn't > look like it used my scripts... > > installRunUrl || exit 1 > runurl > http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure > -hostnames -c cloudservers runurl > http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/insta > ll runurl > http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/ > install -c cloudservers > > Any suggestions? > Praveen > ________________________________________ > From: ext Tom White [tom.e.wh...@gmail.com] > Sent: Monday, January 31, 2011 6:23 PM > To: whirr-user@incubator.apache.org > Subject: Re: Running Mapred jobs after launching cluster > > On Mon, Jan 31, 2011 at 3:03 PM, <praveen.pe...@nokia.com> wrote: >> If I anyway have to upload the files to webservers, do I still need the >> patch then? It looks like the script has these properties that I can >> overwrite. > > I suggested you look at the patch (WHIRR-55) so you can see how it > will be possible once it's committed. To try it out you need to upload > the scripts to a webserver (since the patch changes one of them). > >> >> BTW I tried with webserver path and I could not make it work so far. >> >> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 >> properties in /var/www/apache/hadoop/post-configure. >> 2. I changed hadoop.properties added following line >> run-url-base=http://localhost/ 3. Launched the cluster and >> verified the job properties are not what I changed to. They are all defaults. > > This looks right to me. If you are using CDH you need to change > cloudera/cdh/post-configure. > >> >> How do I debug this issue? > > You can log into the instances (see the FAQ for how to do this) and > look at the scripts that actually ran (and their output) in the /tmp > directory. > > > Tom > >> >> Praveen >> >> >> Launched the cluster and I didn't see child jvm have 2G alloc >> -----Original Message----- >> From: ext Tom White [mailto:tom.e.wh...@gmail.com] >> Sent: Monday, January 31, 2011 3:02 PM >> To: whirr-user@incubator.apache.org >> Subject: Re: Running Mapred jobs after launching cluster >> >> Hi Praveen, >> >> I think removing the webserver dependency (or making it optional) >> would be a good goal, but we're not there yet. I've just created >> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the >> design and implementation. >> >> In the meantime you could take a look at >> https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch >> there to override some Hadoop properties (you will need to upload the >> scripts to a webserver still however, until it is committed, since it >> modifies Hadoop's post-configure script). >> >> Hope this helps. >> >> Cheers, >> Tom >> >> BTW what are the security concerns you have? There are no credentials >> embedded in the scripts, so it should be safe to host them publicly, no? >> >> On Mon, Jan 31, 2011 at 11:00 AM, <praveen.pe...@nokia.com> wrote: >>> Hi Tom, >>> If the hadoop install is fixed, Whirr must be getting all default hadoop >>> properties from the hadoop install itself, correct? I sent an email about >>> configuring hadoop properties and you mentioned I need to put the modified >>> scripts on a webserver that is publicly accessible. I was wondering if >>> there is place inside hadoop install I can change so that I don't need to >>> put the scripts on webserver (for security reasons). Do you think it is >>> possible? If so, how? I do not mind customizing the jar file for our >>> purposes. I want to change the following properties: >>> >>> mapred.reduce.tasks=24 >>> mapred.map.tasks=64 >>> mapred.child.java.opts=-Xmx2048m >>> >>> Thanks in advance. >>> Praveen >>> >>> -----Original Message----- >>> From: ext Tom White [mailto:tom.e.wh...@gmail.com] >>> Sent: Friday, January 28, 2011 4:02 PM >>> To: whirr-user@incubator.apache.org >>> Subject: Re: Running Mapred jobs after launching cluster >>> >>> It is fixed, and currently on 0.20.2. It will be made configurable in >>> https://issues.apache.org/jira/browse/WHIRR-222. >>> >>> Cheers >>> Tom >>> >>> On Fri, Jan 28, 2011 at 12:56 PM, <praveen.pe...@nokia.com> wrote: >>>> Hi Tom, >>>> So the hadoop version is not going to change for a given Whirr install? I >>>> thought Whirr is getting hadoop install dynamically from a URL which is >>>> always going to have the latest hadoop version. If that is not the case I >>>> guess I am fine. I just don't want to get hadoop version mismatch 6 months >>>> after our software is released just because new hadoop version got >>>> released. >>>> >>>> Thanks >>>> Praveen >>>> >>>> -----Original Message----- >>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com] >>>> Sent: Friday, January 28, 2011 3:35 PM >>>> To: whirr-user@incubator.apache.org >>>> Subject: Re: Running Mapred jobs after launching cluster >>>> >>>> On Fri, Jan 28, 2011 at 12:06 PM, <praveen.pe...@nokia.com> wrote: >>>>> Thanks Tom. I think I got it working with my own driver so I will go with >>>>> it for now (unless that proves to be a bad option). >>>>> >>>>> BTW, could you tell me how to stick with one hadoop version while >>>>> launching cluster. I have hadoop-0.20.2 in my classpath but it lookws >>>>> like Whirr gets the latest hadoop from the repository. Since the latest >>>>> version may be different depending on the time, I would like to stick to >>>>> one version so that hadoop version mismatch won't happen. >>>> >>>> You do need to make sure that the versions are the same. See the Hadoop >>>> integration tests, which specify the version of Hadoop to use in their POM. >>>> >>>>> >>>>> Also what jar files are necessary for launching cluster using Java. >>>>> Currently I have cli version of jar file but that's way too large since >>>>> it has ervrything in it. >>>> >>>> You need Whirr's core and Hadoop jars, as well as their dependencies. >>>> If you look at the POMs in the source code they will tell you the >>>> dependencies. >>>> >>>> Cheers >>>> Tom >>>> >>>>> >>>>> Thanks >>>>> Praveen >>>>> >>>>> -----Original Message----- >>>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com] >>>>> Sent: Friday, January 28, 2011 2:12 PM >>>>> To: whirr-user@incubator.apache.org >>>>> Subject: Re: Running Mapred jobs after launching cluster >>>>> >>>>> On Fri, Jan 28, 2011 at 6:28 AM, <praveen.pe...@nokia.com> wrote: >>>>>> Thanks Tom. Could you eloborate little more on the second option. >>>>>> >>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster? >>>>> >>>>> ~/.whirr/<cluster-name> >>>>> >>>>>> When you said run in new process, did you mean using command line Whirr >>>>>> tool? >>>>> >>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the >>>>> job in another process, with HADOOP_CONF_DIR set. >>>>> >>>>> The MR jobs you are running I assume can be run against an arbitrary >>>>> cluster, so you should be able to point them at a cluster started by >>>>> Whirr. >>>>> >>>>> Tom >>>>> >>>>>> >>>>>> I may finally end up writing my own driver for running external mapred >>>>>> jobs so I can have more control but I was just curious to know if option >>>>>> #2 is better than writing my own driver. >>>>>> >>>>>> Praveen >>>>>> >>>>>> -----Original Message----- >>>>>> From: ext Tom White [mailto:t...@cloudera.com] >>>>>> Sent: Thursday, January 27, 2011 4:01 PM >>>>>> To: whirr-user@incubator.apache.org >>>>>> Subject: Re: Running Mapred jobs after launching cluster >>>>>> >>>>>> If they implement the Tool interface then you can set configuration on >>>>>> them. Failing that you could set HADOOP_CONF_DIR and run them in a new >>>>>> process. >>>>>> >>>>>> Cheers, >>>>>> Tom >>>>>> >>>>>> On Thu, Jan 27, 2011 at 12:52 PM, <praveen.pe...@nokia.com> wrote: >>>>>>> Hmm... >>>>>>> I am running some of the map reduce jobs written by me but some of them >>>>>>> are in external libraries (eg. Mahout) which I don't have control over. >>>>>>> Since I can't modify the code in external libraries, is there any other >>>>>>> way to make this work? >>>>>>> >>>>>>> Praveen >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com] >>>>>>> Sent: Thursday, January 27, 2011 3:42 PM >>>>>>> To: whirr-user@incubator.apache.org >>>>>>> Subject: Re: Running Mapred jobs after launching cluster >>>>>>> >>>>>>> You don't need to add anything to the classpath, but you need to use >>>>>>> the configuration in the org.apache.whirr.service.Cluster object to >>>>>>> populate your Hadoop Configuration object so that your code knows which >>>>>>> cluster to connect to. See the getConfiguration() method in >>>>>>> HadoopServiceController for how to do this. >>>>>>> >>>>>>> Cheers, >>>>>>> Tom >>>>>>> >>>>>>> On Thu, Jan 27, 2011 at 12:21 PM, <praveen.pe...@nokia.com> wrote: >>>>>>>> Hello all, >>>>>>>> I wrote a java class HadoopLanucher that is very similar to >>>>>>>> HadoopServiceController. I was succesfully able to launch a >>>>>>>> cluster programtically from my application using Whirr. Now I >>>>>>>> want to copy files to hdfs and also run a job progrmatically. >>>>>>>> >>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs. >>>>>>>> Here is the code I used: >>>>>>>> >>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs = >>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new >>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory)); >>>>>>>> >>>>>>>> Do I need to add anything else to the classpath so Hadoop >>>>>>>> libraries know that it needs to talk to the dynamically lanuched >>>>>>>> cluster? >>>>>>>> When running Whirr from command line I know it uses >>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing >>>>>>>> the same from Java I am wondering how to solve this issue. >>>>>>>> >>>>>>>> Praveen >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >