RE: Running Mapred jobs after launching cluster

praveen.peddi Thu, 03 Feb 2011 09:19:24 -0800

Hi Tom/all
Where are the hadoop config files stored on the cluster nodes? I would like to 
debug this issue since I need to give more memory for child java mapred 
processes to process huge chunks of data.


Thanks
Praveen
-----Original Message-----
From: ext praveen.pe...@nokia.com [mailto:praveen.pe...@nokia.com] 
Sent: Wednesday, February 02, 2011 5:23 PM
To: whirr-user@incubator.apache.org
Subject: RE: Running Mapred jobs after launching cluster

Can anyone think of a reason why the below property is not honoured when I 
overwrote this along with other properties in the post_configure. Other 
properties are correctly overwritten except this one. I need to set the mapred 
tasks jvm to bigger than 200m.

Praveen
________________________________________
From: Peddi Praveen (Nokia-MS/Boston)
Sent: Tuesday, February 01, 2011 11:21 AM
To: whirr-user@incubator.apache.org
Subject: RE: Running Mapred jobs after launching cluster

Thanks Tom. Silly me I should have thought of the property name. It works now 
except one issue: I ran the wordcount example and I saw that no. of map and 
reduce tasks are as I configured in post_configure script but for some reason 
the below property in job.xml is always -Xmx200m and I set it to -Xmx1700m. Not 
sure if this property is any special.

mapred.child.java.opts  -Xmx200m

Praveen
________________________________________
From: ext Tom White [tom.e.wh...@gmail.com]
Sent: Tuesday, February 01, 2011 12:13 AM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

Try setting whirr.run-url-base, not run-url-base.

Tom

On Mon, Jan 31, 2011 at 5:33 PM,  <praveen.pe...@nokia.com> wrote:
> I am not using cdh (for now anyway) but the default hadoop. I even changed 
> the "localhost" to ipaddress and still no luck. It likely that I am doing 
> something wrong but having hard time debugging.
> Here are the properties I changed in  /var/www/apache/hadoop/post-configure 
> but when I run the job I am not seeing these values.
>  MAX_MAP_TASKS=16
>  MAX_REDUCE_TASKS=24
>  CHILD_OPTS=-Xmx1700m
>
> Here is what I see in  /tmp/runscript/runscript.sh of master node. It doesn't 
> look like it used my scripts...
>
> installRunUrl || exit 1
> runurl 
> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure
> -hostnames -c cloudservers runurl 
> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/insta
> ll runurl 
> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/
> install -c cloudservers
>
> Any suggestions?
> Praveen
> ________________________________________
> From: ext Tom White [tom.e.wh...@gmail.com]
> Sent: Monday, January 31, 2011 6:23 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Mon, Jan 31, 2011 at 3:03 PM,  <praveen.pe...@nokia.com> wrote:
>> If I anyway have to upload the files to webservers, do I still need the 
>> patch then? It looks like the script has these properties that I can 
>> overwrite.
>
> I suggested you look at the patch (WHIRR-55) so you can see how it 
> will be possible once it's committed. To try it out you need to upload 
> the scripts to a webserver (since the patch changes one of them).
>
>>
>> BTW I tried with webserver path and I could not make it work so far.
>>
>> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 
>> properties in /var/www/apache/hadoop/post-configure.
>> 2. I changed hadoop.properties added following line
>>        run-url-base=http://localhost/ 3. Launched the cluster and 
>> verified the job properties are not what I changed to. They are all defaults.
>
> This looks right to me. If you are using CDH you need to change 
> cloudera/cdh/post-configure.
>
>>
>> How do I debug this issue?
>
> You can log into the instances (see the FAQ for how to do this) and 
> look at the scripts that actually ran (and their output) in the /tmp 
> directory.
>
>
> Tom
>
>>
>> Praveen
>>
>>
>> Launched the cluster and I didn't see child jvm have 2G alloc 
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
>> Sent: Monday, January 31, 2011 3:02 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> Hi Praveen,
>>
>> I think removing the webserver dependency (or making it optional) 
>> would be a good goal, but we're not there yet. I've just created
>> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the 
>> design and implementation.
>>
>> In the meantime you could take a look at 
>> https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch 
>> there to override some Hadoop properties (you will need to upload the 
>> scripts to a webserver still however, until it is committed, since it 
>> modifies Hadoop's post-configure script).
>>
>> Hope this helps.
>>
>> Cheers,
>> Tom
>>
>> BTW what are the security concerns you have? There are no credentials 
>> embedded in the scripts, so it should be safe to host them publicly, no?
>>
>> On Mon, Jan 31, 2011 at 11:00 AM,  <praveen.pe...@nokia.com> wrote:
>>> Hi Tom,
>>> If the hadoop install is fixed, Whirr must be getting all default hadoop 
>>> properties from the hadoop install itself, correct? I sent an email about 
>>> configuring hadoop properties and you mentioned I need to put the modified 
>>> scripts on a webserver that is publicly accessible. I was wondering if 
>>> there is place inside hadoop install I can change so that I don't need to 
>>> put the scripts on webserver (for security reasons). Do you think it is 
>>> possible? If so, how? I do not mind customizing the jar file for our 
>>> purposes. I want to change the following properties:
>>>
>>> mapred.reduce.tasks=24
>>> mapred.map.tasks=64
>>> mapred.child.java.opts=-Xmx2048m
>>>
>>> Thanks in advance.
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
>>> Sent: Friday, January 28, 2011 4:02 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> It is fixed, and currently on 0.20.2. It will be made configurable in 
>>> https://issues.apache.org/jira/browse/WHIRR-222.
>>>
>>> Cheers
>>> Tom
>>>
>>> On Fri, Jan 28, 2011 at 12:56 PM,  <praveen.pe...@nokia.com> wrote:
>>>> Hi Tom,
>>>> So the hadoop version is not going to change for a given Whirr install? I 
>>>> thought Whirr is getting hadoop install dynamically from a URL which is 
>>>> always going to have the latest hadoop version. If that is not the case I 
>>>> guess I am fine. I just don't want to get hadoop version mismatch 6 months 
>>>> after our software is released just because new hadoop version got 
>>>> released.
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
>>>> Sent: Friday, January 28, 2011 3:35 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 12:06 PM,  <praveen.pe...@nokia.com> wrote:
>>>>> Thanks Tom. I think I got it working with my own driver so I will go with 
>>>>> it for now (unless that proves to be a bad option).
>>>>>
>>>>> BTW, could you tell me how to stick with one hadoop version while 
>>>>> launching cluster. I have hadoop-0.20.2 in my classpath but it lookws 
>>>>> like Whirr gets the latest hadoop from the repository. Since the latest 
>>>>> version may be different depending on the time, I would like to stick to 
>>>>> one version so that hadoop version mismatch won't happen.
>>>>
>>>> You do need to make sure that the versions are the same. See the Hadoop 
>>>> integration tests, which specify the version of Hadoop to use in their POM.
>>>>
>>>>>
>>>>> Also what jar files are necessary for launching cluster using Java. 
>>>>> Currently I have cli version of jar file but that's way too large since 
>>>>> it has ervrything in it.
>>>>
>>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>>> If you look at the POMs in the source code they will tell you the 
>>>> dependencies.
>>>>
>>>> Cheers
>>>> Tom
>>>>
>>>>>
>>>>> Thanks
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
>>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <praveen.pe...@nokia.com> wrote:
>>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>>
>>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>>
>>>>> ~/.whirr/<cluster-name>
>>>>>
>>>>>> When you said run in new process, did you mean using command line Whirr 
>>>>>> tool?
>>>>>
>>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the 
>>>>> job in another process, with HADOOP_CONF_DIR set.
>>>>>
>>>>> The MR jobs you are running I assume can be run against an arbitrary 
>>>>> cluster, so you should be able to point them at a cluster started by 
>>>>> Whirr.
>>>>>
>>>>> Tom
>>>>>
>>>>>>
>>>>>> I may finally end up writing my own driver for running external mapred 
>>>>>> jobs so I can have more control but I was just curious to know if option 
>>>>>> #2 is better than writing my own driver.
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:t...@cloudera.com]
>>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>>> To: whirr-user@incubator.apache.org
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> If they implement the Tool interface then you can set configuration on 
>>>>>> them. Failing that you could set HADOOP_CONF_DIR and run them in a new 
>>>>>> process.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <praveen.pe...@nokia.com> wrote:
>>>>>>> Hmm...
>>>>>>> I am running some of the map reduce jobs written by me but some of them 
>>>>>>> are in external libraries (eg. Mahout) which I don't have control over. 
>>>>>>> Since I can't modify the code in external libraries, is there any other 
>>>>>>> way to make this work?
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ext Tom White [mailto:tom.e.wh...@gmail.com]
>>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>>> To: whirr-user@incubator.apache.org
>>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>>
>>>>>>> You don't need to add anything to the classpath, but you need to use 
>>>>>>> the configuration in the org.apache.whirr.service.Cluster object to 
>>>>>>> populate your Hadoop Configuration object so that your code knows which 
>>>>>>> cluster to connect to. See the getConfiguration() method in 
>>>>>>> HadoopServiceController for how to do this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Tom
>>>>>>>
>>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <praveen.pe...@nokia.com> wrote:
>>>>>>>> Hello all,
>>>>>>>> I wrote a java class HadoopLanucher that is very similar to 
>>>>>>>> HadoopServiceController. I was succesfully able to launch a 
>>>>>>>> cluster programtically from my application using Whirr. Now I 
>>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>>
>>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>>> Here is the code I used:
>>>>>>>>
>>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs = 
>>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new 
>>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>>
>>>>>>>> Do I need to add anything else to the classpath so Hadoop 
>>>>>>>> libraries know that it needs to talk to the dynamically lanuched 
>>>>>>>> cluster?
>>>>>>>> When running Whirr from command line I know it uses 
>>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing 
>>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>>
>>>>>>>> Praveen
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Running Mapred jobs after launching cluster

Reply via email to