Re: Environment consideration for a research on scheduling

2011-09-24 Thread Merto Mertek
 I agree, we will go the standard route. Like you suggested we will go step
by step to the full cluster deployment. After the first node configuration
we will use clonezilla to replicate it and then setup them one by one..

On the workernodes I was thinking to run ubuntu server, namenode will run
ubuntu desktop. I am interested how should I configure the environment that
I will able to remotely monitor, analyse and configure the cluster. I will
run jobs outsite the local network via ssh to the namenode, however in this
situation I will not be abble to access the web interface of the job and
tasktracker. So I am wondering how to analyze them and how did you configure
your environment to be as practical as possible.

For monitoring the cluster I saw that ganglia is one of the option, but in
this stage of testing probably job-history files will be enough..

On 23 September 2011 17:09, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 If you are starting from scratch with no prior Hadoop install experience I
 would configure stand-alone, migrate to pseudo distributed and then to fully
 distributed verifying functionality at each step by doing a simple word
 count run. Also, if you don't mind using the CDH distribution then SCM /
 their rpms will greatly simplify both the bin installs as well as the user
 creation.

 Your VM route will most likely work but I can imagine the amount of hiccups
 during migration from that to the real cluster will not make it worth your
 time.

 Matt

 -Original Message-
 From: Merto Mertek [mailto:masmer...@gmail.com]
 Sent: Friday, September 23, 2011 10:00 AM
 To: common-user@hadoop.apache.org
 Subject: Environment consideration for a research on scheduling

 Hi,
 in the first phase we are planning to establish a small cluster with few
 commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
 10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
 version 0.20.203 with missing
 libraries
 http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567
 ).
 Would you suggest any other version?

 In the second phase we are planning to analyse, test and modify some of
 hadoop schedulers.

 Now I am interested what is the best way to deploy ubuntu and hadop to this
 few machine. I was thinking to configure the system in the local VM and
 then
 converting it to each physical machine but probably this is not the best
 option. If you know any other please share..

 Thanks you!
 This e-mail message may contain privileged and/or confidential information,
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use
 of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.




Asking for Advice for *-site.xml

2011-09-24 Thread Aditya Budi
Hi Everyone,

I am new to hadoop and have 4 machine with 4 CPU and 4 GB ram in each
machine. And i am planning some scenario as follow.
Scenario 1: Only one machine utilize 1 core
Scenario 2: Only one machine utilize 4 core
Scenario 3: Only two machine utilize each 1 core
Scenario 4: All machine with all 4 core each been utilize.

Is my scenario possible? and if possible is there any best practice to
achieve that? I am fully understand that i need to make modification of all
*-site.xml for each scenario.

below is my *-site.xml configuration.

Thank you so much for your help !

Best Regards,
Budi

-- core-site.xml
configuration
property
namehadoop.tmp.dir/name
value/app/hadoop/tmp/value
/property
property
namefs.default.name/name
valuehdfs://master:54310/value
/property
property
nameio.sort.factor/name
value32/value
/property
property
nameio.sort.mb/name
value320/value
/property
property
nameio.file.buffer.size/name
value131072/value
/property
/configuration

-- hdfs-site.xml
configuration
property
namedfs.replication/name
value2/value
/property
property
namedfs.permissions/name
valuefalse/value
/property
property
namedfs.block.size/name
value134217728/value
/property
property
namedfs.namenode.handler.count/name
value40/value
/property
/configuration

-- mapred-site.xml

configuration
property
namemapred.job.tracker/name
valuemaster:54311/value
/property
property
namemapred.reduce.parallel.copies/name
value20/value
/property
property
namemapred.map.child.java.opts/name
value-Xmx512M/value
/property
property
namemapred.reduce.child.java.opts/name
value-Xmx512M/value
/property
property
namemapred.tasktracker.map.tasks.maximum/name
value1/value
/property
property
namemapred.tasktracker.reduce.tasks.maximum/name
value1/value
/property
property
namemapred.task.timeout/name
value100/value
/property
/configuration