How Jobtracler stores tasktracker's information

2011-12-13 Thread hadoop anis
  Anyone please tell this,
  I want to know from where Jobtracker sends task(taskid) to
tasktarcker for scheduling.
  i.e where it creates taskid  tasktracker pairs



Thanks  Regards,

Mohmmadanis Moulavi

Student,
MTech (Computer Sci.  Engg.)
Walchand college of Engg. Sangli (M.S.) India


Where do i see Sysout statements after building example ?

2011-12-13 Thread ArunKumar
HI guys !

I have a single node set up as per
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
1I have put some sysout statements in Jobtracker and wordcount
(src/examples/org/..) code 
2ant build 
3Ran example jar with wordcount

Where do i find the sysout statements ? i have seen in logs/
datanode,tasktracker,*.out  files.

Can anyone help me out ?


Arun


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Where-do-i-see-Sysout-statements-after-building-example-tp3582467p3582467.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Newbee Question: Do I must load XML files in KV store?

2011-12-13 Thread thedba

I  have a constant feed of large number of XML files that I would like to use
MapReduce and Hive 
My questions are
(1) Do I must load XML files into KV store before I can MapReduce
(2) Do I must load XML files into KV store before I can use Hive?

Thanks
TheDBA
-- 
View this message in context: 
http://old.nabble.com/Newbee-Question%3A-Do-I-must-load-XML-files-in-KV-store--tp32966497p32966497.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Where do i see Sysout statements after building example ?

2011-12-13 Thread Harsh J
JobTracker sysouts would go to logs/*-jobtracker*.out

On 13-Dec-2011, at 8:08 PM, ArunKumar wrote:

 HI guys !
 
 I have a single node set up as per
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
 1I have put some sysout statements in Jobtracker and wordcount
 (src/examples/org/..) code 
 2ant build 
 3Ran example jar with wordcount
 
 Where do i find the sysout statements ? i have seen in logs/
 datanode,tasktracker,*.out  files.
 
 Can anyone help me out ?
 
 
 Arun
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Where-do-i-see-Sysout-statements-after-building-example-tp3582467p3582467.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: Where do i see Sysout statements after building example ?

2011-12-13 Thread Bejoy Ks
Adding on to Harsh's response.
If your Sysouts are on mapper or reducer classes, you can get the same from
from WebUI  as well, http://JT host:50030/jobtracker.jsp . You need
to select your job and drill down to individual task level.

Regards
Bejoy.K.S

On Tue, Dec 13, 2011 at 10:30 PM, Harsh J ha...@cloudera.com wrote:

 JobTracker sysouts would go to logs/*-jobtracker*.out

 On 13-Dec-2011, at 8:08 PM, ArunKumar wrote:

  HI guys !
 
  I have a single node set up as per
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  1I have put some sysout statements in Jobtracker and wordcount
  (src/examples/org/..) code
  2ant build
  3Ran example jar with wordcount
 
  Where do i find the sysout statements ? i have seen in logs/
  datanode,tasktracker,*.out  files.
 
  Can anyone help me out ?
 
 
  Arun
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Where-do-i-see-Sysout-statements-after-building-example-tp3582467p3582467.html
  Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




Re: Where do i see Sysout statements after building example ?

2011-12-13 Thread Mark Kerzner
For me, they go two levels deeper - under 'userlogs' in logs, then in
directory that stores the run logs.

Here is what I see

root@ip-10-84-123-125
:/var/log/hadoop/userlogs/job_201112120200_0010/attempt_201112120200_0010_r_02_0#
ls
log.index  stderr  stdout  syslog

and there, in stdout, I see my write statements.

Mark

On Tue, Dec 13, 2011 at 11:00 AM, Harsh J ha...@cloudera.com wrote:

 JobTracker sysouts would go to logs/*-jobtracker*.out

 On 13-Dec-2011, at 8:08 PM, ArunKumar wrote:

  HI guys !
 
  I have a single node set up as per
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  1I have put some sysout statements in Jobtracker and wordcount
  (src/examples/org/..) code
  2ant build
  3Ran example jar with wordcount
 
  Where do i find the sysout statements ? i have seen in logs/
  datanode,tasktracker,*.out  files.
 
  Can anyone help me out ?
 
 
  Arun
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Where-do-i-see-Sysout-statements-after-building-example-tp3582467p3582467.html
  Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




RE: More cores Vs More Nodes ?

2011-12-13 Thread Brad Sarsfield
Praveenesh,

Your question is not naïve; in fact, optimal hardware design can ultimately be 
a very difficult question to answer on what would be better. If you made me 
pick one without much information I'd go for more machines.  But...

It all depends; and there is no right answer :)   

More machines 
+May run your workload faster
+Will give you a higher degree of reliability protection from node / 
hardware / hard drive failure.
+More aggregate IO capabilities
- capex / opex may be higher than allocating more cores
More cores 
+May run your workload faster
+More cores may allow for more tasks to run on the same machine
+More cores/tasks may reduce network contention and increase increasing 
task to task data flow performance.

Notice May run your workload faster is in both; as it can be very workload 
dependant.

My Experience:
I did a recent experiment and found that given the same number of cores (64) 
with the exact same network / machine configuration; 
A: I had 8 machines with 8 cores 
B: I had 28 machines with 2 cores (and 1x8 core head node)

B was able to outperform A by 2x using teragen and terasort. These machines 
were running in a virtualized environment; where some of the IO capabilities 
behind the scenes were being regulated to 400Mbps per node when running in the 
2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
non-throttled scenario to work even better. 

~Brad


-Original Message-
From: praveenesh kumar [mailto:praveen...@gmail.com] 
Sent: Monday, December 12, 2011 8:51 PM
To: common-user@hadoop.apache.org
Subject: More cores Vs More Nodes ?

Hey Guys,

So I have a very naive question in my mind regarding Hadoop cluster nodes ?

more cores or more nodes - Shall I spend money on going from 2-4 core machines, 
or spend money on buying more nodes less core eg. say 2 machines of 2 cores for 
example?

Thanks,
Praveenesh



Re: More cores Vs More Nodes ?

2011-12-13 Thread Prashant Kommireddi
Hi Brad, how many taskstrackers did you have on each node in both cases?

Thanks,
Prashant

Sent from my iPhone

On Dec 13, 2011, at 9:42 AM, Brad Sarsfield b...@bing.com wrote:

 Praveenesh,

 Your question is not naïve; in fact, optimal hardware design can ultimately 
 be a very difficult question to answer on what would be better. If you made 
 me pick one without much information I'd go for more machines.  But...

 It all depends; and there is no right answer :)

 More machines
+May run your workload faster
+Will give you a higher degree of reliability protection from node / 
 hardware / hard drive failure.
+More aggregate IO capabilities
- capex / opex may be higher than allocating more cores
 More cores
+May run your workload faster
+More cores may allow for more tasks to run on the same machine
+More cores/tasks may reduce network contention and increase increasing 
 task to task data flow performance.

 Notice May run your workload faster is in both; as it can be very workload 
 dependant.

 My Experience:
 I did a recent experiment and found that given the same number of cores (64) 
 with the exact same network / machine configuration;
A: I had 8 machines with 8 cores
B: I had 28 machines with 2 cores (and 1x8 core head node)

 B was able to outperform A by 2x using teragen and terasort. These machines 
 were running in a virtualized environment; where some of the IO capabilities 
 behind the scenes were being regulated to 400Mbps per node when running in 
 the 2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
 non-throttled scenario to work even better.

 ~Brad


 -Original Message-
 From: praveenesh kumar [mailto:praveen...@gmail.com]
 Sent: Monday, December 12, 2011 8:51 PM
 To: common-user@hadoop.apache.org
 Subject: More cores Vs More Nodes ?

 Hey Guys,

 So I have a very naive question in my mind regarding Hadoop cluster nodes ?

 more cores or more nodes - Shall I spend money on going from 2-4 core 
 machines, or spend money on buying more nodes less core eg. say 2 machines of 
 2 cores for example?

 Thanks,
 Praveenesh



RE: More cores Vs More Nodes ?

2011-12-13 Thread Tom Deutsch
It also helps to know the profile of your job in how you spec the 
machines. So in addition to Brad's response you should consider if you 
think your jobs will be more storage or compute oriented. 


Tom Deutsch
Program Director
Information Management
Big Data Technologies
IBM
3565 Harbor Blvd
Costa Mesa, CA 92626-1420
tdeut...@us.ibm.com




Brad Sarsfield b...@bing.com 
12/13/2011 09:41 AM
Please respond to
common-user@hadoop.apache.org


To
common-user@hadoop.apache.org common-user@hadoop.apache.org
cc

Subject
RE: More cores Vs More Nodes ?






Praveenesh,

Your question is not naïve; in fact, optimal hardware design can 
ultimately be a very difficult question to answer on what would be 
better. If you made me pick one without much information I'd go for more 
machines.  But...

It all depends; and there is no right answer :) 

More machines 
 +May run your workload faster
 +Will give you a higher degree of reliability protection 
from node / hardware / hard drive failure.
 +More aggregate IO capabilities
 - capex / opex may be higher than allocating more cores
More cores 
 +May run your workload faster
 +More cores may allow for more tasks to run on the same 
machine
 +More cores/tasks may reduce network contention and 
increase increasing task to task data flow performance.

Notice May run your workload faster is in both; as it can be very 
workload dependant.

My Experience:
I did a recent experiment and found that given the same number of cores 
(64) with the exact same network / machine configuration; 
 A: I had 8 machines with 8 cores 
 B: I had 28 machines with 2 cores (and 1x8 core head 
node)

B was able to outperform A by 2x using teragen and terasort. These 
machines were running in a virtualized environment; where some of the IO 
capabilities behind the scenes were being regulated to 400Mbps per node 
when running in the 2 core configuration vs 1Gbps on the 8 core.  So I 
would expect the non-throttled scenario to work even better. 

~Brad


-Original Message-
From: praveenesh kumar [mailto:praveen...@gmail.com] 
Sent: Monday, December 12, 2011 8:51 PM
To: common-user@hadoop.apache.org
Subject: More cores Vs More Nodes ?

Hey Guys,

So I have a very naive question in my mind regarding Hadoop cluster nodes 
?

more cores or more nodes - Shall I spend money on going from 2-4 core 
machines, or spend money on buying more nodes less core eg. say 2 machines 
of 2 cores for example?

Thanks,
Praveenesh




Re: More cores Vs More Nodes ?

2011-12-13 Thread real great..
more cores might help in hadoop environments as there would be more data
locality.
your thoughts?

On Tue, Dec 13, 2011 at 11:11 PM, Brad Sarsfield b...@bing.com wrote:

 Praveenesh,

 Your question is not naïve; in fact, optimal hardware design can
 ultimately be a very difficult question to answer on what would be
 better. If you made me pick one without much information I'd go for more
 machines.  But...

 It all depends; and there is no right answer :)

 More machines
+May run your workload faster
+Will give you a higher degree of reliability protection from node
 / hardware / hard drive failure.
+More aggregate IO capabilities
- capex / opex may be higher than allocating more cores
 More cores
+May run your workload faster
+More cores may allow for more tasks to run on the same machine
+More cores/tasks may reduce network contention and increase
 increasing task to task data flow performance.

 Notice May run your workload faster is in both; as it can be very
 workload dependant.

 My Experience:
 I did a recent experiment and found that given the same number of cores
 (64) with the exact same network / machine configuration;
A: I had 8 machines with 8 cores
B: I had 28 machines with 2 cores (and 1x8 core head node)

 B was able to outperform A by 2x using teragen and terasort. These
 machines were running in a virtualized environment; where some of the IO
 capabilities behind the scenes were being regulated to 400Mbps per node
 when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
 would expect the non-throttled scenario to work even better.

 ~Brad


 -Original Message-
 From: praveenesh kumar [mailto:praveen...@gmail.com]
 Sent: Monday, December 12, 2011 8:51 PM
 To: common-user@hadoop.apache.org
 Subject: More cores Vs More Nodes ?

 Hey Guys,

 So I have a very naive question in my mind regarding Hadoop cluster nodes ?

 more cores or more nodes - Shall I spend money on going from 2-4 core
 machines, or spend money on buying more nodes less core eg. say 2 machines
 of 2 cores for example?

 Thanks,
 Praveenesh




-- 
Regards,
R.V.


Re: More cores Vs More Nodes ?

2011-12-13 Thread Alexander Pivovarov
more nodes means more IO on read on mapper step
If you use combiners you might need to send only small amount of data over
network to reducers

Alexander


On Tue, Dec 13, 2011 at 12:45 PM, real great.. greatness.hardn...@gmail.com
 wrote:

 more cores might help in hadoop environments as there would be more data
 locality.
 your thoughts?

 On Tue, Dec 13, 2011 at 11:11 PM, Brad Sarsfield b...@bing.com wrote:

  Praveenesh,
 
  Your question is not naïve; in fact, optimal hardware design can
  ultimately be a very difficult question to answer on what would be
  better. If you made me pick one without much information I'd go for
 more
  machines.  But...
 
  It all depends; and there is no right answer :)
 
  More machines
 +May run your workload faster
 +Will give you a higher degree of reliability protection from node
  / hardware / hard drive failure.
 +More aggregate IO capabilities
 - capex / opex may be higher than allocating more cores
  More cores
 +May run your workload faster
 +More cores may allow for more tasks to run on the same machine
 +More cores/tasks may reduce network contention and increase
  increasing task to task data flow performance.
 
  Notice May run your workload faster is in both; as it can be very
  workload dependant.
 
  My Experience:
  I did a recent experiment and found that given the same number of cores
  (64) with the exact same network / machine configuration;
 A: I had 8 machines with 8 cores
 B: I had 28 machines with 2 cores (and 1x8 core head node)
 
  B was able to outperform A by 2x using teragen and terasort. These
  machines were running in a virtualized environment; where some of the IO
  capabilities behind the scenes were being regulated to 400Mbps per node
  when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
  would expect the non-throttled scenario to work even better.
 
  ~Brad
 
 
  -Original Message-
  From: praveenesh kumar [mailto:praveen...@gmail.com]
  Sent: Monday, December 12, 2011 8:51 PM
  To: common-user@hadoop.apache.org
  Subject: More cores Vs More Nodes ?
 
  Hey Guys,
 
  So I have a very naive question in my mind regarding Hadoop cluster
 nodes ?
 
  more cores or more nodes - Shall I spend money on going from 2-4 core
  machines, or spend money on buying more nodes less core eg. say 2
 machines
  of 2 cores for example?
 
  Thanks,
  Praveenesh
 
 


 --
 Regards,
 R.V.



Re: More cores Vs More Nodes ?

2011-12-13 Thread bharath vissapragada
Hey there,

I agree with Tom's response. One can decide it based on the type of jobs
you run. I have been working on Hive and I realized that increasing no. of
cores would give very good performance boost because joins and stuff are
compute oriented and consume a lot of CPU on reduce side. This may not be
the case with other applications (like HBase? )

Thanks

So I feel that you shou

On Tue, Dec 13, 2011 at 11:16 PM, Tom Deutsch tdeut...@us.ibm.com wrote:

 It also helps to know the profile of your job in how you spec the
 machines. So in addition to Brad's response you should consider if you
 think your jobs will be more storage or compute oriented.

 
 Tom Deutsch
 Program Director
 Information Management
 Big Data Technologies
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com




 Brad Sarsfield b...@bing.com
 12/13/2011 09:41 AM
 Please respond to
 common-user@hadoop.apache.org


 To
 common-user@hadoop.apache.org common-user@hadoop.apache.org
 cc

 Subject
 RE: More cores Vs More Nodes ?






 Praveenesh,

 Your question is not naïve; in fact, optimal hardware design can
 ultimately be a very difficult question to answer on what would be
 better. If you made me pick one without much information I'd go for more
 machines.  But...

 It all depends; and there is no right answer :)

 More machines
 +May run your workload faster
 +Will give you a higher degree of reliability protection
 from node / hardware / hard drive failure.
 +More aggregate IO capabilities
 - capex / opex may be higher than allocating more cores
 More cores
 +May run your workload faster
 +More cores may allow for more tasks to run on the same
 machine
 +More cores/tasks may reduce network contention and
 increase increasing task to task data flow performance.

 Notice May run your workload faster is in both; as it can be very
 workload dependant.

 My Experience:
 I did a recent experiment and found that given the same number of cores
 (64) with the exact same network / machine configuration;
 A: I had 8 machines with 8 cores
 B: I had 28 machines with 2 cores (and 1x8 core head
 node)

 B was able to outperform A by 2x using teragen and terasort. These
 machines were running in a virtualized environment; where some of the IO
 capabilities behind the scenes were being regulated to 400Mbps per node
 when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
 would expect the non-throttled scenario to work even better.

 ~Brad


 -Original Message-
 From: praveenesh kumar [mailto:praveen...@gmail.com]
 Sent: Monday, December 12, 2011 8:51 PM
 To: common-user@hadoop.apache.org
 Subject: More cores Vs More Nodes ?

 Hey Guys,

 So I have a very naive question in my mind regarding Hadoop cluster nodes
 ?

 more cores or more nodes - Shall I spend money on going from 2-4 core
 machines, or spend money on buying more nodes less core eg. say 2 machines
 of 2 cores for example?

 Thanks,
 Praveenesh





-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v


Re: ArrayWritable usage

2011-12-13 Thread Brock Noland
Hi,

ArrayWritable is a touch hard to use. Say you have an array of
IntWritable[]. The get() method or ArrayWritable, after
serializations/deserialization, does in fact return an array of type
Writable. As such you cannot cast it directly to IntWritable[]. Individual
elements are of type IntWritable and can be cast as such.

Will not work:

IntWritable[] array = (IntWritable[]) writable.get();

Will work:

for(Writable element : writable.get()) {
  IntWritable intWritable = (IntWritable)element;
}

Brock

On Sat, Dec 10, 2011 at 3:58 PM, zanurag zanu...@live.com wrote:

 Hi Dhruv,
 Is this working well for you ?? Are you able to do IntWritable [] abc =
 array.get();

 I am trying similar thing for IntTwoDArrayWritable.
 The array.set works but array.get returns Writable[][] and I am not able
 to cast it to IntWritable[][].

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/ArrayWritable-usage-tp3138520p3576386.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



remote hadoop in tomcat with jaas security

2011-12-13 Thread Avni, Itamar
Hi,

Our application runs in Tomcat 5.5, Java 6.17, with JAAS.
We give our own implementation to LoginModule and we start Tomcat with 
-Djava.security.auth.login.config.
We use Hadoop 0.20.203.0.

We want to execute Hadoop jobs, or Hadoop FileSystem methods from within our 
application, remotely.

Problem: once we use/initialize Hadoop's configuration (either by calling 
org.apache.hadoop.fs.FileSystem.get(Configuration) or by calling 
org.apache.hadoop.mapred.JobClient.runJob(JobConf)), the implementation of 
LoginModule changes from ours to 
org.apache.hadoop.security.UserGroupInformation.HadoopLoginModule, and we 
cannot login to our application anymore!

However, there is no problem with the remote execution of the jobs.


I saw this thread (Use JAAS LoginContext for our 
loginhttps://issues.apache.org/jira/browse/HADOOP-6299).
Does it fixes the problem?
It looks like it will still create a new LoginModule? Won't it?

Thanks for any help
Itamar



This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries.
The information is intended to be for the use of the individual(s) or
entity(ies) named above.  If you are not the intended recipient (or authorized 
to receive this e-mail for the intended recipient), you may not use, copy, 
disclose or distribute to anyone this message or any information contained in 
this message.  If you have received this electronic message in error, please 
notify us by replying to this e-mail.



Re: HDFS Backup nodes

2011-12-13 Thread Suresh Srinivas
Srivas,

As you may know already, NFS is just being used in the first prototype for
HA.

Two options for editlog store are:
1. Using BookKeeper. Work has already completed on trunk towards this. This
will replace need for NFS to  store the editlogs and is highly available.
This solution will also be used for HA.
2. We have a short term goal also to enable editlogs going to HDFS itself.
The work is in progress.

Regards,
Suresh



 -- Forwarded message --
 From: M. C. Srivas mcsri...@gmail.com
 Date: Sun, Dec 11, 2011 at 10:47 PM
 Subject: Re: HDFS Backup nodes
 To: common-user@hadoop.apache.org


 You are out of luck if you don't want to use NFS, and yet want redundancy
 for the NN.  Even the new NN HA work being done by the community will
 require NFS ... and the NFS itself needs to be HA.

 But if you use a Netapp, then the likelihood of the Netapp crashing is
 lower than the likelihood of a garbage-collection-of-death happening in the
 NN.

 [ disclaimer:  I don't work for Netapp, I work for MapR ]


 On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote:

  Thanks Joey. We've had enough problems with nfs (mainly under very high
  load) that we thought it might be riskier to use it for the NN.
 
  randy
 
 
  On 12/07/2011 06:46 PM, Joey Echeverria wrote:
 
  Hey Rand,
 
  It will mark that storage directory as failed and ignore it from then
  on. In order to do this correctly, you need a couple of options
  enabled on the NFS mount to make sure that it doesn't retry
  infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10
  options set.
 
  -Joey
 
  On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net  wrote:
 
  What happens then if the nfs server fails or isn't reachable? Does hdfs
  lock up? Does it gracefully ignore the nfs copy?
 
  Thanks,
  randy
 
  - Original Message -
  From: Joey Echeverriaj...@cloudera.com
  To: common-user@hadoop.apache.org
  Sent: Wednesday, December 7, 2011 6:07:58 AM
  Subject: Re: HDFS Backup nodes
 
  You should also configure the Namenode to use an NFS mount for one of
  it's storage directories. That will give the most up-to-date back of
  the metadata in case of total node failure.
 
  -Joey
 
  On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumarpraveen...@gmail.com
   wrote:
 
  This means still we are relying on Secondary NameNode idealogy for
  Namenode's backup.
  Can OS-mirroring of Namenode is a good alternative keep it alive all
 the
  time ?
 
  Thanks,
  Praveenesh
 
  On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G
  mahesw...@huawei.comwrote:
 
   AFAIK backup node introduced in 0.21 version onwards.
  __**__
  From: praveenesh kumar [praveen...@gmail.com]
  Sent: Wednesday, December 07, 2011 12:40 PM
  To: common-user@hadoop.apache.org
  Subject: HDFS Backup nodes
 
  Does hadoop 0.20.205 supports configuring HDFS backup nodes ?
 
  Thanks,
  Praveenesh
 
 
 
 
  --
  Joseph Echeverria
  Cloudera, Inc.
  443.305.9434
 
 
 
 
 
 




Re: How Jobtracler stores tasktracker's information

2011-12-13 Thread Arun C Murthy
Moving to mapreduce-user@, bcc common-user@. Please use project specific lists. 

Take a look at JobTracker.heartbeat - *Scheduler.assignTasks.

After the scheduler 'assigns' tasks, the JT sends the corresponding 
'LaunchTaskAction' to the TaskTracker.

hth,
Arun

On Dec 13, 2011, at 12:59 AM, hadoop anis wrote:

  Anyone please tell this,
  I want to know from where Jobtracker sends task(taskid) to
 tasktarcker for scheduling.
  i.e where it creates taskid  tasktracker pairs
 
 
 
 Thanks  Regards,
 
 Mohmmadanis Moulavi
 
 Student,
 MTech (Computer Sci.  Engg.)
 Walchand college of Engg. Sangli (M.S.) India



RE: More cores Vs More Nodes ?

2011-12-13 Thread Brad Sarsfield
Hi Prashant,

In each case I had a single tasktracker per node. I oversubscribed the total 
tasks per tasktracker/node by 1.5 x # of cores.

So for the 64 core allocation comparison.
In A: 8 cores; Each machine had a single tasktracker with 8 maps / 4 
reduce slots for 12 task slots total per machine x 8 machines (including head 
node)
In B: 2 c   ores; Each machine had a single tasktracker with 2 maps 
/ 1 reduce slots for 3 slots total per machines x 29 machines (including head 
node which was running 8 cores)

The experiment was done in a cloud hosted environment running set of VMs.

~Brad

-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com] 
Sent: Tuesday, December 13, 2011 9:46 AM
To: common-user@hadoop.apache.org
Subject: Re: More cores Vs More Nodes ?

Hi Brad, how many taskstrackers did you have on each node in both cases?

Thanks,
Prashant

Sent from my iPhone

On Dec 13, 2011, at 9:42 AM, Brad Sarsfield b...@bing.com wrote:

 Praveenesh,

 Your question is not naïve; in fact, optimal hardware design can ultimately 
 be a very difficult question to answer on what would be better. If you made 
 me pick one without much information I'd go for more machines.  But...

 It all depends; and there is no right answer :)

 More machines
+May run your workload faster
+Will give you a higher degree of reliability protection from node / 
 hardware / hard drive failure.
+More aggregate IO capabilities
- capex / opex may be higher than allocating more cores More cores
+May run your workload faster
+More cores may allow for more tasks to run on the same machine
+More cores/tasks may reduce network contention and increase increasing 
 task to task data flow performance.

 Notice May run your workload faster is in both; as it can be very workload 
 dependant.

 My Experience:
 I did a recent experiment and found that given the same number of cores (64) 
 with the exact same network / machine configuration;
A: I had 8 machines with 8 cores
B: I had 28 machines with 2 cores (and 1x8 core head node)

 B was able to outperform A by 2x using teragen and terasort. These machines 
 were running in a virtualized environment; where some of the IO capabilities 
 behind the scenes were being regulated to 400Mbps per node when running in 
 the 2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
 non-throttled scenario to work even better.

 ~Brad


 -Original Message-
 From: praveenesh kumar [mailto:praveen...@gmail.com]
 Sent: Monday, December 12, 2011 8:51 PM
 To: common-user@hadoop.apache.org
 Subject: More cores Vs More Nodes ?

 Hey Guys,

 So I have a very naive question in my mind regarding Hadoop cluster nodes ?

 more cores or more nodes - Shall I spend money on going from 2-4 core 
 machines, or spend money on buying more nodes less core eg. say 2 machines of 
 2 cores for example?

 Thanks,
 Praveenesh




Re: HDFS Backup nodes

2011-12-13 Thread Todd Lipcon
On Sun, Dec 11, 2011 at 10:47 PM, M. C. Srivas mcsri...@gmail.com wrote:
 But if you use a Netapp, then the likelihood of the Netapp crashing is
 lower than the likelihood of a garbage-collection-of-death happening in the
 NN.

This is pure FUD.

I've never seen a garbage collection of death ever in any NN with
smaller than a 40GB heap, and only a small handful of times on larger
heaps. So, unless you're running a 4000 node cluster, you shouldn't be
concerned with this. And the existence of many 4000 node clusters
running fine on HDFS indicates that a properly tuned NN does just
fine.

[Disclaimer: I don't spread FUD regardless of vendor affiliation.]

-Todd


 [ disclaimer:  I don't work for Netapp, I work for MapR ]


 On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote:

 Thanks Joey. We've had enough problems with nfs (mainly under very high
 load) that we thought it might be riskier to use it for the NN.

 randy


 On 12/07/2011 06:46 PM, Joey Echeverria wrote:

 Hey Rand,

 It will mark that storage directory as failed and ignore it from then
 on. In order to do this correctly, you need a couple of options
 enabled on the NFS mount to make sure that it doesn't retry
 infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10
 options set.

 -Joey

 On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net  wrote:

 What happens then if the nfs server fails or isn't reachable? Does hdfs
 lock up? Does it gracefully ignore the nfs copy?

 Thanks,
 randy

 - Original Message -
 From: Joey Echeverriaj...@cloudera.com
 To: common-user@hadoop.apache.org
 Sent: Wednesday, December 7, 2011 6:07:58 AM
 Subject: Re: HDFS Backup nodes

 You should also configure the Namenode to use an NFS mount for one of
 it's storage directories. That will give the most up-to-date back of
 the metadata in case of total node failure.

 -Joey

 On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumarpraveen...@gmail.com
  wrote:

 This means still we are relying on Secondary NameNode idealogy for
 Namenode's backup.
 Can OS-mirroring of Namenode is a good alternative keep it alive all the
 time ?

 Thanks,
 Praveenesh

 On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G
 mahesw...@huawei.comwrote:

  AFAIK backup node introduced in 0.21 version onwards.
 __**__
 From: praveenesh kumar [praveen...@gmail.com]
 Sent: Wednesday, December 07, 2011 12:40 PM
 To: common-user@hadoop.apache.org
 Subject: HDFS Backup nodes

 Does hadoop 0.20.205 supports configuring HDFS backup nodes ?

 Thanks,
 Praveenesh




 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434









-- 
Todd Lipcon
Software Engineer, Cloudera


Re: More cores Vs More Nodes ?

2011-12-13 Thread He Chen
Hi Brad

This is a really interesting experiment. I am curious why you did not use 2
cores each machine but 32 nodes. That makes the number of CPU core in two
groups equal.

Chen

On Tue, Dec 13, 2011 at 7:15 PM, Brad Sarsfield b...@bing.com wrote:

 Hi Prashant,

 In each case I had a single tasktracker per node. I oversubscribed the
 total tasks per tasktracker/node by 1.5 x # of cores.

 So for the 64 core allocation comparison.
In A: 8 cores; Each machine had a single tasktracker with 8 maps /
 4 reduce slots for 12 task slots total per machine x 8 machines (including
 head node)
In B: 2 c   ores; Each machine had a single tasktracker with 2
 maps / 1 reduce slots for 3 slots total per machines x 29 machines
 (including head node which was running 8 cores)

 The experiment was done in a cloud hosted environment running set of VMs.

 ~Brad

 -Original Message-
 From: Prashant Kommireddi [mailto:prash1...@gmail.com]
 Sent: Tuesday, December 13, 2011 9:46 AM
 To: common-user@hadoop.apache.org
 Subject: Re: More cores Vs More Nodes ?

 Hi Brad, how many taskstrackers did you have on each node in both cases?

 Thanks,
 Prashant

 Sent from my iPhone

 On Dec 13, 2011, at 9:42 AM, Brad Sarsfield b...@bing.com wrote:

  Praveenesh,
 
  Your question is not naïve; in fact, optimal hardware design can
 ultimately be a very difficult question to answer on what would be
 better. If you made me pick one without much information I'd go for more
 machines.  But...
 
  It all depends; and there is no right answer :)
 
  More machines
 +May run your workload faster
 +Will give you a higher degree of reliability protection from node /
 hardware / hard drive failure.
 +More aggregate IO capabilities
 - capex / opex may be higher than allocating more cores More cores
 +May run your workload faster
 +More cores may allow for more tasks to run on the same machine
 +More cores/tasks may reduce network contention and increase
 increasing task to task data flow performance.
 
  Notice May run your workload faster is in both; as it can be very
 workload dependant.
 
  My Experience:
  I did a recent experiment and found that given the same number of cores
 (64) with the exact same network / machine configuration;
 A: I had 8 machines with 8 cores
 B: I had 28 machines with 2 cores (and 1x8 core head node)
 
  B was able to outperform A by 2x using teragen and terasort. These
 machines were running in a virtualized environment; where some of the IO
 capabilities behind the scenes were being regulated to 400Mbps per node
 when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
 would expect the non-throttled scenario to work even better.
 
  ~Brad
 
 
  -Original Message-
  From: praveenesh kumar [mailto:praveen...@gmail.com]
  Sent: Monday, December 12, 2011 8:51 PM
  To: common-user@hadoop.apache.org
  Subject: More cores Vs More Nodes ?
 
  Hey Guys,
 
  So I have a very naive question in my mind regarding Hadoop cluster
 nodes ?
 
  more cores or more nodes - Shall I spend money on going from 2-4 core
 machines, or spend money on buying more nodes less core eg. say 2 machines
 of 2 cores for example?
 
  Thanks,
  Praveenesh
 




Re: HDFS Backup nodes

2011-12-13 Thread M. C. Srivas
Suresh,

As of today, there is no option except to use NFS.  And as you yourself
mention, the first HA prototype when it comes out will require NFS.

(a) I wasn't aware that Bookkeeper had progressed that far. I wonder
whether it would be able to keep up with the data rates that is required in
order to hold the NN log without falling behind.

(b) I do know Karthik Ranga at FB just started a design to put the NN data
in HDFS itself, but that is in very preliminary design stages with no real
code there.

The problem is that the HA code written with NFS in mind is very different
from the HA code written with HDFS in mind, which are both quite different
from the code that is written with Bookkeeper in mind. Essentially the
three options will form three different implementations, since the failure
modes of each of the back-ends are different. Am I totally off base?

thanks,
Srivas.




On Tue, Dec 13, 2011 at 11:00 AM, Suresh Srinivas sur...@hortonworks.comwrote:

 Srivas,

 As you may know already, NFS is just being used in the first prototype for
 HA.

 Two options for editlog store are:
 1. Using BookKeeper. Work has already completed on trunk towards this. This
 will replace need for NFS to  store the editlogs and is highly available.
 This solution will also be used for HA.
 2. We have a short term goal also to enable editlogs going to HDFS itself.
 The work is in progress.

 Regards,
 Suresh


 
  -- Forwarded message --
  From: M. C. Srivas mcsri...@gmail.com
  Date: Sun, Dec 11, 2011 at 10:47 PM
  Subject: Re: HDFS Backup nodes
  To: common-user@hadoop.apache.org
 
 
  You are out of luck if you don't want to use NFS, and yet want redundancy
  for the NN.  Even the new NN HA work being done by the community will
  require NFS ... and the NFS itself needs to be HA.
 
  But if you use a Netapp, then the likelihood of the Netapp crashing is
  lower than the likelihood of a garbage-collection-of-death happening in
 the
  NN.
 
  [ disclaimer:  I don't work for Netapp, I work for MapR ]
 
 
  On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote:
 
   Thanks Joey. We've had enough problems with nfs (mainly under very high
   load) that we thought it might be riskier to use it for the NN.
  
   randy
  
  
   On 12/07/2011 06:46 PM, Joey Echeverria wrote:
  
   Hey Rand,
  
   It will mark that storage directory as failed and ignore it from then
   on. In order to do this correctly, you need a couple of options
   enabled on the NFS mount to make sure that it doesn't retry
   infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10
   options set.
  
   -Joey
  
   On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net  wrote:
  
   What happens then if the nfs server fails or isn't reachable? Does
 hdfs
   lock up? Does it gracefully ignore the nfs copy?
  
   Thanks,
   randy
  
   - Original Message -
   From: Joey Echeverriaj...@cloudera.com
   To: common-user@hadoop.apache.org
   Sent: Wednesday, December 7, 2011 6:07:58 AM
   Subject: Re: HDFS Backup nodes
  
   You should also configure the Namenode to use an NFS mount for one of
   it's storage directories. That will give the most up-to-date back of
   the metadata in case of total node failure.
  
   -Joey
  
   On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar
 praveen...@gmail.com
wrote:
  
   This means still we are relying on Secondary NameNode idealogy for
   Namenode's backup.
   Can OS-mirroring of Namenode is a good alternative keep it alive all
  the
   time ?
  
   Thanks,
   Praveenesh
  
   On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G
   mahesw...@huawei.comwrote:
  
AFAIK backup node introduced in 0.21 version onwards.
   __**__
   From: praveenesh kumar [praveen...@gmail.com]
   Sent: Wednesday, December 07, 2011 12:40 PM
   To: common-user@hadoop.apache.org
   Subject: HDFS Backup nodes
  
   Does hadoop 0.20.205 supports configuring HDFS backup nodes ?
  
   Thanks,
   Praveenesh
  
  
  
  
   --
   Joseph Echeverria
   Cloudera, Inc.
   443.305.9434
  
  
  
  
  
  
 
 



Re: HDFS Backup nodes

2011-12-13 Thread Todd Lipcon
On Tue, Dec 13, 2011 at 10:42 PM, M. C. Srivas mcsri...@gmail.com wrote:
 Any simple file meta-data test will cause the NN to spiral to death with
 infinite GC.  For example, try create many many files. Or even simple
 stat a bunch of file continuously.

Sure. If I run dd if=/dev/zero of=foo my laptop will spiral to
death also. I think this is what you're referring to -- continuously
write files until it is out of RAM.

This is a well understood design choice of HDFS. It is not designed as
general purpose storage for small files, and if you run tests against
it assuming it is, you'll get bad results. I agree there.


 The real FUD going on is refusing to acknowledge that there is indeed a
 real problem.

Yes, if you use HDFS for workloads for which it was never designed,
you'll have a problem. If you stick to commonly accepted best
practices I think you'll find the same thing that hundreds of other
companies have found: HDFS is stable and reliable and has no such GC
of death problems when used as intended.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: HDFS Backup nodes

2011-12-13 Thread Konstantin Boudnik
On Tue, Dec 13, 2011 at 11:00PM, M. C. Srivas wrote:
 Suresh,
 
 As of today, there is no option except to use NFS.  And as you yourself
 mention, the first HA prototype when it comes out will require NFS.

Well, in the interest of full disclosure NFS is just one of the options and
not the only one. Any auxiliary storage will do greatly. Distributed in-memory
redundant storage for sub-seconds fail-over? Sure, Gigaspaces do this for
years using very mature JINI.

NFS is just happen to be readily available in any data center and doesn't
require much of the extra investment on top of what exists. NFS comes with its
own set of problems of course. First and foremost is No-File-Security which
requires use of something like Kerberos for third-party user management. And
when paired with something like LinuxTaskController it can produce some very
interesting effects.

Cos

 (a) I wasn't aware that Bookkeeper had progressed that far. I wonder
 whether it would be able to keep up with the data rates that is required in
 order to hold the NN log without falling behind.
 
 (b) I do know Karthik Ranga at FB just started a design to put the NN data
 in HDFS itself, but that is in very preliminary design stages with no real
 code there.
 
 The problem is that the HA code written with NFS in mind is very different
 from the HA code written with HDFS in mind, which are both quite different
 from the code that is written with Bookkeeper in mind. Essentially the
 three options will form three different implementations, since the failure
 modes of each of the back-ends are different. Am I totally off base?
 
 thanks,
 Srivas.
 
 
 
 
 On Tue, Dec 13, 2011 at 11:00 AM, Suresh Srinivas 
 sur...@hortonworks.comwrote:
 
  Srivas,
 
  As you may know already, NFS is just being used in the first prototype for
  HA.
 
  Two options for editlog store are:
  1. Using BookKeeper. Work has already completed on trunk towards this. This
  will replace need for NFS to  store the editlogs and is highly available.
  This solution will also be used for HA.
  2. We have a short term goal also to enable editlogs going to HDFS itself.
  The work is in progress.
 
  Regards,
  Suresh
 
 
  
   -- Forwarded message --
   From: M. C. Srivas mcsri...@gmail.com
   Date: Sun, Dec 11, 2011 at 10:47 PM
   Subject: Re: HDFS Backup nodes
   To: common-user@hadoop.apache.org
  
  
   You are out of luck if you don't want to use NFS, and yet want redundancy
   for the NN.  Even the new NN HA work being done by the community will
   require NFS ... and the NFS itself needs to be HA.
  
   But if you use a Netapp, then the likelihood of the Netapp crashing is
   lower than the likelihood of a garbage-collection-of-death happening in
  the
   NN.
  
   [ disclaimer:  I don't work for Netapp, I work for MapR ]
  
  
   On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote:
  
Thanks Joey. We've had enough problems with nfs (mainly under very high
load) that we thought it might be riskier to use it for the NN.
   
randy
   
   
On 12/07/2011 06:46 PM, Joey Echeverria wrote:
   
Hey Rand,
   
It will mark that storage directory as failed and ignore it from then
on. In order to do this correctly, you need a couple of options
enabled on the NFS mount to make sure that it doesn't retry
infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10
options set.
   
-Joey
   
On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net  wrote:
   
What happens then if the nfs server fails or isn't reachable? Does
  hdfs
lock up? Does it gracefully ignore the nfs copy?
   
Thanks,
randy
   
- Original Message -
From: Joey Echeverriaj...@cloudera.com
To: common-user@hadoop.apache.org
Sent: Wednesday, December 7, 2011 6:07:58 AM
Subject: Re: HDFS Backup nodes
   
You should also configure the Namenode to use an NFS mount for one of
it's storage directories. That will give the most up-to-date back of
the metadata in case of total node failure.
   
-Joey
   
On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar
  praveen...@gmail.com
 wrote:
   
This means still we are relying on Secondary NameNode idealogy for
Namenode's backup.
Can OS-mirroring of Namenode is a good alternative keep it alive all
   the
time ?
   
Thanks,
Praveenesh
   
On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G
mahesw...@huawei.comwrote:
   
 AFAIK backup node introduced in 0.21 version onwards.
__**__
From: praveenesh kumar [praveen...@gmail.com]
Sent: Wednesday, December 07, 2011 12:40 PM
To: common-user@hadoop.apache.org
Subject: HDFS Backup nodes
   
Does hadoop 0.20.205 supports configuring HDFS backup nodes ?
   
Thanks,
Praveenesh
   
   
   
   
--
Joseph Echeverria
Cloudera, Inc.
443.305.9434
   
   
   
   
   
   
  
  
 


Re: HDFS Backup nodes

2011-12-13 Thread Todd Lipcon
On Tue, Dec 13, 2011 at 11:00 PM, M. C. Srivas mcsri...@gmail.com wrote:
 (a) I wasn't aware that Bookkeeper had progressed that far. I wonder
 whether it would be able to keep up with the data rates that is required in
 order to hold the NN log without falling behind.

It's a good question - but one which has data relatively available.
Reading from Flavio Junqueira's slides from the Hadoop In China
conference a few weeks ago, he can maintain ~50k TPS with 20ms
latency, with 128 byte transactions. Given that HDFS does batch
multiple transactions per commit (standard group commit techniques) we
might imagine 4KB transactions where it looks like about 5K TPS,
equating to around 20MB/sec throughput. These transaction rates should
be plenty for the edit logging use case in my experience.


 (b) I do know Karthik Ranga at FB just started a design to put the NN data
 in HDFS itself, but that is in very preliminary design stages with no real
 code there.

Agreed. But it's not particularly complex either.. things can move
from preliminary design to working code in short timelines.


 The problem is that the HA code written with NFS in mind is very different
 from the HA code written with HDFS in mind, which are both quite different
 from the code that is written with Bookkeeper in mind. Essentially the
 three options will form three different implementations, since the failure
 modes of each of the back-ends are different. Am I totally off base?

Actually since the beginning of the HA project we have been keeping in
mind that NFS is only a step along the way. The shared edits storage
only has to have the following very basic operations:
- write and append to files (log segments)
- read from closed files
- fence another writer (which can also be implemented with STONITH)

As I understand it, BK supports all of the above and in fact the BK
team has a working prototype of journal storage in BK. The interface
is already made pluggable as of last month. So this is not far-off
brainstorming but rather a very real implementation that's coming very
soon to stable releases.

-Todd

 On Tue, Dec 13, 2011 at 11:00 AM, Suresh Srinivas 
 sur...@hortonworks.comwrote:

 Srivas,

 As you may know already, NFS is just being used in the first prototype for
 HA.

 Two options for editlog store are:
 1. Using BookKeeper. Work has already completed on trunk towards this. This
 will replace need for NFS to  store the editlogs and is highly available.
 This solution will also be used for HA.
 2. We have a short term goal also to enable editlogs going to HDFS itself.
 The work is in progress.

 Regards,
 Suresh


 
  -- Forwarded message --
  From: M. C. Srivas mcsri...@gmail.com
  Date: Sun, Dec 11, 2011 at 10:47 PM
  Subject: Re: HDFS Backup nodes
  To: common-user@hadoop.apache.org
 
 
  You are out of luck if you don't want to use NFS, and yet want redundancy
  for the NN.  Even the new NN HA work being done by the community will
  require NFS ... and the NFS itself needs to be HA.
 
  But if you use a Netapp, then the likelihood of the Netapp crashing is
  lower than the likelihood of a garbage-collection-of-death happening in
 the
  NN.
 
  [ disclaimer:  I don't work for Netapp, I work for MapR ]
 
 
  On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote:
 
   Thanks Joey. We've had enough problems with nfs (mainly under very high
   load) that we thought it might be riskier to use it for the NN.
  
   randy
  
  
   On 12/07/2011 06:46 PM, Joey Echeverria wrote:
  
   Hey Rand,
  
   It will mark that storage directory as failed and ignore it from then
   on. In order to do this correctly, you need a couple of options
   enabled on the NFS mount to make sure that it doesn't retry
   infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10
   options set.
  
   -Joey
  
   On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net  wrote:
  
   What happens then if the nfs server fails or isn't reachable? Does
 hdfs
   lock up? Does it gracefully ignore the nfs copy?
  
   Thanks,
   randy
  
   - Original Message -
   From: Joey Echeverriaj...@cloudera.com
   To: common-user@hadoop.apache.org
   Sent: Wednesday, December 7, 2011 6:07:58 AM
   Subject: Re: HDFS Backup nodes
  
   You should also configure the Namenode to use an NFS mount for one of
   it's storage directories. That will give the most up-to-date back of
   the metadata in case of total node failure.
  
   -Joey
  
   On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar
 praveen...@gmail.com
    wrote:
  
   This means still we are relying on Secondary NameNode idealogy for
   Namenode's backup.
   Can OS-mirroring of Namenode is a good alternative keep it alive all
  the
   time ?
  
   Thanks,
   Praveenesh
  
   On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G
   mahesw...@huawei.comwrote:
  
    AFAIK backup node introduced in 0.21 version onwards.
   __**__
   From: praveenesh kumar