RE: Why most of the free reduce slots are NOT used for my Hadoop Jobs? Thanks.

2012-03-12 Thread WangRamon

Hi Satish I'm not sure about this, but it's a double Quad-Core CPU, each of 
them is Hyper Threading, so 14 mapper and 14 reducer for a node should be OK, 
right? I find each task finishs very quickly, no more than 20 seconds, so is 
this the root cause for this problem? ThanksRamon
 From: satish.se...@hcl.com
To: mapreduce-user@hadoop.apache.org; ramon_w...@hotmail.com
Date: Mon, 12 Mar 2012 09:00:49 +0530
Subject: RE: Why most of the free reduce slots are NOT used for my Hadoop Jobs? 
Thanks.












 
Just guessing if this has something to do with number of cores/cpus.
Myself noticed this for number of map tasks spawned - depending on number of  
input files and also number of tasks to run concurrently depends on number 
cores/cpus.
 
Thanks
 
 


From: WangRamon [ramon_w...@hotmail.com]

Sent: Saturday, March 10, 2012 5:35 PM

To: mapreduce-user@hadoop.apache.org

Subject: RE: Why most of the free reduce slots are NOT used for my Hadoop Jobs? 
Thanks.






Joey, here is the information:

 

Cluster Summary (Heap Size is 481.88 MB/1.74 GB)

Maps Reduces Total Submissions Nodes Map Task Capacity Reduce Task Capacity 
Avg. Tasks/Node Blacklisted Nodes


06 11  3 42 
  42 28.00  nbsp; 0 

 

 

Cheers

Ramon



 




Subject: Re: Why most of the free reduce slots are NOT used for my Hadoop Jobs? 
Thanks.

From: j...@cloudera.com

Date: Sat, 10 Mar 2012 07:00:26 -0500

To: mapreduce-user@hadoop.apache.org



What does the jobtracker web page say is the total reduce capacity?



-Joey








On Mar 10, 2012, at 5:39, WangRamon ramon_w...@hotmail.com wrote:







Hi All

 

I'm using Hadoop-0.20-append, the cluster contains 3 nodes, for each node I 
have 14 map and 14 reduce slots, here is the configuration:

 

 

property

namemapred.tasktracker.map.tasks.maximum/name

value14/value

/property

property

namemapred.tasktracker.reduce.tasks.maximum/name

value14/value

/property

property

namemapred.reduce.tasks/name

value73/value

/property



 

When I submit 5 Jobs simultane ously (the input data for each job is not so big 
for the test, it's about 2~5M in size), I assume the Jobs will use the slots as 
much as possible, each Job did created 73 Reduce Tasks as configured above, so 
there will be 5 *
 73 Reduce Tasks in total, but, most of them are in pending state, only about 
12 of them are running, it's too small compared to the total slots number for 
reduce, 42 reduce slots for the 3 nodes cluster.


 

What interestring is that it always about 12 of them are running, I tried a few 
times.

 

So, I thought it might because about the scheduler, I changed it to Fair 
Scheduler, I created 3 pools, the configure is as below:

 

?xml version=1.0?

allocations

 pool name=pool-a

  minMaps14/minMaps

  minReduces14/minReduces

  weight1.0/weight

 /pool

 pool name=pool -b

  minMaps14/minMaps

  minReduces14/minReduces

  weight1.0/weight

 /pool

 pool name=pool-c

  minMaps14/minMaps

  minReduces14/minReduces

  weight1.0/weight

 /pool

 

/allocations 

 

Then I submit the 5 Jobs simultaneously to these pools randomly again, I can 
see the jobs were assigned to different pools, but, it's still the same problem 
only about 12 of the reduce tasks from different pool are running, here is the 
output i copied from
 the Fair Scheduler monitor GUI:

 

pool-a 2 14 14 0 9

pool-b 0 14 14 0 0 

pool-c 2 14 14 0 3 

 

pool-a and pool-c have a total of 12 reduce tasks running, but I do have about 
11 reduce slots at least available in my cluster.

 

So can anyone please give me some suggestions, why NOT all my REDUCE SLOTS are 
working? Thanks in advance.


 

Cheers 

Ramon











::DISCLAIMER::

---



The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.

It shall not attach any liability on the originator or HCL or its affiliates. 
Any views or opinions presented in

this email are solely those of the author and may not necessarily reflect the 
opinions of HCL or its affiliates.

Any form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of

this message without the prior written consent of the author of this e-mail is 
strictly prohibited. If you have

received this email in error please delete it and notify the sender 
immediately. Before opening any mail and

attachments please check them for viruses and defect.



---

  

Re: Mapper Record Spillage

2012-03-12 Thread George Datskos
Actually if you set {io.sort.mb} to 2048, your map tasks will always 
fail.  The maximum {io.sort.mb} is hard-coded to 2047.  Which means if 
you think you've set 2048 and your tasks aren't failing, then you 
probably haven't actually changed io.sort.mb.  Double-check what 
configuration settings the Jobtracker actually saw by looking at


$ hadoop fs -cat hdfs://JOB_OUTPUT_DIR/_logs/history/*.xml | grep 
io.sort.mb




George


On 2012/03/11 22:38, Harsh J wrote:

Hans,

I don't think io.sort.mb can support a whole 2048 value (it builds one
array with the size, and JVM may not be allowing that). Can you lower
it to 2000 ± 100 and try again?

On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlighuh...@uhlisys.com  wrote:

If that is the case then these two lines should make more than enough
memory. On a virtually unused cluster.

job.getConfiguration().setInt(io.sort.mb, 2048);
job.getConfiguration().set(mapred.map.child.java.opts, -Xmx3072M);

Such that a conversion from 1GB of CSV Text to binary primitives should fit
easily. but java still throws a heap error even when there is 25 GB of
memory free.

On Sat, Mar 10, 2012 at 11:50 PM, Harsh Jha...@cloudera.com  wrote:

Hans,

You can change memory requirements for tasks of a single job, but not
of a single task inside that job.

This is briefly how the 0.20 framework (by default) works: TT has
notions only of slots, and carries a maximum _number_ of
simultaneous slots it may run. It does not know of what each task,
occupying one slot, would demand in resource-terms. Your job then
supplies a # of map tasks, and amount of memory required per map task
in general, as a configuration. TTs then merely start the task JVMs
with the provided heap configuration.

On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlighuh...@uhlisys.com  wrote:

That was a typo in my email not in the configuration. Is the memory
reserved
for the tasks when the task tracker starts? You seem to be suggesting
that I
need to set the memory to be the same for all map tasks. Is there no way
to
override for a single map task?


On Sat, Mar 10, 2012 at 8:41 PM, Harsh Jha...@cloudera.com  wrote:

Hans,

Its possible you may have an typo issue: mapred.map.child.jvm.opts -
Such a property does not exist. Perhaps you wanted
mapred.map.child.java.opts?

Additionally, the computation you need to do is (# of map slots on a
TT * per-map-task-heap-requirement) should be at least  (Total RAM -
2/3 GB). With your 4 GB requirement, I guess you can support a max of
6-7 slots per machine (i.e. Not counting reducer heap requirements in
parallel).

On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlighuh...@uhlisys.com  wrote:

I am attempting to speed up a mapping process whose input is GZIP
compressed
CSV files. The files range from 1-2GB, I am running on a Cluster
where
each
node has a total of 32GB memory available to use. I have attempted to
tweak
mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
accommodate the size but I keep getting java heap errors or other
memory
related problems. My row count per mapper is well below
Integer.MAX_INTEGER
limit by several orders of magnitude and the box is NOT using
anywhere
close
to its full memory allotment. How can I specify that this map task
can
have
3-4 GB of memory for the collection, partition and sort process
without
constantly spilling records to disk?



--
Harsh J





--
Harsh J