Re: Mapper Record Spillage

2012-03-12 Thread George Datskos
Actually if you set {io.sort.mb} to 2048, your map tasks will always 
fail.  The maximum {io.sort.mb} is hard-coded to 2047.  Which means if 
you think you've set 2048 and your tasks aren't failing, then you 
probably haven't actually changed io.sort.mb.  Double-check what 
configuration settings the Jobtracker actually saw by looking at


$ hadoop fs -cat hdfs:///_logs/history/*.xml | grep 
io.sort.mb




George


On 2012/03/11 22:38, Harsh J wrote:

Hans,

I don't think io.sort.mb can support a whole 2048 value (it builds one
array with the size, and JVM may not be allowing that). Can you lower
it to 2000 ± 100 and try again?

On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig  wrote:

If that is the case then these two lines should make more than enough
memory. On a virtually unused cluster.

job.getConfiguration().setInt("io.sort.mb", 2048);
job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M");

Such that a conversion from 1GB of CSV Text to binary primitives should fit
easily. but java still throws a heap error even when there is 25 GB of
memory free.

On Sat, Mar 10, 2012 at 11:50 PM, Harsh J  wrote:

Hans,

You can change memory requirements for tasks of a single job, but not
of a single task inside that job.

This is briefly how the 0.20 framework (by default) works: TT has
notions only of "slots", and carries a maximum _number_ of
simultaneous slots it may run. It does not know of what each task,
occupying one slot, would demand in resource-terms. Your job then
supplies a # of map tasks, and amount of memory required per map task
in general, as a configuration. TTs then merely start the task JVMs
with the provided heap configuration.

On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig  wrote:

That was a typo in my email not in the configuration. Is the memory
reserved
for the tasks when the task tracker starts? You seem to be suggesting
that I
need to set the memory to be the same for all map tasks. Is there no way
to
override for a single map task?


On Sat, Mar 10, 2012 at 8:41 PM, Harsh J  wrote:

Hans,

Its possible you may have an typo issue: mapred.map.child.jvm.opts -
Such a property does not exist. Perhaps you wanted
"mapred.map.child.java.opts"?

Additionally, the computation you need to do is (# of map slots on a
TT * per-map-task-heap-requirement) should be at least<  (Total RAM -
2/3 GB). With your 4 GB requirement, I guess you can support a max of
6-7 slots per machine (i.e. Not counting reducer heap requirements in
parallel).

On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig  wrote:

I am attempting to speed up a mapping process whose input is GZIP
compressed
CSV files. The files range from 1-2GB, I am running on a Cluster
where
each
node has a total of 32GB memory available to use. I have attempted to
tweak
mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
accommodate the size but I keep getting java heap errors or other
memory
related problems. My row count per mapper is well below
Integer.MAX_INTEGER
limit by several orders of magnitude and the box is NOT using
anywhere
close
to its full memory allotment. How can I specify that this map task
can
have
3-4 GB of memory for the collection, partition and sort process
without
constantly spilling records to disk?



--
Harsh J





--
Harsh J










Re: Mapper Record Spillage

2012-03-11 Thread Harsh J
(Er, not sure how that ± got in there, I wished to type (-100, lowered
further if it continued to show problems)).

On Sun, Mar 11, 2012 at 7:08 PM, Harsh J  wrote:
> Hans,
>
> I don't think io.sort.mb can support a whole 2048 value (it builds one
> array with the size, and JVM may not be allowing that). Can you lower
> it to 2000 ± 100 and try again?
>
> On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig  wrote:
>> If that is the case then these two lines should make more than enough
>> memory. On a virtually unused cluster.
>>
>> job.getConfiguration().setInt("io.sort.mb", 2048);
>> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M");
>>
>> Such that a conversion from 1GB of CSV Text to binary primitives should fit
>> easily. but java still throws a heap error even when there is 25 GB of
>> memory free.
>>
>> On Sat, Mar 10, 2012 at 11:50 PM, Harsh J  wrote:
>>>
>>> Hans,
>>>
>>> You can change memory requirements for tasks of a single job, but not
>>> of a single task inside that job.
>>>
>>> This is briefly how the 0.20 framework (by default) works: TT has
>>> notions only of "slots", and carries a maximum _number_ of
>>> simultaneous slots it may run. It does not know of what each task,
>>> occupying one slot, would demand in resource-terms. Your job then
>>> supplies a # of map tasks, and amount of memory required per map task
>>> in general, as a configuration. TTs then merely start the task JVMs
>>> with the provided heap configuration.
>>>
>>> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig  wrote:
>>> > That was a typo in my email not in the configuration. Is the memory
>>> > reserved
>>> > for the tasks when the task tracker starts? You seem to be suggesting
>>> > that I
>>> > need to set the memory to be the same for all map tasks. Is there no way
>>> > to
>>> > override for a single map task?
>>> >
>>> >
>>> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J  wrote:
>>> >>
>>> >> Hans,
>>> >>
>>> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts -
>>> >> Such a property does not exist. Perhaps you wanted
>>> >> "mapred.map.child.java.opts"?
>>> >>
>>> >> Additionally, the computation you need to do is (# of map slots on a
>>> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM -
>>> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of
>>> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in
>>> >> parallel).
>>> >>
>>> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig  wrote:
>>> >> > I am attempting to speed up a mapping process whose input is GZIP
>>> >> > compressed
>>> >> > CSV files. The files range from 1-2GB, I am running on a Cluster
>>> >> > where
>>> >> > each
>>> >> > node has a total of 32GB memory available to use. I have attempted to
>>> >> > tweak
>>> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
>>> >> > accommodate the size but I keep getting java heap errors or other
>>> >> > memory
>>> >> > related problems. My row count per mapper is well below
>>> >> > Integer.MAX_INTEGER
>>> >> > limit by several orders of magnitude and the box is NOT using
>>> >> > anywhere
>>> >> > close
>>> >> > to its full memory allotment. How can I specify that this map task
>>> >> > can
>>> >> > have
>>> >> > 3-4 GB of memory for the collection, partition and sort process
>>> >> > without
>>> >> > constantly spilling records to disk?
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Harsh J
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J


Re: Mapper Record Spillage

2012-03-11 Thread Harsh J
Hans,

I don't think io.sort.mb can support a whole 2048 value (it builds one
array with the size, and JVM may not be allowing that). Can you lower
it to 2000 ± 100 and try again?

On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig  wrote:
> If that is the case then these two lines should make more than enough
> memory. On a virtually unused cluster.
>
> job.getConfiguration().setInt("io.sort.mb", 2048);
> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M");
>
> Such that a conversion from 1GB of CSV Text to binary primitives should fit
> easily. but java still throws a heap error even when there is 25 GB of
> memory free.
>
> On Sat, Mar 10, 2012 at 11:50 PM, Harsh J  wrote:
>>
>> Hans,
>>
>> You can change memory requirements for tasks of a single job, but not
>> of a single task inside that job.
>>
>> This is briefly how the 0.20 framework (by default) works: TT has
>> notions only of "slots", and carries a maximum _number_ of
>> simultaneous slots it may run. It does not know of what each task,
>> occupying one slot, would demand in resource-terms. Your job then
>> supplies a # of map tasks, and amount of memory required per map task
>> in general, as a configuration. TTs then merely start the task JVMs
>> with the provided heap configuration.
>>
>> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig  wrote:
>> > That was a typo in my email not in the configuration. Is the memory
>> > reserved
>> > for the tasks when the task tracker starts? You seem to be suggesting
>> > that I
>> > need to set the memory to be the same for all map tasks. Is there no way
>> > to
>> > override for a single map task?
>> >
>> >
>> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J  wrote:
>> >>
>> >> Hans,
>> >>
>> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts -
>> >> Such a property does not exist. Perhaps you wanted
>> >> "mapred.map.child.java.opts"?
>> >>
>> >> Additionally, the computation you need to do is (# of map slots on a
>> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM -
>> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of
>> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in
>> >> parallel).
>> >>
>> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig  wrote:
>> >> > I am attempting to speed up a mapping process whose input is GZIP
>> >> > compressed
>> >> > CSV files. The files range from 1-2GB, I am running on a Cluster
>> >> > where
>> >> > each
>> >> > node has a total of 32GB memory available to use. I have attempted to
>> >> > tweak
>> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
>> >> > accommodate the size but I keep getting java heap errors or other
>> >> > memory
>> >> > related problems. My row count per mapper is well below
>> >> > Integer.MAX_INTEGER
>> >> > limit by several orders of magnitude and the box is NOT using
>> >> > anywhere
>> >> > close
>> >> > to its full memory allotment. How can I specify that this map task
>> >> > can
>> >> > have
>> >> > 3-4 GB of memory for the collection, partition and sort process
>> >> > without
>> >> > constantly spilling records to disk?
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J


Re: Mapper Record Spillage

2012-03-11 Thread Hans Uhlig
If that is the case then these two lines should make more than enough
memory. On a virtually unused cluster.

job.getConfiguration().setInt("io.sort.mb", 2048);
job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M");

Such that a conversion from 1GB of CSV Text to binary primitives should fit
easily. but java still throws a heap error even when there is 25 GB of
memory free.

On Sat, Mar 10, 2012 at 11:50 PM, Harsh J  wrote:

> Hans,
>
> You can change memory requirements for tasks of a single job, but not
> of a single task inside that job.
>
> This is briefly how the 0.20 framework (by default) works: TT has
> notions only of "slots", and carries a maximum _number_ of
> simultaneous slots it may run. It does not know of what each task,
> occupying one slot, would demand in resource-terms. Your job then
> supplies a # of map tasks, and amount of memory required per map task
> in general, as a configuration. TTs then merely start the task JVMs
> with the provided heap configuration.
>
> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig  wrote:
> > That was a typo in my email not in the configuration. Is the memory
> reserved
> > for the tasks when the task tracker starts? You seem to be suggesting
> that I
> > need to set the memory to be the same for all map tasks. Is there no way
> to
> > override for a single map task?
> >
> >
> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J  wrote:
> >>
> >> Hans,
> >>
> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts -
> >> Such a property does not exist. Perhaps you wanted
> >> "mapred.map.child.java.opts"?
> >>
> >> Additionally, the computation you need to do is (# of map slots on a
> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM -
> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of
> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in
> >> parallel).
> >>
> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig  wrote:
> >> > I am attempting to speed up a mapping process whose input is GZIP
> >> > compressed
> >> > CSV files. The files range from 1-2GB, I am running on a Cluster where
> >> > each
> >> > node has a total of 32GB memory available to use. I have attempted to
> >> > tweak
> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
> >> > accommodate the size but I keep getting java heap errors or other
> memory
> >> > related problems. My row count per mapper is well below
> >> > Integer.MAX_INTEGER
> >> > limit by several orders of magnitude and the box is NOT using anywhere
> >> > close
> >> > to its full memory allotment. How can I specify that this map task can
> >> > have
> >> > 3-4 GB of memory for the collection, partition and sort process
> without
> >> > constantly spilling records to disk?
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>


Re: Mapper Record Spillage

2012-03-10 Thread Harsh J
Hans,

You can change memory requirements for tasks of a single job, but not
of a single task inside that job.

This is briefly how the 0.20 framework (by default) works: TT has
notions only of "slots", and carries a maximum _number_ of
simultaneous slots it may run. It does not know of what each task,
occupying one slot, would demand in resource-terms. Your job then
supplies a # of map tasks, and amount of memory required per map task
in general, as a configuration. TTs then merely start the task JVMs
with the provided heap configuration.

On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig  wrote:
> That was a typo in my email not in the configuration. Is the memory reserved
> for the tasks when the task tracker starts? You seem to be suggesting that I
> need to set the memory to be the same for all map tasks. Is there no way to
> override for a single map task?
>
>
> On Sat, Mar 10, 2012 at 8:41 PM, Harsh J  wrote:
>>
>> Hans,
>>
>> Its possible you may have an typo issue: mapred.map.child.jvm.opts -
>> Such a property does not exist. Perhaps you wanted
>> "mapred.map.child.java.opts"?
>>
>> Additionally, the computation you need to do is (# of map slots on a
>> TT * per-map-task-heap-requirement) should be at least < (Total RAM -
>> 2/3 GB). With your 4 GB requirement, I guess you can support a max of
>> 6-7 slots per machine (i.e. Not counting reducer heap requirements in
>> parallel).
>>
>> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig  wrote:
>> > I am attempting to speed up a mapping process whose input is GZIP
>> > compressed
>> > CSV files. The files range from 1-2GB, I am running on a Cluster where
>> > each
>> > node has a total of 32GB memory available to use. I have attempted to
>> > tweak
>> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
>> > accommodate the size but I keep getting java heap errors or other memory
>> > related problems. My row count per mapper is well below
>> > Integer.MAX_INTEGER
>> > limit by several orders of magnitude and the box is NOT using anywhere
>> > close
>> > to its full memory allotment. How can I specify that this map task can
>> > have
>> > 3-4 GB of memory for the collection, partition and sort process without
>> > constantly spilling records to disk?
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J


Re: Mapper Record Spillage

2012-03-10 Thread Hans Uhlig
That was a typo in my email not in the configuration. Is the memory
reserved for the tasks when the task tracker starts? You seem to be
suggesting that I need to set the memory to be the same for all map tasks.
Is there no way to override for a single map task?

On Sat, Mar 10, 2012 at 8:41 PM, Harsh J  wrote:

> Hans,
>
> Its possible you may have an typo issue: mapred.map.child.jvm.opts -
> Such a property does not exist. Perhaps you wanted
> "mapred.map.child.java.opts"?
>
> Additionally, the computation you need to do is (# of map slots on a
> TT * per-map-task-heap-requirement) should be at least < (Total RAM -
> 2/3 GB). With your 4 GB requirement, I guess you can support a max of
> 6-7 slots per machine (i.e. Not counting reducer heap requirements in
> parallel).
>
> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig  wrote:
> > I am attempting to speed up a mapping process whose input is GZIP
> compressed
> > CSV files. The files range from 1-2GB, I am running on a Cluster where
> each
> > node has a total of 32GB memory available to use. I have attempted to
> tweak
> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
> > accommodate the size but I keep getting java heap errors or other memory
> > related problems. My row count per mapper is well below
> Integer.MAX_INTEGER
> > limit by several orders of magnitude and the box is NOT using anywhere
> close
> > to its full memory allotment. How can I specify that this map task can
> have
> > 3-4 GB of memory for the collection, partition and sort process without
> > constantly spilling records to disk?
>
>
>
> --
> Harsh J
>


Re: Mapper Record Spillage

2012-03-10 Thread Harsh J
Hans,

Its possible you may have an typo issue: mapred.map.child.jvm.opts -
Such a property does not exist. Perhaps you wanted
"mapred.map.child.java.opts"?

Additionally, the computation you need to do is (# of map slots on a
TT * per-map-task-heap-requirement) should be at least < (Total RAM -
2/3 GB). With your 4 GB requirement, I guess you can support a max of
6-7 slots per machine (i.e. Not counting reducer heap requirements in
parallel).

On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig  wrote:
> I am attempting to speed up a mapping process whose input is GZIP compressed
> CSV files. The files range from 1-2GB, I am running on a Cluster where each
> node has a total of 32GB memory available to use. I have attempted to tweak
> mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
> accommodate the size but I keep getting java heap errors or other memory
> related problems. My row count per mapper is well below Integer.MAX_INTEGER
> limit by several orders of magnitude and the box is NOT using anywhere close
> to its full memory allotment. How can I specify that this map task can have
> 3-4 GB of memory for the collection, partition and sort process without
> constantly spilling records to disk?



-- 
Harsh J


Re: Mapper Record Spillage

2012-03-10 Thread Hans Uhlig
I am attempting to specify this for a single job during its
creation/submission. Not via the general construct. I am using the new api
so I am adding the values to the conf passed into new Job();

2012/3/10 WangRamon 

>  How man map/reduce tasks slots do you have for each node? If the
> total number is 10, then you will use 10 * 4096mb memory when all tasks are
> running, which is bigger than the total memory 32G you have for each node.
>
> --
> Date: Sat, 10 Mar 2012 20:00:13 -0800
> Subject: Mapper Record Spillage
> From: huh...@uhlisys.com
> To: mapreduce-user@hadoop.apache.org
>
> I am attempting to speed up a mapping process whose input is GZIP compressed
> CSV files. The files range from 1-2GB, I am running on a Cluster where each
> node has a total of 32GB memory available to use. I have attempted to tweak
> mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to 
> accommodate
> the size but I keep getting java heap errors or other memory related
> problems. My row count per mapper is well below Integer.MAX_INTEGER limi t
> by several orders of magnitude and the box is NOT using anywhere close to its
> full memory allotment. How can I specify that this map task can have 3-4
> GB of memory for the collection, partition and sort process without constantly
> spilling records to disk?
>


RE: Mapper Record Spillage

2012-03-10 Thread WangRamon

How man map/reduce tasks slots do you have for each node? If the total number 
is 10, then you will use 10 * 4096mb memory when all tasks are running, which 
is bigger than the total memory 32G you have for each node.
 Date: Sat, 10 Mar 2012 20:00:13 -0800
Subject: Mapper Record Spillage
From: huh...@uhlisys.com
To: mapreduce-user@hadoop.apache.org

I am attempting to speed up a mapping process whose input is GZIP compressed 
CSV files. The files range from 1-2GB, I am running on a Cluster where each 
node has a total of 32GB memory available to use. I have attempted to tweak 
mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to accommodate 
the size but I keep getting java heap errors or other memory related problems. 
My row count per mapper is well below Integer.MAX_INTEGER limit by several 
orders of magnitude and the box is NOT using anywhere close to its full memory 
allotment. How can I specify that this map task can have 3-4 GB of memory for 
the collection, partition and sort process without constantly spilling records 
to disk?