Re: Mapper Record Spillage
Actually if you set {io.sort.mb} to 2048, your map tasks will always fail. The maximum {io.sort.mb} is hard-coded to 2047. Which means if you think you've set 2048 and your tasks aren't failing, then you probably haven't actually changed io.sort.mb. Double-check what configuration settings the Jobtracker actually saw by looking at $ hadoop fs -cat hdfs:///_logs/history/*.xml | grep io.sort.mb George On 2012/03/11 22:38, Harsh J wrote: Hans, I don't think io.sort.mb can support a whole 2048 value (it builds one array with the size, and JVM may not be allowing that). Can you lower it to 2000 ± 100 and try again? On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig wrote: If that is the case then these two lines should make more than enough memory. On a virtually unused cluster. job.getConfiguration().setInt("io.sort.mb", 2048); job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M"); Such that a conversion from 1GB of CSV Text to binary primitives should fit easily. but java still throws a heap error even when there is 25 GB of memory free. On Sat, Mar 10, 2012 at 11:50 PM, Harsh J wrote: Hans, You can change memory requirements for tasks of a single job, but not of a single task inside that job. This is briefly how the 0.20 framework (by default) works: TT has notions only of "slots", and carries a maximum _number_ of simultaneous slots it may run. It does not know of what each task, occupying one slot, would demand in resource-terms. Your job then supplies a # of map tasks, and amount of memory required per map task in general, as a configuration. TTs then merely start the task JVMs with the provided heap configuration. On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig wrote: That was a typo in my email not in the configuration. Is the memory reserved for the tasks when the task tracker starts? You seem to be suggesting that I need to set the memory to be the same for all map tasks. Is there no way to override for a single map task? On Sat, Mar 10, 2012 at 8:41 PM, Harsh J wrote: Hans, Its possible you may have an typo issue: mapred.map.child.jvm.opts - Such a property does not exist. Perhaps you wanted "mapred.map.child.java.opts"? Additionally, the computation you need to do is (# of map slots on a TT * per-map-task-heap-requirement) should be at least< (Total RAM - 2/3 GB). With your 4 GB requirement, I guess you can support a max of 6-7 slots per machine (i.e. Not counting reducer heap requirements in parallel). On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig wrote: I am attempting to speed up a mapping process whose input is GZIP compressed CSV files. The files range from 1-2GB, I am running on a Cluster where each node has a total of 32GB memory available to use. I have attempted to tweak mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to accommodate the size but I keep getting java heap errors or other memory related problems. My row count per mapper is well below Integer.MAX_INTEGER limit by several orders of magnitude and the box is NOT using anywhere close to its full memory allotment. How can I specify that this map task can have 3-4 GB of memory for the collection, partition and sort process without constantly spilling records to disk? -- Harsh J -- Harsh J
Re: Mapper Record Spillage
(Er, not sure how that ± got in there, I wished to type (-100, lowered further if it continued to show problems)). On Sun, Mar 11, 2012 at 7:08 PM, Harsh J wrote: > Hans, > > I don't think io.sort.mb can support a whole 2048 value (it builds one > array with the size, and JVM may not be allowing that). Can you lower > it to 2000 ± 100 and try again? > > On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig wrote: >> If that is the case then these two lines should make more than enough >> memory. On a virtually unused cluster. >> >> job.getConfiguration().setInt("io.sort.mb", 2048); >> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M"); >> >> Such that a conversion from 1GB of CSV Text to binary primitives should fit >> easily. but java still throws a heap error even when there is 25 GB of >> memory free. >> >> On Sat, Mar 10, 2012 at 11:50 PM, Harsh J wrote: >>> >>> Hans, >>> >>> You can change memory requirements for tasks of a single job, but not >>> of a single task inside that job. >>> >>> This is briefly how the 0.20 framework (by default) works: TT has >>> notions only of "slots", and carries a maximum _number_ of >>> simultaneous slots it may run. It does not know of what each task, >>> occupying one slot, would demand in resource-terms. Your job then >>> supplies a # of map tasks, and amount of memory required per map task >>> in general, as a configuration. TTs then merely start the task JVMs >>> with the provided heap configuration. >>> >>> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig wrote: >>> > That was a typo in my email not in the configuration. Is the memory >>> > reserved >>> > for the tasks when the task tracker starts? You seem to be suggesting >>> > that I >>> > need to set the memory to be the same for all map tasks. Is there no way >>> > to >>> > override for a single map task? >>> > >>> > >>> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J wrote: >>> >> >>> >> Hans, >>> >> >>> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts - >>> >> Such a property does not exist. Perhaps you wanted >>> >> "mapred.map.child.java.opts"? >>> >> >>> >> Additionally, the computation you need to do is (# of map slots on a >>> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM - >>> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of >>> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in >>> >> parallel). >>> >> >>> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig wrote: >>> >> > I am attempting to speed up a mapping process whose input is GZIP >>> >> > compressed >>> >> > CSV files. The files range from 1-2GB, I am running on a Cluster >>> >> > where >>> >> > each >>> >> > node has a total of 32GB memory available to use. I have attempted to >>> >> > tweak >>> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to >>> >> > accommodate the size but I keep getting java heap errors or other >>> >> > memory >>> >> > related problems. My row count per mapper is well below >>> >> > Integer.MAX_INTEGER >>> >> > limit by several orders of magnitude and the box is NOT using >>> >> > anywhere >>> >> > close >>> >> > to its full memory allotment. How can I specify that this map task >>> >> > can >>> >> > have >>> >> > 3-4 GB of memory for the collection, partition and sort process >>> >> > without >>> >> > constantly spilling records to disk? >>> >> >>> >> >>> >> >>> >> -- >>> >> Harsh J >>> > >>> > >>> >>> >>> >>> -- >>> Harsh J >> >> > > > > -- > Harsh J -- Harsh J
Re: Mapper Record Spillage
Hans, I don't think io.sort.mb can support a whole 2048 value (it builds one array with the size, and JVM may not be allowing that). Can you lower it to 2000 ± 100 and try again? On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig wrote: > If that is the case then these two lines should make more than enough > memory. On a virtually unused cluster. > > job.getConfiguration().setInt("io.sort.mb", 2048); > job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M"); > > Such that a conversion from 1GB of CSV Text to binary primitives should fit > easily. but java still throws a heap error even when there is 25 GB of > memory free. > > On Sat, Mar 10, 2012 at 11:50 PM, Harsh J wrote: >> >> Hans, >> >> You can change memory requirements for tasks of a single job, but not >> of a single task inside that job. >> >> This is briefly how the 0.20 framework (by default) works: TT has >> notions only of "slots", and carries a maximum _number_ of >> simultaneous slots it may run. It does not know of what each task, >> occupying one slot, would demand in resource-terms. Your job then >> supplies a # of map tasks, and amount of memory required per map task >> in general, as a configuration. TTs then merely start the task JVMs >> with the provided heap configuration. >> >> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig wrote: >> > That was a typo in my email not in the configuration. Is the memory >> > reserved >> > for the tasks when the task tracker starts? You seem to be suggesting >> > that I >> > need to set the memory to be the same for all map tasks. Is there no way >> > to >> > override for a single map task? >> > >> > >> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J wrote: >> >> >> >> Hans, >> >> >> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts - >> >> Such a property does not exist. Perhaps you wanted >> >> "mapred.map.child.java.opts"? >> >> >> >> Additionally, the computation you need to do is (# of map slots on a >> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM - >> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of >> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in >> >> parallel). >> >> >> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig wrote: >> >> > I am attempting to speed up a mapping process whose input is GZIP >> >> > compressed >> >> > CSV files. The files range from 1-2GB, I am running on a Cluster >> >> > where >> >> > each >> >> > node has a total of 32GB memory available to use. I have attempted to >> >> > tweak >> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to >> >> > accommodate the size but I keep getting java heap errors or other >> >> > memory >> >> > related problems. My row count per mapper is well below >> >> > Integer.MAX_INTEGER >> >> > limit by several orders of magnitude and the box is NOT using >> >> > anywhere >> >> > close >> >> > to its full memory allotment. How can I specify that this map task >> >> > can >> >> > have >> >> > 3-4 GB of memory for the collection, partition and sort process >> >> > without >> >> > constantly spilling records to disk? >> >> >> >> >> >> >> >> -- >> >> Harsh J >> > >> > >> >> >> >> -- >> Harsh J > > -- Harsh J
Re: Mapper Record Spillage
If that is the case then these two lines should make more than enough memory. On a virtually unused cluster. job.getConfiguration().setInt("io.sort.mb", 2048); job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M"); Such that a conversion from 1GB of CSV Text to binary primitives should fit easily. but java still throws a heap error even when there is 25 GB of memory free. On Sat, Mar 10, 2012 at 11:50 PM, Harsh J wrote: > Hans, > > You can change memory requirements for tasks of a single job, but not > of a single task inside that job. > > This is briefly how the 0.20 framework (by default) works: TT has > notions only of "slots", and carries a maximum _number_ of > simultaneous slots it may run. It does not know of what each task, > occupying one slot, would demand in resource-terms. Your job then > supplies a # of map tasks, and amount of memory required per map task > in general, as a configuration. TTs then merely start the task JVMs > with the provided heap configuration. > > On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig wrote: > > That was a typo in my email not in the configuration. Is the memory > reserved > > for the tasks when the task tracker starts? You seem to be suggesting > that I > > need to set the memory to be the same for all map tasks. Is there no way > to > > override for a single map task? > > > > > > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J wrote: > >> > >> Hans, > >> > >> Its possible you may have an typo issue: mapred.map.child.jvm.opts - > >> Such a property does not exist. Perhaps you wanted > >> "mapred.map.child.java.opts"? > >> > >> Additionally, the computation you need to do is (# of map slots on a > >> TT * per-map-task-heap-requirement) should be at least < (Total RAM - > >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of > >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in > >> parallel). > >> > >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig wrote: > >> > I am attempting to speed up a mapping process whose input is GZIP > >> > compressed > >> > CSV files. The files range from 1-2GB, I am running on a Cluster where > >> > each > >> > node has a total of 32GB memory available to use. I have attempted to > >> > tweak > >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to > >> > accommodate the size but I keep getting java heap errors or other > memory > >> > related problems. My row count per mapper is well below > >> > Integer.MAX_INTEGER > >> > limit by several orders of magnitude and the box is NOT using anywhere > >> > close > >> > to its full memory allotment. How can I specify that this map task can > >> > have > >> > 3-4 GB of memory for the collection, partition and sort process > without > >> > constantly spilling records to disk? > >> > >> > >> > >> -- > >> Harsh J > > > > > > > > -- > Harsh J >
Re: Mapper Record Spillage
Hans, You can change memory requirements for tasks of a single job, but not of a single task inside that job. This is briefly how the 0.20 framework (by default) works: TT has notions only of "slots", and carries a maximum _number_ of simultaneous slots it may run. It does not know of what each task, occupying one slot, would demand in resource-terms. Your job then supplies a # of map tasks, and amount of memory required per map task in general, as a configuration. TTs then merely start the task JVMs with the provided heap configuration. On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig wrote: > That was a typo in my email not in the configuration. Is the memory reserved > for the tasks when the task tracker starts? You seem to be suggesting that I > need to set the memory to be the same for all map tasks. Is there no way to > override for a single map task? > > > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J wrote: >> >> Hans, >> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts - >> Such a property does not exist. Perhaps you wanted >> "mapred.map.child.java.opts"? >> >> Additionally, the computation you need to do is (# of map slots on a >> TT * per-map-task-heap-requirement) should be at least < (Total RAM - >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in >> parallel). >> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig wrote: >> > I am attempting to speed up a mapping process whose input is GZIP >> > compressed >> > CSV files. The files range from 1-2GB, I am running on a Cluster where >> > each >> > node has a total of 32GB memory available to use. I have attempted to >> > tweak >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to >> > accommodate the size but I keep getting java heap errors or other memory >> > related problems. My row count per mapper is well below >> > Integer.MAX_INTEGER >> > limit by several orders of magnitude and the box is NOT using anywhere >> > close >> > to its full memory allotment. How can I specify that this map task can >> > have >> > 3-4 GB of memory for the collection, partition and sort process without >> > constantly spilling records to disk? >> >> >> >> -- >> Harsh J > > -- Harsh J
Re: Mapper Record Spillage
That was a typo in my email not in the configuration. Is the memory reserved for the tasks when the task tracker starts? You seem to be suggesting that I need to set the memory to be the same for all map tasks. Is there no way to override for a single map task? On Sat, Mar 10, 2012 at 8:41 PM, Harsh J wrote: > Hans, > > Its possible you may have an typo issue: mapred.map.child.jvm.opts - > Such a property does not exist. Perhaps you wanted > "mapred.map.child.java.opts"? > > Additionally, the computation you need to do is (# of map slots on a > TT * per-map-task-heap-requirement) should be at least < (Total RAM - > 2/3 GB). With your 4 GB requirement, I guess you can support a max of > 6-7 slots per machine (i.e. Not counting reducer heap requirements in > parallel). > > On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig wrote: > > I am attempting to speed up a mapping process whose input is GZIP > compressed > > CSV files. The files range from 1-2GB, I am running on a Cluster where > each > > node has a total of 32GB memory available to use. I have attempted to > tweak > > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to > > accommodate the size but I keep getting java heap errors or other memory > > related problems. My row count per mapper is well below > Integer.MAX_INTEGER > > limit by several orders of magnitude and the box is NOT using anywhere > close > > to its full memory allotment. How can I specify that this map task can > have > > 3-4 GB of memory for the collection, partition and sort process without > > constantly spilling records to disk? > > > > -- > Harsh J >
Re: Mapper Record Spillage
Hans, Its possible you may have an typo issue: mapred.map.child.jvm.opts - Such a property does not exist. Perhaps you wanted "mapred.map.child.java.opts"? Additionally, the computation you need to do is (# of map slots on a TT * per-map-task-heap-requirement) should be at least < (Total RAM - 2/3 GB). With your 4 GB requirement, I guess you can support a max of 6-7 slots per machine (i.e. Not counting reducer heap requirements in parallel). On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig wrote: > I am attempting to speed up a mapping process whose input is GZIP compressed > CSV files. The files range from 1-2GB, I am running on a Cluster where each > node has a total of 32GB memory available to use. I have attempted to tweak > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to > accommodate the size but I keep getting java heap errors or other memory > related problems. My row count per mapper is well below Integer.MAX_INTEGER > limit by several orders of magnitude and the box is NOT using anywhere close > to its full memory allotment. How can I specify that this map task can have > 3-4 GB of memory for the collection, partition and sort process without > constantly spilling records to disk? -- Harsh J
Re: Mapper Record Spillage
I am attempting to specify this for a single job during its creation/submission. Not via the general construct. I am using the new api so I am adding the values to the conf passed into new Job(); 2012/3/10 WangRamon > How man map/reduce tasks slots do you have for each node? If the > total number is 10, then you will use 10 * 4096mb memory when all tasks are > running, which is bigger than the total memory 32G you have for each node. > > -- > Date: Sat, 10 Mar 2012 20:00:13 -0800 > Subject: Mapper Record Spillage > From: huh...@uhlisys.com > To: mapreduce-user@hadoop.apache.org > > I am attempting to speed up a mapping process whose input is GZIP compressed > CSV files. The files range from 1-2GB, I am running on a Cluster where each > node has a total of 32GB memory available to use. I have attempted to tweak > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to > accommodate > the size but I keep getting java heap errors or other memory related > problems. My row count per mapper is well below Integer.MAX_INTEGER limi t > by several orders of magnitude and the box is NOT using anywhere close to its > full memory allotment. How can I specify that this map task can have 3-4 > GB of memory for the collection, partition and sort process without constantly > spilling records to disk? >
RE: Mapper Record Spillage
How man map/reduce tasks slots do you have for each node? If the total number is 10, then you will use 10 * 4096mb memory when all tasks are running, which is bigger than the total memory 32G you have for each node. Date: Sat, 10 Mar 2012 20:00:13 -0800 Subject: Mapper Record Spillage From: huh...@uhlisys.com To: mapreduce-user@hadoop.apache.org I am attempting to speed up a mapping process whose input is GZIP compressed CSV files. The files range from 1-2GB, I am running on a Cluster where each node has a total of 32GB memory available to use. I have attempted to tweak mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to accommodate the size but I keep getting java heap errors or other memory related problems. My row count per mapper is well below Integer.MAX_INTEGER limit by several orders of magnitude and the box is NOT using anywhere close to its full memory allotment. How can I specify that this map task can have 3-4 GB of memory for the collection, partition and sort process without constantly spilling records to disk?