I am attempting to speed up a mapping process whose input is GZIP compressed
CSV files. The files range from 1-2GB, I am running on a Cluster where each
node has a total of 32GB memory available to use. I have attempted to tweak
mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to accommodate
the size but I keep getting java heap errors or other memory related
problems. My row count per mapper is well below Integer.MAX_INTEGER limit
by several orders of magnitude and the box is NOT using anywhere close to its
full memory allotment. How can I specify that this map task can have 3-4 GB
of memory for the collection, partition and sort process without constantly
spilling records to disk?

Reply via email to