>Hi,
>
>1 - I would like to understand how a partition works in the Map
>Reduce. I know that the Map Reduce contains the IndexRecord class that
>indicates the length of something. Is it the length of a partition or
>of a spill?
>
>2 - In large map output, a partition can be a set of spills, or a
>
Are you asking about how reducer will know whether to uncompress or not ? May
be it checks the config properties like mapreduce.map.output.compress, am not
sure though.
-Ravi
On 12/23/10 4:30 AM, "Pedro Costa" wrote:
I know that, but what's confusing me is that at
ReduceTask#MapOutputServlet,
I know that, but what's confusing me is that at
ReduceTask#MapOutputServlet, it has:
[code]
//open the map-output file
mapOutputIn = rfs.open(mapOutputFileName);
//seek to the correct offset for the reduce
mapOutputIn.seek(info.startOffset);
long rem = info.partLe
PartLength is compressed length as the map output data could be compressed
based on config setting.
RawLength is uncompressed length.
SortAndSpill() in MapTask.java has details of these as:
rec.startOffset = segmentStart;
rec.rawLength = writer.getRawLength();
A index record contains 3 variables:
startOffset, rawLength and partLength.
What's the difference between a raw length and a partition length?
On Wed, Dec 22, 2010 at 10:05 PM, Ravi Gummadi wrote:
> Each map task produces R partitions(as part of its output file) if the
> number of reduce tasks
Each map task produces R partitions(as part of its output file) if the number
of reduce tasks for the job is R.
Reduce task asks the TaskTrackerWhereMapRan for its input. TaskTracker gives
the corresponding partition in the map output file based on the reduce task id.
For eg. TaskTracker gives t
So, I conclude that a partition is defined by the offset.
But, for example, a Map Tasks produces 5 partitions. How the reduce
knows that it must fetch the 5 partitions? Where's this information?
This information is not only given by the offset.
On Wed, Dec 22, 2010 at 9:07 PM, Ravi Gummadi wrote
Each map task will generate a single intermediate file (i.e. Map output file).
This is obtained by merging multiple spills, if spills needed to happen.
Index file gives the details of the offset and length for each reducer. Offset
is offset in the map output file where the input data for the par
Hi,
1 - I would like to understand how a partition works in the Map
Reduce. I know that the Map Reduce contains the IndexRecord class that
indicates the length of something. Is it the length of a partition or
of a spill?
2 - In large map output, a partition can be a set of spills, or a
spill is s
I appreciate the insightful comments Todd. I now understand that 0.21 is not
a production release and never will be. That makes me much more confident to
keep working with the CDH3 version. It's difficult to get started with
Hadoop because information is so scattered. The fact that the libraries ar
On 12/21/2010 09:50 PM, Chase Bradford wrote:
If you want a tmp file on a task's local host, just use java's
createTempFile from the File class. It creates a file in
java.io.tmp, which the task runner sets up in the task's workspace
and is cleaned by the TT even if the child jvm exits badly.
M
Hi,
2010/12/22 Paweł Łoziński :
> Hi all,
>
> a quick question: is the config option name
> "mapred.reduce.tasks.speculative.execution" valid in hadoop-0.21, or
> did it change since 0.20?
It has changed. Those names are now deprecated.
New names are (still booleans):
mapreduce.map.speculative
m
On Wed, Dec 22, 2010 at 7:14 PM, Eric wrote:
> Thank you for your suggestion. I need a temp directory, not a single file. I
> successfully used Koji's suggestion to use ./tmp, which is a preexisting
> directory for this purpose.
And whose name is also configurable via "mapred.child.tmp".
--
Har
Hi Eric,
Some thoughts inline below:
On Wed, Dec 22, 2010 at 3:39 AM, Eric wrote:
> This question may have been asked numerous times, and the answer will
> probably come down to the specific situation you are in, but I'm going to
> ask anyway:
>
> Which Hadoop version should I pick?
>
> I'm cur
I would suggest you to configure latest version of hadoop, since hadoop is
evolving and continuously updating, hence lots old libraries are getting
deprecated and your own application will not run on latest version. Also its
structure changed from 0.20.x to 0.21.
Regarding its stability, if you fou
I am using hadoop 0.20.2 for data analysis for my company. I did not upgrade
to hadoop 0.21 since the note in
http://hadoop.apache.org/common/releases.html#23+August%2C+2010%3A+release+0.21.0+available
On Wed, Dec 22, 2010 at 7:39 PM, Eric wrote:
> This question may have been asked numerous time
2010/12/22 Chase Bradford
> If you want a tmp file on a task's local host, just use java's
> createTempFile from the File class. It creates a file in java.io.tmp, which
> the task runner sets up in the task's workspace and is cleaned by the TT
> even if the child jvm exits badly.
>
>
Thank you f
Hi all,
a quick question: is the config option name
"mapred.reduce.tasks.speculative.execution" valid in hadoop-0.21, or
did it change since 0.20?
Regards,
Paweł Łoziński
This question may have been asked numerous times, and the answer will
probably come down to the specific situation you are in, but I'm going to
ask anyway:
Which Hadoop version should I pick?
I'm currently running Cloudera's CDH3 beta release, but I'm very tempted to
install the latest Apache 0.2
19 matches
Mail list logo