Re:Spill and Map Output

2010-12-22 Thread 周俊清
>Hi, > >1 - I would like to understand how a partition works in the Map >Reduce. I know that the Map Reduce contains the IndexRecord class that >indicates the length of something. Is it the length of a partition or >of a spill? > >2 - In large map output, a partition can be a set of spills, or a >

Re: Spill and Map Output

2010-12-22 Thread Ravi Gummadi
Are you asking about how reducer will know whether to uncompress or not ? May be it checks the config properties like mapreduce.map.output.compress, am not sure though. -Ravi On 12/23/10 4:30 AM, "Pedro Costa" wrote: I know that, but what's confusing me is that at ReduceTask#MapOutputServlet,

Re: Spill and Map Output

2010-12-22 Thread Pedro Costa
I know that, but what's confusing me is that at ReduceTask#MapOutputServlet, it has: [code] //open the map-output file mapOutputIn = rfs.open(mapOutputFileName); //seek to the correct offset for the reduce mapOutputIn.seek(info.startOffset); long rem = info.partLe

Re: Spill and Map Output

2010-12-22 Thread Ravi Gummadi
PartLength is compressed length as the map output data could be compressed based on config setting. RawLength is uncompressed length. SortAndSpill() in MapTask.java has details of these as: rec.startOffset = segmentStart; rec.rawLength = writer.getRawLength();

Re: Spill and Map Output

2010-12-22 Thread Pedro Costa
A index record contains 3 variables: startOffset, rawLength and partLength. What's the difference between a raw length and a partition length? On Wed, Dec 22, 2010 at 10:05 PM, Ravi Gummadi wrote: > Each map task produces R partitions(as part of its output file) if the > number of reduce tasks

Re: Spill and Map Output

2010-12-22 Thread Ravi Gummadi
Each map task produces R partitions(as part of its output file) if the number of reduce tasks for the job is R. Reduce task asks the TaskTrackerWhereMapRan for its input. TaskTracker gives the corresponding partition in the map output file based on the reduce task id. For eg. TaskTracker gives t

Re: Spill and Map Output

2010-12-22 Thread Pedro Costa
So, I conclude that a partition is defined by the offset. But, for example, a Map Tasks produces 5 partitions. How the reduce knows that it must fetch the 5 partitions? Where's this information? This information is not only given by the offset. On Wed, Dec 22, 2010 at 9:07 PM, Ravi Gummadi wrote

Re: Spill and Map Output

2010-12-22 Thread Ravi Gummadi
Each map task will generate a single intermediate file (i.e. Map output file). This is obtained by merging multiple spills, if spills needed to happen. Index file gives the details of the offset and length for each reducer. Offset is offset in the map output file where the input data for the par

Spill and Map Output

2010-12-22 Thread Pedro Costa
Hi, 1 - I would like to understand how a partition works in the Map Reduce. I know that the Map Reduce contains the IndexRecord class that indicates the length of something. Is it the length of a partition or of a spill? 2 - In large map output, a partition can be a set of spills, or a spill is s

Re: Which version to choose

2010-12-22 Thread Eric
I appreciate the insightful comments Todd. I now understand that 0.21 is not a production release and never will be. That makes me much more confident to keep working with the CDH3 version. It's difficult to get started with Hadoop because information is so scattered. The fact that the libraries ar

Re: Getting a temporary directory in map jobs

2010-12-22 Thread David Rosenstrauch
On 12/21/2010 09:50 PM, Chase Bradford wrote: If you want a tmp file on a task's local host, just use java's createTempFile from the File class. It creates a file in java.io.tmp, which the task runner sets up in the task's workspace and is cleaned by the TT even if the child jvm exits badly. M

Re: mapred.reduce.tasks.speculative.execution config option in hadoop-0.21

2010-12-22 Thread Harsh J
Hi, 2010/12/22 Paweł Łoziński : > Hi all, > > a quick question: is the config option name > "mapred.reduce.tasks.speculative.execution" valid in hadoop-0.21, or > did it change since 0.20? It has changed. Those names are now deprecated. New names are (still booleans): mapreduce.map.speculative m

Re: Getting a temporary directory in map jobs

2010-12-22 Thread Harsh J
On Wed, Dec 22, 2010 at 7:14 PM, Eric wrote: > Thank you for your suggestion. I need a temp directory, not a single file. I > successfully used Koji's suggestion to use ./tmp, which is a preexisting > directory for this purpose. And whose name is also configurable via "mapred.child.tmp". -- Har

Re: Which version to choose

2010-12-22 Thread Todd Lipcon
Hi Eric, Some thoughts inline below: On Wed, Dec 22, 2010 at 3:39 AM, Eric wrote: > This question may have been asked numerous times, and the answer will > probably come down to the specific situation you are in, but I'm going to > ask anyway: > > Which Hadoop version should I pick? > > I'm cur

Re: Which version to choose

2010-12-22 Thread rahul patodi
I would suggest you to configure latest version of hadoop, since hadoop is evolving and continuously updating, hence lots old libraries are getting deprecated and your own application will not run on latest version. Also its structure changed from 0.20.x to 0.21. Regarding its stability, if you fou

Re: Which version to choose

2010-12-22 Thread jingguo yao
I am using hadoop 0.20.2 for data analysis for my company. I did not upgrade to hadoop 0.21 since the note in http://hadoop.apache.org/common/releases.html#23+August%2C+2010%3A+release+0.21.0+available On Wed, Dec 22, 2010 at 7:39 PM, Eric wrote: > This question may have been asked numerous time

Re: Getting a temporary directory in map jobs

2010-12-22 Thread Eric
2010/12/22 Chase Bradford > If you want a tmp file on a task's local host, just use java's > createTempFile from the File class. It creates a file in java.io.tmp, which > the task runner sets up in the task's workspace and is cleaned by the TT > even if the child jvm exits badly. > > Thank you f

mapred.reduce.tasks.speculative.execution config option in hadoop-0.21

2010-12-22 Thread Paweł Łoziński
Hi all, a quick question: is the config option name "mapred.reduce.tasks.speculative.execution" valid in hadoop-0.21, or did it change since 0.20? Regards, Paweł Łoziński

Which version to choose

2010-12-22 Thread Eric
This question may have been asked numerous times, and the answer will probably come down to the specific situation you are in, but I'm going to ask anyway: Which Hadoop version should I pick? I'm currently running Cloudera's CDH3 beta release, but I'm very tempted to install the latest Apache 0.2