RE: Partitioning reduce output by date

2008-03-20 Thread Runping Qi
gt; To: core-user@hadoop.apache.org > Subject: Re: Partitioning reduce output by date > > Thank you, Doug and Ted, this pointed me in the right direction, which > lead to a custom OutputFormat and a RecordWriter that opens and closes the > DataOutputStream based on the current ke

Re: Partitioning reduce output by date

2008-03-20 Thread Otis Gospodnetic
ext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doug Cutting <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Wednesday, March 19, 2008 4:39:04 PM Subject: Re: Partitioning reduce output by date Otis Gospodnetic wrote: > That "numPartition

Re: Partitioning reduce output by date

2008-03-19 Thread Doug Cutting
Otis Gospodnetic wrote: That "numPartitions" corresponds to the number of reduce tasks. What I need is partitioning that corresponds to the number of unique dates (-mm-dd) processed by the Mapper and not the number of reduce tasks. I don't know the number of distinct dates in the input a

Re: Partitioning reduce output by date

2008-03-19 Thread Otis Gospodnetic
user@hadoop.apache.org Sent: Tuesday, March 18, 2008 9:24:14 PM Subject: Re: Partitioning reduce output by date Also see my comment about side effect files. Basically, if you partition on date, then each set of values in the reduce will have the same date. Thus the reducer can open a file, writ

Re: Partitioning reduce output by date

2008-03-18 Thread Ted Dunning
Also see my comment about side effect files. Basically, if you partition on date, then each set of values in the reduce will have the same date. Thus the reducer can open a file, write the values, close the file (repeat). This gives precisely the effect you were seeking. On 3/18/08 6:17 PM, "

Re: Partitioning reduce output by date

2008-03-18 Thread Martin Traverso
> This makes it sound like the Partitioner is only for intermediate > map-outputs, and not outputs of reduces. Also, it sounds like the number of > distinct partitions is tied to the number of reduces. But what if my job > uses, say, only 2 reduce tasks, and my input has 100 distinct dates, and a

Re: Partitioning reduce output by date

2008-03-18 Thread Otis Gospodnetic
is -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Arun C Murthy <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, March 18, 2008 8:17:32 PM Subject: Re: Partitioning reduce output by date On Mar 18, 2008, at 4:35 PM, Otis Gos

Re: Partitioning reduce output by date

2008-03-18 Thread Ted Dunning
I think that a custom partitioner is half of the answer. The other half is that the reducer can open and close output files as needed. With the partitioner, only one file need be kept open at a time. It is good practice to open the files relative to the task directory so that process failure is

Re: Partitioning reduce output by date

2008-03-18 Thread Arun C Murthy
On Mar 18, 2008, at 4:35 PM, Otis Gospodnetic wrote: Hi, What is the best/right way to handle partitioning of the final job output (i.e. output of reduce tasks)? In my case, I am processing logs whose entries include dates (e.g. "2008-03-01foobar baz"). A single log file may