gt; To: core-user@hadoop.apache.org
> Subject: Re: Partitioning reduce output by date
>
> Thank you, Doug and Ted, this pointed me in the right direction, which
> lead to a custom OutputFormat and a RecordWriter that opens and closes
the
> DataOutputStream based on the current ke
ext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Doug Cutting <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 19, 2008 4:39:04 PM
Subject: Re: Partitioning reduce output by date
Otis Gospodnetic wrote:
> That "numPartition
Otis Gospodnetic wrote:
That "numPartitions" corresponds to the number of reduce tasks. What I need is
partitioning that corresponds to the number of unique dates (-mm-dd) processed by the
Mapper and not the number of reduce tasks. I don't know the number of distinct dates in
the input a
user@hadoop.apache.org
Sent: Tuesday, March 18, 2008 9:24:14 PM
Subject: Re: Partitioning reduce output by date
Also see my comment about side effect files.
Basically, if you partition on date, then each set of values in the reduce
will have the same date. Thus the reducer can open a file, writ
Also see my comment about side effect files.
Basically, if you partition on date, then each set of values in the reduce
will have the same date. Thus the reducer can open a file, write the
values, close the file (repeat).
This gives precisely the effect you were seeking.
On 3/18/08 6:17 PM, "
> This makes it sound like the Partitioner is only for intermediate
> map-outputs, and not outputs of reduces. Also, it sounds like the number of
> distinct partitions is tied to the number of reduces. But what if my job
> uses, say, only 2 reduce tasks, and my input has 100 distinct dates, and a
is
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Arun C Murthy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, March 18, 2008 8:17:32 PM
Subject: Re: Partitioning reduce output by date
On Mar 18, 2008, at 4:35 PM, Otis Gos
I think that a custom partitioner is half of the answer. The other half is
that the reducer can open and close output files as needed. With the
partitioner, only one file need be kept open at a time. It is good practice
to open the files relative to the task directory so that process failure is
On Mar 18, 2008, at 4:35 PM, Otis Gospodnetic wrote:
Hi,
What is the best/right way to handle partitioning of the final job
output (i.e. output of reduce tasks)? In my case, I am processing
logs whose entries include dates (e.g. "2008-03-01foobar
baz"). A single log file may