[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Chris Douglas (JIRA) Wed, 17 Sep 2008 14:36:46 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Douglas updated HADOOP-3019:
----------------------------------

    Status: Open  (was: Patch Available)

bq. Since the sample points are kept in array and sorted in memory, then its 
size is severely limited. Why not consider to use a map reduce job to generate 
the sampling points and the partition file?

The client-side sampler is limited, no question, but it usually only takes a 
few seconds to run (unlike a distributed job), generates decent results, and 
can be easily rolled into the user's driver. The distributed sampler (planned, 
writing it) can be more accurate, but will take longer. The client-side sampler 
also needs to use the map class, so the sampling is on the map output keytype 
and distribution rather than the input.

The latter requires that most of the InputSampler be rewritten to use 
MapRunnable, so I'm cancelling this for now.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch
>
>
> The input sampler should generate a small, random sample of the input, saved 
> to a file.
> The partitioner should read the sample file and partition keys into 
> relatively even-sized key-ranges, where the partition numbers correspond to 
> key order.
> Note that when the sampler is used for partitioning, the number of samples 
> required is proportional to the number of reduce partitions.  10x the 
> intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Reply via email to