[
https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Douglas updated HADOOP-3019:
----------------------------------
Status: Open (was: Patch Available)
bq. Since the sample points are kept in array and sorted in memory, then its
size is severely limited. Why not consider to use a map reduce job to generate
the sampling points and the partition file?
The client-side sampler is limited, no question, but it usually only takes a
few seconds to run (unlike a distributed job), generates decent results, and
can be easily rolled into the user's driver. The distributed sampler (planned,
writing it) can be more accurate, but will take longer. The client-side sampler
also needs to use the map class, so the sampling is on the map output keytype
and distribution rather than the input.
The latter requires that most of the InputSampler be rewritten to use
MapRunnable, so I'm cancelling this for now.
> want input sampler & sorted partitioner
> ---------------------------------------
>
> Key: HADOOP-3019
> URL: https://issues.apache.org/jira/browse/HADOOP-3019
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Doug Cutting
> Assignee: Chris Douglas
> Fix For: 0.19.0
>
> Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch
>
>
> The input sampler should generate a small, random sample of the input, saved
> to a file.
> The partitioner should read the sample file and partition keys into
> relatively even-sized key-ranges, where the partition numbers correspond to
> key order.
> Note that when the sampler is used for partitioning, the number of samples
> required is proportional to the number of reduce partitions. 10x the
> intended reducer count should give good results.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.