[
https://issues.apache.org/jira/browse/CRUNCH-351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909543#comment-13909543
]
Gabriel Reid commented on CRUNCH-351:
-------------------------------------
Looks to me like this will indeed work a lot better when there are a lot of
unique elements with a costly compare. I guess it will also be slower if the
input PCollection contains a large number of duplicate items with a low-cost
compare, as the shuffle would then become extremely lightweight in terms of IO
in the current version, while this patch would mean that all the duplicate
elements would still go through the full shuffle.
I'm guessing that Shard is going to be much more commonly used for
heavier-weight and mostly-unique values, so it makes sense to optimize for that
use case, so I'd say this one sounds like a good idea to me.
About the use of the random long as the key, I was thinking it might be even
better to use a limited-range int as the generated key, limiting the range of
the generated keys to be just large enough that we're sure that they'll be
decently distributed over the partitions. That way the shuffle will (I think)
become even lighter weight because there would be fewer unique keys to sort.
Does that sound right?
Also on the subject of the random key generation, is there any drawback to
using a constant value as the seed for the Random? If not, it might be better
to just use a constant to avoid the dependency on the MapReduce framework.
> Improve performance of Shard#shard on large records
> ---------------------------------------------------
>
> Key: CRUNCH-351
> URL: https://issues.apache.org/jira/browse/CRUNCH-351
> Project: Crunch
> Issue Type: Improvement
> Reporter: Chao Shi
> Assignee: Chao Shi
> Attachments: crunch-351.patch
>
>
> This avoids sorting on the input data, which may be long and make
> shuffle phase slow. The improvement is to sort on pseudo-random numbers.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)