[jira] [Commented] (PIG-3900) SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large input set

li yuntian (JIRA) Tue, 19 May 2015 02:10:49 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550086#comment-14550086
 ]


li yuntian commented on PIG-3900:
---------------------------------

have you solved sample with seed problem yourself ? 

> SAMPLE and RANDOM should optionally stabilize their output from run-to-run, 
> even across a large input set
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-3900
>                 URL: https://issues.apache.org/jira/browse/PIG-3900
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Philip (flip) Kromer
>            Priority: Minor
>              Labels: features, random, sample, seed
>
> SAMPLE and RANDOM should be able to give output that is stable from 
> run-to-run, yet random across a large input set. Although PIG-2965 allows the 
> RANDOM function to be constructed with a seed, each mapper will generate the 
> same sequence of values, which is unacceptable.
> It's typically undesirable to have the output of a large job be completely 
> non-deterministic. Testing becomes difficult, and failed map tasks don't 
> provide the same output from attempt to attempt, which complicates debugging.
> The most desirable implementation would provide a guarantee that a given seed 
> and input data would produce an identical result in any environment. I 
> believe this is difficult in a distributed environment, however.
> If each mapper added the index of its task ID to the provided seed, then the 
> output would be stable for most practical purposes -- as long as the 
> assignment of input splits to mappers doesn't change from job to job, the 
> number produced for each row won't change from job to job. Doing it this way 
> would be backwards compatible with the current Pig 0.12.0 implementation 
> (PIG-2965) in the case of a single mapper (which is the only justifiable use 
> of the current seed feature). Alternatively, one could use a hash of the 
> input file path, the split offset, and the provided seed. Both approaches are 
> not stable if the splitCombination logic is not stable. 
> Suggested documentation for new functionality of RANDOM:
> {quote}
> This example constructs a function, providing a seed to control the series of 
> numbers generated. Each of the three fields will have an  independent series 
> of random values, and the output will be stable from run to run. (Note that 
> the result is only stable if the input splits remain stable).
> {code:sql}
> DEFINE rollRand  RANDOM('12345');
> DEFINE yawRand   RANDOM('69');
> DEFINE pitchRand RANDOM('42');
> position = LOAD 'position.tsv';
> orientation = FOREACH position GENERATE rollRand() AS roll:double, 
> pitchRand() AS pitch:double, yawRand() AS yaw:double;
> {code}
> {quote}
> Suggested documentation for new functionality of SAMPLE:
> {quote}
> In this example, we provide a seed that stabilizes which rows are selected 
> from run to run. (Note that the result is only stable if the input splits 
> remain stable).
> {code:sql}
> a = LOAD 'a.txt';
> b = SAMPLE A 0.1 SEED 42;
> {code}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-3900) SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large input set

Reply via email to