[ 
https://issues.apache.org/jira/browse/DATAFU-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Hayes updated DATAFU-11:
--------------------------------

    Description: 
Reported by Barbara Mucha ([Issue #92 on 
GitHub|https://github.com/linkedin/datafu/issues/92]):

ReservoirSample does not behave as expected when grouping by a key other than 
ALL.

It appears like the sample is done on the full input instead of the group input.

Given input:

{noformat}
a1,5
a1,6
a1,7
a2,5
a2,6
a2,7
{noformat}

with the following program

{noformat}
DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2');
data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: 
chararray);
grouped = GROUP data BY key;
sample2 = FOREACH grouped GENERATE ReservoirSample(data);
{noformat}

the expected output should be similar to

{noformat}
(a1, {(a1,5),(a1,7)}
(a2, {(a2,5),(a2,7)}
{noformat}

However, actual output may show up as

{noformat}
(a1, {(a1,5),(a1,7)}
(a2, {(a1,5),(a1,7)}
{noformat}

  was:
ReservoirSample does not behave as expected when grouping by a key other than 
ALL.

It appears like the sample is done on the full input instead of the group input.

Given input:
a1,5
a1,6
a1,7
a2,5
a2,6
a2,7

with the following program
DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2');
data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: 
chararray);
grouped = GROUP data BY key;
sample2 = FOREACH grouped GENERATE ReservoirSample(data);

the expected output should be similar to
(a1, {(a1,5),(a1,7)}
(a2, {(a2,5),(a2,7)}

However, actual output may show up as
(a1, {(a1,5),(a1,7)}
(a2, {(a1,5),(a1,7)}


> ReservoirSample does not behave as expected when grouping by a key other than 
> ALL
> ---------------------------------------------------------------------------------
>
>                 Key: DATAFU-11
>                 URL: https://issues.apache.org/jira/browse/DATAFU-11
>             Project: DataFu
>          Issue Type: Bug
>            Reporter: Will Vaughan
>
> Reported by Barbara Mucha ([Issue #92 on 
> GitHub|https://github.com/linkedin/datafu/issues/92]):
> ReservoirSample does not behave as expected when grouping by a key other than 
> ALL.
> It appears like the sample is done on the full input instead of the group 
> input.
> Given input:
> {noformat}
> a1,5
> a1,6
> a1,7
> a2,5
> a2,6
> a2,7
> {noformat}
> with the following program
> {noformat}
> DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2');
> data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: 
> chararray);
> grouped = GROUP data BY key;
> sample2 = FOREACH grouped GENERATE ReservoirSample(data);
> {noformat}
> the expected output should be similar to
> {noformat}
> (a1, {(a1,5),(a1,7)}
> (a2, {(a2,5),(a2,7)}
> {noformat}
> However, actual output may show up as
> {noformat}
> (a1, {(a1,5),(a1,7)}
> (a2, {(a1,5),(a1,7)}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to