[ https://issues.apache.org/jira/browse/DATAFU-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthew Hayes updated DATAFU-11: -------------------------------- Description: Reported by Barbara Mucha ([Issue #92 on GitHub|https://github.com/linkedin/datafu/issues/92]): ReservoirSample does not behave as expected when grouping by a key other than ALL. It appears like the sample is done on the full input instead of the group input. Given input: {noformat} a1,5 a1,6 a1,7 a2,5 a2,6 a2,7 {noformat} with the following program {noformat} DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2'); data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: chararray); grouped = GROUP data BY key; sample2 = FOREACH grouped GENERATE ReservoirSample(data); {noformat} the expected output should be similar to {noformat} (a1, {(a1,5),(a1,7)} (a2, {(a2,5),(a2,7)} {noformat} However, actual output may show up as {noformat} (a1, {(a1,5),(a1,7)} (a2, {(a1,5),(a1,7)} {noformat} was: ReservoirSample does not behave as expected when grouping by a key other than ALL. It appears like the sample is done on the full input instead of the group input. Given input: a1,5 a1,6 a1,7 a2,5 a2,6 a2,7 with the following program DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2'); data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: chararray); grouped = GROUP data BY key; sample2 = FOREACH grouped GENERATE ReservoirSample(data); the expected output should be similar to (a1, {(a1,5),(a1,7)} (a2, {(a2,5),(a2,7)} However, actual output may show up as (a1, {(a1,5),(a1,7)} (a2, {(a1,5),(a1,7)} > ReservoirSample does not behave as expected when grouping by a key other than > ALL > --------------------------------------------------------------------------------- > > Key: DATAFU-11 > URL: https://issues.apache.org/jira/browse/DATAFU-11 > Project: DataFu > Issue Type: Bug > Reporter: Will Vaughan > > Reported by Barbara Mucha ([Issue #92 on > GitHub|https://github.com/linkedin/datafu/issues/92]): > ReservoirSample does not behave as expected when grouping by a key other than > ALL. > It appears like the sample is done on the full input instead of the group > input. > Given input: > {noformat} > a1,5 > a1,6 > a1,7 > a2,5 > a2,6 > a2,7 > {noformat} > with the following program > {noformat} > DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2'); > data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: > chararray); > grouped = GROUP data BY key; > sample2 = FOREACH grouped GENERATE ReservoirSample(data); > {noformat} > the expected output should be similar to > {noformat} > (a1, {(a1,5),(a1,7)} > (a2, {(a2,5),(a2,7)} > {noformat} > However, actual output may show up as > {noformat} > (a1, {(a1,5),(a1,7)} > (a2, {(a1,5),(a1,7)} > {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)