[ https://issues.apache.org/jira/browse/DATAFU-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthew Hayes reassigned DATAFU-11: ----------------------------------- Assignee: Matthew Hayes > ReservoirSample does not behave as expected when grouping by a key other than > ALL > --------------------------------------------------------------------------------- > > Key: DATAFU-11 > URL: https://issues.apache.org/jira/browse/DATAFU-11 > Project: DataFu > Issue Type: Bug > Reporter: Will Vaughan > Assignee: Matthew Hayes > Attachments: DATAFU-11.patch > > > Reported by Barbara Mucha ([Issue #92 on > GitHub|https://github.com/linkedin/datafu/issues/92]): > ReservoirSample does not behave as expected when grouping by a key other than > ALL. > It appears like the sample is done on the full input instead of the group > input. > Given input: > {noformat} > a1,5 > a1,6 > a1,7 > a2,5 > a2,6 > a2,7 > {noformat} > with the following program > {noformat} > DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2'); > data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: > chararray); > grouped = GROUP data BY key; > sample2 = FOREACH grouped GENERATE ReservoirSample(data); > {noformat} > the expected output should be similar to > {noformat} > (a1, {(a1,5),(a1,7)} > (a2, {(a2,5),(a2,7)} > {noformat} > However, actual output may show up as > {noformat} > (a1, {(a1,5),(a1,7)} > (a2, {(a1,5),(a1,7)} > {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)