What about Quantile UDF in DataFu: http://datafu.incubator.apache.org/docs/datafu/1.1.0/datafu/pig/stats/Quantile.html
Is that useful here? If not then can it be modified to cover Russell's use case? Thanks, Mitul On Sun, Jul 12, 2015 at 11:16 AM, Russell Melick (JIRA) <j...@apache.org> wrote: > Russell Melick created DATAFU-98: > ------------------------------------ > > Summary: New UDF for Histogram / Frequency counting > Key: DATAFU-98 > URL: https://issues.apache.org/jira/browse/DATAFU-98 > Project: DataFu > Issue Type: New Feature > Reporter: Russell Melick > > > I was thinking of creating a new UDF to compute histograms / frequency > counts of input bags. It seems like it would make sense to support ints, > longs, float, and doubles. > > I tried looking around to see if this was already implemented, but > ValueHistogram and AggregateWordHistogram were about the only things I > found. They seem to exist as an example job, and only work for Strings. > > https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html > > https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html > > Should the user specify the bin size or the number of bins? Specifying > bin size probably makes the implementation simpler since you can bin things > without having seen all of the data. > > I think it would make sense to implement a version of this that didn't > need any reducers. It could use counters to keep track of the counts per > bin without sending any data to a reducer. You would be able to call this > without a preceding GROUP BY as well. > > Here's my proposal for the two udfs. This assumes the input data is two > columns, memberId and numConnections. > {code} > DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50') > > connections = LOAD 'connections' AS memberId, numConnections; > connectionHistogram = FOREACH (GROUP connections ALL) GENERATE > BinnedFrequency(connections.numConnections); > {code} > > The output here would be a bag with the frequency counts > {code} > {('0-49', 5), ('50-99', 0), ('100-149', 10)} > {code} > > {code} > DEFINE BinnedFrequencyCounter > datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram') > > connections = LOAD 'connections' AS memberId, numConnections; > connections = FOREACH connections GENERATE > BinnedFrequencyCounter(numConnections); > {code} > > The output here would just be a counter for each bin, all sharing the same > group of numConnectionsHistogram. It would look something like > > numConnectionsHistogram.'0-49' = 5 > numConnectionsHistogram.'50-99' = 0 > numConnectionsHistogram.'100-149' = 10 > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >