[ https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624120#comment-14624120 ]
Russell Melick commented on DATAFU-98: -------------------------------------- Posted RB: https://reviews.apache.org/r/36439/ > New UDF for Histogram / Frequency counting > ------------------------------------------ > > Key: DATAFU-98 > URL: https://issues.apache.org/jira/browse/DATAFU-98 > Project: DataFu > Issue Type: New Feature > Reporter: Russell Melick > Attachments: DATAFU-98.patch > > > I was thinking of creating a new UDF to compute histograms / frequency counts > of input bags. It seems like it would make sense to support ints, longs, > float, and doubles. > I tried looking around to see if this was already implemented, but > ValueHistogram and AggregateWordHistogram were about the only things I found. > They seem to exist as an example job, and only work for Strings. > https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html > https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html > Should the user specify the bin size or the number of bins? Specifying bin > size probably makes the implementation simpler since you can bin things > without having seen all of the data. > I think it would make sense to implement a version of this that didn't need > any reducers. It could use counters to keep track of the counts per bin > without sending any data to a reducer. You would be able to call this > without a preceding GROUP BY as well. > Here's my proposal for the two udfs. This assumes the input data is two > columns, memberId and numConnections. > {code} > DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50') > connections = LOAD 'connections' AS memberId, numConnections; > connectionHistogram = FOREACH (GROUP connections ALL) GENERATE > BinnedFrequency(connections.numConnections); > {code} > The output here would be a bag with the frequency counts > {code} > {('0-49', 5), ('50-99', 0), ('100-149', 10)} > {code} > {code} > DEFINE BinnedFrequencyCounter > datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram') > connections = LOAD 'connections' AS memberId, numConnections; > connections = FOREACH connections GENERATE > BinnedFrequencyCounter(numConnections); > {code} > The output here would just be a counter for each bin, all sharing the same > group of numConnectionsHistogram. It would look something like > numConnectionsHistogram.'0-49' = 5 > numConnectionsHistogram.'50-99' = 0 > numConnectionsHistogram.'100-149' = 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332)