[ 
https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624120#comment-14624120
 ] 

Russell Melick commented on DATAFU-98:
--------------------------------------

Posted RB: https://reviews.apache.org/r/36439/

> New UDF for Histogram / Frequency counting
> ------------------------------------------
>
>                 Key: DATAFU-98
>                 URL: https://issues.apache.org/jira/browse/DATAFU-98
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Russell Melick
>         Attachments: DATAFU-98.patch
>
>
> I was thinking of creating a new UDF to compute histograms / frequency counts 
> of input bags.  It seems like it would make sense to support ints, longs, 
> float, and doubles.  
> I tried looking around to see if this was already implemented, but 
> ValueHistogram and AggregateWordHistogram were about the only things I found. 
>  They seem to exist as an example job, and only work for Strings.
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
> https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
> Should the user specify the bin size or the number of bins?  Specifying bin 
> size probably makes the implementation simpler since you can bin things 
> without having seen all of the data.
> I think it would make sense to implement a version of this that didn't need 
> any reducers.  It could use counters to keep track of the counts per bin 
> without sending any data to a reducer.  You would be able to call this 
> without a preceding GROUP BY as well.
> Here's my proposal for the two udfs.  This assumes the input data is two 
> columns, memberId and numConnections.
> {code}
> DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
> connections = LOAD 'connections' AS memberId, numConnections;
> connectionHistogram = FOREACH (GROUP connections ALL) GENERATE 
> BinnedFrequency(connections.numConnections);
> {code}
> The output here would be a bag with the frequency counts
> {code}
> {('0-49', 5), ('50-99', 0), ('100-149', 10)}
> {code}
> {code}
> DEFINE BinnedFrequencyCounter 
> datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
> connections = LOAD 'connections' AS memberId, numConnections;
> connections = FOREACH connections GENERATE 
> BinnedFrequencyCounter(numConnections);
> {code}
> The output here would just be a counter for each bin, all sharing the same 
> group of numConnectionsHistogram.  It would look something like
> numConnectionsHistogram.'0-49' = 5
> numConnectionsHistogram.'50-99' = 0
> numConnectionsHistogram.'100-149' = 10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to