[
https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthew Hayes closed DATAFU-98.
-------------------------------
Resolution: Won't Do
Closing this as it is quite old and there have been no updates.
> New UDF for Histogram / Frequency counting
> ------------------------------------------
>
> Key: DATAFU-98
> URL: https://issues.apache.org/jira/browse/DATAFU-98
> Project: DataFu
> Issue Type: New Feature
> Reporter: Russell Melick
> Priority: Major
> Attachments: DATAFU-98.patch
>
>
> I was thinking of creating a new UDF to compute histograms / frequency counts
> of input bags. It seems like it would make sense to support ints, longs,
> float, and doubles.
> I tried looking around to see if this was already implemented, but
> ValueHistogram and AggregateWordHistogram were about the only things I found.
> They seem to exist as an example job, and only work for Strings.
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
> https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
> Should the user specify the bin size or the number of bins? Specifying bin
> size probably makes the implementation simpler since you can bin things
> without having seen all of the data.
> I think it would make sense to implement a version of this that didn't need
> any reducers. It could use counters to keep track of the counts per bin
> without sending any data to a reducer. You would be able to call this
> without a preceding GROUP BY as well.
> Here's my proposal for the two udfs. This assumes the input data is two
> columns, memberId and numConnections.
> {code}
> DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
> connections = LOAD 'connections' AS memberId, numConnections;
> connectionHistogram = FOREACH (GROUP connections ALL) GENERATE
> BinnedFrequency(connections.numConnections);
> {code}
> The output here would be a bag with the frequency counts
> {code}
> {('0-49', 5), ('50-99', 0), ('100-149', 10)}
> {code}
> {code}
> DEFINE BinnedFrequencyCounter
> datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
> connections = LOAD 'connections' AS memberId, numConnections;
> connections = FOREACH connections GENERATE
> BinnedFrequencyCounter(numConnections);
> {code}
> The output here would just be a counter for each bin, all sharing the same
> group of numConnectionsHistogram. It would look something like
> numConnectionsHistogram.'0-49' = 5
> numConnectionsHistogram.'50-99' = 0
> numConnectionsHistogram.'100-149' = 10
--
This message was sent by Atlassian Jira
(v8.3.4#803005)