Russell Melick created DATAFU-98:
------------------------------------
Summary: New UDF for Histogram / Frequency counting
Key: DATAFU-98
URL: https://issues.apache.org/jira/browse/DATAFU-98
Project: DataFu
Issue Type: New Feature
Reporter: Russell Melick
I was thinking of creating a new UDF to compute histograms / frequency counts
of input bags. It seems like it would make sense to support ints, longs,
float, and doubles.
I tried looking around to see if this was already implemented, but
ValueHistogram and AggregateWordHistogram were about the only things I found.
They seem to exist as an example job, and only work for Strings.
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
Should the user specify the bin size or the number of bins? Specifying bin
size probably makes the implementation simpler since you can bin things without
having seen all of the data.
I think it would make sense to implement a version of this that didn't need any
reducers. It could use counters to keep track of the counts per bin without
sending any data to a reducer. You would be able to call this without a
preceding GROUP BY as well.
Here's my proposal for the two udfs. This assumes the input data is two
columns, memberId and numConnections.
{code}
DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
connections = LOAD 'connections' AS memberId, numConnections;
connectionHistogram = FOREACH (GROUP connections ALL) GENERATE
BinnedFrequency(connections.numConnections);
{code}
The output here would be a bag with the frequency counts
{code}
{('0-49', 5), ('50-99', 0), ('100-149', 10)}
{code}
{code}
DEFINE BinnedFrequencyCounter
datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
connections = LOAD 'connections' AS memberId, numConnections;
connections = FOREACH connections GENERATE
BinnedFrequencyCounter(numConnections);
{code}
The output here would just be a counter for each bin, all sharing the same
group of numConnectionsHistogram. It would look something like
numConnectionsHistogram.'0-49' = 5
numConnectionsHistogram.'50-99' = 0
numConnectionsHistogram.'100-149' = 10
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)