[jira] [Commented] (DATAFU-98) New UDF for Histogram / Frequency counting

2016-10-25 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605952#comment-15605952
 ] 

Eyal Allweil commented on DATAFU-98:


Hi Russell.

First of all, I want to apologize for the time it's taken us to get to your 
contribution. I think it could be quite useful. Having said that, I wonder if 
the current version - without counters - gives us enough of an advantage over 
vanilla Pig. I think the following code (modified from your unit test) gives us 
nearly the same functionality as the UDF in the patch:

{noformat}
data_in = LOAD 'input' as (val:int);
-- data_in: "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "20"

intermediate_data = FOREACH data_in GENERATE val, (val / 5 * 5) AS binStart;

data_out = FOREACH (GROUP intermediate_data BY binStart) GENERATE group AS 
binStart, COUNT(intermediate_data) AS binCount;
-- data_out: (0,5),(5,5),(10,2),(20,1)

{noformat}

Unlike your UDF, missing bins are not included. But while including missing 
bins can be useful, I do wonder if a single skewed value can cause problems, 
especially with small bin sizes and long values. (as a performance-related 
aside, I would try to have FrequencyCounter.toBag() called only in the Final 
implementations, instead of the first two stages of the algebraic 
implementation, to minimize the data copied).

So it seems to me the current UDF has the advantage of having the missing bins, 
and it's obviously more readable and convenient than rewriting the Pig code I 
wrote above. Did you (or you, [~andrew.musselman]) run any performance tests? 
Maybe the Algebraic implementation runs faster than the vanilla Pig code by 
virtue of the combiner use.

Last (but not least!) the version you mentioned with counters sounds like it 
could be really great.


> New UDF for Histogram / Frequency counting
> --
>
> Key: DATAFU-98
> URL: https://issues.apache.org/jira/browse/DATAFU-98
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Russell Melick
> Attachments: DATAFU-98.patch
>
>
> I was thinking of creating a new UDF to compute histograms / frequency counts 
> of input bags.  It seems like it would make sense to support ints, longs, 
> float, and doubles.  
> I tried looking around to see if this was already implemented, but 
> ValueHistogram and AggregateWordHistogram were about the only things I found. 
>  They seem to exist as an example job, and only work for Strings.
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
> https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
> Should the user specify the bin size or the number of bins?  Specifying bin 
> size probably makes the implementation simpler since you can bin things 
> without having seen all of the data.
> I think it would make sense to implement a version of this that didn't need 
> any reducers.  It could use counters to keep track of the counts per bin 
> without sending any data to a reducer.  You would be able to call this 
> without a preceding GROUP BY as well.
> Here's my proposal for the two udfs.  This assumes the input data is two 
> columns, memberId and numConnections.
> {code}
> DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
> connections = LOAD 'connections' AS memberId, numConnections;
> connectionHistogram = FOREACH (GROUP connections ALL) GENERATE 
> BinnedFrequency(connections.numConnections);
> {code}
> The output here would be a bag with the frequency counts
> {code}
> {('0-49', 5), ('50-99', 0), ('100-149', 10)}
> {code}
> {code}
> DEFINE BinnedFrequencyCounter 
> datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
> connections = LOAD 'connections' AS memberId, numConnections;
> connections = FOREACH connections GENERATE 
> BinnedFrequencyCounter(numConnections);
> {code}
> The output here would just be a counter for each bin, all sharing the same 
> group of numConnectionsHistogram.  It would look something like
> numConnectionsHistogram.'0-49' = 5
> numConnectionsHistogram.'50-99' = 0
> numConnectionsHistogram.'100-149' = 10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-98) New UDF for Histogram / Frequency counting

2015-08-26 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716098#comment-14716098
 ] 

Andrew Musselman commented on DATAFU-98:


Cool, would be a good addition; I didn't try it out yet but can when I get back 
from vacation first week of September.

 New UDF for Histogram / Frequency counting
 --

 Key: DATAFU-98
 URL: https://issues.apache.org/jira/browse/DATAFU-98
 Project: DataFu
  Issue Type: New Feature
Reporter: Russell Melick
 Attachments: DATAFU-98.patch


 I was thinking of creating a new UDF to compute histograms / frequency counts 
 of input bags.  It seems like it would make sense to support ints, longs, 
 float, and doubles.  
 I tried looking around to see if this was already implemented, but 
 ValueHistogram and AggregateWordHistogram were about the only things I found. 
  They seem to exist as an example job, and only work for Strings.
 https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
 https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
 Should the user specify the bin size or the number of bins?  Specifying bin 
 size probably makes the implementation simpler since you can bin things 
 without having seen all of the data.
 I think it would make sense to implement a version of this that didn't need 
 any reducers.  It could use counters to keep track of the counts per bin 
 without sending any data to a reducer.  You would be able to call this 
 without a preceding GROUP BY as well.
 Here's my proposal for the two udfs.  This assumes the input data is two 
 columns, memberId and numConnections.
 {code}
 DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
 connections = LOAD 'connections' AS memberId, numConnections;
 connectionHistogram = FOREACH (GROUP connections ALL) GENERATE 
 BinnedFrequency(connections.numConnections);
 {code}
 The output here would be a bag with the frequency counts
 {code}
 {('0-49', 5), ('50-99', 0), ('100-149', 10)}
 {code}
 {code}
 DEFINE BinnedFrequencyCounter 
 datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
 connections = LOAD 'connections' AS memberId, numConnections;
 connections = FOREACH connections GENERATE 
 BinnedFrequencyCounter(numConnections);
 {code}
 The output here would just be a counter for each bin, all sharing the same 
 group of numConnectionsHistogram.  It would look something like
 numConnectionsHistogram.'0-49' = 5
 numConnectionsHistogram.'50-99' = 0
 numConnectionsHistogram.'100-149' = 10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-98) New UDF for Histogram / Frequency counting

2015-08-25 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711807#comment-14711807
 ] 

Andrew Musselman commented on DATAFU-98:


Have you tested that patch with Pig 0.12 by chance?  Could come in handy for me 
today.

 New UDF for Histogram / Frequency counting
 --

 Key: DATAFU-98
 URL: https://issues.apache.org/jira/browse/DATAFU-98
 Project: DataFu
  Issue Type: New Feature
Reporter: Russell Melick
 Attachments: DATAFU-98.patch


 I was thinking of creating a new UDF to compute histograms / frequency counts 
 of input bags.  It seems like it would make sense to support ints, longs, 
 float, and doubles.  
 I tried looking around to see if this was already implemented, but 
 ValueHistogram and AggregateWordHistogram were about the only things I found. 
  They seem to exist as an example job, and only work for Strings.
 https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
 https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
 Should the user specify the bin size or the number of bins?  Specifying bin 
 size probably makes the implementation simpler since you can bin things 
 without having seen all of the data.
 I think it would make sense to implement a version of this that didn't need 
 any reducers.  It could use counters to keep track of the counts per bin 
 without sending any data to a reducer.  You would be able to call this 
 without a preceding GROUP BY as well.
 Here's my proposal for the two udfs.  This assumes the input data is two 
 columns, memberId and numConnections.
 {code}
 DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
 connections = LOAD 'connections' AS memberId, numConnections;
 connectionHistogram = FOREACH (GROUP connections ALL) GENERATE 
 BinnedFrequency(connections.numConnections);
 {code}
 The output here would be a bag with the frequency counts
 {code}
 {('0-49', 5), ('50-99', 0), ('100-149', 10)}
 {code}
 {code}
 DEFINE BinnedFrequencyCounter 
 datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
 connections = LOAD 'connections' AS memberId, numConnections;
 connections = FOREACH connections GENERATE 
 BinnedFrequencyCounter(numConnections);
 {code}
 The output here would just be a counter for each bin, all sharing the same 
 group of numConnectionsHistogram.  It would look something like
 numConnectionsHistogram.'0-49' = 5
 numConnectionsHistogram.'50-99' = 0
 numConnectionsHistogram.'100-149' = 10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)