[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-15 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879177#action_12879177
 ] 

John Sichi commented on HIVE-1397:
--

+1.  Will commit if tests pass.


> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
> Attachments: Histogram_quality.png.jpg, HIVE-1397.1.patch, 
> HIVE-1397.2.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-15 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879094#action_12879094
 ] 

John Sichi commented on HIVE-1397:
--

histogram_numeric is fine.

> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
> Attachments: Histogram_quality.png.jpg, HIVE-1397.1.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-15 Thread Mayank Lahiri (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879081#action_12879081
 ] 

Mayank Lahiri commented on HIVE-1397:
-

Does 'histogram_numeric()' work? And I'm patching it to return a list of
(x,y) structs. 







> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
> Attachments: Histogram_quality.png.jpg, HIVE-1397.1.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-15 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879077#action_12879077
 ] 

HBase Review Board commented on HIVE-1397:
--

Message from: "John Sichi" 


bq.  On None, John Sichi wrote:
bq.  > 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java,
 line 346
bq.  > 
bq.  >
bq.  > Since eventually we would like to support histograms on non-numeric 
fields such as STRING, I think we should rename this one numeric_histogram 
(likewise for the Java class) to avoid confusion later when we have other 
algorithms.
bq.  >
bq.  
bq.  Mayank Lahiri wrote:
bq.  It might seem a little odd since histograms are generally used as 
approximations of numerical distributions. I would suggest either (a) 
overloading histogram() to behave differently on STRING arguments (perhaps 
STRING arguments that cause a NumberFormatException), or (b) creating a 
factor_histogram() function for general strings. 
bq.  
bq.  I could add in the code for computing frequencies of STRINGs quite 
easily, although there's no way to prevent it from choking if there are too 
many unique strings.

I don't think we should handle strings now, but we should rename this one to 
make it clear that it only works on numeric data.  And per our discussion 
offline, reject attempts to use it on non-numeric data.


- John


---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/181/#review224
---





> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
> Attachments: Histogram_quality.png.jpg, HIVE-1397.1.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-15 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879054#action_12879054
 ] 

HBase Review Board commented on HIVE-1397:
--

Message from: "Mayank Lahiri" 


bq.  On None, John Sichi wrote:
bq.  > 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java,
 line 346
bq.  > 
bq.  >
bq.  > Since eventually we would like to support histograms on non-numeric 
fields such as STRING, I think we should rename this one numeric_histogram 
(likewise for the Java class) to avoid confusion later when we have other 
algorithms.
bq.  >

It might seem a little odd since histograms are generally used as 
approximations of numerical distributions. I would suggest either (a) 
overloading histogram() to behave differently on STRING arguments (perhaps 
STRING arguments that cause a NumberFormatException), or (b) creating a 
factor_histogram() function for general strings. 

I could add in the code for computing frequencies of STRINGs quite easily, 
although there's no way to prevent it from choking if there are too many unique 
strings.


bq.  On None, John Sichi wrote:
bq.  > 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java,
 line 239
bq.  > 
bq.  >
bq.  > Shouldn't this method be private?

yes! thanks!


bq.  On None, John Sichi wrote:
bq.  > 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java,
 line 186
bq.  > 
bq.  >
bq.  > Formatting nitpick:  Here and elsewhere, use braces even for 
single-line blocks; we use the NeedBraces checkstyle rule to flag this.
bq.  > 
bq.  > 
http://stackoverflow.com/questions/382633/can-the-checkstyle-module-needbraces-work-with-nested-if-else-blocks
bq.  >

Will re-submit patch, thanks.


bq.  On None, John Sichi wrote:
bq.  > 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java,
 line 362
bq.  > 
bq.  >
bq.  > Under what conditions can this exception be encountered?  Shouldn't 
it be impossible since we already checked the type up front?

This could happen when a STRING row contains a non-numeric value. As a 
follow-up to my earlier comment, we could either drop the value or somehow 
"intelligently" switch to computing the histogram over strings instead of 
doubles.


bq.  On None, John Sichi wrote:
bq.  > 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java,
 line 153
bq.  > 
bq.  >
bq.  > Is it possible to return an ARRAY 
instead?  That seems more natural (and compact) than a MAP.
bq.  > 
bq.  > But if you already got feedback that MAP is prefereable, ignore this 
comment.
bq.  >

Not a problem. I was using the map to avoid an extra level of indirection, and 
possibly to make it compatible with an explode() extension that explodes maps 
as well as arrays. 


- Mayank


---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/181/#review224
---





> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
> Attachments: Histogram_quality.png.jpg, HIVE-1397.1.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message

[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-14 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878808#action_12878808
 ] 

HBase Review Board commented on HIVE-1397:
--

Message from: "John Sichi" 

---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/181/
---

Review request for Hive Developers.


Summary
---

Submitted on behalf of Mayank Lahiri.


This addresses bug HIVE-1397.
http://issues.apache.org/jira/browse/HIVE-1397


Diffs
-

  
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
 953449 
  
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_histogram.q
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientpositive/show_functions.q.out
 953449 
  
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientpositive/udaf_histogram.q.out
 PRE-CREATION 

Diff: http://review.hbase.org/r/181/diff


Testing
---

None


Thanks,

John




> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
> Attachments: Histogram_quality.png.jpg, HIVE-1397.1.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-14 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878807#action_12878807
 ] 

John Sichi commented on HIVE-1397:
--

I am testing out Review Board here:

http://review.hbase.org/r/181/

I think it will be automatically appending my comments here in JIRA as well, 
but give that URL a try to see if you can see them.


> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
> Attachments: Histogram_quality.png.jpg, HIVE-1397.1.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-09 Thread Ashish Thusoo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877233#action_12877233
 ] 

Ashish Thusoo commented on HIVE-1397:
-

+1.

This would be a cool contribution.


> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-09 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877220#action_12877220
 ] 

Edward Capriolo commented on HIVE-1397:
---

Looks great. Can not wait.

> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.