[jira] [Commented] (MADLIB-1128) Add GROUP BY to correlation functions

Nikhil (JIRA) Tue, 01 May 2018 12:08:28 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460030#comment-16460030
 ]


Nikhil commented on MADLIB-1128:
--------------------------------

To support grouping we have to change the query that deconstructs the matrix to 
get the lower triangle.
{code}
SELECT * FROM
{schema_madlib}.__deconstruct_lower_triangle(
(SELECT {cor_mat} FROM {temp_output_table})
) AS deconstructed(column_position integer, {variable_list_str})
{code}
There are 2 possibilities
1. Instead of passing each row to the __deconstruct_lower_triangle function, we 
can pass the entire column and use the group by clause. This syntax is not 
allowed in postgres/gpdb
{code}
SELECT * FROM
{schema_madlib}.__deconstruct_lower_triangle(
(SELECT {cor_mat}) AS deconstructed(column_position integer, 
{variable_list_str})
) FROM  {temp_output_table})
{code}
Two possibilities to work around this
a.) postgres 9.3 introduced LATERAL joins which solves the problem for us but 
it won't work with gpdb 4.3/5. 
{code}
SELECT * FROM 
    {temp_output_table}, 
    LATERAL (SELECT * FROM 
{schema_madlib}.__deconstruct_lower_triangle((cor_mat)) AS 
                deconstructed(column_position integer, {variable_list_str})) q
{code}
b.) Create our own type and change the __deconstruct_lower_triangle function to 
return this custom type instead of the record type. The problem here is that we 
don't know the type in advance which means we will need to create the function 
in runtime which is not something we want to do.

2. The other solution is to do a union all for all the groups, so the query 
will look like
{code}
SELECT *, {select_grouping_cols}
FROM {schema_madlib}.__deconstruct_lower_triangle(
(SELECT {cor_mat} FROM {temp_output_table} where gr = 1)
) AS deconstructed(column_position integer, {variable_list_str})
UNION ALL
SELECT *, {select_grouping_cols}
FROM {schema_madlib}.__deconstruct_lower_triangle(
(SELECT {cor_mat} FROM {temp_output_table} where gr = 2)
) AS deconstructed(column_position integer, {variable_list_str})
{code}

We have decided to go with the second solution

> Add GROUP BY to correlation functions
> -------------------------------------
>
>                 Key: MADLIB-1128
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1128
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Descriptive Statistics
>            Reporter: Frank McQuillan
>            Assignee: Nandish Jayaram
>            Priority: Minor
>             Fix For: v1.15
>
>
> http://madlib.incubator.apache.org/docs/latest/group__grp__correlation.html
> add:
> grouping_cols (optional)
> TEXT, default: null. A comma-separated list of columns on which to group 
> results. If NULL, correlations/covariance is produced on the complete table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MADLIB-1128) Add GROUP BY to correlation functions

Reply via email to