[GitHub] incubator-madlib pull request #47: Feature: Pivot Function

decibel Tue, 21 Jun 2016 13:16:31 -0700

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/47#discussion_r67943088
  
    --- Diff: src/ports/postgres/modules/utilities/pivot.py_in ---
    @@ -0,0 +1,201 @@
    +# coding=utf-8
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +# Pivoting
    +# The goal of the MADlib pivot function is to provide a data summarization 
tool
    +# that can do basic OLAP type operations on data stored in one table and 
output
    +# the summarized data to a second table.  Typical operations are count, 
average,
    +# min, max and standard deviation, however user defined aggregates (UDAs) 
are
    +# also be allowed.
    +
    +# Please refer to the pivot.sql_in file for the documentation
    +
    +"""
    +@file pivot.py_in
    +
    +"""
    +import plpy
    +from utilities import _assert
    +from utilities import split_quoted_delimited_str
    +from utilities import strip_end_quotes
    +from validate_args import table_exists
    +from validate_args import columns_exist_in_table
    +from validate_args import table_is_empty
    +from validate_args import _get_table_schema_names
    +from validate_args import get_first_schema
    +
    +m4_changequote(`<!', `!>')
    +
    +def pivot(schema_madlib, source_table, out_table,
    +                 index, pivot_cols, pivot_values,
    +                 aggregate_func, **kwargs):
    +    """
    +    Helper function that can be used to pivot tables
    +    Args:
    +        @param source_table     The original data table
    +        @param out_table        The output table that contains the dummy
    +                                variable columns
    +        @param index            The index columns to group by the records 
by
    +        @param pivot_cols       The columns to pivot the table
    +        @param pivot_values     The value columns to be summarized in the
    +                                pivoted table
    +        @param aggregate_func   The aggregate function to be applied to the
    +                                values
    +
    +    """
    +    indices = split_quoted_delimited_str(index)
    +    pcols = split_quoted_delimited_str(pivot_cols)
    +    pvals = split_quoted_delimited_str(pivot_values)
    +    # aggregate_func = "sum"
    +    validate_pivot_coding(source_table, out_table, indices, pcols, pvals)
    +    new_col_names =[]
    +    sql_list = ["CREATE TABLE " + out_table + " AS (SELECT " + index]
    +    # Preperation for multiple index, pivot, etc.
    +    for pcol in pcols:
    +        for pval in pvals:
    +            pcol_no_quotes = strip_end_quotes(pcol.strip())
    +            pval_no_quotes = strip_end_quotes(pval.strip())
    +            distinct_values = plpy.execute(
    --- End diff --
    
    Doing this once per column will be very slow. I think it should be possible 
to get the same results by building a query of the form
    SELECT
        array_agg(column1) OVER( PARTITION BY column1 ORDER BY column1 )
        , array_agg(column2) OVER( PARTITION BY column2 ORDER BY column2 )
    ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #47: Feature: Pivot Function

Reply via email to