Github user decibel commented on a diff in the pull request:
https://github.com/apache/incubator-madlib/pull/47#discussion_r67943088
--- Diff: src/ports/postgres/modules/utilities/pivot.py_in ---
@@ -0,0 +1,201 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Pivoting
+# The goal of the MADlib pivot function is to provide a data summarization
tool
+# that can do basic OLAP type operations on data stored in one table and
output
+# the summarized data to a second table. Typical operations are count,
average,
+# min, max and standard deviation, however user defined aggregates (UDAs)
are
+# also be allowed.
+
+# Please refer to the pivot.sql_in file for the documentation
+
+"""
+@file pivot.py_in
+
+"""
+import plpy
+from utilities import _assert
+from utilities import split_quoted_delimited_str
+from utilities import strip_end_quotes
+from validate_args import table_exists
+from validate_args import columns_exist_in_table
+from validate_args import table_is_empty
+from validate_args import _get_table_schema_names
+from validate_args import get_first_schema
+
+m4_changequote(`<!', `!>')
+
+def pivot(schema_madlib, source_table, out_table,
+ index, pivot_cols, pivot_values,
+ aggregate_func, **kwargs):
+ """
+ Helper function that can be used to pivot tables
+ Args:
+ @param source_table The original data table
+ @param out_table The output table that contains the dummy
+ variable columns
+ @param index The index columns to group by the records
by
+ @param pivot_cols The columns to pivot the table
+ @param pivot_values The value columns to be summarized in the
+ pivoted table
+ @param aggregate_func The aggregate function to be applied to the
+ values
+
+ """
+ indices = split_quoted_delimited_str(index)
+ pcols = split_quoted_delimited_str(pivot_cols)
+ pvals = split_quoted_delimited_str(pivot_values)
+ # aggregate_func = "sum"
+ validate_pivot_coding(source_table, out_table, indices, pcols, pvals)
+ new_col_names =[]
+ sql_list = ["CREATE TABLE " + out_table + " AS (SELECT " + index]
+ # Preperation for multiple index, pivot, etc.
+ for pcol in pcols:
+ for pval in pvals:
+ pcol_no_quotes = strip_end_quotes(pcol.strip())
+ pval_no_quotes = strip_end_quotes(pval.strip())
+ distinct_values = plpy.execute(
--- End diff --
Doing this once per column will be very slow. I think it should be possible
to get the same results by building a query of the form
SELECT
array_agg(column1) OVER( PARTITION BY column1 ORDER BY column1 )
, array_agg(column2) OVER( PARTITION BY column2 ORDER BY column2 )
...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---