Frank McQuillan created MADLIB-1265:
---------------------------------------
Summary: Load data from database into PL/Python
Key: MADLIB-1265
URL: https://issues.apache.org/jira/browse/MADLIB-1265
Project: Apache MADlib
Issue Type: New Feature
Components: Module: Utilities
Reporter: Frank McQuillan
Story
`As a data scientist`
I want to easily and efficiently load data from the database into PL/Python
memory
`so that`
I can use the loaded data in my PL/Python code.
Interface
```
load_to_plpythonu (
source_table, -- source table
list_of_columns, -- columns you
want in GD, could be '*'
list_of_columns_to_exclude -- columns explicitly
not to load
);
```
Arguments
```
source_table
TEXT. Name of the table containing the data to load.
list_of_columns
TEXT. Comma-separated string of column names or expressions to load.
Can also be '*' implying all columns are to be loaded (except for the ones
included
in the next argument that lists exclusions). The types of the columns can be
mixed.
Array columns can also be included in the list and will be loaded as is (i.e.,
not be flattened). (???)
list_of_columns_to_exclude
TEXT. Comma-separated string of column names to exclude from load. Typically
used when 'list_of_columns' is set to '*'.
```
Details
1) This function will user facing and also will be called internally by other
MADlib functions in the area of data parallel models.
2) The interface above is modeled on DT/RF. I think it should be the same
general idea.
Open questions
1) Is the interface above the correct one? Are there any parameters missing?
2) Can we support array columns, and is it necessary to flatten them? i.e., can
we leave them unflattened, since that is preferable?
Acceptance
1) Load MNIST data set from PG or GP into PL/Python and print out the a few
rows of the data.
2) Load array columns and mixed type data into PL/Python and confirm that
types and formats are preserved.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)