[jira] [Updated] (MADLIB-1265) Formalize the read data code for parallel segment data loading

Nikhil (JIRA) Thu, 23 Aug 2018 16:32:31 -0700


     [ 
https://issues.apache.org/jira/browse/MADLIB-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nikhil updated MADLIB-1265:
---------------------------
    Description: 
Story
We need to productize the read data code so that the madlib.fit function can 
call it along with the user defined python script.

Details
1. Figure out how to copy read_data.py and user code to all hosts?
1. The read_data script will call the user data script and then write the 
model. This way the user can easily iterate on their python file.
1. Think about error handling in read_data.py. If we write all errors to a log 
file, do we delete the log file every time the madlib.fit udf is called? Do we 
need to rotate the log files?
1. We need to make sure that we take a lock on model file while writing. An 
alternative to avoid the need for locking is to create one file per segment. We 
can append the segment id in the file name and use that name to create the 
external readable table.
1. read_data.py can be copied to all the segments during madlib install. This 
file can take the user_defined_module as an argument which will then be 
dynamically imported.
1. How will the python memory be managed ? The postgres process for each spawns 
a connection process because of the INSERT which in turn spawns another process 
run our executable command from the CREATE WRITABLE EXTERNAL WEB TABLE 
definition. This means that the memory is prob not restricted by greenplum.
1. Should read_data.py also get the column names and the grouping col 
information? Can we pass the metadata without duplication ? madlib.fit which 
will be a plpython function can take care of this:
    a. Get the absolute path of the user defined python file and copy it to all 
the segments ( this is up for discussion, maybe there is a better way to copy 
the user defined code to all the segments)
    b. Parse the columns to include and get the types of all the columns using 
plpy.execute(). Write this metadata information along with any other relevant 
information to a yml file and copy this to all the segments.
1. The grp col value should be written to the model file.
1. Since the data is read through a pipe, read_data.py can also provide an api 
to stream rows to the user defined python file.


Acceptance

  was:
Story

`As a data scientist`
I want to easily and efficiently load data from the database into PL/Python 
memory
`so that`
I can use the loaded data in my PL/Python code.

Interface

{code}
load_to_plpythonu (
                        source_table,                           -- source table
                        list_of_columns,                        -- columns you 
want in GD, could be '*'
                        list_of_columns_to_exclude      -- columns explicitly 
not to load
        );
{code}

Arguments
{code}
source_table
TEXT. Name of the table containing the data to load.

list_of_columns
TEXT. Comma-separated string of column names or expressions to load. 
Can also be '*' implying all columns are to be loaded (except for the ones 
included
 in the next argument that lists exclusions). The types of the columns can be 
mixed.  
Array columns can also be included in the list and will be loaded as is (i.e., 
not be flattened). (???)

list_of_columns_to_exclude
TEXT. Comma-separated string of column names to exclude from load. Typically 
used when 'list_of_columns' is set to '*'.
{code}

Details

1) This function will user facing and also will be called internally by other 
MADlib functions in the area of data parallel models.
2) The interface above is modeled on DT/RF.  I think it should be the same 
general idea.


Open questions

1) Is the interface above the correct one?  Are there any parameters missing?
2) Can we support array columns, and is it necessary to flatten them? i.e., can 
we leave them unflattened, since that is preferable?


Acceptance

1) Load MNIST data set from PG or GP into PL/Python and print out the a few 
rows of the data.
2) Load array columns and mixed type data  into PL/Python and confirm that 
types and formats are preserved.


> Formalize the read data code for parallel segment data loading
> --------------------------------------------------------------
>
>                 Key: MADLIB-1265
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1265
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v2.0
>
>
> Story
> We need to productize the read data code so that the madlib.fit function can 
> call it along with the user defined python script.
> Details
> 1. Figure out how to copy read_data.py and user code to all hosts?
> 1. The read_data script will call the user data script and then write the 
> model. This way the user can easily iterate on their python file.
> 1. Think about error handling in read_data.py. If we write all errors to a 
> log file, do we delete the log file every time the madlib.fit udf is called? 
> Do we need to rotate the log files?
> 1. We need to make sure that we take a lock on model file while writing. An 
> alternative to avoid the need for locking is to create one file per segment. 
> We can append the segment id in the file name and use that name to create the 
> external readable table.
> 1. read_data.py can be copied to all the segments during madlib install. This 
> file can take the user_defined_module as an argument which will then be 
> dynamically imported.
> 1. How will the python memory be managed ? The postgres process for each 
> spawns a connection process because of the INSERT which in turn spawns 
> another process run our executable command from the CREATE WRITABLE EXTERNAL 
> WEB TABLE definition. This means that the memory is prob not restricted by 
> greenplum.
> 1. Should read_data.py also get the column names and the grouping col 
> information? Can we pass the metadata without duplication ? madlib.fit which 
> will be a plpython function can take care of this:
>     a. Get the absolute path of the user defined python file and copy it to 
> all the segments ( this is up for discussion, maybe there is a better way to 
> copy the user defined code to all the segments)
>     b. Parse the columns to include and get the types of all the columns 
> using plpy.execute(). Write this metadata information along with any other 
> relevant information to a yml file and copy this to all the segments.
> 1. The grp col value should be written to the model file.
> 1. Since the data is read through a pipe, read_data.py can also provide an 
> api to stream rows to the user defined python file.
> Acceptance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (MADLIB-1265) Formalize the read data code for parallel segment data loading

Reply via email to