[ 
https://issues.apache.org/jira/browse/MADLIB-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16869847#comment-16869847
 ] 

Domino Valdano commented on MADLIB-1345:
----------------------------------------

Also, a possible idea for getting around the 1GB limit so that we could handle 
large models whose state takes up more than 1GB:

  Pass the layers in as a tuple instead of an array.  In other words, instead 
of the weights being an array of layers (where each layer has an array of 
weights), let it be a row where each column is a different layer.  This may 
require creating a temporary user-defined-type, and another unknown would be 
whether the 1GB limit applies to each column or the entire row.  But if the 
limit is per column, then this would expand the set of models supported by 
Madlib to anything where each layer is < 1GB.  For example, you might have a 
model with 16 layers that's 15GB in size, but as long as each layer is under 
1GB it can still work.

> DL: Performance improvement in DL functions
> -------------------------------------------
>
>                 Key: MADLIB-1345
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1345
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Deep Learning
>            Reporter: Ekta Khanna
>            Priority: Major
>             Fix For: v2.0
>
>
> Currently, we pass around model_data, model_arch, etc.. for each buffer/image 
> for fit(), predict() and evaluate(). This causes a lot of overhead and slows 
> down the query considerable.
> We tried to set model_data and model_arch using GD for predict. Following 
> were the runtimes:
> with GD
> ~707 sec(with CPU) - 50K places10_20seg
> without GD
> ~1650 sec(with CPU) - 50K places10_20seg
> Below is the patch for GD changes:
> {code}
> def set_predict_GD(model_architecture, model_data,
>                            is_response, normalizing_const, seg_ids,
>                            images_per_seg, gpus_per_host, segments_per_host,
>                            **kwargs):
>     GD = kwargs['GD']
>     GD['model_architecture'] = model_architecture
>     GD['model_data'] = model_data
>     GD['is_response'] = is_response
>     GD['normalizing_const'] = normalizing_const
>     #GD['current_seg_id'] = current_seg_id
>     GD['seg_ids'] = seg_ids
>     GD['images_per_seg'] = images_per_seg
>     GD['gpus_per_host'] = gpus_per_host
>     GD['segments_per_host'] = segments_per_host
> def predict()
> ....
> set_gd_query=plpy.prepare("""
>            SELECT set_predict_GD
>             ($MAD${model_arch}$MAD$,
>             $1,
>             {is_response},
>             {normalizing_const},
>             -- gp_segment_id,
>             ARRAY{seg_ids_test},
>             ARRAY{images_per_seg_test},
>             {gpus_per_host},
>             {segments_per_host}
>             ) from gp_dist_random('gp_id')
>             """.format(**locals()), ["bytea"]) #Using gp_dist_random('gp_id') 
>  in the query makes the UDF run on each segment
> plpy.execute(set_gd_query, [model_data])
> predict_query = plpy.execute("""
>     CREATE TABLE {output_table} AS
>     SELECT {id_col}, {prediction_select_clause}
>     FROM (
>         SELECT {test_table}.{id_col},
>                ({schema_madlib}.internal_keras_predict
>                    ({independent_varname}, {gp_segment_id_col})
>                ) AS {intermediate_col}
>     FROM {test_table}
>     ) q distributed by ({id_col})
>     """.format(**locals()))
> def internal_keras_predict(independent_var, current_seg_id, **kwargs):
>     start = time.time()
>     SD = kwargs['SD']
>     GD = kwargs['GD']
>     is_response = GD['is_response']
>     normalizing_const = GD['normalizing_const']
>     #current_seg_id = GD['current_seg_id']
>     seg_ids = GD['seg_ids']
>     images_per_seg = GD['images_per_seg']
>     gpus_per_host = GD['gpus_per_host']
>     segments_per_host = GD['segments_per_host']
>     device_name = get_device_name_and_set_cuda_env(gpus_per_host,
>                                                    current_seg_id)
>     ...
> {code}
> With the above changes , we found out that GD is not reliable for GPDB 
> because of the following reasons:
> Consider a single node gpdb cluster with 3 segments
> Calling set_gd using gp_dist_random(), creates 1 process per seg and sets GD 
> on each of these processes.
> seg1 - pid 100 - gd is set here for seg1
> seg2 - pid 200 - gd is set here for seg2
> seg3 - pid300- gd is set here for seg3
> Now, CREATE TABLE.. in predict(), spins up 2 processes per seg, (the old 
> processes where GD was set) + 1 new process per seg.
> seg1 - pid 100 - gd is set here for seg1 (reused from before)
> seg1 - pid 101 - gd is read here for seg1
> seg2 - pid 200 - gd is set here for seg2 (reused from before)
> seg1 - pid 201 - gd is read here for seg2
> seg3 - pid300 - gd is set here for seg3 (reused from before)
> seg1 - pid 301- gd is read here for seg3
> This causes problems because , the processes where GD is read from is not 
> same as the process where it was set.
> Couple of ways to avoid this problem
> # Change predict code to run two plpy execute queries, the first one being 
> the internal predict query and the second one being the create table query.
> # Distribute the source table by the id column and while creating the predict 
> output table use that id column as the distribution key.
> We are not sure if this is good enough for all use cases like what if the 
> source table has an index which might do the same thing as the create table 
> command. Our goal is to avoid the query from creating multiple processes.
> # Explore the GD option
> # Explore alternatives so that we don't have to pass the model data for every 
> row/buffer/image in the transition function/udf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to