[
https://issues.apache.org/jira/browse/MADLIB-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ekta Khanna updated MADLIB-1345:
--------------------------------
Component/s: Deep Learning
> DL: Performance improvement in DL functions
> -------------------------------------------
>
> Key: MADLIB-1345
> URL: https://issues.apache.org/jira/browse/MADLIB-1345
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Deep Learning
> Reporter: Ekta Khanna
> Priority: Major
>
> Currently, we pass around model_data, model_arch, etc.. for each buffer/image
> for fit(), predict() and evaluate(). This causes a lot of overhead and slows
> down the query considerable.
> We tried to set model_data and model_arch using GD for predict. Following
> were the runtimes:
> with GD
> ~707 sec(with CPU) - 50K places10_20seg
> without GD
> ~1650 sec(with CPU) - 50K places10_20seg
> Below is the patch for GD changes:
> {code}
> def set_predict_GD(model_architecture, model_data,
> is_response, normalizing_const, seg_ids,
> images_per_seg, gpus_per_host, segments_per_host,
> **kwargs):
> GD = kwargs['GD']
> GD['model_architecture'] = model_architecture
> GD['model_data'] = model_data
> GD['is_response'] = is_response
> GD['normalizing_const'] = normalizing_const
> #GD['current_seg_id'] = current_seg_id
> GD['seg_ids'] = seg_ids
> GD['images_per_seg'] = images_per_seg
> GD['gpus_per_host'] = gpus_per_host
> GD['segments_per_host'] = segments_per_host
> def predict()
> ....
> set_gd_query=plpy.prepare("""
> SELECT set_predict_GD
> ($MAD${model_arch}$MAD$,
> $1,
> {is_response},
> {normalizing_const},
> -- gp_segment_id,
> ARRAY{seg_ids_test},
> ARRAY{images_per_seg_test},
> {gpus_per_host},
> {segments_per_host}
> ) from gp_dist_random('gp_id')
> """.format(**locals()), ["bytea"]) #Using gp_dist_random('gp_id')
> in the query makes the UDF run on each segment
> plpy.execute(set_gd_query, [model_data])
> predict_query = plpy.execute("""
> CREATE TABLE {output_table} AS
> SELECT {id_col}, {prediction_select_clause}
> FROM (
> SELECT {test_table}.{id_col},
> ({schema_madlib}.internal_keras_predict
> ({independent_varname}, {gp_segment_id_col})
> ) AS {intermediate_col}
> FROM {test_table}
> ) q distributed by ({id_col})
> """.format(**locals()))
> def internal_keras_predict(independent_var, current_seg_id, **kwargs):
> start = time.time()
> SD = kwargs['SD']
> GD = kwargs['GD']
> is_response = GD['is_response']
> normalizing_const = GD['normalizing_const']
> #current_seg_id = GD['current_seg_id']
> seg_ids = GD['seg_ids']
> images_per_seg = GD['images_per_seg']
> gpus_per_host = GD['gpus_per_host']
> segments_per_host = GD['segments_per_host']
> device_name = get_device_name_and_set_cuda_env(gpus_per_host,
> current_seg_id)
> ...
> {code}
> With the above changes , we found out that GD is not reliable for GPDB
> because of the following reasons:
> Consider a single node gpdb cluster with 3 segments
> Calling set_gd using gp_dist_random(), creates 1 process per seg and sets GD
> on each of these processes.
> seg1 - pid 100 - gd is set here for seg1
> seg2 - pid 200 - gd is set here for seg2
> seg3 - pid300- gd is set here for seg3
> Now, CREATE TABLE.. in predict(), spins up 2 processes per seg, (the old
> processes where GD was set) + 1 new process per seg.
> seg1 - pid 100 - gd is set here for seg1 (reused from before)
> seg1 - pid 101 - gd is read here for seg1
> seg2 - pid 200 - gd is set here for seg2 (reused from before)
> seg1 - pid 201 - gd is read here for seg2
> seg3 - pid300 - gd is set here for seg3 (reused from before)
> seg1 - pid 301- gd is read here for seg3
> This causes problems because , the processes where GD is read from is not
> same as the process where it was set.
> Couple of ways to avoid this problem
> # Change predict code to run two plpy execute queries, the first one being
> the internal predict query and the second one being the create table query.
> # Distribute the source table by the id column and while creating the predict
> output table use that id column as the distribution key.
> We are not sure if this is good enough for all use cases like what if the
> source table has an index which might do the same thing as the create table
> command. Our goal is to avoid the query from creating multiple processes.
> # Explore the GD option
> # Explore alternatives so that we don't have to pass the model data for every
> row/buffer/image in the transition function/udf
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)