reductionista opened a new pull request #525:
URL: https://github.com/apache/madlib/pull/525


   This is a major refactor of the model hopper parallelism of the deep 
learning module.
   
   **A new data flow for the hop/training cycle**
   
   Background:
   
   In v1.17, MADlib creates 3 temporary tables to store the model weights while 
they are being
   hopped around and trained:  mst_weights_tbl, weights_to_update_tbl, and 
model_output_table.
   This gives rise to 3 different long-running queries that each involve at 
least one Redistribute Motion:
   the UDA query (which does the actual training), the UPDATE query, and the 
HOP query.  Most of the time in
   madlib_keras_fit_multiple() is spent in a loop over iterations and hops 
calling the run_training()
   function which performs each of these 3 stages, where all of the models 
circulate between
   these 3 temporary tables and get back to where they started after each full 
iteration.
   The overall effect was a lot of unnecessary motion of the model weights 
going back and forth 
   between different segment hosts several times for each call to 
run_training().  Each hop was
   really 3 or 4 hops in terms of data motion, which for a modest sized data 
set can cause
   the hopping phase of the cycle to take just as long as the training phase, 
leaving GPU's with a lot
   of idle time.
   
   3 Biggest Changes:
   
   1.  This refactor involves a major simplification where the 3 temporary 
tables are replaced by 2 temporary tables (model_input_tbl and 
model_output_tbl) and the UPDATE stage is eliminated to leave only the HOP and
   UDA stages.  There were also 3 shorter-running TRUNCATE & DROP queries in 
between the 3 longer queries.
   These are also reduced to only 2 TRUNCATE & DROP queries.
   
   Two additional performance improvements closely related to the above are:
   2.  A __dist_key__ column is added to model_output table and both of the 
temporary tables are
          DISTRIBUTED BY this key instead of mst.  The __dist_key__'s have a 
1-to-1 mapping with segment_id's,
          as in the batched source table.  This serves as the hash key for all 
JOIN's now.
          The schedule table has both a __dist_key__ column and a 
__previous_dist_key__ column,
          so that it can guide the weights from the specific segment they were 
on previously
          to the one they are scheduled to be on for the next hop.  This avoids 
any additional
          Redistribute Motions caused by table Hash Join's.  With this change, 
there is no longer any
          movement of the models at all during except during the hop itself.  
They only move
          exactly once per call to run_training().
   
   and
   3.  For model weight averaging (madlib_keras_fit()), we needed a merge and a 
final function
        in addition to the fit_transition function, so a UDA was the natural 
choice.  For the model
        hopper (madlib_keras_fit_multiple()), there is no need for merge or 
final, all we need is
        to call the fit_transition function directly on each row--so it was 
less clear whether a UDA
        was the right choice.  We found that calling it as a UDF directly, and 
using SD to pass
        the state from one row to the next (as we were doing already anyway 
with the UDA), resulted
        in slightly better performance.  So now fit_transition can be called 
either as a UDF or as part
        of a UDA, and fit_transition_multiple_model() is always called as a UDF.
   
       - Simplified schedule rotation: schedule table created only once, then 
gets
         rotated on segments, instead of re-creating many times by transfering
         data back and forth from master to segments to master each hop.  No 
longer
         need separate "current_schedule" and "grand_schedule" data structures.
   
   # Other performance enhancements
   
   4.  There is no need to hop the models between the last hop of the previous 
iteration and the first hop of the next iteration, so once per iteration we 
skip the hop and instead just rename model_output_tbl to model_input_tbl before 
truncating and dropping model_output_tbl.  For example, if there were 4 
segments then in v1.17 that means 4 model hops per iteration.  In this branch, 
the last hop is skipped and the next iteration just starts with the models on 
the same segments they are already on.  The 4 hops has been reduced to only 3 
hops.  The same amount of training occurs as before, and each model is paired 
exactly once per iteration with each segment--just in a slightly different 
order.
   
   5.  Much faster initialization code:  previously, we were reading the weights
         in from the original model output table (during warm start) and the 
model
         arch table (for transfer learning) one mst row at a time from segment 
to
         master, then writing them each back out one row at a time from master
         back to segments with a large number of SELECT and INSERT queries.
         Now, we just use a single query to copy the weights directly from the
         original model output table into the new model output table on the
         segments, without ever sending them to master.  And a similar single
         query copies the transfer learning weights directly from model_arch to
         model_output for training.  Both of these happen in parallel on the
         segments, instead of in sequence on master.  During testing of warm
         start on a 20-segment cluster with 20 models, this resulted in a 10x 
reduction
         in initialization time (26s instead of 5 mins in v1.17)
   
   6.    Split get_model_arch_and_weights() into query_weights() and 
get_model_arch()
           So we don't have to transfer weights from segment to master in places
           where we only need the model_arch json
   
   7.     Enables JIT XLA auto-clustering, if available.
   
   # Other Changes
   
    8.  Simplified schedule rotation: schedule table is created only once, then 
gets
         rotated on segments, instead of re-creating many times by transferring
         data back and forth from master to segments to master each hop.  No 
longer
         need separate "current_schedule" and "grand_schedule" data structures.
         These tables do not contain model weights, so there is not much of a 
         performance benefit, but it's at least simpler and more maintainable.
   
   9.   Added some debugging that can be enabled to help profile the
         performance of fit multiple, and track which segment each mst_key
         is located during each hop. This also serves as an example for
         the utils/debug PR this is rebased on top of.
   
   10.  Remove AutoML dependency on internals of fit_multiple (needed to make 
AutoML
        compatible with the new FitMultiple class, but also good for avoiding 
having to
        do the same thing for any future updates to fit_multiple.  Better 
modularity.
   
    11.  Improved Exception handling:  send full stack traceback from segment 
back to master (soon to be renamed "coordinator").  Now when an exception 
happens in fit_transition (or merge, final, etc.) we'll be able to see the 
stack trace of the internal UDF or UDA running on the segments, along with the 
stack trace of the outer UDF that runs on the
   coordinator.  It's attached to the DETAILS of the Exception and includes the 
line number where the error occurred--something that we had to guess at before, 
making debugging difficult.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to