reductionista opened a new pull request #525: URL: https://github.com/apache/madlib/pull/525
This is a major refactor of the model hopper parallelism of the deep learning module. **A new data flow for the hop/training cycle** Background: In v1.17, MADlib creates 3 temporary tables to store the model weights while they are being hopped around and trained: mst_weights_tbl, weights_to_update_tbl, and model_output_table. This gives rise to 3 different long-running queries that each involve at least one Redistribute Motion: the UDA query (which does the actual training), the UPDATE query, and the HOP query. Most of the time in madlib_keras_fit_multiple() is spent in a loop over iterations and hops calling the run_training() function which performs each of these 3 stages, where all of the models circulate between these 3 temporary tables and get back to where they started after each full iteration. The overall effect was a lot of unnecessary motion of the model weights going back and forth between different segment hosts several times for each call to run_training(). Each hop was really 3 or 4 hops in terms of data motion, which for a modest sized data set can cause the hopping phase of the cycle to take just as long as the training phase, leaving GPU's with a lot of idle time. 3 Biggest Changes: 1. This refactor involves a major simplification where the 3 temporary tables are replaced by 2 temporary tables (model_input_tbl and model_output_tbl) and the UPDATE stage is eliminated to leave only the HOP and UDA stages. There were also 3 shorter-running TRUNCATE & DROP queries in between the 3 longer queries. These are also reduced to only 2 TRUNCATE & DROP queries. Two additional performance improvements closely related to the above are: 2. A __dist_key__ column is added to model_output table and both of the temporary tables are DISTRIBUTED BY this key instead of mst. The __dist_key__'s have a 1-to-1 mapping with segment_id's, as in the batched source table. This serves as the hash key for all JOIN's now. The schedule table has both a __dist_key__ column and a __previous_dist_key__ column, so that it can guide the weights from the specific segment they were on previously to the one they are scheduled to be on for the next hop. This avoids any additional Redistribute Motions caused by table Hash Join's. With this change, there is no longer any movement of the models at all during except during the hop itself. They only move exactly once per call to run_training(). and 3. For model weight averaging (madlib_keras_fit()), we needed a merge and a final function in addition to the fit_transition function, so a UDA was the natural choice. For the model hopper (madlib_keras_fit_multiple()), there is no need for merge or final, all we need is to call the fit_transition function directly on each row--so it was less clear whether a UDA was the right choice. We found that calling it as a UDF directly, and using SD to pass the state from one row to the next (as we were doing already anyway with the UDA), resulted in slightly better performance. So now fit_transition can be called either as a UDF or as part of a UDA, and fit_transition_multiple_model() is always called as a UDF. - Simplified schedule rotation: schedule table created only once, then gets rotated on segments, instead of re-creating many times by transfering data back and forth from master to segments to master each hop. No longer need separate "current_schedule" and "grand_schedule" data structures. # Other performance enhancements 4. There is no need to hop the models between the last hop of the previous iteration and the first hop of the next iteration, so once per iteration we skip the hop and instead just rename model_output_tbl to model_input_tbl before truncating and dropping model_output_tbl. For example, if there were 4 segments then in v1.17 that means 4 model hops per iteration. In this branch, the last hop is skipped and the next iteration just starts with the models on the same segments they are already on. The 4 hops has been reduced to only 3 hops. The same amount of training occurs as before, and each model is paired exactly once per iteration with each segment--just in a slightly different order. 5. Much faster initialization code: previously, we were reading the weights in from the original model output table (during warm start) and the model arch table (for transfer learning) one mst row at a time from segment to master, then writing them each back out one row at a time from master back to segments with a large number of SELECT and INSERT queries. Now, we just use a single query to copy the weights directly from the original model output table into the new model output table on the segments, without ever sending them to master. And a similar single query copies the transfer learning weights directly from model_arch to model_output for training. Both of these happen in parallel on the segments, instead of in sequence on master. During testing of warm start on a 20-segment cluster with 20 models, this resulted in a 10x reduction in initialization time (26s instead of 5 mins in v1.17) 6. Split get_model_arch_and_weights() into query_weights() and get_model_arch() So we don't have to transfer weights from segment to master in places where we only need the model_arch json 7. Enables JIT XLA auto-clustering, if available. # Other Changes 8. Simplified schedule rotation: schedule table is created only once, then gets rotated on segments, instead of re-creating many times by transferring data back and forth from master to segments to master each hop. No longer need separate "current_schedule" and "grand_schedule" data structures. These tables do not contain model weights, so there is not much of a performance benefit, but it's at least simpler and more maintainable. 9. Added some debugging that can be enabled to help profile the performance of fit multiple, and track which segment each mst_key is located during each hop. This also serves as an example for the utils/debug PR this is rebased on top of. 10. Remove AutoML dependency on internals of fit_multiple (needed to make AutoML compatible with the new FitMultiple class, but also good for avoiding having to do the same thing for any future updates to fit_multiple. Better modularity. 11. Improved Exception handling: send full stack traceback from segment back to master (soon to be renamed "coordinator"). Now when an exception happens in fit_transition (or merge, final, etc.) we'll be able to see the stack trace of the internal UDF or UDA running on the segments, along with the stack trace of the outer UDF that runs on the coordinator. It's attached to the DETAILS of the Exception and includes the line number where the error occurred--something that we had to guess at before, making debugging difficult. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org