reductionista opened a new pull request #525:
URL: https://github.com/apache/madlib/pull/525
This is a major refactor of the model hopper parallelism of the deep
learning module.
**A new data flow for the hop/training cycle**
Background:
In v1.17, MADlib creates 3 temporary tables to store the model weights while
they are being
hopped around and trained: mst_weights_tbl, weights_to_update_tbl, and
model_output_table.
This gives rise to 3 different long-running queries that each involve at
least one Redistribute Motion:
the UDA query (which does the actual training), the UPDATE query, and the
HOP query. Most of the time in
madlib_keras_fit_multiple() is spent in a loop over iterations and hops
calling the run_training()
function which performs each of these 3 stages, where all of the models
circulate between
these 3 temporary tables and get back to where they started after each full
iteration.
The overall effect was a lot of unnecessary motion of the model weights
going back and forth
between different segment hosts several times for each call to
run_training(). Each hop was
really 3 or 4 hops in terms of data motion, which for a modest sized data
set can cause
the hopping phase of the cycle to take just as long as the training phase,
leaving GPU's with a lot
of idle time.
3 Biggest Changes:
1. This refactor involves a major simplification where the 3 temporary
tables are replaced by 2 temporary tables (model_input_tbl and
model_output_tbl) and the UPDATE stage is eliminated to leave only the HOP and
UDA stages. There were also 3 shorter-running TRUNCATE & DROP queries in
between the 3 longer queries.
These are also reduced to only 2 TRUNCATE & DROP queries.
Two additional performance improvements closely related to the above are:
2. A __dist_key__ column is added to model_output table and both of the
temporary tables are
DISTRIBUTED BY this key instead of mst. The __dist_key__'s have a
1-to-1 mapping with segment_id's,
as in the batched source table. This serves as the hash key for all
JOIN's now.
The schedule table has both a __dist_key__ column and a
__previous_dist_key__ column,
so that it can guide the weights from the specific segment they were
on previously
to the one they are scheduled to be on for the next hop. This avoids
any additional
Redistribute Motions caused by table Hash Join's. With this change,
there is no longer any
movement of the models at all during except during the hop itself.
They only move
exactly once per call to run_training().
and
3. For model weight averaging (madlib_keras_fit()), we needed a merge and a
final function
in addition to the fit_transition function, so a UDA was the natural
choice. For the model
hopper (madlib_keras_fit_multiple()), there is no need for merge or
final, all we need is
to call the fit_transition function directly on each row--so it was
less clear whether a UDA
was the right choice. We found that calling it as a UDF directly, and
using SD to pass
the state from one row to the next (as we were doing already anyway
with the UDA), resulted
in slightly better performance. So now fit_transition can be called
either as a UDF or as part
of a UDA, and fit_transition_multiple_model() is always called as a UDF.
- Simplified schedule rotation: schedule table created only once, then
gets
rotated on segments, instead of re-creating many times by transfering
data back and forth from master to segments to master each hop. No
longer
need separate "current_schedule" and "grand_schedule" data structures.
# Other performance enhancements
4. There is no need to hop the models between the last hop of the previous
iteration and the first hop of the next iteration, so once per iteration we
skip the hop and instead just rename model_output_tbl to model_input_tbl before
truncating and dropping model_output_tbl. For example, if there were 4
segments then in v1.17 that means 4 model hops per iteration. In this branch,
the last hop is skipped and the next iteration just starts with the models on
the same segments they are already on. The 4 hops has been reduced to only 3
hops. The same amount of training occurs as before, and each model is paired
exactly once per iteration with each segment--just in a slightly different
order.
5. Much faster initialization code: previously, we were reading the weights
in from the original model output table (during warm start) and the
model
arch table (for transfer learning) one mst row at a time from segment
to
master, then writing them each back out one row at a time from master
back to segments with a large number of SELECT and INSERT queries.
Now, we just use a single query to copy the weights directly from the
original model output table into the new model output table on the
segments, without ever sending them to master. And a similar single
query copies the transfer learning weights directly from model_arch to
model_output for training. Both of these happen in parallel on the
segments, instead of in sequence on master. During testing of warm
start on a 20-segment cluster with 20 models, this resulted in a 10x
reduction
in initialization time (26s instead of 5 mins in v1.17)
6. Split get_model_arch_and_weights() into query_weights() and
get_model_arch()
So we don't have to transfer weights from segment to master in places
where we only need the model_arch json
7. Enables JIT XLA auto-clustering, if available.
# Other Changes
8. Simplified schedule rotation: schedule table is created only once, then
gets
rotated on segments, instead of re-creating many times by transferring
data back and forth from master to segments to master each hop. No
longer
need separate "current_schedule" and "grand_schedule" data structures.
These tables do not contain model weights, so there is not much of a
performance benefit, but it's at least simpler and more maintainable.
9. Added some debugging that can be enabled to help profile the
performance of fit multiple, and track which segment each mst_key
is located during each hop. This also serves as an example for
the utils/debug PR this is rebased on top of.
10. Remove AutoML dependency on internals of fit_multiple (needed to make
AutoML
compatible with the new FitMultiple class, but also good for avoiding
having to
do the same thing for any future updates to fit_multiple. Better
modularity.
11. Improved Exception handling: send full stack traceback from segment
back to master (soon to be renamed "coordinator"). Now when an exception
happens in fit_transition (or merge, final, etc.) we'll be able to see the
stack trace of the internal UDF or UDA running on the segments, along with the
stack trace of the outer UDF that runs on the
coordinator. It's attached to the DETAILS of the Exception and includes the
line number where the error occurred--something that we had to guess at before,
making debugging difficult.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]