This is an automated email from the ASF dual-hosted git repository. pingsutw pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/submarine.git
The following commit(s) were added to refs/heads/master by this push: new a94494b SUBMARINE-928. [Quickstart] Rewrite quickstart guide a94494b is described below commit a94494bf4ba89d05b3e8680b3139736204781c35 Author: ByronHsu <byronhsu1...@gmail.com> AuthorDate: Thu Jul 29 20:54:32 2021 +0800 SUBMARINE-928. [Quickstart] Rewrite quickstart guide ### What is this PR for? Write an example that will walk users through the end-to-end usage of the submarine. ### What type of PR is it? [Documentation] ### Todos * [ ] - Task ### What is the Jira issue? https://issues.apache.org/jira/browse/SUBMARINE-928 ### How should this be tested? ### Screenshots (if appropriate) ### Questions: * Do the license files need updating? No * Are there breaking changes for older versions? No * Does this need new documentation? No Author: ByronHsu <byronhsu1...@gmail.com> Signed-off-by: Kevin <pings...@apache.org> Closes #664 from ByronHsu/quickstart and squashes the following commits: 673b2ab0 [ByronHsu] fix conflict dcad3bbc [ByronHsu] push quickstart to dockerhub d8be6fcd [ByronHsu] add workbench connection and mlflow demo b87b04b2 [ByronHsu] add example bcb01d79 [ByronHsu] version 1 --- .github/workflows/deploy_docker_images.yml | 5 + .../examples/quickstart/{post.sh => Dockerfile} | 31 +--- .../examples/quickstart/{post.sh => build.sh} | 49 ++--- dev-support/examples/quickstart/post.sh | 1 - dev-support/examples/quickstart/train.py | 86 +++++++++ website/docs/assets/quickstart-mlflow-2.png | Bin 0 -> 267330 bytes website/docs/assets/quickstart-mlflow.png | Bin 0 -> 309585 bytes website/docs/assets/quickstart-submit-1.png | Bin 0 -> 245302 bytes website/docs/assets/quickstart-submit-2.png | Bin 0 -> 244702 bytes website/docs/assets/quickstart-submit-3.png | Bin 0 -> 251717 bytes website/docs/assets/quickstart-submit-4.png | Bin 0 -> 332445 bytes website/docs/assets/quickstart-worbench.png | Bin 0 -> 86036 bytes website/docs/gettingStarted/notebook.md | 2 +- website/docs/gettingStarted/quickstart.md | 203 +++++++++++++++++++++ website/docusaurus.config.js | 2 +- website/sidebars.js | 6 +- 16 files changed, 334 insertions(+), 51 deletions(-) diff --git a/.github/workflows/deploy_docker_images.yml b/.github/workflows/deploy_docker_images.yml index 3afd32e..93b55d6 100644 --- a/.github/workflows/deploy_docker_images.yml +++ b/.github/workflows/deploy_docker_images.yml @@ -79,3 +79,8 @@ jobs: run: ./dev-support/docker-images/serve/build.sh - name: Push submarine-serve docker image run: docker push apache/submarine:serve-$SUBMARINE_VERSION + + - name: Build submarine quickstart + run: ./dev-support/examples/quickstart/build.sh + - name: Push submarine quickstart docker image + run: docker push apache/submarine:quickstart-$SUBMARINE_VERSION diff --git a/dev-support/examples/quickstart/post.sh b/dev-support/examples/quickstart/Dockerfile similarity index 62% copy from dev-support/examples/quickstart/post.sh copy to dev-support/examples/quickstart/Dockerfile index 39336bc..ee6d66d 100644 --- a/dev-support/examples/quickstart/post.sh +++ b/dev-support/examples/quickstart/Dockerfile @@ -1,4 +1,3 @@ -#!/usr/bin/env bash # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. @@ -14,26 +13,12 @@ # See the License for the specific language governing permissions and # limitations under the License. +FROM continuumio/anaconda3 +MAINTAINER Apache Software Foundation <dev@submarine.apache.org> -curl -X POST -H "Content-Type: application/json" -d ' -{ - "meta": { - "name": "quickstart", - "namespace": "default", - "framework": "TensorFlow", - "cmd": "python /opt/train.py", - "envVars": { - "ENV_1": "ENV1" - } - }, - "environment": { - "image": "quickstart:0.6.0-SNAPSHOT" - }, - "spec": { - "Worker": { - "replicas": 3, - "resources": "cpu=1,memory=1024M" - } - } -} -' http://127.0.0.1:32080/api/v1/experiment \ No newline at end of file +ADD ./tmp/submarine-sdk /opt/ +# install submarine-sdk locally +RUN pip install /opt/pysubmarine/.[tf-latest] +RUN pip install tensorflow_datasets + +ADD ./train.py /opt/ \ No newline at end of file diff --git a/dev-support/examples/quickstart/post.sh b/dev-support/examples/quickstart/build.sh old mode 100644 new mode 100755 similarity index 51% copy from dev-support/examples/quickstart/post.sh copy to dev-support/examples/quickstart/build.sh index 39336bc..6865c39 --- a/dev-support/examples/quickstart/post.sh +++ b/dev-support/examples/quickstart/build.sh @@ -14,26 +14,31 @@ # See the License for the specific language governing permissions and # limitations under the License. +set -euxo pipefail -curl -X POST -H "Content-Type: application/json" -d ' -{ - "meta": { - "name": "quickstart", - "namespace": "default", - "framework": "TensorFlow", - "cmd": "python /opt/train.py", - "envVars": { - "ENV_1": "ENV1" - } - }, - "environment": { - "image": "quickstart:0.6.0-SNAPSHOT" - }, - "spec": { - "Worker": { - "replicas": 3, - "resources": "cpu=1,memory=1024M" - } - } -} -' http://127.0.0.1:32080/api/v1/experiment \ No newline at end of file +SUBMARINE_VERSION=0.6.0-SNAPSHOT +SUBMARINE_IMAGE_NAME="apache/submarine:quickstart-${SUBMARINE_VERSION}" + +if [ -L ${BASH_SOURCE-$0} ]; then + PWD=$(dirname $(readlink "${BASH_SOURCE-$0}")) +else + PWD=$(dirname ${BASH_SOURCE-$0}) +fi +export CURRENT_PATH=$(cd "${PWD}">/dev/null; pwd) +export SUBMARINE_HOME=${CURRENT_PATH}/../../.. + +if [ -d "${CURRENT_PATH}/tmp" ] # if old tmp folder is still there, delete it. +then + rm -rf "${CURRENT_PATH}/tmp" +fi + +mkdir -p "${CURRENT_PATH}/tmp" +cp -r "${SUBMARINE_HOME}/submarine-sdk" "${CURRENT_PATH}/tmp" + +# build image +cd ${CURRENT_PATH} +echo "Start building the ${SUBMARINE_IMAGE_NAME} docker image ..." +docker build -t ${SUBMARINE_IMAGE_NAME} . + +# clean temp file +rm -rf "${CURRENT_PATH}/tmp" diff --git a/dev-support/examples/quickstart/post.sh b/dev-support/examples/quickstart/post.sh old mode 100644 new mode 100755 index 39336bc..8c23c52 --- a/dev-support/examples/quickstart/post.sh +++ b/dev-support/examples/quickstart/post.sh @@ -14,7 +14,6 @@ # See the License for the specific language governing permissions and # limitations under the License. - curl -X POST -H "Content-Type: application/json" -d ' { "meta": { diff --git a/dev-support/examples/quickstart/train.py b/dev-support/examples/quickstart/train.py new file mode 100644 index 0000000..e33de68 --- /dev/null +++ b/dev-support/examples/quickstart/train.py @@ -0,0 +1,86 @@ +# Copyright 2020 The Kubeflow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +""" +An example of multi-worker training with Keras model using Strategy API. +https://github.com/kubeflow/tf-operator/blob/master/examples/v1/distribution_strategy/keras-API/multi_worker_strategy-with-keras.py +""" +import tensorflow_datasets as tfds +import tensorflow as tf +from tensorflow.keras import layers, models +from submarine import ModelsClient + +def make_datasets_unbatched(): + BUFFER_SIZE = 10000 + + # Scaling MNIST data from (0, 255] to (0., 1.] + def scale(image, label): + image = tf.cast(image, tf.float32) + image /= 255 + return image, label + + datasets, _ = tfds.load(name='mnist', with_info=True, as_supervised=True) + + return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE) + + +def build_and_compile_cnn_model(): + model = models.Sequential() + model.add( + layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) + model.add(layers.MaxPooling2D((2, 2))) + model.add(layers.Conv2D(64, (3, 3), activation='relu')) + model.add(layers.MaxPooling2D((2, 2))) + model.add(layers.Conv2D(64, (3, 3), activation='relu')) + model.add(layers.Flatten()) + model.add(layers.Dense(64, activation='relu')) + model.add(layers.Dense(10, activation='softmax')) + + model.summary() + + model.compile(optimizer='adam', + loss='sparse_categorical_crossentropy', + metrics=['accuracy']) + + return model + +def main(): + strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( + communication=tf.distribute.experimental.CollectiveCommunication.AUTO) + + BATCH_SIZE_PER_REPLICA = 4 + BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync + + with strategy.scope(): + ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat() + options = tf.data.Options() + options.experimental_distribute.auto_shard_policy = \ + tf.data.experimental.AutoShardPolicy.DATA + ds_train = ds_train.with_options(options) + # Model building/compiling need to be within `strategy.scope()`. + multi_worker_model = build_and_compile_cnn_model() + + class MyCallback(tf.keras.callbacks.Callback): + def on_epoch_end(self, epoch, logs=None): + # monitor the loss and accuracy + print(logs) + modelClient.log_metrics({"loss": logs["loss"], "accuracy": logs["accuracy"]}, epoch) + + with modelClient.start() as run: + multi_worker_model.fit(ds_train, epochs=10, steps_per_epoch=70, callbacks=[MyCallback()]) + + +if __name__ == '__main__': + modelClient = ModelsClient() + main() \ No newline at end of file diff --git a/website/docs/assets/quickstart-mlflow-2.png b/website/docs/assets/quickstart-mlflow-2.png new file mode 100644 index 0000000..6430164 Binary files /dev/null and b/website/docs/assets/quickstart-mlflow-2.png differ diff --git a/website/docs/assets/quickstart-mlflow.png b/website/docs/assets/quickstart-mlflow.png new file mode 100644 index 0000000..7600663 Binary files /dev/null and b/website/docs/assets/quickstart-mlflow.png differ diff --git a/website/docs/assets/quickstart-submit-1.png b/website/docs/assets/quickstart-submit-1.png new file mode 100644 index 0000000..a5d095f Binary files /dev/null and b/website/docs/assets/quickstart-submit-1.png differ diff --git a/website/docs/assets/quickstart-submit-2.png b/website/docs/assets/quickstart-submit-2.png new file mode 100644 index 0000000..cc368d6 Binary files /dev/null and b/website/docs/assets/quickstart-submit-2.png differ diff --git a/website/docs/assets/quickstart-submit-3.png b/website/docs/assets/quickstart-submit-3.png new file mode 100644 index 0000000..0ca1daa Binary files /dev/null and b/website/docs/assets/quickstart-submit-3.png differ diff --git a/website/docs/assets/quickstart-submit-4.png b/website/docs/assets/quickstart-submit-4.png new file mode 100644 index 0000000..ad7c60e Binary files /dev/null and b/website/docs/assets/quickstart-submit-4.png differ diff --git a/website/docs/assets/quickstart-worbench.png b/website/docs/assets/quickstart-worbench.png new file mode 100644 index 0000000..a9ca304 Binary files /dev/null and b/website/docs/assets/quickstart-worbench.png differ diff --git a/website/docs/gettingStarted/notebook.md b/website/docs/gettingStarted/notebook.md index 532f5bc..1b58b59 100644 --- a/website/docs/gettingStarted/notebook.md +++ b/website/docs/gettingStarted/notebook.md @@ -1,5 +1,5 @@ --- -title: Notebook Tutorial +title: Jupyter Notebook --- <!-- diff --git a/website/docs/gettingStarted/quickstart.md b/website/docs/gettingStarted/quickstart.md new file mode 100644 index 0000000..0de4a93 --- /dev/null +++ b/website/docs/gettingStarted/quickstart.md @@ -0,0 +1,203 @@ +--- +title: Quickstart +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +This document gives you a quick view on the basic usage of Submarine platform. You can finish each step of ML model lifecycle on the platform without messing up with the troublesome environment problems. + +## Installation + +### Prepare a Kubernetes cluster + +1. Prerequisite + +- Check [dependency page](https://github.com/apache/submarine/blob/master/website/docs/devDocs/Dependencies.md) for the compatible version +- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) +- [helm](https://helm.sh/docs/intro/install/) (Helm v3 is minimum requirement.) +- [minikube](https://minikube.sigs.k8s.io/docs/start/). + +2. Start minikube cluster +``` +$ minikube start --vm-driver=docker --cpus 8 --memory 4096 --kubernetes-version v1.15.11 +``` + +### Launch submarine in the cluster + +1. Clone the project +``` +$ git clone https://github.com/apache/submarine.git +``` + +2. Install the resources by helm chart +``` +$ cd submarine +$ helm install submarine ./helm-charts/submarine +``` +### Ensure submarine is ready + +1. Use kubectl to query the status of pods +``` +$ kubectl get pods +``` + +2. Make sure each pod is `Running` +``` +NAME READY STATUS RESTARTS AGE +notebook-controller-deployment-5d4f5f874c-vwds8 1/1 Running 0 3h33m +pytorch-operator-844c866d54-q5ztd 1/1 Running 0 3h33m +submarine-database-674987ff7d-r8zqs 1/1 Running 0 3h33m +submarine-minio-5fdd957785-xd987 1/1 Running 0 3h33m +submarine-mlflow-76bbf5c7b-g2ntd 1/1 Running 0 3h33m +submarine-server-66f7b8658b-sfmv8 1/1 Running 0 3h33m +submarine-tensorboard-6c44944dfb-tvbr9 1/1 Running 0 3h33m +submarine-traefik-7cbcfd4bd9-4bczn 1/1 Running 0 3h33m +tf-job-operator-6bb69fd44-mc8ww 1/1 Running 0 3h33m +``` + +### Connect to workbench + +1. Port-forwarding + +``` +# using port-forwarding +$ kubectl port-forward --address 0.0.0.0 service/submarine-traefik 32080:80 +``` + +2. Open `http://0.0.0.0:32080` + +![](../assets/quickstart-worbench.png) + +## Example: Submit a mnist distributed example + +We put the code of this example [here](https://github.com/apache/submarine/tree/master/dev-support/examples/quickstart). `train.py` is our training script, and `build.sh` is the script to build a docker image. + +### 1. Write a python script for distributed training + +Take a simple mnist tensorflow script as an example. We choose `MultiWorkerMirroredStrategy` as our distributed strategy. + +```python +""" +./dev-support/examples/quickstart/train.py +Reference: https://github.com/kubeflow/tf-operator/blob/master/examples/v1/distribution_strategy/keras-API/multi_worker_strategy-with-keras.py +""" + +import tensorflow_datasets as tfds +import tensorflow as tf +from tensorflow.keras import layers, models +from submarine import ModelsClient + +def make_datasets_unbatched(): + BUFFER_SIZE = 10000 + + # Scaling MNIST data from (0, 255] to (0., 1.] + def scale(image, label): + image = tf.cast(image, tf.float32) + image /= 255 + return image, label + + datasets, _ = tfds.load(name='mnist', with_info=True, as_supervised=True) + + return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE) + + +def build_and_compile_cnn_model(): + model = models.Sequential() + model.add( + layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) + model.add(layers.MaxPooling2D((2, 2))) + model.add(layers.Conv2D(64, (3, 3), activation='relu')) + model.add(layers.MaxPooling2D((2, 2))) + model.add(layers.Conv2D(64, (3, 3), activation='relu')) + model.add(layers.Flatten()) + model.add(layers.Dense(64, activation='relu')) + model.add(layers.Dense(10, activation='softmax')) + + model.summary() + + model.compile(optimizer='adam', + loss='sparse_categorical_crossentropy', + metrics=['accuracy']) + + return model + +def main(): + strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( + communication=tf.distribute.experimental.CollectiveCommunication.AUTO) + + BATCH_SIZE_PER_REPLICA = 4 + BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync + + with strategy.scope(): + ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat() + options = tf.data.Options() + options.experimental_distribute.auto_shard_policy = \ + tf.data.experimental.AutoShardPolicy.DATA + ds_train = ds_train.with_options(options) + # Model building/compiling need to be within `strategy.scope()`. + multi_worker_model = build_and_compile_cnn_model() + + class MyCallback(tf.keras.callbacks.Callback): + def on_epoch_end(self, epoch, logs=None): + # monitor the loss and accuracy + print(logs) + modelClient.log_metrics({"loss": logs["loss"], "accuracy": logs["accuracy"]}, epoch) + + with modelClient.start() as run: + multi_worker_model.fit(ds_train, epochs=10, steps_per_epoch=70, callbacks=[MyCallback()]) + + +if __name__ == '__main__': + modelClient = ModelsClient() + main() +``` + +### 2. Prepare an environment compatible with the training +Build a docker image equipped with the requirement of the environment. + +```bash +$ ./dev-support/examples/quickstart/build.sh +``` + +### 3. Submit the experiment + +1. Open submarine workbench and click `+ New Experiment` +2. Fill the form accordingly. Here we set 3 workers. + + 1. Step 1 + ![](../assets/quickstart-submit-1.png) + 2. Step 2 + ![](../assets/quickstart-submit-2.png) + 3. Step 3 + ![](../assets/quickstart-submit-3.png) + 4. The experiment is successfully submitted + ![](../assets/quickstart-submit-4.png) + +### 4. Monitor the process (modelClient) + +1. In our code, we use `modelClient` from `submarine-sdk` to record the metrics. To see the result, click `MLflow UI` in the workbench. +2. To compare the metrics of each worker, you can select all workers and then click `compare` + + ![](../assets/quickstart-mlflow.png) + + ![](../assets/quickstart-mlflow-2.png) + + +### 5. Serve the model (In development) diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js index bf482b3..23b1839 100644 --- a/website/docusaurus.config.js +++ b/website/docusaurus.config.js @@ -37,7 +37,7 @@ module.exports = { items: [ { type: 'doc', - docId: 'gettingStarted/localDeployment', + docId: 'gettingStarted/quickstart', label: 'Docs', position: 'left', }, diff --git a/website/sidebars.js b/website/sidebars.js index 80fd24d..b10ec3a 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -22,10 +22,10 @@ module.exports = { { "Introduction": [], "Getting Started": [ - "gettingStarted/localDeployment", - "gettingStarted/kind", + "gettingStarted/quickstart", + // "gettingStarted/localDeployment", "gettingStarted/notebook", - "gettingStarted/python-sdk", + // "gettingStarted/python-sdk", ], "User Docs": [ { --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@submarine.apache.org For additional commands, e-mail: dev-h...@submarine.apache.org