DCausse has submitted this change and it was merged. (
https://gerrit.wikimedia.org/r/387442 )
Change subject: Update resource usage docs
......................................................................
Update resource usage docs
Add some information learned when evaluating different sizes of
executors used to train models.
Bug: T170009
Change-Id: I2cf916f999f053d1ce8545ac8279815e9b09a0f9
---
M docs/running-in-analytics.rst
M mjolnir/test/fixtures/load_config/example_train.expect
M mjolnir/test/utilities/test_spark.py
3 files changed, 89 insertions(+), 60 deletions(-)
Approvals:
DCausse: Verified; Looks good to me, approved
diff --git a/docs/running-in-analytics.rst b/docs/running-in-analytics.rst
index f728052..93a6a5e 100644
--- a/docs/running-in-analytics.rst
+++ b/docs/running-in-analytics.rst
@@ -295,10 +295,37 @@
Same as before the final argument is the wiki to limit data collection and
training to.
-Resource usage in the hadoop cluster
-====================================
-TODO
+Resource usage in the hadoop cluster when training
+==================================================
+If the training data all fits on a single executor, that is the most efficient
+use of cluster resources. This may not be the fastest way to train individual
+models, but if we are doing hyperparameter tuning we are generally training
+many models in parallel, and the lowest total cpu time used per model comes
+from using a single executor.
+
+Training speed vs core count looks to stay relatively flat up to about 6 cores.
+Less parallelism is again more efficient in terms of total efficiency of the
+cluster, but up to 6 cores has a very minimal decrease. After 6 cores the
+efficiency loss starts to increase at a greater rate.
+
+Overall suggestions:
+
+* Train models with 4 or 6 cores per executor
+* Aim for a single executor if reasonable.
+* Limitation: Cluster has ~2GB of memory per core, so training data (with
+ duplicates, due to spark storage, task data, and xgboost DMatrix copy in CPP)
+ needs to fit in 4*2 or 6*2 GB of memory. This is actually quite reasonable
with
+ our current feature size, but may need to be revisited is we dramatically
+ increase the number of features used.
+
+Other:
+
+* Minimum amounts of memory that work fine for training a single model will
+ overrun their memory allocation regularly when used to train in mjolnir with
+ hyperparameter optimization. We need to over provision memory vs what it
takes
+ to spin up a spark instance and train a single model. Perhaps this is some
sort
+ of leak, or late de-allocation, in xgboost? unsure.
Help! There are exceptions eveywhere!
=====================================
diff --git a/mjolnir/test/fixtures/load_config/example_train.expect
b/mjolnir/test/fixtures/load_config/example_train.expect
index a423aa4..23e536f 100644
--- a/mjolnir/test/fixtures/load_config/example_train.expect
+++ b/mjolnir/test/fixtures/load_config/example_train.expect
@@ -14,7 +14,7 @@
SPARK_HOME: /home/pytest/spark-2.1.0-bin-hadoop2.6
USER: pytest
mjolnir_utility: data_pipeline
- mjolnir_utility_path: /vagrant/venv/bin/mjolnir-utilities.py
+ mjolnir_utility_path: /srv/mjolnir/venv/bin/mjolnir-utilities.py
paths:
dir_exist: !!set
/etc/spark/conf: null
@@ -22,11 +22,11 @@
/mnt/hdfs/user/pytest/mjolnir/marker: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '1'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -51,11 +51,11 @@
/etc/spark/conf: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '1'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -80,11 +80,11 @@
/etc/spark/conf: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '4'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -108,18 +108,18 @@
SPARK_HOME: /home/pytest/spark-2.1.0-bin-hadoop2.6
USER: pytest
mjolnir_utility: training_pipeline
- mjolnir_utility_path: /vagrant/venv/bin/mjolnir-utilities.py
+ mjolnir_utility_path: /srv/mjolnir/venv/bin/mjolnir-utilities.py
paths:
dir_exist: !!set
/etc/spark/conf: null
/home/pytest/training_size: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
driver-memory: 3G
executor-cores: '1'
executor-memory: 2G
@@ -153,7 +153,7 @@
SPARK_HOME: /home/pytest/spark-2.1.0-bin-hadoop2.6
USER: pytest
mjolnir_utility: data_pipeline
- mjolnir_utility_path: /vagrant/venv/bin/mjolnir-utilities.py
+ mjolnir_utility_path: /srv/mjolnir/venv/bin/mjolnir-utilities.py
paths:
dir_exist: !!set
/etc/spark/conf: null
@@ -161,11 +161,11 @@
/mnt/hdfs/user/pytest/mjolnir/marker: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '1'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -190,11 +190,11 @@
/etc/spark/conf: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '1'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -219,11 +219,11 @@
/etc/spark/conf: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '4'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -251,18 +251,18 @@
SPARK_HOME: /home/pytest/spark-2.1.0-bin-hadoop2.6
USER: pytest
mjolnir_utility: training_pipeline
- mjolnir_utility_path: /vagrant/venv/bin/mjolnir-utilities.py
+ mjolnir_utility_path: /srv/mjolnir/venv/bin/mjolnir-utilities.py
paths:
dir_exist: !!set
/etc/spark/conf: null
/home/pytest/training_size: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
driver-memory: 3G
executor-cores: '6'
executor-memory: 3G
@@ -298,7 +298,7 @@
SPARK_HOME: /home/pytest/spark-2.1.0-bin-hadoop2.6
USER: pytest
mjolnir_utility: data_pipeline
- mjolnir_utility_path: /vagrant/venv/bin/mjolnir-utilities.py
+ mjolnir_utility_path: /srv/mjolnir/venv/bin/mjolnir-utilities.py
paths:
dir_exist: !!set
/etc/spark/conf: null
@@ -306,11 +306,11 @@
/mnt/hdfs/user/pytest/mjolnir/marker: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '1'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -335,11 +335,11 @@
/etc/spark/conf: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '1'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -364,11 +364,11 @@
/etc/spark/conf: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '4'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -396,18 +396,18 @@
SPARK_HOME: /home/pytest/spark-2.1.0-bin-hadoop2.6
USER: pytest
mjolnir_utility: training_pipeline
- mjolnir_utility_path: /vagrant/venv/bin/mjolnir-utilities.py
+ mjolnir_utility_path: /srv/mjolnir/venv/bin/mjolnir-utilities.py
paths:
dir_exist: !!set
/etc/spark/conf: null
/home/pytest/training_size: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
driver-memory: 3G
executor-cores: '6'
executor-memory: 3G
@@ -445,7 +445,7 @@
SPARK_HOME: /home/pytest/spark-2.1.0-bin-hadoop2.6
USER: pytest
mjolnir_utility: data_pipeline
- mjolnir_utility_path: /vagrant/venv/bin/mjolnir-utilities.py
+ mjolnir_utility_path: /srv/mjolnir/venv/bin/mjolnir-utilities.py
paths:
dir_exist: !!set
/etc/spark/conf: null
@@ -453,11 +453,11 @@
/mnt/hdfs/user/pytest/mjolnir/marker: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '1'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -482,11 +482,11 @@
/etc/spark/conf: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '1'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -511,11 +511,11 @@
/etc/spark/conf: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
executor-cores: '4'
executor-memory: 2G
files: /usr/lib/libhdfs.so.0.0.0
@@ -543,18 +543,18 @@
SPARK_HOME: /home/pytest/spark-2.1.0-bin-hadoop2.6
USER: pytest
mjolnir_utility: training_pipeline
- mjolnir_utility_path: /vagrant/venv/bin/mjolnir-utilities.py
+ mjolnir_utility_path: /srv/mjolnir/venv/bin/mjolnir-utilities.py
paths:
dir_exist: !!set
/etc/spark/conf: null
/home/pytest/training_size: null
file_exist: !!set
/home/pytest/spark-2.1.0-bin-hadoop2.6/bin/spark-submit: null
- /vagrant/mjolnir_venv.zip: null
- /vagrant/venv/bin/mjolnir-utilities.py: null
+ /srv/mjolnir/mjolnir_venv.zip: null
+ /srv/mjolnir/venv/bin/mjolnir-utilities.py: null
venv/bin/python: null
spark_args:
- archives: /vagrant/mjolnir_venv.zip#venv
+ archives: /srv/mjolnir/mjolnir_venv.zip#venv
driver-memory: 3G
executor-cores: '4'
executor-memory: 2G
diff --git a/mjolnir/test/utilities/test_spark.py
b/mjolnir/test/utilities/test_spark.py
index ccea823..07e0b0c 100644
--- a/mjolnir/test/utilities/test_spark.py
+++ b/mjolnir/test/utilities/test_spark.py
@@ -70,7 +70,9 @@
monkeypatch.setenv('USER', 'pytest')
with open(test_file, 'r') as f:
- global_config, profiles = mjolnir.utilities.spark.load_config(f,
'marker', {})
+ global_config, profiles = mjolnir.utilities.spark.load_config(f,
'marker', {
+ 'mjolnir_dir': '/srv/mjolnir',
+ })
compare_fixture(expect_file, {
'global_config': global_config,
'profiles': profiles
--
To view, visit https://gerrit.wikimedia.org/r/387442
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: I2cf916f999f053d1ce8545ac8279815e9b09a0f9
Gerrit-PatchSet: 5
Gerrit-Project: search/MjoLniR
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <[email protected]>
Gerrit-Reviewer: DCausse <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits