[MINOR] Added common errors and troubleshooting tricks Closes #428.
Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/bd232241 Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/bd232241 Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/bd232241 Branch: refs/heads/gh-pages Commit: bd232241b432dbe28e952ae36f1dce03f5658e23 Parents: 358cfc9 Author: Niketan Pansare <npan...@us.ibm.com> Authored: Mon Mar 13 13:53:45 2017 -0800 Committer: Niketan Pansare <npan...@us.ibm.com> Committed: Mon Mar 13 14:53:45 2017 -0700 ---------------------------------------------------------------------- troubleshooting-guide.md | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/bd232241/troubleshooting-guide.md ---------------------------------------------------------------------- diff --git a/troubleshooting-guide.md b/troubleshooting-guide.md index db8f060..629bcf5 100644 --- a/troubleshooting-guide.md +++ b/troubleshooting-guide.md @@ -94,3 +94,45 @@ Note: The default `SystemML-config.xml` is located in `<path to SystemML root>/c hadoop jar SystemML.jar [-? | -help | -f <filename>] (-config=<config_filename>) ([-args | -nvargs] <args-list>) See [Invoking SystemML in Hadoop Batch Mode](hadoop-batch-mode.html) for details of the syntax. + +## Total size of serialized results is bigger than spark.driver.maxResultSize + +Spark aborts a job if the estimated result size of collect is greater than maxResultSize to avoid out-of-memory errors in driver. +However, SystemML's optimizer has estimates the memory required for each operator and provides guards against these out-of-memory errors in driver. +So, we recommend setting the configuration `--conf spark.driver.maxResultSize=0`. + +## File does not exist on HDFS/LFS error from remote parfor + +This error usually comes from incorrect HDFS configuration on the worker nodes. To investigate this, we recommend + +- Testing if HDFS is accessible from the worker node: `hadoop fs -ls <file path>` +- Synchronize hadoop configuration across the worker nodes. +- Set the environment variable `HADOOP_CONF_DIR`. You may have to restart the cluster-manager to get the hadoop configuration. + +## JVM Garbage Collection related flags + +We recommend providing 10% of maximum memory to young generation and using `-server` flag for robust garbage collection policy. +For example: if you intend to use 20G driver and 60G executor, then please add following to your configuration: + + spark-submit --driver-memory 20G --executor-memory 60G --conf "spark.executor.extraJavaOptions=-Xmn6G -server" --conf "spark.driver.extraJavaOptions=-Xmn2G -server" ... + +## Memory overhead + +Spark sets `spark.yarn.executor.memoryOverhead`, `spark.yarn.driver.memoryOverhead` and `spark.yarn.am.memoryOverhead` to be 10% of memory provided +to the executor, driver and YARN Application Master respectively (with minimum of 384 MB). For certain workloads, the user may have to increase this +overhead to 12-15% of the memory budget. + +## Network timeout + +To avoid false-positive errors due to network failures in case of compute-bound scripts, the user may have to increase the timeout `spark.network.timeout` (default: 120s). + +## Advanced developer statistics + +Few of our operators (for example: convolution-related operator) and GPU backend allows an expert user to get advanced statistics +by setting the configuration `systemml.stats.extraGPU` and `systemml.stats.extraDNN` in the file SystemML-config.xml. + +## Out-Of-Memory on executors + +Out-Of-Memory on executors is often caused due to side-effects of lazy evaluation and in-memory input data of Spark for large-scale problems. +Though we are constantly improving our optimizer to address this scenario, a quick hack to resolve this is reducing the number of cores allocated to the executor. +We would highly appreciate if you file a bug report on our [issue tracker](https://issues.apache.org/jira/browse/SYSTEMML) if and when you encounter OOM. \ No newline at end of file