This is an automated email from the ASF dual-hosted git repository. rickyma pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git
The following commit(s) were added to refs/heads/master by this push: new c92406f3f [#2018] docs: Add a troubleshooting document (#2032) c92406f3f is described below commit c92406f3faa93129883ecb5a419cef2b6b3f10ea Author: maobaolong <baoloong...@tencent.com> AuthorDate: Mon Aug 12 15:46:28 2024 +0800 [#2018] docs: Add a troubleshooting document (#2032) ### What changes were proposed in this pull request? Introduce a document of troubleshooting. ### Why are the changes needed? Fix: #2018. ### Does this PR introduce _any_ user-facing change? Yes, it introduce a new document. ### How was this patch tested? No need. --- docs/index.md | 2 ++ docs/troubleshooting.md | 82 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 84 insertions(+) diff --git a/docs/index.md b/docs/index.md index db5a2c609..0b872fcd4 100644 --- a/docs/index.md +++ b/docs/index.md @@ -36,6 +36,8 @@ More advanced details for Uniffle users are available in the following: - [Uniffle Shuffle Client Guide](client_guide/client_guide.md) - [Metrics Guide](metrics_guide.md) + +- [Troubleshooting](troubleshooting.md) - Here you can read API docs for Uniffle along with its submodules. diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 000000000..ca5c53a7d --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,82 @@ +--- +layout: page +displayTitle: Uniffle Shuffle Server Guide +title: Uniffle Shuffle Server Guide +description: Uniffle Shuffle Server Guide +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- +# Troubleshooting + +## Where is the Uniffle log file? + +Uniffle logs are stored in the `$RSS_LOG_DIR`, which defaults to `${RSS_HOME}/logs`. The common log file names are `coordinator.log`, `shuffle-server.log`, `dashboard.log`. + +## Audit logs + +The Uniffle cluster provides audit logs for each process. You can also find audit logs in the log directory, the log file names are `coordinator_rpc_audit.log`, `shuffle_server_rpc_audit.log`, `shuffle_server_storage_audit.log`. + +| Audit log name | Configuration | Default | Description | +|----------------------------------|---------------------------------------|---------|-----------------------------------------------------------------------------| +| coordinator rpc audit log | rss.coordinator.rpc.audit.log.enabled | true | Record coordinator rpc operation audit. | +| shuffle server rpc audit log | rss.server.rpc.audit.log.enabled | true | Record shuffle server rpc operation audit. | +| shuffle server storage audit log | rss.server.storage.audit.log.enabled | false | The server will log audit records for every disk write and delete operation | + +Based on the above audit logs, you can check the operation details and the operation time cost. + +## Uniffle remote debug + +### Debugging Uniffle processes + +Java remote debugging makes it easier to debug Uniffle at the source level without modifying any code. You will need to set the JVM remote debugging parameters before starting the process. There are several ways to add the remote debugging parameters; you can export the following configuration properties in shell or conf/rss-env.sh: + +```shell +# Java 8 +export DASHBOARD_JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5004" +export COORDINATOR_JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5006" +export SHUFFLE_SERVER_JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5005" + +# Java 11 +export DASHBOARD_JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=*:5004" +export COORDINATOR_JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=*:5006" +export SHUFFLE_SERVER_JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=*:5005" +``` + +In general, you can use `<PROCESS>_JAVA_OPTS` to specify how an Uniffle process should be attached to. + +The `suspend={y | n}` parameter determines whether the JVM process waits until the debugger connects or not. + +The `address` parameter determines which port the Uniffle process will use to be attached to by a debugger. If left blank, it will choose an open port by itself. + +After completing this setup, learn how [To attach](#to-attach). + +### To attach + +You can find a [comprehensive tutorial on how to attach to and debug a Java process in IntelliJ](https://www.jetbrains.com/help/idea/attaching-to-local-process.html) for more detailed guidance. + +Start the process or a shell command of interest, then create a new Java remote configuration, set the debug server's host and port, and start the debug session. +If you set a breakpoint that can be reached, the IDE will enter debug mode. You can inspect the current context's variables, call stack, thread list, and evaluate expressions. + +## Resource Leak Detection + +If you are operating your Uniffle cluster it is possible you may notice a message in the logs like: + +``` +[ERROR] ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information. +``` + +Uniffle uses Netty's built-in memory leak detection mechanism to help identify potential resource leaks. This message implies that there might be a bug in the Uniffle code, causing a resource leak. +If this message appears while the cluster is running, please open a GitHub Issue as a bug report and share your log message, along with any relevant stack traces associated with it.