This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new df15c8d7744b [SPARK-48471][CORE] Improve documentation and usage guide for history server df15c8d7744b is described below commit df15c8d7744becfd44cd4a447c362e8e007bd574 Author: Kent Yao <y...@apache.org> AuthorDate: Thu May 30 17:16:31 2024 +0800 [SPARK-48471][CORE] Improve documentation and usage guide for history server ### What changes were proposed in this pull request? In this PR, we improve documentation and usage guide for the history server by: - Identify and print **unrecognized options** specified by users - Obtain and print all history server-related configurations dynamically instead of using an incomplete, outdated hardcoded list. - Ensure all configurations are documented for the usage guide ### Why are the changes needed? - Revise the help guide for the history server to make it more user-friendly. Missing configuration in the help guide is not always reachable in our official documentation. E.g. spark.history.fs.safemodeCheck.interval is still missing from the doc since added in 1.6. - Missusage shall be reported to users ### Does this PR introduce _any_ user-facing change? No, the print style is still AS-IS with items increased ### How was this patch tested? #### without this pr ``` Usage: ./sbin/start-history-server.sh [options] 24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for TERM 24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for HUP 24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for INT Options: --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. Configuration options can be set by setting the corresponding JVM system property. History Server options are always available; additional options depend on the provider. History Server options: spark.history.ui.port Port where server will listen for connections (default 18080) spark.history.acls.enable Whether to enable view acls for all applications (default false) spark.history.provider Name of history provider class (defaults to file system-based provider) spark.history.retainedApplications Max number of application UIs to keep loaded in memory (default 50) FsHistoryProvider options: spark.history.fs.logDirectory Directory where app logs are stored (default: file:/tmp/spark-events) spark.history.fs.update.interval How often to reload log data from storage (in seconds, default: 10) ``` #### For error ```java Unrecognized options: --conf spark.history.ui.port=10000 Usage: HistoryServer [options] Options: --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. ``` #### For help ```java sbin/start-history-server.sh --help Usage: ./sbin/start-history-server.sh [options] {"ts":"2024-05-30T07:15:29.740Z","level":"INFO","msg":"Registering signal handler for TERM","context":{"signal":"TERM"},"logger":"SignalUtils"} {"ts":"2024-05-30T07:15:29.741Z","level":"INFO","msg":"Registering signal handler for HUP","context":{"signal":"HUP"},"logger":"SignalUtils"} {"ts":"2024-05-30T07:15:29.741Z","level":"INFO","msg":"Registering signal handler for INT","context":{"signal":"INT"},"logger":"SignalUtils"} Options: --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. Configuration options can be set by setting the corresponding JVM system property. History Server options are always available; additional options depend on the provider. History Server options: spark.history.custom.executor.log.url Specifies custom spark executor log url for supporting external log service instead of using cluster managers' application log urls in the history server. Spark will support some path variables via patterns which can vary on cluster manager. Please check the documentation for your cluster manager to see which patterns are supported, if any. This configuration has no effect on a live application, it only affects the history server. (Default: <undefined>) spark.history.custom.executor.log.url.applyIncompleteApplication Whether to apply custom executor log url, as specified by spark.history.custom.executor.log.url, to incomplete application as well. Even if this is true, this still only affects the behavior of the history server, not running spark applications. (Default: true) spark.history.kerberos.enabled Indicates whether the history server should use kerberos to login. This is required if the history server is accessing HDFS files on a secure Hadoop cluster. (Default: false) spark.history.kerberos.keytab When spark.history.kerberos.enabled=true, specifies location of the kerberos keytab file for the History Server. (Default: <undefined>) spark.history.kerberos.principal When spark.history.kerberos.enabled=true, specifies kerberos principal name for the History Server. (Default: <undefined>) spark.history.provider Name of the class implementing the application history backend. (Default: org.apache.spark.deploy.history.FsHistoryProvider) spark.history.retainedApplications The number of applications to retain UI data for in the cache. If this cap is exceeded, then the oldest applications will be removed from the cache. If an application is not in the cache, it will have to be loaded from disk if it is accessed from the UI. (Default: 50) spark.history.store.hybridStore.diskBackend Specifies a disk-based store used in hybrid store; ROCKSDB or LEVELDB (deprecated). (Default: ROCKSDB) spark.history.store.hybridStore.enabled Whether to use HybridStore as the store when parsing event logs. HybridStore will first write data to an in-memory store and having a background thread that dumps data to a disk store after the writing to in-memory store is completed. (Default: false) spark.history.store.hybridStore.maxMemoryUsage Maximum memory space that can be used to create HybridStore. The HybridStore co-uses the heap memory, so the heap memory should be increased through the memory option for SHS if the HybridStore is enabled. (Default: 2g) spark.history.store.maxDiskUsage Maximum disk usage for the local directory where the cache application history information are stored. (Default: 10g) spark.history.store.path Local directory where to cache application history information. By default this is not set, meaning all history information will be kept in memory. (Default: <undefined>) spark.history.store.serializer Serializer for writing/reading in-memory UI objects to/from disk-based KV Store; JSON or PROTOBUF. JSON serializer is the only choice before Spark 3.4.0, thus it is the default value. PROTOBUF serializer is fast and compact, and it is the default serializer for disk-based KV store of live UI. (Default: JSON) spark.history.ui.acls.enable Specifies whether ACLs should be checked to authorize users viewing the applications in the history server. If enabled, access control checks are performed regardless of what the individual applications had set for spark.ui.acls.enable. The application owner will always have authorization to view their own application and any users specified via spark.ui.view.acls and groups specified via spark.ui.view.acls.groups when the application was run will also have authorization to view that application. If disabled, no access control checks are made for any application UIs available through the history server. (Default: false) spark.history.ui.admin.acls Comma separated list of users that have view access to all the Spark applications in history server. (Default: ) spark.history.ui.admin.acls.groups Comma separated list of groups that have view access to all the Spark applications in history server. (Default: ) spark.history.ui.port Web UI port to bind Spark History Server (Default: 18080) FsHistoryProvider options: spark.history.fs.cleaner.enabled Whether the History Server should periodically clean up event logs from storage (Default: false) spark.history.fs.cleaner.interval When spark.history.fs.cleaner.enabled=true, specifies how often the filesystem job history cleaner checks for files to delete. (Default: 1d) spark.history.fs.cleaner.maxAge When spark.history.fs.cleaner.enabled=true, history files older than this will be deleted when the filesystem history cleaner runs. (Default: 7d) ``` ### Was this patch authored or co-authored using generative AI tooling? no Closes #46802 from yaooqinn/SPARK-48471. Authored-by: Kent Yao <y...@apache.org> Signed-off-by: Kent Yao <y...@apache.org> --- .../spark/deploy/history/HistoryServer.scala | 1 - .../deploy/history/HistoryServerArguments.scala | 88 +++++++++++++--------- .../org/apache/spark/internal/config/History.scala | 50 ++++++++++-- .../history/HistoryServerArgumentsSuite.scala | 13 ++++ 4 files changed, 110 insertions(+), 42 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala index cad1797590a7..6e559dc4492e 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala @@ -302,7 +302,6 @@ object HistoryServer extends Logging { val securityManager = createSecurityManager(conf) val providerName = conf.get(History.PROVIDER) - .getOrElse(classOf[FsHistoryProvider].getName()) val provider = Utils.classForName[ApplicationHistoryProvider](providerName) .getConstructor(classOf[SparkConf]) .newInstance(conf) diff --git a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala index 01cc59e1d2e6..2fdf7a473a29 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala @@ -21,6 +21,7 @@ import scala.annotation.tailrec import org.apache.spark.SparkConf import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.{ConfigEntry, History} import org.apache.spark.util.Utils /** @@ -44,47 +45,62 @@ private[history] class HistoryServerArguments(conf: SparkConf, args: Array[Strin case Nil => - case _ => - printUsageAndExit(1) + case other => + val errorMsg = s"Unrecognized options: ${other.mkString(" ")}\n" + printUsageAndExit(1, errorMsg) } } - // This mutates the SparkConf, so all accesses to it must be made after this line - Utils.loadDefaultSparkProperties(conf, propertiesFile) + // This mutates the SparkConf, so all accesses to it must be made after this line + Utils.loadDefaultSparkProperties(conf, propertiesFile) - private def printUsageAndExit(exitCode: Int): Unit = { - // scalastyle:off println - System.err.println( - """ - |Usage: HistoryServer [options] - | - |Options: - | --properties-file FILE Path to a custom Spark properties file. - | Default is conf/spark-defaults.conf. - | - |Configuration options can be set by setting the corresponding JVM system property. - |History Server options are always available; additional options depend on the provider. - | - |History Server options: - | - | spark.history.ui.port Port where server will listen for connections - | (default 18080) - | spark.history.acls.enable Whether to enable view acls for all applications - | (default false) - | spark.history.provider Name of history provider class (defaults to - | file system-based provider) - | spark.history.retainedApplications Max number of application UIs to keep loaded in memory - | (default 50) - |FsHistoryProvider options: - | - | spark.history.fs.logDirectory Directory where app logs are stored - | (default: file:/tmp/spark-events) - | spark.history.fs.update.interval How often to reload log data from storage - | (in seconds, default: 10) - |""".stripMargin) - // scalastyle:on println + // scalastyle:off line.size.limit println + private def printUsageAndExit(exitCode: Int, error: String = ""): Unit = { + val configs = History.getClass.getDeclaredFields + .filter(f => classOf[ConfigEntry[_]].isAssignableFrom(f.getType)) + .map { f => + f.setAccessible(true) + f.get(History).asInstanceOf[ConfigEntry[_]] + } + val maxConfigLength = configs.map(_.key.length).max + val sb = new StringBuilder( + s""" + |${error}Usage: HistoryServer [options] + | + |Options: + | ${"--properties-file FILE".padTo(maxConfigLength, ' ')} Path to a custom Spark properties file. + | ${"".padTo(maxConfigLength, ' ')} Default is conf/spark-defaults.conf. + | + |Configuration options can be set by setting the corresponding JVM system property. + |History Server options are always available; additional options depend on the provider. + | + |""".stripMargin) + + def printConfigs(configs: Array[ConfigEntry[_]]): Unit = { + configs.sortBy(_.key).foreach { conf => + sb.append(" ").append(conf.key.padTo(maxConfigLength, ' ')) + var currentDocLen = 0 + val intention = "\n" + " " * (maxConfigLength + 2) + conf.doc.split("\\s+").foreach { word => + if (currentDocLen + word.length > 60) { + sb.append(intention).append(" ").append(word) + currentDocLen = word.length + 1 + } else { + sb.append(" ").append(word) + currentDocLen += word.length + 1 + } + } + sb.append(intention).append(" (Default: ").append(conf.defaultValueString).append(")\n") + } + } + val (common, fs) = configs.partition(!_.key.startsWith("spark.history.fs.")) + sb.append("History Server options:\n") + printConfigs(common) + sb.append("FsHistoryProvider options:\n") + printConfigs(fs) + System.err.println(sb.toString()) + // scalastyle:on line.size.limit println System.exit(exitCode) } - } diff --git a/core/src/main/scala/org/apache/spark/internal/config/History.scala b/core/src/main/scala/org/apache/spark/internal/config/History.scala index 2306856f9331..64a8681ca295 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/History.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/History.scala @@ -28,16 +28,19 @@ private[spark] object History { val HISTORY_LOG_DIR = ConfigBuilder("spark.history.fs.logDirectory") .version("1.1.0") + .doc("Directory where app logs are stored") .stringConf .createWithDefault(DEFAULT_LOG_DIR) val SAFEMODE_CHECK_INTERVAL_S = ConfigBuilder("spark.history.fs.safemodeCheck.interval") .version("1.6.0") + .doc("Interval between HDFS safemode checks for the event log directory") .timeConf(TimeUnit.SECONDS) .createWithDefaultString("5s") val UPDATE_INTERVAL_S = ConfigBuilder("spark.history.fs.update.interval") .version("1.4.0") + .doc("How often(in seconds) to reload log data from storage") .timeConf(TimeUnit.SECONDS) .createWithDefaultString("10s") @@ -53,16 +56,21 @@ private[spark] object History { val CLEANER_ENABLED = ConfigBuilder("spark.history.fs.cleaner.enabled") .version("1.4.0") + .doc("Whether the History Server should periodically clean up event logs from storage") .booleanConf .createWithDefault(false) val CLEANER_INTERVAL_S = ConfigBuilder("spark.history.fs.cleaner.interval") .version("1.4.0") + .doc("When spark.history.fs.cleaner.enabled=true, specifies how often the filesystem " + + "job history cleaner checks for files to delete.") .timeConf(TimeUnit.SECONDS) .createWithDefaultString("1d") val MAX_LOG_AGE_S = ConfigBuilder("spark.history.fs.cleaner.maxAge") .version("1.4.0") + .doc("When spark.history.fs.cleaner.enabled=true, history files older than this will be " + + "deleted when the filesystem history cleaner runs.") .timeConf(TimeUnit.SECONDS) .createWithDefaultString("7d") @@ -96,6 +104,8 @@ private[spark] object History { val MAX_LOCAL_DISK_USAGE = ConfigBuilder("spark.history.store.maxDiskUsage") .version("2.3.0") + .doc("Maximum disk usage for the local directory where the cache application history " + + "information are stored.") .bytesConf(ByteUnit.BYTE) .createWithDefaultString("10g") @@ -145,60 +155,90 @@ private[spark] object History { val DRIVER_LOG_CLEANER_ENABLED = ConfigBuilder("spark.history.fs.driverlog.cleaner.enabled") .version("3.0.0") + .doc("Specifies whether the History Server should periodically clean up driver logs from " + + "storage.") .fallbackConf(CLEANER_ENABLED) - val DRIVER_LOG_CLEANER_INTERVAL = ConfigBuilder("spark.history.fs.driverlog.cleaner.interval") - .version("3.0.0") - .fallbackConf(CLEANER_INTERVAL_S) - val MAX_DRIVER_LOG_AGE_S = ConfigBuilder("spark.history.fs.driverlog.cleaner.maxAge") .version("3.0.0") + .doc(s"When ${DRIVER_LOG_CLEANER_ENABLED.key}=true, driver log files older than this will be " + + s"deleted when the driver log cleaner runs.") .fallbackConf(MAX_LOG_AGE_S) + val DRIVER_LOG_CLEANER_INTERVAL = ConfigBuilder("spark.history.fs.driverlog.cleaner.interval") + .version("3.0.0") + .doc(s" When ${DRIVER_LOG_CLEANER_ENABLED.key}=true, specifies how often the filesystem " + + s"driver log cleaner checks for files to delete. Files are only deleted if they are older " + + s"than ${MAX_DRIVER_LOG_AGE_S.key}.") + .fallbackConf(CLEANER_INTERVAL_S) + val HISTORY_SERVER_UI_ACLS_ENABLE = ConfigBuilder("spark.history.ui.acls.enable") .version("1.0.1") + .doc("Specifies whether ACLs should be checked to authorize users viewing the applications " + + "in the history server. If enabled, access control checks are performed regardless of " + + "what the individual applications had set for spark.ui.acls.enable. The application owner " + + "will always have authorization to view their own application and any users specified via " + + "spark.ui.view.acls and groups specified via spark.ui.view.acls.groups when the " + + "application was run will also have authorization to view that application. If disabled, " + + "no access control checks are made for any application UIs available through the history " + + "server.") .booleanConf .createWithDefault(false) val HISTORY_SERVER_UI_ADMIN_ACLS = ConfigBuilder("spark.history.ui.admin.acls") .version("2.1.1") + .doc("Comma separated list of users that have view access to all the Spark applications in " + + "history server.") .stringConf .toSequence .createWithDefault(Nil) val HISTORY_SERVER_UI_ADMIN_ACLS_GROUPS = ConfigBuilder("spark.history.ui.admin.acls.groups") .version("2.1.1") + .doc("Comma separated list of groups that have view access to all the Spark applications " + + "in history server.") .stringConf .toSequence .createWithDefault(Nil) val NUM_REPLAY_THREADS = ConfigBuilder("spark.history.fs.numReplayThreads") .version("2.0.0") + .doc("Number of threads that will be used by history server to process event logs.") .intConf .createWithDefaultFunction(() => Math.ceil(Runtime.getRuntime.availableProcessors() / 4f).toInt) val RETAINED_APPLICATIONS = ConfigBuilder("spark.history.retainedApplications") .version("1.0.0") + .doc("The number of applications to retain UI data for in the cache. If this cap is " + + "exceeded, then the oldest applications will be removed from the cache. If an application " + + "is not in the cache, it will have to be loaded from disk if it is accessed from the UI.") .intConf .createWithDefault(50) val PROVIDER = ConfigBuilder("spark.history.provider") .version("1.1.0") + .doc("Name of the class implementing the application history backend.") .stringConf - .createOptional + .createWithDefault("org.apache.spark.deploy.history.FsHistoryProvider") val KERBEROS_ENABLED = ConfigBuilder("spark.history.kerberos.enabled") .version("1.0.1") + .doc("Indicates whether the history server should use kerberos to login. This is required " + + "if the history server is accessing HDFS files on a secure Hadoop cluster.") .booleanConf .createWithDefault(false) val KERBEROS_PRINCIPAL = ConfigBuilder("spark.history.kerberos.principal") .version("1.0.1") + .doc(s"When ${KERBEROS_ENABLED.key}=true, specifies kerberos principal name for " + + s" the History Server.") .stringConf .createOptional val KERBEROS_KEYTAB = ConfigBuilder("spark.history.kerberos.keytab") .version("1.0.1") + .doc(s"When ${KERBEROS_ENABLED.key}=true, specifies location of the kerberos keytab file " + + s"for the History Server.") .stringConf .createOptional diff --git a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerArgumentsSuite.scala b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerArgumentsSuite.scala index 5903ae71ec66..2b9b110a4142 100644 --- a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerArgumentsSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerArgumentsSuite.scala @@ -22,6 +22,7 @@ import java.nio.charset.StandardCharsets._ import com.google.common.io.Files import org.apache.spark._ +import org.apache.spark.internal.config.{ConfigEntry, History} import org.apache.spark.internal.config.History._ import org.apache.spark.internal.config.Tests._ @@ -52,4 +53,16 @@ class HistoryServerArgumentsSuite extends SparkFunSuite { assert(conf.get("spark.test.CustomPropertyB") === "notblah") } } + + test("SPARK-48471: all history configurations should have documentations") { + val configs = History.getClass.getDeclaredFields + .filter(f => classOf[ConfigEntry[_]].isAssignableFrom(f.getType)) + .map { f => + f.setAccessible(true) + f.get(History).asInstanceOf[ConfigEntry[_]] + } + configs.foreach { config => + assert(config.doc.nonEmpty, s"Config ${config.key} doesn't have documentation") + } + } } --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org