This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new df15c8d7744b [SPARK-48471][CORE] Improve documentation and usage guide 
for history server
df15c8d7744b is described below

commit df15c8d7744becfd44cd4a447c362e8e007bd574
Author: Kent Yao <y...@apache.org>
AuthorDate: Thu May 30 17:16:31 2024 +0800

    [SPARK-48471][CORE] Improve documentation and usage guide for history server
    
    ### What changes were proposed in this pull request?
    
    In this PR, we improve documentation and usage guide for the history server 
by:
    - Identify and print **unrecognized options** specified by users
    - Obtain and print all history server-related configurations dynamically 
instead of using an incomplete, outdated hardcoded list.
    - Ensure all configurations are documented for the usage guide
    
    ### Why are the changes needed?
    
    - Revise the help guide for the history server to make it more 
user-friendly. Missing configuration in the help guide is not always reachable 
in our official documentation. E.g. spark.history.fs.safemodeCheck.interval is 
still missing from the doc since added in 1.6.
    - Missusage shall be reported to users
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, the print style is still AS-IS with items increased
    
    ### How was this patch tested?
    
    #### without this pr
    
    ```
    Usage: ./sbin/start-history-server.sh [options]
    24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for TERM
    24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for HUP
    24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for INT
    
    Options:
      --properties-file FILE      Path to a custom Spark properties file.
                                  Default is conf/spark-defaults.conf.
    
    Configuration options can be set by setting the corresponding JVM system 
property.
    History Server options are always available; additional options depend on 
the provider.
    
    History Server options:
    
      spark.history.ui.port              Port where server will listen for 
connections
                                         (default 18080)
      spark.history.acls.enable          Whether to enable view acls for all 
applications
                                         (default false)
      spark.history.provider             Name of history provider class 
(defaults to
                                         file system-based provider)
      spark.history.retainedApplications Max number of application UIs to keep 
loaded in memory
                                         (default 50)
    FsHistoryProvider options:
    
      spark.history.fs.logDirectory      Directory where app logs are stored
                                         (default: file:/tmp/spark-events)
      spark.history.fs.update.interval   How often to reload log data from 
storage
                                         (in seconds, default: 10)
    ```
    #### For error
    ```java
    Unrecognized options: --conf spark.history.ui.port=10000
    Usage: HistoryServer [options]
    
    Options:
      --properties-file FILE                                           Path to 
a custom Spark properties file.
                                                                       Default 
is conf/spark-defaults.conf.
    
    ```
    
    #### For help
    ```java
     sbin/start-history-server.sh --help
    Usage: ./sbin/start-history-server.sh [options]
    {"ts":"2024-05-30T07:15:29.740Z","level":"INFO","msg":"Registering signal 
handler for TERM","context":{"signal":"TERM"},"logger":"SignalUtils"}
    {"ts":"2024-05-30T07:15:29.741Z","level":"INFO","msg":"Registering signal 
handler for HUP","context":{"signal":"HUP"},"logger":"SignalUtils"}
    {"ts":"2024-05-30T07:15:29.741Z","level":"INFO","msg":"Registering signal 
handler for INT","context":{"signal":"INT"},"logger":"SignalUtils"}
    
    Options:
      --properties-file FILE                                           Path to 
a custom Spark properties file.
                                                                       Default 
is conf/spark-defaults.conf.
    
    Configuration options can be set by setting the corresponding JVM system 
property.
    History Server options are always available; additional options depend on 
the provider.
    
    History Server options:
      spark.history.custom.executor.log.url                            
Specifies custom spark executor log url for supporting
                                                                       external 
log service instead of using cluster managers'
                                                                       
application log urls in the history server. Spark will
                                                                       support 
some path variables via patterns which can vary on
                                                                       cluster 
manager. Please check the documentation for your
                                                                       cluster 
manager to see which patterns are supported, if any.
                                                                       This 
configuration has no effect on a live application, it
                                                                       only 
affects the history server.
                                                                       
(Default: <undefined>)
      spark.history.custom.executor.log.url.applyIncompleteApplication Whether 
to apply custom executor log url, as specified by
                                                                       
spark.history.custom.executor.log.url, to incomplete
                                                                       
application as well. Even if this is true, this still only
                                                                       affects 
the behavior of the history server, not running
                                                                       spark 
applications.
                                                                       
(Default: true)
      spark.history.kerberos.enabled                                   
Indicates whether the history server should use kerberos to
                                                                       login. 
This is required if the history server is accessing
                                                                       HDFS 
files on a secure Hadoop cluster.
                                                                       
(Default: false)
      spark.history.kerberos.keytab                                    When 
spark.history.kerberos.enabled=true, specifies location
                                                                       of the 
kerberos keytab file for the History Server.
                                                                       
(Default: <undefined>)
      spark.history.kerberos.principal                                 When 
spark.history.kerberos.enabled=true, specifies kerberos
                                                                       
principal name for the History Server.
                                                                       
(Default: <undefined>)
      spark.history.provider                                           Name of 
the class implementing the application history
                                                                       backend.
                                                                       
(Default: org.apache.spark.deploy.history.FsHistoryProvider)
      spark.history.retainedApplications                               The 
number of applications to retain UI data for in the
                                                                       cache. 
If this cap is exceeded, then the oldest applications
                                                                       will be 
removed from the cache. If an application is not in
                                                                       the 
cache, it will have to be loaded from disk if it is
                                                                       accessed 
from the UI.
                                                                       
(Default: 50)
      spark.history.store.hybridStore.diskBackend                      
Specifies a disk-based store used in hybrid store; ROCKSDB
                                                                       or 
LEVELDB (deprecated).
                                                                       
(Default: ROCKSDB)
      spark.history.store.hybridStore.enabled                          Whether 
to use HybridStore as the store when parsing event
                                                                       logs. 
HybridStore will first write data to an in-memory
                                                                       store 
and having a background thread that dumps data to a
                                                                       disk 
store after the writing to in-memory store is
                                                                       
completed.
                                                                       
(Default: false)
      spark.history.store.hybridStore.maxMemoryUsage                   Maximum 
memory space that can be used to create HybridStore.
                                                                       The 
HybridStore co-uses the heap memory, so the heap memory
                                                                       should 
be increased through the memory option for SHS if the
                                                                       
HybridStore is enabled.
                                                                       
(Default: 2g)
      spark.history.store.maxDiskUsage                                 Maximum 
disk usage for the local directory where the cache
                                                                       
application history information are stored.
                                                                       
(Default: 10g)
      spark.history.store.path                                         Local 
directory where to cache application history
                                                                       
information. By default this is not set, meaning all history
                                                                       
information will be kept in memory.
                                                                       
(Default: <undefined>)
      spark.history.store.serializer                                   
Serializer for writing/reading in-memory UI objects to/from
                                                                       
disk-based KV Store; JSON or PROTOBUF. JSON serializer is
                                                                       the only 
choice before Spark 3.4.0, thus it is the default
                                                                       value. 
PROTOBUF serializer is fast and compact, and it is
                                                                       the 
default serializer for disk-based KV store of live UI.
                                                                       
(Default: JSON)
      spark.history.ui.acls.enable                                     
Specifies whether ACLs should be checked to authorize users
                                                                       viewing 
the applications in the history server. If enabled,
                                                                       access 
control checks are performed regardless of what the
                                                                       
individual applications had set for spark.ui.acls.enable.
                                                                       The 
application owner will always have authorization to view
                                                                       their 
own application and any users specified via
                                                                       
spark.ui.view.acls and groups specified via
                                                                       
spark.ui.view.acls.groups when the application was run will
                                                                       also 
have authorization to view that application. If
                                                                       
disabled, no access control checks are made for any
                                                                       
application UIs available through the history server.
                                                                       
(Default: false)
      spark.history.ui.admin.acls                                      Comma 
separated list of users that have view access to all
                                                                       the 
Spark applications in history server.
                                                                       
(Default: )
      spark.history.ui.admin.acls.groups                               Comma 
separated list of groups that have view access to all
                                                                       the 
Spark applications in history server.
                                                                       
(Default: )
      spark.history.ui.port                                            Web UI 
port to bind Spark History Server
                                                                       
(Default: 18080)
    FsHistoryProvider options:
      spark.history.fs.cleaner.enabled                                 Whether 
the History Server should periodically clean up
                                                                       event 
logs from storage
                                                                       
(Default: false)
      spark.history.fs.cleaner.interval                                When 
spark.history.fs.cleaner.enabled=true, specifies how
                                                                       often 
the filesystem job history cleaner checks for files to
                                                                       delete.
                                                                       
(Default: 1d)
      spark.history.fs.cleaner.maxAge                                  When 
spark.history.fs.cleaner.enabled=true, history files
                                                                       older 
than this will be deleted when the filesystem history
                                                                       cleaner 
runs.
                                                                       
(Default: 7d)
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes #46802 from yaooqinn/SPARK-48471.
    
    Authored-by: Kent Yao <y...@apache.org>
    Signed-off-by: Kent Yao <y...@apache.org>
---
 .../spark/deploy/history/HistoryServer.scala       |  1 -
 .../deploy/history/HistoryServerArguments.scala    | 88 +++++++++++++---------
 .../org/apache/spark/internal/config/History.scala | 50 ++++++++++--
 .../history/HistoryServerArgumentsSuite.scala      | 13 ++++
 4 files changed, 110 insertions(+), 42 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
index cad1797590a7..6e559dc4492e 100644
--- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
@@ -302,7 +302,6 @@ object HistoryServer extends Logging {
     val securityManager = createSecurityManager(conf)
 
     val providerName = conf.get(History.PROVIDER)
-      .getOrElse(classOf[FsHistoryProvider].getName())
     val provider = Utils.classForName[ApplicationHistoryProvider](providerName)
       .getConstructor(classOf[SparkConf])
       .newInstance(conf)
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala
 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala
index 01cc59e1d2e6..2fdf7a473a29 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala
@@ -21,6 +21,7 @@ import scala.annotation.tailrec
 
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.{ConfigEntry, History}
 import org.apache.spark.util.Utils
 
 /**
@@ -44,47 +45,62 @@ private[history] class HistoryServerArguments(conf: 
SparkConf, args: Array[Strin
 
       case Nil =>
 
-      case _ =>
-        printUsageAndExit(1)
+      case other =>
+        val errorMsg = s"Unrecognized options: ${other.mkString(" ")}\n"
+        printUsageAndExit(1, errorMsg)
     }
   }
 
-   // This mutates the SparkConf, so all accesses to it must be made after 
this line
-   Utils.loadDefaultSparkProperties(conf, propertiesFile)
+  // This mutates the SparkConf, so all accesses to it must be made after this 
line
+  Utils.loadDefaultSparkProperties(conf, propertiesFile)
 
-  private def printUsageAndExit(exitCode: Int): Unit = {
-    // scalastyle:off println
-    System.err.println(
-      """
-      |Usage: HistoryServer [options]
-      |
-      |Options:
-      |  --properties-file FILE      Path to a custom Spark properties file.
-      |                              Default is conf/spark-defaults.conf.
-      |
-      |Configuration options can be set by setting the corresponding JVM 
system property.
-      |History Server options are always available; additional options depend 
on the provider.
-      |
-      |History Server options:
-      |
-      |  spark.history.ui.port              Port where server will listen for 
connections
-      |                                     (default 18080)
-      |  spark.history.acls.enable          Whether to enable view acls for 
all applications
-      |                                     (default false)
-      |  spark.history.provider             Name of history provider class 
(defaults to
-      |                                     file system-based provider)
-      |  spark.history.retainedApplications Max number of application UIs to 
keep loaded in memory
-      |                                     (default 50)
-      |FsHistoryProvider options:
-      |
-      |  spark.history.fs.logDirectory      Directory where app logs are stored
-      |                                     (default: file:/tmp/spark-events)
-      |  spark.history.fs.update.interval   How often to reload log data from 
storage
-      |                                     (in seconds, default: 10)
-      |""".stripMargin)
-    // scalastyle:on println
+  // scalastyle:off line.size.limit println
+  private def printUsageAndExit(exitCode: Int, error: String = ""): Unit = {
+    val configs = History.getClass.getDeclaredFields
+      .filter(f => classOf[ConfigEntry[_]].isAssignableFrom(f.getType))
+      .map { f =>
+        f.setAccessible(true)
+        f.get(History).asInstanceOf[ConfigEntry[_]]
+      }
+    val maxConfigLength = configs.map(_.key.length).max
+    val sb = new StringBuilder(
+      s"""
+         |${error}Usage: HistoryServer [options]
+         |
+         |Options:
+         |  ${"--properties-file FILE".padTo(maxConfigLength, ' ')} Path to a 
custom Spark properties file.
+         |  ${"".padTo(maxConfigLength, ' ')} Default is 
conf/spark-defaults.conf.
+         |
+         |Configuration options can be set by setting the corresponding JVM 
system property.
+         |History Server options are always available; additional options 
depend on the provider.
+         |
+         |""".stripMargin)
+
+    def printConfigs(configs: Array[ConfigEntry[_]]): Unit = {
+      configs.sortBy(_.key).foreach { conf =>
+        sb.append("  ").append(conf.key.padTo(maxConfigLength, ' '))
+        var currentDocLen = 0
+        val intention = "\n" + " " * (maxConfigLength + 2)
+        conf.doc.split("\\s+").foreach { word =>
+          if (currentDocLen + word.length > 60) {
+            sb.append(intention).append(" ").append(word)
+            currentDocLen = word.length + 1
+          } else {
+            sb.append(" ").append(word)
+            currentDocLen += word.length + 1
+          }
+        }
+        sb.append(intention).append(" (Default: 
").append(conf.defaultValueString).append(")\n")
+      }
+    }
+    val (common, fs) = 
configs.partition(!_.key.startsWith("spark.history.fs."))
+    sb.append("History Server options:\n")
+    printConfigs(common)
+    sb.append("FsHistoryProvider options:\n")
+    printConfigs(fs)
+    System.err.println(sb.toString())
+    // scalastyle:on line.size.limit println
     System.exit(exitCode)
   }
-
 }
 
diff --git a/core/src/main/scala/org/apache/spark/internal/config/History.scala 
b/core/src/main/scala/org/apache/spark/internal/config/History.scala
index 2306856f9331..64a8681ca295 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/History.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/History.scala
@@ -28,16 +28,19 @@ private[spark] object History {
 
   val HISTORY_LOG_DIR = ConfigBuilder("spark.history.fs.logDirectory")
     .version("1.1.0")
+    .doc("Directory where app logs are stored")
     .stringConf
     .createWithDefault(DEFAULT_LOG_DIR)
 
   val SAFEMODE_CHECK_INTERVAL_S = 
ConfigBuilder("spark.history.fs.safemodeCheck.interval")
     .version("1.6.0")
+    .doc("Interval between HDFS safemode checks for the event log directory")
     .timeConf(TimeUnit.SECONDS)
     .createWithDefaultString("5s")
 
   val UPDATE_INTERVAL_S = ConfigBuilder("spark.history.fs.update.interval")
     .version("1.4.0")
+    .doc("How often(in seconds) to reload log data from storage")
     .timeConf(TimeUnit.SECONDS)
     .createWithDefaultString("10s")
 
@@ -53,16 +56,21 @@ private[spark] object History {
 
   val CLEANER_ENABLED = ConfigBuilder("spark.history.fs.cleaner.enabled")
     .version("1.4.0")
+    .doc("Whether the History Server should periodically clean up event logs 
from storage")
     .booleanConf
     .createWithDefault(false)
 
   val CLEANER_INTERVAL_S = ConfigBuilder("spark.history.fs.cleaner.interval")
     .version("1.4.0")
+    .doc("When spark.history.fs.cleaner.enabled=true, specifies how often the 
filesystem " +
+      "job history cleaner checks for files to delete.")
     .timeConf(TimeUnit.SECONDS)
     .createWithDefaultString("1d")
 
   val MAX_LOG_AGE_S = ConfigBuilder("spark.history.fs.cleaner.maxAge")
     .version("1.4.0")
+    .doc("When spark.history.fs.cleaner.enabled=true, history files older than 
this will be " +
+      "deleted when the filesystem history cleaner runs.")
     .timeConf(TimeUnit.SECONDS)
     .createWithDefaultString("7d")
 
@@ -96,6 +104,8 @@ private[spark] object History {
 
   val MAX_LOCAL_DISK_USAGE = ConfigBuilder("spark.history.store.maxDiskUsage")
     .version("2.3.0")
+    .doc("Maximum disk usage for the local directory where the cache 
application history " +
+      "information are stored.")
     .bytesConf(ByteUnit.BYTE)
     .createWithDefaultString("10g")
 
@@ -145,60 +155,90 @@ private[spark] object History {
 
   val DRIVER_LOG_CLEANER_ENABLED = 
ConfigBuilder("spark.history.fs.driverlog.cleaner.enabled")
     .version("3.0.0")
+    .doc("Specifies whether the History Server should periodically clean up 
driver logs from " +
+      "storage.")
     .fallbackConf(CLEANER_ENABLED)
 
-  val DRIVER_LOG_CLEANER_INTERVAL = 
ConfigBuilder("spark.history.fs.driverlog.cleaner.interval")
-    .version("3.0.0")
-    .fallbackConf(CLEANER_INTERVAL_S)
-
   val MAX_DRIVER_LOG_AGE_S = 
ConfigBuilder("spark.history.fs.driverlog.cleaner.maxAge")
     .version("3.0.0")
+    .doc(s"When ${DRIVER_LOG_CLEANER_ENABLED.key}=true, driver log files older 
than this will be " +
+      s"deleted when the driver log cleaner runs.")
     .fallbackConf(MAX_LOG_AGE_S)
 
+  val DRIVER_LOG_CLEANER_INTERVAL = 
ConfigBuilder("spark.history.fs.driverlog.cleaner.interval")
+    .version("3.0.0")
+    .doc(s" When ${DRIVER_LOG_CLEANER_ENABLED.key}=true, specifies how often 
the filesystem " +
+      s"driver log cleaner checks for files to delete. Files are only deleted 
if they are older " +
+      s"than ${MAX_DRIVER_LOG_AGE_S.key}.")
+    .fallbackConf(CLEANER_INTERVAL_S)
+
   val HISTORY_SERVER_UI_ACLS_ENABLE = 
ConfigBuilder("spark.history.ui.acls.enable")
     .version("1.0.1")
+    .doc("Specifies whether ACLs should be checked to authorize users viewing 
the applications " +
+      "in the history server. If enabled, access control checks are performed 
regardless of " +
+      "what the individual applications had set for spark.ui.acls.enable. The 
application owner " +
+      "will always have authorization to view their own application and any 
users specified via " +
+      "spark.ui.view.acls and groups specified via spark.ui.view.acls.groups 
when the " +
+      "application was run will also have authorization to view that 
application. If disabled, " +
+      "no access control checks are made for any application UIs available 
through the history " +
+      "server.")
     .booleanConf
     .createWithDefault(false)
 
   val HISTORY_SERVER_UI_ADMIN_ACLS = 
ConfigBuilder("spark.history.ui.admin.acls")
     .version("2.1.1")
+    .doc("Comma separated list of users that have view access to all the Spark 
applications in " +
+      "history server.")
     .stringConf
     .toSequence
     .createWithDefault(Nil)
 
   val HISTORY_SERVER_UI_ADMIN_ACLS_GROUPS = 
ConfigBuilder("spark.history.ui.admin.acls.groups")
     .version("2.1.1")
+    .doc("Comma separated list of groups that have view access to all the 
Spark applications " +
+      "in history server.")
     .stringConf
     .toSequence
     .createWithDefault(Nil)
 
   val NUM_REPLAY_THREADS = ConfigBuilder("spark.history.fs.numReplayThreads")
     .version("2.0.0")
+    .doc("Number of threads that will be used by history server to process 
event logs.")
     .intConf
     .createWithDefaultFunction(() => 
Math.ceil(Runtime.getRuntime.availableProcessors() / 4f).toInt)
 
   val RETAINED_APPLICATIONS = 
ConfigBuilder("spark.history.retainedApplications")
     .version("1.0.0")
+    .doc("The number of applications to retain UI data for in the cache. If 
this cap is " +
+      "exceeded, then the oldest applications will be removed from the cache. 
If an application " +
+      "is not in the cache, it will have to be loaded from disk if it is 
accessed from the UI.")
     .intConf
     .createWithDefault(50)
 
   val PROVIDER = ConfigBuilder("spark.history.provider")
     .version("1.1.0")
+    .doc("Name of the class implementing the application history backend.")
     .stringConf
-    .createOptional
+    .createWithDefault("org.apache.spark.deploy.history.FsHistoryProvider")
 
   val KERBEROS_ENABLED = ConfigBuilder("spark.history.kerberos.enabled")
     .version("1.0.1")
+    .doc("Indicates whether the history server should use kerberos to login. 
This is required " +
+      "if the history server is accessing HDFS files on a secure Hadoop 
cluster.")
     .booleanConf
     .createWithDefault(false)
 
   val KERBEROS_PRINCIPAL = ConfigBuilder("spark.history.kerberos.principal")
     .version("1.0.1")
+    .doc(s"When ${KERBEROS_ENABLED.key}=true, specifies kerberos principal 
name for " +
+      s" the History Server.")
     .stringConf
     .createOptional
 
   val KERBEROS_KEYTAB = ConfigBuilder("spark.history.kerberos.keytab")
     .version("1.0.1")
+    .doc(s"When ${KERBEROS_ENABLED.key}=true, specifies location of the 
kerberos keytab file " +
+      s"for the History Server.")
     .stringConf
     .createOptional
 
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerArgumentsSuite.scala
 
b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerArgumentsSuite.scala
index 5903ae71ec66..2b9b110a4142 100644
--- 
a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerArgumentsSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerArgumentsSuite.scala
@@ -22,6 +22,7 @@ import java.nio.charset.StandardCharsets._
 import com.google.common.io.Files
 
 import org.apache.spark._
+import org.apache.spark.internal.config.{ConfigEntry, History}
 import org.apache.spark.internal.config.History._
 import org.apache.spark.internal.config.Tests._
 
@@ -52,4 +53,16 @@ class HistoryServerArgumentsSuite extends SparkFunSuite {
       assert(conf.get("spark.test.CustomPropertyB") === "notblah")
     }
   }
+
+  test("SPARK-48471: all history configurations should have documentations") {
+    val configs = History.getClass.getDeclaredFields
+      .filter(f => classOf[ConfigEntry[_]].isAssignableFrom(f.getType))
+      .map { f =>
+        f.setAccessible(true)
+        f.get(History).asInstanceOf[ConfigEntry[_]]
+      }
+    configs.foreach { config =>
+      assert(config.doc.nonEmpty, s"Config ${config.key} doesn't have 
documentation")
+    }
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to