[spark] branch master updated: [SPARK-40675][DOCS] Supplement undocumented spark configurations in `configuration.md`

srowen Sun, 09 Oct 2022 08:12:40 -0700

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new cd7ca92051b [SPARK-40675][DOCS] Supplement undocumented spark 
configurations in `configuration.md`
cd7ca92051b is described below

commit cd7ca92051b55c615b8db07030ea3af469dd4da4
Author: Qian.Sun <qian.sun2...@gmail.com>
AuthorDate: Sun Oct 9 10:12:19 2022 -0500

    [SPARK-40675][DOCS] Supplement undocumented spark configurations in 
`configuration.md`
    
    ### What changes were proposed in this pull request?
    
    This PR aims to supplement missing spark configurations in 
`org.apache.spark.internal.config` in `configuration.md`.
    
    ### Why are the changes needed?
    
    Help users to confirm configuration through documentation instead of code.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, more configurations in documentation.
    
    ### How was this patch tested?
    
    Pass the GitHub Actions.
    
    Closes #38131 from dcoliversun/SPARK-40675.
    
    Authored-by: Qian.Sun <qian.sun2...@gmail.com>
    Signed-off-by: Sean Owen <sro...@gmail.com>
---
 docs/configuration.md | 314 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 313 insertions(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 16c9fdfdf9f..b528c766884 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -468,6 +468,43 @@ of the most common options to set are:
   </td>
   <td>3.0.0</td>
 </tr>
+<tr>
+  <td><code>spark.decommission.enabled</code></td>
+  <td>false</td>
+  <td>
+    When decommission enabled, Spark will try its best to shut down the 
executor gracefully. 
+    Spark will try to migrate all the RDD blocks (controlled by 
<code>spark.storage.decommission.rddBlocks.enabled</code>)
+    and shuffle blocks (controlled by 
<code>spark.storage.decommission.shuffleBlocks.enabled</code>) from the 
decommissioning 
+    executor to a remote executor when 
<code>spark.storage.decommission.enabled</code> is enabled. 
+    With decommission enabled, Spark will also decommission an executor 
instead of killing when <code>spark.dynamicAllocation.enabled</code> enabled.
+  </td>
+  <td>3.1.0</td>
+</tr>
+<tr>
+  <td><code>spark.executor.decommission.killInterval</code></td>
+  <td>(none)</td>
+  <td>
+    Duration after which a decommissioned executor will be killed forcefully 
by an outside (e.g. non-spark) service.
+  </td>
+  <td>3.1.0</td>
+</tr>
+<tr>
+  <td><code>spark.executor.decommission.forceKillTimeout</code></td>
+  <td>(none)</td>
+  <td>
+    Duration after which a Spark will force a decommissioning executor to 
exit. 
+    This should be set to a high value in most situations as low values will 
prevent block migrations from having enough time to complete.
+  </td>
+  <td>3.2.0</td>
+</tr>
+<tr>
+  <td><code>spark.executor.decommission.signal</code></td>
+  <td>PWR</td>
+  <td>
+    The signal that used to trigger the executor to start decommission.
+  </td>
+  <td>3.2.0</td>
+</tr>
 </table>
 
 Apart from these, the following properties are also available, and may be 
useful in some situations:
@@ -681,7 +718,7 @@ Apart from these, the following properties are also 
available, and may be useful
 </tr>
 <tr>
   <td><code>spark.redaction.regex</code></td>
-  <td>(?i)secret|password|token</td>
+  <td>(?i)secret|password|token|access[.]key</td>
   <td>
     Regex to decide which Spark configuration properties and environment 
variables in driver and
     executor environments contain sensitive information. When this regex 
matches a property key or
@@ -689,6 +726,16 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>2.1.2</td>
 </tr>
+<tr>
+  <td><code>spark.redaction.string.regex</code></td>
+  <td>(none)</td>
+  <td>
+    Regex to decide which parts of strings produced by Spark contain sensitive 
information. 
+    When this regex matches a string part, that string part is replaced by a 
dummy value. 
+    This is currently used to redact the output of SQL explain commands.
+  </td>
+  <td>2.2.0</td>
+</tr>
 <tr>
   <td><code>spark.python.profile</code></td>
   <td>false</td>
@@ -906,6 +953,23 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>1.4.0</td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.unsafe.file.output.buffer</code></td>
+  <td>32k</td>
+  <td>
+    The file system for this buffer size after each partition is written in 
unsafe shuffle writer. 
+    In KiB unless otherwise specified.
+  </td>
+  <td>2.3.0</td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.spill.diskWriteBufferSize</code></td>
+  <td>1024 * 1024</td>
+  <td>
+    The buffer size, in bytes, to use when writing the sorted records to an 
on-disk file. 
+  </td>
+  <td>2.3.0</td>
+</tr>
 <tr>
   <td><code>spark.shuffle.io.maxRetries</code></td>
   <td>3</td>
@@ -988,6 +1052,17 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>1.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.service.name</code></td>
+  <td>spark_shuffle</td>
+  <td>
+    The configured name of the Spark shuffle service the client should 
communicate with. 
+    This must match the name used to configure the Shuffle within the YARN 
NodeManager configuration 
+    (<code>yarn.nodemanager.aux-services</code>). Only takes effect 
+    when <code>spark.shuffle.service.enabled</code> is set to true.
+  </td>
+  <td>3.2.0</td>
+</tr>
 <tr>
   <td><code>spark.shuffle.service.index.cache.size</code></td>
   <td>100m</td>
@@ -1028,6 +1103,14 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>1.1.1</td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.sort.io.plugin.class</code></td>
+  <td>org.apache.spark.shuffle.sort.io.LocalDiskShuffleDataIO</td>
+  <td>
+    Name of the class to use for shuffle IO.
+  </td>
+  <td>3.0.0</td>
+</tr>
 <tr>
   <td><code>spark.shuffle.spill.compress</code></td>
   <td>true</td>
@@ -1063,6 +1146,58 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>2.3.0</td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.reduceLocality.enabled</code></td>
+  <td>true</td>
+  <td>
+    Whether to compute locality preferences for reduce tasks.
+  </td>
+  <td>1.5.0</td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.mapOutput.minSizeForBroadcast</code></td>
+  <td>512k</td>
+  <td>
+    The size at which we use Broadcast to send the map output statuses to the 
executors.
+  </td>
+  <td>2.0.0</td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.detectCorrupt</code></td>
+  <td>true</td>
+  <td>
+    Whether to detect any corruption in fetched blocks.
+  </td>
+  <td>2.2.0</td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.detectCorrupt.useExtraMemory</code></td>
+  <td>false</td>
+  <td>
+    If enabled, part of a compressed/encrypted stream will be 
de-compressed/de-crypted by using extra memory 
+    to detect early corruption. Any IOException thrown will cause the task to 
be retried once 
+    and if it fails again with same exception, then FetchFailedException will 
be thrown to retry previous stage.
+  </td>
+  <td>3.0.0</td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.useOldFetchProtocol</code></td>
+  <td>false</td>
+  <td>
+    Whether to use the old protocol while doing the shuffle block fetching. It 
is only enabled while we need the 
+    compatibility in the scenario of new Spark version job fetching shuffle 
blocks from old version external shuffle service.
+  </td>
+  <td>3.0.0</td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.readHostLocalDisk</code></td>
+  <td>true</td>
+  <td>
+    If enabled (and <code>spark.shuffle.useOldFetchProtocol</code> is 
disabled, shuffle blocks requested from those block managers 
+    which are running on the same host are read from the disk directly instead 
of being fetched as remote blocks over the network.
+  </td>
+  <td>3.0.0</td>
+</tr>
 <tr>
   <td><code>spark.files.io.connectionTimeout</code></td>
   <td>value of <code>spark.network.timeout</code></td>
@@ -1102,6 +1237,22 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>3.0.0</td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.service.db.enabled</code></td>
+  <td>true</td>
+  <td>
+    Whether to use db in ExternalShuffleService. Note that this only affects 
standalone mode.
+  </td>
+  <td>3.0.0</td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.service.db.backend</code></td>
+  <td>LEVELDB</td>
+  <td>
+    Specifies a disk-based store used in shuffle service local db. Setting as 
LEVELDB or ROCKSDB.
+  </td>
+  <td>3.4.0</td>
+</tr>
 </table>
 
 ### Spark UI
@@ -1735,6 +1886,14 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>1.6.0</td>
 </tr>
+<tr>
+  <td><code>spark.storage.unrollMemoryThreshold</code></td>
+  <td>1024 * 1024</td>
+  <td>
+    Initial memory to request before unrolling any block.
+  </td>
+  <td>1.1.0</td>
+</tr>
 <tr>
   <td><code>spark.storage.replication.proactive</code></td>
   <td>false</td>
@@ -1745,6 +1904,16 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>2.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.storage.localDiskByExecutors.cacheSize</code></td>
+  <td>1000</td>
+  <td>
+    The max number of executors for which the local dirs are stored. This size 
is both applied for the driver and 
+    both for the executors side to avoid having an unbounded store. This cache 
will be used to avoid the network 
+    in case of fetching disk persisted RDD blocks or shuffle blocks (when 
<code>spark.shuffle.readHostLocalDisk</code> is set) from the same host.
+  </td>
+  <td>3.0.0</td>
+</tr>
 <tr>
   <td><code>spark.cleaner.periodicGC.interval</code></td>
   <td>30min</td>
@@ -1816,6 +1985,14 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>2.1.1</td>
 </tr>
+<tr>
+  <td><code>spark.broadcast.UDFCompressionThreshold</code></td>
+  <td>1 * 1024 * 1024</td>
+  <td>
+    The threshold at which user-defined functions (UDFs) and Python RDD 
commands are compressed by broadcast in bytes unless otherwise specified.
+  </td>
+  <td>3.0.0</td>
+</tr>
 <tr>
   <td><code>spark.executor.cores</code></td>
   <td>
@@ -1891,6 +2068,24 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>1.0.0</td>
 </tr>
+<tr>
+  <td><code>spark.files.ignoreCorruptFiles</code></td>
+  <td>false</td>
+  <td>
+    Whether to ignore corrupt files. If true, the Spark jobs will continue to 
run when encountering corrupted or 
+    non-existing files and contents that have been read will still be returned.
+  </td>
+  <td>2.1.0</td>
+</tr>
+<tr>
+  <td><code>spark.files.ignoreMissingFiles</code></td>
+  <td>false</td>
+  <td>
+    Whether to ignore missing files. If true, the Spark jobs will continue to 
run when encountering missing files and 
+    the contents that have been read will still be returned.
+  </td>
+  <td>2.4.0</td>
+</tr>
 <tr>
   <td><code>spark.files.maxPartitionBytes</code></td>
   <td>134217728 (128 MiB)</td>
@@ -1944,6 +2139,67 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>0.9.2</td>
 </tr>
+<tr>
+  <td><code>spark.storage.decommission.enabled</code></td>
+  <td>false</td>
+  <td>
+    Whether to decommission the block manager when decommissioning executor.
+  </td>
+  <td>3.1.0</td>
+</tr>
+<tr>
+  <td><code>spark.storage.decommission.shuffleBlocks.enabled</code></td>
+  <td>true</td>
+  <td>
+    Whether to transfer shuffle blocks during block manager decommissioning. 
Requires a migratable shuffle resolver 
+    (like sort based shuffle).
+  </td>
+  <td>3.1.0</td>
+</tr>
+<tr>
+  <td><code>spark.storage.decommission.shuffleBlocks.maxThreads</code></td>
+  <td>8</td>
+  <td>
+    Maximum number of threads to use in migrating shuffle files.
+  </td>
+  <td>3.1.0</td>
+</tr>
+<tr>
+  <td><code>spark.storage.decommission.rddBlocks.enabled</code></td>
+  <td>true</td>
+  <td>
+    Whether to transfer RDD blocks during block manager decommissioning.
+  </td>
+  <td>3.1.0</td>
+</tr>
+<tr>
+  <td><code>spark.storage.decommission.fallbackStorage.path</code></td>
+  <td>(none)</td>
+  <td>
+    The location for fallback storage during block manager decommissioning. 
For example, <code>s3a://spark-storage/</code>. 
+    In case of empty, fallback storage is disabled. The storage should be 
managed by TTL because Spark will not clean it up.
+  </td>
+  <td>3.1.0</td>
+</tr>
+<tr>
+  <td><code>spark.storage.decommission.fallbackStorage.cleanUp</code></td>
+  <td>false</td>
+  <td>
+    If true, Spark cleans up its fallback storage data during shutting down.
+  </td>
+  <td>3.2.0</td>
+</tr>
+<tr>
+  <td><code>spark.storage.decommission.shuffleBlocks.maxDiskSize</code></td>
+  <td>(none)</td>
+  <td>
+    Maximum disk space to use to store shuffle blocks before rejecting remote 
shuffle blocks. 
+    Rejecting remote shuffle blocks means that an executor will not receive 
any shuffle migrations, 
+    and if there are no other executors available for migration then shuffle 
blocks will be lost unless 
+    <code>spark.storage.decommission.fallbackStorage.path</code> is configured.
+  </td>
+  <td>3.2.0</td>
+</tr>
 <tr>
   
<td><code>spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version</code></td>
   <td>1</td>
@@ -1971,6 +2227,7 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>3.0.0</td>
 </tr>
+<tr>
   <td><code>spark.executor.processTreeMetrics.enabled</code></td>
   <td>false</td>
   <td>
@@ -1981,6 +2238,7 @@ Apart from these, the following properties are also 
available, and may be useful
     exists.
   </td>
   <td>3.0.0</td>
+</tr>
 <tr>
   <td><code>spark.executor.metrics.pollingInterval</code></td>
   <td>0</td>
@@ -1993,6 +2251,32 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>3.0.0</td>
 </tr>
+<tr>
+  
<td><code>spark.eventLog.gcMetrics.youngGenerationGarbageCollectors</code></td>
+  <td>Copy,PS Scavenge,ParNew,G1 Young Generation</td>
+  <td>
+    Names of supported young generation garbage collector. A name usually is 
the return of GarbageCollectorMXBean.getName.
+    The built-in young generation garbage collectors are Copy,PS 
Scavenge,ParNew,G1 Young Generation.
+  </td>
+  <td>3.0.0</td>
+</tr>
+<tr>
+  <td><code>spark.eventLog.gcMetrics.oldGenerationGarbageCollectors</code></td>
+  <td>MarkSweepCompact,PS MarkSweep,ConcurrentMarkSweep,G1 Old Generation</td>
+  <td>
+    Names of supported old generation garbage collector. A name usually is the 
return of GarbageCollectorMXBean.getName.
+    The built-in old generation garbage collectors are MarkSweepCompact,PS 
MarkSweep,ConcurrentMarkSweep,G1 Old Generation.
+  </td>
+  <td>3.0.0</td>
+</tr>
+<tr>
+  <td><code>spark.executor.metrics.fileSystemSchemes</code></td>
+  <td>file,hdfs</td>
+  <td>
+    The file system schemes to report in executor metrics.
+  </td>
+  <td>3.1.0</td>
+</tr>
 </table>
 
 ### Networking
@@ -2321,6 +2605,16 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>2.4.1</td>
 </tr>
+<tr>
+  <td><code>spark.standalone.submit.waitAppCompletion</code></td>
+  <td>false</td>
+  <td>
+    If set to true, Spark will merge ResourceProfiles when different profiles 
are specified in RDDs that get combined into a single stage. 
+    When they are merged, Spark chooses the maximum of each resource and 
creates a new ResourceProfile. 
+    The default of false results in Spark throwing an exception if multiple 
different ResourceProfiles are found in RDDs going into the same stage.
+  </td>
+  <td>3.1.0</td>
+</tr>
 <tr>
   <td><code>spark.excludeOnFailure.enabled</code></td>
   <td>
@@ -3342,6 +3636,15 @@ Push-based shuffle helps improve the reliability and 
performance of spark shuffl
   </td>
   <td>3.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.push.numPushThreads</code></td>
+  <td>(none)</td>
+  <td>
+    Specify the number of threads in the block pusher pool. These threads 
assist in creating connections and pushing blocks to remote external shuffle 
services. 
+    By default, the threadpool size is equal to the number of spark executor 
cores.
+  </td>
+  <td>3.2.0</td>
+</tr>
 <tr>
   <td><code>spark.shuffle.push.maxBlockSizeToPush</code></td>
   <td><code>1m</code></td>
@@ -3360,6 +3663,15 @@ Push-based shuffle helps improve the reliability and 
performance of spark shuffl
   </td>
   <td>3.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.push.merge.finalizeThreads</code></td>
+  <td>8</td>
+  <td>
+    Number of threads used by driver to finalize shuffle merge. Since it could 
potentially take seconds for a large shuffle to finalize, 
+    having multiple threads helps driver to handle concurrent shuffle merge 
finalize requests when push-based shuffle is enabled.
+  </td>
+  <td>3.3.0</td>
+</tr>
 <tr>
   <td><code>spark.shuffle.push.minShuffleSizeToWait</code></td>
   <td><code>500m</code></td>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-40675][DOCS] Supplement undocumented spark configurations in `configuration.md`

Reply via email to