Re: [I] [CORE] Spark 4.0 support [incubator-gluten]

via GitHub Tue, 16 Dec 2025 01:34:29 -0800


FelixYBW commented on issue #8852:
URL: 
https://github.com/apache/incubator-gluten/issues/8852#issuecomment-3659620776


   record here: Spark 4.0 new configs:
   
   Property Name | Default | Meaning | Since Version
   -- | -- | -- | --
   spark.driver.minMemoryOverhead | 384m | The minimum amount of non-heap 
memory to be allocated per driver process in cluster mode, in MiB unless 
otherwise specified, if spark.driver.memoryOverhead is not defined. This option 
is currently supported on YARN and Kubernetes. | 4.0.0
   spark.executor.minMemoryOverhead | 384m | The minimum amount of non-heap 
memory to be allocated per executor process, in MiB unless otherwise specified, 
if spark.executor.memoryOverhead is not defined. This option is currently 
supported on YARN and Kubernetes. | 4.0.0
   spark.driver.timeout | 0min | A timeout for Spark driver in minutes. 0 means 
infinite. For the positive time value, terminate the driver with the exit code 
124 if it runs after timeout duration. To use, it's required to set 
spark.plugins with org.apache.spark.deploy.DriverTimeoutPlugin. | 4.0.0
   spark.driver.log.localDir | (none) | Specifies a local directory to write 
driver logs and enable Driver Log UI Tab. | 4.0.0
   
   ---
   
   shuffle behavior
   
   Property Name | Default | Meaning | Since Version
   -- | -- | -- | --
   spark.shuffle.file.merge.buffer | 32k | Size of the in-memory buffer for 
each shuffle file input stream, in KiB unless otherwise specified. These 
buffers use off-heap buffers and are related to the number of files in the 
shuffle file. Too large buffers should be avoided. | 4.0.0
   spark.shuffle.localDisk.file.output.buffer | 32k | The file system for this 
buffer size after each partition is written in all local disk shuffle writers. 
In KiB unless otherwise specified. | 4.0.0
   
   ---
   spark UI
   
   Property Name | Default | Meaning | Since Version
   -- | -- | -- | --
   spark.ui.threadDump.flamegraphEnabled | true | Whether to render the 
Flamegraph for executor thread dumps. | 4.0.0
   
   ---
   
   Compression and Serialization
   
   Property Name | Default | Meaning | Since Version
   -- | -- | -- | --
   spark.checkpoint.dir | (none) | Set the default directory for checkpointing. 
It can be overwritten by SparkContext.setCheckpointDir. | 4.0.0
   spark.io.compression.zstd.workers | 0 | Thread size spawned to compress in 
parallel when using Zstd. When value is 0 no worker is spawned, it works in 
single-threaded mode. When value > 0, it triggers asynchronous mode, 
corresponding number of threads are spawned. More workers improve performance, 
but also increase memory cost. | 4.0.0
   spark.io.compression.lzf.parallel.enabled | false | When true, LZF 
compression will use multiple threads to compress data in parallel. | 4.0.0
   
   
   ---
   Scheduling
   
   Property Name | Default | Meaning | Since Version
   -- | -- | -- | --
   spark.excludeOnFailure.application.enabled | false | If set to "true", 
enables excluding executors for the entire application due to too many task 
failures and prevent Spark from scheduling tasks on them. This config overrides 
"spark.excludeOnFailure.enabled". | 4.0.0
   spark.excludeOnFailure.taskAndStage.enabled | false | If set to "true", 
enables excluding executors on a task set level due to too many task failures 
and prevent Spark from scheduling tasks on them. This config overrides 
"spark.excludeOnFailure.enabled". | 4.0.0
   
   ---
   Spark SQL
   
   Property Name | Default | Meaning | Since Version
   -- | -- | -- | --
   spark.sql.avro.xz.level | 6 | Compression level for the xz codec used in 
writing of AVRO files. Valid value must be in the range of from 1 to 9 
inclusive The default value is 6. | 4.0.0
   spark.sql.avro.zstandard.bufferPool.enabled | false | If true, enable buffer 
pool of ZSTD JNI library when writing of AVRO files | 4.0.0
   spark.sql.avro.zstandard.level | 3 | Compression level for the zstandard 
codec used in writing of AVRO files. | 4.0.0
   spark.sql.binaryOutputStyle | (none) | The output style used display binary 
data. Valid values are 'UTF-8', 'BASIC', 'BASE64', 'HEX', and 'HEX_DISCRETE'. | 
4.0.0
   spark.sql.defaultCacheStorageLevel | MEMORY_AND_DISK | The default storage 
level of dataset.cache(), catalog.cacheTable() and sql query CACHE TABLE t. | 
4.0.0
   spark.sql.execution.arrow.transformWithStateInPandas.maxRecordsPerBatch | 
10000 | When using TransformWithStateInPandas, limit the maximum number of 
state records that can be written to a single ArrowRecordBatch in memory. | 
4.0.0
   spark.sql.execution.interruptOnCancel | true | When true, all running tasks 
will be interrupted if one cancels a query. | 4.0.0
   spark.sql.execution.pandas.inferPandasDictAsMap | false | When true, 
spark.createDataFrame will infer dict from Pandas DataFrame as a MapType. When 
false, spark.createDataFrame infers dict from Pandas DataFrame as a StructType 
which is default inferring from PyArrow. | 4.0.0
   spark.sql.execution.pyspark.udf.faulthandler.enabled | (value of 
spark.python.worker.faulthandler.enabled) | Same as 
spark.python.worker.faulthandler.enabled for Python execution with DataFrame 
and SQL. It can change during runtime. | 4.0.0
   spark.sql.execution.pyspark.udf.hideTraceback.enabled | false | When true, 
only show the message of the exception from Python UDFs, hiding the stack 
trace. If this is enabled, simplifiedTraceback has no effect. | 4.0.0
   spark.sql.execution.pyspark.udf.idleTimeoutSeconds | (value of 
spark.python.worker.idleTimeoutSeconds) | Same as 
spark.python.worker.idleTimeoutSeconds for Python execution with DataFrame and 
SQL. It can change during runtime. | 4.0.0
   spark.sql.execution.python.udf.buffer.size | (value of spark.buffer.size) | 
Same as spark.buffer.size but only applies to Python UDF executions. If it is 
not set, the fallback is spark.buffer.size. | 4.0.0
   spark.sql.execution.python.udf.maxRecordsPerBatch | 100 | When using Python 
UDFs, limit the maximum number of records that can be batched for 
serialization/deserialization. | 4.0.0
   spark.sql.execution.pythonUDF.arrow.concurrency.level | (none) | The level 
of concurrency to execute Arrow-optimized Python UDF. This can be useful if 
Python UDFs use I/O intensively. | 4.0.0
   spark.sql.extendedExplainProviders | (none) | A comma-separated list of 
classes that implement the org.apache.spark.sql.ExtendedExplainGenerator trait. 
If provided, Spark will print extended plan information from the providers in 
explain plan and in the UI | 4.0.0
   spark.sql.files.ignoreInvalidPartitionPaths | false | Whether to ignore 
invalid partition paths that do not match <column>=<value>. When the option is 
enabled, table with two partition directories 'table/invalid' and 'table/col=1' 
will only load the latter directory and ignore the invalid partition | 4.0.0
   spark.sql.hive.convertInsertingUnpartitionedTable | true | When set to true, 
and spark.sql.hive.convertMetastoreParquet or 
spark.sql.hive.convertMetastoreOrc is true, the built-in ORC/Parquet writer is 
usedto process inserting into unpartitioned ORC/Parquet tables created by using 
the HiveSQL syntax. | 4.0.0
   spark.sql.icu.caseMappings.enabled | true | When enabled we use the ICU 
library (instead of the JVM) to implement case mappings for strings under 
UTF8_BINARY collation. | 4.0.0
   spark.sql.inMemoryColumnarStorage.hugeVectorReserveRatio | 1.2 | When 
spark.sql.inMemoryColumnarStorage.hugeVectorThreshold <= 0 or the required 
memory is smaller than spark.sql.inMemoryColumnarStorage.hugeVectorThreshold, 
spark reserves required memory * 2 memory; otherwise, spark reserves required 
memory * this ratio memory, and will release this column vector memory before 
reading the next batch rows. | 4.0.0
   spark.sql.inMemoryColumnarStorage.hugeVectorThreshold | -1b | When the 
required memory is larger than this, spark reserves required memory * 
spark.sql.inMemoryColumnarStorage.hugeVectorReserveRatio memory next time and 
release this column vector memory before reading the next batch rows. -1 means 
disabling the optimization. | 4.0.0
   spark.sql.json.useUnsafeRow | false | When set to true, use UnsafeRow to 
represent struct result in the JSON parser. It can be overwritten by the JSON 
option useUnsafeRow. | 4.0.0
   spark.sql.operatorPipeSyntaxEnabled | true | If true, enable operator pipe 
syntax for Apache Spark SQL. This uses the operator pipe marker \|> to indicate 
separation between clauses of SQL in a manner that describes the sequence of 
steps that the query performs in a composable fashion. | 4.0.0
   spark.sql.optimizer.avoidCollapseUDFWithExpensiveExpr | true | Whether to 
avoid collapsing projections that would duplicate expensive expressions in 
UDFs. | 4.0.0
   spark.sql.planner.pythonExecution.memory | (none) | Specifies the memory 
allocation for executing Python code in Spark driver, in MiB. When set, it caps 
the memory for Python execution to the specified amount. If not set, Spark will 
not limit Python's memory usage and it is up to the application to avoid 
exceeding the overhead memory space shared with other non-JVM processes. Note: 
Windows does not support resource limiting and actual resource is not limited 
on MacOS. | 4.0.0
   spark.sql.preserveCharVarcharTypeInfo | false | When true, Spark does not 
replace CHAR/VARCHAR types the STRING type, which is the default behavior of 
Spark 3.0 and earlier versions. This means the length checks for CHAR/VARCHAR 
types is enforced and CHAR type is also properly padded. | 4.0.0
   spark.sql.pyspark.plotting.max_rows | 1000 | The visual limit on plots. If 
set to 1000 for top-n-based plots (pie, bar, barh), the first 1000 data points 
will be used for plotting. For sampled-based plots (scatter, area, line), 1000 
data points will be randomly sampled. | 4.0.0
   spark.sql.pyspark.udf.profiler | (none) | Configure the Python/Pandas UDF 
profiler by enabling or disabling it with the option to choose between "perf" 
and "memory" types, or unsetting the config disables the profiler. This is 
disabled by default. | 4.0.0
   spark.sql.scripting.enabled | false | SQL Scripting feature is under 
development and its use should be done under this feature flag. SQL Scripting 
enables users to write procedural SQL including control flow and error 
handling. | 4.0.0
   spark.sql.shuffleDependency.fileCleanup.enabled | false | When enabled, 
shuffle files will be cleaned up at the end of Spark Connect SQL executions. | 
4.0.0
   spark.sql.shuffleDependency.skipMigration.enabled | false | When enabled, 
shuffle dependencies for a Spark Connect SQL execution are marked at the end of 
the execution, and they will not be migrated during decommissions. | 4.0.0
   spark.sql.sources.v2.bucketing.allowCompatibleTransforms.enabled | false | 
Whether to allow storage-partition join in the case where the partition 
transforms are compatible but not identical. This config requires both 
spark.sql.sources.v2.bucketing.enabled and 
spark.sql.sources.v2.bucketing.pushPartValues.enabled to be enabled and 
spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled to be 
disabled. | 4.0.0
   spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled | 
false | Whether to allow storage-partition join in the case where join keys are 
a subset of the partition keys of the source tables. At planning time, Spark 
will group the partitions by only those keys that are in the join keys. This is 
currently enabled only if spark.sql.requireAllClusterKeysForDistribution is 
false. | 4.0.0
   spark.sql.sources.v2.bucketing.partition.filter.enabled | false | Whether to 
filter partitions when running storage-partition join. When enabled, partitions 
without matches on the other side can be omitted for scanning, if allowed by 
the join type. This config requires both spark.sql.sources.v2.bucketing.enabled 
and spark.sql.sources.v2.bucketing.pushPartValues.enabled to be enabled. | 4.0.0
   spark.sql.sources.v2.bucketing.shuffle.enabled | false | During a 
storage-partitioned join, whether to allow to shuffle only one side. When only 
one side is KeyGroupedPartitioning, if the conditions are met, spark will only 
shuffle the other side. This optimization will reduce the amount of data that 
needs to be shuffle. This config requires 
spark.sql.sources.v2.bucketing.enabled to be enabled | 4.0.0
   spark.sql.sources.v2.bucketing.sorting.enabled | false | When turned on, 
Spark will recognize the specific distribution reported by a V2 data source 
through SupportsReportPartitioning, and will try to avoid a shuffle if possible 
when sorting by those columns. This config requires 
spark.sql.sources.v2.bucketing.enabled to be enabled. | 4.0.0
   spark.sql.stackTracesInDataFrameContext | 1 | The number of non-Spark stack 
traces in the captured DataFrame query context. | 4.0.0
   spark.sql.statistics.updatePartitionStatsInAnalyzeTable.enabled | false | 
When this config is enabled, Spark will also update partition statistics in 
analyze table command (i.e., ANALYZE TABLE .. COMPUTE STATISTICS [NOSCAN]). 
Note the command will also become more expensive. When this config is disabled, 
Spark will only update table level statistics. | 4.0.0
   spark.sql.streaming.stateStore.encodingFormat | unsaferow | The encoding 
format used for stateful operators to store information in the state store | 
4.0.0
   spark.sql.streaming.transformWithState.stateSchemaVersion | 3 | The version 
of the state schema used by the transformWithState operator | 4.0.0
   spark.sql.timeTravelTimestampKey | timestampAsOf | The option name to 
specify the time travel timestamp when reading a table. | 4.0.0
   spark.sql.timeTravelVersionKey | versionAsOf | The option name to specify 
the time travel table version when reading a table. | 4.0.0
   spark.sql.transposeMaxValues | 500 | When doing a transpose without 
specifying values for the index column this is the maximum number of values 
that will be transposed without error. | 4.0.0
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [CORE] Spark 4.0 support [incubator-gluten]

Reply via email to