FelixYBW commented on issue #8852: URL: https://github.com/apache/incubator-gluten/issues/8852#issuecomment-3659620776
record here: Spark 4.0 new configs: Property Name | Default | Meaning | Since Version -- | -- | -- | -- spark.driver.minMemoryOverhead | 384m | The minimum amount of non-heap memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified, if spark.driver.memoryOverhead is not defined. This option is currently supported on YARN and Kubernetes. | 4.0.0 spark.executor.minMemoryOverhead | 384m | The minimum amount of non-heap memory to be allocated per executor process, in MiB unless otherwise specified, if spark.executor.memoryOverhead is not defined. This option is currently supported on YARN and Kubernetes. | 4.0.0 spark.driver.timeout | 0min | A timeout for Spark driver in minutes. 0 means infinite. For the positive time value, terminate the driver with the exit code 124 if it runs after timeout duration. To use, it's required to set spark.plugins with org.apache.spark.deploy.DriverTimeoutPlugin. | 4.0.0 spark.driver.log.localDir | (none) | Specifies a local directory to write driver logs and enable Driver Log UI Tab. | 4.0.0 --- shuffle behavior Property Name | Default | Meaning | Since Version -- | -- | -- | -- spark.shuffle.file.merge.buffer | 32k | Size of the in-memory buffer for each shuffle file input stream, in KiB unless otherwise specified. These buffers use off-heap buffers and are related to the number of files in the shuffle file. Too large buffers should be avoided. | 4.0.0 spark.shuffle.localDisk.file.output.buffer | 32k | The file system for this buffer size after each partition is written in all local disk shuffle writers. In KiB unless otherwise specified. | 4.0.0 --- spark UI Property Name | Default | Meaning | Since Version -- | -- | -- | -- spark.ui.threadDump.flamegraphEnabled | true | Whether to render the Flamegraph for executor thread dumps. | 4.0.0 --- Compression and Serialization Property Name | Default | Meaning | Since Version -- | -- | -- | -- spark.checkpoint.dir | (none) | Set the default directory for checkpointing. It can be overwritten by SparkContext.setCheckpointDir. | 4.0.0 spark.io.compression.zstd.workers | 0 | Thread size spawned to compress in parallel when using Zstd. When value is 0 no worker is spawned, it works in single-threaded mode. When value > 0, it triggers asynchronous mode, corresponding number of threads are spawned. More workers improve performance, but also increase memory cost. | 4.0.0 spark.io.compression.lzf.parallel.enabled | false | When true, LZF compression will use multiple threads to compress data in parallel. | 4.0.0 --- Scheduling Property Name | Default | Meaning | Since Version -- | -- | -- | -- spark.excludeOnFailure.application.enabled | false | If set to "true", enables excluding executors for the entire application due to too many task failures and prevent Spark from scheduling tasks on them. This config overrides "spark.excludeOnFailure.enabled". | 4.0.0 spark.excludeOnFailure.taskAndStage.enabled | false | If set to "true", enables excluding executors on a task set level due to too many task failures and prevent Spark from scheduling tasks on them. This config overrides "spark.excludeOnFailure.enabled". | 4.0.0 --- Spark SQL Property Name | Default | Meaning | Since Version -- | -- | -- | -- spark.sql.avro.xz.level | 6 | Compression level for the xz codec used in writing of AVRO files. Valid value must be in the range of from 1 to 9 inclusive The default value is 6. | 4.0.0 spark.sql.avro.zstandard.bufferPool.enabled | false | If true, enable buffer pool of ZSTD JNI library when writing of AVRO files | 4.0.0 spark.sql.avro.zstandard.level | 3 | Compression level for the zstandard codec used in writing of AVRO files. | 4.0.0 spark.sql.binaryOutputStyle | (none) | The output style used display binary data. Valid values are 'UTF-8', 'BASIC', 'BASE64', 'HEX', and 'HEX_DISCRETE'. | 4.0.0 spark.sql.defaultCacheStorageLevel | MEMORY_AND_DISK | The default storage level of dataset.cache(), catalog.cacheTable() and sql query CACHE TABLE t. | 4.0.0 spark.sql.execution.arrow.transformWithStateInPandas.maxRecordsPerBatch | 10000 | When using TransformWithStateInPandas, limit the maximum number of state records that can be written to a single ArrowRecordBatch in memory. | 4.0.0 spark.sql.execution.interruptOnCancel | true | When true, all running tasks will be interrupted if one cancels a query. | 4.0.0 spark.sql.execution.pandas.inferPandasDictAsMap | false | When true, spark.createDataFrame will infer dict from Pandas DataFrame as a MapType. When false, spark.createDataFrame infers dict from Pandas DataFrame as a StructType which is default inferring from PyArrow. | 4.0.0 spark.sql.execution.pyspark.udf.faulthandler.enabled | (value of spark.python.worker.faulthandler.enabled) | Same as spark.python.worker.faulthandler.enabled for Python execution with DataFrame and SQL. It can change during runtime. | 4.0.0 spark.sql.execution.pyspark.udf.hideTraceback.enabled | false | When true, only show the message of the exception from Python UDFs, hiding the stack trace. If this is enabled, simplifiedTraceback has no effect. | 4.0.0 spark.sql.execution.pyspark.udf.idleTimeoutSeconds | (value of spark.python.worker.idleTimeoutSeconds) | Same as spark.python.worker.idleTimeoutSeconds for Python execution with DataFrame and SQL. It can change during runtime. | 4.0.0 spark.sql.execution.python.udf.buffer.size | (value of spark.buffer.size) | Same as spark.buffer.size but only applies to Python UDF executions. If it is not set, the fallback is spark.buffer.size. | 4.0.0 spark.sql.execution.python.udf.maxRecordsPerBatch | 100 | When using Python UDFs, limit the maximum number of records that can be batched for serialization/deserialization. | 4.0.0 spark.sql.execution.pythonUDF.arrow.concurrency.level | (none) | The level of concurrency to execute Arrow-optimized Python UDF. This can be useful if Python UDFs use I/O intensively. | 4.0.0 spark.sql.extendedExplainProviders | (none) | A comma-separated list of classes that implement the org.apache.spark.sql.ExtendedExplainGenerator trait. If provided, Spark will print extended plan information from the providers in explain plan and in the UI | 4.0.0 spark.sql.files.ignoreInvalidPartitionPaths | false | Whether to ignore invalid partition paths that do not match <column>=<value>. When the option is enabled, table with two partition directories 'table/invalid' and 'table/col=1' will only load the latter directory and ignore the invalid partition | 4.0.0 spark.sql.hive.convertInsertingUnpartitionedTable | true | When set to true, and spark.sql.hive.convertMetastoreParquet or spark.sql.hive.convertMetastoreOrc is true, the built-in ORC/Parquet writer is usedto process inserting into unpartitioned ORC/Parquet tables created by using the HiveSQL syntax. | 4.0.0 spark.sql.icu.caseMappings.enabled | true | When enabled we use the ICU library (instead of the JVM) to implement case mappings for strings under UTF8_BINARY collation. | 4.0.0 spark.sql.inMemoryColumnarStorage.hugeVectorReserveRatio | 1.2 | When spark.sql.inMemoryColumnarStorage.hugeVectorThreshold <= 0 or the required memory is smaller than spark.sql.inMemoryColumnarStorage.hugeVectorThreshold, spark reserves required memory * 2 memory; otherwise, spark reserves required memory * this ratio memory, and will release this column vector memory before reading the next batch rows. | 4.0.0 spark.sql.inMemoryColumnarStorage.hugeVectorThreshold | -1b | When the required memory is larger than this, spark reserves required memory * spark.sql.inMemoryColumnarStorage.hugeVectorReserveRatio memory next time and release this column vector memory before reading the next batch rows. -1 means disabling the optimization. | 4.0.0 spark.sql.json.useUnsafeRow | false | When set to true, use UnsafeRow to represent struct result in the JSON parser. It can be overwritten by the JSON option useUnsafeRow. | 4.0.0 spark.sql.operatorPipeSyntaxEnabled | true | If true, enable operator pipe syntax for Apache Spark SQL. This uses the operator pipe marker \|> to indicate separation between clauses of SQL in a manner that describes the sequence of steps that the query performs in a composable fashion. | 4.0.0 spark.sql.optimizer.avoidCollapseUDFWithExpensiveExpr | true | Whether to avoid collapsing projections that would duplicate expensive expressions in UDFs. | 4.0.0 spark.sql.planner.pythonExecution.memory | (none) | Specifies the memory allocation for executing Python code in Spark driver, in MiB. When set, it caps the memory for Python execution to the specified amount. If not set, Spark will not limit Python's memory usage and it is up to the application to avoid exceeding the overhead memory space shared with other non-JVM processes. Note: Windows does not support resource limiting and actual resource is not limited on MacOS. | 4.0.0 spark.sql.preserveCharVarcharTypeInfo | false | When true, Spark does not replace CHAR/VARCHAR types the STRING type, which is the default behavior of Spark 3.0 and earlier versions. This means the length checks for CHAR/VARCHAR types is enforced and CHAR type is also properly padded. | 4.0.0 spark.sql.pyspark.plotting.max_rows | 1000 | The visual limit on plots. If set to 1000 for top-n-based plots (pie, bar, barh), the first 1000 data points will be used for plotting. For sampled-based plots (scatter, area, line), 1000 data points will be randomly sampled. | 4.0.0 spark.sql.pyspark.udf.profiler | (none) | Configure the Python/Pandas UDF profiler by enabling or disabling it with the option to choose between "perf" and "memory" types, or unsetting the config disables the profiler. This is disabled by default. | 4.0.0 spark.sql.scripting.enabled | false | SQL Scripting feature is under development and its use should be done under this feature flag. SQL Scripting enables users to write procedural SQL including control flow and error handling. | 4.0.0 spark.sql.shuffleDependency.fileCleanup.enabled | false | When enabled, shuffle files will be cleaned up at the end of Spark Connect SQL executions. | 4.0.0 spark.sql.shuffleDependency.skipMigration.enabled | false | When enabled, shuffle dependencies for a Spark Connect SQL execution are marked at the end of the execution, and they will not be migrated during decommissions. | 4.0.0 spark.sql.sources.v2.bucketing.allowCompatibleTransforms.enabled | false | Whether to allow storage-partition join in the case where the partition transforms are compatible but not identical. This config requires both spark.sql.sources.v2.bucketing.enabled and spark.sql.sources.v2.bucketing.pushPartValues.enabled to be enabled and spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled to be disabled. | 4.0.0 spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled | false | Whether to allow storage-partition join in the case where join keys are a subset of the partition keys of the source tables. At planning time, Spark will group the partitions by only those keys that are in the join keys. This is currently enabled only if spark.sql.requireAllClusterKeysForDistribution is false. | 4.0.0 spark.sql.sources.v2.bucketing.partition.filter.enabled | false | Whether to filter partitions when running storage-partition join. When enabled, partitions without matches on the other side can be omitted for scanning, if allowed by the join type. This config requires both spark.sql.sources.v2.bucketing.enabled and spark.sql.sources.v2.bucketing.pushPartValues.enabled to be enabled. | 4.0.0 spark.sql.sources.v2.bucketing.shuffle.enabled | false | During a storage-partitioned join, whether to allow to shuffle only one side. When only one side is KeyGroupedPartitioning, if the conditions are met, spark will only shuffle the other side. This optimization will reduce the amount of data that needs to be shuffle. This config requires spark.sql.sources.v2.bucketing.enabled to be enabled | 4.0.0 spark.sql.sources.v2.bucketing.sorting.enabled | false | When turned on, Spark will recognize the specific distribution reported by a V2 data source through SupportsReportPartitioning, and will try to avoid a shuffle if possible when sorting by those columns. This config requires spark.sql.sources.v2.bucketing.enabled to be enabled. | 4.0.0 spark.sql.stackTracesInDataFrameContext | 1 | The number of non-Spark stack traces in the captured DataFrame query context. | 4.0.0 spark.sql.statistics.updatePartitionStatsInAnalyzeTable.enabled | false | When this config is enabled, Spark will also update partition statistics in analyze table command (i.e., ANALYZE TABLE .. COMPUTE STATISTICS [NOSCAN]). Note the command will also become more expensive. When this config is disabled, Spark will only update table level statistics. | 4.0.0 spark.sql.streaming.stateStore.encodingFormat | unsaferow | The encoding format used for stateful operators to store information in the state store | 4.0.0 spark.sql.streaming.transformWithState.stateSchemaVersion | 3 | The version of the state schema used by the transformWithState operator | 4.0.0 spark.sql.timeTravelTimestampKey | timestampAsOf | The option name to specify the time travel timestamp when reading a table. | 4.0.0 spark.sql.timeTravelVersionKey | versionAsOf | The option name to specify the time travel table version when reading a table. | 4.0.0 spark.sql.transposeMaxValues | 500 | When doing a transpose without specifying values for the index column this is the maximum number of values that will be transposed without error. | 4.0.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
