This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new f39b75ccbdc [SPARK-40710][DOCS] Supplement undocumented parquet configurations in documentation f39b75ccbdc is described below commit f39b75ccbdcac6a9d67c61ed399f5c03603cada7 Author: Qian.Sun <qian.sun2...@gmail.com> AuthorDate: Sun Oct 9 10:11:05 2022 -0500 [SPARK-40710][DOCS] Supplement undocumented parquet configurations in documentation ### What changes were proposed in this pull request? This PR aims to supplement undocumented parquet configurations in documentation. ### Why are the changes needed? Help users to confirm configurations through documentation instead of code. ### Does this PR introduce _any_ user-facing change? Yes, more configurations in documentation. ### How was this patch tested? Pass the GA. Closes #38160 from dcoliversun/SPARK-40710. Authored-by: Qian.Sun <qian.sun2...@gmail.com> Signed-off-by: Sean Owen <sro...@gmail.com> --- docs/sql-data-sources-parquet.md | 122 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 122 insertions(+) diff --git a/docs/sql-data-sources-parquet.md b/docs/sql-data-sources-parquet.md index 2189892c928..de339c21ef2 100644 --- a/docs/sql-data-sources-parquet.md +++ b/docs/sql-data-sources-parquet.md @@ -454,6 +454,28 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession </td> <td>1.3.0</td> </tr> +<tr> + <td><code>spark.sql.parquet.int96TimestampConversion</code></td> + <td>false</td> + <td> + This controls whether timestamp adjustments should be applied to INT96 data when + converting to timestamps, for data written by Impala. This is necessary because Impala + stores INT96 data with a different timezone offset than Hive & Spark. + </td> + <td>2.3.0</td> +</tr> +<tr> + <td><code>spark.sql.parquet.outputTimestampType</code></td> + <td>INT96</td> + <td> + Sets which Parquet timestamp type to use when Spark writes data to Parquet files. + INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS + is a standard timestamp type in Parquet, which stores number of microseconds from the + Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which + means Spark has to truncate the microsecond portion of its timestamp value. + </td> + <td>2.3.0</td> +</tr> <tr> <td><code>spark.sql.parquet.compression.codec</code></td> <td>snappy</td> @@ -473,6 +495,17 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession <td>Enables Parquet filter push-down optimization when set to true.</td> <td>1.2.0</td> </tr> +<tr> + <td><code>spark.sql.parquet.aggregatePushdown</code></td> + <td>false</td> + <td> + If true, aggregates will be pushed down to Parquet for optimization. Support MIN, MAX + and COUNT as aggregate expression. For MIN/MAX, support boolean, integer, float and date + type. For COUNT, support all data types. If statistics is missing from any Parquet file + footer, exception would be thrown. + </td> + <td>3.3.0</td> +</tr> <tr> <td><code>spark.sql.hive.convertMetastoreParquet</code></td> <td>true</td> @@ -493,6 +526,17 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession </td> <td>1.5.0</td> </tr> +<tr> + <td><code>spark.sql.parquet.respectSummaryFiles</code></td> + <td>false</td> + <td> + When true, we make assumption that all part-files of Parquet are consistent with + summary files and we will ignore them when merging schema. Otherwise, if this is + false, which is the default, we will merge all part-files. This should be considered + as expert-only option, and shouldn't be enabled before knowing what it means exactly. + </td> + <td>1.5.0</td> +</tr> <tr> <td><code>spark.sql.parquet.writeLegacyFormat</code></td> <td>false</td> @@ -505,6 +549,84 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession </td> <td>1.6.0</td> </tr> +<tr> + <td><code>spark.sql.parquet.enableVectorizedReader</code></td> + <td>true</td> + <td> + Enables vectorized parquet decoding. + </td> + <td>2.0.0</td> +</tr> +<tr> + <td><code>spark.sql.parquet.enableNestedColumnVectorizedReader</code></td> + <td>true</td> + <td> + Enables vectorized Parquet decoding for nested columns (e.g., struct, list, map). + Requires <code>spark.sql.parquet.enableVectorizedReader</code> to be enabled. + </td> + <td>3.3.0</td> +</tr> +<tr> + <td><code>spark.sql.parquet.recordLevelFilter.enabled</code></td> + <td>false</td> + <td> + If true, enables Parquet's native record-level filtering using the pushed down filters. + This configuration only has an effect when <code>spark.sql.parquet.filterPushdown</code> + is enabled and the vectorized reader is not used. You can ensure the vectorized reader + is not used by setting <code>spark.sql.parquet.enableVectorizedReader</code> to false. + </td> + <td>2.3.0</td> +</tr> +<tr> + <td><code>spark.sql.parquet.columnarReaderBatchSize</code></td> + <td>4096</td> + <td> + The number of rows to include in a parquet vectorized reader batch. The number should + be carefully chosen to minimize overhead and avoid OOMs in reading data. + </td> + <td>2.4.0</td> +</tr> +<tr> + <td><code>spark.sql.parquet.fieldId.write.enabled</code></td> + <td>true</td> + <td> + Field ID is a native field of the Parquet schema spec. When enabled, + Parquet writers will populate the field Id metadata (if present) in the Spark schema to the Parquet schema. + </td> + <td>3.3.0</td> +</tr> +<tr> + <td><code>spark.sql.parquet.fieldId.read.enabled</code></td> + <td>false</td> + <td> + Field ID is a native field of the Parquet schema spec. When enabled, Parquet readers + will use field IDs (if present) in the requested Spark schema to look up Parquet + fields instead of using column names. + </td> + <td>3.3.0</td> +</tr> +<tr> + <td><code>spark.sql.parquet.fieldId.read.ignoreMissing</code></td> + <td>false</td> + <td> + When the Parquet file doesn't have any field IDs but the + Spark read schema is using field IDs to read, we will silently return nulls + when this flag is enabled, or error otherwise. + </td> + <td>3.3.0</td> +</tr> +<tr> + <td><code>spark.sql.parquet.timestampNTZ.enabled</code></td> + <td>true</td> + <td> + Enables <code>TIMESTAMP_NTZ</code> support for Parquet reads and writes. + When enabled, <code>TIMESTAMP_NTZ</code> values are written as Parquet timestamp + columns with annotation isAdjustedToUTC = false and are inferred in a similar way. + When disabled, such values are read as <code>TIMESTAMP_LTZ</code> and have to be + converted to <code>TIMESTAMP_LTZ</code> for writes. + </td> + <td>3.4.0</td> +</tr> <tr> <td>spark.sql.parquet.datetimeRebaseModeInRead</td> <td><code>EXCEPTION</code></td> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org