This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new c4d159a368d [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar c4d159a368d is described below commit c4d159a368d554a8567271dbfec8f291d1de70a5 Author: Dustin William Smith <dustin.sm...@deliveryhero.com> AuthorDate: Sun Nov 6 18:04:10 2022 -0600 [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar ### What changes were proposed in this pull request? I made some small grammar fixes related to dependent clause followed but independent clauses, starting a sentence with an introductory phrase, using the plural with when are is present in the sentence, and other small fixes to improve readability. https://spark.apache.org/docs/latest/sql-performance-tuning.html <img width="1065" alt="Screenshot 2022-11-04 at 15 24 17" src="https://user-images.githubusercontent.com/7563201/199998862-d9418bc1-2fcd-4eff-be8e-af412add6946.png"> ### Why are the changes needed? These changes improve the readability of the Spark documentation for new users or those studying up. ### Does this PR introduce _any_ user-facing change? Yes, these changes impact the spark documentation. ### How was this patch tested? No test were created as these changes were solely in markdown. Closes #38510 from dwsmith1983/minor-doc-revisions. Lead-authored-by: Dustin William Smith <dustin.sm...@deliveryhero.com> Co-authored-by: dustin <dwsmith1...@users.noreply.github.com> Co-authored-by: Dustin Smith <dustin.william.sm...@gmail.com> Signed-off-by: Sean Owen <sro...@gmail.com> --- docs/sql-performance-tuning.md | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index d736ff8f83f..6ac39d90527 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -40,7 +40,7 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td> <td>true</td> <td> - When set to true Spark SQL will automatically select a compression codec for each column based + When set to true, Spark SQL will automatically select a compression codec for each column based on statistics of the data. </td> <td>1.0.1</td> @@ -77,8 +77,8 @@ that these options will be deprecated in future release as more optimizations ar <td><code>spark.sql.files.openCostInBytes</code></td> <td>4194304 (4 MB)</td> <td> - The estimated cost to open a file, measured by the number of bytes could be scanned in the same - time. This is used when putting multiple files into a partition. It is better to over-estimated, + The estimated cost to open a file, measured by the number of bytes that could be scanned in the same + time. This is used when putting multiple files into a partition. It is better to over-estimate, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. @@ -110,7 +110,7 @@ that these options will be deprecated in future release as more optimizations ar <td>10485760 (10 MB)</td> <td> Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when - performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently + performing a join. By setting this value to -1, broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command <code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been run. </td> @@ -140,8 +140,7 @@ that these options will be deprecated in future release as more optimizations ar <td>10000</td> <td> Configures the maximum listing parallelism for job input paths. In case the number of input - paths is larger than this value, it will be throttled down to use this value. Same as above, - this configuration is only effective when using file-based data sources such as Parquet, ORC + paths is larger than this value, it will be throttled down to use this value. This configuration is only effective when using file-based data sources such as Parquet, ORC and JSON. </td> <td>2.1.1</td> @@ -215,8 +214,8 @@ For more details please refer to the documentation of [Join Hints](sql-ref-synta ## Coalesce Hints for SQL Queries -Coalesce hints allows the Spark SQL users to control the number of output files just like the -`coalesce`, `repartition` and `repartitionByRange` in Dataset API, they can be used for performance +Coalesce hints allow Spark SQL users to control the number of output files just like +`coalesce`, `repartition` and `repartitionByRange` in the Dataset API, they can be used for performance tuning and reducing the number of output files. The "COALESCE" hint only has a partition number as a parameter. The "REPARTITION" hint has a partition number, columns, or both/neither of them as parameters. The "REPARTITION_BY_RANGE" hint must have column names and a partition number is optional. The "REBALANCE" @@ -295,7 +294,7 @@ AQE converts sort-merge join to broadcast hash join when the runtime statistics <td><code>spark.sql.adaptive.autoBroadcastJoinThreshold</code></td> <td>(none)</td> <td> - Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. The default value is same with <code>spark.sql.autoBroadcastJoinThreshold</code>. Note that, this config is used only in adaptive framework. + Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1, broadcasting can be disabled. The default value is the same as <code>spark.sql.autoBroadcastJoinThreshold</code>. Note that, this config is used only in adaptive framework. </td> <td>3.2.0</td> </tr> @@ -309,7 +308,7 @@ AQE converts sort-merge join to shuffled hash join when all post shuffle partiti <td><code>spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold</code></td> <td>0</td> <td> - Configures the maximum size in bytes per partition that can be allowed to build local hash map. If this value is not smaller than <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code> and all the partition size are not larger than this config, join selection prefer to use shuffled hash join instead of sort merge join regardless of the value of <code>spark.sql.join.preferSortMergeJoin</code>. + Configures the maximum size in bytes per partition that can be allowed to build local hash map. If this value is not smaller than <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code> and all the partition sizes are not larger than this config, join selection prefers to use shuffled hash join instead of sort merge join regardless of the value of <code>spark.sql.join.preferSortMergeJoin</code>. </td> <td>3.2.0</td> </tr> @@ -339,7 +338,7 @@ Data skew can severely downgrade the performance of join queries. This feature d <td><code>spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes</code></td> <td>256MB</td> <td> - A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than <code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code> multiplying the median partition size. Ideally this config should be set larger than <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>. + A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than <code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code> multiplying the median partition size. Ideally, this config should be set larger than <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>. </td> <td>3.0.0</td> </tr> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org