[spark] branch master updated: [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar

srowen Sun, 06 Nov 2022 16:04:30 -0800

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new c4d159a368d [MINOR][DOC] revisions for spark sql performance tuning to 
improve readability and grammar
c4d159a368d is described below

commit c4d159a368d554a8567271dbfec8f291d1de70a5
Author: Dustin William Smith <dustin.sm...@deliveryhero.com>
AuthorDate: Sun Nov 6 18:04:10 2022 -0600

    [MINOR][DOC] revisions for spark sql performance tuning to improve 
readability and grammar
    
    ### What changes were proposed in this pull request?
    
    I made some small grammar fixes related to dependent clause followed but 
independent clauses, starting a sentence with an introductory phrase, using the 
plural with when are is present in the sentence, and other small fixes to 
improve readability.
    
    https://spark.apache.org/docs/latest/sql-performance-tuning.html
    
    <img width="1065" alt="Screenshot 2022-11-04 at 15 24 17" 
src="https://user-images.githubusercontent.com/7563201/199998862-d9418bc1-2fcd-4eff-be8e-af412add6946.png";>
    
    ### Why are the changes needed?
    
    These changes improve the readability of the Spark documentation for new 
users or those studying up.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, these changes impact the spark documentation.
    
    ### How was this patch tested?
    
    No test were created as these changes were solely in markdown.
    
    Closes #38510 from dwsmith1983/minor-doc-revisions.
    
    Lead-authored-by: Dustin William Smith <dustin.sm...@deliveryhero.com>
    Co-authored-by: dustin <dwsmith1...@users.noreply.github.com>
    Co-authored-by: Dustin Smith <dustin.william.sm...@gmail.com>
    Signed-off-by: Sean Owen <sro...@gmail.com>
---
 docs/sql-performance-tuning.md | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
index d736ff8f83f..6ac39d90527 100644
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@@ -40,7 +40,7 @@ Configuration of in-memory caching can be done using the 
`setConf` method on `Sp
   <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
   <td>true</td>
   <td>
-    When set to true Spark SQL will automatically select a compression codec 
for each column based
+    When set to true, Spark SQL will automatically select a compression codec 
for each column based
     on statistics of the data.
   </td>
   <td>1.0.1</td>
@@ -77,8 +77,8 @@ that these options will be deprecated in future release as 
more optimizations ar
     <td><code>spark.sql.files.openCostInBytes</code></td>
     <td>4194304 (4 MB)</td>
     <td>
-      The estimated cost to open a file, measured by the number of bytes could 
be scanned in the same
-      time. This is used when putting multiple files into a partition. It is 
better to over-estimated,
+      The estimated cost to open a file, measured by the number of bytes that 
could be scanned in the same
+      time. This is used when putting multiple files into a partition. It is 
better to over-estimate,
       then the partitions with small files will be faster than partitions with 
bigger files (which is
       scheduled first). This configuration is effective only when using 
file-based sources such as Parquet,
       JSON and ORC.
@@ -110,7 +110,7 @@ that these options will be deprecated in future release as 
more optimizations ar
     <td>10485760 (10 MB)</td>
     <td>
       Configures the maximum size in bytes for a table that will be broadcast 
to all worker nodes when
-      performing a join. By setting this value to -1 broadcasting can be 
disabled. Note that currently
+      performing a join. By setting this value to -1, broadcasting can be 
disabled. Note that currently
       statistics are only supported for Hive Metastore tables where the command
       <code>ANALYZE TABLE &lt;tableName&gt; COMPUTE STATISTICS noscan</code> 
has been run.
     </td>
@@ -140,8 +140,7 @@ that these options will be deprecated in future release as 
more optimizations ar
     <td>10000</td>
     <td>
       Configures the maximum listing parallelism for job input paths. In case 
the number of input
-      paths is larger than this value, it will be throttled down to use this 
value. Same as above,
-      this configuration is only effective when using file-based data sources 
such as Parquet, ORC
+      paths is larger than this value, it will be throttled down to use this 
value. This configuration is only effective when using file-based data sources 
such as Parquet, ORC
       and JSON.
     </td>
     <td>2.1.1</td>
@@ -215,8 +214,8 @@ For more details please refer to the documentation of [Join 
Hints](sql-ref-synta
 
 ## Coalesce Hints for SQL Queries
 
-Coalesce hints allows the Spark SQL users to control the number of output 
files just like the
-`coalesce`, `repartition` and `repartitionByRange` in Dataset API, they can be 
used for performance
+Coalesce hints allow Spark SQL users to control the number of output files 
just like
+`coalesce`, `repartition` and `repartitionByRange` in the Dataset API, they 
can be used for performance
 tuning and reducing the number of output files. The "COALESCE" hint only has a 
partition number as a
 parameter. The "REPARTITION" hint has a partition number, columns, or 
both/neither of them as parameters.
 The "REPARTITION_BY_RANGE" hint must have column names and a partition number 
is optional. The "REBALANCE"
@@ -295,7 +294,7 @@ AQE converts sort-merge join to broadcast hash join when 
the runtime statistics
        <td><code>spark.sql.adaptive.autoBroadcastJoinThreshold</code></td>
        <td>(none)</td>
        <td>
-         Configures the maximum size in bytes for a table that will be 
broadcast to all worker nodes when performing a join. By setting this value to 
-1 broadcasting can be disabled. The default value is same with 
<code>spark.sql.autoBroadcastJoinThreshold</code>. Note that, this config is 
used only in adaptive framework.
+         Configures the maximum size in bytes for a table that will be 
broadcast to all worker nodes when performing a join. By setting this value to 
-1, broadcasting can be disabled. The default value is the same as 
<code>spark.sql.autoBroadcastJoinThreshold</code>. Note that, this config is 
used only in adaptive framework.
        </td>
        <td>3.2.0</td>
      </tr>
@@ -309,7 +308,7 @@ AQE converts sort-merge join to shuffled hash join when all 
post shuffle partiti
        
<td><code>spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold</code></td>
        <td>0</td>
        <td>
-         Configures the maximum size in bytes per partition that can be 
allowed to build local hash map. If this value is not smaller than 
<code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code> and all the 
partition size are not larger than this config, join selection prefer to use 
shuffled hash join instead of sort merge join regardless of the value of 
<code>spark.sql.join.preferSortMergeJoin</code>.
+         Configures the maximum size in bytes per partition that can be 
allowed to build local hash map. If this value is not smaller than 
<code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code> and all the 
partition sizes are not larger than this config, join selection prefers to use 
shuffled hash join instead of sort merge join regardless of the value of 
<code>spark.sql.join.preferSortMergeJoin</code>.
        </td>
        <td>3.2.0</td>
      </tr>
@@ -339,7 +338,7 @@ Data skew can severely downgrade the performance of join 
queries. This feature d
        
<td><code>spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes</code></td>
        <td>256MB</td>
        <td>
-         A partition is considered as skewed if its size in bytes is larger 
than this threshold and also larger than 
<code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code> multiplying the 
median partition size. Ideally this config should be set larger than 
<code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>.
+         A partition is considered as skewed if its size in bytes is larger 
than this threshold and also larger than 
<code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code> multiplying the 
median partition size. Ideally, this config should be set larger than 
<code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>.
        </td>
        <td>3.0.0</td>
      </tr>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar

Reply via email to