Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226250306
  
    --- Diff: docs/sql-performance-turing.md ---
    @@ -0,0 +1,151 @@
    +---
    +layout: global
    +title: Performance Tuning
    +displayTitle: Performance Tuning
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +For some workloads, it is possible to improve performance by either 
caching data in memory, or by
    +turning on some experimental options.
    +
    +## Caching Data In Memory
    +
    +Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
    +Then Spark SQL will scan only required columns and will automatically tune 
compression to minimize
    +memory usage and GC pressure. You can call 
`spark.catalog.uncacheTable("tableName")` to remove the table from memory.
    +
    +Configuration of in-memory caching can be done using the `setConf` method 
on `SparkSession` or by running
    +`SET key=value` commands using SQL.
    +
    +<table class="table">
    +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
    +<tr>
    +  <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
    +  <td>true</td>
    +  <td>
    +    When set to true Spark SQL will automatically select a compression 
codec for each column based
    +    on statistics of the data.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
    +  <td>10000</td>
    +  <td>
    +    Controls the size of batches for columnar caching. Larger batch sizes 
can improve memory utilization
    +    and compression, but risk OOMs when caching data.
    +  </td>
    +</tr>
    +
    +</table>
    +
    +## Other Configuration Options
    +
    +The following options can also be used to tune the performance of query 
execution. It is possible
    +that these options will be deprecated in future release as more 
optimizations are performed automatically.
    +
    +<table class="table">
    +  <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
    +  <tr>
    +    <td><code>spark.sql.files.maxPartitionBytes</code></td>
    +    <td>134217728 (128 MB)</td>
    +    <td>
    +      The maximum number of bytes to pack into a single partition when 
reading files.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>spark.sql.files.openCostInBytes</code></td>
    +    <td>4194304 (4 MB)</td>
    +    <td>
    +      The estimated cost to open a file, measured by the number of bytes 
could be scanned in the same
    +      time. This is used when putting multiple files into a partition. It 
is better to over estimated,
    --- End diff --
    
    nit: `It is better to over estimated` -> ` It is better to over-estimate`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to