Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226250306 --- Diff: docs/sql-performance-turing.md --- @@ -0,0 +1,151 @@ +--- +layout: global +title: Performance Tuning +displayTitle: Performance Tuning +--- + +* Table of contents +{:toc} + +For some workloads, it is possible to improve performance by either caching data in memory, or by +turning on some experimental options. + +## Caching Data In Memory + +Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`. +Then Spark SQL will scan only required columns and will automatically tune compression to minimize +memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableName")` to remove the table from memory. + +Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running +`SET key=value` commands using SQL. + +<table class="table"> +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> +<tr> + <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td> + <td>true</td> + <td> + When set to true Spark SQL will automatically select a compression codec for each column based + on statistics of the data. + </td> +</tr> +<tr> + <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td> + <td>10000</td> + <td> + Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization + and compression, but risk OOMs when caching data. + </td> +</tr> + +</table> + +## Other Configuration Options + +The following options can also be used to tune the performance of query execution. It is possible +that these options will be deprecated in future release as more optimizations are performed automatically. + +<table class="table"> + <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> + <tr> + <td><code>spark.sql.files.maxPartitionBytes</code></td> + <td>134217728 (128 MB)</td> + <td> + The maximum number of bytes to pack into a single partition when reading files. + </td> + </tr> + <tr> + <td><code>spark.sql.files.openCostInBytes</code></td> + <td>4194304 (4 MB)</td> + <td> + The estimated cost to open a file, measured by the number of bytes could be scanned in the same + time. This is used when putting multiple files into a partition. It is better to over estimated, --- End diff -- nit: `It is better to over estimated` -> ` It is better to over-estimate`?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org