Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22238#discussion_r213129120 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output Sinks](#output-sinks) sections f # Additional Information +**Gotchas** + +- For structured streaming, modifying "spark.sql.shuffle.partitions" is restricted once you run the query. + - This is because state is partitioned via key, hence number of partitions for state should be unchanged. + - If you want to run less tasks for stateful operations, `coalesce` would help with avoiding unnecessary repartitioning. Please note that it will also affect downstream operators. --- End diff -- It just means that the number of partitions in stateful operations' output will be same as parameter for `coalesce`, and the number of partitions will be kept unless another shuffle happens. It is implicitly same as `spark.sql.shuffle.partitions`, which default value is 200. I'll add the code, but not sure we need to have the code per language like Scala / Java / Python tabs since they will be same.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org