Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/21733#discussion_r201483544 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -825,6 +825,16 @@ object SQLConf { .intConf .createWithDefault(100) + val ADVANCED_REMOVE_REDUNDANT_IN_STATEFUL_AGGREGATION = --- End diff -- This is not compatible with current stateful aggregation (definitely, that's the improvement of this patch) and there is no undo. So once end users enable the option in the query, the option must be enabled unless end users clear out checkpoint. (I've added the new option to OffsetSeqMetadata to remember the first setting like partition count). I'm seeing performance on far or even slightly better on specific workload (publicized in description link), but I would say I cannot try out exhaustive workloads. I actually expected a tradeoff between performance vs state memory usage, so assuming if other workloads follow the tradeoff, end users may need to try out this option in their query with non-production environment (for example, staged) to ensure enabling option doesn't break their expectation of performance. That's why I also make changes available as an option instead of modifying default behavior. If we apply this to the default behavior, we need to provide state migration.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org