[GitHub] spark pull request #21733: [SPARK-24763][SS] Remove redundant key data from ...

HeartSaVioR Tue, 10 Jul 2018 13:29:43 -0700

Github user HeartSaVioR commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21733#discussion_r201483544
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -825,6 +825,16 @@ object SQLConf {
         .intConf
         .createWithDefault(100)
     
    +  val ADVANCED_REMOVE_REDUNDANT_IN_STATEFUL_AGGREGATION =
    --- End diff --
    
    This is not compatible with current stateful aggregation (definitely, 
that's the improvement of this patch) and there is no undo. So once end users 
enable the option in the query, the option must be enabled unless end users 
clear out checkpoint. (I've added the new option to OffsetSeqMetadata to 
remember the first setting like partition count).
    
    I'm seeing performance on far or even slightly better on specific workload 
(publicized in description link), but I would say I cannot try out exhaustive 
workloads. I actually expected a tradeoff between performance vs state memory 
usage, so assuming if other workloads follow the tradeoff, end users may need 
to try out this option in their query with non-production environment (for 
example, staged) to ensure enabling option doesn't break their expectation of 
performance.
    
    That's why I also make changes available as an option instead of modifying 
default behavior. If we apply this to the default behavior, we need to provide 
state migration.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21733: [SPARK-24763][SS] Remove redundant key data from ...

Reply via email to