[GitHub] [spark] HeartSaVioR opened a new pull request #34502: [SPARK-37224][SS] Optimize write path on RocksDB state store provider

GitBox Sat, 06 Nov 2021 00:04:18 -0700


HeartSaVioR opened a new pull request #34502:
URL: https://github.com/apache/spark/pull/34502



   ### What changes were proposed in this pull request?
   
   This PR proposes to optimize write path on RocksDB via removing unnecessary 
lookup. Removing unnecessary lookup unfortunately also disables the feasibility 
to track the number of rows, so this PR also introduces a new configuration for 
RocksDB state store provider to let end users turn it on and off based on their 
needs.
   
   The new configuration is following:
   
   * config name: 
`spark.sql.streaming.stateStore.rocksdb.trackTotalNumberOfRows`
   * default value: true (since we already serve the number and we want to 
avoid breaking change)
   
   We will give "0" for the number of keys in the state store metric when the 
config is turned off. The ideal value seems to be a negative one, but currently 
SQL metric doesn't allow negative value.
   
   We will also handle the case the config is flipped during restart. This will 
enable the way end users enjoy the benefit but also not lose the chance to know 
the number of state rows. End users can turn off the flag to maximize the 
performance, and turn on the flag (restart required) when they want to see the 
actual number of keys (for observability/debugging/etc).
   
   ### Why are the changes needed?
   
   This addresses unnecessary lookup in write path, which only needs to track 
the number of rows. While the metric is a part of basic metrics for stateful 
operator, we can sacrifice some observability to gain performance on heavy 
write load.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, new configuration is added. This is neither a backward incompatible 
change nor behavior change, since default value of the flag is retaining the 
behavior as it is.
   
   But there's a glitch regarding rolling back to previous Spark version: if 
you run query with turning the config off (so that the number of keys is lost) 
and restart the query in older Spark version, older Spark version will still 
try to track the number and the number will get messed up. You may want to turn 
the config on and run some micro-batches before going back to previous Spark 
version.
   
   ### How was this patch tested?
   
   New UT & benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR opened a new pull request #34502: [SPARK-37224][SS] Optimize write path on RocksDB state store provider

Reply via email to