HeartSaVioR opened a new pull request #34502: URL: https://github.com/apache/spark/pull/34502
### What changes were proposed in this pull request? This PR proposes to optimize write path on RocksDB via removing unnecessary lookup. Removing unnecessary lookup unfortunately also disables the feasibility to track the number of rows, so this PR also introduces a new configuration for RocksDB state store provider to let end users turn it on and off based on their needs. The new configuration is following: * config name: `spark.sql.streaming.stateStore.rocksdb.trackTotalNumberOfRows` * default value: true (since we already serve the number and we want to avoid breaking change) We will give "0" for the number of keys in the state store metric when the config is turned off. The ideal value seems to be a negative one, but currently SQL metric doesn't allow negative value. We will also handle the case the config is flipped during restart. This will enable the way end users enjoy the benefit but also not lose the chance to know the number of state rows. End users can turn off the flag to maximize the performance, and turn on the flag (restart required) when they want to see the actual number of keys (for observability/debugging/etc). ### Why are the changes needed? This addresses unnecessary lookup in write path, which only needs to track the number of rows. While the metric is a part of basic metrics for stateful operator, we can sacrifice some observability to gain performance on heavy write load. ### Does this PR introduce _any_ user-facing change? Yes, new configuration is added. This is neither a backward incompatible change nor behavior change, since default value of the flag is retaining the behavior as it is. But there's a glitch regarding rolling back to previous Spark version: if you run query with turning the config off (so that the number of keys is lost) and restart the query in older Spark version, older Spark version will still try to track the number and the number will get messed up. You may want to turn the config on and run some micro-batches before going back to previous Spark version. ### How was this patch tested? New UT & benchmark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org