HeartSaVioR edited a comment on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-637926400
And personally I'd rather do the check in StateStore with additional overhead of reading "a" row in prior to achieve the same in all stateful operations. ``` /** Get or create a store associated with the id. */ def get( storeProviderId: StateStoreProviderId, keySchema: StructType, valueSchema: StructType, indexOrdinal: Option[Int], version: Long, storeConf: StateStoreConf, hadoopConf: Configuration): StateStore = { require(version >= 0) val storeProvider = loadedProviders.synchronized { startMaintenanceIfNeeded() val provider = loadedProviders.getOrElseUpdate( storeProviderId, StateStoreProvider.createAndInit( storeProviderId.storeId, keySchema, valueSchema, indexOrdinal, storeConf, hadoopConf) ) reportActiveStoreInstance(storeProviderId) provider } val store = storeProvider.getStore(version) val iter = store.iterator() if (iter.nonEmpty) { val rowPair = iter.next() val key = rowPair.key val value = rowPair.value // TODO: validate key with key schema // TODO: validate value with value schema } store } ``` For streaming aggregations it initializes "two" state stores so the overhead goes to "two" rows, but I don't think the overhead matters much. If we really concern about the overhead of making additional "iterator" or do the validation on early phase (where it might be possible the state store may not be accessed), just have a StateStore wrapper wrapping `store` and do the same - only validate once for the first "get". In either way, we never need to restrict the functionality to the streaming aggregation. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org