HeartSaVioR edited a comment on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-637926400


   And personally I'd rather do the check in StateStore with additional 
overhead of reading "a" row in prior to achieve the same in all stateful 
operations. 
   
   ```
     /** Get or create a store associated with the id. */
     def get(
         storeProviderId: StateStoreProviderId,
         keySchema: StructType,
         valueSchema: StructType,
         indexOrdinal: Option[Int],
         version: Long,
         storeConf: StateStoreConf,
         hadoopConf: Configuration): StateStore = {
       require(version >= 0)
       val storeProvider = loadedProviders.synchronized {
         startMaintenanceIfNeeded()
         val provider = loadedProviders.getOrElseUpdate(
           storeProviderId,
           StateStoreProvider.createAndInit(
             storeProviderId.storeId, keySchema, valueSchema, indexOrdinal, 
storeConf, hadoopConf)
         )
         reportActiveStoreInstance(storeProviderId)
         provider
       }
       val store = storeProvider.getStore(version)
       val iter = store.iterator()
       if (iter.nonEmpty) {
         val rowPair = iter.next()
         val key = rowPair.key
         val value = rowPair.value
         // TODO: validate key with key schema
         // TODO: validate value with value schema
       }
       store
     }
   ```
   
   For streaming aggregations it initializes "two" state stores so the overhead 
goes to "two" rows, but I don't think the overhead matters much.
   
   If we really concern about the overhead of making additional "iterator" or 
do the validation on early phase (where it might be possible the state store 
may not be accessed), just have a StateStore wrapper wrapping `store` and do 
the same - only validate once for the first "get". In either way, we never need 
to restrict the functionality to the streaming aggregation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to