Github user HeartSaVioR commented on the issue:

    https://github.com/apache/spark/pull/21700
  
    I would like to add numbers to pursuade how much this patch is helpful for 
end users of Apache Spark.
    
    I crafted and published a project which implements some stateful use cases 
with IoT Trucking example.
    
    https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming
    
    With running apps with I can see that cache (loadedMaps) in 
HDFSBackedStateStoreProvider consumes much more memory than one version of 
state. It's not like 10~30% but more than 1500% and even more than 8000% in 
specific case based on the update ratio of state.
    (Capturing overall map size of provider requires applying the patch #21469 )
    
    Below table is the result of the query, publishing query status to Kafka 
topic and query these data via Spark SQL.
    https://gist.github.com/HeartSaVioR/9d53b39052d4779a4c77e71ff7e989a3
    
    > Before applying the patch (`spark.sql.streaming.minBatchesToRetain` set 
to default value 100)
    
    * stream-stream join (IotTruckingAppJoinedAbnormalEvents.scala)
    
    batchId | numRowsTotal | numRowsUpdated | memoryUsedBytes | 
providerLoadedMapSize | stateExcessLoadingOverheadPercentage
    -- | -- | -- | -- | -- | --
    319 | 765456 | 2632 | 185499903 | 3307747279 | 17.8315310439811928
    
    * window aggregation (IotTruckingAppMovingAggregationsOnSpeed.scala)
    
    batchId | numRowsTotal | numRowsUpdated | memoryUsedBytes | 
providerLoadedMapSize | stateExcessLoadingOverheadPercentage
    -- | -- | -- | -- | -- | --
    142 | 184 | 138 | 72103 | 6214927 | 86.1951236425668835
    
    * deduplication (IotTruckingAppDistinctPairDriverAndTruck.scala)
    
    batchId | numRowsTotal | numRowsUpdated | memoryUsedBytes | 
providerLoadedMapSize | stateExcessLoadingOverheadPercentage
    -- | -- | -- | -- | -- | --
    634 | 598 | 0 | 136279 | 6587839 | 48.3408228707284321
    
    > After applying this patch 
(`spark.sql.streaming.maxBatchesToRetainInMemory` set to default value 2)
    
    * stream-stream join (IotTruckingAppJoinedAbnormalEvents.scala)
    
    batchId | numRowsTotal | numRowsUpdated | memoryUsedBytes | 
providerLoadedMapSize | stateExcessLoadingOverheadPercentage
    -- | -- | -- | -- | -- | --
    127 | 298452 | 4170 | 71023679 | 84454399 | 1.1891020035726395
    
    * window aggregation (IotTruckingAppMovingAggregationsOnSpeed.scala)
    
    batchId | numRowsTotal | numRowsUpdated | memoryUsedBytes | 
providerLoadedMapSize | stateExcessLoadingOverheadPercentage
    -- | -- | -- | -- | -- | --
    132 | 184 | 138 | 72319 | 162647 | 2.2490216955433565
    
    * deduplication (IotTruckingAppDistinctPairDriverAndTruck.scala)
    
    batchId | numRowsTotal | numRowsUpdated | memoryUsedBytes | 
providerLoadedMapSize | stateExcessLoadingOverheadPercentage
    -- | -- | -- | -- | -- | --
    133 | 598 | 0 | 136079 | 227863 | 1.6744905532815497


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to