HeartSaVioR commented on issue #25577: [WIP][CORE][SPARK-28867] InMemoryStore checkpoint to speed up replay log file in HistoryServer URL: https://github.com/apache/spark/pull/25577#issuecomment-525578717 Thanks for pointing out about "recover live entities". I've commented in your comment in the doc, but elaborate again to get some guides about what we should do from experts for event logging part. I thought `live*` fields are working as "cache" (code comment made me a bit confused). If it's designed as a "cache" instead of playing as truth of source, it should also load from KVStore if the entity is not in cache, and fill the cache if found. Looks like it's not: AppStatusListener doesn't seek KVStore if it can't find the entity from live* fields. Some part of SQLAppStatusListener deals with that, though it doesn't deal with cache behavior entirely. https://github.com/apache/spark/blob/8848af263570a82a985037a95468efff050c61b0/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L79-L101 It seems that AppStatusListener always assumes KVStore is empty and we're replaying all events. As we're breaking the assumption, at least as of now I'd say it's a bug and we should fix the bug, regardless of how to snapshot - once we load KVStore and provide to AppStatusListener we should fix. "initLiveEntities" in your PR seems to deal with this, so hopefully you can create smaller PR to fix the bug first. (You could test the functionality without snapshotting, as once you know the type of entities you could retrieve entities from KVStore.) One thing I would like to modify is, I don't think we should initialize only for certain condition. We should just initialize all the time for startup of AppStatusListener. It should be fast enough if KVStore is empty, and it must be done regardless of flags if KVStore is not empty - correctness matters, speed doesn't matter here. SQLAppStatusListener may also need to be fixed as well. Once liveEntities can be restored from KVStore, it would also work for incremental replaying, as well as incremental snapshotting (replaying some more events and snapshotting again). `live` flag makes thing a bit complicated due to "maybeUpdate" and "liveUpdate", but AppStatusListener guarantees to flush these live entities to KVStore eventually, at least when ElementTrackingStore is closed. I guess it would be pretty much helpful if ElementTrackingStore provides "flush" to force flush without closing it. @vanzin As you're expert on this area, could you go through this and provide guidance on the approach? Feels like we could deal with below steps: 1. address live entities initialization in AppStatusListener/SQLAppStatusListener with non-empty KVStore (based on the part of this PR, maybe @Ngone51 would be better to deal with this) 2. address snapshotting of KVStore 3. based on 1 and 2, address both issues concurrently, SPARK-28867 (@Ngone51), SPARK-28594 (me) The approach would be much better than the origin plan when you prefer smaller PR, as SPARK-28594 will be split to 3 sub-issues instead of 2. Let me know it makes sense to you. Thanks in advance!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org