Hi Yun, How underlying storage explains fact that without re-scale I can restore from savepoint? Does Flink write file once or many times, if many times, then potentially could be problem with 50,000 blocks per blob limit, I'm I right? Should I try block blob with compaction like described in [1] or without compaction?
Thanks, Alexey ________________________________ From: Yun Tang <myas...@live.com> Sent: Wednesday, March 17, 2021 9:31 PM To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Alexey, I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally. Currently, I highly suspect this strange rescaling behavior is related with your underlying storage, could you try to use block blob instead of page blob [1] to see whether this behavior still existed? [1] https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Block_Blob_with_Compaction_Support_and_Configuration Best Yun Tang ________________________________ From: Alexey Trenikhun <yen...@msn.com> Sent: Thursday, March 18, 2021 12:00 To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Yun, Azure web UI shows size of all files created by Flink as 128Mib * X (128, 256, 640), see screenshot attached. In my understanding this is because Flink creates them as Page Blobs. In same storage other application creates files as block blobs and they have sizes not rounded on 128Mib Thanks, Alexey ________________________________ From: Yun Tang <myas...@live.com> Sent: Wednesday, March 17, 2021 8:38 PM To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Alexey, I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files? Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata are 128MB which is much larger than its correct capacity, is this expected on azure blob storage or you just uploaded the wrong files? [1] https://github.com/apache/flink/blob/956c0716fdbf20bf53305fe4d023fa2bea412595/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99 Best Yun Tang ________________________________ From: Alexey Trenikhun <yen...@msn.com> Sent: Thursday, March 18, 2021 0:45 To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Yun, I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive: https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not) Thanks, Alexey ________________________________ From: Yun Tang <myas...@live.com> Sent: Wednesday, March 17, 2021 12:33 AM To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Alexey, Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen. Take "wasbs://gsp-st...@gspstatewestus2dev.blob.core.windows.net/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1]. Have you ever enabled snappy compression [2] [3] for savepoints? Could you also share the file "wasbs://gsp-st...@gspstatewestus2dev.blob.core.windows.net/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected. Moreover, could you also share savepoint meta data ""wasbs://gsp-st...@gspstatewestus2dev.blob.core.windows.net/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ? [1] https://github.com/apache/flink/blob/dc404e2538fdfbc98b9c565951f30f922bf7cedd/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/restore/RocksDBFullRestoreOperation.java#L211 [2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression [3] https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression Best Yun Tang ________________________________ From: Alexey Trenikhun <yen...@msn.com> Sent: Wednesday, March 17, 2021 14:25 To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Attached. ________________________________ From: Yun Tang <myas...@live.com> Sent: Tuesday, March 16, 2021 11:13 PM To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Alexey, Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare. Best Yun Tang ________________________________ From: Alexey Trenikhun <yen...@msn.com> Sent: Wednesday, March 17, 2021 13:55 To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Yun, I'm attaching shorter version of log, looks like full version didn't come through Thanks, Alexey ________________________________ From: Yun Tang <myas...@live.com> Sent: Tuesday, March 16, 2021 8:05 PM To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Alexey, I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling. Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1] [1] https://github.com/apache/flink/blob/dc404e2538fdfbc98b9c565951f30f922bf7cedd/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/restore/RocksDBFullRestoreOperation.java#L153 Best ________________________________ From: Alexey Trenikhun <yen...@msn.com> Sent: Tuesday, March 16, 2021 15:10 To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Also restore from same savepoint without change in parallelism works fine. ________________________________ From: Alexey Trenikhun <yen...@msn.com> Sent: Monday, March 15, 2021 9:51 PM To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend No, I believe original exception was from 1.12.1 to 1.12.1 Thanks, Alexey ________________________________ From: Yun Tang <myas...@live.com> Sent: Monday, March 15, 2021 8:07:07 PM To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi, Can you scale the job at the same version from 1.12.1 to 1.12.1? Best Yun Tang ________________________________ From: Alexey Trenikhun <yen...@msn.com> Sent: Tuesday, March 16, 2021 4:46 To: Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2 ________________________________ From: Tzu-Li (Gordon) Tai <tzuli...@apache.org> Sent: Monday, March 15, 2021 12:06 AM To: user@flink.apache.org <user@flink.apache.org> Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi, Could you provide info on the Flink version used? Cheers, Gordon -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/