Re: EOFException on attempt to scale up job with RocksDB state backend

Yun Tang Thu, 18 Mar 2021 05:08:37 -0700

Hi Alexey,

Flink would only write once for checkpointed files. Could you try to write 
checkpointed files as block blob format and see whether the problem still 
existed?

Best
Yun Tang
________________________________
From: Alexey Trenikhun <yen...@msn.com>
Sent: Thursday, March 18, 2021 13:54
To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; 
user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,
How underlying storage explains fact that without re-scale I can restore from 
savepoint? Does Flink write file once or many times, if many times, then 
potentially could be problem with 50,000 blocks per blob limit, I'm I right? 
Should I try block blob with compaction like described in [1] or without 
compaction?

Thanks,
Alexey
________________________________
From: Yun Tang <myas...@live.com>
Sent: Wednesday, March 17, 2021 9:31 PM
To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai 
<tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" 
with your given file locally.

Currently, I highly suspect this strange rescaling behavior is related with 
your underlying storage, could you try to use block blob instead of page blob 
[1] to see whether this behavior still existed?

[1] 
https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Block_Blob_with_Compaction_Support_and_Configuration

Best
Yun Tang

________________________________
From: Alexey Trenikhun <yen...@msn.com>
Sent: Thursday, March 18, 2021 12:00
To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; 
user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,
Azure web UI shows size of all files created by Flink as 128Mib * X (128, 256, 
640), see screenshot attached. In my understanding this is because Flink 
creates them as Page Blobs. In same storage other application creates files as 
block blobs and they have sizes not rounded on 128Mib

Thanks,
Alexey

________________________________
From: Yun Tang <myas...@live.com>
Sent: Wednesday, March 17, 2021 8:38 PM
To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai 
<tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

I tried to load your _metadata as checkpoint via 
Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a 
savepoint meta, have you ever uploaded the correct files?
Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and 
_metadata are 128MB which is much larger than its correct capacity, is this 
expected on azure blob storage or you just uploaded the wrong files?

[1] 
https://github.com/apache/flink/blob/956c0716fdbf20bf53305fe4d023fa2bea412595/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99

Best
Yun Tang
________________________________
From: Alexey Trenikhun <yen...@msn.com>
Sent: Thursday, March 18, 2021 0:45
To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; 
user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:
https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing

Compression was never enabled (docs says that RocksDB's incremental checkpoints 
always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,
Alexey
________________________________
From: Yun Tang <myas...@live.com>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai 
<tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still 
cannot understand why this could happen.

Take 
"wasbs://gsp-st...@gspstatewestus2dev.blob.core.windows.net/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0"
 for example, the key group range offset has been intersected correctly during 
rescale for task "Intake voice calls (6/7)". The only place I could doubt is 
that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file 
"wasbs://gsp-st...@gspstatewestus2dev.blob.core.windows.net/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0
 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data 
""wasbs://gsp-st...@gspstatewestus2dev.blob.core.windows.net/gsp/savepoints/savepoint-000000-67de6690143a/_metadata"
 ?

[1] 
https://github.com/apache/flink/blob/dc404e2538fdfbc98b9c565951f30f922bf7cedd/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/restore/RocksDBFullRestoreOperation.java#L211
[2] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression
[3] 
https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang
________________________________
From: Alexey Trenikhun <yen...@msn.com>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; 
user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Attached.

________________________________
From: Yun Tang <myas...@live.com>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai 
<tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just 
as I wrote in previous thread so that I could compare.

Best
Yun Tang
________________________________
From: Alexey Trenikhun <yen...@msn.com>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; 
user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come 
through

Thanks,
Alexey
________________________________
From: Yun Tang <myas...@live.com>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai 
<tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not 
Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I 
hope you could share more logs during restoring and rescaling. I want to see 
details of key group handle [1]

[1] 
https://github.com/apache/flink/blob/dc404e2538fdfbc98b9c565951f30f922bf7cedd/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/restore/RocksDBFullRestoreOperation.java#L153

Best
________________________________
From: Alexey Trenikhun <yen...@msn.com>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; 
user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Also restore from same savepoint without change in parallelism works fine.

________________________________
From: Alexey Trenikhun <yen...@msn.com>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <myas...@live.com>; Tzu-Li (Gordon) Tai <tzuli...@apache.org>; 
user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey

________________________________
From: Yun Tang <myas...@live.com>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <yen...@msn.com>; Tzu-Li (Gordon) Tai 
<tzuli...@apache.org>; user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang

________________________________
From: Alexey Trenikhun <yen...@msn.com>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <tzuli...@apache.org>; user@flink.apache.org 
<user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 
1.12.2

________________________________
From: Tzu-Li (Gordon) Tai <tzuli...@apache.org>
Sent: Monday, March 15, 2021 12:06 AM
To: user@flink.apache.org <user@flink.apache.org>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: EOFException on attempt to scale up job with RocksDB state backend

Reply via email to