Hi,

what I can say is that any failures like OOMs should not corrupt checkpoint 
files, because only successfully completed checkpoints are used for recovery by 
the job manager. Just to get a bit more info, are you using full or incremental 
checkpoints? Unfortunately, it is a bit hard to say from the given information 
what the cause of the problem is. Typically, these problems have been observed 
when something was wrong with a serializer or a stateful serializer was used 
from multiple threads.

Best,
Stefan 

> Am 07.09.2018 um 05:04 schrieb vino yang <yanghua1...@gmail.com>:
> 
> Hi Edward,
> 
> From this log: Caused by: java.io.EOFException, it seems that the state 
> metadata file has been corrupted.
> But I can't confirm it, maybe Stefan knows more details, Ping him for you.
> 
> Thanks, vino.
> 
> Edward Rojas <edward.roja...@gmail.com <mailto:edward.roja...@gmail.com>> 
> 于2018年9月7日周五 上午1:22写道:
> Hello all,
> 
> We are running Flink 1.5.3 on Kubernetes with RocksDB as statebackend. 
> When performing some load testing we got an /OutOfMemoryError: native memory
> exhausted/, causing the job to fail and be restarted.
> 
> After the Taskmanager is restarted, the job is recovered from a Checkpoint,
> but it seems that there is a problem when trying to access the state. We got
> the error from the *onTimer* function of a *onProcessingTime*.
> 
> It would be possible that the OOM error could have caused to checkpoint a
> corrupted state?
> 
> We get Exceptions like:
> 
> TimerException{java.lang.RuntimeException: Error while retrieving data from
> RocksDB.}
>         at
> org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:288)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:277)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:191)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>         at java.lang.Thread.run(Thread.java:811)
> Caused by: java.lang.RuntimeException: Error while retrieving data from
> RocksDB.
>         at
> org.apache.flink.contrib.streaming.state.RocksDBValueState.value(RocksDBValueState.java:89)
>         at com.xxx.ProcessFunction.*onTimer*(ProcessFunction.java:279)
>         at
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.invokeUserFunction(KeyedProcessOperator.java:94)
>         at
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.*onProcessingTime*(KeyedProcessOperator.java:78)
>         at
> org.apache.flink.streaming.api.operators.HeapInternalTimerService.*onProcessingTime*(HeapInternalTimerService.java:266)
>         at
> org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:285)
>         ... 7 more
> Caused by: java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:208)
>         at java.io.DataInputStream.readUTF(DataInputStream.java:618)
>         at java.io.DataInputStream.readUTF(DataInputStream.java:573)
>         at
> org.apache.flink.api.java.typeutils.runtime.PojoSerializer.deserialize(PojoSerializer.java:381)
>         at
> org.apache.flink.contrib.streaming.state.RocksDBValueState.value(RocksDBValueState.java:87)
>         ... 12 more
> 
> 
> Thanks in advance for any help
> 
> 
> 
> 
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ 
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>

Reply via email to