Re: Link checkpoint failure issue

2018-06-05 Thread Chesnay Schepler

Can you provide us with the TaskManager logs?

On 05.06.2018 12:30, James (Jian Wu) [FDS Data Platform] wrote:


Hi:

  I am using Flink streaming continuous query.

  Scenario:

Kafka-connector to consume a topic, and streaming incremental 
calculate 24 hours window data. And use processingTime as 
TimeCharacteristic. I am using RocksDB as StateBackend, file system is 
HDFS, and checkpoint interval is 5 minutes.


env.setStreamTimeCharacteristic(TimeCharacteristic./ProcessingTime/);


RocksDBStateBackend rocksdb = new RocksDBStateBackend(checkPointPath, 
true);

rocksdb.setPredefinedOptions(PredefinedOptions./FLASH_SSD_OPTIMIZED/);

env.setStateBackend(rocksdb);


env.enableCheckpointing(checkPointInterval);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(checkPointInterval);

  After I run the application for serval hours, the info log shows

2018-06-04 19:29:08,048 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator   - 
Triggering checkpoint 9 @ 1528108147985 for job 
33313f186b439312bd09e5672e8af661.


 but not completed log, and the checkpoint failed

 Via web UI metrics, kafka commit offset stop increase, and kafka 
current offset still go ahead, wait for 2 hours, kafka stop consume 
message.


Then I enable debug log, and try to reproduce the issue,

*During normal stage, the log shows there are DFSClient send data package*

2018-06-04 19:23:58,933 DEBUG org.apache.hadoop.hdfs.DFSClient 
- /flink/cps/33313f186b439312bd09e5672e8af661/chk-8: 
masked=rwxr-xr-x


2018-06-04 19:23:58,934 DEBUG org.apache.hadoop.ipc.Client 
- The ping interval is 6 ms.


2018-06-04 19:23:58,934 DEBUG org.apache.hadoop.ipc.Client 
- Connecting to fds-hadoop-prod30-mp/10.10.22.50:8020


2018-06-04 19:23:58,935 DEBUG org.apache.hadoop.ipc.Client 
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin: starting, having 
connections 1


2018-06-04 19:23:58,936 DEBUG org.apache.hadoop.ipc.Client 
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1709


2018-06-04 19:23:58,967 DEBUG org.apache.hadoop.ipc.Client 
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1709


2018-06-04 19:23:58,967 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine 
- Call: mkdirs took 33ms


2018-06-04 19:23:58,967 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator   - 
Triggering checkpoint 8 @ 1528107838933 for job 
33313f186b439312bd09e5672e8af661.


2018-06-04 19:24:00,054 DEBUG org.apache.hadoop.hdfs.DFSClient 
- 
/flink/cps/33313f186b439312bd09e5672e8af661/chk-8/_metadata: 
masked=rw-r--r--


2018-06-04 19:24:00,055 DEBUG org.apache.hadoop.ipc.Client 
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1710


2018-06-04 19:24:00,060 DEBUG org.apache.hadoop.ipc.Client 
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1710


2018-06-04 19:24:00,060 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine 
- Call: create took 6ms


2018-06-04 19:24:00,060 DEBUG org.apache.hadoop.hdfs.DFSClient 
- computePacketChunkSize: 
src=/flink/cps/33313f186b439312bd09e5672e8af661/chk-8/_metadata, 
chunkSize=516, chunksPerPacket=126, packetSize=65016


2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.LeaseRenewer 
- Lease renewer daemon for 
[DFSClient_NONMAPREDUCE_-866487647_111] with renew id 1 started


2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient 
- DFSClient writeChunk allocating new packet seqno=0, 
src=/flink/cps/33313f186b439312bd09e5672e8af661/chk-8/_metadata, 
packetSize=65016, chunksPerPacket=126, bytesCurBlock=0


2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient 
- DFSClient flush(): bytesCurBlock=6567 lastFlushOffset=0 
createNewBlock=false


2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient 
- Queued packet 0


2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient 
- Waiting for ack for: 0


2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient 
- Allocating new block


2018-06-04 19:24:00,062 DEBUG org.apache.hadoop.ipc.Client 
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1711


2018-06-04 19:24:00,068 DEBUG org.apache.hadoop.ipc.Client 
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1711


2018-06-04 19:24:00,068 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine 
- Call: addBlock took 6ms


20

Link checkpoint failure issue

2018-06-05 Thread James (Jian Wu) [FDS Data Platform]
Hi:

  I am using Flink streaming continuous query.
  Scenario:
  Kafka-connector to consume a topic, and streaming incremental calculate 24 
hours window data. And use processingTime as TimeCharacteristic. I am using 
RocksDB as StateBackend, file system is HDFS, and checkpoint interval is 5 
minutes.

env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);


RocksDBStateBackend rocksdb = new RocksDBStateBackend(checkPointPath, true);
rocksdb.setPredefinedOptions(PredefinedOptions.FLASH_SSD_OPTIMIZED);

env.setStateBackend(rocksdb);


env.enableCheckpointing(checkPointInterval);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(checkPointInterval);

  After I run the application for serval hours, the info log shows
2018-06-04 19:29:08,048 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 9 @ 1528108147985 for job 33313f186b439312bd09e5672e8af661.

 but not completed log, and the checkpoint failed
 Via web UI metrics, kafka commit offset stop increase, and kafka current 
offset still go ahead, wait for 2 hours, kafka stop consume message.

Then I enable debug log, and try to reproduce the issue,

During normal stage, the log shows there are DFSClient send data package


2018-06-04 19:23:58,933 DEBUG org.apache.hadoop.hdfs.DFSClient  
- /flink/cps/33313f186b439312bd09e5672e8af661/chk-8: 
masked=rwxr-xr-x
2018-06-04 19:23:58,934 DEBUG org.apache.hadoop.ipc.Client  
- The ping interval is 6 ms.
2018-06-04 19:23:58,934 DEBUG org.apache.hadoop.ipc.Client  
- Connecting to fds-hadoop-prod30-mp/10.10.22.50:8020
2018-06-04 19:23:58,935 DEBUG org.apache.hadoop.ipc.Client  
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin: starting, having 
connections 1
2018-06-04 19:23:58,936 DEBUG org.apache.hadoop.ipc.Client  
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1709
2018-06-04 19:23:58,967 DEBUG org.apache.hadoop.ipc.Client  
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1709
2018-06-04 19:23:58,967 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine   
- Call: mkdirs took 33ms
2018-06-04 19:23:58,967 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 8 @ 1528107838933 for job 33313f186b439312bd09e5672e8af661.
2018-06-04 19:24:00,054 DEBUG org.apache.hadoop.hdfs.DFSClient  
- /flink/cps/33313f186b439312bd09e5672e8af661/chk-8/_metadata: 
masked=rw-r--r--
2018-06-04 19:24:00,055 DEBUG org.apache.hadoop.ipc.Client  
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1710
2018-06-04 19:24:00,060 DEBUG org.apache.hadoop.ipc.Client  
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1710
2018-06-04 19:24:00,060 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine   
- Call: create took 6ms
2018-06-04 19:24:00,060 DEBUG org.apache.hadoop.hdfs.DFSClient  
- computePacketChunkSize: 
src=/flink/cps/33313f186b439312bd09e5672e8af661/chk-8/_metadata, chunkSize=516, 
chunksPerPacket=126, packetSize=65016
2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.LeaseRenewer   
- Lease renewer daemon for [DFSClient_NONMAPREDUCE_-866487647_111] 
with renew id 1 started
2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient  
- DFSClient writeChunk allocating new packet seqno=0, 
src=/flink/cps/33313f186b439312bd09e5672e8af661/chk-8/_metadata, 
packetSize=65016, chunksPerPacket=126, bytesCurBlock=0
2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient  
- DFSClient flush(): bytesCurBlock=6567 lastFlushOffset=0 
createNewBlock=false
2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient  
- Queued packet 0
2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient  
- Waiting for ack for: 0
2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient  
- Allocating new block
2018-06-04 19:24:00,062 DEBUG org.apache.hadoop.ipc.Client  
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1711
2018-06-04 19:24:00,068 DEBUG org.apache.hadoop.ipc.Client  
- IPC Client (2045458324) connection to 
fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1711
2018-06-04 19:24:00,068 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine