Re: flink 1.7 HA production setup going down completely

2019-05-08 Thread Till Rohrmann
Hi Manju, I guess this exception Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494 file=/flink/checkpoints/submittedJobGraph480ddf9572ed at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputS

Re: flink 1.7 HA production setup going down completely

2019-05-08 Thread Manjusha Vuyyuru
Hi Till, Thanks for the response. please see the attached log file. *HA config is : * high-availability: zookeeper high-availability.storageDir: hdfs://flink-hdfs:9000/flink/checkpoints >From the logs i can see block missing exceptions from hdfs, but i can see that the jobgraph is still present in

Re: flink 1.7 HA production setup going down completely

2019-05-08 Thread Till Rohrmann
Hi Manju, could you share the full logs or at least the full stack trace of the exception with us? I suspect that after a failover Flink tries to restore the JobGraph from persistent storage (the directory which you have configured via `high-availability.storageDir`) but is not able to do so. One

Re: flink 1.7 HA production setup going down completely

2019-05-08 Thread Manjusha Vuyyuru
Any update on this from community side? On Tue, May 7, 2019 at 6:43 PM Manjusha Vuyyuru wrote: > im using 1.7.2. > > > On Tue, May 7, 2019 at 5:50 PM miki haiat wrote: > >> Which flink version are you using? >> I had similar issues with 1.5.x >> >> On Tue, May 7, 2019 at 2:49 PM Manjusha Vuyyu

Re: flink 1.7 HA production setup going down completely

2019-05-07 Thread Manjusha Vuyyuru
im using 1.7.2. On Tue, May 7, 2019 at 5:50 PM miki haiat wrote: > Which flink version are you using? > I had similar issues with 1.5.x > > On Tue, May 7, 2019 at 2:49 PM Manjusha Vuyyuru > wrote: > >> Hello, >> >> I have a flink setup with two job managers coordinated by zookeeper. >> >> I s

Re: flink 1.7 HA production setup going down completely

2019-05-07 Thread miki haiat
Which flink version are you using? I had similar issues with 1.5.x On Tue, May 7, 2019 at 2:49 PM Manjusha Vuyyuru wrote: > Hello, > > I have a flink setup with two job managers coordinated by zookeeper. > > I see the below exception and both jobmanagers are going down: > > 2019-05-07 08:29:13,

flink 1.7 HA production setup going down completely

2019-05-07 Thread Manjusha Vuyyuru
Hello, I have a flink setup with two job managers coordinated by zookeeper. I see the below exception and both jobmanagers are going down: 2019-05-07 08:29:13,346 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph f8eb1b482d8ec8c1d3e94c4d0f79d