Glad to hear it! Best, Congxian
Adrian Vasiliu <vasi...@fr.ibm.com> 于2019年10月15日周二 下午9:10写道: > Hi, > FYI we've switched to a different Hadoop server, and the issue vanished... > It does look as the cause was on hadoop side. > Thanks again Congxian. > Adrian > > > ----- Original message ----- > From: "Adrian Vasiliu" <vasi...@fr.ibm.com> > To: qcx978132...@gmail.com > Cc: user@flink.apache.org > Subject: [EXTERNAL] RE: FLINK-13497 / "Could not create file for checking > if truncate works" / HDFS > Date: Tue, Oct 15, 2019 8:37 AM > > Thanks Congxian. The possible causes listed in the mostly voted answer of > https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025 > do > not seem to hold for us, because we have other pretty much similar flink > jobs using the same Hadoop server and root directory (under different hdfs > paths), and they do work. Thus in principle the config on the Hadoop > server-side wouldn't be the cause. Also, according to the Ambari monitoring > tools, the Hadoop server is healthy, and we did restart it. However, we'll > check all points mentioned in various answers, in particular the one about > temp files. > Thanks > Adrian > > > ----- Original message ----- > From: Congxian Qiu <qcx978132...@gmail.com> > To: Adrian Vasiliu <vasi...@fr.ibm.com> > Cc: user <user@flink.apache.org> > Subject: [EXTERNAL] Re: FLINK-13497 / "Could not create file for checking > if truncate works" / HDFS > Date: Tue, Oct 15, 2019 4:02 AM > > Hi > > From the given stack trace, maybe you could solve the "replication > problem" first, File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e > could only be replicated to 0 nodes instead of minReplication (=1). There > are 2 datanode(s) running and no node(s) are excluded in this operation, > and maybe the answer from SO[1] can help. > > [1] > https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025 > Best, > Congxian > > Adrian Vasiliu <vasi...@fr.ibm.com> 于2019年10月14日周一 下午9:10写道: > > Hello, > > We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we > experience repeated failing jobs with > > java.lang.RuntimeException: Could not create file for checking if > truncate works. You can disable support for truncate() completely via > BucketingSink.setUseTruncate(false). > at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink > .reflectTruncate(BucketingSink.java:645) > at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink > .initializeState(BucketingSink.java:388) > at org.apache.flink.streaming.util.functions.StreamingFunctionUtils > .tryRestoreFunction(StreamingFunctionUtils.java:178) > at org.apache.flink.streaming.util.functions.StreamingFunctionUtils > .restoreFunctionState(StreamingFunctionUtils.java:160) > at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator > .initializeState(AbstractUdfStreamOperator.java:96) > at org.apache.flink.streaming.api.operators.AbstractStreamOperator > .initializeState(AbstractStreamOperator.java:281) > at org.apache.flink.streaming.runtime.tasks.StreamTask > .initializeState(StreamTask.java:878) > at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke( > StreamTask.java:392) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): > File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be > replicated to 0 nodes instead of minReplication (=1). There are 2 > datanode(s) running and no node(s) are excluded in this operation. > at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager > .chooseTarget4NewBlock(BlockManager.java:1719) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem > .getNewBlockTargets(FSNamesystem.java:3368) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem > .getAdditionalBlock(FSNamesystem.java:3292) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock( > NameNodeRpcServer.java:850) > at org.apache.hadoop.hdfs.protocolPB. > ClientNamenodeProtocolServerSideTranslatorPB.addBlock( > ClientNamenodeProtocolServerSideTranslatorPB.java:504) > at org.apache.hadoop.hdfs.protocol.proto. > ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod( > ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker > .call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInformation.java:1866) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) > > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489) > at org.apache.hadoop.ipc.Client.call(Client.java:1435) > at org.apache.hadoop.ipc.Client.call(Client.java:1345) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke( > ProtobufRpcEngine.java:227) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke( > ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy49.addBlock(Unknown Source) > at org.apache.hadoop.hdfs.protocolPB. > ClientNamenodeProtocolTranslatorPB.addBlock( > ClientNamenodeProtocolTranslatorPB.java:444) > at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod( > RetryInvocationHandler.java:409) > at org.apache.hadoop.io.retry.RetryInvocationHandler$Call > .invokeMethod(RetryInvocationHandler.java:163) > at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke( > RetryInvocationHandler.java:155) > at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce( > RetryInvocationHandler.java:95) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke( > RetryInvocationHandler.java:346) > at com.sun.proxy.$Proxy50.addBlock(Unknown Source) > at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock( > DataStreamer.java:1838) > at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream( > DataStreamer.java:1638) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704) > > Reading through https://issues.apache.org/jira/browse/FLINK-13593 , it > looks related but this is marked as fixed in 1.9. > > Then, the discussion there points to > https://issues.apache.org/jira/browse/FLINK-13497 which is marked as > unresolved / fixed in 1.10. > > Any lights about: > 1/ Would you confirm that our stack trace is related with > https://issues.apache.org/jira/browse/FLINK-13497 ? > 2/ Any ETA for a 1.9.x fixing it? > > Thanks > Adrian Vasiliu > > > > > >