Thanks Congxian. The possible causes listed in the mostly voted answer of https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025 do not seem to hold for us, because we have other pretty much similar flink jobs using the same Hadoop server and root directory (under different hdfs paths), and they do work. Thus in principle the config on the Hadoop server-side wouldn't be the cause. Also, according to the Ambari monitoring tools, the Hadoop server is healthy, and we did restart it. However, we'll check all points mentioned in various answers, in particular the one about temp files.
Thanks
Adrian
 
----- Original message -----
From: Congxian Qiu <qcx978132...@gmail.com>
To: Adrian Vasiliu <vasi...@fr.ibm.com>
Cc: user <user@flink.apache.org>
Subject: [EXTERNAL] Re: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS
Date: Tue, Oct 15, 2019 4:02 AM
 
Hi
 
From the given stack trace, maybe you could solve the "replication problem" first,   File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation, and maybe the answer from SO[1] can help.
 
 
Adrian Vasiliu <vasi...@fr.ibm.com> 于2019年10月14日周一 下午9:10写道:
Hello, 
 
We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we experience repeated failing jobs with 
 
java.lang.RuntimeException: Could not create file for checking if truncate works. You can disable support for truncate() completely via BucketingSink.setUseTruncate(false).
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.reflectTruncate(BucketingSink.java:645)
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:388)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
    at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3368)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3292)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
 
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
    at org.apache.hadoop.ipc.Client.call(Client.java:1435)
    at org.apache.hadoop.ipc.Client.call(Client.java:1345)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
    at com.sun.proxy.$Proxy49.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
    at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
    at com.sun.proxy.$Proxy50.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)
    at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)
    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
 
Reading through https://issues.apache.org/jira/browse/FLINK-13593 , it looks related but this is marked as fixed in 1.9.
 
Then, the discussion there points to https://issues.apache.org/jira/browse/FLINK-13497 which is marked as unresolved / fixed in 1.10.

Any lights about:
1/ Would you confirm that our stack trace is related with  https://issues.apache.org/jira/browse/FLINK-13497  ?
2/ Any ETA for a 1.9.x fixing it?
 
Thanks
Adrian Vasiliu
 

Reply via email to