[ https://issues.apache.org/jira/browse/FLUME-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962102#comment-15962102 ]
Hudson commented on FLUME-3080: ------------------------------- UNSTABLE: Integrated in Jenkins build Flume-trunk-hbase-1 #243 (See [https://builds.apache.org/job/Flume-trunk-hbase-1/243/]) FLUME-3080. Close failure in HDFS Sink might cause data loss (bessbd: [http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=e5c3e6aa76cf2b0bb0838ff6dcd3853656bff704]) * (edit) flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java * (edit) flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestHDFSEventSinkOnMiniCluster.java > Close failure in HDFS Sink might cause data loss > ------------------------------------------------ > > Key: FLUME-3080 > URL: https://issues.apache.org/jira/browse/FLUME-3080 > Project: Flume > Issue Type: Bug > Components: Sinks+Sources > Affects Versions: 1.7.0 > Reporter: Denes Arvay > Assignee: Denes Arvay > Priority: Blocker > Fix For: 1.8.0 > > > If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the > last block might not end up in COMPLETE state. In this case block recovery > should happen but as the lease is still held by Flume the NameNode will start > the recovery process only after the hard limit of 1 hour expires. > The lease recovery can be started manually by the {{hdfs debug recoverLease}} > command. > For reproduction I removed the close call from the {{BucketWriter}} > (https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java#L380) > to simulate the failure and used the following config: > {noformat} > agent.sources = testsrc > agent.sinks = testsink > agent.channels = testch > agent.sources.testsrc.type = netcat > agent.sources.testsrc.bind = localhost > agent.sources.testsrc.port = 9494 > agent.sources.testsrc.channels = testch > agent.sinks.testsink.type = hdfs > agent.sinks.testsink.hdfs.path = /user/flume/test > agent.sinks.testsink.hdfs.rollInterval = 0 > agent.sinks.testsink.hdfs.rollCount = 3 > agent.sinks.testsink.serializer = avro_event > agent.sinks.testsink.serializer.compressionCodec = snappy > agent.sinks.testsink.hdfs.fileSuffix = .avro > agent.sinks.testsink.hdfs.fileType = DataStream > agent.sinks.testsink.hdfs.batchSize = 2 > agent.sinks.testsink.hdfs.writeFormat = Text > agent.sinks.testsink.hdfs.idleTimeout=20 > agent.sinks.testsink.channel = testch > agent.channels.testch.type = memory > {noformat} > After ingesting 6 events ("a" - "f") 2 files were created on HDFS, as > expected. But there are missing events when listing the contents in Spark > shell: > {noformat} > scala> > sqlContext.read.avro("/user/flume/test/FlumeData.14908867127*.avro").collect().map(a > => new String(a(1).asInstanceOf[Array[Byte]])).foreach(println) > a > b > d > {noformat} > {{hdfs fsck}} also confirms that the blocks are still in > {{UNDER_CONSTRUCTION}} state: > {noformat} > $ hdfs fsck /user/flume/test/ -openforwrite -files -blocks > FSCK started by root (auth:SIMPLE) from /172.31.114.3 for path > /user/flume/test/ at Thu Mar 30 08:23:56 PDT 2017 > /user/flume/test/ <dir> > /user/flume/test/FlumeData.1490887185312.avro 310 bytes, 1 block(s), > OPENFORWRITE: MISSING 1 blocks of total size 310 B > 0. > BP-1285398861-172.31.114.3-1489845696835:blk_1073761923_21128{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW], > > ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW], > > ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW]]} > len=310 MISSING! > /user/flume/test/FlumeData.1490887185313.avro 292 bytes, 1 block(s), > OPENFORWRITE: MISSING 1 blocks of total size 292 B > 0. > BP-1285398861-172.31.114.3-1489845696835:blk_1073761924_21129{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW], > > ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW], > > ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW]]} > len=292 MISSING! > {noformat} > These blocks need to be recovered by starting a lease recovery process on the > NameNode (which will then run the block recovery as well). This can be > triggered programmatically via the DFSClient. > Adding this call if the close fails solves the issue. > cc [~jojochuang] -- This message was sent by Atlassian JIRA (v6.3.15#6346)