[ 
https://issues.apache.org/jira/browse/FLUME-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962102#comment-15962102
 ] 

Hudson commented on FLUME-3080:
-------------------------------

UNSTABLE: Integrated in Jenkins build Flume-trunk-hbase-1 #243 (See 
[https://builds.apache.org/job/Flume-trunk-hbase-1/243/])
FLUME-3080. Close failure in HDFS Sink might cause data loss (bessbd: 
[http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=e5c3e6aa76cf2b0bb0838ff6dcd3853656bff704])
* (edit) 
flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java
* (edit) 
flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestHDFSEventSinkOnMiniCluster.java


> Close failure in HDFS Sink might cause data loss
> ------------------------------------------------
>
>                 Key: FLUME-3080
>                 URL: https://issues.apache.org/jira/browse/FLUME-3080
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: 1.7.0
>            Reporter: Denes Arvay
>            Assignee: Denes Arvay
>            Priority: Blocker
>             Fix For: 1.8.0
>
>
> If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the 
> last block might not end up in COMPLETE state. In this case block recovery 
> should happen but as the lease is still held by Flume the NameNode will start 
> the recovery process only after the hard limit of 1 hour expires.
> The lease recovery can be started manually by the {{hdfs debug recoverLease}} 
> command.
> For reproduction I removed the close call from the {{BucketWriter}} 
> (https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java#L380)
>  to simulate the failure and used the following config:
> {noformat}
> agent.sources = testsrc
> agent.sinks = testsink
> agent.channels = testch
> agent.sources.testsrc.type = netcat
> agent.sources.testsrc.bind = localhost
> agent.sources.testsrc.port = 9494
> agent.sources.testsrc.channels = testch
> agent.sinks.testsink.type = hdfs
> agent.sinks.testsink.hdfs.path = /user/flume/test
> agent.sinks.testsink.hdfs.rollInterval = 0
> agent.sinks.testsink.hdfs.rollCount = 3
> agent.sinks.testsink.serializer = avro_event
> agent.sinks.testsink.serializer.compressionCodec = snappy
> agent.sinks.testsink.hdfs.fileSuffix = .avro
> agent.sinks.testsink.hdfs.fileType = DataStream
> agent.sinks.testsink.hdfs.batchSize = 2
> agent.sinks.testsink.hdfs.writeFormat = Text
> agent.sinks.testsink.hdfs.idleTimeout=20
> agent.sinks.testsink.channel = testch
> agent.channels.testch.type = memory
> {noformat}
> After ingesting 6 events ("a" - "f") 2 files were created on HDFS, as 
> expected. But there are missing events when listing the contents in Spark 
> shell:
> {noformat}
> scala> 
> sqlContext.read.avro("/user/flume/test/FlumeData.14908867127*.avro").collect().map(a
>  => new String(a(1).asInstanceOf[Array[Byte]])).foreach(println)
> a
> b
> d
> {noformat}
> {{hdfs fsck}} also confirms that the blocks are still in 
> {{UNDER_CONSTRUCTION}} state:
> {noformat}
> $ hdfs fsck /user/flume/test/ -openforwrite -files -blocks
> FSCK started by root (auth:SIMPLE) from /172.31.114.3 for path 
> /user/flume/test/ at Thu Mar 30 08:23:56 PDT 2017
> /user/flume/test/ <dir>
> /user/flume/test/FlumeData.1490887185312.avro 310 bytes, 1 block(s), 
> OPENFORWRITE:  MISSING 1 blocks of total size 310 B
> 0. 
> BP-1285398861-172.31.114.3-1489845696835:blk_1073761923_21128{blockUCState=UNDER_CONSTRUCTION,
>  primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW]]}
>  len=310 MISSING!
> /user/flume/test/FlumeData.1490887185313.avro 292 bytes, 1 block(s), 
> OPENFORWRITE:  MISSING 1 blocks of total size 292 B
> 0. 
> BP-1285398861-172.31.114.3-1489845696835:blk_1073761924_21129{blockUCState=UNDER_CONSTRUCTION,
>  primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW]]}
>  len=292 MISSING!
> {noformat}
> These blocks need to be recovered by starting a lease recovery process on the 
> NameNode (which will then run the block recovery as well). This can be 
> triggered programmatically via the DFSClient.
> Adding this call if the close fails solves the issue.
> cc [~jojochuang]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to