[jira] [Commented] (FLUME-3080) Close failure in HDFS Sink might cause data loss

ASF subversion and git services (JIRA) Sun, 09 Apr 2017 03:33:07 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962093#comment-15962093
 ]


ASF subversion and git services commented on FLUME-3080:
--------------------------------------------------------

Commit e5c3e6aa76cf2b0bb0838ff6dcd3853656bff704 in flume's branch 
refs/heads/trunk from [~denes]
[ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=e5c3e6a ]

FLUME-3080. Close failure in HDFS Sink might cause data loss

If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the 
last block might
not end up in COMPLETE state. In this case block recovery should happen but as 
the lease is
still held by Flume the NameNode will start the recovery process only after the 
hard limit of
1 hour expires.

This change adds an explicit recoverLease() call in case of close failure.

This closes #127

Reviewers: Hari Shreedharan

(Denes Arvay via Bessenyei Balázs Donát)


> Close failure in HDFS Sink might cause data loss
> ------------------------------------------------
>
>                 Key: FLUME-3080
>                 URL: https://issues.apache.org/jira/browse/FLUME-3080
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: 1.7.0
>            Reporter: Denes Arvay
>            Assignee: Denes Arvay
>            Priority: Blocker
>
> If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the 
> last block might not end up in COMPLETE state. In this case block recovery 
> should happen but as the lease is still held by Flume the NameNode will start 
> the recovery process only after the hard limit of 1 hour expires.
> The lease recovery can be started manually by the {{hdfs debug recoverLease}} 
> command.
> For reproduction I removed the close call from the {{BucketWriter}} 
> (https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java#L380)
>  to simulate the failure and used the following config:
> {noformat}
> agent.sources = testsrc
> agent.sinks = testsink
> agent.channels = testch
> agent.sources.testsrc.type = netcat
> agent.sources.testsrc.bind = localhost
> agent.sources.testsrc.port = 9494
> agent.sources.testsrc.channels = testch
> agent.sinks.testsink.type = hdfs
> agent.sinks.testsink.hdfs.path = /user/flume/test
> agent.sinks.testsink.hdfs.rollInterval = 0
> agent.sinks.testsink.hdfs.rollCount = 3
> agent.sinks.testsink.serializer = avro_event
> agent.sinks.testsink.serializer.compressionCodec = snappy
> agent.sinks.testsink.hdfs.fileSuffix = .avro
> agent.sinks.testsink.hdfs.fileType = DataStream
> agent.sinks.testsink.hdfs.batchSize = 2
> agent.sinks.testsink.hdfs.writeFormat = Text
> agent.sinks.testsink.hdfs.idleTimeout=20
> agent.sinks.testsink.channel = testch
> agent.channels.testch.type = memory
> {noformat}
> After ingesting 6 events ("a" - "f") 2 files were created on HDFS, as 
> expected. But there are missing events when listing the contents in Spark 
> shell:
> {noformat}
> scala> 
> sqlContext.read.avro("/user/flume/test/FlumeData.14908867127*.avro").collect().map(a
>  => new String(a(1).asInstanceOf[Array[Byte]])).foreach(println)
> a
> b
> d
> {noformat}
> {{hdfs fsck}} also confirms that the blocks are still in 
> {{UNDER_CONSTRUCTION}} state:
> {noformat}
> $ hdfs fsck /user/flume/test/ -openforwrite -files -blocks
> FSCK started by root (auth:SIMPLE) from /172.31.114.3 for path 
> /user/flume/test/ at Thu Mar 30 08:23:56 PDT 2017
> /user/flume/test/ <dir>
> /user/flume/test/FlumeData.1490887185312.avro 310 bytes, 1 block(s), 
> OPENFORWRITE:  MISSING 1 blocks of total size 310 B
> 0. 
> BP-1285398861-172.31.114.3-1489845696835:blk_1073761923_21128{blockUCState=UNDER_CONSTRUCTION,
>  primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW]]}
>  len=310 MISSING!
> /user/flume/test/FlumeData.1490887185313.avro 292 bytes, 1 block(s), 
> OPENFORWRITE:  MISSING 1 blocks of total size 292 B
> 0. 
> BP-1285398861-172.31.114.3-1489845696835:blk_1073761924_21129{blockUCState=UNDER_CONSTRUCTION,
>  primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW]]}
>  len=292 MISSING!
> {noformat}
> These blocks need to be recovered by starting a lease recovery process on the 
> NameNode (which will then run the block recovery as well). This can be 
> triggered programmatically via the DFSClient.
> Adding this call if the close fails solves the issue.
> cc [~jojochuang]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLUME-3080) Close failure in HDFS Sink might cause data loss

Reply via email to