[jira] [Commented] (FLUME-2451) HDFS Sink Cannot Reconnect After NameNode Restart

Hari Shreedharan (JIRA) Mon, 13 Oct 2014 12:54:58 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169844#comment-14169844
 ]


Hari Shreedharan commented on FLUME-2451:
-----------------------------------------

Roshan:

- Do you think we should close *all* bucket writers on an HDFS IOException? Is 
there any specific way to identify if the namenode is down? We don't really 
need to close all bucket writers in all cases.
- The sfWritersMap is guarded by a lock - you should use the locking mechanism 
in this method too.

Also, the 2nd patch only has the unit test, not the code fix. Can you merge the 
two patches? Thanks!

> HDFS Sink Cannot Reconnect After NameNode Restart
> -------------------------------------------------
>
>                 Key: FLUME-2451
>                 URL: https://issues.apache.org/jira/browse/FLUME-2451
>             Project: Flume
>          Issue Type: Bug
>          Components: File Channel, Sinks+Sources
>    Affects Versions: v1.4.0
>         Environment: 8 node CDH 4.2.2 (2.0.0-cdh4.2.2) cluster
> All cluster machines are running Ubuntu 12.04 x86_64
>            Reporter: Andrew O'Neill
>            Assignee: Roshan Naik
>              Labels: HDFS, Sink
>         Attachments: FLUME-2451.patch, FLUME-2451.v2.patch
>
>
> I am testing a simple flume setup with a Sequence Generator Source, a File 
> Channel, and an HDFS Sink (see my flume.conf below). This configuration works 
> as expected until I reboot the cluster’s NameNode or until I restart the HDFS 
> service on the cluster. At this point, it appears that the Flume Agent cannot 
> reconnect to HDFS and must be manually restarted.
> Here is our flume.conf:
>     appserver.sources = rawtext
>     appserver.channels = testchannel
>     appserver.sinks = test_sink
>     appserver.sources.rawtext.type = seq
>     appserver.sources.rawtext.channels = testchannel
>     appserver.channels.testchannel.type = file
>     appserver.channels.testchannel.capacity = 10000000
>     appserver.channels.testchannel.minimumRequiredSpace = 214748364800
>     appserver.channels.testchannel.checkpointDir = 
> /Users/aoneill/Desktop/testchannel/checkpoint
>     appserver.channels.testchannel.dataDirs = 
> /Users/aoneill/Desktop/testchannel/data
>     appserver.channels.testchannel.maxFileSize = 20000000
>     appserver.sinks.test_sink.type = hdfs
>     appserver.sinks.test_sink.channel = testchannel
>     appserver.sinks.test_sink.hdfs.path = 
> hdfs://cluster01:8020/user/aoneill/flumetest
>     appserver.sinks.test_sink.hdfs.closeTries = 3
>     appserver.sinks.test_sink.hdfs.filePrefix = events-
>     appserver.sinks.test_sink.hdfs.fileSuffix = .avro
>     appserver.sinks.test_sink.hdfs.fileType = DataStream
>     appserver.sinks.test_sink.hdfs.writeFormat = Text
>     appserver.sinks.test_sink.hdfs.inUsePrefix = inuse-
>     appserver.sinks.test_sink.hdfs.inUseSuffix = .avro
>     appserver.sinks.test_sink.hdfs.rollCount = 100000
>     appserver.sinks.test_sink.hdfs.rollInterval = 30
>     appserver.sinks.test_sink.hdfs.rollSize = 10485760
> These are the two error message that the Flume Agent outputs constantly after 
> the restart:
>     2014-08-26 10:47:24,572 (SinkRunner-PollingRunner-DefaultSinkProcessor) 
> [ERROR - 
> org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:96)]
>  Unexpected error while checking replication factor
>     java.lang.reflect.InvocationTargetException
>         at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.flume.sink.hdfs.AbstractHDFSWriter.getNumCurrentReplicas(AbstractHDFSWriter.java:162)
>         at 
> org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:82)
>         at 
> org.apache.flume.sink.hdfs.BucketWriter.shouldRotate(BucketWriter.java:452)
>         at 
> org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:387)
>         at 
> org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:392)
>         at 
> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>         at java.lang.Thread.run(Thread.java:744)
>     Caused by: java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:525)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1253)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:891)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:881)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:982)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:779)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
> and
>     2014-08-26 10:47:29,592 (SinkRunner-PollingRunner-DefaultSinkProcessor) 
> [WARN - 
> org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418)] 
> HDFS IO error
>     java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:525)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1253)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:891)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:881)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:982)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:779)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
> I can provide additional information if needed. Thank you very much for any 
> insight you are able to provide into this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLUME-2451) HDFS Sink Cannot Reconnect After NameNode Restart

Reply via email to