Re: HDFS sink to a remote HDFS node

Ed Judge Tue, 30 Sep 2014 19:34:36 -0700

I’ve pulled over all of the Hadoop jar files for my flume instance to use.  I 
am seeing some slightly different errors now.  Basically I have 2 identically 
configured hadoop instances on the same subnet.  Running flume on those same 
instances and pointing flume at the local hadoop/hdfs instance works fine and 
the files get written.  However, when I point it to the adjacent hadoop/hdfs 
instance I get many exceptions/errors (show below) and the files never get 
written.  Here is my HDFS sink configuration on 10.0.0.14:


# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://10.0.0.16:9000/tmp/
a1.sinks.k1.hdfs.filePrefix = twitter
a1.sinks.k1.hdfs.fileSuffix = .ds
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 10
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = DataStream
#a1.sinks.k1.serializer = TEXT
a1.sinks.k1.channel = c1

Any idea why this is not working?

Thanks.

01 Oct 2014 01:59:45,098 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] 
(org.apache.flume.sink.hdfs.HDFSDataStream.configure:58)  - Serializer = TEXT, 
UseRawLocalFileSystem = false
01 Oct 2014 01:59:45,385 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] 
(org.apache.flume.sink.hdfs.BucketWriter.open:261)  - Creating 
hdfs://10.0.0.16:9000/tmp//twitter.1412128785099.ds.tmp
01 Oct 2014 01:59:45,997 INFO  [Twitter4J Async Dispatcher[0]] 
(org.apache.flume.source.twitter.TwitterSource.onStatus:178)  - Processed 100 
docs
01 Oct 2014 01:59:47,754 INFO  [Twitter4J Async Dispatcher[0]] 
(org.apache.flume.source.twitter.TwitterSource.onStatus:178)  - Processed 200 
docs
01 Oct 2014 01:59:49,379 INFO  [Thread-7] 
(org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream:1378)
  - Exception in createBlockOutputStream
java.io.EOFException: Premature EOF: no length prefix available
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1346)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1272)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:525)
01 Oct 2014 01:59:49,390 INFO  [Thread-7] 
(org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream:1275)
  - Abandoning BP-1768727495-127.0.0.1-1412117897373:blk_1073743575_2751
01 Oct 2014 01:59:49,398 INFO  [Thread-7] 
(org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream:1278)
  - Excluding datanode 127.0.0.1:50010
01 Oct 2014 01:59:49,431 WARN  [Thread-7] 
(org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run:627)  - DataStreamer 
Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/tmp/twitter.1412128785099.ds.tmp could only be replicated to 0 nodes instead 
of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are 
excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
        at org.apache.hadoop.ipc.Client.call(Client.java:1363)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
        at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:361)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1439)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1261)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:525)
01 Oct 2014 01:59:49,437 WARN  [hdfs-k1-call-runner-2] 
(org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync:1950)  - Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/tmp/twitter.1412128785099.ds.tmp could only be replicated to 0 nodes instead 
of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are 
excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
        at org.apache.hadoop.ipc.Client.call(Client.java:1363)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
        at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:361)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1439)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1261)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:525)
01 Oct 2014 01:59:49,439 WARN  [SinkRunner-PollingRunner-DefaultSinkProcessor] 
(org.apache.flume.sink.hdfs.HDFSEventSink.process:463)  - HDFS IO error
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/tmp/twitter.1412128785099.ds.tmp could only be replicated to 0 nodes instead 
of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are 
excluded in this operation.

On Sep 30, 2014, at 3:18 PM, Hari Shreedharan <[email protected]> wrote:

> You'd need to add the jars that hadoop itself depends on. Flume pulls it in 
> if Hadoop is installed on that machine, else you'd need to manually download 
> it and install it. If you are using Hadoop 2.x, install the RPM provided by 
> Bigtop.
> 
> On Tue, Sep 30, 2014 at 12:12 PM, Ed Judge <[email protected]> wrote:
> I added commons-configuration and there is now another missing dependency.  
> What do you mean by “all of Hadoop’s dependencies”?
> 
> 
> On Sep 30, 2014, at 2:51 PM, Hari Shreedharan <[email protected]> 
> wrote:
> 
>> You actually need to add of all Hadoop’s dependencies to Flume classpath. 
>> Looks like Apache Commons Configuration is missing in classpath.
>> 
>> Thanks,
>> Hari
>> 
>> 
>> On Tue, Sep 30, 2014 at 11:48 AM, Ed Judge <[email protected]> wrote:
>> 
>> Thank you.  I am using hadoop 2.5 which I think uses protobuf-java-2.5.0.jar.
>> 
>> I am getting the following error even after adding those 2 jar files to my 
>> flume-ng classpath:
>> 
>> 30 Sep 2014 18:27:03,269 INFO  [lifecycleSupervisor-1-0] 
>> (org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start:61)  
>> - Configuration provider starting
>> 30 Sep 2014 18:27:03,278 INFO  [conf-file-poller-0] 
>> (org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run:133)
>>   - Reloading configuration file:./src.conf
>> 30 Sep 2014 18:27:03,288 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)
>>   - Processing:k1
>> 30 Sep 2014 18:27:03,289 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:930)
>>   - Added sinks: k1 Agent: a1
>> 30 Sep 2014 18:27:03,289 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)
>>   - Processing:k1
>> 30 Sep 2014 18:27:03,292 WARN  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration.<init>:101)  - Configuration 
>> property ignored: i# = Describe the sink
>> 30 Sep 2014 18:27:03,292 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)
>>   - Processing:k1
>> 30 Sep 2014 18:27:03,292 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)
>>   - Processing:k1
>> 30 Sep 2014 18:27:03,293 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)
>>   - Processing:k1
>> 30 Sep 2014 18:27:03,293 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)
>>   - Processing:k1
>> 30 Sep 2014 18:27:03,293 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)
>>   - Processing:k1
>> 30 Sep 2014 18:27:03,293 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)
>>   - Processing:k1
>> 30 Sep 2014 18:27:03,293 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)
>>   - Processing:k1
>> 30 Sep 2014 18:27:03,312 INFO  [conf-file-poller-0] 
>> (org.apache.flume.conf.FlumeConfiguration.validateConfiguration:140)  - 
>> Post-validation flume configuration contains configuration for agents: [a1]
>> 30 Sep 2014 18:27:03,312 INFO  [conf-file-poller-0] 
>> (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:150)  - 
>> Creating channels
>> 30 Sep 2014 18:27:03,329 INFO  [conf-file-poller-0] 
>> (org.apache.flume.channel.DefaultChannelFactory.create:40)  - Creating 
>> instance of channel c1 type memory
>> 30 Sep 2014 18:27:03,351 INFO  [conf-file-poller-0] 
>> (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:205)  - 
>> Created channel c1
>> 30 Sep 2014 18:27:03,352 INFO  [conf-file-poller-0] 
>> (org.apache.flume.source.DefaultSourceFactory.create:39)  - Creating 
>> instance of source r1, type org.apache.flume.source.twitter.TwitterSource
>> 30 Sep 2014 18:27:03,363 INFO  [conf-file-poller-0] 
>> (org.apache.flume.source.twitter.TwitterSource.configure:110)  - Consumer 
>> Key:        'tobhMtidckJoe1tByXDmI4pW3'
>> 30 Sep 2014 18:27:03,363 INFO  [conf-file-poller-0] 
>> (org.apache.flume.source.twitter.TwitterSource.configure:111)  - Consumer 
>> Secret:     '6eZKRpd6JvGT3Dg9jtd9fG9UMEhBzGxoLhLUGP1dqzkKznrXuQ'
>> 30 Sep 2014 18:27:03,363 INFO  [conf-file-poller-0] 
>> (org.apache.flume.source.twitter.TwitterSource.configure:112)  - Access 
>> Token:        '1588514408-o36mOSbXYCVacQ3p6Knsf6Kho17iCwNYLZyA9V5'
>> 30 Sep 2014 18:27:03,364 INFO  [conf-file-poller-0] 
>> (org.apache.flume.source.twitter.TwitterSource.configure:113)  - Access 
>> Token Secret: 'vBtp7wKsi2BOQqZSBpSBQSgZcc93oHea38T9OdckDCLKn'
>> 30 Sep 2014 18:27:03,825 INFO  [conf-file-poller-0] 
>> (org.apache.flume.sink.DefaultSinkFactory.create:40)  - Creating instance of 
>> sink: k1, type: hdfs
>> 30 Sep 2014 18:27:03,874 ERROR [conf-file-poller-0] 
>> (org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run:145)
>>   - Failed to start agent because dependencies were not found in classpath. 
>> Error follows.
>> java.lang.NoClassDefFoundError: 
>> org/apache/commons/configuration/Configuration
>>      at 
>> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:38)
>>      at 
>> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:36)
>>      at 
>> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:106)
>>      at 
>> org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:208)
>>      at 
>> org.apache.flume.sink.hdfs.HDFSEventSink.authenticate(HDFSEventSink.java:553)
>>      at 
>> org.apache.flume.sink.hdfs.HDFSEventSink.configure(HDFSEventSink.java:272)
>>      at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
>>      at 
>> org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:418)
>>      at 
>> org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:103)
>>      at 
>> org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:140)
>>      at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>>      at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>      at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>      at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.ClassNotFoundException: 
>> org.apache.commons.configuration.Configuration
>>      at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>      at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>      at java.security.AccessController.doPrivileged(Native Method)
>>      at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>      at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>      at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>      ... 17 more
>> 30 Sep 2014 18:27:33,491 INFO  [agent-shutdown-hook] 
>> (org.apache.flume.lifecycle.LifecycleSupervisor.stop:79)  - Stopping 
>> lifecycle supervisor 10
>> 30 Sep 2014 18:27:33,493 INFO  [agent-shutdown-hook] 
>> (org.apache.flume.node.PollingPropertiesFileConfigurationProvider.stop:83)  
>> - Configuration provider stopping
>> [vagrant@localhost 6]$ 
>> 
>> Is there another jar file I need?
>> 
>> Thanks.
>> 
>> On Sep 29, 2014, at 9:04 PM, shengyi.pan <[email protected]> wrote:
>> 
>>> you need hadoop-common-x.x.x.jar and hadoop-hdfs-x.x.x.jar under your 
>>> flume-ng classpath, and the dependent hadoop jar version must match your 
>>> hadoop system.
>>>  
>>> if sink to hadoop-2.0.0,  you should use "protobuf-java-2.4.1.jar" 
>>> (defaultly, flume-1.5.0 uses "protobuf-java-2.5.0.jar", the jar file is 
>>> under flume lib directory ), because the pb interface of hdfs-2.0 is 
>>> compiled wtih protobuf-2.4, while using protobuf-2.5 the flume-ng will fail 
>>> to start....
>>>  
>>>  
>>>  
>>>  
>>> 2014-09-30
>>> shengyi.pan
>>> 发件人：Ed Judge <[email protected]>
>>> 发送时间：2014-09-29 22:38
>>> 主题：HDFS sink to a remote HDFS node
>>> 收件人："[email protected]"<[email protected]>
>>> 抄送：
>>>  
>>> I am trying to run the flume-ng agent on one node with an HDFS sink 
>>> pointing to an HDFS filesystem on another node.
>>> Is this possible?  What packages/jar files are needed on the flume agent 
>>> node for this to work?  Secondary goal is to install only what is needed on 
>>> the flume-ng node.
>>> 
>>> # Describe the sink
>>> a1.sinks.k1.type = hdfs
>>> a1.sinks.k1.hdfs.path = hdfs://<remote IP address>/tmp/
>>> 
>>> 
>>> Thanks,
>>> Ed
>> 
>> 
> 
>

Re: HDFS sink to a remote HDFS node

Reply via email to