Re: Directory with avro files to HDFS

ed Thu, 06 Feb 2014 19:11:20 -0800

Hi Daniel,

I think you will need to write a custom event serializer for the HDFSSink
that extends AbstractAvroEventSerializer to write out your data using your
specific Avro Schema.  Then in your agent configuration add it like this:


a1.sinks.sink1.serializer =
> com.yourpackagename.CustomAvroEventSerializer$Builder


As a quick test you can use the default avro serializer  (
https://flume.apache.org/FlumeUserGuide.html#avro-event-serializer) like so:

a1.sinks.sink1.serializer = avro_event


I think this will end up just wrapping your avro data in Flume's default
schema but at least you can see if valid avro files are getting written to
HDFS.  Hope that gets you a little closer.

Best,

Ed


On Fri, Feb 7, 2014 at 11:51 AM, Daniel Rodriguez <[email protected]
> wrote:

> Hi all,
>
> I have users writing AVRO files in different server and I want to use
> Flume to move all those files into HDFS using Flume. So I can later use
> Hive or Pig to query/analyse the data.
>
> On the client I installed flume and have a SpoolDir source and AVRO sink
> like this:
>
> a1.sources = src1
> a1.sinks = sink1
> a1.channels = c1
>
> a1.channels.c1.type = memory
>
> a1.sources.src1.type = spooldir
> a1.sources.src1.channels = c1
> a1.sources.src1.spoolDir = {directory}
> a1.sources.src1.fileHeader = true
> a1.sources.src1.deserializer = avro
>
> a1.sinks.sink1.type = avro
> a1.sinks.sink1.channel = c1
> a1.sinks.sink1.hostname = {IP}
> a1.sinks.sink1.port = 41414
>
> On the hadoop cluster I have this AVRO source and HDFS sink:
>
> a1.sources = avro1
> a1.sinks = sink1
> a1.channels = c1
>
> a1.channels.c1.type = memory
>
> a1.sources.avro1.type = avro
> a1.sources.avro1.channels = c1
> a1.sources.avro1.bind = 0.0.0.0
> a1.sources.avro1.port = 41414
>
> a1.sinks.sink1.type = hdfs
> a1.sinks.sink1.channel = c1
> a1.sinks.sink1.hdfs.path = {hdfs dir}
> a1.sinks.sink1.hdfs.fileSuffix = .avro
> a1.sinks.sink1.hdfs.rollSize = 67108864
> a1.sinks.sink1.hdfs.fileType = DataStream
>
> The problem is that the files on HDFS are not valid AVRO files! I am using
> the hue UI to check whenever the file is a valid AVRO file or not. If I
> upload an AVRO I file that I generate on my pc to the cluster I can see its
> contents perfectly, even create a Hive table and query but the files I send
> via flume are not valid AVRO files.
>
> I tried the flume avro client that is included in flume but didn't work
> because it sends a flume event per line breaking the avro files, so i fixed
> that using the spooldir source using deserializer = avro. So I think the
> problem is on the HDFS sink when is writing the files.
>
> Using hdfs.fileType = DataStream it writes the values from the avro
> fields not the whole avro file, losing all the schema information. If I use 
> hdfs.fileType
> = SequenceFile the files are not valid for some reason.
>
> I appreciate any help.
>
> Thanks,
>
> Daniel
>

Re: Directory with avro files to HDFS

Reply via email to