Re: Directory with avro files to HDFS

Daniel Rodriguez Fri, 07 Feb 2014 07:22:29 -0800

Hi ed,

Thanks for your response. I was afraid that the solution was to write my own 
serializer, not the most expert Java programmer :P


But I think that is the only solution, reading more at the docs:

This deserializer is able to read an Avro container file, and it generates one 
event per Avro record in the file. Each event is annotated with a header that 
indicates the schema used. The body of the event is the binary Avro record 
data, not including the schema or the rest of the container file elements.

So I tested using deserializer.schemaType = LITERAL and I can see a JSON header 
with the schema and on the body i can see the binary data of the values. So I 
think it should be “easy” to write a serializer based on an example I found: 
https://github.com/brockn/avro-flume-hive-example/blob/master/src/main/java/com/cloudera/flume/serialization/FlumeEventStringBodyAvroEventSerializer.java

I was hoping that a General Avro serializer existed since there a deserializer 
that I am using in the SpoolDir Source.

I will post if I came up with a solution,

Thanks

On Feb 6, 2014, at 9:10 PM, ed <[email protected]> wrote:

> Hi Daniel,
> 
> I think you will need to write a custom event serializer for the HDFSSink 
> that extends AbstractAvroEventSerializer to write out your data using your 
> specific Avro Schema.  Then in your agent configuration add it like this:
> 
> a1.sinks.sink1.serializer = 
> com.yourpackagename.CustomAvroEventSerializer$Builder
> 
> As a quick test you can use the default avro serializer  
> (https://flume.apache.org/FlumeUserGuide.html#avro-event-serializer) like so:
> 
> a1.sinks.sink1.serializer = avro_event
> 
> I think this will end up just wrapping your avro data in Flume's default 
> schema but at least you can see if valid avro files are getting written to 
> HDFS.  Hope that gets you a little closer.
> 
> Best,
> 
> Ed
> 
> 
> On Fri, Feb 7, 2014 at 11:51 AM, Daniel Rodriguez <[email protected]> 
> wrote:
> Hi all,
> 
> I have users writing AVRO files in different server and I want to use Flume 
> to move all those files into HDFS using Flume. So I can later use Hive or Pig 
> to query/analyse the data.
> 
> On the client I installed flume and have a SpoolDir source and AVRO sink like 
> this:
> 
> 
> a1.sources = src1
> a1.sinks = sink1
> a1.channels = c1
> 
> a1.channels.c1.type = memory
> 
> a1.sources.src1.type = spooldir
> a1.sources.src1.channels = c1
> a1.sources.src1.spoolDir = {directory}
> a1.sources.src1.fileHeader = true
> a1.sources.src1.deserializer = avro
> 
> a1.sinks.sink1.type = avro
> a1.sinks.sink1.channel = c1
> a1.sinks.sink1.hostname = {IP}
> a1.sinks.sink1.port = 41414
> On the hadoop cluster I have this AVRO source and HDFS sink:
> 
> 
> a1.sources = avro1
> a1.sinks = sink1
> a1.channels = c1
> 
> a1.channels.c1.type = memory
> 
> a1.sources.avro1.type = avro
> a1.sources.avro1.channels = c1
> a1.sources.avro1.bind = 0.0.0.0
> a1.sources.avro1.port = 41414
> 
> a1.sinks.sink1.type = hdfs
> a1.sinks.sink1.channel = c1
> a1.sinks.sink1.hdfs.path = {hdfs dir}
> a1.sinks.sink1.hdfs.fileSuffix = .avro
> a1.sinks.sink1.hdfs.rollSize = 67108864
> a1.sinks.sink1.hdfs.fileType = DataStream
> The problem is that the files on HDFS are not valid AVRO files! I am using 
> the hue UI to check whenever the file is a valid AVRO file or not. If I 
> upload an AVRO I file that I generate on my pc to the cluster I can see its 
> contents perfectly, even create a Hive table and query but the files I send 
> via flume are not valid AVRO files.
> 
> I tried the flume avro client that is included in flume but didn't work 
> because it sends a flume event per line breaking the avro files, so i fixed 
> that using the spooldir source using deserializer = avro. So I think the 
> problem is on the HDFS sink when is writing the files.
> 
> Using hdfs.fileType = DataStream it writes the values from the avro fields 
> not the whole avro file, losing all the schema information. If I use 
> hdfs.fileType = SequenceFile the files are not valid for some reason.
> 
> I appreciate any help.
> 
> Thanks,
> 
> Daniel
> 
>

Re: Directory with avro files to HDFS

Reply via email to