Hi ed, Thanks for your response. I was afraid that the solution was to write my own serializer, not the most expert Java programmer :P
But I think that is the only solution, reading more at the docs: This deserializer is able to read an Avro container file, and it generates one event per Avro record in the file. Each event is annotated with a header that indicates the schema used. The body of the event is the binary Avro record data, not including the schema or the rest of the container file elements. So I tested using deserializer.schemaType = LITERAL and I can see a JSON header with the schema and on the body i can see the binary data of the values. So I think it should be “easy” to write a serializer based on an example I found: https://github.com/brockn/avro-flume-hive-example/blob/master/src/main/java/com/cloudera/flume/serialization/FlumeEventStringBodyAvroEventSerializer.java I was hoping that a General Avro serializer existed since there a deserializer that I am using in the SpoolDir Source. I will post if I came up with a solution, Thanks On Feb 6, 2014, at 9:10 PM, ed <[email protected]> wrote: > Hi Daniel, > > I think you will need to write a custom event serializer for the HDFSSink > that extends AbstractAvroEventSerializer to write out your data using your > specific Avro Schema. Then in your agent configuration add it like this: > > a1.sinks.sink1.serializer = > com.yourpackagename.CustomAvroEventSerializer$Builder > > As a quick test you can use the default avro serializer > (https://flume.apache.org/FlumeUserGuide.html#avro-event-serializer) like so: > > a1.sinks.sink1.serializer = avro_event > > I think this will end up just wrapping your avro data in Flume's default > schema but at least you can see if valid avro files are getting written to > HDFS. Hope that gets you a little closer. > > Best, > > Ed > > > On Fri, Feb 7, 2014 at 11:51 AM, Daniel Rodriguez <[email protected]> > wrote: > Hi all, > > I have users writing AVRO files in different server and I want to use Flume > to move all those files into HDFS using Flume. So I can later use Hive or Pig > to query/analyse the data. > > On the client I installed flume and have a SpoolDir source and AVRO sink like > this: > > > a1.sources = src1 > a1.sinks = sink1 > a1.channels = c1 > > a1.channels.c1.type = memory > > a1.sources.src1.type = spooldir > a1.sources.src1.channels = c1 > a1.sources.src1.spoolDir = {directory} > a1.sources.src1.fileHeader = true > a1.sources.src1.deserializer = avro > > a1.sinks.sink1.type = avro > a1.sinks.sink1.channel = c1 > a1.sinks.sink1.hostname = {IP} > a1.sinks.sink1.port = 41414 > On the hadoop cluster I have this AVRO source and HDFS sink: > > > a1.sources = avro1 > a1.sinks = sink1 > a1.channels = c1 > > a1.channels.c1.type = memory > > a1.sources.avro1.type = avro > a1.sources.avro1.channels = c1 > a1.sources.avro1.bind = 0.0.0.0 > a1.sources.avro1.port = 41414 > > a1.sinks.sink1.type = hdfs > a1.sinks.sink1.channel = c1 > a1.sinks.sink1.hdfs.path = {hdfs dir} > a1.sinks.sink1.hdfs.fileSuffix = .avro > a1.sinks.sink1.hdfs.rollSize = 67108864 > a1.sinks.sink1.hdfs.fileType = DataStream > The problem is that the files on HDFS are not valid AVRO files! I am using > the hue UI to check whenever the file is a valid AVRO file or not. If I > upload an AVRO I file that I generate on my pc to the cluster I can see its > contents perfectly, even create a Hive table and query but the files I send > via flume are not valid AVRO files. > > I tried the flume avro client that is included in flume but didn't work > because it sends a flume event per line breaking the avro files, so i fixed > that using the spooldir source using deserializer = avro. So I think the > problem is on the HDFS sink when is writing the files. > > Using hdfs.fileType = DataStream it writes the values from the avro fields > not the whole avro file, losing all the schema information. If I use > hdfs.fileType = SequenceFile the files are not valid for some reason. > > I appreciate any help. > > Thanks, > > Daniel > >
