Re: Avro configuration

Laxman Ch Thu, 29 Dec 2016 12:23:03 -0800

Directly configure this serializer to file_roll sink in your configuration.
>> agent.sinks.persistence-sink.sink.serializer
= org.apache.flume.sink.hdfs.AvroEventSerializer


Please go through the documentation before you proceed.
https://flume.apache.org/FlumeUserGuide.html#avro-event-serializer


On 29 December 2016 at 23:29, Jim Langston <[email protected]> wrote:

> Thanks - I have pulled the flume source, are there any good step-by-step
> examples of creating a custom Serializer and deploying it for flume to use
> ?
>
>
> Jim
> ------------------------------
> *From:* Laxman Ch <[email protected]>
> *Sent:* Wednesday, December 28, 2016 12:05:08 AM
> *To:* [email protected]
> *Subject:* Re: Avro configuration
>
> Jim,
>
> In one-liner, FlumeEventAvroEventSerializer and AvroEventDeserializer are
> not in sync and they can't be used as a serde pair.
>
> Flume's built-in avro serializer FlumeEventAvroEventSerializer which
> serializes Flume events with shell. It's important to note that, here
> actual raw event is wrapped inside the flume shell object and this raw
> object is treated as binary (which can be thrift, avro, or just a byte
> array, etc).
> Flume's built-in avro deserializer AvroEventDeserializer which
> deserializes any generic event serialized and it wraps the deserialized
> event into another flume shell object.
>
> This means as per your configuration, spool directory source (
> persistence-dev-source) will get an double wrapped flume event
> (FlumeEvent -> FlumeEvent -> raw event body)
>
> To solve this problem, we need to have serializer and deserializer to be
> in sync. We can achieve it in either of the following approaches.
> - Use a custom FluemEventAvroEventDeserializer to extract directly
> FlumeEvent without double wrapper and use it with spool directory source.
>
> Similar attempt has already been made by Sebastian here.
> https://issues.apache.org/jira/browse/FLUME-2942
>
> I personally recommend to write a FlumeEventAvroEventDeserializer than to
> modify the existing one.
>
> - Use a custom AvroEventSerializer to directly serialize the avro event
> and use it with file_roll sink.
> Reference implementation is available in hdfs sink
> (org.apache.flume.sink.hdfs.AvroEventSerializer)
> You may strip of hdfs dependencies from it achieve what you want.
>
>
> On 28 December 2016 at 01:17, Jim Langston <[email protected]>
> wrote:
>
>> Hi all,
>>
>>
>> I'm looking for some guidance , I have been trying to get a flow working
>> that involves the following:
>>
>>
>> Source Avro --> mem channel --> file_roll
>>
>>
>> File Roll config
>>
>>
>> agent.sinks.persistence-sink.type = file_roll
>> agent.sinks.persistence-sink.sink.directory = /home/flume/persistence
>> agent.sinks.persistence-sink.sink.serializer = avro_event
>> agent.sinks.persistence-sink.batchSize = 1000
>> agent.sinks.persistence-sink.sink.rollInterval = 300
>>
>>
>> Once the data is on local disk, I want to flume the data to another flume
>> server
>>
>> Source spooldir --> mem channel -- Avro Sink (to another flume server)
>>
>> agent.sources.persistence-dev-source.type = spooldir
>> agent.sources.persistence-dev-source.spoolDir = /home/flume/ready
>> agent.sources.persistence-dev-source.deserializer = avro
>> agent.sources.persistence-dev-source.deserializer.schemaType = LITERAL
>> agent.sources.persistence-dev-source.batchSize = 1000
>>
>>
>>
>> The problem is that file_roll will put the incoming Avro data into a Avro
>> container before storing the data on the local file system. Then when the
>> data is picked up by the spooldir source , and sent to the flume server, it
>> will have the file_roll headers when being read by the interceptor.
>>
>> Is there a recommended way to save the Avro data coming in, that will
>> maintain its integrity when sending on to another flume server, which is
>> waiting on Avro data to multiplex and send to its channels.
>>
>> I have tried many different variations, with the result of the above
>> configurations getting the Avro to the other server with the Avro data that
>> was received, but the problem is that the applications will see the
>> container headers from the file_roll , and not the headers from the records
>> from the initial Avro data.
>>
>>
>> Thanks,
>>
>> Jim
>>
>> schema that gets set by file_roll on its writes to disk:
>>
>> {
>>   "type" : "record",
>>   "name" : "Event",
>>   "fields" : [ {
>>     "name" : "headers",
>>     "type" : {
>>       "type" : "map",
>>       "values" : "string"
>>     }
>>   }, {
>>     "name" : "body",
>>     "type" : "bytes"
>>   } ]
>> }
>>
>>
>>
>>
>>
>>
>
>
> --
> Thanks,
> Laxman
>



-- 
Thanks,
Laxman

Re: Avro configuration

Reply via email to