On Thu, Nov 26, 2015 at 10:41:20AM +0800, Chris Miller wrote:
> I'm storing data generated from my web application in Apache Avro
> format. The data is serialized and sent to an Apache Kinesis Firehose
> that buffers and writes the data to Amazon S3 every 300 seconds or so.
> Since I have multiple web servers, this results in multiple blobs of
> Avro files being sent to Kinesis, upon which it concatenates and
> periodically writes them to S3.
> When I grab the file from S3, I can't using the normal Avro tools to
> decode it since it's actually multiple files in one. I could add a
> delimiter I suppose, but that seems risky in the event that the data
> being logged also has the same delimiter.
> What's the best way to deal with this? I couldn't find anything in the
> standard that supports multiple Avro files concatenated into the same
> file.

I don't know about a good way, but I've dealt with files that I accidently
corrupted in a similar fashion (hadoop fs -getmerge or something like that).

I hacked up something to seek around and skip the header and sync with the
following patch to the python avro 1.7.7 and the driver program.

This may not have been the final version (i.e. it may not work) since I did
this a while ago, but should give a general idea of how to handle this case.

https://gist.github.com/aeroevan/988dde466a17b70ff4ae


If there is some way to synchronize the creation of each S3 bucket to ensure
that the avro header is only written once (and the schema is the same and
doesn't change) you should be able to just append the raw avro with:

  writer = DataFileWriter(open(filename, "ab", DatumWriter())

instead of providing the schema as usual when using the python library.

I'm sure there's some way to do this same stuff in Java, but I didn't bother
looking into it since I found something that worked for my one-off issue.

Good luck.
-- 
Evan McClain
https://keybase.io/aeroevan

Attachment: signature.asc
Description: PGP signature

Reply via email to