Distributed reading and parsing of protobuf files from S3 in Apache Flink

ShB Wed, 26 Jul 2017 11:48:06 -0700

I'm working with Apache Flink on reading, parsing and processing data from
S3. I'm using the DataSet API, as my data is bounded and doesn't need
streaming semantics.


My data is on S3 in binary protobuf format in the form of a large number of
timestamped files. Each of these files have to be read, parsed(using 
parseDelimiedFrom
<https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/Parser#parseDelimitedFrom-java.io.InputStream->
 
) into their custom protobuf java classes and then processed. 

I’m currently using the aws-java-sdk to read these files as I couldn’t
figure out how to read binary protobufs via Flink semantics(env.readFile).
But I'm getting OOM errors as the number/size of files is too large. 

So I'm looking to do distributed/parallel reading and parsing of the files
in Flink. How can these custom binary files be read from s3 using the Flink
Dataset API(like env.readFile)? How can these custom binary files be read
from s3 in a distributed manner?



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Distributed-reading-and-parsing-of-protobuf-files-from-S3-in-Apache-Flink-tp14480.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.

Distributed reading and parsing of protobuf files from S3 in Apache Flink

Reply via email to