Hello everyone,

At my company we are considering using Apache Beam as part of our
Analytics system using the Python SDK.

Our dataset consists of an unbounded collection of TAR (gzipped)
archives which contain several JSON and binary files.
These TAR files need to be split into sub-categories, so, essentially
outputting a new collection composed of smaller parts.
Our transforms will operate over this second collection.

The  size of the compressed TAR archive files is around 10 MiB and the
largest binary files we have are around 16 MiB.
We only have a couple of these, the rest of the binary files are
smaller than that.

Also, in some cases, we may want some transformations to generate new
binary files from this collection.

The first problem I encountered is that there's no native way to
extract TAR archives, so my first approach was to unpack the TAR in
place (in a temporary directory) and then return the JSON files as
objects and the binary files as bytes.
But this crashes the Flink runner due to the large memory consumption.

Is there a way to pass large binary files along each instance of the pipeline?

I'm aware of fileio.py, I tried using WriteToFiles to write the
unpacked binary files with no success.
Apparently WriteToFiles groups all the files data into the same file.

I'm also aware that I can implement my own IO transforms using
FileBasedSource and FileBasedSink but it seems these classes are
"record oriented" which is not very useful for us.

Is Apache Beam the right framework for us?
Can we implement our system using Beam?

Thanks,
Ignacio.

-- 


This e-mail and any attachments may contain information that is 
privileged, confidential,  and/or exempt from disclosure under applicable 
law.  If you are not the intended recipient, you are hereby notified that 
any disclosure, copying, distribution or use of any information contained 
herein is strictly prohibited. If you have received this transmission in 
error, please immediately notify the sender and destroy the original 
transmission and any attachments, whether in electronic or hard copy 
format, without reading or saving.












Reply via email to