Beam IO Connector

Jeremy Bloom Sat, 12 Aug 2023 08:34:01 -0700

Hello-
I am fairly new to Beam but have been working with Apache Spark for a
number of years. The application I am developing uses a data pipeline to
ingest JSON with a particular schema, uses it to prepare data for a service
that I do not control (a mathematical optimization solver), runs the
application and recovers its results, and then publishes the results in
JSON (same schema).  Although I work in Java, colleagues of mine are
implementing in Python. This is an open-source, non-commercial project.


The application has three kinds of IO sources/sinks: file system files
(using Windows now, but Unix in the future), URL, and in-memory (string,
byte buffer, etc). The last is primarily used for debugging, displayed in a
JTextArea.

I have not found a Beam IO connector that handles all three data
sources/sinks, particularly the in-memory sink. I have tried adapting
FileIO and TextIO, however, I continually run up against objects that are
not serializable, particularly Java OutputStream and its subclasses. I have
looked at the code for FileIO and TextIO as well as several other custom IO
implementations, but none of them addresses this particular bug.

The CSVSink example in the FileIO Javadoc uses a PrintWriter, which is not
serializable; when I tried the same thing, I got a not-serializable
exception. How does this example actually avoid this error? In the code for
TextIO.Sink, the PrintWriter field is marked transient, meaning that it is
not serialized, but again, when I tried the same thing, I got an exception.

Please explain, in particular, how to write a Sink that avoids the not
serializable exception. In general, please explain how I can use a Beam IO
connector for the three kinds of data sources/sinks I want to use (file
system, url, and in-memory).

After the frustrations I had with Spark, I have high hopes for Beam. This
issue is a blocker for me.

Thank you.
Jeremy Bloom

Beam IO Connector

Reply via email to