Hello- I am fairly new to Beam but have been working with Apache Spark for a number of years. The application I am developing uses a data pipeline to ingest JSON with a particular schema, uses it to prepare data for a service that I do not control (a mathematical optimization solver), runs the application and recovers its results, and then publishes the results in JSON (same schema). Although I work in Java, colleagues of mine are implementing in Python. This is an open-source, non-commercial project.
The application has three kinds of IO sources/sinks: file system files (using Windows now, but Unix in the future), URL, and in-memory (string, byte buffer, etc). The last is primarily used for debugging, displayed in a JTextArea. I have not found a Beam IO connector that handles all three data sources/sinks, particularly the in-memory sink. I have tried adapting FileIO and TextIO, however, I continually run up against objects that are not serializable, particularly Java OutputStream and its subclasses. I have looked at the code for FileIO and TextIO as well as several other custom IO implementations, but none of them addresses this particular bug. The CSVSink example in the FileIO Javadoc uses a PrintWriter, which is not serializable; when I tried the same thing, I got a not-serializable exception. How does this example actually avoid this error? In the code for TextIO.Sink, the PrintWriter field is marked transient, meaning that it is not serialized, but again, when I tried the same thing, I got an exception. Please explain, in particular, how to write a Sink that avoids the not serializable exception. In general, please explain how I can use a Beam IO connector for the three kinds of data sources/sinks I want to use (file system, url, and in-memory). After the frustrations I had with Spark, I have high hopes for Beam. This issue is a blocker for me. Thank you. Jeremy Bloom