Thanks. Is there a github link to Devon's code? On Mon, Aug 14, 2023 at 8:49 AM John Casey <theotherj...@google.com> wrote:
> I believe Devon Peticolas wrote a similar tool to create an IO that wrote > to configurable sinks that might fit your use case > > On Sat, Aug 12, 2023 at 12:18 PM Bruno Volpato via dev < > dev@beam.apache.org> wrote: > >> Hi Jeremy, >> >> Apparently you are trying to use Beam's DirectRunner >> <https://beam.apache.org/documentation/runners/direct/>, which is mostly >> focused on small pipelines / testing purposes. >> Even if it runs in the JVM, there are protections in place to make sure >> your pipeline will be able to be distributed correctly when choosing a >> production-ready runner (e.g., Dataflow, Spark, Flink), from the link above: >> >> - enforcing immutability of elements >> - enforcing encodability of elements >> >> There are ways to disable those checks (--enforceEncodability=false, >> --enforceImmutability=false), but to make sure you take the best out of >> Beam and can run the pipeline in one of the runners in the future, I >> believe the best way would be to write to a file, and read it back in the >> GUI application (for the sink part). >> >> For the source part, you may want to use Create >> <https://beam.apache.org/documentation/transforms/java/other/create/> to >> create a PCollection with specific elements for the in-memory scenario. >> >> If you are getting exceptions for supported scenarios that you've >> mentioned, there are a few things -- for example, if you are using lambda, >> sometimes Java will try to Serialize the entire instance that holds members >> being used. Creating your own DoFn classes and passing the Serializables >> that what you need to use may resolve. >> >> >> Best, >> Bruno >> >> >> >> >> On Sat, Aug 12, 2023 at 11:34 AM Jeremy Bloom <jeremybl...@gmail.com> >> wrote: >> >>> Hello- >>> I am fairly new to Beam but have been working with Apache Spark for a >>> number of years. The application I am developing uses a data pipeline to >>> ingest JSON with a particular schema, uses it to prepare data for a service >>> that I do not control (a mathematical optimization solver), runs the >>> application and recovers its results, and then publishes the results in >>> JSON (same schema). Although I work in Java, colleagues of mine are >>> implementing in Python. This is an open-source, non-commercial project. >>> >>> The application has three kinds of IO sources/sinks: file system files >>> (using Windows now, but Unix in the future), URL, and in-memory (string, >>> byte buffer, etc). The last is primarily used for debugging, displayed in a >>> JTextArea. >>> >>> I have not found a Beam IO connector that handles all three data >>> sources/sinks, particularly the in-memory sink. I have tried adapting >>> FileIO and TextIO, however, I continually run up against objects that are >>> not serializable, particularly Java OutputStream and its subclasses. I have >>> looked at the code for FileIO and TextIO as well as several other custom IO >>> implementations, but none of them addresses this particular bug. >>> >>> The CSVSink example in the FileIO Javadoc uses a PrintWriter, which is >>> not serializable; when I tried the same thing, I got a not-serializable >>> exception. How does this example actually avoid this error? In the code for >>> TextIO.Sink, the PrintWriter field is marked transient, meaning that it is >>> not serialized, but again, when I tried the same thing, I got an exception. >>> >>> Please explain, in particular, how to write a Sink that avoids the not >>> serializable exception. In general, please explain how I can use a Beam IO >>> connector for the three kinds of data sources/sinks I want to use (file >>> system, url, and in-memory). >>> >>> After the frustrations I had with Spark, I have high hopes for Beam. >>> This issue is a blocker for me. >>> >>> Thank you. >>> Jeremy Bloom >>> >>