Re: Beam IO Connector

Jeremy Bloom Mon, 14 Aug 2023 13:21:15 -0700

Thanks. Is there a github link to Devon's code?

On Mon, Aug 14, 2023 at 8:49 AM John Casey <theotherj...@google.com> wrote:


> I believe Devon Peticolas wrote a similar tool to create an IO that wrote
> to configurable sinks that might fit your use case
>
> On Sat, Aug 12, 2023 at 12:18 PM Bruno Volpato via dev <
> dev@beam.apache.org> wrote:
>
>> Hi Jeremy,
>>
>> Apparently you are trying to use Beam's DirectRunner
>> <https://beam.apache.org/documentation/runners/direct/>, which is mostly
>> focused on small pipelines / testing purposes.
>> Even if it runs in the JVM, there are protections in place to make sure
>> your pipeline will be able to be distributed correctly when choosing a
>> production-ready runner (e.g., Dataflow, Spark, Flink), from the link above:
>>
>> - enforcing immutability of elements
>> - enforcing encodability of elements
>>
>> There are ways to disable those checks (--enforceEncodability=false,
>> --enforceImmutability=false), but to make sure you take the best out of
>> Beam and can run the pipeline in one of the runners in the future, I
>> believe the best way would be to write to a file, and read it back in the
>> GUI application (for the sink part).
>>
>> For the source part, you may want to use Create
>> <https://beam.apache.org/documentation/transforms/java/other/create/> to
>> create a PCollection with specific elements for the in-memory scenario.
>>
>> If you are getting exceptions for supported scenarios that you've
>> mentioned, there are a few things -- for example, if you are using lambda,
>> sometimes Java will try to Serialize the entire instance that holds members
>> being used. Creating your own DoFn classes and passing the Serializables
>> that what you need to use may resolve.
>>
>>
>> Best,
>> Bruno
>>
>>
>>
>>
>> On Sat, Aug 12, 2023 at 11:34 AM Jeremy Bloom <jeremybl...@gmail.com>
>> wrote:
>>
>>> Hello-
>>> I am fairly new to Beam but have been working with Apache Spark for a
>>> number of years. The application I am developing uses a data pipeline to
>>> ingest JSON with a particular schema, uses it to prepare data for a service
>>> that I do not control (a mathematical optimization solver), runs the
>>> application and recovers its results, and then publishes the results in
>>> JSON (same schema).  Although I work in Java, colleagues of mine are
>>> implementing in Python. This is an open-source, non-commercial project.
>>>
>>> The application has three kinds of IO sources/sinks: file system files
>>> (using Windows now, but Unix in the future), URL, and in-memory (string,
>>> byte buffer, etc). The last is primarily used for debugging, displayed in a
>>> JTextArea.
>>>
>>> I have not found a Beam IO connector that handles all three data
>>> sources/sinks, particularly the in-memory sink. I have tried adapting
>>> FileIO and TextIO, however, I continually run up against objects that are
>>> not serializable, particularly Java OutputStream and its subclasses. I have
>>> looked at the code for FileIO and TextIO as well as several other custom IO
>>> implementations, but none of them addresses this particular bug.
>>>
>>> The CSVSink example in the FileIO Javadoc uses a PrintWriter, which is
>>> not serializable; when I tried the same thing, I got a not-serializable
>>> exception. How does this example actually avoid this error? In the code for
>>> TextIO.Sink, the PrintWriter field is marked transient, meaning that it is
>>> not serialized, but again, when I tried the same thing, I got an exception.
>>>
>>> Please explain, in particular, how to write a Sink that avoids the not
>>> serializable exception. In general, please explain how I can use a Beam IO
>>> connector for the three kinds of data sources/sinks I want to use (file
>>> system, url, and in-memory).
>>>
>>> After the frustrations I had with Spark, I have high hopes for Beam.
>>> This issue is a blocker for me.
>>>
>>> Thank you.
>>> Jeremy Bloom
>>>
>>

Re: Beam IO Connector

Reply via email to