Yes, I'm having a PR[1] under reviewing to update the IO development guide.
[1] https://github.com/apache/beam/pull/13227 <https://github.com/apache/beam/pull/13227> On Fri, Nov 27, 2020 at 8:00 AM Alexey Romanenko <[email protected]> wrote: > Seems like the documentation about creating new IO connectors [1] is out > of date and it makes people get confused about the recommended way of > developing : > > *“Splittable DoFn is a new sources framework that is under development and > will replace the other options for developing bounded and unbounded > sources.*" > > Do you think we need rewrite this section completely according to a large > progress with moving to SDF-based connectors in last time? Though, it would > be useful to keep an old (current) one since Source API is still used. > > > [1] https://beam.apache.org/documentation/io/developing-io-overview/ > > On 25 Nov 2020, at 19:37, Boyuan Zhang <[email protected]> wrote: > > +dev <[email protected]> > > Hi Bashir, > > Most recently we are recommending to use Splittable DoFn[1] to build new > IO connectors. We have several examples for that in our codebase: > Java examples: > > - Kafka > > <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L118> > - An I/O connector for Apache Kafka <https://kafka.apache.org/> (an > open-source distributed event streaming platform). > - Watch > > <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787> > - Uses a polling function producing a growing set of outputs for each input > until a per-input termination condition is met. > - Parquet > > <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L365> > - An I/O connector for Apache Parquet <https://parquet.apache.org/> > (an open-source columnar storage format). > - HL7v2 > > <https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java#L493> > - An I/O connector for HL7v2 messages (a clinical messaging format that > provides data about events that occur inside an organization) part of > Google’s > Cloud Healthcare API <https://cloud.google.com/healthcare>. > - BoundedSource wrapper > > <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L248> > - A wrapper which converts an existing BoundedSource implementation to a > splittable DoFn. > - UnboundedSource wrapper > > <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L432> > - A wrapper which converts an existing UnboundedSource implementation to a > splittable DoFn. > > > Python examples: > > - BoundedSourceWrapper > > <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/python/apache_beam/io/iobase.py#L1375> > - A wrapper which converts an existing BoundedSource implementation to a > splittable DoFn. > > > [1] > https://beam.apache.org/documentation/programming-guide/#splittable-dofns > > On Wed, Nov 25, 2020 at 8:19 AM Bashir Sadjad <[email protected]> wrote: > >> Hi, >> >> I have a scenario in which a streaming pipeline should read update >> messages from MySQL binlog (through Debezium). To implement this pipeline >> using Beam, I understand there is a KafkaIO which I can use. But I also >> want to support a local mode in which there is no Kafka and the messages >> are directly consumed using embedded Debezium because this is a much >> simpler architecture (no Kafka, ZooKeeper, and Kafka Connect). >> >> I did a little bit of search and it seems there is no IO connector for >> Debezim, hence I have to implement one following this guide >> <https://beam.apache.org/documentation/io/developing-io-java/>. I wonder: >> >> 1) Does this approach make sense or is it better to rely on Kafka even >> for the local single machine use case? >> >> 2) Beside the above guide, is there any simple example IO that I can >> follow to implement the UnboundedSource/Reader? I have looked at some >> examples here <https://github.com/apache/beam/tree/master/sdks/java/io> but >> was wondering if there is a recommended/simple one as a tutorial. >> >> Thanks >> >> -B >> P.S. If this is better suited for dev@, please feel free to move it to >> that list. >> > >
