Re: Implementing an IO Connector for Debezium

2020-11-30 Thread Bashir Sadjad
Thanks Boyuan for the pointers.

If you or anyone else here have any recommendations about the two
approaches, i.e., implementing a connector for Beam using the embedded
version of Debezium or relying on Kafka (even for the single node case),
that would be great too.

Regards

-B

On Wed, Nov 25, 2020 at 1:37 PM Boyuan Zhang  wrote:

> +dev 
>
> Hi Bashir,
>
> Most recently we are recommending to use Splittable DoFn[1] to build new
> IO connectors. We have several examples for that in our codebase:
> Java examples:
>
>-
>
>Kafka
>
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L118>
>- An I/O connector for Apache Kafka <https://kafka.apache.org/> (an
>open-source distributed event streaming platform).
>-
>
>Watch
>
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787>
>- Uses a polling function producing a growing set of outputs for each input
>until a per-input termination condition is met.
>-
>
>Parquet
>
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L365>
>- An I/O connector for Apache Parquet <https://parquet.apache.org/>
>(an open-source columnar storage format).
>-
>
>HL7v2
>
> <https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java#L493>
>- An I/O connector for HL7v2 messages (a clinical messaging format that
>provides data about events that occur inside an organization) part of 
> Google’s
>Cloud Healthcare API <https://cloud.google.com/healthcare>.
>-
>
>BoundedSource wrapper
>
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L248>
>- A wrapper which converts an existing BoundedSource implementation to a
>splittable DoFn.
>-
>
>UnboundedSource wrapper
>
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L432>
>- A wrapper which converts an existing UnboundedSource implementation to a
>splittable DoFn.
>
>
> Python examples:
>
>- BoundedSourceWrapper
>
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/python/apache_beam/io/iobase.py#L1375>
>- A wrapper which converts an existing BoundedSource implementation to a
>splittable DoFn.
>
>
> [1]
> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>
> On Wed, Nov 25, 2020 at 8:19 AM Bashir Sadjad  wrote:
>
>> Hi,
>>
>> I have a scenario in which a streaming pipeline should read update
>> messages from MySQL binlog (through Debezium). To implement this pipeline
>> using Beam, I understand there is a KafkaIO which I can use. But I also
>> want to support a local mode in which there is no Kafka and the messages
>> are directly consumed using embedded Debezium because this is a much
>> simpler architecture (no Kafka, ZooKeeper, and Kafka Connect).
>>
>> I did a little bit of search and it seems there is no IO connector for
>> Debezim, hence I have to implement one following this guide
>> <https://beam.apache.org/documentation/io/developing-io-java/>. I wonder:
>>
>> 1) Does this approach make sense or is it better to rely on Kafka even
>> for the local single machine use case?
>>
>> 2) Beside the above guide, is there any simple example IO that I can
>> follow to implement the UnboundedSource/Reader? I have looked at some
>> examples here <https://github.com/apache/beam/tree/master/sdks/java/io> but
>> was wondering if there is a recommended/simple one as a tutorial.
>>
>> Thanks
>>
>> -B
>> P.S. If this is better suited for dev@, please feel free to move it to
>> that list.
>>
>


Re: [VOTE] Policies for managing Beam dependencies

2018-06-11 Thread Bashir Sadjad
FWIW, I also think that this has relevance for users. I am a user of Beam
not a contributor and only monitor this list at a high level. But I think
the dependency issue is something that many users have to deal with. It has
bitten us at least twice over the last few months due to the fact that we
depend on other libraries too and sometimes we get version conflicts (which
is one of the issues highlighted in the doc

Cham shared). I usually go through file histories on GitHub to try to
figure out why a certain version requirement is there. It would be nice if
the reasons are maintained at a higher level easier to consume by users.

Cheers

-B

On Tue, Jun 12, 2018 at 12:19 AM Ahmet Altay  wrote:

> I think this is relevant for users. It makes sense for users to know about
> how Beam work with its dependencies and understand how conflicts will be
> addressed and when dependencies will be upgraded.
>
> On Mon, Jun 11, 2018 at 9:09 PM, Kenneth Knowles  wrote:
>
>> Do you think this has relevance for users?
>>
>> If not, it might be a good use of the new Confluence space. I'm not too
>> familiar with the way permission work, but perhaps we can have a more
>> locked down area that is for policy decisions like this.
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 3:58 PM Chamikara Jayalath 
>> wrote:
>>
>>> Hi All,
>>>
>>> Based on the vote (3 PMC +1s and no -1s) and based on the discussions in
>>> the doc (seems to be mostly positive), I think we can go ahead and
>>> implement some of the policies discussed so far.
>>>
>>> I have given some of the potential action items below.
>>>
>>> * Automatically generate human readable reports on status of Beam
>>> dependencies weekly and share in dev list.
>>> * Create JIRAs for significantly outdated dependencies based on above
>>> reports.
>>> * Copy some of the component level dependency version declarations to
>>> top level.
>>> * Try to identify owners for dependencies and specify owners in comments
>>> close to dependency declarations.
>>> * Vendor any dependencies that can cause issues if leaked to other
>>> components.
>>> * Add policies discussed so far to the Web site along with reasoning
>>> (from doc).
>>>
>>> Of course, I'm happy to refine or add to these polices as needed.
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>> On Thu, Jun 7, 2018 at 9:40 AM Lukasz Cwik  wrote:
>>>
 +1

 On Thu, Jun 7, 2018 at 5:18 AM Kenneth Knowles  wrote:

> +1 to these. Thanks for clarifying!
>
> Kenn
>
> On Wed, Jun 6, 2018 at 10:40 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> Hi Kenn,
>>
>> On Wed, Jun 6, 2018 at 8:14 PM Kenneth Knowles 
>> wrote:
>>
>>> +0.5
>>>
>>> I like the spirit of these policies. I think they need a little
>>> wording work. Comments inline.
>>>
>>> On Wed, Jun 6, 2018 at 4:53 PM, Chamikara Jayalath <
 chamik...@google.com> wrote:
>
>
> (1) Human readable reports on status of Beam dependencies are
> generated weekly and shared with the Beam community through the dev 
> list.
>

>>> Who is responsible for generating these? The mechanism or
>>> responsibility should be made clear.
>>>
>>> I clicked through a doc -> thread -> doc to find even some details.
>>> It looks like manual run of a gradle command was adopted. So the
>>> responsibility needs an owner, even if it is "unspecified volunteer on 
>>> dev@
>>> and feel free to complain or do it yourself if you don't see it"
>>>
>>
>> This is described in following doc (referenced by my doc).
>>
>> https://docs.google.com/document/d/1rqr_8a9NYZCgeiXpTIwWLCL7X8amPAVfRXsO72BpBwA/edit#
>>
>> Proposal is to run an automated Jenkins job that is run weekly, so no
>> need for someone to manually generate these reports.
>>
>>
>>>
>>> (2) Beam components should define dependencies and their versions at
> the top level.
>

>>> I think the big "should" works better with some guidance about when
>>> something might be an exception, or at least explicit mention that there
>>> can be rare exceptions. Unless you think that is never the case. If 
>>> there
>>> are no exceptions, then say "must" and if we hit a roadblock we can 
>>> revisit
>>> the policy.
>>>
>>
>> The idea was to allow exceptions. Added more details to the doc.
>>
>>
>>>
>>>
>>> (3) A significantly outdated dependency (identified manually or
> through tooling) should result in a JIRA that is a blocker for the 
> next
> release. Release manager may choose to push the blocker to the 
> subsequent
> release or downgrade from a blocker.
>

>>> How is "significantly outdated" defined? By dev@ discussion? Seems