Re: How to write an IO guide draft

2023-01-10 Thread Sachin Agarwal via dev
Totally agreed with that, but it's not bad as a statement of intent for our
vision -

On Tue, Jan 10, 2023 at 8:34 AM Alexey Romanenko 
wrote:

> I doubt that it will be a "de-facto" standard behaviour for all runners in
> the short term until the cross-language funtionality brings additional
> complexity into pipeline deployment and performance overhead.
>
> Perhaps, it will be changed in long term, but for now, I may guess that
> the most of Beam pipelines still use the same SDK IO connectors as a
> pipeline itself.
>
> —
> Alexey
>
> On 10 Jan 2023, at 16:51, Sachin Agarwal via dev 
> wrote:
>
> I think the idea of cross language is that an IO is only in one language
> and others can use that IO. My feeling is that the idea of “what language
> is this IO in” becomes an implementation detail that folks won’t have to
> care about longer term. There are enhancements needed to the expansion
> service to make that happen but that’s my understanding of the strategy.
>
> On Tue, Jan 10, 2023 at 7:40 AM Austin Bennett  wrote:
>
>> This is great, thanks for putting this together!
>>
>> A related question:  are we as a community targeting java to be the
>> canonical/target IO language if an IO does not currently exist?  If that is
>> not the case, then I would imagine we are hoping that we might eventually
>> also wind up with good examples for implementing IOs in other languages as
>> well [ not suggesting that you/John address that, but that we add GH Issues
>> as that might be worthwhile to hope others take on ]?
>>
>>
>>
>> On Mon, Jan 9, 2023 at 8:58 AM John Casey via dev 
>> wrote:
>>
>>> Hi All,
>>>
>>> I spent the last few weeks of December drafting a "How to write an IO
>>> guide":
>>> https://docs.google.com/document/d/1-WxZTNu9RrLhh5O7Dl5PbnKqz3e5gm1x3gDBBhszVF8/edit#
>>>
>>> and an associated code sample: https://github.com/apache/beam/pull/24799
>>>
>>> My goal is to make it easier for a new IO developer to create a new IO
>>> from scratch. This is intended to complement the various standards
>>> documents that have been floating around. Where those are intended to
>>> prescribe structure of an IO, this is more focused on the mechanics of
>>> internal design.
>>>
>>> Please take a look and let me know what you think,
>>>
>>> John
>>>
>>
>


Re: How to write an IO guide draft

2023-01-10 Thread Sachin Agarwal via dev
I think the idea of cross language is that an IO is only in one language
and others can use that IO. My feeling is that the idea of “what language
is this IO in” becomes an implementation detail that folks won’t have to
care about longer term. There are enhancements needed to the expansion
service to make that happen but that’s my understanding of the strategy.

On Tue, Jan 10, 2023 at 7:40 AM Austin Bennett  wrote:

> This is great, thanks for putting this together!
>
> A related question:  are we as a community targeting java to be the
> canonical/target IO language if an IO does not currently exist?  If that is
> not the case, then I would imagine we are hoping that we might eventually
> also wind up with good examples for implementing IOs in other languages as
> well [ not suggesting that you/John address that, but that we add GH Issues
> as that might be worthwhile to hope others take on ]?
>
>
>
> On Mon, Jan 9, 2023 at 8:58 AM John Casey via dev 
> wrote:
>
>> Hi All,
>>
>> I spent the last few weeks of December drafting a "How to write an IO
>> guide":
>> https://docs.google.com/document/d/1-WxZTNu9RrLhh5O7Dl5PbnKqz3e5gm1x3gDBBhszVF8/edit#
>>
>> and an associated code sample: https://github.com/apache/beam/pull/24799
>>
>> My goal is to make it easier for a new IO developer to create a new IO
>> from scratch. This is intended to complement the various standards
>> documents that have been floating around. Where those are intended to
>> prescribe structure of an IO, this is more focused on the mechanics of
>> internal design.
>>
>> Please take a look and let me know what you think,
>>
>> John
>>
>


Re: Beam Website Feedback

2023-01-09 Thread Sachin Agarwal via dev
These are great. Thank you so much!

On Mon, Jan 9, 2023 at 6:33 AM Alexey Romanenko 
wrote:

> Always happy to help!
>
> Many thanks for your work to make Beam website better!
>
> —
> Alexey
>
> On 6 Jan 2023, at 21:54, Alex Kosolapov 
> wrote:
>
> Thank you, Ahmet! Happy to help! Both changes [1] and [2] have been
> reviewed and merged by Alexey Romanenko.
>
> We wanted to thank Alexey Romanenko, David Huntsperger, Pablo Estrada,
> Alya Boiko for reviewing and helping to contribute 52 enhancements, fixes
> and case study related additions for the Beam website in the last 6 months
> since July’22! [3]
>
> [1] https://github.com/apache/beam/pull/1
> [2] https://github.com/apache/beam/pull/24747
> [3]
> https://github.com/apache/beam/pulls?page=1=is%3Apr+author%3Abullet03+is%3Aclosed+merged%3A%3E%3D2022-07-01
>
> *From: *Ahmet Altay 
> *Date: *Tuesday, January 3, 2023 at 2:22 PM
> *To: *Alex Kosolapov 
> *Cc: *"dev@beam.apache.org" , Rebecca Szper <
> rsz...@google.com>, Bulat Safiullin , Alexey
> Romanenko , Rajkumar Gupta <
> rajkumargu...@google.com>
> *Subject: *[EXTERNAL] Re: Beam Website Feedback
>
> Thank you Alex and Bulat for improving this. We all very much appreciate
> it.
>
> On Thu, Dec 22, 2022 at 9:21 AM Alex Kosolapov 
> wrote:
>
> Hi all,
>
> We were preparing some improvements for check-links.sh script that is used
> for testing Apache Beam website links during the website build with Bulat (
> @bullet03 ).
>
> We saw several categories of link checks and error statuses:
>
>- 404 - actual incorrect links - fixed in [1] and [2]
>- Valid links that appear to the script as incorrect, e..g., 9xx
>status code for LinkedIn requiring authentication in LinkedIn, some GitHub
>documentation links, example links, some Meetup links, etc.
>
>
> We propose to add a “verified_list” to check_links.sh so that manually
> verified links can be skipped in testing. Current verified list includes 15
> links based on review of most recent test review. Inconvenience of this
> approach is that a verified link may become outdated, and would require an
> update of the “verified_list” in check_links.sh. This approach implemented
> in [3].
>
> [3] also contains check-links.sh improvements:
>
>- Added a function that checks and reports Apache Beam staging website
>links to prevent the production website from having links to staging
>- Added script checks and reports Apache Beam website absolute links
>(links of the form https://beam.apache.org/path) - relative links in
>the sources are preferred to properly build and review website staging
>- Added sorting any invalid links by their error code - this may be
>more convenient for reviewing output
>
>
> [4] - optionally, update absolute links to relative links so that a
> staging website more closely resembles the production website
>
> We submitted [3] and [4] for PR review and tagged Alexey Romanenko to
> kindly help with reviewing these PRs. Please share your comments about
> proposed approach in the PRs or list.
>
> [1] https://github.com/apache/beam/pull/24635
> [2] https://github.com/apache/beam/pull/24744
> [3] https://github.com/apache/beam/pull/1
> [4] https://github.com/apache/beam/pull/24747
>
> Thank you,
> Alex
>
> *From: *Rebecca Szper via dev 
> *Reply-To: *"dev@beam.apache.org" , Rebecca Szper <
> rsz...@google.com>
> *Date: *Wednesday, December 21, 2022 at 10:15 AM
> *To: *Ahmet Altay 
> *Cc: *Alexey Romanenko , dev <
> dev@beam.apache.org>, Rajkumar Gupta 
> *Subject: *[EXTERNAL] Re: Beam Website Feedback
>
> Our team doesn't maintain the Beam website infrastructure, but last time
> something like this came up, David said that there are consultants that
> work on this type of thing. He pinged @bullet03
>  on the Beam ticket, who was able to help.
>
> On Tue, Dec 20, 2022 at 5:06 PM Ahmet Altay  wrote:
>
>
>
> On Tue, Dec 20, 2022 at 1:12 PM Ahmet Altay  wrote:
>
>
>
> On Tue, Dec 20, 2022 at 9:14 AM Alexey Romanenko 
> wrote:
>
> Thanks Ahmet! I’d prefer to fix the links as you did and add the redirect
> from old one - perhaps, there are other similar links that have been
> changed in the same way.
>
>
> Thank you for the review. I fixed it, and added a redirect too.
>
>
>
> Btw, I’m not sure that we still check the broken links as it was before,
> iirc, but probably it would be a good idea to add such check before
> publishing a website.
>
>
> I agree. I also do not know about the state of this. It would be good to
> add that links checker again.
>
>
> Adding @Rebecca Szper  - in case this is something she
> can fix or would know who could fix it.
>
>
>
>
>
> —
> Alexey
>
>
>
>
>
> On 20 Dec 2022, at 18:04, Ahmet Altay via dev  wrote:
>
> I did a search and found a few places with the broken link. Correct links
> should be:
> https://beam.apache.org/get-started/resources/videos-and-podcasts/
>
> I created a PR to update the website (
> 

Re: Testing Multilanguage Pipelines?

2022-12-28 Thread Sachin Agarwal via dev
Given the increasing importance of multi language pipelines, it does seem
that we should expand the capabilities of the DirectRunner or just go all
in on FlinkRunner for testing and local / small scale development

On Wed, Dec 28, 2022 at 12:47 AM Robert Burke  wrote:

> Probably either on Flink, or the Python Portable runner at this juncture.
>
> On Tue, Dec 27, 2022, 8:40 PM Byron Ellis via dev 
> wrote:
>
>> Hi all,
>>
>> I spent some more time adding things to my dbt-for-Beam clone (
>> https://github.com/apache/beam/pull/24670) and actually made a fair
>> amount of progress, including starting to add in the profile support so I
>> can start to run it against real workloads (though at the moment only the
>> "test" connector is properly configured). More interestingly, though, is
>> adding in support for Python Dataframe external transforms... which expands
>> properly, but then (unsurprisingly) hangs if you try to actually run the
>> pipeline with Java's TestPipeline.
>>
>> I was wondering how people go about testing Java/Python hybrid pipelines
>> locally? The Java<->Python tests don't seem to actually execute a pipeline,
>> but I was hoping that maybe the direct runner could be set up properly to
>> do that?
>>
>> Best,
>> B
>>
>


Re: A Declarative API for Apache Beam

2022-12-16 Thread Sachin Agarwal via dev
> Transforms are defined using a YAML tag and named properties and can
>>> be used by constructing a YAML reference.
>>>
>>> That's an interesting idea. Can it be done inline as well?
>>>
>>> > DAG construction is done using a simple topological sort of transforms
>>> and their dependencies.
>>>
>>> Same.
>>>
>>> > Named side outputs can be referenced using a tag field.
>>>
>>> I didn't put this in any of the examples, but I do the same. If a
>>> transform Foo produces multiple outputs, one can (in fact must)
>>> reference the various outputs by Foo.output1, Foo.output2, etc.
>>>
>>> > Multiple inputs are merged with a Flatten transform.
>>>
>>> PTransfoms can have named inputs as well (they're not always
>>> symmetric), so I let inputs be a map if they care to distinguish them.
>>>
>>> > Not sure if there's any inspiration left to take from this, but I
>>> figured I'd throw it up here to share.
>>>
>>> Thanks. It's neat to see others coming up with the same idea, with
>>> very similar conventions, and validates that it'd be both natural and
>>> useful.
>>>
>>>
>>> > On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>> >>
>>> >> +1 for these proposals and agree that these will simplify and
>>> demystify Beam for many new users. I think when combined with the
>>> x-lang/Schema-Aware transform binding, these might end up being adequate
>>> solutions for many production use-cases as well (unless users need to
>>> define custom composites, I/O connectors, etc.).
>>> >>
>>> >> Also, thanks for providing prototype implementations with examples.
>>> >>
>>> >> - Cham
>>> >>
>>> >>
>>> >> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <
>>> dev@beam.apache.org> wrote:
>>> >>>
>>> >>> To build on Kenn's point, if we leverage existing stuff like dbt we
>>> get access to a ready made community which can help drive both adoption and
>>> incremental innovation by bringing more folks to Beam
>>> >>>
>>> >>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles 
>>> wrote:
>>> >>>>
>>> >>>> 1. I love the idea. Back in the early days people talked about an
>>> "XML SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at
>>> the time. Portability and specifically cross-language schema transforms
>>> gives the right infrastructure so this is the perfect time: unique names
>>> (URNs) for transforms and explicit lists of parameters they require.
>>> >>>>
>>> >>>> 2. I like the idea of re-using some existing thing like dbt if it
>>> is pretty much what we were going to do anyhow. I don't think we should
>>> hold ourselves back. I also don't think we'll gain anything in terms of
>>> implementation. But at least it could fast-forward our design process
>>> because we simply don't have to make most of the decisions because they are
>>> made for us.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <
>>> dev@beam.apache.org> wrote:
>>> >>>>>
>>> >>>>> And I guess also a PR for completeness to make it easier to find
>>> going forward instead of my random repo:
>>> https://github.com/apache/beam/pull/24670
>>> >>>>>
>>> >>>>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis 
>>> wrote:
>>> >>>>>>
>>> >>>>>> Since Robert opened that can of worms (and we happened to talk
>>> about it yesterday)... :-)
>>> >>>>>>
>>> >>>>>> I figured I'd also share my start on a "port" of dbt to the Beam
>>> SDK. This would be complementary as it doesn't really provide a way of
>>> specifying a pipeline, more orchestrating and packaging a complex
>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>>> like reasonable things for Beam and it wouldn't be a stretch to include
>>> something like the format above. Though in my head I had imagined people
>>> would tend to write 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Sachin Agarwal via dev
To build on Kenn's point, if we leverage existing stuff like dbt we get
access to a ready made community which can help drive both adoption and
incremental innovation by bringing more folks to Beam

On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles  wrote:

> 1. I love the idea. Back in the early days people talked about an "XML
> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the
> time. Portability and specifically cross-language schema transforms gives
> the right infrastructure so this is the perfect time: unique names (URNs)
> for transforms and explicit lists of parameters they require.
>
> 2. I like the idea of re-using some existing thing like dbt if it is
> pretty much what we were going to do anyhow. I don't think we should hold
> ourselves back. I also don't think we'll gain anything in terms of
> implementation. But at least it could fast-forward our design process
> because we simply don't have to make most of the decisions because they are
> made for us.
>
>
>
> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev 
> wrote:
>
>> And I guess also a PR for completeness to make it easier to find going
>> forward instead of my random repo:
>> https://github.com/apache/beam/pull/24670
>>
>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis 
>> wrote:
>>
>>> Since Robert opened that can of worms (and we happened to talk about it
>>> yesterday)... :-)
>>>
>>> I figured I'd also share my start on a "port" of dbt to the Beam SDK.
>>> This would be complementary as it doesn't really provide a way of
>>> specifying a pipeline, more orchestrating and packaging a complex
>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>>> like reasonable things for Beam and it wouldn't be a stretch to include
>>> something like the format above. Though in my head I had imagined people
>>> would tend to write composite transforms in the SDK of their choosing that
>>> are then exposed at this layer. I decided to go with dbt as it also
>>> provides a number of nice "quality of life" features for its users like
>>> documentation, validation, environments and so on,
>>>
>>> I did a really quick proof-of-viability implementation here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>>
>>> And you can see a really simple pipeline that reads a seed file
>>> (TextIO), runs it through a couple of SQLTransforms and then drops it out
>>> to a logger via a simple DoFn here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>>
>>> I've also heard a rumor there might also be a textproto-based
>>> representation floating around too :-)
>>>
>>> Best,
>>> B
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hello Robert,

 I'm replying to say that I've been waiting for something like this ever
 since I started learning Beam and I'm grateful you are pushing this 
 forward.

 Best,

 Damon

 On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw 
 wrote:

> While Beam provides powerful APIs for authoring sophisticated data
> processing pipelines, it often still has too high a barrier for
> getting started and authoring simple pipelines. Even setting up the
> environment, installing the dependencies, and setting up the project
> can be an overwhelming amount of boilerplate for some (though
> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
> way in making this easier). At the other extreme, the Dataflow project
> has the notion of templates which are pre-built Beam pipelines that
> can be easily launched from the command line, or even from your
> browser, but they are fairly restrictive, limited to pre-assembled
> pipelines taking a small number of parameters.
>
> The idea of creating a yaml-based description of pipelines has come up
> several times in several contexts and this last week I decided to code
> up what it could look like. Here's a proposal.
>
> pipeline:
>   - type: chain
> transforms:
>   - type: ReadFromText
> args:
>  file_pattern: "wordcount.yaml"
>   - type: PyMap
> fn: "str.lower"
>   - type: PyFlatMap
> fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>   - type: PyTransform
> name: Count
> constructor:
> "apache_beam.transforms.combiners.Count.PerElement"
>   - type: PyMap
> fn: str
>   - type: WriteToText
> file_path_prefix: "counts.txt"
>
> Some more examples at
> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>
> A prototype (feedback welcome) can be found at
> https://github.com/apache/beam/pull/24667. It can be invoked as
>
> python -m 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-14 Thread Sachin Agarwal via dev
I strongly believe that we should continue to have Beam optimize for the
user - and while having separate components would allow those of us who are
contributors and committers move faster, the downsides of not having
everything "in one box" for a new user where the components are all
relatively guaranteed to work together at that version level are very high.

Beam having everything included is absolutely a competitive advantage for
Beam and I would not want to lose that.

On Wed, Dec 14, 2022 at 9:31 AM Byron Ellis via dev 
wrote:

> Talk it with a grain of salt since I'm not even a committer, but is
> perhaps the reorganization of Beam into smaller components the real work of
> a 3.0 effort? Splitting of Beam into smaller more independently managed
> components would be a pretty huge breaking change from a dependency
> management perspective which would potentially be largely separate from any
> code changes.
>
> Best,
> B
>
> On Wed, Dec 14, 2022 at 9:23 AM Alexey Romanenko 
> wrote:
>
>> On 12 Dec 2022, at 22:23, Robert Bradshaw via dev 
>> wrote:
>>
>>
>> Saving up all the breaking changes until a major release definitely
>> has its downsides (look at Python 3). The migration path is often as
>> important (if not more so) than the final destination.
>>
>>
>> Actually, it proves that the major releases *should not* be delayed for
>> a long period of time and *should* be issued more often to reduce the
>> number of breaking changes (that, of course, likely may happen). That will
>> help users to do much more smooth and less risky upgrades, and developers
>> to not keep burden forever. Beam 2.0.0 was released back in may 2017 and
>> we've almost never talked about Beam 3.0 and what are the criteria for it.
>> I understand that it’s a completely different discussion but seems that
>> this time has come =)
>>
>> As for this particular change, I would question how the benefit (it's
>> unclear what the exact benefit is--better internal organization?)
>> exceeds the pain of making every user refactor their code. I think a
>> stronger case can be made for things like the Avro dependency that
>> cause real pain.
>>
>>
>> Agree. I think that if it doesn’t bring any pain with additional external
>> dependecies and this code is used in almost every other SDK module, then
>> there are no reasons for such breaking changes. On the other hand, Avro
>> case, that you mentioned above, is a good example why sometimes it would be
>> better to keep such code outside of “core”.
>>
>> As for the pipeline update feature, we've long discussed having
>> "pick-your-implementation" transforms that specify alternative,
>> equivalent implementations. Upgrades can choose the old one whereas
>> new pipelines can get the latest and greatest. It won't solve all
>> issues, and requires keeping old codepaths around, but could be an
>> important step forward.
>>
>> On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles  wrote:
>>
>>
>> I agree with Mortiz. To answer a few specifics in my own words:
>>
>> - It is a perfectly sensible refactor, but as a counterpoint without
>> file-based IO the SDK isn't functional so it is also a reasonable design
>> point to have this included. There are other things in the core SDK that
>> are far less "core" and could be moved out with greater benefit. The main
>> goal for any separation of modules would be lighter weight transitive
>> dependencies, IMO.
>>
>> - No, Beam has not made any deliberate breaking changes of this nature.
>> Hence we are still on major version 2. We have made some bugfixes for data
>> loss risks that could be called "breaking changes" but since the feature
>> was unsafe to use in the first place we did not bump the major version.
>>
>> - It is sometimes possible to do such a refactor and have the deprecated
>> location proxy to the new location. In this case that seems hard to achieve.
>>
>> - It is not actually necessary to maintain both locations, as we can
>> declare the old location will be unmaintained (but left alone) and all new
>> development goes to the new location. That isn't a great choice for users
>> who may simply upgrade their SDK version and not notice that their old code
>> is now pointing at a version that will not receive e.g. security updates.
>>
>> - I like the style where if/when we transition from Beam 2 to Beam 3 we
>> should have the exact functionality of Beam 3 available as an opt-in flag
>> first. So if a user passes --beam-3 they get exactly what will be the
>> default functionality when we bump the major version. It really is a
>> problem to do a whole bunch of stuff feverishly before a major version
>> bump. The other style that I think works well is the linux kernel style
>> where major versions alternate between stable and unstable (in other words,
>> returning to the 0.x style with every alternating version).
>>
>> - I do think Beam suffers from fear and inability to do significant code
>> gardening. I don't think backwards compatibility in the 

Re: [Proposal] Adopt a Beam I/O Standard

2022-12-13 Thread Sachin Agarwal via dev
It would be helpful to explain the scope here - if the previous iteration
was too overweight, it would be good to be intentional.

I think all would agree that being more prescriptive would help IO makers
(especially those from startups looking to expand their reach).

On Mon, Dec 12, 2022 at 7:32 PM Chamikara Jayalath 
wrote:

> Yeah, I don't think either finalized or documented (in the Website) the
> previous iteration. This doc seems to contain details from the documents
> shared in the previous iteration.
>
> Thanks,
> Cham
>
>
>
> On Mon, Dec 12, 2022 at 6:49 PM Robert Burke  wrote:
>
>> I think ultimately: until the docs a clearly available on the Beam site
>> itself, it's not documentation. See also, design docs, previous emails, and
>> similar.
>>
>> On Mon, Dec 12, 2022, 6:07 PM Andrew Pilloud via dev 
>> wrote:
>>
>>> I believe the previous iteration was here:
>>> https://lists.apache.org/thread/3o8glwkn70kqjrf6wm4dyf8bt27s52hk
>>>
>>> The associated docs are:
>>> https://s.apache.org/beam-io-api-standard-documentation
>>> https://s.apache.org/beam-io-api-standard
>>>
>>> This is missing all the relational stuff that was in those docs, this
>>> appears to be another attempt starting from the beginning?
>>>
>>> Andrew
>>>
>>>
>>> On Mon, Dec 12, 2022 at 9:57 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Thanks for writing this!

 IIRC, the similar design doc was sent for review here a while ago. Is
 this just an updated version and a new one?

 —
 Alexey

 On 11 Dec 2022, at 15:16, Herman Mak via dev 
 wrote:

 Hello Everyone,

 *TLDR*

 Should we adopt a set of standards that Connector I/Os should adhere
 to?
 Attached is a first version of a Beam I/O Standards guideline that
 includes opinionated best practices across important components of a
 Connector I/O, namely Documentation, Development and Testing.

 *The Long Version*

 Apache Beam is a unified open-source programming model for both batch
 and streaming. It runs on multiple platform runners and integrates with
 over 50 services using individually developed I/O Connectors
 .

 Given that Apache Beam connectors are written by many different
 developers and at varying points in time, they vary in syntax style,
 documentation completeness and testing done. For a new adopter of Apache
 Beam, that can definitely cause some uncertainty.

 So should we adopt a set of standards that Connector I/Os should adhere
 to?
 Attached is a first version, in Doc format, of a Beam I/O Standards
 guideline that includes opinionated best practices across important
 components of a Connector I/O, namely Documentation, Development and
 Testing. And the aim is to incorporate this into the documentation and to
 have it referenced as standards for new Connector I/Os (and ideally have
 existing Connectors upgraded over time). If it looks helpful, the immediate
 next step is that we can convert it into a .md as a PR into the Beam repo!

 Thanks and looking forward to feedbacks and discussion,

  [PUBLIC] Beam I/O Standards
 

 Herman Mak |  Customer Engineer, Hong Kong, Google Cloud |
 herman...@google.com |  +852-3923-5417 <+852%203923%205417>






Re: [DISCUSSION][JAVA] Current state of Java 17 support

2022-12-01 Thread Sachin Agarwal via dev
This is a good heads up, thank you Cristian.

On Thu, Dec 1, 2022 at 8:13 AM Cristian Constantinescu 
wrote:

> Hi,
>
> I came across some Kafka info and would like to share for those
> unaware. Kafka is planning to drop support for Java 8 in Kafka 4 (Java
> 8 is deprecated in Kafka 3), see KIP-750 [1].
>
> I'm not sure when Kafka 4 is scheduled to be released (probably a few
> years down the road), but when it happens, KafkaIO may not be able to
> support it if we maintain Java 8 compatibility unless it remains on
> Kafka 3.
>
> Anyways, if not already done, I think it's a good idea to start
> putting up serious warning flags around Beam used with Java 8, even
> for Google cloud customers ;)
>
> Cheers,
> Cristian
>
> [1] https://issues.apache.org/jira/browse/KAFKA-12894
>
> On Wed, Nov 30, 2022 at 12:59 PM Kenneth Knowles  wrote:
> >
> > An important thing is to ensure that we do not accidentally depend on
> something that would break Java 8 support.
> >
> > Currently our Java 11 and 17 tests build the code with Java 8 (just like
> our released artifacts) and then compile and run the test code with the
> newer JDK. This roughly matches the user scenario, I think. So it is a
> little more complex than just having separate test runs for different JDK
> versions. But it would be good to make this more symmetrical between JDK
> versions to develop the mindset that JDK is always explicit.
> >
> > Kenn
> >
> > On Wed, Nov 30, 2022 at 9:48 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >>
> >>
> >> On 30 Nov 2022, at 03:56, Tomo Suzuki via dev 
> wrote:
> >>
> >> > Do we still need to support Java 8 SDK?
> >>
> >> Yes, for Google Cloud customers who still use Java 8, I want Apache
> Beam to support Java 8. Do you observe any special burden maintaining Java
> 8?
> >>
> >>
> >> I can only think of the additional resources costs if we will test all
> supported JDKs, as Austin mentioned above. Imho, we should do that for all
> JDK that are officially supported.
> >> Another less-costly way is to run the Java tests for all JDKs only
> during the release preparation stage.
> >>
> >> I agree that it would make sense to continue to support Java 8 until a
> significant number of users are using it.
> >>
> >> —
> >> Alexey
> >>
> >>
> >>
> >> Regards,
> >> Tomo
> >>
> >> On Tue, Nov 29, 2022 at 21:48 Austin Bennett  wrote:
> >>>
> >>> -1 for ongoing Java8 support [ or, said another way, +1 for dropping
> support of Java8 ]
> >>>
> >>> +1 for having tests that run for ANY JDK that we say we support.  Is
> there any reason the resources to support are too costly [ or outweigh the
> benefits of additional confidence in ensuring we support what we say we do
> ]?  I am not certain on whether this would only be critical for releases,
> or should be done as part of regular CI.
> >>>
> >>> On Tue, Nov 29, 2022 at 8:51 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> 
>  Hello,
> 
>  I’m sorry if it’s already discussed somewhere but I find myself a
> little bit lost in the subject.
>  So, I’d like to clarify this - what is a current official state of
> Java 17 support at Beam?
> 
>  I recall that a great job was done to make Beam compatible with Java
> 17 [1] and Beam already provides “beam_java17_sdk” Docker image [2] but,
> iiuc, Java 8 is still the default JVM to run all Java tests on Jenkins
> ("Java PreCommit" in the first order) and there are only limited number of
> tests that are running with JDK 11 and 17 on Jenkins by dedicated jobs.
> 
>  So, my question would sound like if Beam officially supports Java 17
> (and 11), do we need to run all Beam Java SDK related tests (VR and IT test
> including) against all supported Java SDKs?
> 
>  Do we still need to support Java 8 SDK?
> 
>  In the same time, as we are heading to move everything from Jenkins
> to GitHub actions, what would be the default JDK there or we will run all
> Java-related actions against all supported JDKs?
> 
>  —
>  Alexey
> 
>  [1] https://issues.apache.org/jira/browse/BEAM-12240
>  [2] https://hub.docker.com/r/apache/beam_java17_sdk
> 
> 
> 
> >> --
> >> Regards,
> >> Tomo
> >>
> >>
>


Re: Questions on primitive transforms hierarchy

2022-11-14 Thread Sachin Agarwal via dev
Would it be helpful to add these answers to the Beam docs?

On Mon, Nov 14, 2022 at 4:35 AM Jan Lukavský  wrote:

> I somehow missed these answers, Reuven and Kenn, thanks for the
> discussion, it helped me clarify my understanding.
>
>  Jan
> On 10/26/22 21:10, Kenneth Knowles wrote:
>
>
>
> On Tue, Oct 25, 2022 at 5:53 AM Jan Lukavský  wrote:
>
>> > Not quite IMO. It is a subtle difference. Perhaps these transforms can
>> be *implemented* using stateful DoFn, but defining their semantics directly
>> at a high level is more powerful. The higher level we can make transforms,
>> the more flexibility we have in the runners. You *could* suggest that we
>> take the same approach as we do with Combine: not a primitive, but a
>> special transform that we optimize. You could say that "vanilla ParDo" is a
>> composite that has a stateful ParDo implementation, but a runner can
>> implement the composite more efficiently (without a shuffle). Same with
>> CoGBK. You could say that there is a default expansion of CoGBK that uses
>> stateful DoFn (which implies a shuffle) but that smart runners will not use
>> that expansion.
>>
>> Yes, semantics > optimizations. For optimizations Beam already has a
>> facility - PTransformOverride. There is no fundamental difference about how
>> we treat Combine wrt GBK. It *can* be expanded using GBK, but "smart
>> runners will not use that expansion". This is essentially the root of this
>> discussion.
>>
>> If I rephrase it:
>>
>>  a) why do we distinguish between "some" actually composite transforms
>> treating them as primitive, while others have expansions, although the
>> fundamental reasoning seems the same for both (performance)?
>>
> It is identical to why you can choose different axioms for formal logic
> and get all the same provable statements. You have to choose something. But
> certainly a runner that just executes primitives is the bare minimum and
> all runners are really expected to take advantage of known composites.
> Before portability, the benefit was minimal to have the runner (written in
> Java) execute a transform directly vs calling a user DoFn. Now with
> portability it could be huge if it avoids a Fn API crossing.
>
>  b) is there a fundamental reason why we do not support stateful DoFn for
>> merging windows?
>>
> No reason. The original design was to force users to only use "mergeable"
> state in a stateful DoFn for merging windows. That is an annoying
> restriction that we don't really need. So I think the best way is to have
> an OnMerge callback. The internal legacy Java APIs for this are way too
> complex. But portability wire protocols support it (I think?) and making a
> good user facing API for all the SDKs shouldn't be too hard.
>
> Kenn
>
>
>> I feel that these are related and have historical reasons, but I'd like
>> to know that for sure. :)
>>
>>  Jan
>> On 10/24/22 19:59, Kenneth Knowles wrote:
>>
>>
>>
>> On Mon, Oct 24, 2022 at 5:51 AM Jan Lukavský  wrote:
>>
>>> On 10/22/22 21:47, Reuven Lax via dev wrote:
>>>
>>> I think we stated that CoGroupbyKey was also a primitive, though in
>>> practice it's implemented in terms of GroupByKey today.
>>>
>>> On Fri, Oct 21, 2022 at 3:05 PM Kenneth Knowles  wrote:
>>>


 On Fri, Oct 21, 2022 at 5:24 AM Jan Lukavský  wrote:

> Hi,
>
> I have some missing pieces in my understanding of the set of Beam's
> primitive transforms, which I'd like to fill. First a quick recap of what 
> I
> think is the current state. We have (basically) the following primitive
> transforms:
>
>  - DoFn (stateless, stateful, splittable)
>
>  - Window
>
>  - Impulse
>
>  - GroupByKey
>
>  - Combine
>

 Not a primitive, just a well-defined transform that runners can execute
 in special ways.

>>> Yep, OK, agree. Performance is orthogonal to semantics.
>>>
>>>

>
>
>  - Flatten (pCollections)
>

 The rest, yes.



> Inside runners, we most often transform GBK into ReduceFn
> (ReduceFnRunner), which does the actual logic for both GBK and stateful
> DoFn.
>

 ReduceFnRunner is for windowing / triggers and has special feature to
 use a CombineFn while doing it. Nothing to do with stateful DoFn.

>>> My bad, wrong wording. The point was that *all* of the semantics of GBK
>>> and Combine can be defined in terms of stateful DoFn. There are some
>>> changes needed to stateful DoFn to support the Combine functionality. But
>>> as mentioned above - optimization is orthogonal to semantics.
>>>
>>
>> Not quite IMO. It is a subtle difference. Perhaps these transforms can be
>> *implemented* using stateful DoFn, but defining their semantics directly at
>> a high level is more powerful. The higher level we can make transforms, the
>> more flexibility we have in the runners. You *could* suggest that we take
>> the same approach as we do with Combine: not a primitive, but 

Re: Experimental WebAssembly Example | Go Beam SDK

2022-11-10 Thread Sachin Agarwal via dev
This is super interesting, thank you Damon!

On Thu, Nov 10, 2022 at 10:51 AM Damon Douglas via dev 
wrote:

> Hello Everyone,
>
> I created https://github.com/apache/beam/pull/24081 to start a
> conversation around WebAssembly support in Beam.
>
> WebAssembly is an experimental technology.  According to WebAssembly.org,
> "Wasm is designed as a portable compilation target for programming
> languages".  This PR is a simple example of how to embed a wasm compiled
> function, written in Rust, within a DoFn using the Go Beam SDK.
>
> Best,
>
> Damon
>


Re: [ANNOUNCE] New committer: Yi Hu

2022-11-09 Thread Sachin Agarwal via dev
Congratulations Yi!

On Wed, Nov 9, 2022 at 10:32 AM Kenneth Knowles  wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Yi Hu (y...@apache.org)
>
> Yi started contributing to Beam in early 2022. Yi's contributions are very
> diverse! I/Os, performance tests, Jenkins, support for Schema logical
> types. Not only code but a very large amount of code review. Yi is also
> noted for picking up smaller issues that normally would be left on the
> backburner and filing issues that he finds rather than ignoring them.
>
> Considering their contributions to the project over this timeframe, the
> Beam PMC trusts Yi with the responsibilities of a Beam committer. [1]
>
> Thank you Yi! And we are looking to see more of your contributions!
>
> Kenn, on behalf of the Apache Beam PMC
>
> [1]
>
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>


github reviewer help / tips

2022-11-08 Thread Sachin Agarwal via dev
Hey folks,

I've found myself repeatedly being very untimely in providing reviews on
PRs where I've been added as a reviewer.  (Mea culpa and thank you for your
understanding to those who have tagged me and emailed me to nudge me along.)

Does anyone have any great tips about how to be super on top of things in
the Beam repos?  Any Github experts who can get my SLA from three weeks to
a day or so would be great.

Many thanks in advance -

Cheers,
Sachin


Re: [ANNOUNCE] New committer: Ritesh Ghorse

2022-11-03 Thread Sachin Agarwal via dev
Congrats Ritesh!

On Thu, Nov 3, 2022 at 4:16 PM Kenneth Knowles  wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new committer:
> Ritesh Ghorse (riteshgho...@apache.org)
>
> Ritesh started contributing to Beam in mid-2021 and has contributed
> immensely to bringin the Go SDK to fruition, in addition to contributions
> to Java and Python and release validation.
>
> Considering their contributions to the project over this timeframe, the
> Beam PMC trusts Ritesh with the responsibilities of a Beam committer. [1]
>
> Thank you Ritesh! And we are looking to see more of your contributions!
>
> Kenn, on behalf of the Apache Beam PMC
>
> [1]
>
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>


Re: Support existing IOs with Schema Transforms

2022-11-03 Thread Sachin Agarwal via dev
I think this is a great idea - making any many existing IOs as possible
available to developers in any language is a huge win (and helps reduce the
need to re-implement IOs on a language-by-language basis going forward).

On Thu, Nov 3, 2022 at 11:25 AM Ahmed Abualsaud via dev 
wrote:

> Hi all,
>
> There has been an effort to add SchemaTransform capabilities to our
> connectors to facilitate the use of multi-lang pipelines. I've drafted a
> document below that provides guidelines and examples of how to support IOs
> with SchemaTransforms. Please take a look and share your thoughts and
> suggestions!
>
>  Supporting existing connectors with SchemaTrans...
> 
>
>
> Best,
> Ahmed
>


Re: Beam Website Feedback

2022-10-27 Thread Sachin Agarwal via dev
No objections here.  The latter (the surviving) is the one linked in
the top navigation bar and has the x-lang details that help.

On Thu, Oct 27, 2022 at 2:09 PM Brian Hulette  wrote:

> Hm, it seems like we need to drop
> https://beam.apache.org/documentation/io/built-in/ as it's been
> superseded by https://beam.apache.org/documentation/io/connectors/
>
> Would there be any objections to that?
>
> On Thu, Oct 27, 2022 at 2:04 PM Sachin Agarwal via dev <
> dev@beam.apache.org> wrote:
>
>> JDBCIO is available as a Java-based IO.  It is also listed on
>> https://beam.apache.org/documentation/io/connectors/
>>
>> On Thu, Oct 27, 2022 at 2:01 PM Charles Kangai <
>> char...@charleskangai.co.uk> wrote:
>>
>>> What about jdbc?
>>> I want to use Beam to read/write to/from a relational database, e.g.
>>> Oracle or Microsoft SQL Server.
>>> I don’t see a connector on your page:
>>> *https://beam.apache.org/documentation/io/built-in*
>>> <https://beam.apache.org/documentation/io/built-in>
>>>
>>> Thanks,
>>> Charles Kangai
>>>
>>>
>>>
>>


Re: Beam Website Feedback

2022-10-27 Thread Sachin Agarwal via dev
JDBCIO is available as a Java-based IO.  It is also listed on
https://beam.apache.org/documentation/io/connectors/

On Thu, Oct 27, 2022 at 2:01 PM Charles Kangai 
wrote:

> What about jdbc?
> I want to use Beam to read/write to/from a relational database, e.g.
> Oracle or Microsoft SQL Server.
> I don’t see a connector on your page:
> *https://beam.apache.org/documentation/io/built-in*
> 
>
> Thanks,
> Charles Kangai
>
>
>


Re: [idea] A new IO connector named DataLakeIO, which support to connect Beam and data lake, such as Delta Lake, Apache Hudi, Apache iceberg.

2022-09-26 Thread Sachin Agarwal via dev
It turns out there was a commit submitted here!
https://github.com/nanhu-lab/beam/commit/d4f5fa4c41602b4696737929dd1bdd5ae2302a65

Related GH issue: https://github.com/apache/beam/issues/23074

On Tue, Aug 30, 2022 at 10:28 AM Sachin Agarwal  wrote:

> I would posit that something is better than nothing - did we ever see that
> generic implementation?
>
> On Tue, Aug 30, 2022 at 10:22 AM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Is there enough commonality across Delta, Hudi, Iceberg for this generic
>> solution?  I imagined we'd potentially have individual IOs for each.  A
>> generic one seems possible, but certainly would like to learn more.
>>
>> Also, are others in the community working on connectors for ANY of those
>> Delta Lake, Hudi, or Iceberg IOs?  Would hope for some form of coordination
>> and/or at least awareness between people addressing
>> complementary/overlapping areas.
>>
>> On Mon, Aug 29, 2022 at 4:15 PM Neil Kolban via dev 
>> wrote:
>>
>>> Howdy,
>>> I have a client who would be interested to use this.  Is there a link to
>>> a GitHub repo or other place I can read more?
>>>
>>> Neil  (kol...@google.com)
>>>
>>> On 2022/08/05 07:23:31 张涛 wrote:
>>> >
>>> > Hi, we developed a new IO connector named DataLakeIO, to connect Beam
>>> and data lake, such as Delta Lake, Apache Hudi, Apache iceberg. Beam can
>>> use DataLakeIO to read data from data lake, and write data to data lake. We
>>> did not find data lake IO on
>>> https://beam.apache.org/documentation/io/built-in/, we want to
>>> contribute this new IO connector to Beam, what should we do next? Thank you
>>> very much!
>>>
>>


Re: [ANNOUNCE][Testing] TPC-DS benchmark suite in Beam

2022-09-16 Thread Sachin Agarwal via dev
This is wonderful - thank you so much to you and the whole Talend team to
make Beam better!

On Fri, Sep 16, 2022 at 9:11 AM Alexey Romanenko 
wrote:

> Hi everybody,
>
> As some of you may know, at Talend, we’ve been working for a while to add
> TPC-DS benchmark suite into Beam. We believe that having TPC-DS as a part
> of Beam testing workflow and release routine will help a community to
> detect quickly the performance regressions or improvements, identify
> missing or incorrect Beam SQL features and execute Beam SQL on different
> runtime environments with different runners.
>
> What is TPC-DS? From TPC-DS specification document [1]:
>
> *“TPC-DS is a decision support benchmark that models several generally
> applicable aspects of a decision support system, including queries and data
> maintenance. The benchmark provides a representative evaluation of
> performance as a general purpose decision support system.” *
>
> TPC-DS benchmark suite for Beam is implemented as a separate testing tool
> for Java SDK (like well known Nexmark benchmark suite) [2]. It supports a
> limited number of TPC-DS SQL queries for now (mostly because of limited SQL
> syntax support in Beam), CSV and Parquet as input data format, and it runs
> on Jenkins with three most popular Beam runners (Spark [3], Flink [4],
> Dataflow [5]). The job metrics are stored in InfluxDB and can be accessed
> though Grafana dashboards [6][7][8].
>
> More details can be found in Beam documentation [9].
>
> For sure, there are still plenty things to do, like adding new runners,
> support of other SDKs, data formats, etc - so, your contributions are very
> welcomed in any form. Though, at least for now, we already have a first
> working and automated version that can be used by community.
>
> Also, I’d like to thank everybody who worked on this improvement!
>
> —
> Alexey
>
>
> [1]
> https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp
> [2] https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds
> [3] https://ci-beam.apache.org/job/beam_PostCommit_Java_Tpcds_Spark/
> [4] https://ci-beam.apache.org/job/beam_PostCommit_Java_Tpcds_Flink/
> [5] https://ci-beam.apache.org/job/beam_PostCommit_Java_Tpcds_Dataflow/
> [6]
> http://metrics.beam.apache.org/d/tkqc0AdGk2/tpc-ds-spark-classic-new-sql?orgId=1
> [7] http://metrics.beam.apache.org/d/8INnSY9Mv/tpc-ds-flink-sql?orgId=1
> [8]
> http://metrics.beam.apache.org/d/tkqc0AdGk2/tpc-ds-spark-classic-new-sql?orgId=1
> [9] https://beam.apache.org/documentation/sdks/java/testing/tpcds/
>
>
>
>
>
>


Re: Beam Website Feedback

2022-09-13 Thread Sachin Agarwal via dev
Andrew,

Thanks so much for the feedback and glad the getting started materials are
helping.

Would you like a downloadable container that works out of the box for your
local machine or to spin up on AWS or GCP or DO or something like that?
Just trying to make sure we understand the gap correctly so we can fix it
for everyone!

Cheers,
Sachin

On Tue, Sep 13, 2022 at 9:55 AM Andrew Hoyle via dev 
wrote:

> Hello,
>
> This is an exciting technology that I'm enjoying learning about.  I'm
> learning best through the examples throughout the Beam docs.  (My favorite
> examples have been those around the clicks, views, and batch RPC calls in
> the state and timer sections.)  I'm able to run proofs of concept locally
> with direct, interactive pipeline runners using those helpful examples.
> I'm finding it difficult, however, to make less trivial local Docker
> environments that use other runners like FlinkRunner.  There are many
> pipeline options to weed through and I'd benefit if they were explained in
> more detail.  If there were a complete example of a Docker Compose Flink
> runner environment + the correct pipeline settings, that would help
> immensely.  It seems other devs on the corners of Stack Overflow are having
> similar problems understanding the different options like
> environment_type=LOOPBACK|EXTERNAL|DOCKER|PROCESS|etc. and are also
> struggling to get a Flink environment spun up.  (I'm using the Python and
> Go SDKs.)
>
> Thanks,
> Andrew
>
>
>
>


Re: [idea] A new IO connector named DataLakeIO, which support to connect Beam and data lake, such as Delta Lake, Apache Hudi, Apache iceberg.

2022-08-30 Thread Sachin Agarwal via dev
I would posit that something is better than nothing - did we ever see that
generic implementation?

On Tue, Aug 30, 2022 at 10:22 AM Austin Bennett 
wrote:

> Is there enough commonality across Delta, Hudi, Iceberg for this generic
> solution?  I imagined we'd potentially have individual IOs for each.  A
> generic one seems possible, but certainly would like to learn more.
>
> Also, are others in the community working on connectors for ANY of those
> Delta Lake, Hudi, or Iceberg IOs?  Would hope for some form of coordination
> and/or at least awareness between people addressing
> complementary/overlapping areas.
>
> On Mon, Aug 29, 2022 at 4:15 PM Neil Kolban via dev 
> wrote:
>
>> Howdy,
>> I have a client who would be interested to use this.  Is there a link to
>> a GitHub repo or other place I can read more?
>>
>> Neil  (kol...@google.com)
>>
>> On 2022/08/05 07:23:31 张涛 wrote:
>> >
>> > Hi, we developed a new IO connector named DataLakeIO, to connect Beam
>> and data lake, such as Delta Lake, Apache Hudi, Apache iceberg. Beam can
>> use DataLakeIO to read data from data lake, and write data to data lake. We
>> did not find data lake IO on
>> https://beam.apache.org/documentation/io/built-in/, we want to
>> contribute this new IO connector to Beam, what should we do next? Thank you
>> very much!
>>
>


Re: [idea] A new IO connector named DataLakeIO, which support to connect Beam and data lake, such as Delta Lake, Apache Hudi, Apache iceberg.

2022-08-05 Thread Sachin Agarwal via dev
This is wonderful to hear -
https://beam.apache.org/contribute/get-started-contributing/#contribute-code
has the process to contribute; we're very much looking forward to seeing
your DataLakeIO!

On Fri, Aug 5, 2022 at 9:02 AM 张涛  wrote:

>
> Hi, we developed a new IO connector named DataLakeIO, to connect Beam and
> data lake, such as Delta Lake, Apache Hudi, Apache iceberg. Beam can use
> DataLakeIO to read data from data lake, and write data to data lake. We did
> not find data lake IO on
> https://beam.apache.org/documentation/io/built-in/, we want to contribute
> this new IO connector to Beam, what should we do next? Thank you very
> much!
>


Re: BigTable reader for Python?

2022-07-26 Thread Sachin Agarwal via dev
On Tue, Jul 26, 2022 at 6:12 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

>
>
> On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev <
> dev@beam.apache.org> wrote:
>
>> Hi dev,
>>
>> We're starting to incorporate BigTable in our stack and I've delighted
>> my co-workers with how easy it was to create some BigTables with
>> Beam... but there doesn't appear to be a reader for BigTable in
>> Python.
>>
>> First off, is there a good reason why not/any reason why it would be
>> difficult?
>>
>
> There's was a previous effort to implement a Python BT source but that was
> not completed:
> https://github.com/apache/beam/pull/11295#issuecomment-646378304
>
>
>>
>> I could write one, but before I start, I'd love some input to make it
>> easier.
>>
>> It appears that there would be two options: either write one in
>> Python, or try to set one up with x-language from Java which I see is
>> done e.g. with the Spanner IO Connector.
>> Any recommendation on which one to pick or potential pitfalls in either
>> choice?
>>
>> If I write one in Python, what should I think about?
>> It is not obvious to me how to achieve parallelization, so any tips
>> here would be welcome.
>>
>
> I would strongly prefer developing a  Python wrapper for the existing Java
> BT source using Beam's Multi-language Pipelines framework over developing a
> new Python source.
>
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>

This is the way.

>
> 
>
> Thanks,
> Cham
>
>
>
>>
>> Thanks!
>> -Lina
>>
>