So it seams that the Java SDK has two different Join libraries? With Schema: https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms And Another one: https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java
So how does it handle that? On Mon, Jul 8, 2019 at 12:39 PM Shannon Duncan <joseph.dun...@liveramp.com> wrote: > Yeah these are for local testing right now. I was hoping to gain insight > on better naming. > > I was thinking of creating an "extras" module. > > On Mon, Jul 8, 2019, 12:28 PM Robin Qiu <robi...@google.com> wrote: > >> Hi Shannon, >> >> Thanks for sharing the repo! I took a quick look and I have a concern >> with the naming of the transforms. >> >> Currently, Beam Java already have "Select" and "Join" transforms. >> However, they work on schemas, a feature that is not yet implemented in >> Beam Python. (See >> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms >> ) >> >> To maintain consistency between SDKs, I think it is good to avoid having >> two different transforms with the same name but different functions. So >> maybe you can consider renaming the transforms or/and putting it in an >> extension Python module, instead of the main ones? >> >> Best, >> Robin >> >> On Mon, Jul 8, 2019 at 9:19 AM Shannon Duncan <joseph.dun...@liveramp.com> >> wrote: >> >>> As a follow up. Here is the repo that contains the utilities for now. >>> https://github.com/shadowcodex/apache-beam-utilities. Will put together >>> a proper PR as code gets closer to production quality. >>> >>> - Shannon >>> >>> On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan < >>> joseph.dun...@liveramp.com> wrote: >>> >>>> Thanks Frederik, >>>> >>>> That's exactly where I was looking. I did get permission to open source >>>> the utilities module. So I'm going to throw them up on my personal github >>>> soon and share with the email group for a look over. >>>> >>>> I'm going to work on the utilities there because it's a quick dev >>>> environment and then once they are ready for proper PR I'll begin working >>>> them into the actual SDK for a PR. >>>> >>>> I also joined the slack #beam and #beam-python channels, I was unsure >>>> of where most collaborators discussed items. >>>> >>>> - Shannon >>>> >>>> On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <frederik.b...@ml6.eu> >>>> wrote: >>>> >>>>> Hi Shannon, >>>>> >>>>> This is probably a good starting point: >>>>> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68 >>>>> . >>>>> >>>>> Frederik >>>>> >>>>> [image: https://ml6.eu] >>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=> >>>>> >>>>> >>>>> * Frederik Bode* >>>>> >>>>> ML6 Ghent >>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=> >>>>> +32 4 92 78 96 18 >>>>> >>>>> >>>>> **** DISCLAIMER **** >>>>> >>>>> This email and any files transmitted with it are confidential and >>>>> intended solely for the use of the individual or entity to whom they are >>>>> addressed. If you have received this email in error please notify the >>>>> system manager. This message contains confidential information and is >>>>> intended only for the individual named. If you are not the named addressee >>>>> you should not disseminate, distribute or copy this e-mail. Please notify >>>>> the sender immediately by e-mail if you have received this e-mail by >>>>> mistake and delete this e-mail from your system. If you are not the >>>>> intended recipient you are notified that disclosing, copying, distributing >>>>> or taking any action in reliance on the contents of this information is >>>>> strictly prohibited. >>>>> >>>>> >>>>> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan < >>>>> joseph.dun...@liveramp.com> wrote: >>>>> >>>>>> I'm sure I could use some of the existing aggregations as a guide on >>>>>> how to make aggregations to fill the gap of missing ones. Such as >>>>>> creating >>>>>> Sum/Max/Min. >>>>>> >>>>>> GroupBy is really already handled with GroupByKey and CoGroupByKey >>>>>> unless you are thinking of a different type of GroupBy? >>>>>> >>>>>> - Shannon >>>>>> >>>>>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ruw...@google.com> wrote: >>>>>> >>>>>>> Maybe also adding Aggregation/GroupBy as utilities? >>>>>>> >>>>>>> >>>>>>> -Rui >>>>>>> >>>>>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan < >>>>>>> joseph.dun...@liveramp.com> wrote: >>>>>>> >>>>>>>> Thanks Valentyn, >>>>>>>> >>>>>>>> I'll outline the utilities and accept any suggestions to add / >>>>>>>> modify. These are really just shortcut PTransforms that I am working >>>>>>>> on to >>>>>>>> simplify creating pipelines. >>>>>>>> >>>>>>>> Currently the utilities contain the following PTransforms: >>>>>>>> >>>>>>>> - Inner Join >>>>>>>> - Left Outer Join >>>>>>>> - Right Outer Join >>>>>>>> - Full Outer Join >>>>>>>> - PrepareKey (For selecting items in a dictionary to act as a key >>>>>>>> for the joins) >>>>>>>> - Select (very simple filter that returns only items you want from >>>>>>>> the dictionary) (allows for defining a default nullValue) >>>>>>>> >>>>>>>> Currently these operations only work with dictionaries, but I'd be >>>>>>>> interested to see how it would work for <K,V> tuples. >>>>>>>> >>>>>>>> I'm new to python so they may not be optimized or the best way, but >>>>>>>> from my understanding these seem to be the best way to do these types >>>>>>>> of >>>>>>>> operations. Essentially I created a pipeline to be able to convert a >>>>>>>> simple >>>>>>>> sql query into a flow of these utilities. Using prepareKey to define >>>>>>>> your >>>>>>>> joining key, joining, and then selecting from the join allows you to >>>>>>>> do a >>>>>>>> lot of powerful manipulation in a simple / familiar way. >>>>>>>> >>>>>>>> If this is something that we'd like to add to the Beam SDK I don't >>>>>>>> mind looking at the contributor license agreement, and conversing more >>>>>>>> on >>>>>>>> how to get them in. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Shannon >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev < >>>>>>>> valen...@google.com> wrote: >>>>>>>> >>>>>>>>> Hi Shannon, >>>>>>>>> >>>>>>>>> Thanks for considering a contribution to Beam Python SDK. With a >>>>>>>>> direct contribution to Beam SDK, your change will reach larger >>>>>>>>> audience of >>>>>>>>> users, and you will not have to maintain a separate project and keep >>>>>>>>> it up >>>>>>>>> to date with new releases of Beam. >>>>>>>>> >>>>>>>>> I encourage you to take a look at >>>>>>>>> https://beam.apache.org/contribute/ for general advice on how to >>>>>>>>> get started. To echo some points mentioned in the guide: >>>>>>>>> >>>>>>>>> - If your change is large or it is your first change, it is a good >>>>>>>>> idea to discuss it on the dev@ mailing list >>>>>>>>> - For large changes create a design doc (template, examples) and >>>>>>>>> email it to the dev@ mailing list. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Valentyn >>>>>>>>> >>>>>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan < >>>>>>>>> joseph.dun...@liveramp.com> wrote: >>>>>>>>> >>>>>>>>>> I have been writing a bunch of utilities for the python SDK such >>>>>>>>>> as joins, selections, composite transforms, etc... >>>>>>>>>> >>>>>>>>>> I am working with my company to see if I can open source the >>>>>>>>>> utilities. Would it be best to post them on a separate PyPi project, >>>>>>>>>> or to >>>>>>>>>> PR them into the beam SDK? I assume if they let me open source it >>>>>>>>>> they will >>>>>>>>>> want some attribution or something like that. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Shannon >>>>>>>>>> >>>>>>>>>