On Thu, Jan 24, 2019 at 5:08 PM Thomas Weise <t...@apache.org> wrote: > > Exciting to see the cross-language train gathering steam :) > > It may be useful to flesh out the user facing aspects a bit more before going > too deep on the service / expansion side or maybe that was done elsewhere?
It's been discussed, but no resolution yet. > A few examples (of varying complexity) of how the shim/proxy transforms would > look like in the other SDKs. Perhaps Java KafkaIO in Python and Go would be a > good candidate? The core implementation would, almost by definition, be input.apply(ExternalTransform(URN, payload, service_address). Nicer shims would just be composite transforms that call this, filling in the URNs, payloads, and possibly service details from more user-friendly parameters. > One problem we discovered with custom Flink native transforms for Python was > handling of lambdas / functions. An example could be a user defined watermark > timestamp extractor that the user should be able to supply in Python and the > JVM cannot handle. Yes, this has never been resolved satisfactorily. For now, if UDFs can be reified in terms of a commonly-understood URN + payload, it'll work. A transform could provide a wide range of "useful" URNs for its internal callbacks, more than that would require significant design if it can't be pre- or post-fixed. > On Wed, Jan 23, 2019 at 7:04 PM Chamikara Jayalath <chamik...@google.com> > wrote: >> >> >> >> On Wed, Jan 23, 2019 at 1:03 PM Robert Bradshaw <rober...@google.com> wrote: >>> >>> On Wed, Jan 23, 2019 at 6:38 PM Maximilian Michels <m...@apache.org> wrote: >>> > >>> > Thank you for starting on the cross-language feature Robert! >>> > >>> > Just to recap: Each SDK runs an ExpansionService which can be contacted >>> > during >>> > pipeline translation to expand transforms that are unknown to the SDK. The >>> > service returns the Proto definitions to the querying process. >>> >>> Yep. Technically it doesn't have to be the SDK, or even if it is there >>> may be a variety of services (e.g. one offering SQL, one offering >>> different IOs). >>> >>> > There will be multiple environments such that during execution >>> > cross-language >>> > pipelines select the appropriate environment for a transform. >>> >>> Exactly. And fuses only those steps with compatible environments together. >>> >>> > It's not clear to me, should the expansion happen during pipeline >>> > construction >>> > or during translation by the Runner? >>> >>> I think it need to happen as part of construction because the set of >>> outputs (and their properties) can be dynamic based on the expansion. >> >> >> Also, without expansion at pipeline construction, we'll have to define all >> composite cross-language transforms as runner-native transforms which won't >> be practical ? >> >>> >>> >>> > Thanks, >>> > Max >>> > >>> > On 23.01.19 04:12, Robert Bradshaw wrote: >>> > > No, this PR simply takes an endpoint address as a parameter, expecting >>> > > it to already be up and available. More convenient APIs, e.g. ones >>> > > that spin up and endpoint and tear it down, or catalog and locate code >>> > > and services offering these endpoints, could be provided as wrappers >>> > > on top of or extensions of this. >>> > > >>> > > On Wed, Jan 23, 2019 at 12:19 AM Kenneth Knowles <k...@apache.org> >>> > > wrote: >>> > >> >>> > >> Nice! If I recall correctly, there was mostly concern about how to >>> > >> launch and manage the expansion service (Docker? Vendor-specific? >>> > >> Etc). Does this PR a position on that question? >>> > >> >>> > >> Kenn >>> > >> >>> > >> On Tue, Jan 22, 2019 at 1:44 PM Chamikara Jayalath >>> > >> <chamik...@google.com> wrote: >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Tue, Jan 22, 2019 at 11:35 AM Udi Meiri <eh...@google.com> wrote: >>> > >>>> >>> > >>>> Also debugability: collecting logs from each of these systems. >>> > >>> >>> > >>> >>> > >>> Agree. >>> > >>> >>> > >>>> >>> > >>>> >>> > >>>> On Tue, Jan 22, 2019 at 10:53 AM Chamikara Jayalath >>> > >>>> <chamik...@google.com> wrote: >>> > >>>>> >>> > >>>>> Thanks Robert. >>> > >>>>> >>> > >>>>> On Tue, Jan 22, 2019 at 4:39 AM Robert Bradshaw >>> > >>>>> <rober...@google.com> wrote: >>> > >>>>>> >>> > >>>>>> Now that we have the FnAPI, I started playing around with support >>> > >>>>>> for >>> > >>>>>> cross-language pipelines. This will allow things like IOs to be >>> > >>>>>> shared >>> > >>>>>> across all languages, SQL to be invoked from non-Java, TFX >>> > >>>>>> tensorflow >>> > >>>>>> transforms to be invoked from non-Python, etc. and I think is the >>> > >>>>>> next >>> > >>>>>> step in extending (and taking advantage of) the portability layer >>> > >>>>>> we've developed. These are often composite transforms whose inner >>> > >>>>>> structure depends in non-trivial ways on their configuration. >>> > >>>>> >>> > >>>>> >>> > >>>>> Some additional benefits of cross-language transforms are given >>> > >>>>> below. >>> > >>>>> >>> > >>>>> (1) Current large collection of Java IO connectors will be become >>> > >>>>> available to other languages. >>> > >>>>> (2) Current Java and Python transforms will be available for Go and >>> > >>>>> any other future SDKs. >>> > >>>>> (3) New transform authors will be able to pick their language of >>> > >>>>> choice and make their transform available to all Beam SDKs. For >>> > >>>>> example, this can be the language the transform author is most >>> > >>>>> familiar with or the only language for which a client library is >>> > >>>>> available for connecting to an external data store. >>> > >>>>> >>> > >>>>>> >>> > >>>>>> I created a PR [1] that basically follows the "expand via an >>> > >>>>>> external >>> > >>>>>> process" over RPC alternative from the proposals we came up with >>> > >>>>>> when >>> > >>>>>> we were discussing this last time [2]. There are still some >>> > >>>>>> unknowns, >>> > >>>>>> e.g. how to handle artifacts supplied by an alternative SDK (they >>> > >>>>>> currently must be provided by the environment), but I think this >>> > >>>>>> is a >>> > >>>>>> good incremental step forward that will already be useful in a >>> > >>>>>> large >>> > >>>>>> number of cases. It would be good to validate the general direction >>> > >>>>>> and I would be interested in any feedback others may have on it. >>> > >>>>> >>> > >>>>> >>> > >>>>> I think there are multiple semi-dependent problems we have to >>> > >>>>> tackle to reach the final goal of supporting fully-fledged >>> > >>>>> cross-language transforms in Beam. I agree with taking an >>> > >>>>> incremental approach here with overall vision in mind. Some other >>> > >>>>> problems we have to tackle involve following. >>> > >>>>> >>> > >>>>> * Defining a user API that will allow pipelines defined in a SDK X >>> > >>>>> to use transforms defined in SDK Y. >>> > >>>>> * Update various runners to use URN/payload based environment >>> > >>>>> definition [1] >>> > >>>>> * Updating various runners to support starting containers for >>> > >>>>> multiple environments/languages for the same pipeline and >>> > >>>>> supporting executing pipeline steps in containers started for >>> > >>>>> multiple environments. >>> > >>> >>> > >>> >>> > >>> I've been working with +Heejong Lee to add some of the missing pieces >>> > >>> mentioned above. >>> > >>> >>> > >>> We created following doc that captures some of the ongoing work >>> > >>> related to cross-language transforms and which will hopefully serve >>> > >>> as a knowledge base for anybody who wish to quickly learn context >>> > >>> related to this. >>> > >>> Feel free to refer to this and/or add to this. >>> > >>> >>> > >>> https://docs.google.com/document/d/1H3yCyVFI9xYs1jsiF1GfrDtARgWGnLDEMwG5aQIx2AU/edit?usp=sharing >>> > >>> >>> > >>> >>> > >>>>> >>> > >>>>> >>> > >>>>> Thanks, >>> > >>>>> Cham >>> > >>>>> >>> > >>>>> [1] >>> > >>>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L952 >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> - Robert >>> > >>>>>> >>> > >>>>>> [1] https://github.com/apache/beam/pull/7316 >>> > >>>>>> [2] https://s.apache.org/beam-mixed-language-pipelines