Exciting to see the cross-language train gathering steam :) It may be useful to flesh out the user facing aspects a bit more before going too deep on the service / expansion side or maybe that was done elsewhere?
A few examples (of varying complexity) of how the shim/proxy transforms would look like in the other SDKs. Perhaps Java KafkaIO in Python and Go would be a good candidate? One problem we discovered with custom Flink native transforms for Python was handling of lambdas / functions. An example could be a user defined watermark timestamp extractor that the user should be able to supply in Python and the JVM cannot handle. Thanks, Thomas On Wed, Jan 23, 2019 at 7:04 PM Chamikara Jayalath <chamik...@google.com> wrote: > > > On Wed, Jan 23, 2019 at 1:03 PM Robert Bradshaw <rober...@google.com> > wrote: > >> On Wed, Jan 23, 2019 at 6:38 PM Maximilian Michels <m...@apache.org> >> wrote: >> > >> > Thank you for starting on the cross-language feature Robert! >> > >> > Just to recap: Each SDK runs an ExpansionService which can be contacted >> during >> > pipeline translation to expand transforms that are unknown to the SDK. >> The >> > service returns the Proto definitions to the querying process. >> >> Yep. Technically it doesn't have to be the SDK, or even if it is there >> may be a variety of services (e.g. one offering SQL, one offering >> different IOs). >> >> > There will be multiple environments such that during execution >> cross-language >> > pipelines select the appropriate environment for a transform. >> >> Exactly. And fuses only those steps with compatible environments together. >> >> > It's not clear to me, should the expansion happen during pipeline >> construction >> > or during translation by the Runner? >> >> I think it need to happen as part of construction because the set of >> outputs (and their properties) can be dynamic based on the expansion. >> > > Also, without expansion at pipeline construction, we'll have to define all > composite cross-language transforms as runner-native transforms which won't > be practical ? > > >> >> > Thanks, >> > Max >> > >> > On 23.01.19 04:12, Robert Bradshaw wrote: >> > > No, this PR simply takes an endpoint address as a parameter, expecting >> > > it to already be up and available. More convenient APIs, e.g. ones >> > > that spin up and endpoint and tear it down, or catalog and locate code >> > > and services offering these endpoints, could be provided as wrappers >> > > on top of or extensions of this. >> > > >> > > On Wed, Jan 23, 2019 at 12:19 AM Kenneth Knowles <k...@apache.org> >> wrote: >> > >> >> > >> Nice! If I recall correctly, there was mostly concern about how to >> launch and manage the expansion service (Docker? Vendor-specific? Etc). >> Does this PR a position on that question? >> > >> >> > >> Kenn >> > >> >> > >> On Tue, Jan 22, 2019 at 1:44 PM Chamikara Jayalath < >> chamik...@google.com> wrote: >> > >>> >> > >>> >> > >>> >> > >>> On Tue, Jan 22, 2019 at 11:35 AM Udi Meiri <eh...@google.com> >> wrote: >> > >>>> >> > >>>> Also debugability: collecting logs from each of these systems. >> > >>> >> > >>> >> > >>> Agree. >> > >>> >> > >>>> >> > >>>> >> > >>>> On Tue, Jan 22, 2019 at 10:53 AM Chamikara Jayalath < >> chamik...@google.com> wrote: >> > >>>>> >> > >>>>> Thanks Robert. >> > >>>>> >> > >>>>> On Tue, Jan 22, 2019 at 4:39 AM Robert Bradshaw < >> rober...@google.com> wrote: >> > >>>>>> >> > >>>>>> Now that we have the FnAPI, I started playing around with >> support for >> > >>>>>> cross-language pipelines. This will allow things like IOs to be >> shared >> > >>>>>> across all languages, SQL to be invoked from non-Java, TFX >> tensorflow >> > >>>>>> transforms to be invoked from non-Python, etc. and I think is >> the next >> > >>>>>> step in extending (and taking advantage of) the portability layer >> > >>>>>> we've developed. These are often composite transforms whose inner >> > >>>>>> structure depends in non-trivial ways on their configuration. >> > >>>>> >> > >>>>> >> > >>>>> Some additional benefits of cross-language transforms are given >> below. >> > >>>>> >> > >>>>> (1) Current large collection of Java IO connectors will be become >> available to other languages. >> > >>>>> (2) Current Java and Python transforms will be available for Go >> and any other future SDKs. >> > >>>>> (3) New transform authors will be able to pick their language of >> choice and make their transform available to all Beam SDKs. For example, >> this can be the language the transform author is most familiar with or the >> only language for which a client library is available for connecting to an >> external data store. >> > >>>>> >> > >>>>>> >> > >>>>>> I created a PR [1] that basically follows the "expand via an >> external >> > >>>>>> process" over RPC alternative from the proposals we came up with >> when >> > >>>>>> we were discussing this last time [2]. There are still some >> unknowns, >> > >>>>>> e.g. how to handle artifacts supplied by an alternative SDK (they >> > >>>>>> currently must be provided by the environment), but I think this >> is a >> > >>>>>> good incremental step forward that will already be useful in a >> large >> > >>>>>> number of cases. It would be good to validate the general >> direction >> > >>>>>> and I would be interested in any feedback others may have on it. >> > >>>>> >> > >>>>> >> > >>>>> I think there are multiple semi-dependent problems we have to >> tackle to reach the final goal of supporting fully-fledged cross-language >> transforms in Beam. I agree with taking an incremental approach here with >> overall vision in mind. Some other problems we have to tackle involve >> following. >> > >>>>> >> > >>>>> * Defining a user API that will allow pipelines defined in a SDK >> X to use transforms defined in SDK Y. >> > >>>>> * Update various runners to use URN/payload based environment >> definition [1] >> > >>>>> * Updating various runners to support starting containers for >> multiple environments/languages for the same pipeline and supporting >> executing pipeline steps in containers started for multiple environments. >> > >>> >> > >>> >> > >>> I've been working with +Heejong Lee to add some of the missing >> pieces mentioned above. >> > >>> >> > >>> We created following doc that captures some of the ongoing work >> related to cross-language transforms and which will hopefully serve as a >> knowledge base for anybody who wish to quickly learn context related to >> this. >> > >>> Feel free to refer to this and/or add to this. >> > >>> >> > >>> >> https://docs.google.com/document/d/1H3yCyVFI9xYs1jsiF1GfrDtARgWGnLDEMwG5aQIx2AU/edit?usp=sharing >> > >>> >> > >>> >> > >>>>> >> > >>>>> >> > >>>>> Thanks, >> > >>>>> Cham >> > >>>>> >> > >>>>> [1] >> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L952 >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> - Robert >> > >>>>>> >> > >>>>>> [1] https://github.com/apache/beam/pull/7316 >> > >>>>>> [2] https://s.apache.org/beam-mixed-language-pipelines >> >