Romain, if we are specifically discussing the use of protocol buffers and gRPC, this is the result of community discussion on the dev list back in 2016. Many options were considered: JSON, Thrift, Kryo, and proto among them. The decision that protocol buffers and gRPC were the best solutions for the portability fnAPI was arrived at via a community discussion that many people took part in.
Reuven On Fri, May 11, 2018 at 11:48 AM Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > > > Le ven. 11 mai 2018 18:15, Andrew Pilloud <apill...@google.com> a écrit : > >> Json and Protobuf aren't the same thing. Json is for exchanging >> unstructured data, Protobuf is for exchanging structured data. The point of >> Portability is to define a protocol for exchanging structured messages >> across languages. What do you propose using on top of Json to define >> message structure? >> > > Im fine with protobuf contracts, not with all the rest (libs*). Json has > the advantage to not require much for consumers and be easy to integrate > and proxy. Protobuf imposes a lot for that layer which will be typed by the > runner anyway so no need of 2 typings layers. > > >> I'd like to see the generic runner rewritten in Golang so we can >> eliminate the significant overhead imposed by the JVM. I would argue that >> Go is the best language for low overhead infrastructure, and is already >> widely used by projects in this space such as Docker, Kubernetes, InfluxDB. >> Even SQL can take advantage of this. For example, several runners could be >> passed raw SQL and use their own SQL engines to implement more efficient >> transforms then generic Beam can. Users will save significant $$$ on >> infrastructure by not having to involve the JVM at all. >> > > Yes...or no. Jvm overhead is very low gor such infra, less than 64M of ram > and almost no cpu so will not help much for cluster or long lived processes > like the ones we talk about. > > Also beam community is java - dont answer it is python or go without > checking ;). Not sure adding a new language will help and give a face > people will like to contribute or use the project. > > Currently id say the way the router runner is done is a detail but the > choice to rethink current impl a crucial atchitectural point. > > No issue having N router impls too, one in java, one in go (but thought we > had very few go resources/lovers in beam?), one in python where it would > make a lot of sense and would show a router without actual primitive impl > (delegating to direct java runner), etc... > > But one thing at a time, anyone against stopping current impl track, > revert it and move to a higher level runner? > > >> Andrew >> >> On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau <rmannibu...@gmail.com> >> wrote: >> >>> >>> >>> Le mer. 9 mai 2018 17:41, Eugene Kirpichov <kirpic...@google.com> a >>> écrit : >>> >>>> >>>> >>>> On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau < >>>> rmannibu...@gmail.com> wrote: >>>> >>>>> >>>>> >>>>> Le mer. 9 mai 2018 00:57, Henning Rohde <hero...@google.com> a écrit : >>>>> >>>>>> There are indeed lots of possibilities for interesting docker >>>>>> alternatives with different tradeoffs and capabilities, but in generally >>>>>> both the runner as well as the SDK must support them for it to work. As >>>>>> mentioned, docker (as used in the container contract) is meant as a >>>>>> flexible main option but not necessarily the only option. I see no >>>>>> problem >>>>>> with certain pipeline-SDK-runner combinations additionally supporting a >>>>>> specialized setup. Pipeline can be a factor, because that some transforms >>>>>> might depend on aspects of the runtime environment -- such as system >>>>>> libraries or shelling out to a /bin/foo. >>>>>> >>>>>> The worker boot code is tied to the current container contract, so >>>>>> pre-launched workers would presumably not use that code path and are not >>>>>> be >>>>>> bound by its assumptions. In particular, such a setup might want to >>>>>> invert >>>>>> who initiates the connection from the SDK worker to the runner. Pipeline >>>>>> options and global state in the SDK and user functions process might make >>>>>> it difficult to safely reuse worker processes across pipelines, but also >>>>>> doable in certain scenarios. >>>>>> >>>>> >>>>> This is not that hard actually and most java env do it. >>>>> >>>>> Main concern is 1. Being tight to an impl detail and 2. A bad >>>>> architecture which doeent embrace the community >>>>> >>>> Could you please be more specific? Concerns about Docker dependency >>>> have already been repeatedly addressed in this thread. >>>> >>> >>> My concern is that beam is being driven by an implementation instead of >>> a clear and scalable architecture. >>> >>> The best demonstration is the protobuf usage which is far to be the best >>> choice for portability these days due to the implication of its stack in >>> several languages (nobody wants it in its classpath in java/scala these >>> days for instance cause of conflicts or security careness its requires). >>> Json is very tooled and trivial to use whatever lib you want to rely on, in >>> any language or environment to cite just one alternative. >>> >>> Being portable (language) is a good goal but IMHO requires: >>> >>> 1. Runners in each language (otherwise fallback on the jsr223 and you >>> are good with just a json facade) >>> 2. A generic runner able to route each task to the right native runner >>> 3. A way to run in a single runner when relevant (keep in mind most of >>> java users dont even want to see python or portable code or api in their >>> classpath and runner) >>> >>> >>> >>> >>>> >>>>> >>>>> >>>>> >>>>>> Henning >>>>>> >>>>>> On Tue, May 8, 2018 at 3:51 PM Thomas Weise <t...@apache.org> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw <rober...@google.com >>>>>>> > wrote: >>>>>>> >>>>>>>> >>>>>>>> I would welcome changes to >>>>>>>> >>>>>>>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730 >>>>>>>> that would provide alternatives to docker (one of which comes to >>>>>>>> mind is "I >>>>>>>> already brought up a worker(s) for you (which could be the same >>>>>>>> process >>>>>>>> that handled pipeline construction in testing scenarios), here's >>>>>>>> how to >>>>>>>> connect to it/them.") Another option, which would seem to appeal to >>>>>>>> you in >>>>>>>> particular, would be "the worker code is linked into the runner's >>>>>>>> binary, >>>>>>>> use this process as the worker" (though note even for java-on-java, >>>>>>>> it can >>>>>>>> be advantageous to shield the worker and runner code from each >>>>>>>> others >>>>>>>> environments, dependencies, and version requirements.) This latter >>>>>>>> should >>>>>>>> still likely use the FnApi to talk to itself (either over GRPC on >>>>>>>> local >>>>>>>> ports, or possibly better via direct function calls eliminating the >>>>>>>> RPC >>>>>>>> overhead altogether--this is how the fast local runner in Python >>>>>>>> works). >>>>>>>> There may be runner environments well controlled enough that "start >>>>>>>> up the >>>>>>>> workers" could be specified as "run this command line." We should >>>>>>>> make this >>>>>>>> environment message extensible to other alternatives than "docker >>>>>>>> container >>>>>>>> url," though of course we don't want the set of options to grow too >>>>>>>> large >>>>>>>> or we loose the promise of portability unless every runner supports >>>>>>>> every >>>>>>>> protocol. >>>>>>>> >>>>>>>> >>>>>>> The pre-launched worker would be an interesting option, which might >>>>>>> work well for a sidecar deployment. >>>>>>> >>>>>>> The current worker boot code though makes the assumption that the >>>>>>> runner endpoint to phone home to is known when the process is launched. >>>>>>> That doesn't work so well with a runner that establishes its endpoint >>>>>>> dynamically. Also, the assumption is baked in that a worker will only >>>>>>> serve >>>>>>> a single pipeline (provisioning API etc.). >>>>>>> >>>>>>> Thanks, >>>>>>> Thomas >>>>>>> >>>>>>> >>>>>>