Romain, if we are specifically discussing the use of protocol buffers and
gRPC, this is the result of community discussion on the dev list back in
2016. Many options were considered: JSON, Thrift, Kryo, and proto among
them. The decision that protocol buffers and gRPC were the best solutions
for the portability fnAPI was arrived at via a community discussion that
many people took part in.

Reuven

On Fri, May 11, 2018 at 11:48 AM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le ven. 11 mai 2018 18:15, Andrew Pilloud <apill...@google.com> a écrit :
>
>> Json and Protobuf aren't the same thing. Json is for exchanging
>> unstructured data, Protobuf is for exchanging structured data. The point of
>> Portability is to define a protocol for exchanging structured messages
>> across languages. What do you propose using on top of Json to define
>> message structure?
>>
>
> Im fine with protobuf contracts, not with all the rest (libs*). Json has
> the advantage to not require much for consumers and be easy to integrate
> and proxy. Protobuf imposes a lot for that layer which will be typed by the
> runner anyway so no need of 2 typings layers.
>
>
>> I'd like to see the generic runner rewritten in Golang so we can
>> eliminate the significant overhead imposed by the JVM. I would argue that
>> Go is the best language for low overhead infrastructure, and is already
>> widely used by projects in this space such as Docker, Kubernetes, InfluxDB.
>> Even SQL can take advantage of this. For example, several runners could be
>> passed raw SQL and use their own SQL engines to implement more efficient
>> transforms then generic Beam can. Users will save significant $$$ on
>> infrastructure by not having to involve the JVM at all.
>>
>
> Yes...or no. Jvm overhead is very low gor such infra, less than 64M of ram
> and almost no cpu so will not help much for cluster or long lived processes
> like the ones we talk about.
>
> Also beam community is java - dont answer it is python or go without
> checking ;). Not sure adding a new language will help and give a face
> people will like to contribute or use the project.
>
> Currently id say the way the router runner is done is a detail but the
> choice to rethink current impl a crucial atchitectural point.
>
> No issue having N router impls too, one in java, one in go (but thought we
> had very few go resources/lovers in beam?), one in python where it would
> make a lot of sense and would show a router without actual primitive impl
> (delegating to direct java runner), etc...
>
> But one thing at a time, anyone against stopping current impl track,
> revert it and move to a higher level runner?
>
>
>> Andrew
>>
>> On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> Le mer. 9 mai 2018 17:41, Eugene Kirpichov <kirpic...@google.com> a
>>> écrit :
>>>
>>>>
>>>>
>>>> On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Le mer. 9 mai 2018 00:57, Henning Rohde <hero...@google.com> a écrit :
>>>>>
>>>>>> There are indeed lots of possibilities for interesting docker
>>>>>> alternatives with different tradeoffs and capabilities, but in generally
>>>>>> both the runner as well as the SDK must support them for it to work. As
>>>>>> mentioned, docker (as used in the container contract) is meant as a
>>>>>> flexible main option but not necessarily the only option. I see no 
>>>>>> problem
>>>>>> with certain pipeline-SDK-runner combinations additionally supporting a
>>>>>> specialized setup. Pipeline can be a factor, because that some transforms
>>>>>> might depend on aspects of the runtime environment -- such as system
>>>>>> libraries or shelling out to a /bin/foo.
>>>>>>
>>>>>> The worker boot code is tied to the current container contract, so
>>>>>> pre-launched workers would presumably not use that code path and are not 
>>>>>> be
>>>>>> bound by its assumptions. In particular, such a setup might want to 
>>>>>> invert
>>>>>> who initiates the connection from the SDK worker to the runner. Pipeline
>>>>>> options and global state in the SDK and user functions process might make
>>>>>> it difficult to safely reuse worker processes across pipelines, but also
>>>>>> doable in certain scenarios.
>>>>>>
>>>>>
>>>>> This is not that hard actually and most java env do it.
>>>>>
>>>>> Main concern is 1. Being tight to an impl detail and 2. A bad
>>>>> architecture which doeent embrace the community
>>>>>
>>>> Could you please be more specific? Concerns about Docker dependency
>>>> have already been repeatedly addressed in this thread.
>>>>
>>>
>>> My concern is that beam is being driven by an implementation instead of
>>> a clear and scalable architecture.
>>>
>>> The best demonstration is the protobuf usage which is far to be the best
>>> choice for portability these days due to the implication of its stack in
>>> several languages (nobody wants it in its classpath in java/scala these
>>> days for instance cause of conflicts or security careness its requires).
>>> Json is very tooled and trivial to use whatever lib you want to rely on, in
>>> any language or environment to cite just one alternative.
>>>
>>> Being portable (language) is a good goal but IMHO requires:
>>>
>>> 1. Runners in each language (otherwise fallback on the jsr223 and you
>>> are good with just a json facade)
>>> 2. A generic runner able to route each task to the right native runner
>>> 3. A way to run in a single runner when relevant (keep in mind most of
>>> java users dont even want to see python or portable code or api in their
>>> classpath and runner)
>>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Henning
>>>>>>
>>>>>> On Tue, May 8, 2018 at 3:51 PM Thomas Weise <t...@apache.org> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw <rober...@google.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I would welcome changes to
>>>>>>>>
>>>>>>>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>>>>>>>> that would provide alternatives to docker (one of which comes to
>>>>>>>> mind is "I
>>>>>>>> already brought up a worker(s) for you (which could be the same
>>>>>>>> process
>>>>>>>> that handled pipeline construction in testing scenarios), here's
>>>>>>>> how to
>>>>>>>> connect to it/them.") Another option, which would seem to appeal to
>>>>>>>> you in
>>>>>>>> particular, would be "the worker code is linked into the runner's
>>>>>>>> binary,
>>>>>>>> use this process as the worker" (though note even for java-on-java,
>>>>>>>> it can
>>>>>>>> be advantageous to shield the worker and runner code from each
>>>>>>>> others
>>>>>>>> environments, dependencies, and version requirements.) This latter
>>>>>>>> should
>>>>>>>> still likely use the FnApi to talk to itself (either over GRPC on
>>>>>>>> local
>>>>>>>> ports, or possibly better via direct function calls eliminating the
>>>>>>>> RPC
>>>>>>>> overhead altogether--this is how the fast local runner in Python
>>>>>>>> works).
>>>>>>>> There may be runner environments well controlled enough that "start
>>>>>>>> up the
>>>>>>>> workers" could be specified as "run this command line." We should
>>>>>>>> make this
>>>>>>>> environment message extensible to other alternatives than "docker
>>>>>>>> container
>>>>>>>> url," though of course we don't want the set of options to grow too
>>>>>>>> large
>>>>>>>> or we loose the promise of portability unless every runner supports
>>>>>>>> every
>>>>>>>> protocol.
>>>>>>>>
>>>>>>>>
>>>>>>> The pre-launched worker would be an interesting option, which might
>>>>>>> work well for a sidecar deployment.
>>>>>>>
>>>>>>> The current worker boot code though makes the assumption that the
>>>>>>> runner endpoint to phone home to is known when the process is launched.
>>>>>>> That doesn't work so well with a runner that establishes its endpoint
>>>>>>> dynamically. Also, the assumption is baked in that a worker will only 
>>>>>>> serve
>>>>>>> a single pipeline (provisioning API etc.).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>

Reply via email to