Hi Hai, Thanks for the PR. Added a couple of comments. Will take a detailed look later.
Thanks, Ankur *From: *Hai Lu <lhai...@apache.org> *Date: *Thu, May 16, 2019 at 8:02 PM *To: * <lc...@google.com>, <goe...@google.com> *Cc: * <dev@beam.apache.org>, <danx...@gmail.com>, <xi...@linkedin.com> Hi Lukasz and Ankur, > > Here is the PR that implements the idea: > https://github.com/apache/beam/pull/8597 > > Would appreciate it if you could take a look. > > Thanks, > Hai > > On Tue, Apr 30, 2019 at 9:13 AM Hai Lu <lhai...@gmail.com> wrote: > >> One thing to clarify is that we do not use docker. I don't have too much >> experience with docker; I assume docker itself already has network >> isolation, and that's why it was never necessary to enable security in >> portable runner before? >> >> For us because we simply use processes, we need this extra secret >> (through file system) for authentication. >> >> Let me create a ticket and send a PR, which should explain my intention >> better. >> >> Thanks, >> Hai >> >> On Mon, Apr 29, 2019 at 1:03 PM Lukasz Cwik <lc...@google.com> wrote: >> >>> Changing the address to be loopback based upon how the environment is >>> started (docker container/process/external/...) makes sense. >>> >>> How would the SDK and runner support storing/sharing this secret? (For >>> example, in the docker container, how would the secret get there?) >>> >>> On Mon, Apr 29, 2019 at 9:23 AM Hai Lu <lhai...@gmail.com> wrote: >>> >>>> Hi Lukasz and Ankur, >>>> >>>> Thank you so much for your response! This is what we're >>>> doing/implementing in our internal fork right now: >>>> >>>> 1. We assume that the Java process and Python process *are always >>>> colocated in the same host*, so first of all we use "loopback" >>>> address instead of "any address" that's currently being used on the java >>>> side. That way, the traffic between sdk worker and runner is limited to >>>> the >>>> host but not exposed to network. >>>> 2. Because of the multi-tenant nature of our environment, we still >>>> want to have authentication even for local host, so that data ports are >>>> not >>>> connected by random processes. Because different jobs have their own >>>> user >>>> name, it's sufficient to *use file system to store an ad-hoc secret*, >>>> which can be shared by both Python sdk and java runner. The the runner >>>> uses >>>> this secret to authenticate the worker (by using gRPC's interceptor for >>>> this customized auth) >>>> 3. By having the 2 steps above, we *no longer need transport layer >>>> security *(SSL/TLS). So we abandon our initial plan to enable >>>> SSL/TLS. >>>> >>>> Above is the high level plan that I'm implementing. I would like to >>>> have a similar solution in the open source to be merged with our internal >>>> fork. Let me know what you think. If this sounds OK I will create a ticket >>>> for myself and will first send out a short write-up in google doc to >>>> collect comments soon. >>>> >>>> Thanks, >>>> Hai >>>> >>>> On Fri, Apr 26, 2019 at 5:24 PM Ankur Goenka <goe...@google.com> wrote: >>>> >>>>> In an offline chat with Hai, It seem useful for users to be able to >>>>> provide custom authentication like a secret which can be distributed out >>>>> of >>>>> band by the infrastructure and can be provided via file system, rpc to >>>>> another service etc. >>>>> gRPC already has some mechanism for standard and custom >>>>> authentication[1]. >>>>> Instrumenting gRPC channel using command line option or environment >>>>> variable on the worker machines can be be useful. >>>>> >>>>> [1] https://grpc.io/docs/guides/auth/ >>>>> >>>>> On Fri, Apr 26, 2019 at 4:33 PM Lukasz Cwik <lc...@google.com> wrote: >>>>> >>>>>> The link to the ApiServiceDescriptor is >>>>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/model/pipeline/src/main/proto/endpoints.proto#L31 >>>>>> >>>>>> On Fri, Apr 26, 2019 at 4:32 PM Lukasz Cwik <lc...@google.com> wrote: >>>>>> >>>>>>> I had originally taken a look at this a while ago but not much has >>>>>>> progressed since then. The original idea was that the >>>>>>> ApiServiceDescriptor >>>>>>> would be extended to support secure ways of >>>>>>> authentication/communication. I >>>>>>> was prototyping with an OAuth2 client credentials grant at the time but >>>>>>> dropped it as other things were more important. The only currently >>>>>>> supported mode across all SDKs is an implicit authenticated/secure mode >>>>>>> where all communication is assumed to already be encrypted/private (e.g. >>>>>>> over VPN that is managed externally with trusted services) and hence the >>>>>>> gRPC channel itself is insecure and there is no authentication being >>>>>>> performed. >>>>>>> >>>>>>> Even though sdk_worker.py seems like it supports credentials, no one >>>>>>> invokes the constructor with credentials enabled as can be seen by this >>>>>>> comment by Robert[1]. >>>>>>> >>>>>>> For SSL/TLS support it seems like we need some way to configure a >>>>>>> runner to be told to use SSL/TLS (potentially with a custom private key >>>>>>> and >>>>>>> trust chain). Do you have some suggestions on how we add support for >>>>>>> passing around channel/call[2] credentials? >>>>>>> >>>>>>> 1: >>>>>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L139 >>>>>>> 2: https://grpc.io/docs/guides/auth/ >>>>>>> >>>>>>> On Tue, Apr 23, 2019 at 5:06 PM Hai Lu <lhai...@apache.org> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> This is Hai from LinkedIn. Daniel and I have been working on >>>>>>>> productionizing Samza portable runner. BTW, Daniel didn't mention in >>>>>>>> his >>>>>>>> previous email that he has enabled and validated Python 3 for Samza >>>>>>>> runner >>>>>>>> and it worked smoothly. Kudos to the team! >>>>>>>> >>>>>>>> Here I have a few security related questions about portability. At >>>>>>>> LinkedIn, we enable SSL/TLS and ACLs for Kafka data and any data >>>>>>>> exchange. >>>>>>>> In the case of portable runner, we're required to secure the data >>>>>>>> channels >>>>>>>> between Java and Python processes as well because our Samza jobs are >>>>>>>> running in a multi-tenant environment. While I'm currently working on >>>>>>>> this >>>>>>>> on our internal branch, I do want to keep it clean and consistent with >>>>>>>> the >>>>>>>> master branch. >>>>>>>> >>>>>>>> My questions are: were there any plans/thoughts around security for >>>>>>>> portability? I see that sdk_worker.py does have some codes to create >>>>>>>> secured gRPC channels; is anyone actually leveraging those codes? I >>>>>>>> don't >>>>>>>> see on the Java side any work is done, though. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Hai Lu >>>>>>>> >>>>>>>