Let me give a shot to summarize the requirements for ApplicationRunner we have discussed so far:
- Support environment for passing in user-defined objects (streams potentially) into ApplicationRunner (*Beam*) - Improve ease of use for ApplicationRunner to avoid complex configurations such as zkCoordinator, zkCoordinationService. (*Standalone*) - Clean up ApplicationRunner into a single interface (*Fluent*). We can have one or more implementations but it's hidden from the users. - Separate StreamGraph from environment so it can be serializable (*Beam, Yarn*) - Better life cycle management of application, including start/stop/stats (*Standalone, Beam*) One way to address 2 and 3 is to provide pre-packaged runner using static factory methods, and the return type will be the ApplicationRunner interface. So we can have: ApplicationRunner runner = ApplicationRunner.zk() / ApplicationRunner.local() / ApplicationRunner.remote() / ApplicationRunner.test(). Internally we will package the right configs and run-time environment with the runner. For example, ApplicationRunner.zk() will define all the configs needed for zk coordination. To support 1 and 4, can we pass in a lambda function in the runner, and then we can run the stream graph? Like the following: ApplicationRunner.zk().env(config -> environment).run(streamGraph); Then we need a way to pass the environment into the StreamGraph. This can be done by either adding an extra parameter to each operator, or have a getEnv() function in the MessageStream, which seems to be pretty hacky. What do you think? Thanks, Xinyu On Sun, Apr 23, 2017 at 11:01 PM, Prateek Maheshwari < pmaheshw...@linkedin.com.invalid> wrote: > Thanks for putting this together Yi! > > I agree with Jake, it does seem like there are a few too many moving parts > here. That said, the problem being solved is pretty broad, so let me try to > summarize my current understanding of the requirements. Please correct me > if I'm wrong or missing something. > > ApplicationRunner and JobRunner first, ignoring test environment for the > moment. > ApplicationRunner: > 1. Create execution plan: Same in Standalone and Yarn > 2. Create intermediate streams: Same logic but different leader election > (ZK-based or pre-configured in standalone, AM in Yarn). > 3. Run jobs: In JVM in standalone. Submit to the cluster in Yarn. > > JobRunner: > 1. Run the StreamProcessors: Same process in Standalone & Test. Remote host > in Yarn. > > To get a single ApplicationRunner implementation, like Jake suggested, we > need to make leader election and JobRunner implementation pluggable. > There's still the question of whether ApplicationRunner#run API should be > blocking or non-blocking. It has to be non-blocking in YARN. We want it to > be blocking in standalone, but seems like the main reason is ease of use > when launched from main(). I'd prefer making it consitently non-blocking > instead, esp. since in embedded standalone mode (where the processor is > running in another container) a blocking API would not be user-friendly > either. If not, we can add both run and runBlocking. > > Coming to RuntimeEnvironment, which is the least clear to me so far: > 1. I don't think RuntimeEnvironment should be responsible for providing > StreamSpecs for streamIds - they can be obtained with a config/util class. > The StreamProcessor should only know about logical streamIds and the > streamId <-> actual stream mapping should happen within the > SystemProducer/Consumer/Admins provided by the RuntimeEnvironment. > 2. There's also other components that the user might be interested in > providing implementations of in embedded Standalone mode (i.e., not just in > tests) - MetricsRegistry and JMXServer come to mind. > 3. Most importantly, it's not clear to me who creates and manages the > RuntimeEnvironment. It seems like it should be the ApplicationRunner or the > user because of (2) above and because StreamManager also needs access to > SystemAdmins for creating intermediate streams which users might want to > mock. But it also needs to be passed down to the StreamProcessor - how > would this work on Yarn? > > I think we should figure out how to integrate RuntimeEnvironment with > ApplicationRunner before we can make a call on one vs. multiple > ApplicationRunner implementations. If we do keep LocalApplicationRunner and > RemoteApplication (and TestApplicationRunner) separate, agree with Jake > that we should remove the JobRunners and roll them up into the respective > ApplicationRunners. > > - Prateek > > On Thu, Apr 20, 2017 at 10:06 AM, Jacob Maes <jacob.m...@gmail.com> wrote: > > > Thanks for the SEP! > > > > +1 on introducing these new components > > -1 on the current definition of their roles (see Design feedback below) > > > > *Design* > > > > - If LocalJobRunner and RemoteJobRunner handle the different methods > of > > launching a Job, what additional value do the different types of > > ApplicationRunner and RuntimeEnvironment provide? It seems like a red > > flag > > that all 3 would need to change from environment to environment. It > > indicates that they don't have proper modularity. The > > call-sequence-figures > > support this; LocalApplicationRunner and RemoteApplicationRunner make > > the > > same calls and the diagram only varies after jobRunner.start() > > - As far as I can tell, the only difference between Local and Remote > > ApplicationRunner is that one is blocking and the other is > > non-blocking. If > > that's all they're for then either the names should be changed to > > reflect > > this, or they should be combined into one ApplicationRunner and just > > expose > > separate methods for run() and runBlocking() > > - There isn't much detail on why the main() methods for Local/Remote > > have such different implementations, how they receive the Application > > (direct vs config), and concretely how the deployment scripts, if any, > > should interact with them. > > > > > > *Style* > > > > - nit: None of the 11 uses of the word "actual" in the doc are > > *actually* > > needed. :-) > > - nit: Colors of the runtime blocks in the diagrams are unconventional > > and a little distracting. Reminds me of nai won bao. Now I'm hungry. > :-) > > - Prefer the name "ExecutionEnvironment" over "RuntimeEnvironment". > The > > term "execution environment" is used > > - The code comparisons for the ApplicationRunners are not > apples-apples. > > The local runner example is an application that USES the local runner. > > The > > remote runner example is the just the runner code itself. So, it's not > > readily apparent that we're comparing the main() methods and not the > > application itself. > > > > > > On Mon, Apr 17, 2017 at 5:02 PM, Yi Pan <nickpa...@gmail.com> wrote: > > > > > Made some updates to clarify the role and functions of > RuntimeEnvironment > > > in SEP-2. > > > > > > On Fri, Apr 14, 2017 at 9:30 AM, Yi Pan <nickpa...@gmail.com> wrote: > > > > > > > Hi, everyone, > > > > > > > > In light of new features such as fluent API and standalone that > > introduce > > > > new deployment / application launch models in Samza, I created a new > > > SEP-2 > > > > to address the new use cases. SEP-2 link: https://cwiki.apache. > > > > org/confluence/display/SAMZA/SEP-2%3A+ApplicationRunner+Design > > > > > > > > Please take a look and give feedbacks! > > > > > > > > Thanks! > > > > > > > > -Yi > > > > > > > > > >