Thanks for putting this together Yi!

I agree with Jake, it does seem like there are a few too many moving parts
here. That said, the problem being solved is pretty broad, so let me try to
summarize my current understanding of the requirements. Please correct me
if I'm wrong or missing something.

ApplicationRunner and JobRunner first, ignoring test environment for the
moment.
ApplicationRunner:
1. Create execution plan: Same in Standalone and Yarn
2. Create intermediate streams: Same logic but different leader election
(ZK-based or pre-configured in standalone, AM in Yarn).
3. Run jobs: In JVM in standalone. Submit to the cluster in Yarn.

JobRunner:
1. Run the StreamProcessors: Same process in Standalone & Test. Remote host
in Yarn.

To get a single ApplicationRunner implementation, like Jake suggested, we
need to make leader election and JobRunner implementation pluggable.
There's still the question of whether ApplicationRunner#run API should be
blocking or non-blocking. It has to be non-blocking in YARN. We want it to
be blocking in standalone, but seems like the main reason is ease of use
when launched from main(). I'd prefer making it consitently non-blocking
instead, esp. since in embedded standalone mode (where the processor is
running in another container) a blocking API would not be user-friendly
either. If not, we can add both run and runBlocking.

Coming to RuntimeEnvironment, which is the least clear to me so far:
1. I don't think RuntimeEnvironment should be responsible for providing
StreamSpecs for streamIds - they can be obtained with a config/util class.
The StreamProcessor should only know about logical streamIds and the
streamId <-> actual stream mapping should happen within the
SystemProducer/Consumer/Admins provided by the RuntimeEnvironment.
2. There's also other components that the user might be interested in
providing implementations of in embedded Standalone mode (i.e., not just in
tests) - MetricsRegistry and JMXServer come to mind.
3. Most importantly, it's not clear to me who creates and manages the
RuntimeEnvironment. It seems like it should be the ApplicationRunner or the
user because of (2) above and because StreamManager also needs access to
SystemAdmins for creating intermediate streams which users might want to
mock. But it also needs to be passed down to the StreamProcessor - how
would this work on Yarn?

I think we should figure out how to integrate RuntimeEnvironment with
ApplicationRunner before we can make a call on one vs. multiple
ApplicationRunner implementations. If we do keep LocalApplicationRunner and
RemoteApplication (and TestApplicationRunner) separate, agree with Jake
that we should remove the JobRunners and roll them up into the respective
ApplicationRunners.

- Prateek

On Thu, Apr 20, 2017 at 10:06 AM, Jacob Maes <jacob.m...@gmail.com> wrote:

> Thanks for the SEP!
>
> +1 on introducing these new components
> -1 on the current definition of their roles (see Design feedback below)
>
> *Design*
>
>    - If LocalJobRunner and RemoteJobRunner handle the different methods of
>    launching a Job, what additional value do the different types of
>    ApplicationRunner and RuntimeEnvironment provide? It seems like a red
> flag
>    that all 3 would need to change from environment to environment. It
>    indicates that they don't have proper modularity. The
> call-sequence-figures
>    support this; LocalApplicationRunner and RemoteApplicationRunner make
> the
>    same calls and the diagram only varies after jobRunner.start()
>    - As far as I can tell, the only difference between Local and Remote
>    ApplicationRunner is that one is blocking and the other is
> non-blocking. If
>    that's all they're for then either the names should be changed to
> reflect
>    this, or they should be combined into one ApplicationRunner and just
> expose
>    separate methods for run() and runBlocking()
>    - There isn't much detail on why the main() methods for Local/Remote
>    have such different implementations, how they receive the Application
>    (direct vs config), and concretely how the deployment scripts, if any,
>    should interact with them.
>
>
> *Style*
>
>    - nit: None of the 11 uses of the word "actual" in the doc are
> *actually*
>    needed. :-)
>    - nit: Colors of the runtime blocks in the diagrams are unconventional
>    and a little distracting. Reminds me of nai won bao. Now I'm hungry. :-)
>    - Prefer the name "ExecutionEnvironment" over "RuntimeEnvironment". The
>    term "execution environment" is used
>    - The code comparisons for the ApplicationRunners are not apples-apples.
>    The local runner example is an application that USES the local runner.
> The
>    remote runner example is the just the runner code itself. So, it's not
>    readily apparent that we're comparing the main() methods and not the
>    application itself.
>
>
> On Mon, Apr 17, 2017 at 5:02 PM, Yi Pan <nickpa...@gmail.com> wrote:
>
> > Made some updates to clarify the role and functions of RuntimeEnvironment
> > in SEP-2.
> >
> > On Fri, Apr 14, 2017 at 9:30 AM, Yi Pan <nickpa...@gmail.com> wrote:
> >
> > > Hi, everyone,
> > >
> > > In light of new features such as fluent API and standalone that
> introduce
> > > new deployment / application launch models in Samza, I created a new
> > SEP-2
> > > to address the new use cases. SEP-2 link: https://cwiki.apache.
> > > org/confluence/display/SAMZA/SEP-2%3A+ApplicationRunner+Design
> > >
> > > Please take a look and give feedbacks!
> > >
> > > Thanks!
> > >
> > > -Yi
> > >
> >
>

Reply via email to