Re: [DISCUSS] Dependencies resolution and action level dependencies

2018-10-23 Thread Yaniv Rodenski
Excellent,

I've added:
AMATERASU-53 - Support action level dependencies

AMATERASU-54 - Use Docker for Mesos containerization

I suggest we review after AMATERASU-54 how to approach the YARN
implementation.

Cheers,
Yaniv

On Tue, Oct 23, 2018 at 9:51 PM Yariv Triffon  wrote:

> Hi Yaniv,
> i'm good to grab the task of moving Mesos to use Docker containers.
>
> Cheers,
> Yariv
>
> On Tue, Oct 23, 2018 at 5:13 PM Yaniv Rodenski  wrote:
>
> > Thanks, Kirupa,
> >
> > I'll create the JIRA tasks shortly and assign that one to you.
> >
> >
> >
> > On Tue, Oct 23, 2018 at 5:09 PM Kirupa Devarajan <
> kirupagara...@gmail.com>
> > wrote:
> >
> > > Hi Yaniv,
> > >
> > > I am happy to pick up the following task
> > >
> > > 1. Add to the JobManager the functionality to read action level
> > > dependencies
> > >
> > > Regards,
> > > Kirupa
> > >
> > > On Tue., 23 Oct. 2018, 11:04 am Yaniv Rodenski, 
> wrote:
> > >
> > > > Hi Nadav,
> > > >
> > > > It does make sense, in fact, we actually have action level resources
> > > > already, however they are limited to the configuration files for the
> > > > container.
> > > > I also think that we need to revision the way we set up those.
> > Correctly
> > > we
> > > > use YARN/Mesos to copy dependencies to the containers. With YARN 3.0
> I
> > > > think it makes sense to move to use Docker as the way to manage
> > resources
> > > > in the containers.
> > > > This should also have performance benefits + will make life easier (I
> > > hope)
> > > > when we start working on K8s.
> > > >
> > > > To do this, I think we need to add the following tasks:
> > > > 1. Add to the JobManager the functionality to read action level
> > > > dependencies
> > > > 2. Move from Mesos/YARN containers to Docker (probably at least two
> > > tasks)
> > > >
> > > > I'll add them to JIRA asap, for version 0.2.1-incubating if everyone
> is
> > > OK
> > > > with it.
> > > >
> > > > On Sat, Oct 20, 2018 at 6:43 PM Nadav Har Tzvi <
> nadavhart...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hey everyone,
> > > > >
> > > > > Yaniv and I were just discussing how to resolve dependencies in the
> > new
> > > > > frameworks architecture and integrate the dependencies with the
> > > concrete
> > > > > cluster resource manager (Mesos/YARN)
> > > > > We rolled with the idea of each runner (or base runner) performing
> > the
> > > > > dependencies resolution on its own.
> > > > > So for example, the Spark Scala runner would resolve the required
> > JARs
> > > > and
> > > > > do whatever it needs to do with them (e.g. spark-submit --jars
> > > --packages
> > > > > --repositories, etc).
> > > > > The base Python provider will resolve dependencies and dynamically
> > > > generate
> > > > > a requirement.txt file that will deployed to the executor.
> > > > > The handling of the requirements.txt file differs between different
> > > > > concrete Python runners. For example, a regular Python runner would
> > > > simply
> > > > > run pip install, while the pyspark runner would need to rearrange
> the
> > > > > dependencies in a way that would be acceptable by spark-submit (
> > > > >
> > > > >
> > > >
> > >
> >
> https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7
> > > > > sounds like a decent idea, comment if you have a better idea
> please)
> > > > >
> > > > > So far I hope it makes sense.
> > > > >
> > > > > The next item I want to discuss is as follows:
> > > > > In the new architecture, we do hierarchical runtime environment
> > > > resolution,
> > > > > starting at the top job level and drilling down to the action
> level,
> > > > > outputting one unified environment configuration file that is
> > deployed
> > > to
> > > > > the executor.
> > > > > I suggest doing the same with dependencies.
> > > > > Currently, we only have job level dependencies. I suggest that we
> > > provide
> > > > > action level dependencies and resolve them in exactly the same
> manner
> > > as
> > > > we
> > > > > resolve the environment.
> > > > > There should be quite a few benefits for this approach:
> > > > >
> > > > >1. It will give the option to have different versions of the
> same
> > > > >package in different actions. This is especially important if
> you
> > > have
> > > > > 2+
> > > > >pipeline developers working independently, this would reduce the
> > > > >integration costs by letting each action be more self-contained.
> > > > >2. It should lower the startup time per action. The more
> > > dependencies
> > > > >you have, the longer it takes to resolve and install them.
> Actions
> > > > will
> > > > > no
> > > > >longer get any unnecessary dependencies.
> > > > >
> > > > >
> > > > > What do you think? Does it make sense?
> > > > >
> > > > > Cheers,
> > > > > Nadav
> > > > >
> > > >
> > > >
> > > > --
> > > > Yaniv Rodenski
> > > >
> > > > +61 477 778 405
> > 

[DISCUSS] Dependencies resolution and action level dependencies

2018-10-20 Thread Nadav Har Tzvi
Hey everyone,

Yaniv and I were just discussing how to resolve dependencies in the new
frameworks architecture and integrate the dependencies with the concrete
cluster resource manager (Mesos/YARN)
We rolled with the idea of each runner (or base runner) performing the
dependencies resolution on its own.
So for example, the Spark Scala runner would resolve the required JARs and
do whatever it needs to do with them (e.g. spark-submit --jars --packages
--repositories, etc).
The base Python provider will resolve dependencies and dynamically generate
a requirement.txt file that will deployed to the executor.
The handling of the requirements.txt file differs between different
concrete Python runners. For example, a regular Python runner would simply
run pip install, while the pyspark runner would need to rearrange the
dependencies in a way that would be acceptable by spark-submit (
https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7
sounds like a decent idea, comment if you have a better idea please)

So far I hope it makes sense.

The next item I want to discuss is as follows:
In the new architecture, we do hierarchical runtime environment resolution,
starting at the top job level and drilling down to the action level,
outputting one unified environment configuration file that is deployed to
the executor.
I suggest doing the same with dependencies.
Currently, we only have job level dependencies. I suggest that we provide
action level dependencies and resolve them in exactly the same manner as we
resolve the environment.
There should be quite a few benefits for this approach:

   1. It will give the option to have different versions of the same
   package in different actions. This is especially important if you have 2+
   pipeline developers working independently, this would reduce the
   integration costs by letting each action be more self-contained.
   2. It should lower the startup time per action. The more dependencies
   you have, the longer it takes to resolve and install them. Actions will no
   longer get any unnecessary dependencies.


What do you think? Does it make sense?

Cheers,
Nadav