Hey everyone,

So we have this issue, Anaconda takes forever to deploy on the executors,
whether it is YARN or Mesos.

Let's first discuss why is it like this right now.

First, let's see for each platform, how Apache Amaterasu interacts with the
underlying platform, in regard to what smallest independent unit that is
awarded its own isolated execution environment.

*Apache Mesos:*
In Apache Mesos, we get our own nifty set of instances and executors. An
instance obviously can host multiple executors. depending on its capacity.
Thus the smallest independent unit here is the executor itself.

*Apache Hadoop YARN*:
On YARN, we have a similar set of resources, we have nodes, each node is a
host to containers.

Great, so far it sounds similar, right? Here is where Apache Amaterasu
takes things a bit differently for each platform.

In Apache Mesos, everything is run on the same executor, regardless of how
many actions the job has. So if the job has 20 actions, they will run
sequentially on the same executor, resulting in the smallest independent
unit being the job itself, as only the job deserves its own running
environment.

On Hadoop, things are different, a lot.
To start, each action is treated by YARN as a different application, with
its own set of containers. This means that on YARN, action is the smallest
independent unit.

So what's the problem actually? So the problem in general is that we cannot
rely on the existence of 3rd party utilities, libraries, you name it, on
the target execution environment. This forces us to bundle anything we need
along with the job execution process.
Anaconda is exactly such 3rd party utility that we desperately need in
order to run PySpark code that has dependencies on more than PySpark itself
and pure Python. (Pandas, numpy, sklearn, there are more than enough
examples out there)
We need to install Anaconda once for each execution environment. In Apache
Mesos our smallest reliable execution environment is the executor itself,
thus we need to install Anaconda once per job.
In YARN, our smallest execution environment is the container, hence we need
to install Anaconda over and over for each action.
This obviously poses a problem because of numerous reasons:
1. While we can make an excuse in the first action that it is setup time,
it is obvious that for the second action we are wasting time, a lot. To
compare Mesos and YARN, starting the second action on Mesos is a matter of
seconds. In YARN it is measured in minutes.
2. We do the same thing over and over again, even if we run on the same
machine. This makes no sense whatsoever! We are losing the ability to cache
things. So for example, if I need numpy and that takes about 20-30 seconds
to download and install, why do I need to install it from scratch over and
over again?
3. It causes code reliability issues. If Miniconda isn't there and I need
to roll a PySpark job, I now have to setup guards and fallbacks and what
not? Even worse, I have to find weird tricks to even get access to the
Miniconda environment, and that is different on Mesos and YARN, so now I
have a jungle in the code!
4. On YARN, PySpark runs on yet a different container! Guess what?! This
container has no access to miniconda! We currently use --py-files to send a
list of gazzilion packages. This is different in Mesos, where PySpark
itself runs in the same executor as the main Amaterasu process.
So guess what? I now have a jungle in my PySpark invocation code too!

Also take a note that the current implementation for Python 3rd party
dependencies resolution is Anaconda, this gives us an isolated environment
that doesn't rely on the existing Python (cause maybe, for some reason, you
have Python 2.5 on your cluster, which is not supported by new versions of
data libraries such as pandas, numpy and so forth), in addition it gives us
the nifty Conda package manager.
However, it doesn't mean that it has to stay that way. If the need or
reason arises, we may need to also support pip and support using the native
Python version (instead of the one supplied by Anaconda).

I want to discuss the possible solutions to this. Please feel free to bring
up your ideas.

Cheers,
Nadav

Reply via email to