Re: [DISCUSS] Spark runner packaging

Amit Sela Thu, 07 Jul 2016 15:31:24 -0700

I don't think that the Spark runner is special, it's just the way it was
until now and that's why I brought up the subject here.


The main issue is that currently, if a user wants to write a beam app using
the Spark runner, he'll have to provide the Spark dependencies, or he'll
get a ClassNotFoundException (which is exactly the case for beam-examples).
This of course happens because the Spark runner has provided dependency on
Spark (not transitive).
The Flink runner avoids this issue by having a compile dependency on flink,
thus being transitive.

By having the cluster provide them I mean that the Spark installation is
aware of the binaries pre-deployed on the cluster and adds them to the
classpath of the app submitted for execution on the cluster - this is
common (AFAIK) for Spark and Spark on YARN, and vendors provide similar
binaries, for example: spark-1.6_hadoop-2.4.0_hdp.xxx.jar (Hortonworks).

 Putting aside our (Beam) issues, the current artifact "beam-runners-spark"
is more suitable to run on clusters with pre-deployed binaries rather than a
quick standalone execution with a single dependency that takes care of
everything (Spark related), but is more cumbersome for users trying to get
going for the first time, which is not good!.

I guess Flink uses a compile dependency for the same reason Spark uses
provided - because it fits them - what about other runners ?

Hope this clarifies some of the questions here.

Thanks,
Amit

On Fri, Jul 8, 2016 at 12:52 AM Dan Halperin <dhalp...@google.com.invalid>
wrote:

> hey folks,
>
> In general, we should optimize for running on clusters rather than running
> locally. Examples is a runner-independent module, with non-compile-time
> deps on runners. Most runners are currently listed as being runtime deps --
> it sounds like that works, for most cases, but might not be the best fit
> for Spark.
>
> Q: What does dependencies being provided by the cluster mean? I'm a little
> naive here, but how would a user submit a pipeline to a Spark cluster
> without actually depending on Spark in mvn? Is it not by running the main
> method in an example like in all other runners?
>
> I'd like to understand the above better, but suppose that to optimize for
> Spark-on-a-cluster, we should default to provided deps in the examples.
> That would be fine -- but couldn't we just make a profile for local Spark
> that overrides the deps from provided to runtime?
>
> To summarize, I think we do not need new artifacts, but we could use a
> profile for local testing if absolutely necessary.
>
> Thanks,
> Dan
>
> On Thu, Jul 7, 2016 at 2:27 PM, Ismaël Mejía <ieme...@gmail.com> wrote:
>
> > Good discussion subject Amit,
> >
> > I let the whole beam distribution subjects continue in BEAM-320, however
> > there
> > is a not yet discussed aspect of the spark runner, the maven behavior:
> >
> > When you import the beam spark runner as a dependency you are obliged to
> > provide
> > your spark dependencies by hand too, in the other runners once you import
> > the
> > runner everything just works e.g.  google-cloud-dataflow-runner and
> > flink-runner.  I understand the arguments for the current setup (the ones
> > you
> > mention), but I think it is more user friendly to be consistent with the
> > other
> > runners and have something that just works as the default (and solve the
> > examples issue as a consequence).  Anyway I think in the spark case we
> need
> > both, an 'spark-included' flavor and the current one that it is really
> > useful to
> > include the runner as a spark library dependency (like Jesse did in his
> > video) or
> > as a spark-package.
> >
> > Actually both the all-included and the runner only make sense for flink
> too
> > but this is a different discussion ;)
> >
> > What do you think about this ? What do the others think ?
> >
> > Ismaël
> >
> >
> > On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > No problem and good idea to discuss in the Jira.
> > >
> > > Actually, I started to experiment a bit beam distributions on a branch
> > > (that I can share with people interested).
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 07/07/2016 10:12 PM, Amit Sela wrote:
> > >
> > >> Thanks JB, I've missed that one.
> > >>
> > >> I suggest we continue this in the ticket comments.
> > >>
> > >> Thanks,
> > >> Amit
> > >>
> > >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > >> wrote:
> > >>
> > >> Hi Amit,
> > >>>
> > >>> I think your proposal is related to:
> > >>>
> > >>> https://issues.apache.org/jira/browse/BEAM-320
> > >>>
> > >>> As described in the Jira, I'm planning to provide (in dedicated Maven
> > >>> modules) is a Beam distribution including:
> > >>> - an uber jar to wrap the dependencies
> > >>> - the underlying runtime backends
> > >>> - etc
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>> On 07/07/2016 07:49 PM, Amit Sela wrote:
> > >>>
> > >>>> Hi everyone,
> > >>>>
> > >>>> Lately I've encountered a number of issues concerning the fact that
> > the
> > >>>> Spark runner does not package Spark along with it and forcing people
> > to
> > >>>>
> > >>> do
> > >>>
> > >>>> this on their own.
> > >>>> In addition, this seems to get in the way of having beam-examples
> > >>>>
> > >>> executed
> > >>>
> > >>>> against the Spark runner, again because it would have to add Spark
> > >>>> dependencies.
> > >>>>
> > >>>> When running on a cluster (which I guess was the original goal
> here),
> > it
> > >>>>
> > >>> is
> > >>>
> > >>>> recommended to have Spark provided by the cluster - this makes sense
> > for
> > >>>> Spark clusters and more so for Spark + YARN clusters where you might
> > >>>> have
> > >>>> your Spark built against a specific Hadoop version or using a vendor
> > >>>> distribution.
> > >>>>
> > >>>> In order to make the runner more accessible to new adopters, I
> suggest
> > >>>> to
> > >>>> consider releasing a "spark-included" artifact as well.
> > >>>>
> > >>>> Thoughts ?
> > >>>>
> > >>>> Thanks,
> > >>>> Amit
> > >>>>
> > >>>>
> > >>> --
> > >>> Jean-Baptiste Onofré
> > >>> jbono...@apache.org
> > >>> http://blog.nanthrax.net
> > >>> Talend - http://www.talend.com
> > >>>
> > >>>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: [DISCUSS] Spark runner packaging

Reply via email to