Re: [DISCUSS] Spark runner packaging

Dan Halperin Thu, 07 Jul 2016 15:54:04 -0700

Thanks Amit, that does clear things up!

On Thu, Jul 7, 2016 at 3:30 PM, Amit Sela <amitsel...@gmail.com> wrote:


> I don't think that the Spark runner is special, it's just the way it was
> until now and that's why I brought up the subject here.
>
> The main issue is that currently, if a user wants to write a beam app using
> the Spark runner, he'll have to provide the Spark dependencies, or he'll
> get a ClassNotFoundException (which is exactly the case for beam-examples).
> This of course happens because the Spark runner has provided dependency on
> Spark (not transitive).
>

Having provided dependencies and making the user include them in their pom
is
pretty normal, I think. We already require users to provide a slf4j logger
and
Hamcrest+Junit (if they use PAssert).
    (We including all these in the examples pom.xml
<https://github.com/apache/incubator-beam/blob/master/examples/java/pom.xml#L286>
.)

I don't see any problem for a user who wants to use the Spark runner to add
these
provided deps to their pom (aka, putting them as runtime deps in examples
pom.xml).


> The Flink runner avoids this issue by having a compile dependency on flink,
> thus being transitive.
>
> By having the cluster provide them I mean that the Spark installation is
> aware of the binaries pre-deployed on the cluster and adds them to the
> classpath of the app submitted for execution on the cluster - this is
> common (AFAIK) for Spark and Spark on YARN, and vendors provide similar
> binaries, for example: spark-1.6_hadoop-2.4.0_hdp.xxx.jar (Hortonworks).
>

Makes sense. So a user submitting to a cluster would submit a jar and
command-line
options, and the cluster itself would add the provided deps.


>  Putting aside our (Beam) issues, the current artifact "beam-runners-spark"
> is more suitable to run on clusters with pre-deployed binaries rather than
> a
> quick standalone execution with a single dependency that takes care of
> everything (Spark related),


great!


> but is more cumbersome for users trying to get
> going for the first time, which is not good!.
>

We should decide which experience we're trying to optimize for (I'd lean
cluster), but
I think that we should update examples pom.xml with the support.

* For cluster mode default, we would add a profile for 'local' mode
  (-PsparkIncluded or something) that overrides the provided deps to be
runtime
  deps instead.

* We can include switching the profile for local mode in the "getting
started" instructions.

Dan

I guess Flink uses a compile dependency for the same reason Spark uses
> provided - because it fits them - what about other runners ?
>
> Hope this clarifies some of the questions here.
>
> Thanks,
> Amit
>
> On Fri, Jul 8, 2016 at 12:52 AM Dan Halperin <dhalp...@google.com.invalid>
> wrote:
>
> > hey folks,
> >
> > In general, we should optimize for running on clusters rather than
> running
> > locally. Examples is a runner-independent module, with non-compile-time
> > deps on runners. Most runners are currently listed as being runtime deps
> --
> > it sounds like that works, for most cases, but might not be the best fit
> > for Spark.
> >
> > Q: What does dependencies being provided by the cluster mean? I'm a
> little
> > naive here, but how would a user submit a pipeline to a Spark cluster
> > without actually depending on Spark in mvn? Is it not by running the main
> > method in an example like in all other runners?
> >
> > I'd like to understand the above better, but suppose that to optimize for
> > Spark-on-a-cluster, we should default to provided deps in the examples.
> > That would be fine -- but couldn't we just make a profile for local Spark
> > that overrides the deps from provided to runtime?
> >
> > To summarize, I think we do not need new artifacts, but we could use a
> > profile for local testing if absolutely necessary.
> >
> > Thanks,
> > Dan
> >
> > On Thu, Jul 7, 2016 at 2:27 PM, Ismaël Mejía <ieme...@gmail.com> wrote:
> >
> > > Good discussion subject Amit,
> > >
> > > I let the whole beam distribution subjects continue in BEAM-320,
> however
> > > there
> > > is a not yet discussed aspect of the spark runner, the maven behavior:
> > >
> > > When you import the beam spark runner as a dependency you are obliged
> to
> > > provide
> > > your spark dependencies by hand too, in the other runners once you
> import
> > > the
> > > runner everything just works e.g.  google-cloud-dataflow-runner and
> > > flink-runner.  I understand the arguments for the current setup (the
> ones
> > > you
> > > mention), but I think it is more user friendly to be consistent with
> the
> > > other
> > > runners and have something that just works as the default (and solve
> the
> > > examples issue as a consequence).  Anyway I think in the spark case we
> > need
> > > both, an 'spark-included' flavor and the current one that it is really
> > > useful to
> > > include the runner as a spark library dependency (like Jesse did in his
> > > video) or
> > > as a spark-package.
> > >
> > > Actually both the all-included and the runner only make sense for flink
> > too
> > > but this is a different discussion ;)
> > >
> > > What do you think about this ? What do the others think ?
> > >
> > > Ismaël
> > >
> > >
> > > On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > > wrote:
> > >
> > > > No problem and good idea to discuss in the Jira.
> > > >
> > > > Actually, I started to experiment a bit beam distributions on a
> branch
> > > > (that I can share with people interested).
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 07/07/2016 10:12 PM, Amit Sela wrote:
> > > >
> > > >> Thanks JB, I've missed that one.
> > > >>
> > > >> I suggest we continue this in the ticket comments.
> > > >>
> > > >> Thanks,
> > > >> Amit
> > > >>
> > > >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > >> wrote:
> > > >>
> > > >> Hi Amit,
> > > >>>
> > > >>> I think your proposal is related to:
> > > >>>
> > > >>> https://issues.apache.org/jira/browse/BEAM-320
> > > >>>
> > > >>> As described in the Jira, I'm planning to provide (in dedicated
> Maven
> > > >>> modules) is a Beam distribution including:
> > > >>> - an uber jar to wrap the dependencies
> > > >>> - the underlying runtime backends
> > > >>> - etc
> > > >>>
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>> On 07/07/2016 07:49 PM, Amit Sela wrote:
> > > >>>
> > > >>>> Hi everyone,
> > > >>>>
> > > >>>> Lately I've encountered a number of issues concerning the fact
> that
> > > the
> > > >>>> Spark runner does not package Spark along with it and forcing
> people
> > > to
> > > >>>>
> > > >>> do
> > > >>>
> > > >>>> this on their own.
> > > >>>> In addition, this seems to get in the way of having beam-examples
> > > >>>>
> > > >>> executed
> > > >>>
> > > >>>> against the Spark runner, again because it would have to add Spark
> > > >>>> dependencies.
> > > >>>>
> > > >>>> When running on a cluster (which I guess was the original goal
> > here),
> > > it
> > > >>>>
> > > >>> is
> > > >>>
> > > >>>> recommended to have Spark provided by the cluster - this makes
> sense
> > > for
> > > >>>> Spark clusters and more so for Spark + YARN clusters where you
> might
> > > >>>> have
> > > >>>> your Spark built against a specific Hadoop version or using a
> vendor
> > > >>>> distribution.
> > > >>>>
> > > >>>> In order to make the runner more accessible to new adopters, I
> > suggest
> > > >>>> to
> > > >>>> consider releasing a "spark-included" artifact as well.
> > > >>>>
> > > >>>> Thoughts ?
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Amit
> > > >>>>
> > > >>>>
> > > >>> --
> > > >>> Jean-Baptiste Onofré
> > > >>> jbono...@apache.org
> > > >>> http://blog.nanthrax.net
> > > >>> Talend - http://www.talend.com
> > > >>>
> > > >>>
> > > >>
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>

Re: [DISCUSS] Spark runner packaging

Reply via email to