I don't think that the Spark runner is special, it's just the way it was until now and that's why I brought up the subject here.
The main issue is that currently, if a user wants to write a beam app using the Spark runner, he'll have to provide the Spark dependencies, or he'll get a ClassNotFoundException (which is exactly the case for beam-examples). This of course happens because the Spark runner has provided dependency on Spark (not transitive). The Flink runner avoids this issue by having a compile dependency on flink, thus being transitive. By having the cluster provide them I mean that the Spark installation is aware of the binaries pre-deployed on the cluster and adds them to the classpath of the app submitted for execution on the cluster - this is common (AFAIK) for Spark and Spark on YARN, and vendors provide similar binaries, for example: spark-1.6_hadoop-2.4.0_hdp.xxx.jar (Hortonworks). Putting aside our (Beam) issues, the current artifact "beam-runners-spark" is more suitable to run on clusters with pre-deployed binaries rather than a quick standalone execution with a single dependency that takes care of everything (Spark related), but is more cumbersome for users trying to get going for the first time, which is not good!. I guess Flink uses a compile dependency for the same reason Spark uses provided - because it fits them - what about other runners ? Hope this clarifies some of the questions here. Thanks, Amit On Fri, Jul 8, 2016 at 12:52 AM Dan Halperin <dhalp...@google.com.invalid> wrote: > hey folks, > > In general, we should optimize for running on clusters rather than running > locally. Examples is a runner-independent module, with non-compile-time > deps on runners. Most runners are currently listed as being runtime deps -- > it sounds like that works, for most cases, but might not be the best fit > for Spark. > > Q: What does dependencies being provided by the cluster mean? I'm a little > naive here, but how would a user submit a pipeline to a Spark cluster > without actually depending on Spark in mvn? Is it not by running the main > method in an example like in all other runners? > > I'd like to understand the above better, but suppose that to optimize for > Spark-on-a-cluster, we should default to provided deps in the examples. > That would be fine -- but couldn't we just make a profile for local Spark > that overrides the deps from provided to runtime? > > To summarize, I think we do not need new artifacts, but we could use a > profile for local testing if absolutely necessary. > > Thanks, > Dan > > On Thu, Jul 7, 2016 at 2:27 PM, Ismaël Mejía <ieme...@gmail.com> wrote: > > > Good discussion subject Amit, > > > > I let the whole beam distribution subjects continue in BEAM-320, however > > there > > is a not yet discussed aspect of the spark runner, the maven behavior: > > > > When you import the beam spark runner as a dependency you are obliged to > > provide > > your spark dependencies by hand too, in the other runners once you import > > the > > runner everything just works e.g. google-cloud-dataflow-runner and > > flink-runner. I understand the arguments for the current setup (the ones > > you > > mention), but I think it is more user friendly to be consistent with the > > other > > runners and have something that just works as the default (and solve the > > examples issue as a consequence). Anyway I think in the spark case we > need > > both, an 'spark-included' flavor and the current one that it is really > > useful to > > include the runner as a spark library dependency (like Jesse did in his > > video) or > > as a spark-package. > > > > Actually both the all-included and the runner only make sense for flink > too > > but this is a different discussion ;) > > > > What do you think about this ? What do the others think ? > > > > Ismaël > > > > > > On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <j...@nanthrax.net> > > wrote: > > > > > No problem and good idea to discuss in the Jira. > > > > > > Actually, I started to experiment a bit beam distributions on a branch > > > (that I can share with people interested). > > > > > > Regards > > > JB > > > > > > > > > On 07/07/2016 10:12 PM, Amit Sela wrote: > > > > > >> Thanks JB, I've missed that one. > > >> > > >> I suggest we continue this in the ticket comments. > > >> > > >> Thanks, > > >> Amit > > >> > > >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <j...@nanthrax.net > > > > >> wrote: > > >> > > >> Hi Amit, > > >>> > > >>> I think your proposal is related to: > > >>> > > >>> https://issues.apache.org/jira/browse/BEAM-320 > > >>> > > >>> As described in the Jira, I'm planning to provide (in dedicated Maven > > >>> modules) is a Beam distribution including: > > >>> - an uber jar to wrap the dependencies > > >>> - the underlying runtime backends > > >>> - etc > > >>> > > >>> Regards > > >>> JB > > >>> > > >>> On 07/07/2016 07:49 PM, Amit Sela wrote: > > >>> > > >>>> Hi everyone, > > >>>> > > >>>> Lately I've encountered a number of issues concerning the fact that > > the > > >>>> Spark runner does not package Spark along with it and forcing people > > to > > >>>> > > >>> do > > >>> > > >>>> this on their own. > > >>>> In addition, this seems to get in the way of having beam-examples > > >>>> > > >>> executed > > >>> > > >>>> against the Spark runner, again because it would have to add Spark > > >>>> dependencies. > > >>>> > > >>>> When running on a cluster (which I guess was the original goal > here), > > it > > >>>> > > >>> is > > >>> > > >>>> recommended to have Spark provided by the cluster - this makes sense > > for > > >>>> Spark clusters and more so for Spark + YARN clusters where you might > > >>>> have > > >>>> your Spark built against a specific Hadoop version or using a vendor > > >>>> distribution. > > >>>> > > >>>> In order to make the runner more accessible to new adopters, I > suggest > > >>>> to > > >>>> consider releasing a "spark-included" artifact as well. > > >>>> > > >>>> Thoughts ? > > >>>> > > >>>> Thanks, > > >>>> Amit > > >>>> > > >>>> > > >>> -- > > >>> Jean-Baptiste Onofré > > >>> jbono...@apache.org > > >>> http://blog.nanthrax.net > > >>> Talend - http://www.talend.com > > >>> > > >>> > > >> > > > -- > > > Jean-Baptiste Onofré > > > jbono...@apache.org > > > http://blog.nanthrax.net > > > Talend - http://www.talend.com > > > > > >