Re: shapeless in spark 2.1.0

2016-12-29 Thread Ryan Williams
Other option would presumably be for someone to make a release of breeze
with old-shapeless shaded... unless shapeless classes are exposed in
breeze's public API, in which case you'd have to copy the relevant
shapeless classes into breeze and then publish that?

On Thu, Dec 29, 2016, 1:05 PM Sean Owen  wrote:

> It is breeze, but, what's the option? It can't be excluded. I think this
> falls in the category of things an app would need to shade in this
> situation.
>
> On Thu, Dec 29, 2016, 16:49 Koert Kuipers  wrote:
>
> i just noticed that spark 2.1.0 bring in a new transitive dependency on
> shapeless 2.0.0
>
> shapeless is a popular library for scala users, and shapeless 2.0.0 is old
> (2014) and not compatible with more current versions.
>
> so this means a spark user that uses shapeless in his own development
> cannot upgrade safely from 2.0.0 to 2.1.0, i think.
>
> wish i had noticed this sooner
>
>


Re: shapeless in spark 2.1.0

2016-12-29 Thread Ryan Williams
`mvn dependency:tree -Dverbose -Dincludes=:shapeless_2.11` shows:

[INFO] \- org.apache.spark:spark-mllib_2.11:jar:2.1.0:provided
[INFO]\- org.scalanlp:breeze_2.11:jar:0.12:provided
[INFO]   \- com.chuusai:shapeless_2.11:jar:2.0.0:provided

On Thu, Dec 29, 2016 at 12:11 PM Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> Which dependency pulls in shapeless?
>
> On Thu, Dec 29, 2016 at 5:49 PM, Koert Kuipers  wrote:
>
> i just noticed that spark 2.1.0 bring in a new transitive dependency on
> shapeless 2.0.0
>
> shapeless is a popular library for scala users, and shapeless 2.0.0 is old
> (2014) and not compatible with more current versions.
>
> so this means a spark user that uses shapeless in his own development
> cannot upgrade safely from 2.0.0 to 2.1.0, i think.
>
> wish i had noticed this sooner
>
>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhov...@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] 
>


Re: spark-core "compile"-scope transitive-dependency on scalatest

2016-12-15 Thread Ryan Williams
ah I see this thread
<http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-has-a-compile-dependency-on-scalatest-td19639.html>
now, thanks; interestingly I don't think the solution I've proposed here
(splitting spark-tags' test-bits into a "-tests" JAR and having spark-core
"test"-depend on that) is discussed there.

thanks for re-opening the JIRA; I can't promise a PR for it atm but I will
think about it :)

On Thu, Dec 15, 2016 at 7:41 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> You're right; we had a discussion here recently about this.
>
> I'll re-open that bug, if you want to send a PR. (I think it's just a
> matter of making the scalatest dependency "provided" in spark-tags, if
> I remember the discussion.)
>
> On Thu, Dec 15, 2016 at 4:15 PM, Ryan Williams
> <ryan.blake.willi...@gmail.com> wrote:
> > spark-core depends on spark-tags (compile scope) which depends on
> scalatest
> > (compile scope), so spark-core leaks test-deps into downstream libraries'
> > "compile"-scope classpath.
> >
> > The cause is that spark-core has logical "test->test" and
> "compile->compile"
> > dependencies on spark-tags, but spark-tags publishes both its
> test-oriented
> > and non-test-oriented bits in its default ("compile") artifact.
> >
> > spark-tags' test-bits should be in a "-tests"-JAR that spark-core can
> > "test"-scope depend on (in addition to "compile"-scope depending on
> > spark-tags as it does today).
> >
> > SPARK-17807 was "Not a Problem"d but I don't think that's the right
> outcome;
> > spark-core should not be leaking test-deps into downstream libraries'
> > classpaths when depended on in "compile" scope.
> >
>
>
>
> --
> Marcelo
>


spark-core "compile"-scope transitive-dependency on scalatest

2016-12-15 Thread Ryan Williams
spark-core depends on spark-tags (compile scope) which depends on scalatest
(compile scope), so spark-core leaks test-deps into downstream libraries'
"compile"-scope classpath.

The cause is that spark-core has logical "test->test" and
"compile->compile" dependencies on spark-tags, but spark-tags publishes
both its test-oriented and non-test-oriented bits in its default
("compile") artifact.

spark-tags' test-bits should be in a "-tests"-JAR that spark-core can
"test"-scope depend on (in addition to "compile"-scope depending on
spark-tags as it does today).

SPARK-17807  was "Not a
Problem"d but I don't think that's the right outcome; spark-core should not
be leaking test-deps into downstream libraries' classpaths when depended on
in "compile" scope.


Re: Compatibility of 1.6 spark.eventLog with a 2.0 History Server

2016-09-15 Thread Ryan Williams
What is meant by:

"""
(This is because clicking the refresh button in browser, updates the UI
with latest events, where-as in the 1.6 code base, this does not happen)
"""

Hasn't refreshing the page updated all the information in the UI through
the 1.x line?


Re: Setting YARN executors' JAVA_HOME

2016-08-18 Thread Ryan Williams
Ah, I guess I missed that by only looking in the YARN config docs, but this
is a more general parameter and not documented there. Thanks!

On Thu, Aug 18, 2016 at 2:51 PM dhruve ashar <dhruveas...@gmail.com> wrote:

> Hi Ryan,
>
> You can get more info on this here:  Spark documentation
> <http://spark.apache.org/docs/latest/configuration.html>.
>
> The page addresses what you need. You can look for 
> spark.executorEnv.[EnvironmentVariableName]
> and set your java home as
> spark.executorEnv.JAVA_HOME=
>
> Regards,
> Dhruve
>
> On Thu, Aug 18, 2016 at 12:49 PM, Ryan Williams <
> ryan.blake.willi...@gmail.com> wrote:
>
>> I need to tell YARN a JAVA_HOME to use when spawning containers (to run a
>> Java 8 app on Java 7 YARN).
>>
>> The only way I've found that works is
>> setting SPARK_YARN_USER_ENV="JAVA_HOME=/path/to/java8".
>>
>> The code
>> <https://github.com/apache/spark/blob/b72bb62d421840f82d663c6b8e3922bd14383fbb/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L762>
>> implies that this is deprecated and users should use "the config", but I
>> can't figure out what config is being referenced.
>>
>> Passing "--conf spark.yarn.appMasterEnv.JAVA_HOME=/path/to/java8" seems
>> to set it for the AM but not for executors.
>>
>> Likewise, spark.executor.extraLibraryPath and
>> spark.driver.extraLibraryPath don't appear to set JAVA_HOME (and maybe
>> aren't even supposed to?).
>>
>> The 1.0.1 docs
>> <https://spark.apache.org/docs/1.0.1/running-on-yarn.html#environment-variables>
>>  are the last ones to reference the SPARK_YARN_USER_ENV var, afaict.
>>
>> What's the preferred way of passing YARN a custom JAVA_HOME that will be
>> applied to executors' containers?
>>
>> Thanks!
>>
>
>
>
> --
> -Dhruve Ashar
>
>


Setting YARN executors' JAVA_HOME

2016-08-18 Thread Ryan Williams
I need to tell YARN a JAVA_HOME to use when spawning containers (to run a
Java 8 app on Java 7 YARN).

The only way I've found that works is
setting SPARK_YARN_USER_ENV="JAVA_HOME=/path/to/java8".

The code

implies that this is deprecated and users should use "the config", but I
can't figure out what config is being referenced.

Passing "--conf spark.yarn.appMasterEnv.JAVA_HOME=/path/to/java8" seems to
set it for the AM but not for executors.

Likewise, spark.executor.extraLibraryPath and spark.driver.extraLibraryPath
don't appear to set JAVA_HOME (and maybe aren't even supposed to?).

The 1.0.1 docs

 are the last ones to reference the SPARK_YARN_USER_ENV var, afaict.

What's the preferred way of passing YARN a custom JAVA_HOME that will be
applied to executors' containers?

Thanks!


Re: Latency due to driver fetching sizes of output statuses

2016-01-23 Thread Ryan Williams
yea, they're all skipped, here's a gif
<http://f.cl.ly/items/413l3k363u290U173W00/Screen%20Recording%202016-01-23%20at%2005.08%20PM.gif>
scrolling through the DAG viz.

Thanks for the JIRA pointer, I'll keep an eye on that one!

On Sat, Jan 23, 2016 at 4:53 PM Mark Hamstra <m...@clearstorydata.com>
wrote:

> Do all of those thousands of Stages end up being actual Stages that need
> to be computed, or are the vast majority of them eventually "skipped"
> Stages?  If the latter, then there is the potential to modify the
> DAGScheduler to avoid much of this behavior:
> https://issues.apache.org/jira/browse/SPARK-10193
> https://github.com/apache/spark/pull/8427
>
> On Sat, Jan 23, 2016 at 1:40 PM, Ryan Williams <
> ryan.blake.willi...@gmail.com> wrote:
>
>> I have a recursive algorithm that performs a few jobs on successively
>> smaller RDDs, and then a few more jobs on successively larger RDDs as the
>> recursion unwinds, resulting in a somewhat deeply-nested (a few dozen
>> levels) RDD lineage.
>>
>> I am observing significant delays starting jobs while the
>> MapOutputTrackerMaster calculates the sizes of the output statuses for all
>> previous shuffles. By the end of my algorithm's execution, the driver
>> spends about a minute doing this before each job, during which time my
>> entire cluster is sitting idle. This output-status info is the same every
>> time it computes it, no executors have joined or left the cluster.
>>
>> In this gist
>> <https://gist.github.com/ryan-williams/445ef8736a688bd78edb#file-job-108>
>> you can see two jobs stalling for almost a minute each between "Starting
>> job:" and "Got job"; with larger input datasets my RDD lineages and this
>> latency would presumably only grow.
>>
>> Additionally, the "DAG Visualization" on the job page of the web UI shows
>> a huge horizontal-scrolling lineage of thousands of stages, indicating that
>> the driver is tracking far more information than would seem necessary.
>>
>> I'm assuming the short answer is that I need to truncate RDDs' lineage,
>> and the only way to do that is by checkpointing them to disk. I've done
>> that and it avoids this issue, but means that I am now serializing my
>> entire dataset to disk dozens of times during the course of execution,
>> which feels unnecessary/wasteful.
>>
>> Is there a better way to deal with this scenario?
>>
>> Thanks,
>>
>> -Ryan
>>
>
>


Latency due to driver fetching sizes of output statuses

2016-01-23 Thread Ryan Williams
I have a recursive algorithm that performs a few jobs on successively
smaller RDDs, and then a few more jobs on successively larger RDDs as the
recursion unwinds, resulting in a somewhat deeply-nested (a few dozen
levels) RDD lineage.

I am observing significant delays starting jobs while the
MapOutputTrackerMaster calculates the sizes of the output statuses for all
previous shuffles. By the end of my algorithm's execution, the driver
spends about a minute doing this before each job, during which time my
entire cluster is sitting idle. This output-status info is the same every
time it computes it, no executors have joined or left the cluster.

In this gist
<https://gist.github.com/ryan-williams/445ef8736a688bd78edb#file-job-108>
you can see two jobs stalling for almost a minute each between "Starting
job:" and "Got job"; with larger input datasets my RDD lineages and this
latency would presumably only grow.

Additionally, the "DAG Visualization" on the job page of the web UI shows a
huge horizontal-scrolling lineage of thousands of stages, indicating that
the driver is tracking far more information than would seem necessary.

I'm assuming the short answer is that I need to truncate RDDs' lineage, and
the only way to do that is by checkpointing them to disk. I've done that
and it avoids this issue, but means that I am now serializing my entire
dataset to disk dozens of times during the course of execution, which feels
unnecessary/wasteful.

Is there a better way to deal with this scenario?

Thanks,

-Ryan


Re: Live UI

2015-10-12 Thread Ryan Williams
Yea, definitely check out Spree ! It
functions as "live" UI, history server, and archival storage of event log
data.

There are pros and cons to building something like it in Spark trunk (and
running it in the Spark driver, presumably) that I've spent a lot of time
thinking about and am happy to talk through (here, offline, or in the Spree
gitter room ) if you want to go that
route.


On Mon, Oct 12, 2015 at 5:36 PM Jakob Odersky  wrote:

> Hi everyone,
> I am just getting started working on spark and was thinking of a first way
> to contribute whilst still trying to wrap my head around the codebase.
>
> Exploring the web UI, I noticed it is a classic request-response website,
> requiring manual refresh to get the latest data.
> I think it would be great to have a "live" website where data would be
> displayed real-time without the need to hit the refresh button. I would be
> very interested in contributing this feature if it is acceptable.
>
> Specifically, I was thinking of using websockets with a ScalaJS front-end.
> Please let me know if this design would be welcome or if it introduces
> unwanted dependencies, I'll be happy to discuss this further in detail.
>
> thanks for your feedback,
> --Jakob
>


Re: An alternate UI for Spark.

2015-09-14 Thread Ryan Williams
You can check out Spree  for one data
point about how this can be done; it is a near-clone of the Spark web UI
that updates in real-time.

It uses JsonRelay , a
SparkListener that sends events as JSON over the network; it receives those
events, aggregates stats similar to the JobProgressListener and writes
those to Mongo in slim , and then Spree
uses Meteor to display a real-time web UI based on the data in Mongo.

On Mon, Sep 14, 2015 at 2:18 AM Prashant Sharma 
wrote:

> Hi all,
>
> TLDR;
> Some of my colleagues at Imaginea are interested in building an alternate
> UI for Spark. Basically allow people or groups to build an alternate UI for
> Spark.
>
> More Details:
> Looking at feasibility, it feels definitely possible to do. But we need a
> consensus on a public(can be experimental initially ) interface which would
> give access to UI in core. Given this is done, their job will be easy.
>
> Infact, it opens up a lot of possibilities for alternate UI for Apache
> spark. Also considering a pluggable UI - where alternate UI can just be a
> plugin. Ofcourse, implementing later can be a long term goal. Elasticsearch
> is a good example of the later approach.
>
> My knowledge on this is certainly limited. Comments and criticism
> appreciated.
>
> Thanks,
> Prashant
>


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-11 Thread Ryan Williams
Any idea why 1.5.0 is not in Maven central yet
? Is
that a separate release process?

On Wed, Sep 9, 2015 at 12:40 PM andy petrella 
wrote:

> You can try it out really quickly by "building" a Spark Notebook from
> http://spark-notebook.io/.
>
> Just choose the master branch and 1.5.0, a correct hadoop version (default
> to 2.2.0 though) and there you go :-)
>
>
> On Wed, Sep 9, 2015 at 6:39 PM Ted Yu  wrote:
>
>> Jerry:
>> I just tried building hbase-spark module with 1.5.0 and I see:
>>
>> ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0
>> total 21712
>> -rw-r--r--  1 tyu  staff   196 Sep  9 09:37 _maven.repositories
>> -rw-r--r--  1 tyu  staff  11081542 Sep  9 09:37 spark-core_2.10-1.5.0.jar
>> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
>> spark-core_2.10-1.5.0.jar.sha1
>> -rw-r--r--  1 tyu  staff 19816 Sep  9 09:37 spark-core_2.10-1.5.0.pom
>> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
>> spark-core_2.10-1.5.0.pom.sha1
>>
>> FYI
>>
>> On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lam  wrote:
>>
>>> Hi Spark Developers,
>>>
>>> I'm eager to try it out! However, I got problems in resolving
>>> dependencies:
>>> [warn] [NOT FOUND  ]
>>> org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
>>> [warn]  jcenter: tried
>>>
>>> When the package will be available?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas <
>>> look...@gmail.com> wrote:
>>>
 Yeii!

 On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa <
 yuu.ishikawa+sp...@gmail.com> wrote:

> Great work, everyone!
>
>
>
> -
> -- Yu Ishikawa
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

>>>
>> --
> andy
>


Spree: Live-updating web UI for Spark

2015-07-27 Thread Ryan Williams
Hi dev@spark, I wanted to quickly ping about Spree
http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/,
a live-updating web UI for Spark that I released on Friday (along with some
supporting infrastructure), and mention a couple things that came up while
I worked on it that are relevant to this list.

This blog post
http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/
and github https://github.com/hammerlab/spree/ have lots of info about
functionality, implementation details, and installation instructions, but
the tl;dr is:

   - You register a SparkListener called JsonRelay
   https://github.com/hammerlab/spark-json-relay via the
   spark.extraListeners conf (thanks @JoshRosen!).
   - That listener ships SparkListenerEvents to a server called slim
   https://github.com/hammerlab/slim that stores them in Mongo.
  - Really what it stores are a bunch of stats similar to those
  maintained by JobProgressListener.
- A Meteor https://www.meteor.com/ app displays live-updating views
   of what’s in Mongo.

Feel free to read about it / try it! but the rest of this email is just
questions about Spark APIs and plans.
JsonProtocol scoping

The most awkward thing about Spree is that JsonRelay declares itself to be
in org.apache.spark
https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L1
so that it can use JsonProtocol.

Will JsonProtocol be private[spark] forever, on purpose, or is it just not
considered stable enough yet, so you want to discourage direct use? I’m
relatively impartial at this point since I’ve done the hacky thing and it
works for my purposes, but thought I’d ask in case there are interesting
perspectives on the ideal scope for it going forward.
@DeveloperApi trait SparkListener

Another set of tea leaves I wasn’t sure how to read was the @DeveloperApi-ness
of SparkListener
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L131-L132.
I assumed I was doing something frowny by having JsonRelay implement the
SparkListener interface. However, I just noticed that I’m actually
extending SparkFirehoseListener
https://github.com/apache/spark/blob/v1.4.1/core/src/main/java/org/apache/spark/SparkFirehoseListener.java,
which is *not* @DeveloperApi afaict, so maybe I’m ok there after all?

Are there other SparkListener implementations of note in the wild (seems
like “no”)? Is that an API that people can and should use externally (seems
like “yes” to me)? I saw @vanzin recently imply on this list that the
answers may be “no” and “no”
http://apache-spark-developers-list.1001551.n3.nabble.com/Slight-API-incompatibility-caused-by-SPARK-4072-tp13257.html
.
Augmenting JsonProtocol

JsonRelay does two things that JsonProtocol does not:

   - adds an appId field to all events; this makes it possible/easy for
   downstream things (slim, in this case) to handle information about
   multiple Spark applications.
   - JSON-serializes SparkListenerExecutorMetricsUpdate events. This was
   added to JsonProtocol in SPARK-9036
   https://issues.apache.org/jira/browse/SPARK-9036 (though it’s unused
   in the Spark repo currently), but I’ll have to leave my version in as long
   as I want to support Spark = 1.4.1.
  - From one perspective, JobProgressListener was sort of “cheating” by
  using these events that were previously not accessible via
  JsonProtocol.

It seems like making an effort to let external tools get the same kinds of
data as the internal listeners is a good principle to try to maintain,
which is also relevant to the scoping questions about JsonProtocol above.

Should JsonProtocol add appIds to all events itself? Should Spark make it
easier for downstream things to to process events from multiple Spark
applications? JsonRelay currently pulls the app ID out of the SparkConf
that it is instantiated with
https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L16;
it works, but also feels hacky and like maybe I’m doing things I’m not
supposed to.
Thrift SparkListenerEvent Implementation?

A few months ago I built a first version of this project involving a
SparkListener called Spear https://github.com/hammerlab/spear that
aggregated stats from SparkListenerEvents *and* wrote those stats to Mongo,
combining JsonRelay and slim from above.

Spear used a couple of libraries (Rogue
https://github.com/foursquare/rogue and Spindle
https://github.com/foursquare/spindle) to define schemas in thrift,
generate Scala for those classes, and do all the Mongo querying in a nice,
type-safe way.

Unfortunately for me, all of the Mongo queries were synchronous in that
implementation, which led to events being dropped
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40
when I tested it on large jobs (thanks a lot to @squito for helping 

Re: Resource usage of a spark application

2015-05-21 Thread Ryan Williams
On Thu, May 21, 2015 at 5:22 AM Peter Prettenhofer 
peter.prettenho...@gmail.com wrote:

 Thanks Akhil, Ryan!

 @Akhil: YARN can only tell me how much vcores my app has been granted but
 not actual cpu usage, right? Pulling mem/cpu usage from the OS means i need
 to map JVM executor processes to the context they belong to, right?

 @Ryan: what a great blog post -- this is super relevant for me to analyze
 the state of the cluster as a whole. However, it seems to me that those
 metrics are mostly reported globally and not per spark application.


Thanks! You can definitely analyze metrics per-application in several ways:

   - If you're running Spark on YARN, use the app URL param
   https://github.com/hammerlab/grafana-spark-dashboards#appyarn-app-id
   to specify a YARN application ID, which will set the Spark application ID
   as well as parse job start/end times.
   - Set the prefix URL param
   https://github.com/hammerlab/grafana-spark-dashboards#prefixmetric-prefix
   to your Spark app's ID, and all metrics will be namespaced to that app ID.
  - You actually have to do one of these two, otherwise it doesn't know
  what app's metrics to look for; it is set up specifically to view per-app
  metrics.
   - There is a dropdown in the upper-left of the page (sorry, don't have a
   screenshot right now) that will let you select from all app IDs that
   graphite has seen metrics from.

Let me know, here or in issues on the repo, if you have any issues with
that or that doesn't make sense!



 2015-05-19 21:43 GMT+02:00 Ryan Williams ryan.blake.willi...@gmail.com:

 Hi Peter, a few months ago I was using MetricsSystem to export to
 Graphite and then view in Grafana; relevant scripts and some
 instructions are here
 https://github.com/hammerlab/grafana-spark-dashboards/ if you want to
 take a look.


 On Sun, May 17, 2015 at 8:48 AM Peter Prettenhofer 
 peter.prettenho...@gmail.com wrote:

 Hi all,

 I'm looking for a way to measure the current memory / cpu usage of a
 spark application to provide users feedback how much resources are actually
 being used.
 It seems that the metric system provides this information to some
 extend. It logs metrics on application level (nr of cores granted) and on
 the JVM level (memory usage).
 Is this the recommended way to gather this kind of information? If so,
 how do i best map a spark application to the corresponding JVM processes?

 If not, should i rather request this information from the resource
 manager (e.g. Mesos/YARN)?

 thanks,
  Peter

 --
 Peter Prettenhofer




 --
 Peter Prettenhofer



Re: Resource usage of a spark application

2015-05-19 Thread Ryan Williams
Hi Peter, a few months ago I was using MetricsSystem to export to Graphite
and then view in Grafana; relevant scripts and some instructions are here
https://github.com/hammerlab/grafana-spark-dashboards/ if you want to
take a look.

On Sun, May 17, 2015 at 8:48 AM Peter Prettenhofer 
peter.prettenho...@gmail.com wrote:

 Hi all,

 I'm looking for a way to measure the current memory / cpu usage of a spark
 application to provide users feedback how much resources are actually being
 used.
 It seems that the metric system provides this information to some extend.
 It logs metrics on application level (nr of cores granted) and on the JVM
 level (memory usage).
 Is this the recommended way to gather this kind of information? If so, how
 do i best map a spark application to the corresponding JVM processes?

 If not, should i rather request this information from the resource manager
 (e.g. Mesos/YARN)?

 thanks,
  Peter

 --
 Peter Prettenhofer



Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Ryan Williams
If anyone is curious to try exporting Spark metrics to Graphite, I just
published a post about my experience doing that, building dashboards in
Grafana http://grafana.org/, and using them to monitor Spark jobs:
http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

Code for generating Grafana dashboards tailored to the metrics emitted by
Spark is here: https://github.com/hammerlab/grafana-spark-dashboards.

If anyone else is interested in working on expanding MetricsSystem to make
this sort of thing more useful, let me know, I've been working on it a fair
amount and have a bunch of ideas about where it should go.

Thanks,

-Ryan


Re: Building Spark with Pants

2015-02-16 Thread Ryan Williams
I worked on Pants at Foursquare for a while and when coming up to speed on
Spark was interested in the possibility of building it with Pants,
particularly because allowing developers to share/reuse each others'
compilation artifacts seems like it would be a boon to productivity; that
was/is Pants' killer feature for Foursquare, as mentioned on the
pants-devel thread.

Given the monumental nature of the task of making Spark build with Pants,
most of my enthusiasm was deflected to SPARK-1517
https://issues.apache.org/jira/browse/SPARK-1517, which deals with
publishing nightly builds (or better, exposing all assembly JARs built by
Jenkins?) that people could use rather than having to assemble their own.

Anyway, it's an intriguing idea, Nicholas, I'm glad you are pursuing it!

On Sat Feb 14 2015 at 4:21:16 AM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 FYI: Here is the matching discussion over on the Pants dev list.
 https://groups.google.com/forum/#!topic/pants-devel/rTaU-iIOIFE

 On Mon Feb 02 2015 at 4:50:33 PM Nicholas Chammas
 nicholas.cham...@gmail.com
 http://mailto:nicholas.cham...@gmail.com wrote:

 To reiterate, I'm asking from an experimental perspective. I'm not
  proposing we change Spark to build with Pants or anything like that.
 
  I'm interested in trying Pants out and I'm wondering if anyone else
 shares
  my interest or already has experience with Pants that they can share.
 
  On Mon Feb 02 2015 at 4:40:45 PM Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  I'm asking from an experimental standpoint; this is not happening
 anytime
  soon.
 
  Of course, if the experiment turns out very well, Pants would replace
  both sbt and Maven (like it has at Twitter, for example). Pants also
 works
  with IDEs http://pantsbuild.github.io/index.html#using-pants-with.
 
  On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch java...@gmail.com
  wrote:
 
  There is a significant investment in sbt and maven - and they are not
 at
  all likely to be going away. A third build tool?  Note that there is
 also
  the perspective of building within an IDE - which actually works
 presently
  for sbt and with a little bit of tweaking with maven as well.
 
  2015-02-02 16:25 GMT-08:00 Nicholas Chammas 
 nicholas.cham...@gmail.com
  :
 
  Does anyone here have experience with Pants
 
  http://pantsbuild.github.io/index.html or interest in trying to
 build
 
 
  Spark with it?
 
  Pants has an interesting story. It was born at Twitter to help them
  build
  their Scala, Java, and Python projects as several independent
  components in
  one monolithic repo. (It was inspired by a similar build tool at
 Google
  called blaze.) The mix of languages and sub-projects at Twitter seems
  similar to the breakdown we have in Spark.
 
  Pants has an interesting take on how a build system should work, and
  Twitter and Foursquare (who use Pants as their primary build tool)
  claim it
  helps enforce better build hygiene and maintainability.
 
  Some relevant talks:
 
 - Building Scala Hygienically with Pants
 https://www.youtube.com/watch?v=ukqke8iTuH0
 - The Pants Build Tool at Twitter
 https://engineering.twitter.com/university/videos/the-pant
  s-build-tool-at-twitter
 - Getting Started with the Pants Build System: Why Pants?
 https://engineering.twitter.com/university/videos/getting-
  started-with-the-pants-build-system-why-pants
 
 
 
  At some point I may take a shot at converting Spark to use Pants as an
  experiment and just see what it’s like.
 
  Nick
  ​
 
  ​



Present/Future of monitoring spark jobs, MetricsSystem vs. Web UI, etc.

2015-01-09 Thread Ryan Williams
I've long wished the web UI gave me a better sense of how the metrics it
reports are changing over time, so I was intrigued to stumble across the
MetricsSystem
https://github.com/apache/spark/blob/b6aa557300275b835cce7baa7bc8a80eb5425cbb/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
infrastructure the other day.

I've set up a very basic Graphite instance and had dummy Spark jobs report
to it, but that process was a little bumpy (and the docs sparse
https://spark.apache.org/docs/latest/monitoring.html#metrics) so I wanted
to come up for air and ask a few questions about the present/future plans
for monitoring Spark jobs.

In rough order of increasing scope:

   - Do most people monitor their Spark jobs in realtime by repeatedly
   refreshing the web UI (cf. SPARK-5106
   https://issues.apache.org/jira/browse/SPARK-5106), or is there a
   better way?
   - Does anyone use or rely on the GraphiteSink? Quick googling turned up
   no evidence of anyone using it.
  - Likewise the other Sinks? GangliaSink?
   - Do people have custom Sink subclasses and dashboards that they've
   built to monitor Spark jobs, as was suggested by the appearance of a
   mysterious Ooyala DatadogSink gist
   
https://gist.github.com/ibuenros/9b94736c2bad2f4b8e23#file-sparkutils-scala-L336
   in the recent thread on this list about custom metrics
   
http://apache-spark-developers-list.1001551.n3.nabble.com/Registering-custom-metrics-tp9030p10041.html
   ?
   - What is the longer-term plan for how people should monitor / diagnose
   problems at runtime?
  - Will the official Spark web UI remain the main way that the average
  user will monitor their jobs?
  - Or, will SPARK-3644
  https://issues.apache.org/jira/browse/SPARK-3644 usher in an era of
  many external implementations of Spark web UIs, so that the average user
  will take one of those off the shelf that they like best (because its
  graphs are prettier or it emphasizes / pivots around certain metrics that
  others do not)?
  - Is the MetricsSystem infrastructure redundant with the REST API
  discussed in SPARK-3644
  https://issues.apache.org/jira/browse/SPARK-3644?
 - Would more robust versions of each start to be redundant in the
 future?
 - I feel like the answers are somewhat yes and yes, and would
 like to hear other perspectives.

Basically, I want to live in a world where:

   - I can see all of the stats currently exposed on the Web UI,
   - as well as others that aren't there yet,
  - number of records assigned to each task,
  - number of records completed by each task in realtime,
  - gc stats in realtime,
  - # of spill events,
  - size of spill events,
   - and all kinds of derivates of the above,
  - latencies/histograms for everything
 - records per second per task,
 - records per second per executor,
 - top N slowest/worst of any metric,
 - avg spill size,
 - etc.
  - over time,
   - at scale https://issues.apache.org/jira/browse/SPARK-2017


Are we going to get to this world by improving the web UI that ships with
Spark? I am pessimistic of that approach:

   - It may be impossible to do in a way that satisfies all stakeholders'
   aesthetic sensibilities and preferences for what stats/views are important.
   - It would be a monumental undertaking relative to the amount of
   attention that seems to have been directed at improving the web UI in the
   last few quarters.

OTOH, if the space of derivative stats and slices thereof that we want to
support is as complex as the outline I gave above suggests it might be,
then Graphite (or some equivalent) could be well suited to the task.
However, this is at odds with the relative obscurity that the MetricsSystem
seems to reside in and my impression that it is not something that core
developers think about or are focused on.

Finally, while the existence of SPARK-3644 (and Josh et al's great work on
it thus far) implies that the REST API / let 1000 [web UIs] bloom vision
is at least nominally being pursued, it seems like it's still a long way
from fostering a world where my dream use-cases above are realized, and
it's not clear from the outside whether fulfilling that vision is a
priority.

So I'm interested to hear peoples' thoughts on the above questions and what
the plan is / should be going forward. Having learned a lot about how Spark
works, the process of figuring out Why My Spark Jobs Are Failing still
feels daunting (at best) using the tools I've come across; we need to do a
better job of empowering people to figure these things out.


Re: zinc invocation examples

2014-12-05 Thread Ryan Williams
fwiw I've been using `zinc -scala-home $SCALA_HOME -nailed -start` which:

- starts a nailgun server as well,
- uses my installed scala 2.{10,11}, as opposed to zinc's default 2.9.2
https://github.com/typesafehub/zinc#scala: If no options are passed to
locate a version of Scala then Scala 2.9.2 is used by default (which is
bundled with zinc).

The latter seems like it might be especially important.


On Thu Dec 04 2014 at 4:25:32 PM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Oh, derp. I just assumed from looking at all the options that there was
 something to it. Thanks Sean.

 On Thu Dec 04 2014 at 7:47:33 AM Sean Owen so...@cloudera.com wrote:

  You just run it once with zinc -start and leave it running as a
  background process on your build machine. You don't have to do
  anything for each build.
 
  On Wed, Dec 3, 2014 at 3:44 PM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
   https://github.com/apache/spark/blob/master/docs/
  building-spark.md#speeding-up-compilation-with-zinc
  
   Could someone summarize how they invoke zinc as part of a regular
   build-test-etc. cycle?
  
   I'll add it in to the aforelinked page if appropriate.
  
   Nick
 



Re: Spurious test failures, testing best practices

2014-12-04 Thread Ryan Williams
Thanks Marcelo, this is just how Maven works (unfortunately) answers my
question.

Another related question: I tried to use `mvn scala:cc` and discovered that
it only seems to work scan src/main and src/test directories (according to its
docs http://scala-tools.org/mvnsites/maven-scala-plugin/usage_cc.html),
and so can only be run from within submodules, not from the root directory.

I'll add a note about this to building-spark.html unless there is a way to
do it for all modules / from the root directory that I've missed. Let me
know!




On Tue Dec 02 2014 at 5:49:58 PM Marcelo Vanzin van...@cloudera.com wrote:

 On Tue, Dec 2, 2014 at 4:40 PM, Ryan Williams
 ryan.blake.willi...@gmail.com wrote:
  But you only need to compile the others once.
 
  once... every time I rebase off master, or am obliged to `mvn clean` by
 some
  other build-correctness bug, as I said before. In my experience this
 works
  out to a few times per week.

 No, you only need to do it something upstream from core changed (i.e.,
 spark-parent, network/common or network/shuffle) in an incompatible
 way. Otherwise, you can rebase and just recompile / retest core,
 without having to install everything else. I do this kind of thing all
 the time. If you have to do mvn clean often you're probably doing
 something wrong somewhere else.

 I understand where you're coming from, but the way you're thinking is
 just not how maven works. I too find annoying that maven requires lots
 of things to be installed before you can use them, when they're all
 part of the same project. But well, that's the way things are.

 --
 Marcelo



Re: Spurious test failures, testing best practices

2014-12-02 Thread Ryan Williams
Following on Mark's Maven examples, here is another related issue I'm
having:

I'd like to compile just the `core` module after a `mvn clean`, without
building an assembly JAR first. Is this possible?

Attempting to do it myself, the steps I performed were:

- `mvn compile -pl core`: fails because `core` depends on `network/common`
and `network/shuffle`, neither of which is installed in my local maven
cache (and which don't exist in central Maven repositories, I guess? I
thought Spark is publishing snapshot releases?)

- `network/shuffle` also depends on `network/common`, so I'll `mvn install`
the latter first: `mvn install -DskipTests -pl network/common`. That
succeeds, and I see a newly built 1.3.0-SNAPSHOT jar in my local maven
repository.

- However, `mvn install -DskipTests -pl network/shuffle` subsequently
fails, seemingly due to not finding network/core. Here's
https://gist.github.com/ryan-williams/1711189e7d0af558738d a sample full
output from running `mvn install -X -U -DskipTests -pl network/shuffle`
from such a state (the -U was to get around a previous failure based on
having cached a failed lookup of network-common-1.3.0-SNAPSHOT).

- Thinking maven might be special-casing -SNAPSHOT versions, I tried
replacing 1.3.0-SNAPSHOT with 1.3.0.1 globally and repeating these
steps, but the error seems to be the same
https://gist.github.com/ryan-williams/37fcdd14dd92fa562dbe.

Any ideas?

Thanks,

-Ryan

On Sun Nov 30 2014 at 6:37:28 PM Mark Hamstra m...@clearstorydata.com
wrote:

 
  - Start the SBT interactive console with sbt/sbt
  - Build your assembly by running the assembly target in the assembly
  project: assembly/assembly
  - Run all the tests in one module: core/test
  - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
 (this
  also supports tab completion)


 The equivalent using Maven:

 - Start zinc
 - Build your assembly using the mvn package or install target
 (install is actually the equivalent of SBT's publishLocal) -- this step
 is the first step in
 http://spark.apache.org/docs/latest/building-with-maven.
 html#spark-tests-in-maven
 - Run all the tests in one module: mvn -pl core test
 - Run a specific suite: mvn -pl core
 -DwildcardSuites=org.apache.spark.rdd.RDDSuite test (the -pl option isn't
 strictly necessary if you don't mind waiting for Maven to scan through all
 the other sub-projects only to do nothing; and, of course, it needs to be
 something other than core if the test you want to run is in another
 sub-project.)

 You also typically want to carry along in each subsequent step any relevant
 command line options you added in the package/install step.

 On Sun, Nov 30, 2014 at 3:06 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  Hi Ryan,
 
  As a tip (and maybe this isn't documented well), I normally use SBT for
  development to avoid the slow build process, and use its interactive
  console to run only specific tests. The nice advantage is that SBT can
 keep
  the Scala compiler loaded and JITed across builds, making it faster to
  iterate. To use it, you can do the following:
 
  - Start the SBT interactive console with sbt/sbt
  - Build your assembly by running the assembly target in the assembly
  project: assembly/assembly
  - Run all the tests in one module: core/test
  - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
 (this
  also supports tab completion)
 
  Running all the tests does take a while, and I usually just rely on
  Jenkins for that once I've run the tests for the things I believed my
 patch
  could break. But this is because some of them are integration tests (e.g.
  DistributedSuite, which creates multi-process mini-clusters). Many of the
  individual suites run fast without requiring this, however, so you can
 pick
  the ones you want. Perhaps we should find a way to tag them so people
 can
  do a quick-test that skips the integration ones.
 
  The assembly builds are annoying but they only take about a minute for me
  on a MacBook Pro with SBT warmed up. The assembly is actually only
 required
  for some of the integration tests (which launch new processes), but I'd
  recommend doing it all the time anyway since it would be very confusing
 to
  run those with an old assembly. The Scala compiler crash issue can also
 be
  a problem, but I don't see it very often with SBT. If it happens, I exit
  SBT and do sbt clean.
 
  Anyway, this is useful feedback and I think we should try to improve some
  of these suites, but hopefully you can also try the faster SBT process.
 At
  the end of the day, if we want integration tests, the whole test process
  will take an hour, but most of the developers I know leave that to
 Jenkins
  and only run individual tests locally before submitting a patch.
 
  Matei
 
 
   On Nov 30, 2014, at 2:39 PM, Ryan Williams 
  ryan.blake.willi...@gmail.com wrote:
  
   In the course of trying to make contributions to Spark, I have had a
 lot
  of
   trouble running Spark's tests

Re: Spurious test failures, testing best practices

2014-12-02 Thread Ryan Williams
Marcelo: by my count, there are 19 maven modules in the codebase. I am
typically only concerned with core (and therefore its two dependencies as
well, `network/{shuffle,common}`).

The `mvn package` workflow (and its sbt equivalent) that most people
apparently use involves (for me) compiling+packaging 16 other modules that
I don't care about; I pay this cost whenever I rebase off of master or
encounter the sbt-compiler-crash bug, among other possible scenarios.

Compiling one module (after building/installing its dependencies) seems
like the sort of thing that should be possible, and I don't see why my
previously-documented attempt is failing.

re: Marcelo's comment about missing the 'spark-parent' project, I saw
that error message too and tried to ascertain what it could mean. Why would
`network/shuffle` need something from the parent project? AFAICT
`network/common` has the same references to the parent project as
`network/shuffle` (namely just a parent block in its POM), and yet I can
`mvn install -pl` the former but not the latter. Why would this be? One
difference is that `network/shuffle` has a dependency on another module,
while `network/common` does not.

Does Maven not let you build modules that depend on *any* other modules
without building *all* modules, or is there a way to do this that we've not
found yet?

Patrick: per my response to Marcelo above, I am trying to avoid having to
compile and package a bunch of stuff I am not using, which both `mvn
package` and `mvn install` on the parent project do.





On Tue Dec 02 2014 at 3:45:48 PM Marcelo Vanzin van...@cloudera.com wrote:

 On Tue, Dec 2, 2014 at 2:40 PM, Ryan Williams
 ryan.blake.willi...@gmail.com wrote:
  Following on Mark's Maven examples, here is another related issue I'm
  having:
 
  I'd like to compile just the `core` module after a `mvn clean`, without
  building an assembly JAR first. Is this possible?

 Out of curiosity, may I ask why? What's the problem with running mvn
 install -DskipTests first (or package instead of install,
 although I generally do the latter)?

 You can probably do what you want if you manually build / install all
 the needed dependencies first; you found two, but it seems you're also
 missing the spark-parent project (which is the top-level pom). That
 sounds like a lot of trouble though, for not any gains that I can
 see... after the first build you should be able to do what you want
 easily.

 --
 Marcelo



Re: Spurious test failures, testing best practices

2014-12-02 Thread Ryan Williams
On Tue Dec 02 2014 at 4:46:20 PM Marcelo Vanzin van...@cloudera.com wrote:

 On Tue, Dec 2, 2014 at 3:39 PM, Ryan Williams
 ryan.blake.willi...@gmail.com wrote:
  Marcelo: by my count, there are 19 maven modules in the codebase. I am
  typically only concerned with core (and therefore its two dependencies
 as
  well, `network/{shuffle,common}`).

 But you only need to compile the others once.


once... every time I rebase off master, or am obliged to `mvn clean` by
some other build-correctness bug, as I said before. In my experience this
works out to a few times per week.


 Once you've established
 the baseline, you can just compile / test core to your heart's
 desire.


I understand that this is a workflow that does what I want as a side effect
of doing 3-5x more work (depending whether you count [number of modules
built] or [lines of scala/java compiled]), none of the extra work being
useful to me (more on that below).


 Core tests won't even run until you build the assembly anyway,
 since some of them require the assembly to be present.


The tests you refer to are exactly the ones that I'd like to let Jenkins
run from here on out, per advice I was given elsewhere in this thread and
due to the myriad unpleasantries I've encountered in trying to run them
myself.



 Also, even if you work in core - I'd say especially if you work in
 core - you should still, at some point, compile and test everything
 else that depends on it.


Last response applies.



 So, do this ONCE:


again, s/ONCE/several times a week/, in my experience.



   mvn install -DskipTests

 Then do this as many times as you want:

   mvn -pl spark-core_2.10 something

 That doesn't seem too bad to me.

(Be aware of the assembly comment
 above, since testing spark-core means you may have to rebuild the
 assembly from time to time, if your changes affect those tests.)

  re: Marcelo's comment about missing the 'spark-parent' project, I saw
 that
  error message too and tried to ascertain what it could mean. Why would
  `network/shuffle` need something from the parent project?

 The spark-parent project is the main pom that defines dependencies
 and their version, along with lots of build plugins and
 configurations. It's needed by all modules to compile correctly.


- I understand the parent POM has that information.

- I don't understand why Maven would feel that it is unable to compile the
`network/shuffle` module without having first compiled, packaged, and
installed 17 modules (19 minus `network/shuffle` and its dependency
`network/common`) that are not transitive dependencies of `network/shuffle`.

- I am trying to understand whether my failure to get Maven to compile
`network/shuffle` stems from my not knowing the correct incantation to feed
to Maven or from Maven's having a different (and seemingly worse) model for
how it handles module dependencies than I expected.




 --
 Marcelo



Spurious test failures, testing best practices

2014-11-30 Thread Ryan Williams
 this level of flakiness from spark tests?
- Do other people bother running dev/run-tests locally, or just let Jenkins
do it during the CR process?
- Needing to run a full assembly post-clean just to continue running one
specific test case feels especially wasteful, and the failure output when
naively attempting to run a specific test without having built an assembly
jar is not always clear about what the issue is or how to fix it; even the
fact that certain tests require building the world is not something I
would have expected, and has cost me hours of confusion.
- Should a person running spark tests assume that they must build an
assembly JAR before running anything?
- Are there some proper unit tests that are actually self-contained /
able to be run without building an assembly jar?
- Can we better document/demarcate which tests have which dependencies?
- Is there something finer-grained than building an assembly JAR that
is sufficient in some cases?
- If so, can we document that?
- If not, can we move to a world of finer-grained dependencies for
some of these?
- Leaving all of these spurious failures aside, the process of assembling
and testing a new JAR is not a quick one (40 and 60 mins for me typically,
respectively). I would guess that there are dozens (hundreds?) of people
who build a Spark assembly from various ToTs on any given day, and who all
wait on the exact same compilation / assembly steps to occur. Expanding on
the recent work to publish nightly snapshots [20], can we do a better job
caching/sharing compilation artifacts at a more granular level (pre-built
assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? more
granular maven modules, plus the previous two?), or otherwise save some of
the considerable amount of redundant compilation work that I had to do over
the course of my odyssey this weekend?

Ramping up on most projects involves some amount of supplementing the
documentation with trial and error to figure out what to run, which
errors are real errors and which can be ignored, etc., but navigating
that minefield on Spark has proved especially challenging and
time-consuming for me. Some of that comes directly from scala's relatively
slow compilation times and immature build-tooling ecosystem, but that is
the world we live in and it would be nice if Spark took the alleviation of
the resulting pain more seriously, as one of the more interesting and
well-known large scala projects around right now. The official
documentation around how to build different subsets of the codebase is
somewhat sparse [21], and there have been many mixed [22] accounts [23] on
this mailing list about preferred ways to build on mvn vs. sbt (none of
which has made it into official documentation, as far as I've seen).
Expecting new contributors to piece together all of this received
folk-wisdom about how to build/test in a sane way by trawling mailing list
archives seems suboptimal.

Thanks for reading, looking forward to hearing your ideas!

-Ryan

P.S. Is best practice for emailing this list to not incorporate any HTML
in the body? It seems like all of the archives I've seen strip it out, but
other people have used it and gmail displays it.


[1]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
(57 mins)
[2]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
(6 mins)
[3]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%20pass%20test,%20fail%20subsequent%20compile
(4 mins)
[4]
https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=2ved=0CCUQFjABurl=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-DataTypeConversions-scala-td17083.htmlei=aRF6VJrpNKr-iAKDgYGYBQusg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQsig2=zDeSqOgs02AXJXj78w5I9gbvm=bv.80642063,d.cGEcad=rja
[5]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%20clean,%20need%20dependencies%20built
[6]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%20post%20clean
(50 mins)
[7]
https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
(1hr)
[8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
[9] https://issues.apache.org/jira/browse/SPARK-3867
[10] https://gist.github.com/ryan-williams/735adf543124c99647cc
[11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
[12]
https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#file-gistfile1-txt-L853
(~90 mins)
[13]
https://gist.github.com/ryan-williams/718f6324af358819b496#file-gistfile1-txt-L852
(91 mins)
[14]
https

Re: Spurious test failures, testing best practices

2014-11-30 Thread Ryan Williams
thanks for the info, Matei and Brennon. I will try to switch my workflow to
using sbt. Other potential action items:

- currently the docs only contain information about building with maven,
and even then don't cover many important cases, as I described in my
previous email. If SBT is as much better as you've described then that
should be made much more obvious. Wasn't it the case recently that there
was only a page about building with SBT, and not one about building with
maven? Clearer messaging around this needs to exist in the documentation,
not just on the mailing list, imho.

- +1 to better distinguishing between unit and integration tests, having
separate scripts for each, improving documentation around common workflows,
expectations of brittleness with each kind of test, advisability of just
relying on Jenkins for certain kinds of tests to not waste too much time,
etc. Things like the compiler crash should be discussed in the
documentation, not just in the mailing list archives, if new contributors
are likely to run into them through no fault of their own.

- What is the algorithm you use to decide what tests you might have broken?
Can we codify it in some scripts that other people can use?



On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi Ryan,

 As a tip (and maybe this isn't documented well), I normally use SBT for
 development to avoid the slow build process, and use its interactive
 console to run only specific tests. The nice advantage is that SBT can keep
 the Scala compiler loaded and JITed across builds, making it faster to
 iterate. To use it, you can do the following:

 - Start the SBT interactive console with sbt/sbt
 - Build your assembly by running the assembly target in the assembly
 project: assembly/assembly
 - Run all the tests in one module: core/test
 - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
 also supports tab completion)

 Running all the tests does take a while, and I usually just rely on
 Jenkins for that once I've run the tests for the things I believed my patch
 could break. But this is because some of them are integration tests (e.g.
 DistributedSuite, which creates multi-process mini-clusters). Many of the
 individual suites run fast without requiring this, however, so you can pick
 the ones you want. Perhaps we should find a way to tag them so people  can
 do a quick-test that skips the integration ones.

 The assembly builds are annoying but they only take about a minute for me
 on a MacBook Pro with SBT warmed up. The assembly is actually only required
 for some of the integration tests (which launch new processes), but I'd
 recommend doing it all the time anyway since it would be very confusing to
 run those with an old assembly. The Scala compiler crash issue can also be
 a problem, but I don't see it very often with SBT. If it happens, I exit
 SBT and do sbt clean.

 Anyway, this is useful feedback and I think we should try to improve some
 of these suites, but hopefully you can also try the faster SBT process. At
 the end of the day, if we want integration tests, the whole test process
 will take an hour, but most of the developers I know leave that to Jenkins
 and only run individual tests locally before submitting a patch.

 Matei


  On Nov 30, 2014, at 2:39 PM, Ryan Williams 
 ryan.blake.willi...@gmail.com wrote:
 
  In the course of trying to make contributions to Spark, I have had a lot
 of
  trouble running Spark's tests successfully. The main pain points I've
  experienced are:
 
 1) frequent, spurious test failures
 2) high latency of running tests
 3) difficulty running specific tests in an iterative fashion
 
  Here is an example series of failures that I encountered this weekend
  (along with footnote links to the console output from each and
  approximately how long each took):
 
  - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
  before.
  - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
  - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
  passed, but scala compiler crashed on the catalyst project.
  - `mvn clean`: some attempts to run earlier commands (that previously
  didn't crash the compiler) all result in the same compiler crash.
 Previous
  discussion on this list implies this can only be solved by a `mvn clean`
  [4].
  - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
  BroadcastSuite can't run because assembly is not built.
  - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
  version mismatches and python 2.6. The machine this ran on has python
 2.7,
  so I don't know what that's about.
  - `./dev/run-tests` again [7]: too many open files errors in several
  tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
  not enough, but only some of the time? I increased it to 8192 and tried
  again.
  - `./dev/run-tests` again [8]: same pyspark

Re: Spurious test failures, testing best practices

2014-11-30 Thread Ryan Williams
Thanks Mark, most of those commands are things I've been using and used in
my original post except for Start zinc. I now see the section about it on
the unpublished building-spark
https://github.com/apache/spark/blob/master/docs/building-spark.md#speeding-up-compilation-with-zinc
page and will try using it.

Even so, finding those commands took a nontrivial amount of trial and
error, I've not seen them very-well-documented outside of this list (your
and Matei's emails (and previous emails to this list) each have more info
about building/testing with Maven and SBT (resp.) than building-spark
https://github.com/apache/spark/blob/master/docs/building-spark.md#spark-tests-in-maven
does),
the per-suite invocation is still subject to requiring assembly in some
cases (without warning from my perspective, having not read up on the
names of all Spark integration tests), spurious failures still abound,
there's no good way to run only the things that a given change actually
could have broken, etc.

Anyway, hopefully zinc brings me to the world of ~minute iteration times
that have been reported on this thread.


On Sun Nov 30 2014 at 6:53:57 PM Ryan Williams 
ryan.blake.willi...@gmail.com wrote:

 Thanks Nicholas, glad to hear that some of this info will be pushed to the
 main site soon, but this brings up yet another point of confusion that I've
 struggled with, namely whether the documentation on github or that on
 spark.apache.org should be considered the primary reference for people
 seeking to learn about best practices for developing Spark.

 Trying to read docs starting from
 https://github.com/apache/spark/blob/master/docs/index.md right now, I
 find that all of the links to other parts of the documentation are broken:
 they point to relative paths that end in .html, which will work when
 published on the docs-site, but that would have to end in .md if a person
 was to be able to navigate them on github.

 So expecting people to use the up-to-date docs on github (where all
 internal URLs 404 and the main github README suggests that the latest
 Spark documentation can be found on the actually-months-old docs-site
 https://github.com/apache/spark#online-documentation) is not a good
 solution. On the other hand, consulting months-old docs on the site is also
 problematic, as this thread and your last email have borne out.  The result
 is that there is no good place on the internet to learn about the most
 up-to-date best practices for using/developing Spark.

 Why not build http://spark.apache.org/docs/latest/ nightly (or every
 commit) off of what's in github, rather than having that URL point to the
 last release's docs (up to ~3 months old)? This way, casual users who want
 the docs for the released version they happen to be using (which is already
 frequently != /latest today, for many Spark users) can (still) find them
 at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
 point people to a site (/latest) that actually has up-to-date docs that
 reflect ToT and whose links work.

 If there are concerns about existing semantics around /latest URLs being
 broken, some new URL could be used, like
 http://spark.apache.org/docs/snapshot/, but given that everything under
 http://spark.apache.org/docs/latest/ is in a state of
 planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
 that serious an issue to me; anyone sending around permanent links to
 things under /latest is already going to have those links break / not make
 sense in the near future.


 On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:


- currently the docs only contain information about building with
maven,
and even then don’t cover many important cases

  All other points aside, I just want to point out that the docs document
 both how to use Maven and SBT and clearly state
 https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt
 that Maven is the “build of reference” while SBT may be preferable for
 day-to-day development.

 I believe the main reason most people miss this documentation is that,
 though it’s up-to-date on GitHub, it has’t been published yet to the docs
 site. It should go out with the 1.2 release.

 Improvements to the documentation on building Spark belong here:
 https://github.com/apache/spark/blob/master/docs/building-spark.md

 If there are clear recommendations that come out of this thread but are
 not in that doc, they should be added in there. Other, less important
 details may possibly be better suited for the Contributing to Spark
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 guide.

 Nick
 ​

 On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Ryan,

 A few more things here. You should feel free to send patches to
 Jenkins to test them, since this is the reference environment in which
 we regularly run tests. This is the normal workflow

Re: Spurious test failures, testing best practices

2014-11-30 Thread Ryan Williams
Thanks Patrick, great to hear that docs-snapshots-via-jenkins is already
JIRA'd; you can interpret some of this thread as a gigantic +1 from me on
prioritizing that, which it looks like you are doing :)

I do understand the limitations of the github vs. official site status
quo; I was mostly responding to a perceived implication that I should have
been getting building/testing-spark advice from the github .md files
instead of from /latest. I agree that neither one works very well
currently, and that docs-snapshots-via-jenkins is the right solution. Per
my other email, leaving /latest as-is sounds reasonable, as long as jenkins
is putting the latest docs *somewhere*.

On Sun Nov 30 2014 at 7:19:33 PM Patrick Wendell pwend...@gmail.com wrote:

 Btw - the documnetation on github represents the source code of our
 docs, which is versioned with each release. Unfortunately github will
 always try to render .md files so it could look to a passerby like
 this is supposed to represent published docs. This is a feature
 limitation of github, AFAIK we cannot disable it.

 The official published docs are associated with each release and
 available on the apache.org website. I think /latest is a common
 convention for referring to the latest *published release* docs, so
 probably we can't change that (the audience for /latest is orders of
 magnitude larger than for snapshot docs). However we could just add
 /snapshot and publish docs there.

 - Patrick

 On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Ryan,
 
  The existing JIRA also covers publishing nightly docs:
  https://issues.apache.org/jira/browse/SPARK-1517
 
  - Patrick
 
  On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
  ryan.blake.willi...@gmail.com wrote:
  Thanks Nicholas, glad to hear that some of this info will be pushed to
 the
  main site soon, but this brings up yet another point of confusion that
 I've
  struggled with, namely whether the documentation on github or that on
  spark.apache.org should be considered the primary reference for people
  seeking to learn about best practices for developing Spark.
 
  Trying to read docs starting from
  https://github.com/apache/spark/blob/master/docs/index.md right now, I
 find
  that all of the links to other parts of the documentation are broken:
 they
  point to relative paths that end in .html, which will work when
 published
  on the docs-site, but that would have to end in .md if a person was
 to be
  able to navigate them on github.
 
  So expecting people to use the up-to-date docs on github (where all
  internal URLs 404 and the main github README suggests that the latest
  Spark documentation can be found on the actually-months-old docs-site
  https://github.com/apache/spark#online-documentation) is not a good
  solution. On the other hand, consulting months-old docs on the site is
 also
  problematic, as this thread and your last email have borne out.  The
 result
  is that there is no good place on the internet to learn about the most
  up-to-date best practices for using/developing Spark.
 
  Why not build http://spark.apache.org/docs/latest/ nightly (or every
  commit) off of what's in github, rather than having that URL point to
 the
  last release's docs (up to ~3 months old)? This way, casual users who
 want
  the docs for the released version they happen to be using (which is
 already
  frequently != /latest today, for many Spark users) can (still) find
 them
  at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
  point people to a site (/latest) that actually has up-to-date docs that
  reflect ToT and whose links work.
 
  If there are concerns about existing semantics around /latest URLs
 being
  broken, some new URL could be used, like
  http://spark.apache.org/docs/snapshot/, but given that everything under
  http://spark.apache.org/docs/latest/ is in a state of
  planned-backwards-incompatible-changes every ~3mos, that doesn't sound
 like
  that serious an issue to me; anyone sending around permanent links to
  things under /latest is already going to have those links break / not
 make
  sense in the near future.
 
 
  On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
 
 - currently the docs only contain information about building with
 maven,
 and even then don't cover many important cases
 
   All other points aside, I just want to point out that the docs
 document
  both how to use Maven and SBT and clearly state
  https://github.com/apache/spark/blob/master/docs/
 building-spark.md#building-with-sbt
  that Maven is the build of reference while SBT may be preferable for
  day-to-day development.
 
  I believe the main reason most people miss this documentation is that,
  though it's up-to-date on GitHub, it has't been published yet to the
 docs
  site. It should go out with the 1.2 release.
 
  Improvements to the documentation on building Spark belong here:
  https://github.com