Fwd: Monitoring / Instrumenting jobs in 1.0

2014-05-31 Thread Mayur Rustagi
We have a json feed of spark application interface that we use for easier
instrumentation  monitoring. Has that been considered/found relevant?
Already sent as a pull request to 0.9.0, would that work or should we
update it to 1.0.0?


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



-- Forwarded message --
From: Patrick Wendell pwend...@gmail.com
Date: Sat, May 31, 2014 at 9:09 AM
Subject: Re: Monitoring / Instrumenting jobs in 1.0
To: u...@spark.apache.org


The main change here was refactoring the SparkListener interface which
is where we expose internal state about a Spark job to other
applications. We've cleaned up these API's a bunch and also added a
way to log all data as JSON for post-hoc analysis:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala

- Patrick

On Fri, May 30, 2014 at 7:09 AM, Daniel Siegmann
daniel.siegm...@velos.io wrote:
 The Spark 1.0.0 release notes state Internal instrumentation has been
added
 to allow applications to monitor and instrument Spark jobs. Can anyone
 point me to the docs for this?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io


Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread prabeesh k
Hi,

scenario : Read data from HDFS and apply hive query  on it and the result
is written back to HDFS.

 Scheme creation, Querying  and saveAsTextFile are working fine with
following mode

   - local mode
   - mesos cluster with single node
   - spark cluster with multi node

Schema creation and querying are working fine with mesos multi node cluster.
But  while trying to write back to HDFS using saveAsTextFile, the following
error occurs

* 14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4
(mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission*
*14/05/30 10:16:35 INFO DAGScheduler: Executor lost:
201405291518-3644595722-5050-17933-1 (epoch 148)*

Let me know your thoughts regarding this.

Regards,
prabeesh


Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should
give an exception with the reason for the task failure. Also in the
future, for this type of e-mail please only e-mail the user@ list
and not both lists.

- Patrick

On Sat, May 31, 2014 at 3:22 AM, prabeesh k prabsma...@gmail.com wrote:
 Hi,

 scenario : Read data from HDFS and apply hive query  on it and the result is
 written back to HDFS.

  Scheme creation, Querying  and saveAsTextFile are working fine with
 following mode

 local mode
 mesos cluster with single node
 spark cluster with multi node

 Schema creation and querying are working fine with mesos multi node cluster.
 But  while trying to write back to HDFS using saveAsTextFile, the following
 error occurs

  14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4
 (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission
 14/05/30 10:16:35 INFO DAGScheduler: Executor lost:
 201405291518-3644595722-5050-17933-1 (epoch 148)

 Let me know your thoughts regarding this.

 Regards,
 prabeesh


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-31 Thread Patrick Wendell
One other consideration popped into my head:

5. Shading our dependencies could mess up our external API's if we
ever return types that are outside of the spark package because we'd
then be returned shaded types that users have to deal with. E.g. where
before we returned an o.a.flume.AvroFlumeEvent we'd have to return a
some.namespace.AvroFlumeEvent. Then users downstream would have to
deal with converting our types into the correct namespace if they want
to inter-operate with other libraries. We generally try to avoid ever
returning types from other libraries, but it would be good to audit
our API's and see if we ever do this.

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com wrote:
 Spark is a bit different than Hadoop MapReduce, so maybe that's a
 source of some confusion. Spark is often used as a substrate for
 building different types of analytics applications, so @DeveloperAPI
 are internal API's that we'd like to expose to application writers,
 but that might be more volatile. This is like the internal API's in
 the linux kernel, they aren't stable, but of course we try to minimize
 changes to them. If people want to write lower-level modules against
 them, that's fine with us, but they know the interfaces might change.

 This has worked pretty well over the years, even with many different
 companies writing against those API's.

 @Experimental are user-facing features we are trying out. Hopefully
 that one is more clear.

 In terms of making a big jar that shades all of our dependencies - I'm
 curious how that would actually work in practice. It would be good to
 explore. There are a few potential challenges I see:

 1. If any of our dependencies encode class name information in IPC
 messages, this would break. E.g. can you definitely shade the Hadoop
 client, protobuf, hbase client, etc and have them send messages over
 the wire? This could break things if class names are ever encoded in a
 wire format.
 2. Many libraries like logging subsystems, configuration systems, etc
 rely on static state and initialization. I'm not totally sure how e.g.
 slf4j initializes itself if you have both a shaded and non-shaded copy
 of slf4j present.
 3. This would mean the spark-core jar would be really massive because
 it would inline all of our deps. We've actually been thinking of
 avoiding the current assembly jar approach because, due to scala
 specialized classes, our assemblies now have more than 65,000 class
 files in them leading to all kinds of bad issues. We'd have to stick
 with a big uber assembly-like jar if we decide to shade stuff.
 4. I'm not totally sure how this would work when people want to e.g.
 build Spark with different Hadoop versions. Would we publish different
 shaded uber-jars for every Hadoop version? Would the Hadoop dep just
 not be shaded... if so what about all it's dependencies.

 Anyways just some things to consider... simplifying our classpath is
 definitely an avenue worth exploring!




 On Fri, May 30, 2014 at 2:56 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote:
 On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
 way better about this with 2.2+ and I think it's great progress.

 We have well defined API levels in Spark and also automated checking
 of API violations for new pull requests. When doing code reviews we
 always enforce the narrowest possible visibility:

 1. private
 2. private[spark]
 3. @Experimental or @DeveloperApi
 4. public

 Our automated checks exclude 1-3. Anything that breaks 4 will trigger
 a build failure.


 That's really excellent.  Great job.

 I like the private[spark] visibility level-- sounds like this is another
 way Scala has greatly improved on Java.

 The Scala compiler prevents anyone external from using 1 or 2. We do
 have bytecode public but annotated (3) API's that we might change.
 We spent a lot of time looking into whether these can offer compiler
 warnings, but we haven't found a way to do this and do not see a
 better alternative at this point.


 It would be nice if the production build could strip this stuff out.
  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
 know how those turned out.


 Regarding Scala compatibility, Scala 2.11+ is source code
 compatible, meaning we'll be able to cross-compile Spark for
 different versions of Scala. We've already been in touch with Typesafe
 about this and they've offered to integrate Spark into their
 compatibility test suite. They've also committed to patching 2.11 with
 a minor release if bugs are found.


 Thanks, I hadn't heard about this plan.  Hopefully we can get everyone on
 2.11 ASAP.


 Anyways, my point is we've actually thought a lot about this already.

 The CLASSPATH thing is different than API stability, but indeed also a
 form of compatibility. This is something where I'd also like to see
 Spark have better isolation of user classes 

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-31 Thread Colin McCabe
On Sat, May 31, 2014 at 10:45 AM, Patrick Wendell pwend...@gmail.com
wrote:

 One other consideration popped into my head:

 5. Shading our dependencies could mess up our external API's if we
 ever return types that are outside of the spark package because we'd
 then be returned shaded types that users have to deal with. E.g. where
 before we returned an o.a.flume.AvroFlumeEvent we'd have to return a
 some.namespace.AvroFlumeEvent. Then users downstream would have to
 deal with converting our types into the correct namespace if they want
 to inter-operate with other libraries. We generally try to avoid ever
 returning types from other libraries, but it would be good to audit
 our API's and see if we ever do this.


That's a good point.  It seems to me that if Spark is returning a type in
the public API, that type is part of the public API (for better or worse).
 So this is a case where you wouldn't want to shade that type.  But it
would be nice to avoid doing this, for exactly the reasons you state...

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Spark is a bit different than Hadoop MapReduce, so maybe that's a
  source of some confusion. Spark is often used as a substrate for
  building different types of analytics applications, so @DeveloperAPI
  are internal API's that we'd like to expose to application writers,
  but that might be more volatile. This is like the internal API's in
  the linux kernel, they aren't stable, but of course we try to minimize
  changes to them. If people want to write lower-level modules against
  them, that's fine with us, but they know the interfaces might change.


MapReduce is used as a substrate in a lot of cases, too.  Hive has
traditionally created MR jobs to do what it needs to do.  Similarly, Oozie
can create MR jobs.  It seems that what @DeveloperAPI is pretty similar to
@LimitedPrivate in Hadoop.  If I understand correctly, your hope is that
frameworks will use @DeveloperAPI, but individual application developers
will steer clear.  That is a good plan, as long as you can ensure that the
framework developers are willing to lock their versions to a certain Spark
version.  Otherwise they will make the same arguments we've heard before,
that they don't want to transition off of a deprecated @DeveloperAPI
because they want to keep support for Spark 1.0.0 (or whatever).  We hear
these arguments in Hadoop all the time...  now that spark as a 1.0 release
they will carry more weight.  Remember, Hadoop APIs started nice and simple
too :)


  This has worked pretty well over the years, even with many different
  companies writing against those API's.
 
  @Experimental are user-facing features we are trying out. Hopefully
  that one is more clear.
 
  In terms of making a big jar that shades all of our dependencies - I'm
  curious how that would actually work in practice. It would be good to
  explore. There are a few potential challenges I see:
 
  1. If any of our dependencies encode class name information in IPC
  messages, this would break. E.g. can you definitely shade the Hadoop
  client, protobuf, hbase client, etc and have them send messages over
  the wire? This could break things if class names are ever encoded in a
  wire format.


Google protobuffers assume a fixed schema.  That is to say, they do not
include metadata identifying the types of what is placed in them.  The
types are determined by convention.  It is possible to change the java
package in which the protobuf classes reside with no harmful effects.  (See
HDFS-4909 for an example of this).  The RPC itself does include a java
class name for the interface we're talking to, though.  The code for
handling this is all under our control, though, so if we had to make any
minor modifications to make shading work, we could.

 2. Many libraries like logging subsystems, configuration systems, etc
  rely on static state and initialization. I'm not totally sure how e.g.
  slf4j initializes itself if you have both a shaded and non-shaded copy
  of slf4j present.


I guess the worst case scenario would be that the shaded version of slf4j
creates a log file, but then the app's unshaded version overwrites that log
file.  I don't see how the two versions could cooperate since they aren't
sharing static state.  The only solutions I can see are leaving slf4j
unshaded, or setting up separate log files for the spark-core versus the
application.  I haven't thought this through completely, but my gut feeling
is that if you're sharing a log file, you probably want to share the
logging code too.


  3. This would mean the spark-core jar would be really massive because
  it would inline all of our deps. We've actually been thinking of
  avoiding the current assembly jar approach because, due to scala
  specialized classes, our assemblies now have more than 65,000 class
  files in them leading to all kinds of bad issues. We'd have to stick
  with a big uber assembly-like jar if we decide to shade 

SCALA_HOME or SCALA_LIBRARY_PATH not set during build

2014-05-31 Thread Soren Macbeth
Hello,

Following the instructions for building spark 1.0.0, I encountered the
following error:

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project
spark-core_2.10: An Ant BuildException has occured: Please set the
SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment
variables and retry.
[ERROR] around Ant part ...fail message=Please set the SCALA_HOME (or
SCALA_LIBRARY_PATH if scala is on the path) environment variables and
retry @ 6:126 in
/Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml

No where in the documentation does it mention that having scala installed
and either of these env vars set nor what version should be installed.
Setting these env vars wasn't required for 0.9.1 with sbt.

I was able to get past it by downloading the scala 2.10.4 binary package to
a temp dir and setting SCALA_HOME to that dir.

Ideally, it would be nice to not have to require people to have a
standalone scala installation but at a minimum this requirement should be
documented in the build instructions no?

-Soren


Re: SCALA_HOME or SCALA_LIBRARY_PATH not set during build

2014-05-31 Thread Colin McCabe
Spark currently supports two build systems, sbt and maven.  sbt will
download the correct version of scala, but with Maven you need to supply it
yourself and set SCALA_HOME.

It sounds like the instructions need to be updated-- perhaps create a JIRA?

best,
Colin


On Sat, May 31, 2014 at 7:06 PM, Soren Macbeth so...@yieldbot.com wrote:

 Hello,

 Following the instructions for building spark 1.0.0, I encountered the
 following error:

 [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project
 spark-core_2.10: An Ant BuildException has occured: Please set the
 SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment
 variables and retry.
 [ERROR] around Ant part ...fail message=Please set the SCALA_HOME (or
 SCALA_LIBRARY_PATH if scala is on the path) environment variables and
 retry @ 6:126 in
 /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml

 No where in the documentation does it mention that having scala installed
 and either of these env vars set nor what version should be installed.
 Setting these env vars wasn't required for 0.9.1 with sbt.

 I was able to get past it by downloading the scala 2.10.4 binary package to
 a temp dir and setting SCALA_HOME to that dir.

 Ideally, it would be nice to not have to require people to have a
 standalone scala installation but at a minimum this requirement should be
 documented in the build instructions no?

 -Soren