Fwd: Monitoring / Instrumenting jobs in 1.0
We have a json feed of spark application interface that we use for easier instrumentation monitoring. Has that been considered/found relevant? Already sent as a pull request to 0.9.0, would that work or should we update it to 1.0.0? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi -- Forwarded message -- From: Patrick Wendell pwend...@gmail.com Date: Sat, May 31, 2014 at 9:09 AM Subject: Re: Monitoring / Instrumenting jobs in 1.0 To: u...@spark.apache.org The main change here was refactoring the SparkListener interface which is where we expose internal state about a Spark job to other applications. We've cleaned up these API's a bunch and also added a way to log all data as JSON for post-hoc analysis: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala - Patrick On Fri, May 30, 2014 at 7:09 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: The Spark 1.0.0 release notes state Internal instrumentation has been added to allow applications to monitor and instrument Spark jobs. Can anyone point me to the docs for this? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Unable to execute saveAsTextFile on multi node mesos
Hi, scenario : Read data from HDFS and apply hive query on it and the result is written back to HDFS. Scheme creation, Querying and saveAsTextFile are working fine with following mode - local mode - mesos cluster with single node - spark cluster with multi node Schema creation and querying are working fine with mesos multi node cluster. But while trying to write back to HDFS using saveAsTextFile, the following error occurs * 14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4 (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission* *14/05/30 10:16:35 INFO DAGScheduler: Executor lost: 201405291518-3644595722-5050-17933-1 (epoch 148)* Let me know your thoughts regarding this. Regards, prabeesh
Re: Unable to execute saveAsTextFile on multi node mesos
Can you look at the logs from the executor or in the UI? They should give an exception with the reason for the task failure. Also in the future, for this type of e-mail please only e-mail the user@ list and not both lists. - Patrick On Sat, May 31, 2014 at 3:22 AM, prabeesh k prabsma...@gmail.com wrote: Hi, scenario : Read data from HDFS and apply hive query on it and the result is written back to HDFS. Scheme creation, Querying and saveAsTextFile are working fine with following mode local mode mesos cluster with single node spark cluster with multi node Schema creation and querying are working fine with mesos multi node cluster. But while trying to write back to HDFS using saveAsTextFile, the following error occurs 14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4 (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission 14/05/30 10:16:35 INFO DAGScheduler: Executor lost: 201405291518-3644595722-5050-17933-1 (epoch 148) Let me know your thoughts regarding this. Regards, prabeesh
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
One other consideration popped into my head: 5. Shading our dependencies could mess up our external API's if we ever return types that are outside of the spark package because we'd then be returned shaded types that users have to deal with. E.g. where before we returned an o.a.flume.AvroFlumeEvent we'd have to return a some.namespace.AvroFlumeEvent. Then users downstream would have to deal with converting our types into the correct namespace if they want to inter-operate with other libraries. We generally try to avoid ever returning types from other libraries, but it would be good to audit our API's and see if we ever do this. On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com wrote: Spark is a bit different than Hadoop MapReduce, so maybe that's a source of some confusion. Spark is often used as a substrate for building different types of analytics applications, so @DeveloperAPI are internal API's that we'd like to expose to application writers, but that might be more volatile. This is like the internal API's in the linux kernel, they aren't stable, but of course we try to minimize changes to them. If people want to write lower-level modules against them, that's fine with us, but they know the interfaces might change. This has worked pretty well over the years, even with many different companies writing against those API's. @Experimental are user-facing features we are trying out. Hopefully that one is more clear. In terms of making a big jar that shades all of our dependencies - I'm curious how that would actually work in practice. It would be good to explore. There are a few potential challenges I see: 1. If any of our dependencies encode class name information in IPC messages, this would break. E.g. can you definitely shade the Hadoop client, protobuf, hbase client, etc and have them send messages over the wire? This could break things if class names are ever encoded in a wire format. 2. Many libraries like logging subsystems, configuration systems, etc rely on static state and initialization. I'm not totally sure how e.g. slf4j initializes itself if you have both a shaded and non-shaded copy of slf4j present. 3. This would mean the spark-core jar would be really massive because it would inline all of our deps. We've actually been thinking of avoiding the current assembly jar approach because, due to scala specialized classes, our assemblies now have more than 65,000 class files in them leading to all kinds of bad issues. We'd have to stick with a big uber assembly-like jar if we decide to shade stuff. 4. I'm not totally sure how this would work when people want to e.g. build Spark with different Hadoop versions. Would we publish different shaded uber-jars for every Hadoop version? Would the Hadoop dep just not be shaded... if so what about all it's dependencies. Anyways just some things to consider... simplifying our classpath is definitely an avenue worth exploring! On Fri, May 30, 2014 at 2:56 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote: On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote: Hey guys, thanks for the insights. Also, I realize Hadoop has gotten way better about this with 2.2+ and I think it's great progress. We have well defined API levels in Spark and also automated checking of API violations for new pull requests. When doing code reviews we always enforce the narrowest possible visibility: 1. private 2. private[spark] 3. @Experimental or @DeveloperApi 4. public Our automated checks exclude 1-3. Anything that breaks 4 will trigger a build failure. That's really excellent. Great job. I like the private[spark] visibility level-- sounds like this is another way Scala has greatly improved on Java. The Scala compiler prevents anyone external from using 1 or 2. We do have bytecode public but annotated (3) API's that we might change. We spent a lot of time looking into whether these can offer compiler warnings, but we haven't found a way to do this and do not see a better alternative at this point. It would be nice if the production build could strip this stuff out. Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we know how those turned out. Regarding Scala compatibility, Scala 2.11+ is source code compatible, meaning we'll be able to cross-compile Spark for different versions of Scala. We've already been in touch with Typesafe about this and they've offered to integrate Spark into their compatibility test suite. They've also committed to patching 2.11 with a minor release if bugs are found. Thanks, I hadn't heard about this plan. Hopefully we can get everyone on 2.11 ASAP. Anyways, my point is we've actually thought a lot about this already. The CLASSPATH thing is different than API stability, but indeed also a form of compatibility. This is something where I'd also like to see Spark have better isolation of user classes
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
On Sat, May 31, 2014 at 10:45 AM, Patrick Wendell pwend...@gmail.com wrote: One other consideration popped into my head: 5. Shading our dependencies could mess up our external API's if we ever return types that are outside of the spark package because we'd then be returned shaded types that users have to deal with. E.g. where before we returned an o.a.flume.AvroFlumeEvent we'd have to return a some.namespace.AvroFlumeEvent. Then users downstream would have to deal with converting our types into the correct namespace if they want to inter-operate with other libraries. We generally try to avoid ever returning types from other libraries, but it would be good to audit our API's and see if we ever do this. That's a good point. It seems to me that if Spark is returning a type in the public API, that type is part of the public API (for better or worse). So this is a case where you wouldn't want to shade that type. But it would be nice to avoid doing this, for exactly the reasons you state... On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com wrote: Spark is a bit different than Hadoop MapReduce, so maybe that's a source of some confusion. Spark is often used as a substrate for building different types of analytics applications, so @DeveloperAPI are internal API's that we'd like to expose to application writers, but that might be more volatile. This is like the internal API's in the linux kernel, they aren't stable, but of course we try to minimize changes to them. If people want to write lower-level modules against them, that's fine with us, but they know the interfaces might change. MapReduce is used as a substrate in a lot of cases, too. Hive has traditionally created MR jobs to do what it needs to do. Similarly, Oozie can create MR jobs. It seems that what @DeveloperAPI is pretty similar to @LimitedPrivate in Hadoop. If I understand correctly, your hope is that frameworks will use @DeveloperAPI, but individual application developers will steer clear. That is a good plan, as long as you can ensure that the framework developers are willing to lock their versions to a certain Spark version. Otherwise they will make the same arguments we've heard before, that they don't want to transition off of a deprecated @DeveloperAPI because they want to keep support for Spark 1.0.0 (or whatever). We hear these arguments in Hadoop all the time... now that spark as a 1.0 release they will carry more weight. Remember, Hadoop APIs started nice and simple too :) This has worked pretty well over the years, even with many different companies writing against those API's. @Experimental are user-facing features we are trying out. Hopefully that one is more clear. In terms of making a big jar that shades all of our dependencies - I'm curious how that would actually work in practice. It would be good to explore. There are a few potential challenges I see: 1. If any of our dependencies encode class name information in IPC messages, this would break. E.g. can you definitely shade the Hadoop client, protobuf, hbase client, etc and have them send messages over the wire? This could break things if class names are ever encoded in a wire format. Google protobuffers assume a fixed schema. That is to say, they do not include metadata identifying the types of what is placed in them. The types are determined by convention. It is possible to change the java package in which the protobuf classes reside with no harmful effects. (See HDFS-4909 for an example of this). The RPC itself does include a java class name for the interface we're talking to, though. The code for handling this is all under our control, though, so if we had to make any minor modifications to make shading work, we could. 2. Many libraries like logging subsystems, configuration systems, etc rely on static state and initialization. I'm not totally sure how e.g. slf4j initializes itself if you have both a shaded and non-shaded copy of slf4j present. I guess the worst case scenario would be that the shaded version of slf4j creates a log file, but then the app's unshaded version overwrites that log file. I don't see how the two versions could cooperate since they aren't sharing static state. The only solutions I can see are leaving slf4j unshaded, or setting up separate log files for the spark-core versus the application. I haven't thought this through completely, but my gut feeling is that if you're sharing a log file, you probably want to share the logging code too. 3. This would mean the spark-core jar would be really massive because it would inline all of our deps. We've actually been thinking of avoiding the current assembly jar approach because, due to scala specialized classes, our assemblies now have more than 65,000 class files in them leading to all kinds of bad issues. We'd have to stick with a big uber assembly-like jar if we decide to shade
SCALA_HOME or SCALA_LIBRARY_PATH not set during build
Hello, Following the instructions for building spark 1.0.0, I encountered the following error: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project spark-core_2.10: An Ant BuildException has occured: Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry. [ERROR] around Ant part ...fail message=Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry @ 6:126 in /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml No where in the documentation does it mention that having scala installed and either of these env vars set nor what version should be installed. Setting these env vars wasn't required for 0.9.1 with sbt. I was able to get past it by downloading the scala 2.10.4 binary package to a temp dir and setting SCALA_HOME to that dir. Ideally, it would be nice to not have to require people to have a standalone scala installation but at a minimum this requirement should be documented in the build instructions no? -Soren
Re: SCALA_HOME or SCALA_LIBRARY_PATH not set during build
Spark currently supports two build systems, sbt and maven. sbt will download the correct version of scala, but with Maven you need to supply it yourself and set SCALA_HOME. It sounds like the instructions need to be updated-- perhaps create a JIRA? best, Colin On Sat, May 31, 2014 at 7:06 PM, Soren Macbeth so...@yieldbot.com wrote: Hello, Following the instructions for building spark 1.0.0, I encountered the following error: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project spark-core_2.10: An Ant BuildException has occured: Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry. [ERROR] around Ant part ...fail message=Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry @ 6:126 in /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml No where in the documentation does it mention that having scala installed and either of these env vars set nor what version should be installed. Setting these env vars wasn't required for 0.9.1 with sbt. I was able to get past it by downloading the scala 2.10.4 binary package to a temp dir and setting SCALA_HOME to that dir. Ideally, it would be nice to not have to require people to have a standalone scala installation but at a minimum this requirement should be documented in the build instructions no? -Soren