Re: SCALA_HOME or SCALA_LIBRARY_PATH not set during build

2014-05-31 Thread Colin McCabe
Spark currently supports two build systems, sbt and maven.  sbt will
download the correct version of scala, but with Maven you need to supply it
yourself and set SCALA_HOME.

It sounds like the instructions need to be updated-- perhaps create a JIRA?

best,
Colin


On Sat, May 31, 2014 at 7:06 PM, Soren Macbeth  wrote:

> Hello,
>
> Following the instructions for building spark 1.0.0, I encountered the
> following error:
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project
> spark-core_2.10: An Ant BuildException has occured: Please set the
> SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment
> variables and retry.
> [ERROR] around Ant part .. @ 6:126 in
> /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml
>
> No where in the documentation does it mention that having scala installed
> and either of these env vars set nor what version should be installed.
> Setting these env vars wasn't required for 0.9.1 with sbt.
>
> I was able to get past it by downloading the scala 2.10.4 binary package to
> a temp dir and setting SCALA_HOME to that dir.
>
> Ideally, it would be nice to not have to require people to have a
> standalone scala installation but at a minimum this requirement should be
> documented in the build instructions no?
>
> -Soren
>


SCALA_HOME or SCALA_LIBRARY_PATH not set during build

2014-05-31 Thread Soren Macbeth
Hello,

Following the instructions for building spark 1.0.0, I encountered the
following error:

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project
spark-core_2.10: An Ant BuildException has occured: Please set the
SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment
variables and retry.
[ERROR] around Ant part .. @ 6:126 in
/Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml

No where in the documentation does it mention that having scala installed
and either of these env vars set nor what version should be installed.
Setting these env vars wasn't required for 0.9.1 with sbt.

I was able to get past it by downloading the scala 2.10.4 binary package to
a temp dir and setting SCALA_HOME to that dir.

Ideally, it would be nice to not have to require people to have a
standalone scala installation but at a minimum this requirement should be
documented in the build instructions no?

-Soren


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-31 Thread Colin McCabe
On Sat, May 31, 2014 at 10:45 AM, Patrick Wendell 
wrote:

> One other consideration popped into my head:
>
> 5. Shading our dependencies could mess up our external API's if we
> ever return types that are outside of the spark package because we'd
> then be returned shaded types that users have to deal with. E.g. where
> before we returned an o.a.flume.AvroFlumeEvent we'd have to return a
> some.namespace.AvroFlumeEvent. Then users downstream would have to
> deal with converting our types into the correct namespace if they want
> to inter-operate with other libraries. We generally try to avoid ever
> returning types from other libraries, but it would be good to audit
> our API's and see if we ever do this.


That's a good point.  It seems to me that if Spark is returning a type in
the public API, that type is part of the public API (for better or worse).
 So this is a case where you wouldn't want to shade that type.  But it
would be nice to avoid doing this, for exactly the reasons you state...

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell 
> wrote:
> > Spark is a bit different than Hadoop MapReduce, so maybe that's a
> > source of some confusion. Spark is often used as a substrate for
> > building different types of analytics applications, so @DeveloperAPI
> > are internal API's that we'd like to expose to application writers,
> > but that might be more volatile. This is like the internal API's in
> > the linux kernel, they aren't stable, but of course we try to minimize
> > changes to them. If people want to write lower-level modules against
> > them, that's fine with us, but they know the interfaces might change.
>

MapReduce is used as a substrate in a lot of cases, too.  Hive has
traditionally created MR jobs to do what it needs to do.  Similarly, Oozie
can create MR jobs.  It seems that what @DeveloperAPI is pretty similar to
@LimitedPrivate in Hadoop.  If I understand correctly, your hope is that
frameworks will use @DeveloperAPI, but individual application developers
will steer clear.  That is a good plan, as long as you can ensure that the
framework developers are willing to lock their versions to a certain Spark
version.  Otherwise they will make the same arguments we've heard before,
that they don't want to transition off of a deprecated @DeveloperAPI
because they want to keep support for Spark 1.0.0 (or whatever).  We hear
these arguments in Hadoop all the time...  now that spark as a 1.0 release
they will carry more weight.  Remember, Hadoop APIs started nice and simple
too :)

>
> > This has worked pretty well over the years, even with many different
> > companies writing against those API's.
> >
> > @Experimental are user-facing features we are trying out. Hopefully
> > that one is more clear.
> >
> > In terms of making a big jar that shades all of our dependencies - I'm
> > curious how that would actually work in practice. It would be good to
> > explore. There are a few potential challenges I see:
> >
> > 1. If any of our dependencies encode class name information in IPC
> > messages, this would break. E.g. can you definitely shade the Hadoop
> > client, protobuf, hbase client, etc and have them send messages over
> > the wire? This could break things if class names are ever encoded in a
> > wire format.
>

Google protobuffers assume a fixed schema.  That is to say, they do not
include metadata identifying the types of what is placed in them.  The
types are determined by convention.  It is possible to change the java
package in which the protobuf classes reside with no harmful effects.  (See
HDFS-4909 for an example of this).  The RPC itself does include a java
class name for the interface we're talking to, though.  The code for
handling this is all under our control, though, so if we had to make any
minor modifications to make shading work, we could.

> 2. Many libraries like logging subsystems, configuration systems, etc
> > rely on static state and initialization. I'm not totally sure how e.g.
> > slf4j initializes itself if you have both a shaded and non-shaded copy
> > of slf4j present.
>

I guess the worst case scenario would be that the shaded version of slf4j
creates a log file, but then the app's unshaded version overwrites that log
file.  I don't see how the two versions could "cooperate" since they aren't
sharing static state.  The only solutions I can see are leaving slf4j
unshaded, or setting up separate log files for the spark-core versus the
application.  I haven't thought this through completely, but my gut feeling
is that if you're sharing a log file, you probably want to share the
logging code too.


> > 3. This would mean the spark-core jar would be really massive because
> > it would inline all of our deps. We've actually been thinking of
> > avoiding the current assembly jar approach because, due to scala
> > specialized classes, our assemblies now have more than 65,000 class
> > files in them leading to all kinds of bad issues. We'd have to stick
> > with a big u

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-31 Thread Patrick Wendell
One other consideration popped into my head:

5. Shading our dependencies could mess up our external API's if we
ever return types that are outside of the spark package because we'd
then be returned shaded types that users have to deal with. E.g. where
before we returned an o.a.flume.AvroFlumeEvent we'd have to return a
some.namespace.AvroFlumeEvent. Then users downstream would have to
deal with converting our types into the correct namespace if they want
to inter-operate with other libraries. We generally try to avoid ever
returning types from other libraries, but it would be good to audit
our API's and see if we ever do this.

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell  wrote:
> Spark is a bit different than Hadoop MapReduce, so maybe that's a
> source of some confusion. Spark is often used as a substrate for
> building different types of analytics applications, so @DeveloperAPI
> are internal API's that we'd like to expose to application writers,
> but that might be more volatile. This is like the internal API's in
> the linux kernel, they aren't stable, but of course we try to minimize
> changes to them. If people want to write lower-level modules against
> them, that's fine with us, but they know the interfaces might change.
>
> This has worked pretty well over the years, even with many different
> companies writing against those API's.
>
> @Experimental are user-facing features we are trying out. Hopefully
> that one is more clear.
>
> In terms of making a big jar that shades all of our dependencies - I'm
> curious how that would actually work in practice. It would be good to
> explore. There are a few potential challenges I see:
>
> 1. If any of our dependencies encode class name information in IPC
> messages, this would break. E.g. can you definitely shade the Hadoop
> client, protobuf, hbase client, etc and have them send messages over
> the wire? This could break things if class names are ever encoded in a
> wire format.
> 2. Many libraries like logging subsystems, configuration systems, etc
> rely on static state and initialization. I'm not totally sure how e.g.
> slf4j initializes itself if you have both a shaded and non-shaded copy
> of slf4j present.
> 3. This would mean the spark-core jar would be really massive because
> it would inline all of our deps. We've actually been thinking of
> avoiding the current assembly jar approach because, due to scala
> specialized classes, our assemblies now have more than 65,000 class
> files in them leading to all kinds of bad issues. We'd have to stick
> with a big uber assembly-like jar if we decide to shade stuff.
> 4. I'm not totally sure how this would work when people want to e.g.
> build Spark with different Hadoop versions. Would we publish different
> shaded uber-jars for every Hadoop version? Would the Hadoop dep just
> not be shaded... if so what about all it's dependencies.
>
> Anyways just some things to consider... simplifying our classpath is
> definitely an avenue worth exploring!
>
>
>
>
> On Fri, May 30, 2014 at 2:56 PM, Colin McCabe  wrote:
>> On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell  wrote:
>>
>>> Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
>>> way better about this with 2.2+ and I think it's great progress.
>>>
>>> We have well defined API levels in Spark and also automated checking
>>> of API violations for new pull requests. When doing code reviews we
>>> always enforce the narrowest possible visibility:
>>>
>>> 1. private
>>> 2. private[spark]
>>> 3. @Experimental or @DeveloperApi
>>> 4. public
>>>
>>> Our automated checks exclude 1-3. Anything that breaks 4 will trigger
>>> a build failure.
>>>
>>>
>> That's really excellent.  Great job.
>>
>> I like the private[spark] visibility level-- sounds like this is another
>> way Scala has greatly improved on Java.
>>
>> The Scala compiler prevents anyone external from using 1 or 2. We do
>>> have "bytecode public but annotated" (3) API's that we might change.
>>> We spent a lot of time looking into whether these can offer compiler
>>> warnings, but we haven't found a way to do this and do not see a
>>> better alternative at this point.
>>>
>>
>> It would be nice if the production build could strip this stuff out.
>>  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
>> know how those turned out.
>>
>>
>>> Regarding Scala compatibility, Scala 2.11+ is "source code
>>> compatible", meaning we'll be able to cross-compile Spark for
>>> different versions of Scala. We've already been in touch with Typesafe
>>> about this and they've offered to integrate Spark into their
>>> compatibility test suite. They've also committed to patching 2.11 with
>>> a minor release if bugs are found.
>>>
>>
>> Thanks, I hadn't heard about this plan.  Hopefully we can get everyone on
>> 2.11 ASAP.
>>
>>
>>> Anyways, my point is we've actually thought a lot about this already.
>>>
>>> The CLASSPATH thing is different than API stability, but indeed al

Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should
give an exception with the reason for the task failure. Also in the
future, for this type of e-mail please only e-mail the "user@" list
and not both lists.

- Patrick

On Sat, May 31, 2014 at 3:22 AM, prabeesh k  wrote:
> Hi,
>
> scenario : Read data from HDFS and apply hive query  on it and the result is
> written back to HDFS.
>
>  Scheme creation, Querying  and saveAsTextFile are working fine with
> following mode
>
> local mode
> mesos cluster with single node
> spark cluster with multi node
>
> Schema creation and querying are working fine with mesos multi node cluster.
> But  while trying to write back to HDFS using saveAsTextFile, the following
> error occurs
>
>  14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4
> (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission
> 14/05/30 10:16:35 INFO DAGScheduler: Executor lost:
> 201405291518-3644595722-5050-17933-1 (epoch 148)
>
> Let me know your thoughts regarding this.
>
> Regards,
> prabeesh


Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread prabeesh k
Hi,

scenario : Read data from HDFS and apply hive query  on it and the result
is written back to HDFS.

 Scheme creation, Querying  and saveAsTextFile are working fine with
following mode

   - local mode
   - mesos cluster with single node
   - spark cluster with multi node

Schema creation and querying are working fine with mesos multi node cluster.
But  while trying to write back to HDFS using saveAsTextFile, the following
error occurs

* 14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4
(mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission*
*14/05/30 10:16:35 INFO DAGScheduler: Executor lost:
201405291518-3644595722-5050-17933-1 (epoch 148)*

Let me know your thoughts regarding this.

Regards,
prabeesh


Fwd: Monitoring / Instrumenting jobs in 1.0

2014-05-31 Thread Mayur Rustagi
We have a json feed of spark application interface that we use for easier
instrumentation & monitoring. Has that been considered/found relevant?
Already sent as a pull request to 0.9.0, would that work or should we
update it to 1.0.0?


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



-- Forwarded message --
From: Patrick Wendell 
Date: Sat, May 31, 2014 at 9:09 AM
Subject: Re: Monitoring / Instrumenting jobs in 1.0
To: u...@spark.apache.org


The main change here was refactoring the SparkListener interface which
is where we expose internal state about a Spark job to other
applications. We've cleaned up these API's a bunch and also added a
way to log all data as JSON for post-hoc analysis:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala

- Patrick

On Fri, May 30, 2014 at 7:09 AM, Daniel Siegmann
 wrote:
> The Spark 1.0.0 release notes state "Internal instrumentation has been
added
> to allow applications to monitor and instrument Spark jobs." Can anyone
> point me to the docs for this?
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io