from:"\"Patrick Wendell\""

Assorted project updates (tests, build, etc)

2014-06-22 Thread Patrick Wendell

Hey All,

1. The original test infrastructure hosted by the AMPLab has been
fully restored and also expanded with many more executor slots for
tests. Thanks to Matt Massie at the Amplab for helping with this.

2. We now have a nightly build matrix across different Hadoop
versions. It appears that the Maven build is failing tests with some
of the newer Hadoop versions. If people from the community are
interested, diagnosing and fixing test issues would be welcome patches
(they are all dependency related).

https://issues.apache.org/jira/browse/SPARK-2232

3. Prashant Sharma has spent a lot of time to make it possible for our
sbt build to read dependencies from Maven. This will save us a huge
amount of headache keeping the builds consistent. I just wanted to
give a heads up to users about this - we should retain compatibility
with features of the sbt build, but if you are e.g. hooking into deep
internals of our build it may affect you. I'm hoping this can be
updated and merged in the next week:

https://github.com/apache/spark/pull/77

4. We've moved most of the documentation over to recommending users
build with Maven when creating official packages. This is just to
provide a single "reference build" of Spark since it's the one we test
and package for releases, we make sure all recursive dependencies are
correct, etc. I'd recommend that all downstream packagers use this
build.

For day-to-day development I imagine sbt will remain more popular
(repl, incremental builds, etc). Prashant's work allows us to get the
"best of both worlds" which is great.

- Patrick

Re: Scala examples for Spark do not work as written in documentation

2014-06-20 Thread Patrick Wendell

Those are pretty old - but I think the reason Matei did that was to
make it less confusing for brand new users. `spark` is actually a
valid identifier because it's just a variable name (val spark = new
SparkContext()) but I agree this could be confusing for users who want
to drop into the shell.

On Fri, Jun 20, 2014 at 12:04 PM, Will Benton  wrote:
> Hey, sorry to reanimate this thread, but just a quick question:  why do the 
> examples (on http://spark.apache.org/examples.html) use "spark" for the 
> SparkContext reference?  This is minor, but it seems like it could be a 
> little confusing for people who want to run them in the shell and need to 
> change "spark" to "sc".  (I noticed because this was a speedbump for a 
> colleague who is trying out Spark.)
>
>
> thanks,
> wb
>
> - Original Message -
>> From: "Andy Konwinski" 
>> To: dev@spark.apache.org
>> Sent: Tuesday, May 20, 2014 4:06:33 PM
>> Subject: Re: Scala examples for Spark do not work as written in documentation
>>
>> I fixed the bug, but I kept the parameter "i" instead of "_" since that (1)
>> keeps it more parallel to the python and java versions which also use
>> functions with a named variable and (2) doesn't require readers to know
>> this particular use of the "_" syntax in Scala.
>>
>> Thanks for catching this Glenn.
>>
>> Andy
>>
>>
>> On Fri, May 16, 2014 at 12:38 PM, Mark Hamstra
>> wrote:
>>
>> > Sorry, looks like an extra line got inserted in there.  One more try:
>> >
>> > val count = spark.parallelize(1 to NUM_SAMPLES).map { _ =>
>> >   val x = Math.random()
>> >   val y = Math.random()
>> >   if (x*x + y*y < 1) 1 else 0
>> > }.reduce(_ + _)
>> >
>> >
>> >
>> > On Fri, May 16, 2014 at 12:36 PM, Mark Hamstra > > >wrote:
>> >
>> > > Actually, the better way to write the multi-line closure would be:
>> > >
>> > > val count = spark.parallelize(1 to NUM_SAMPLES).map { _ =>
>> > >
>> > >   val x = Math.random()
>> > >   val y = Math.random()
>> > >   if (x*x + y*y < 1) 1 else 0
>> > > }.reduce(_ + _)
>> > >
>> > >
>> > > On Fri, May 16, 2014 at 9:41 AM, GlennStrycker > > >wrote:
>> > >
>> > >> On the webpage http://spark.apache.org/examples.html, there is an
>> > example
>> > >> written as
>> > >>
>> > >> val count = spark.parallelize(1 to NUM_SAMPLES).map(i =>
>> > >>   val x = Math.random()
>> > >>   val y = Math.random()
>> > >>   if (x*x + y*y < 1) 1 else 0
>> > >> ).reduce(_ + _)
>> > >> println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)
>> > >>
>> > >> This does not execute in Spark, which gives me an error:
>> > >> :2: error: illegal start of simple expression
>> > >>  val x = Math.random()
>> > >>  ^
>> > >>
>> > >> If I rewrite the query slightly, adding in {}, it works:
>> > >>
>> > >> val count = spark.parallelize(1 to 1).map(i =>
>> > >>{
>> > >>val x = Math.random()
>> > >>val y = Math.random()
>> > >>if (x*x + y*y < 1) 1 else 0
>> > >>}
>> > >> ).reduce(_ + _)
>> > >> println("Pi is roughly " + 4.0 * count / 1.0)
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> View this message in context:
>> > >>
>> > http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-examples-for-Spark-do-not-work-as-written-in-documentation-tp6593.html
>> > >> Sent from the Apache Spark Developers List mailing list archive at
>> > >> Nabble.com.
>> > >>
>> > >
>> > >
>> >
>>

Re: Trailing Tasks Saving to HDFS

2014-06-19 Thread Patrick Wendell

I'll make a comment on the JIRA - thanks for reporting this, let's get
to the bottom of it.

On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman
 wrote:
> I've created an issue for this but if anyone has any advice, please let me
> know.
>
> Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two
> remaining tasks (out of 320). Those tasks seem to be waiting on data from
> another task on another node. Eventually (about 2 hours later) they time out
> with a connection reset by peer.
>
> All the data actually seems to be on HDFS as the expected part files. It
> just seems like the remaining tasks have corrupted "metadata", so that they
> do not realize that they are done. Just a guess though.
>
> https://issues.apache.org/jira/browse/SPARK-2202
>
> -Suren
>
>
>
>
> On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman
>  wrote:
>>
>> Looks like eventually there was some type of reset or timeout and the
>> tasks have been reassigned. I'm guessing they'll keep failing until max
>> failure count.
>>
>> The machine it disconnected from was a remote machine, though I've seen
>> such failures from connections to itself with other problems. The log lines
>> from the remote machine are also below.
>>
>> Any thoughts or guesses would be appreciated!
>>
>> "HUNG" WORKER
>>
>> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from
>> connection to ConnectionManagerId(172.16.25.103,57626)
>>
>> java.io.IOException: Connection reset by peer
>>
>> at sun.nio.ch.FileDispatcher.read0(Native Method)
>>
>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>
>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>
>> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>
>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>
>> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
>>
>> at
>> org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>
>> at java.lang.Thread.run(Thread.java:679)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection
>> error on connection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> SendingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>>
>>
>> REMOTE WORKER
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
>>
>> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>>
>>
>>
>>
>> On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman
>>  wrote:
>>>
>>> I have a flow that ends with saveAsTextFile() to HDFS.
>>>
>>> It seems all the expected files per partition have been written out,
>>> based on the number of part files and the file sizes.
>>>
>>> But the driver logs show 2 tasks still not completed and has no activity
>>> and the worker logs show no activity for those two tasks for a while now.
>>>
>>> Has anyone run into this situation? It's happened to me a couple of times
>>> now.
>>>
>>> Thanks.
>>>
>>> -- Suren
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hira...@velos.io
>>> W: www.velos.io
>>>
>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hira...@velos.io
>> W: www.velos.io
>>
>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hira...@velos.io
> W: www.velos.io
>

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Patrick Wendell

Just wondering, do you get this particular exception if you are not
consolidating shuffle data?

On Wed, Jun 18, 2014 at 12:15 PM, Mridul Muralidharan  wrote:
> On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman
>  wrote:
>> Patrick,
>>
>> My team is using shuffle consolidation but not speculation. We are also
>> using persist(DISK_ONLY) for caching.
>
>
> Use of shuffle consolidation is probably what is causing the issue.
> Would be good idea to try again with that turned off (which is the default).
>
> It should get fixed most likely in 1.1 timeframe.
>
>
> Regards,
> Mridul
>
>
>>
>> Here are some config changes that are in our work-in-progress.
>>
>> We've been trying for 2 weeks to get our production flow (maybe around
>> 50-70 stages, a few forks and joins with up to 20 branches in the forks) to
>> run end to end without any success, running into other problems besides
>> this one as well. For example, we have run into situations where saving to
>> HDFS just hangs on a couple of tasks, which are printing out nothing in
>> their logs and not taking any CPU. For testing, our input data is 10 GB
>> across 320 input splits and generates maybe around 200-300 GB of
>> intermediate and final data.
>>
>>
>> conf.set("spark.executor.memory", "14g") // TODO make this
>> configurable
>>
>> // shuffle configs
>> conf.set("spark.default.parallelism", "320") // TODO make this
>> configurable
>> conf.set("spark.shuffle.consolidateFiles","true")
>>
>> conf.set("spark.shuffle.file.buffer.kb", "200")
>> conf.set("spark.reducer.maxMbInFlight", "96")
>>
>> conf.set("spark.rdd.compress","true"
>>
>> // we ran into a problem with the default timeout of 60 seconds
>> // this is also being set in the master's spark-env.sh. Not sure if
>> it needs to be in both places
>> conf.set("spark.worker.timeout","180")
>>
>> // akka settings
>> conf.set("spark.akka.threads", "300")
>>     conf.set("spark.akka.timeout", "180")
>> conf.set("spark.akka.frameSize", "100")
>> conf.set("spark.akka.batchSize", "30")
>> conf.set("spark.akka.askTimeout", "30")
>>
>> // block manager
>> conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
>> conf.set("spark.blockManagerHeartBeatMs", "8")
>>
>> -Suren
>>
>>
>>
>> On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell  wrote:
>>
>>> Out of curiosity - are you guys using speculation, shuffle
>>> consolidation, or any other non-default option? If so that would help
>>> narrow down what's causing this corruption.
>>>
>>> On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
>>>  wrote:
>>> > Matt/Ryan,
>>> >
>>> > Did you make any headway on this? My team is running into this also.
>>> > Doesn't happen on smaller datasets. Our input set is about 10 GB but we
>>> > generate 100s of GBs in the flow itself.
>>> >
>>> > -Suren
>>> >
>>> >
>>> >
>>> >
>>> > On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton 
>>> wrote:
>>> >
>>> >> Just ran into this today myself. I'm on branch-1.0 using a CDH3
>>> >> cluster (no modifications to Spark or its dependencies). The error
>>> >> appeared trying to run GraphX's .connectedComponents() on a ~200GB
>>> >> edge list (GraphX worked beautifully on smaller data).
>>> >>
>>> >> Here's the stacktrace (it's quite similar to yours
>>> >> https://imgur.com/7iBA4nJ ).
>>> >>
>>> >> 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
>>> >> 4 times; aborting job
>>> >> 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
>>> >> VertexRDD.scala:100
>>> >> Exception in thread "main" org.apache.spark.SparkException: Job
>>> >> aborted due to stage failure: Task 5.599:39 failed 4 times, most
>>> >> recent failure: Exception failure in TID 29735 on host node18:
>&g

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-17 Thread Patrick Wendell

Out of curiosity - are you guys using speculation, shuffle
consolidation, or any other non-default option? If so that would help
narrow down what's causing this corruption.

On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
 wrote:
> Matt/Ryan,
>
> Did you make any headway on this? My team is running into this also.
> Doesn't happen on smaller datasets. Our input set is about 10 GB but we
> generate 100s of GBs in the flow itself.
>
> -Suren
>
>
>
>
> On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton  wrote:
>
>> Just ran into this today myself. I'm on branch-1.0 using a CDH3
>> cluster (no modifications to Spark or its dependencies). The error
>> appeared trying to run GraphX's .connectedComponents() on a ~200GB
>> edge list (GraphX worked beautifully on smaller data).
>>
>> Here's the stacktrace (it's quite similar to yours
>> https://imgur.com/7iBA4nJ ).
>>
>> 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
>> 4 times; aborting job
>> 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
>> VertexRDD.scala:100
>> Exception in thread "main" org.apache.spark.SparkException: Job
>> aborted due to stage failure: Task 5.599:39 failed 4 times, most
>> recent failure: Exception failure in TID 29735 on host node18:
>> java.io.StreamCorruptedException: invalid type code: AC
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>>
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>
>> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
>> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>
>> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>>
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>
>> org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)
>>
>> org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)
>>
>> org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
>>
>> org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>>
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>> org.apache.spark.scheduler.Task.run(Task.scala:51)
>>
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> java.lang.Thread.run(Thread.java:662)
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> scala.concurren

Re: Emergency maintenace on jenkins

2014-06-10 Thread Patrick Wendell

Hey just to update people - as of around 1pm PT we were back up and
running with Jenkins slaves on EC2. Sorry about the disruption.

- Patrick

On Tue, Jun 10, 2014 at 1:15 AM, Patrick Wendell  wrote:
> No luck with this tonight - unfortunately our Python tests aren't
> working well with Python 2.6 and some other issues made it hard to get
> the EC2 worker up to speed. Hopefully we can have this up and running
> tomororw.
>
> - Patrick
>
> On Mon, Jun 9, 2014 at 10:17 PM, Patrick Wendell  wrote:
>> Just a heads up - due to an outage at UCB we've lost several of the
>> Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to
>> compensate, but this might fail some ongoing builds.
>>
>> The good news is if we do get it working with EC2 workers, then we
>> will have burst capability in the future - e.g. on release deadlines.
>> So it's not all bad!
>>
>> - Patrick

Re: Emergency maintenace on jenkins

2014-06-10 Thread Patrick Wendell

No luck with this tonight - unfortunately our Python tests aren't
working well with Python 2.6 and some other issues made it hard to get
the EC2 worker up to speed. Hopefully we can have this up and running
tomororw.

- Patrick

On Mon, Jun 9, 2014 at 10:17 PM, Patrick Wendell  wrote:
> Just a heads up - due to an outage at UCB we've lost several of the
> Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to
> compensate, but this might fail some ongoing builds.
>
> The good news is if we do get it working with EC2 workers, then we
> will have burst capability in the future - e.g. on release deadlines.
> So it's not all bad!
>
> - Patrick

Emergency maintenace on jenkins

2014-06-09 Thread Patrick Wendell

Just a heads up - due to an outage at UCB we've lost several of the
Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to
compensate, but this might fail some ongoing builds.

The good news is if we do get it working with EC2 workers, then we
will have burst capability in the future - e.g. on release deadlines.
So it's not all bad!

- Patrick

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Patrick Wendell

Okay I think I've isolated this a bit more. Let's discuss over on the JIRA:

https://issues.apache.org/jira/browse/SPARK-2075

On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown  wrote:
>
> Hi, Patrick --
>
> Java 7 on the development machines:
>
> » java -version
> 1 ↵
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
>
>
> And on the deployed boxes:
>
> $ java -version
> java version "1.7.0_55"
> OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
> OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
>
>
> Also, "unzip -l" in place of "jar tvf" gives the same results, so I don't
> think it's an issue with jar not reporting the files.  Also, the classes do
> get correctly packaged into the uberjar:
>
> unzip -l /target/[deleted]-driver.jar | grep 'rdd/RDD' | grep 'saveAs'
>  1519  06-08-14 12:05
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>  1560  06-08-14 12:05
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>
>
> Best.
> -- Paul
>
> —
> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>
>
> On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell  wrote:
>>
>> Paul,
>>
>> Could you give the version of Java that you are building with and the
>> version of Java you are running with? Are they the same?
>>
>> Just off the cuff, I wonder if this is related to:
>> https://issues.apache.org/jira/browse/SPARK-1520
>>
>> If it is, it could appear that certain functions are not in the jar
>> because they go beyond the extended zip boundary `jar tvf` won't list
>> them.
>>
>> - Patrick
>>
>> On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown  wrote:
>> > Moving over to the dev list, as this isn't a user-scope issue.
>> >
>> > I just ran into this issue with the missing saveAsTestFile, and here's a
>> > little additional information:
>> >
>> > - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
>> > - Driver built as an uberjar via Maven.
>> > - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
>> > Spark 1.0.0-hadoop1 downloaded from Apache.
>> >
>> > Given that it functions correctly in local mode but not in a standalone
>> > cluster, this suggests to me that the issue is in a difference between
>> > the
>> > Maven version and the hadoop1 version.
>> >
>> > In the spirit of taking the computer at its word, we can just have a
>> > look
>> > in the JAR files.  Here's what's in the Maven dep as of 1.0.0:
>> >
>> > jar tvf
>> >
>> > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>> > | grep 'rdd/RDD' | grep 'saveAs'
>> >   1519 Mon May 26 13:57:58 PDT 2014
>> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>> >   1560 Mon May 26 13:57:58 PDT 2014
>> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>> >
>> >
>> > And here's what's in the hadoop1 distribution:
>> >
>> > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep
>> > 'saveAs'
>> >
>> >
>> > I.e., it's not there.  It is in the hadoop2 distribution:
>> >
>> > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep
>> > 'saveAs'
>> >   1519 Mon May 26 07:29:54 PDT 2014
>> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>> >   1560 Mon May 26 07:29:54 PDT 2014
>> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>> >
>> >
>> > So something's clearly broken with the way that the distribution
>> > assemblies
>> > are created.
>> >
>> > FWIW and IMHO, the "right" way to publish the hadoop1 and hadoop2
>> > flavors
>> > of Spark to Maven Central would be as *entirely different* artifacts
>> > (spark-core-h1, spark-core-h2).
>> >
>> > Logged as SPARK-2075 <https://issues.apache.org/jira/browse/SPARK-2075>.
>> >
>> > Cheers.
>> > -- Paul
>> >
>> >
>> >
>> > --
>> > p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>> >
>> >
>> > On Fri, Jun 6, 2014 at 2:45 AM, HenriV  wrote:
>>

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Patrick Wendell

Also I should add - thanks for taking time to help narrow this down!

On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell  wrote:
> Paul,
>
> Could you give the version of Java that you are building with and the
> version of Java you are running with? Are they the same?
>
> Just off the cuff, I wonder if this is related to:
> https://issues.apache.org/jira/browse/SPARK-1520
>
> If it is, it could appear that certain functions are not in the jar
> because they go beyond the extended zip boundary `jar tvf` won't list
> them.
>
> - Patrick
>
> On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown  wrote:
>> Moving over to the dev list, as this isn't a user-scope issue.
>>
>> I just ran into this issue with the missing saveAsTestFile, and here's a
>> little additional information:
>>
>> - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
>> - Driver built as an uberjar via Maven.
>> - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
>> Spark 1.0.0-hadoop1 downloaded from Apache.
>>
>> Given that it functions correctly in local mode but not in a standalone
>> cluster, this suggests to me that the issue is in a difference between the
>> Maven version and the hadoop1 version.
>>
>> In the spirit of taking the computer at its word, we can just have a look
>> in the JAR files.  Here's what's in the Maven dep as of 1.0.0:
>>
>> jar tvf
>> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>> | grep 'rdd/RDD' | grep 'saveAs'
>>   1519 Mon May 26 13:57:58 PDT 2014
>> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>>   1560 Mon May 26 13:57:58 PDT 2014
>> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>>
>>
>> And here's what's in the hadoop1 distribution:
>>
>> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
>>
>>
>> I.e., it's not there.  It is in the hadoop2 distribution:
>>
>> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>>   1519 Mon May 26 07:29:54 PDT 2014
>> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>>   1560 Mon May 26 07:29:54 PDT 2014
>> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>>
>>
>> So something's clearly broken with the way that the distribution assemblies
>> are created.
>>
>> FWIW and IMHO, the "right" way to publish the hadoop1 and hadoop2 flavors
>> of Spark to Maven Central would be as *entirely different* artifacts
>> (spark-core-h1, spark-core-h2).
>>
>> Logged as SPARK-2075 <https://issues.apache.org/jira/browse/SPARK-2075>.
>>
>> Cheers.
>> -- Paul
>>
>>
>>
>> --
>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>>
>>
>> On Fri, Jun 6, 2014 at 2:45 AM, HenriV  wrote:
>>
>>> I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
>>> Im using google compute engine and cloud storage. but saveAsTextFile is
>>> returning errors while saving in the cloud or saving local. When i start a
>>> job in the cluster i do get an error but after this error it keeps on
>>> running fine untill the saveAsTextFile. ( I don't know if the two are
>>> connected)
>>>
>>> ---Error at job startup---
>>>  ERROR metrics.MetricsSystem: Sink class
>>> org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
>>> java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>> at
>>>
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> at
>>>
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> at
>>>
>>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
>>> at
>>>
>>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
>>> at
>>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>> at
>>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>> at
>

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Patrick Wendell

Paul,

Could you give the version of Java that you are building with and the
version of Java you are running with? Are they the same?

Just off the cuff, I wonder if this is related to:
https://issues.apache.org/jira/browse/SPARK-1520

If it is, it could appear that certain functions are not in the jar
because they go beyond the extended zip boundary `jar tvf` won't list
them.

- Patrick

On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown  wrote:
> Moving over to the dev list, as this isn't a user-scope issue.
>
> I just ran into this issue with the missing saveAsTestFile, and here's a
> little additional information:
>
> - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
> - Driver built as an uberjar via Maven.
> - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
> Spark 1.0.0-hadoop1 downloaded from Apache.
>
> Given that it functions correctly in local mode but not in a standalone
> cluster, this suggests to me that the issue is in a difference between the
> Maven version and the hadoop1 version.
>
> In the spirit of taking the computer at its word, we can just have a look
> in the JAR files.  Here's what's in the Maven dep as of 1.0.0:
>
> jar tvf
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
> | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>
>
> And here's what's in the hadoop1 distribution:
>
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
>
>
> I.e., it's not there.  It is in the hadoop2 distribution:
>
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>
>
> So something's clearly broken with the way that the distribution assemblies
> are created.
>
> FWIW and IMHO, the "right" way to publish the hadoop1 and hadoop2 flavors
> of Spark to Maven Central would be as *entirely different* artifacts
> (spark-core-h1, spark-core-h2).
>
> Logged as SPARK-2075 .
>
> Cheers.
> -- Paul
>
>
>
> --
> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>
>
> On Fri, Jun 6, 2014 at 2:45 AM, HenriV  wrote:
>
>> I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
>> Im using google compute engine and cloud storage. but saveAsTextFile is
>> returning errors while saving in the cloud or saving local. When i start a
>> job in the cluster i do get an error but after this error it keeps on
>> running fine untill the saveAsTextFile. ( I don't know if the two are
>> connected)
>>
>> ---Error at job startup---
>>  ERROR metrics.MetricsSystem: Sink class
>> org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
>> java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>> at
>>
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at
>>
>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
>> at
>>
>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>> at
>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>> at
>>
>> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
>> at
>> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:84)
>> at
>>
>> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)
>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
>> at org.apache.spark.SparkContext.(SparkContext.scala:202)
>> at Hello$.main(Hello.scala:101)
>> at Hello.main(Hello.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>

MIMA Compatiblity Checks

2014-06-08 Thread Patrick Wendell

Hey All,

Some people may have noticed PR failures due to binary compatibility
checks. We've had these enabled in several of the sub-modules since
the 0.9.0 release but we've turned them on in Spark core post 1.0.0
which has much higher churn.

The checks are based on the "migration manager" tool from Typesafe.
One issue is that tool doesn't support package-private declarations of
classes or methods. Prashant Sharma has built instrumentation that
adds partial support for package-privacy (via a workaround) but since
there isn't really native support for this in MIMA we are still
finding cases in which we trigger false positives.

In the next week or two we'll make it a priority to handle more of
these false-positive cases. In the mean time users can add manual
excludes to:

project/MimaExcludes.scala

to avoid triggering warnings for certain issues.

This is definitely annoying - sorry about that. Unfortunately we are
the first open source Scala project to ever do this, so we are dealing
with uncharted territory.

Longer term I'd actually like to see us just write our own sbt-based
tool to do this in a better way (we've had trouble trying to extend
MIMA itself, it e.g. has copy-pasted code in it from an old version of
the scala compiler). If someone in the community is a Scala fan and
wants to take that on, I'm happy to give more details.

- Patrick

Re: Announcing Spark 1.0.0

2014-06-04 Thread Patrick Wendell

Hey Rahul,

The v1.0.0 tag is correct. When we release Spark we create multiple
candidates. One of the candidates is promoted to the full release. So
rc11 is also the same as the official v1.0.0 release.

- Patrick

On Wed, Jun 4, 2014 at 8:29 PM, Rahul Singhal  wrote:
> Could someone please clarify my confusion or is this not an issue that we
> should be concerned about?
>
> Thanks,
> Rahul Singhal
>
>
>
>
>
> On 30/05/14 5:28 PM, "Rahul Singhal"  wrote:
>
>>Is it intentional/ok that the tag v1.0.0 is behind tag v1.0.0-rc11?
>>
>>
>>Thanks,
>>Rahul Singhal
>>
>>
>>
>>
>>
>>On 30/05/14 3:43 PM, "Patrick Wendell"  wrote:
>>
>>>I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
>>>is a milestone release as the first in the 1.0 line of releases,
>>>providing API stability for Spark's core interfaces.
>>>
>>>Spark 1.0.0 is Spark's largest release ever, with contributions from
>>>117 developers. I'd like to thank everyone involved in this release -
>>>it was truly a community effort with fixes, features, and
>>>optimizations contributed from dozens of organizations.
>>>
>>>This release expands Spark's standard libraries, introducing a new SQL
>>>package (SparkSQL) which lets users integrate SQL queries into
>>>existing Spark workflows. MLlib, Spark's machine learning library, is
>>>expanded with sparse vector support and several new algorithms. The
>>>GraphX and Streaming libraries also introduce new features and
>>>optimizations. Spark's core engine adds support for secured YARN
>>>clusters, a unified tool for submitting Spark applications, and
>>>several performance and stability improvements. Finally, Spark adds
>>>support for Java 8 lambda syntax and improves coverage of the Java and
>>>Python API's.
>>>
>>>Those features only scratch the surface - check out the release notes
>>>here:
>>>http://spark.apache.org/releases/spark-release-1-0-0.html
>>>
>>>Note that since release artifacts were posted recently, certain
>>>mirrors may not have working downloads for a few hours.
>>>
>>>- Patrick
>>
>

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-06-04 Thread Patrick Wendell

Hey There,

The best way is to use the v1.0.0 tag:
https://github.com/apache/spark/releases/tag/v1.0.0

- Patrick

On Wed, Jun 4, 2014 at 12:19 PM, Debasish Das  wrote:
> Hi Patrick,
>
> We maintain internal Spark mirror in sync with Spark github master...
>
> What's the way to get the 1.0.0 stable release from github to deploy on our
> production cluster ? Is there a tag for 1.0.0 that I should use to deploy ?
>
> Thanks.
> Deb
>
>
>
> On Wed, Jun 4, 2014 at 10:49 AM, Patrick Wendell  wrote:
>
>> Received!
>>
>> On Wed, Jun 4, 2014 at 10:47 AM, Tom Graves
>>  wrote:
>> > Testing... Resending as it appears my message didn't go through last
>> week.
>> >
>> > Tom
>> >
>> >
>> > On Wednesday, May 28, 2014 4:12 PM, Tom Graves 
>> wrote:
>> >
>> >
>> >
>> > +1. Tested spark on yarn (cluster mode, client mode, pyspark,
>> spark-shell) on hadoop 0.23 and 2.4.
>> >
>> > Tom
>> >
>> >
>> > On Wednesday, May 28, 2014 3:07 PM, Sean McNamara
>>  wrote:
>> >
>> >
>> >
>> > Pulled down, compiled, and tested examples on OS X and ubuntu.
>> > Deployed app we are building on spark and poured data through it.
>> >
>> > +1
>> >
>> > Sean
>> >
>> >
>> >
>> > On May 26, 2014, at 8:39 AM, Tathagata Das 
>> wrote:
>> >
>> >> Please vote on releasing the following candidate as Apache Spark
>> version 1.0.0!
>> >>
>> >> This has a few important bug fixes on top of rc10:
>> >> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
>> >> SPARK-1870: https://github.com/apache/spark/pull/848
>> >> SPARK-1897: https://github.com/apache/spark/pull/849
>> >>
>> >> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> http://people.apache.org/~tdas/spark-1.0.0-rc11/
>> >>
>> >> Release
>> >  artifacts are signed with the following key:
>> >> https://people.apache.org/keys/committer/tdas.asc
>> >>
>> >> The staging repository for this release can be found at:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1019/
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
>> >>
>> >> Please vote on releasing this package as Apache Spark 1.0.0!
>> >>
>> >> The vote is open until
>> >  Thursday, May 29, at 16:00 UTC and passes if
>> >> a majority of at least 3 +1 PMC votes are cast.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 1.0.0
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see
>> >> http://spark.apache.org/
>> >>
>> >> == API Changes ==
>> >> We welcome users to compile Spark applications against 1.0. There are
>> >> a few API changes in this release. Here are links to the associated
>> >> upgrade guides - user facing changes have been kept as small as
>> >> possible.
>> >>
>> >> Changes to ML vector specification:
>> >>
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
>> >>
>> >> Changes to the Java API:
>> >>
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>> >>
>> >> Changes to the streaming API:
>> >>
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>> >>
>> >> Changes to the GraphX API:
>> >>
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>> >>
>> >> Other changes:
>> >> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> >> ==> Call toSeq on the result to restore the old behavior
>> >>
>> >> SparkContext.jarOfClass returns Option[String] instead of
>> >  Seq[String]
>> >> ==> Call toSeq on the result to restore old behavior
>>

Re: What is the correct Spark version of master/branch-1.0?

2014-06-04 Thread Patrick Wendell

It should be 1.1-SNAPSHOT. Feel free to submit a PR to clean up any
inconsistencies.

On Tue, Jun 3, 2014 at 8:33 PM, Takuya UESHIN  wrote:
> Hi all,
>
> I'm wondering what is the correct Spark version of each HEAD of master
> and branch-1.0.
>
> current master HEAD (e8d93ee5284cb6a1d4551effe91ee8d233323329):
> - pom.xml: 1.0.0-SNAPSHOT
> - SparkBuild.scala: 1.1.0-SNAPSHOT
>
> It should be 1.1.0-SNAPSHOT?
>
>
> current branch-1.0 HEAD (d96794132e37cf57f8dd945b9d11f8adcfc30490):
> - pom.xml: 1.0.1-SNAPSHOT
> - SparkBuild.scala: 1.0.0
>
> It should be 1.0.1-SNAPSHOT?
>
>
> Thanks.
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-06-04 Thread Patrick Wendell

Received!

On Wed, Jun 4, 2014 at 10:47 AM, Tom Graves
 wrote:
> Testing... Resending as it appears my message didn't go through last week.
>
> Tom
>
>
> On Wednesday, May 28, 2014 4:12 PM, Tom Graves  wrote:
>
>
>
> +1. Tested spark on yarn (cluster mode, client mode, pyspark, spark-shell) on 
> hadoop 0.23 and 2.4.
>
> Tom
>
>
> On Wednesday, May 28, 2014 3:07 PM, Sean McNamara 
>  wrote:
>
>
>
> Pulled down, compiled, and tested examples on OS X and ubuntu.
> Deployed app we are building on spark and poured data through it.
>
> +1
>
> Sean
>
>
>
> On May 26, 2014, at 8:39 AM, Tathagata Das  
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.0.0!
>>
>> This has a few important bug fixes on top of rc10:
>> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
>> SPARK-1870: https://github.com/apache/spark/pull/848
>> SPARK-1897: https://github.com/apache/spark/pull/849
>>
>> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11/
>>
>> Release
>  artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/tdas.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1019/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.0.0!
>>
>> The vote is open until
>  Thursday, May 29, at 16:00 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.0.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == API Changes ==
>> We welcome users to compile Spark applications against 1.0. There are
>> a few API changes in this release. Here are links to the associated
>> upgrade guides - user facing changes have been kept as small as
>> possible.
>>
>> Changes to ML vector specification:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
>>
>> Changes to the Java API:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>>
>> Changes to the streaming API:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>>
>> Changes to the GraphX API:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>>
>> Other changes:
>> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> ==> Call toSeq on the result to restore the old behavior
>>
>> SparkContext.jarOfClass returns Option[String] instead of
>  Seq[String]
>> ==> Call toSeq on the result to restore old behavior

Spark 1.1 Window and 1.0 Wrap-up

2014-06-02 Thread Patrick Wendell

Hey All,

I wanted to announce the the Spark 1.1 release window:
June 1 - Merge window opens
July 25 - Cut-off for new pull requests
August 1 - Merge window closes (code freeze), QA period starts
August 15+ - RC's and voting

This is consistent with the "3 month" release cycle we are targeting.
I'd really encourage people submitting larger features to do so during
the month of June, as features submitted closer to the window closing
could end up getting pushed into the next release.

I wanted to reflect a bit as well on the 1.0 release. First, thanks to
everyone who was involved in this release. It was the largest release
ever and it's something we should all be proud of.

In the 1.0 release, we cleaned up and consolidated several parts of the
Spark code base. In particular, we  consolidated the previously
fragmented process of submitting Spark jobs across a wide variety of
environments {YARN/Mesos/Standalone, Windows/Unix, Python/Java/Scala}.
We also brought the three language API's into much closer alignment.
These were difficult (but critical) tasks towards having a stable
deployment environment on which higher level libraries can build.

These cross-cutting changes also had associated test burden resulting
in an extended QA period. The 1.1, 1.2, 1.3, family of releases are
intended to be smaller releases, and I'd like to deliver them with
very predictable timing to the community. This will mean being fairly
strict about freezes and investing in QA infrastructure to allow us to
get through voting more quickly.

With 1.0 shipped, now is a great time to catch up on code reviews and look at
outstanding patches. Despite the large queue, we've actually been
consistently merging/closing about 80% of proposed PR's, which
is definitely good (for instance, we have 170 outstanding out of 950
proposed), but there remain a lot of people waiting on reviews, and
it's something everyone can help with!

Thanks again to everyone involved. Looking forward to more great releases!

- Patrick

Re: Which version does the binary compatibility test against by default?

2014-06-02 Thread Patrick Wendell

Yeah - check out sparkPreviousArtifact in the build:
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L325

- Patrick

On Mon, Jun 2, 2014 at 5:30 PM, Xiangrui Meng  wrote:
> Is there a way to specify the target version? -Xiangrui

Re: SCALA_HOME or SCALA_LIBRARY_PATH not set during build

2014-06-01 Thread Patrick Wendell

I went ahead and created a JIRA for this and back ported the
improvement into branch-1.0. This wasn't a regression per-se because
the behavior existed in all previous versions, but it's annoying
behavior so best to fix it.

https://issues.apache.org/jira/browse/SPARK-1984

- Patrick

On Sun, Jun 1, 2014 at 11:13 AM, Patrick Wendell  wrote:
> This is a false error message actually - the Maven build no longer
> requires SCALA_HOME but the message/check was still there. This was
> fixed recently in master:
>
> https://github.com/apache/spark/commit/d8c005d5371f81a2a06c5d27c7021e1ae43d7193
>
> I can back port that fix into branch-1.0 so it will be in 1.0.1 as
> well. For other people running into this, you can export SCALA_HOME to
> any value and it will work.
>
> - Patrick
>
> On Sat, May 31, 2014 at 8:34 PM, Colin McCabe  wrote:
>> Spark currently supports two build systems, sbt and maven.  sbt will
>> download the correct version of scala, but with Maven you need to supply it
>> yourself and set SCALA_HOME.
>>
>> It sounds like the instructions need to be updated-- perhaps create a JIRA?
>>
>> best,
>> Colin
>>
>>
>> On Sat, May 31, 2014 at 7:06 PM, Soren Macbeth  wrote:
>>
>>> Hello,
>>>
>>> Following the instructions for building spark 1.0.0, I encountered the
>>> following error:
>>>
>>> [ERROR] Failed to execute goal
>>> org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project
>>> spark-core_2.10: An Ant BuildException has occured: Please set the
>>> SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment
>>> variables and retry.
>>> [ERROR] around Ant part .. @ 6:126 in
>>> /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml
>>>
>>> No where in the documentation does it mention that having scala installed
>>> and either of these env vars set nor what version should be installed.
>>> Setting these env vars wasn't required for 0.9.1 with sbt.
>>>
>>> I was able to get past it by downloading the scala 2.10.4 binary package to
>>> a temp dir and setting SCALA_HOME to that dir.
>>>
>>> Ideally, it would be nice to not have to require people to have a
>>> standalone scala installation but at a minimum this requirement should be
>>> documented in the build instructions no?
>>>
>>> -Soren
>>>

Re: SCALA_HOME or SCALA_LIBRARY_PATH not set during build

2014-06-01 Thread Patrick Wendell

This is a false error message actually - the Maven build no longer
requires SCALA_HOME but the message/check was still there. This was
fixed recently in master:

https://github.com/apache/spark/commit/d8c005d5371f81a2a06c5d27c7021e1ae43d7193

I can back port that fix into branch-1.0 so it will be in 1.0.1 as
well. For other people running into this, you can export SCALA_HOME to
any value and it will work.

- Patrick

On Sat, May 31, 2014 at 8:34 PM, Colin McCabe  wrote:
> Spark currently supports two build systems, sbt and maven.  sbt will
> download the correct version of scala, but with Maven you need to supply it
> yourself and set SCALA_HOME.
>
> It sounds like the instructions need to be updated-- perhaps create a JIRA?
>
> best,
> Colin
>
>
> On Sat, May 31, 2014 at 7:06 PM, Soren Macbeth  wrote:
>
>> Hello,
>>
>> Following the instructions for building spark 1.0.0, I encountered the
>> following error:
>>
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project
>> spark-core_2.10: An Ant BuildException has occured: Please set the
>> SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment
>> variables and retry.
>> [ERROR] around Ant part .. @ 6:126 in
>> /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml
>>
>> No where in the documentation does it mention that having scala installed
>> and either of these env vars set nor what version should be installed.
>> Setting these env vars wasn't required for 0.9.1 with sbt.
>>
>> I was able to get past it by downloading the scala 2.10.4 binary package to
>> a temp dir and setting SCALA_HOME to that dir.
>>
>> Ideally, it would be nice to not have to require people to have a
>> standalone scala installation but at a minimum this requirement should be
>> documented in the build instructions no?
>>
>> -Soren
>>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-31 Thread Patrick Wendell

One other consideration popped into my head:

5. Shading our dependencies could mess up our external API's if we
ever return types that are outside of the spark package because we'd
then be returned shaded types that users have to deal with. E.g. where
before we returned an o.a.flume.AvroFlumeEvent we'd have to return a
some.namespace.AvroFlumeEvent. Then users downstream would have to
deal with converting our types into the correct namespace if they want
to inter-operate with other libraries. We generally try to avoid ever
returning types from other libraries, but it would be good to audit
our API's and see if we ever do this.

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell  wrote:
> Spark is a bit different than Hadoop MapReduce, so maybe that's a
> source of some confusion. Spark is often used as a substrate for
> building different types of analytics applications, so @DeveloperAPI
> are internal API's that we'd like to expose to application writers,
> but that might be more volatile. This is like the internal API's in
> the linux kernel, they aren't stable, but of course we try to minimize
> changes to them. If people want to write lower-level modules against
> them, that's fine with us, but they know the interfaces might change.
>
> This has worked pretty well over the years, even with many different
> companies writing against those API's.
>
> @Experimental are user-facing features we are trying out. Hopefully
> that one is more clear.
>
> In terms of making a big jar that shades all of our dependencies - I'm
> curious how that would actually work in practice. It would be good to
> explore. There are a few potential challenges I see:
>
> 1. If any of our dependencies encode class name information in IPC
> messages, this would break. E.g. can you definitely shade the Hadoop
> client, protobuf, hbase client, etc and have them send messages over
> the wire? This could break things if class names are ever encoded in a
> wire format.
> 2. Many libraries like logging subsystems, configuration systems, etc
> rely on static state and initialization. I'm not totally sure how e.g.
> slf4j initializes itself if you have both a shaded and non-shaded copy
> of slf4j present.
> 3. This would mean the spark-core jar would be really massive because
> it would inline all of our deps. We've actually been thinking of
> avoiding the current assembly jar approach because, due to scala
> specialized classes, our assemblies now have more than 65,000 class
> files in them leading to all kinds of bad issues. We'd have to stick
> with a big uber assembly-like jar if we decide to shade stuff.
> 4. I'm not totally sure how this would work when people want to e.g.
> build Spark with different Hadoop versions. Would we publish different
> shaded uber-jars for every Hadoop version? Would the Hadoop dep just
> not be shaded... if so what about all it's dependencies.
>
> Anyways just some things to consider... simplifying our classpath is
> definitely an avenue worth exploring!
>
>
>
>
> On Fri, May 30, 2014 at 2:56 PM, Colin McCabe  wrote:
>> On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell  wrote:
>>
>>> Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
>>> way better about this with 2.2+ and I think it's great progress.
>>>
>>> We have well defined API levels in Spark and also automated checking
>>> of API violations for new pull requests. When doing code reviews we
>>> always enforce the narrowest possible visibility:
>>>
>>> 1. private
>>> 2. private[spark]
>>> 3. @Experimental or @DeveloperApi
>>> 4. public
>>>
>>> Our automated checks exclude 1-3. Anything that breaks 4 will trigger
>>> a build failure.
>>>
>>>
>> That's really excellent.  Great job.
>>
>> I like the private[spark] visibility level-- sounds like this is another
>> way Scala has greatly improved on Java.
>>
>> The Scala compiler prevents anyone external from using 1 or 2. We do
>>> have "bytecode public but annotated" (3) API's that we might change.
>>> We spent a lot of time looking into whether these can offer compiler
>>> warnings, but we haven't found a way to do this and do not see a
>>> better alternative at this point.
>>>
>>
>> It would be nice if the production build could strip this stuff out.
>>  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
>> know how those turned out.
>>
>>
>>> Regarding Scala compatibility, Scala 2.11+ is "source code
>>> compatible"

Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell

Can you look at the logs from the executor or in the UI? They should
give an exception with the reason for the task failure. Also in the
future, for this type of e-mail please only e-mail the "user@" list
and not both lists.

- Patrick

On Sat, May 31, 2014 at 3:22 AM, prabeesh k  wrote:
> Hi,
>
> scenario : Read data from HDFS and apply hive query  on it and the result is
> written back to HDFS.
>
>  Scheme creation, Querying  and saveAsTextFile are working fine with
> following mode
>
> local mode
> mesos cluster with single node
> spark cluster with multi node
>
> Schema creation and querying are working fine with mesos multi node cluster.
> But  while trying to write back to HDFS using saveAsTextFile, the following
> error occurs
>
>  14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4
> (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission
> 14/05/30 10:16:35 INFO DAGScheduler: Executor lost:
> 201405291518-3644595722-5050-17933-1 (epoch 148)
>
> Let me know your thoughts regarding this.
>
> Regards,
> prabeesh

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-30 Thread Patrick Wendell

Spark is a bit different than Hadoop MapReduce, so maybe that's a
source of some confusion. Spark is often used as a substrate for
building different types of analytics applications, so @DeveloperAPI
are internal API's that we'd like to expose to application writers,
but that might be more volatile. This is like the internal API's in
the linux kernel, they aren't stable, but of course we try to minimize
changes to them. If people want to write lower-level modules against
them, that's fine with us, but they know the interfaces might change.

This has worked pretty well over the years, even with many different
companies writing against those API's.

@Experimental are user-facing features we are trying out. Hopefully
that one is more clear.

In terms of making a big jar that shades all of our dependencies - I'm
curious how that would actually work in practice. It would be good to
explore. There are a few potential challenges I see:

1. If any of our dependencies encode class name information in IPC
messages, this would break. E.g. can you definitely shade the Hadoop
client, protobuf, hbase client, etc and have them send messages over
the wire? This could break things if class names are ever encoded in a
wire format.
2. Many libraries like logging subsystems, configuration systems, etc
rely on static state and initialization. I'm not totally sure how e.g.
slf4j initializes itself if you have both a shaded and non-shaded copy
of slf4j present.
3. This would mean the spark-core jar would be really massive because
it would inline all of our deps. We've actually been thinking of
avoiding the current assembly jar approach because, due to scala
specialized classes, our assemblies now have more than 65,000 class
files in them leading to all kinds of bad issues. We'd have to stick
with a big uber assembly-like jar if we decide to shade stuff.
4. I'm not totally sure how this would work when people want to e.g.
build Spark with different Hadoop versions. Would we publish different
shaded uber-jars for every Hadoop version? Would the Hadoop dep just
not be shaded... if so what about all it's dependencies.

Anyways just some things to consider... simplifying our classpath is
definitely an avenue worth exploring!

On Fri, May 30, 2014 at 2:56 PM, Colin McCabe  wrote:
> On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell  wrote:
>
>> Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
>> way better about this with 2.2+ and I think it's great progress.
>>
>> We have well defined API levels in Spark and also automated checking
>> of API violations for new pull requests. When doing code reviews we
>> always enforce the narrowest possible visibility:
>>
>> 1. private
>> 2. private[spark]
>> 3. @Experimental or @DeveloperApi
>> 4. public
>>
>> Our automated checks exclude 1-3. Anything that breaks 4 will trigger
>> a build failure.
>>
>>
> That's really excellent.  Great job.
>
> I like the private[spark] visibility level-- sounds like this is another
> way Scala has greatly improved on Java.
>
> The Scala compiler prevents anyone external from using 1 or 2. We do
>> have "bytecode public but annotated" (3) API's that we might change.
>> We spent a lot of time looking into whether these can offer compiler
>> warnings, but we haven't found a way to do this and do not see a
>> better alternative at this point.
>>
>
> It would be nice if the production build could strip this stuff out.
>  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
> know how those turned out.
>
>
>> Regarding Scala compatibility, Scala 2.11+ is "source code
>> compatible", meaning we'll be able to cross-compile Spark for
>> different versions of Scala. We've already been in touch with Typesafe
>> about this and they've offered to integrate Spark into their
>> compatibility test suite. They've also committed to patching 2.11 with
>> a minor release if bugs are found.
>>
>
> Thanks, I hadn't heard about this plan.  Hopefully we can get everyone on
> 2.11 ASAP.
>
>
>> Anyways, my point is we've actually thought a lot about this already.
>>
>> The CLASSPATH thing is different than API stability, but indeed also a
>> form of compatibility. This is something where I'd also like to see
>> Spark have better isolation of user classes from Spark's own
>> execution...
>>
>>
> I think the best thing to do is just "shade" all the dependencies.  Then
> they will be in a different namespace, and clients can have their own
> versions of whatever dependencies they like without conflicting.  As
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-30 Thread Patrick Wendell

Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
way better about this with 2.2+ and I think it's great progress.

We have well defined API levels in Spark and also automated checking
of API violations for new pull requests. When doing code reviews we
always enforce the narrowest possible visibility:

1. private
2. private[spark]
3. @Experimental or @DeveloperApi
4. public

Our automated checks exclude 1-3. Anything that breaks 4 will trigger
a build failure.

The Scala compiler prevents anyone external from using 1 or 2. We do
have "bytecode public but annotated" (3) API's that we might change.
We spent a lot of time looking into whether these can offer compiler
warnings, but we haven't found a way to do this and do not see a
better alternative at this point.

Regarding Scala compatibility, Scala 2.11+ is "source code
compatible", meaning we'll be able to cross-compile Spark for
different versions of Scala. We've already been in touch with Typesafe
about this and they've offered to integrate Spark into their
compatibility test suite. They've also committed to patching 2.11 with
a minor release if bugs are found.

Anyways, my point is we've actually thought a lot about this already.

The CLASSPATH thing is different than API stability, but indeed also a
form of compatibility. This is something where I'd also like to see
Spark have better isolation of user classes from Spark's own
execution...

- Patrick

On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin  wrote:
> On Fri, May 30, 2014 at 12:05 PM, Colin McCabe  wrote:
>> I don't know if Scala provides any mechanisms to do this beyond what Java 
>> provides.
>
> In fact it does. You can say something like "private[foo]" and the
> annotated element will be visible for all classes under "foo" (where
> "foo" is any package in the hierarchy leading up to the class). That's
> used a lot in Spark.
>
> I haven't fully looked at how the @DeveloperApi is used, but I agree
> with you - annotations are not a good way to do this. The Scala
> feature above would be much better, but it might still leak things at
> the Java bytecode level (don't know how Scala implements it under the
> cover, but I assume it's not by declaring the element as a Java
> "private").
>
> Another thing is that in Scala the default visibility is public, which
> makes it very easy to inadvertently add things to the API. I'd like to
> see more care in making things have the proper visibility - I
> generally declare things private first, and relax that as needed.
> Using @VisibleForTesting would be great too, when the Scala
> private[foo] approach doesn't work.
>
>> Does Spark also expose its CLASSPATH in
>> this way to executors?  I was under the impression that it did.
>
> If you're using the Spark assemblies, yes, there is a lot of things
> that your app gets exposed to. For example, you can see Guava and
> Jetty (and many other things) there. This is something that has always
> bugged me, but I don't really have a good suggestion of how to fix it;
> shading goes a certain way, but it also breaks codes that uses
> reflection (e.g. Class.forName()-style class loading).
>
> What is worse is that Spark doesn't even agree with the Hadoop code it
> depends on; e.g., Spark uses Guava 14.x while Hadoop is still in Guava
> 11.x. So when you run your Scala app, what gets loaded?
>
>> At some point we will also have to confront the Scala version issue.  Will
>> there be flag days where Spark jobs need to be upgraded to a new,
>> incompatible version of Scala to run on the latest Spark?
>
> Yes, this could be an issue - I'm not sure Scala has a policy towards
> this, but updates (at least minor, e.g. 2.9 -> 2.10) tend to break
> binary compatibility.
>
> Scala also makes some API updates tricky - e.g., adding a new named
> argument to a Scala method is not a binary compatible change (while,
> e.g., adding a new keyword argument in a python method is just fine).
> The use of implicits and other Scala features make this even more
> opaque...
>
> Anyway, not really any solutions in this message, just a few comments
> I wanted to throw out there. :-)
>
> --
> Marcelo

Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Patrick Wendell

Yeah - Spark streaming needs at least two threads to run. I actually
thought we warned the user if they only use one (@tdas?) but the
warning might not be working correctly - or I'm misremembering.

On Fri, May 30, 2014 at 6:38 AM, Sean Owen  wrote:
> Thanks Nan, that does appear to fix it. I was using "local". Can
> anyone say whether that's to be expected or whether it could be a bug
> somewhere?
>
> On Fri, May 30, 2014 at 2:42 PM, Nan Zhu  wrote:
>> Hi, Sean
>>
>> I was in the same problem
>>
>> but when I changed MASTER="local" to MASTER="local[2]"
>>
>> everything back to the normal
>>
>> Hasn't get a chance to ask here
>>
>> Best,
>>
>> --
>> Nan Zhu
>>

Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell

I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
is a milestone release as the first in the 1.0 line of releases,
providing API stability for Spark's core interfaces.

Spark 1.0.0 is Spark's largest release ever, with contributions from
117 developers. I'd like to thank everyone involved in this release -
it was truly a community effort with fixes, features, and
optimizations contributed from dozens of organizations.

This release expands Spark's standard libraries, introducing a new SQL
package (SparkSQL) which lets users integrate SQL queries into
existing Spark workflows. MLlib, Spark's machine learning library, is
expanded with sparse vector support and several new algorithms. The
GraphX and Streaming libraries also introduce new features and
optimizations. Spark's core engine adds support for secured YARN
clusters, a unified tool for submitting Spark applications, and
several performance and stability improvements. Finally, Spark adds
support for Java 8 lambda syntax and improves coverage of the Java and
Python API's.

Those features only scratch the surface - check out the release notes here:
http://spark.apache.org/releases/spark-release-1-0-0.html

Note that since release artifacts were posted recently, certain
mirrors may not have working downloads for a few hours.

- Patrick

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Patrick Wendell

Congrats everyone! This is a huge accomplishment!

On Thu, May 29, 2014 at 1:24 PM, Tathagata Das
 wrote:
> Hello everyone,
>
> The vote on Spark 1.0.0 RC11 passes with13 "+1" votes, one "0" vote and no
> "-1" vote.
>
> Thanks to everyone who tested the RC and voted. Here are the totals:
>
> +1: (13 votes)
> Matei Zaharia*
> Mark Hamstra*
> Holden Karau
> Nick Pentreath*
> Will Benton
> Henry Saputra
> Sean McNamara*
> Xiangrui Meng*
> Andy Konwinski*
> Krishna Sankar
> Kevin Markey
> Patrick Wendell*
> Tathagata Das*
>
> 0: (1 vote)
> Ankur Dave*
>
> -1: (0 vote)
>
> Please hold off announcing Spark 1.0.0 until Apache Software Foundation
> makes the press release tomorrow. Thank you very much for your cooperation.
>
> TD

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-29 Thread Patrick Wendell

[tl;dr stable API's are important - sorry, this is slightly meandering]

Hey - just wanted to chime in on this as I was travelling. Sean, you
bring up great points here about the velocity and stability of Spark.
Many projects have fairly customized semantics around what versions
actually mean (HBase is a good, if somewhat hard-to-comprehend,
example).

What the 1.X label means to Spark is that we are willing to guarantee
stability for Spark's core API. This is something that actually, Spark
has been doing for a while already (we've made few or no breaking
changes to the Spark core API for several years) and we want to codify
this for application developers. In this regard Spark has made a bunch
of changes to enforce the integrity of our API's:

- We went through and clearly annotated internal, or experimental
API's. This was a huge project-wide effort and included Scaladoc and
several other components to make it clear to users.
- We implemented automated byte-code verification of all proposed pull
requests that they don't break public API's. Pull requests after 1.0
will fail if they break API's that are not explicitly declared private
or experimental.

I can't possibly emphasize enough the importance of API stability.
What we want to avoid is the Hadoop approach. Candidly, Hadoop does a
poor job on this. There really isn't a well defined stable API for any
of the Hadoop components, for a few reasons:

1. Hadoop projects don't do any rigorous checking that new patches
don't break API's. Of course, the results in regular API breaks and a
poor understanding of what is a public API.
2. In several cases it's not possible to do basic things in Hadoop
without using deprecated or private API's.
3. There is significant vendor fragmentation of API's.

The main focus of the Hadoop vendors is making consistent cuts of the
core projects work together (HDFS/Pig/Hive/etc) - so API breaks are
sometimes considered "fixed" as long as the other projects work around
them (see [1]). We also regularly need to do archaeology (see [2]) and
directly interact with Hadoop committers to understand what API's are
stable and in which versions.

One goal of Spark is to deal with the pain of inter-operating with
Hadoop so that application writers don't to. We'd like to retain the
property that if you build an application against the (well defined,
stable) Spark API's right now, you'll be able to run it across many
Hadoop vendors and versions for the entire Spark 1.X release cycle.

Writing apps against Hadoop can be very difficult... consider how much
more engineering effort we spent maintaining YARN support than Mesos
support. There are many factors, but one is that Mesos has a single,
narrow, stable API. We've never had to make a change in Mesos due to
an API change, for several years. YARN on the other hand, there are at
least 3 YARN API's that currently exist, which are all binary
incompatible. We'd like to offer apps the ability to build against
Spark's API and just let us deal with it.

As more vendors packaging Spark, I'd like to see us put tools in the
upstream Spark repo that do validation for vendor packages of Spark,
so that we don't end up with fragmentation. Of course, vendors can
enhance the API and are encouraged to, but we need a kernel of API's
that vendors must maintain (think POSIX) to be considered compliant
with Apache Spark. I believe some other projects like OpenStack have
done this to avoid fragmentation.

- Patrick

[1] https://issues.apache.org/jira/browse/MAPREDUCE-5830
[2] 
http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AD0/dEWFFYTRgYw/s1600/output-file.png

On Sun, May 18, 2014 at 2:13 AM, Mridul Muralidharan  wrote:
> So I think I need to clarify a few things here - particularly since
> this mail went to the wrong mailing list and a much wider audience
> than I intended it for :-)
>
>
> Most of the issues I mentioned are internal implementation detail of
> spark core : which means, we can enhance them in future without
> disruption to our userbase (ability to support large number of
> input/output partitions. Note: this is of order of 100k input and
> output partitions with uniform spread of keys - very rarely seen
> outside of some crazy jobs).
>
> Some of the issues I mentioned would reqiure DeveloperApi changes -
> which are not user exposed : they would impact developer use of these
> api's - which are mostly internally provided by spark. (Like fixing
> blocks > 2G would require change to Serializer api)
>
> A smaller faction might require interface changes - note, I am
> referring specifically to configuration changes (removing/deprecating
> some) and possibly newer options to submit/env, etc - I dont envision
> any programming api change itself.
> The only api change we did was from Seq -> Iterable - which is
> actually to address some of the issues I mentioned (join/cogroup).
>
> Remaining are bugs which need to be addressed or the feature
> removed/enhanced like shuffle consolidation.
>
>

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Patrick Wendell

+1

I spun up a few EC2 clusters and ran my normal audit checks. Tests
passing, sigs, CHANGES and NOTICE look good

Thanks TD for helping cut this RC!

On Wed, May 28, 2014 at 9:38 PM, Kevin Markey  wrote:
> +1
>
> Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
> Ran current version of one of my applications on 1-node pseudocluster
> (sorry, unable to test on full cluster).
> yarn-cluster mode
> Ran regression tests.
>
> Thanks
> Kevin
>
>
> On 05/28/2014 09:55 PM, Krishna Sankar wrote:
>>
>> +1
>> Pulled & built on MacOS X, EC2 Amazon Linux
>> Ran test programs on OS X, 5 node c3.4xlarge cluster
>> Cheers
>> 
>>
>>
>> On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski
>> wrote:
>>
>>> +1
>>> On May 28, 2014 7:05 PM, "Xiangrui Meng"  wrote:
>>>
 +1

 Tested apps with standalone client mode and yarn cluster and client
>>>
>>> modes.

 Xiangrui

 On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
  wrote:
>
> Pulled down, compiled, and tested examples on OS X and ubuntu.
> Deployed app we are building on spark and poured data through it.
>
> +1
>
> Sean
>
>
> On May 26, 2014, at 8:39 AM, Tathagata Das <
>>>
>>> tathagata.das1...@gmail.com>

 wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark

 version 1.0.0!
>>
>> This has a few important bug fixes on top of rc10:
>> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
>> SPARK-1870: https://github.com/apache/spark/pull/848
>> SPARK-1897: https://github.com/apache/spark/pull/849
>>
>> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
>>
>>>
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
>>
>> The release files, including signatures, digests, etc. can be found
>>>
>>> at:
>>
>> http://people.apache.org/~tdas/spark-1.0.0-rc11/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/tdas.asc
>>
>> The staging repository for this release can be found at:
>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1019/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.0.0!
>>
>> The vote is open until Thursday, May 29, at 16:00 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.0.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == API Changes ==
>> We welcome users to compile Spark applications against 1.0. There are
>> a few API changes in this release. Here are links to the associated
>> upgrade guides - user facing changes have been kept as small as
>> possible.
>>
>> Changes to ML vector specification:
>>
>>>
>>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
>>
>> Changes to the Java API:
>>
>>>
>>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>>
>> Changes to the streaming API:
>>
>>>
>>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>>
>> Changes to the GraphX API:
>>
>>>
>>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>>
>> Other changes:
>> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> ==> Call toSeq on the result to restore the old behavior
>>
>> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> ==> Call toSeq on the result to restore old behavior
>
>

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Patrick Wendell

Hey Ankur,

That does seem like a good fix, but right now we are only blocking the
release on major regressions that affect all components. So I don't
think this is sufficient to block it from going forward and cutting a
new candidate. This is because we are in the very late stage of the
release.

We can slot that for the 1.0.1 release and merge it into the 1.0
branch so people can get access to the fix easily.

On Mon, May 26, 2014 at 6:50 PM, ankurdave  wrote:
> -1
>
> I just fixed  SPARK-1931 
> , which was a critical bug in Graph#partitionBy. Since this is an important
> part of the GraphX API, I think Spark 1.0.0 should include the fix:
> https://github.com/apache/spark/pull/885.
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-RC11-tp6797p6802.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: all values for a key must fit in memory

2014-05-25 Thread Patrick Wendell

Nilesh - out of curiosity - what operation are you doing on the values
for the key?

On Sun, May 25, 2014 at 6:35 PM, Nilesh  wrote:
> Hi Andrew,
>
> Thanks for the reply!
>
> It's clearer about the API part now. That's what I wanted to know.
>
> Wow, tuples, why didn't that occur to me. That's a lovely ugly hack. :) I
> also came across something that solved my real problem though - the
> RDD.toLocalIterator method from 1.0, the logic of which thankfully works
> with 0.9.1 too, no new API changes there.
>
> Cheers,
> Nilesh
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6794.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: No output from Spark Streaming program with Spark 1.0

2014-05-23 Thread Patrick Wendell

Also one other thing to try, try removing all of the logic form inside
of foreach and just printing something. It could be that somehow an
exception is being triggered inside of your foreach block and as a
result the output goes away.

On Fri, May 23, 2014 at 6:00 PM, Patrick Wendell  wrote:
> Hey Jim,
>
> Do you see the same behavior if you run this outside of eclipse?
>
> Also, what happens if you print something to standard out when setting
> up your streams (i.e. not inside of the foreach) do you see that? This
> could be a streaming issue, but it could also be something related to
> the way it's running in eclipse.
>
> - Patrick
>
> On Fri, May 23, 2014 at 2:57 PM, Jim Donahue  wrote:
>> I¹m trying out 1.0 on a set of small Spark Streaming tests and am running
>> into problems.  Here¹s one of the little programs I¹ve used for a long
>> time ‹ it reads a Kafka stream that contains Twitter JSON tweets and does
>> some simple counting.  The program starts OK (it connects to the Kafka
>> stream fine) and generates a stream of INFO logging messages, but never
>> generates any output. :-(
>>
>> I¹m running this in Eclipse, so there may be some class loading issue
>> (loading the wrong class or something like that), but I¹m not seeing
>> anything in the console output.
>>
>> Thanks,
>>
>> Jim Donahue
>> Adobe
>>
>>
>>
>> val kafka_messages =
>>   KafkaUtils.createStream[Array[Byte], Array[Byte],
>> kafka.serializer.DefaultDecoder, kafka.serializer.DefaultDecoder](ssc,
>> propsMap, topicMap, StorageLevel.MEMORY_AND_DISK)
>>
>>
>>  val messages = kafka_messages.map(_._2)
>>
>>
>>  val total = ssc.sparkContext.accumulator(0)
>>
>>
>>  val startTime = new java.util.Date().getTime()
>>
>>
>>  val jsonstream = messages.map[JSONObject](message =>
>>   {val string = new String(message);
>>   val json = new JSONObject(string);
>>   total += 1
>>   json
>>   }
>> )
>>
>>
>> val deleted = ssc.sparkContext.accumulator(0)
>>
>>
>> val msgstream = jsonstream.filter(json =>
>>   if (!json.has("delete")) true else { deleted += 1; false}
>>   )
>>
>>
>> msgstream.foreach(rdd => {
>>   if(rdd.count() > 0){
>>   val data = rdd.map(json => (json.has("entities"),
>> json.length())).collect()
>>   val entities: Double = data.count(t => t._1)
>>   val fieldCounts = data.sortBy(_._2)
>>   val minFields = fieldCounts(0)._2
>>   val maxFields = fieldCounts(fieldCounts.size - 1)._2
>>   val now = new java.util.Date()
>>   val interval = (now.getTime() - startTime) / 1000
>>   System.out.println(now.toString)
>>   System.out.println("processing time: " + interval + " seconds")
>>   System.out.println("total messages: " + total.value)
>>   System.out.println("deleted messages: " + deleted.value)
>>   System.out.println("message receipt rate: " + (total.value/interval)
>> + " per second")
>>   System.out.println("messages this interval: " + data.length)
>>   System.out.println("message fields varied between: " + minFields + "
>> and " + maxFields)
>>   System.out.println("fraction with entities is " + (entities /
>> data.length))
>>   }
>> }
>> )
>>
>> ssc.start()
>>

Re: No output from Spark Streaming program with Spark 1.0

2014-05-23 Thread Patrick Wendell

Hey Jim,

Do you see the same behavior if you run this outside of eclipse?

Also, what happens if you print something to standard out when setting
up your streams (i.e. not inside of the foreach) do you see that? This
could be a streaming issue, but it could also be something related to
the way it's running in eclipse.

- Patrick

On Fri, May 23, 2014 at 2:57 PM, Jim Donahue  wrote:
> I¹m trying out 1.0 on a set of small Spark Streaming tests and am running
> into problems.  Here¹s one of the little programs I¹ve used for a long
> time ‹ it reads a Kafka stream that contains Twitter JSON tweets and does
> some simple counting.  The program starts OK (it connects to the Kafka
> stream fine) and generates a stream of INFO logging messages, but never
> generates any output. :-(
>
> I¹m running this in Eclipse, so there may be some class loading issue
> (loading the wrong class or something like that), but I¹m not seeing
> anything in the console output.
>
> Thanks,
>
> Jim Donahue
> Adobe
>
>
>
> val kafka_messages =
>   KafkaUtils.createStream[Array[Byte], Array[Byte],
> kafka.serializer.DefaultDecoder, kafka.serializer.DefaultDecoder](ssc,
> propsMap, topicMap, StorageLevel.MEMORY_AND_DISK)
>
>
>  val messages = kafka_messages.map(_._2)
>
>
>  val total = ssc.sparkContext.accumulator(0)
>
>
>  val startTime = new java.util.Date().getTime()
>
>
>  val jsonstream = messages.map[JSONObject](message =>
>   {val string = new String(message);
>   val json = new JSONObject(string);
>   total += 1
>   json
>   }
> )
>
>
> val deleted = ssc.sparkContext.accumulator(0)
>
>
> val msgstream = jsonstream.filter(json =>
>   if (!json.has("delete")) true else { deleted += 1; false}
>   )
>
>
> msgstream.foreach(rdd => {
>   if(rdd.count() > 0){
>   val data = rdd.map(json => (json.has("entities"),
> json.length())).collect()
>   val entities: Double = data.count(t => t._1)
>   val fieldCounts = data.sortBy(_._2)
>   val minFields = fieldCounts(0)._2
>   val maxFields = fieldCounts(fieldCounts.size - 1)._2
>   val now = new java.util.Date()
>   val interval = (now.getTime() - startTime) / 1000
>   System.out.println(now.toString)
>   System.out.println("processing time: " + interval + " seconds")
>   System.out.println("total messages: " + total.value)
>   System.out.println("deleted messages: " + deleted.value)
>   System.out.println("message receipt rate: " + (total.value/interval)
> + " per second")
>   System.out.println("messages this interval: " + data.length)
>   System.out.println("message fields varied between: " + minFields + "
> and " + maxFields)
>   System.out.println("fraction with entities is " + (entities /
> data.length))
>   }
> }
> )
>
> ssc.start()
>

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Patrick Wendell

Hey I just looked at the fix here:
https://github.com/apache/spark/pull/848

Given that this is quite simple, maybe it's best to just go with this
and just explain that we don't support adding jars dynamically in YARN
in Spark 1.0. That seems like a reasonable thing to do.

On Wed, May 21, 2014 at 3:15 PM, Patrick Wendell  wrote:
> Of these two solutions I'd definitely prefer 2 in the short term. I'd
> imagine the fix is very straightforward (it would mostly just be
> remove code), and we'd be making this more consistent with the
> standalone mode which makes things way easier to reason about.
>
> In the long term we'll definitely want to exploit the distributed
> cache more, but at this point it's premature optimization at a high
> complexity cost. Writing stuff to HDFS through is so slow anyways I'd
> guess that serving it directly from the driver is still faster in most
> cases (though for very large jar sizes or very large clusters, yes,
> we'll need the distributed cache).
>
> - Patrick
>
> On Wed, May 21, 2014 at 2:41 PM, Xiangrui Meng  wrote:
>> That's a good example. If we really want to cover that case, there are
>> two solutions:
>>
>> 1. Follow DB's patch, adding jars to the system classloader. Then we
>> cannot put a user class in front of an existing class.
>> 2. Do not send the primary jar and secondary jars to executors'
>> distributed cache. Instead, add them to "spark.jars" in SparkSubmit
>> and serve them via http by called sc.addJar in SparkContext.
>>
>> What is your preference?
>>
>> On Wed, May 21, 2014 at 2:27 PM, Sandy Ryza  wrote:
>>> Is that an assumption we can make?  I think we'd run into an issue in this
>>> situation:
>>>
>>> *In primary jar:*
>>> def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()
>>>
>>> *In app code:*
>>> sc.addJar("dynamicjar.jar")
>>> ...
>>> rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))
>>>
>>> It might be fair to say that the user should make sure to use the context
>>> classloader when instantiating dynamic classes, but I think it's weird that
>>> this code would work on Spark standalone but not on YARN.
>>>
>>> -Sandy
>>>
>>>
>>> On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng  wrote:
>>>
>>>> I think adding jars dynamically should work as long as the primary jar
>>>> and the secondary jars do not depend on dynamically added jars, which
>>>> should be the correct logic. -Xiangrui
>>>>
>>>> On Wed, May 21, 2014 at 1:40 PM, DB Tsai  wrote:
>>>> > This will be another separate story.
>>>> >
>>>> > Since in the yarn deployment, as Sandy said, the app.jar will be always
>>>> in
>>>> > the systemclassloader which means any object instantiated in app.jar will
>>>> > have parent loader of systemclassloader instead of custom one. As a
>>>> result,
>>>> > the custom class loader in yarn will never work without specifically
>>>> using
>>>> > reflection.
>>>> >
>>>> > Solution will be not using system classloader in the classloader
>>>> hierarchy,
>>>> > and add all the resources in system one into custom one. This is the
>>>> > approach of tomcat takes.
>>>> >
>>>> > Or we can directly overwirte the system class loader by calling the
>>>> > protected method `addURL` which will not work and throw exception if the
>>>> > code is wrapped in security manager.
>>>> >
>>>> >
>>>> > Sincerely,
>>>> >
>>>> > DB Tsai
>>>> > ---
>>>> > My Blog: https://www.dbtsai.com
>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >
>>>> >
>>>> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza 
>>>> wrote:
>>>> >
>>>> >> This will solve the issue for jars added upon application submission,
>>>> but,
>>>> >> on top of this, we need to make sure that anything dynamically added
>>>> >> through sc.addJar works as well.
>>>> >>
>>>> >> To do so, we need to make sure that any jars retrieved via the driver's
>>>> >> HTTP server are loaded by the same classloader that lo

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Patrick Wendell

>>> >> > than just the master, and set the classpath to include spark jar,
>>> >> > primary app jar, and secondary jars before executor starts. In this
>>> >> > way, user only needs to specify secondary jars via --jars instead of
>>> >> > calling sc.addJar inside the code. It also solves the scalability
>>> >> > problem of serving all the jars via http.
>>> >> >
>>> >> > If this solution sounds good, I can try to make a patch.
>>> >> >
>>> >> > Best,
>>> >> > Xiangrui
>>> >> >
>>> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai 
>>> wrote:
>>> >> > > In 1.0, there is a new option for users to choose which classloader
>>> has
>>> >> > > higher priority via spark.files.userClassPathFirst, I decided to
>>> submit
>>> >> > the
>>> >> > > PR for 0.9 first. We use this patch in our lab and we can use those
>>> >> jars
>>> >> > > added by sc.addJar without reflection.
>>> >> > >
>>> >> > > https://github.com/apache/spark/pull/834
>>> >> > >
>>> >> > > Can anyone comment if it's a good approach?
>>> >> > >
>>> >> > > Thanks.
>>> >> > >
>>> >> > >
>>> >> > > Sincerely,
>>> >> > >
>>> >> > > DB Tsai
>>> >> > > ---
>>> >> > > My Blog: https://www.dbtsai.com
>>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> > >
>>> >> > >
>>> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai 
>>> wrote:
>>> >> > >
>>> >> > >> Good summary! We fixed it in branch 0.9 since our production is
>>> still
>>> >> in
>>> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
>>> 1.0
>>> >> > >> tonight.
>>> >> > >>
>>> >> > >>
>>> >> > >> Sincerely,
>>> >> > >>
>>> >> > >> DB Tsai
>>> >> > >> ---
>>> >> > >> My Blog: https://www.dbtsai.com
>>> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> > >>
>>> >> > >>
>>> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
>>> sandy.r...@cloudera.com
>>> >> > >wrote:
>>> >> > >>
>>> >> > >>> It just hit me why this problem is showing up on YARN and not on
>>> >> > >>> standalone.
>>> >> > >>>
>>> >> > >>> The relevant difference between YARN and standalone is that, on
>>> YARN,
>>> >> > the
>>> >> > >>> app jar is loaded by the system classloader instead of Spark's
>>> custom
>>> >> > URL
>>> >> > >>> classloader.
>>> >> > >>>
>>> >> > >>> On YARN, the system classloader knows about [the classes in the
>>> spark
>>> >> > >>> jars,
>>> >> > >>> the classes in the primary app jar].   The custom classloader
>>> knows
>>> >> > about
>>> >> > >>> [the classes in secondary app jars] and has the system
>>> classloader as
>>> >> > its
>>> >> > >>> parent.
>>> >> > >>>
>>> >> > >>> A few relevant facts (mostly redundant with what Sean pointed
>>> out):
>>> >> > >>> * Every class has a classloader that loaded it.
>>> >> > >>> * When an object of class B is instantiated inside of class A, the
>>> >> > >>> classloader used for loading B is the classloader that was used
>>> for
>>> >> > >>> loading
>>> >> > >>> A.
>>> >> > >>> * When a classloader fails to load a class, it lets its parent
>>> >> > classloader
>>> >> > >>&g

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-19 Thread Patrick Wendell

We're cancelling this RC in favor of rc10. There were two blockers: an
issue with Windows run scripts and an issue with the packaging for
Hadoop 1 when hive support is bundled.

https://issues.apache.org/jira/browse/SPARK-1875
https://issues.apache.org/jira/browse/SPARK-1876

Thanks everyone for the testing. TD will be cutting rc10, since I'm
travelling this week (thanks TD!).

- Patrick

On Mon, May 19, 2014 at 7:06 PM, Nan Zhu  wrote:
> just rerun my test on rc5
>
> everything works
>
> build applications with sbt and the spark-*.jar which is compiled with Hadoop 
> 2.3
>
> +1
>
> --
> Nan Zhu
>
>
> On Sunday, May 18, 2014 at 11:07 PM, witgo wrote:
>
>> How to reproduce this bug?
>>
>>
>> -- Original --
>> From: "Patrick Wendell";mailto:pwend...@gmail.com)>;
>> Date: Mon, May 19, 2014 10:08 AM
>> To: "dev@spark.apache.org 
>> (mailto:dev@spark.apache.org)"> (mailto:dev@spark.apache.org)>;
>> Cc: "Tom Graves"mailto:tgraves...@yahoo.com)>;
>> Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
>>
>>
>>
>> Hey Matei - the issue you found is not related to security. This patch
>> a few days ago broke builds for Hadoop 1 with YARN support enabled.
>> The patch directly altered the way we deal with commons-lang
>> dependency, which is what is at the base of this stack trace.
>>
>> https://github.com/apache/spark/pull/754
>>
>> - Patrick
>>
>> On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia > (mailto:matei.zaha...@gmail.com)> wrote:
>> > Alright, I've opened https://github.com/apache/spark/pull/819 with the 
>> > Windows fixes. I also found one other likely bug, 
>> > https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages 
>> > for Hadoop1 built in this RC. I think this is due to Hadoop 1's security 
>> > code depending on a different version of org.apache.commons than Hadoop 2, 
>> > but it needs investigation. Tom, any thoughts on this?
>> >
>> > Matei
>> >
>> > On May 18, 2014, at 12:33 PM, Matei Zaharia > > (mailto:matei.zaha...@gmail.com)> wrote:
>> >
>> > > I took the always fun task of testing it on Windows, and unfortunately, 
>> > > I found some small problems with the prebuilt packages due to recent 
>> > > changes to the launch scripts: bin/spark-class2.cmd looks in ./jars 
>> > > instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn't 
>> > > quite match the master-setting behavior of the Unix based one. I'll send 
>> > > a pull request to fix them soon.
>> > >
>> > > Matei
>> > >
>> > >
>> > > On May 17, 2014, at 11:32 AM, Sandy Ryza > > > (mailto:sandy.r...@cloudera.com)> wrote:
>> > >
>> > > > +1
>> > > >
>> > > > Reran my tests from rc5:
>> > > >
>> > > > * Built the release from source.
>> > > > * Compiled Java and Scala apps that interact with HDFS against it.
>> > > > * Ran them in local mode.
>> > > > * Ran them against a pseudo-distributed YARN cluster in both 
>> > > > yarn-client
>> > > > mode and yarn-cluster mode.
>> > > >
>> > > >
>> > > > On Sat, May 17, 2014 at 10:08 AM, Andrew Or > > > > (mailto:and...@databricks.com)> wrote:
>> > > >
>> > > > > +1
>> > > > >
>> > > > >
>> > > > > 2014-05-17 8:53 GMT-07:00 Mark Hamstra > > > > > (mailto:m...@clearstorydata.com)>:
>> > > > >
>> > > > > > +1
>> > > > > >
>> > > > > >
>> > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
>> > > > > > mailto:pwend...@gmail.com)
>> > > > > > > wrote:
>> > > > > >
>> > > > > >
>> > > > > > > I'll start the voting with a +1.
>> > > > > > >
>> > > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
>> > > > > > > mailto:pwend...@gmail.com)>
>> > > > > > > wrote:
>> > > > > > > > Please vote on releasing the following candidate as Apache 
>> > > > > > > > Spark
>> > > > > > >
>> > > > > >

Re: spark 1.0 standalone application

2014-05-19 Thread Patrick Wendell

Whenever we publish a release candidate, we create a temporary maven
repository that host the artifacts. We do this precisely for the case
you are running into (where a user wants to build an application
against it to test).

You can build against the release candidate by just adding that
repository in your sbt build, then linking against "spark-core"
version "1.0.0". For rc9 the repository is in the vote e-mail:

http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc9-td6629.html

On Mon, May 19, 2014 at 7:03 PM, Mark Hamstra  wrote:
> That's the crude way to do it.  If you run `sbt/sbt publishLocal`, then you
> can resolve the artifact from your local cache in the same way that you
> would resolve it if it were deployed to a remote cache.  That's just the
> build step.  Actually running the application will require the necessary
> jars to be accessible by the cluster nodes.
>
>
> On Mon, May 19, 2014 at 7:04 PM, Nan Zhu  wrote:
>
>> en, you have to put spark-assembly-*.jar to the lib directory of your
>> application
>>
>> Best,
>>
>> --
>> Nan Zhu
>>
>>
>> On Monday, May 19, 2014 at 9:48 PM, nit wrote:
>>
>> > I am not much comfortable with sbt. I want to build a standalone
>> application
>> > using spark 1.0 RC9. I can build sbt assembly for my application with
>> Spark
>> > 0.9.1, and I think in that case spark is pulled from Aka Repository?
>> >
>> > Now if I want to use 1.0 RC9 for my application; what is the process ?
>> > (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
>> > sbt-assembly jar; and I think I will have to copy my jar somewhere? and
>> > update build.sbt?)
>> >
>> > PS: I am not sure if this is the right place for this question; but since
>> > 1.0 is still RC, I felt that this may be appropriate forum.
>> >
>> > thank!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com (http://Nabble.com).
>> >
>> >
>>
>>
>>

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Patrick Wendell

Having a user add define a custom class inside of an added jar and
instantiate it directly inside of an executor is definitely supported
in Spark and has been for a really long time (several years). This is
something we do all the time in Spark.

DB - I'd hold off on a re-architecting of this until we identify
exactly what is causing the bug you are running into.

In a nutshell, when the bytecode "new Foo()" is run on the executor,
it will ask the driver for the class over HTTP using a custom
classloader. Something in that pipeline is breaking here, possibly
related to the YARN deployment stuff.


On Mon, May 19, 2014 at 12:29 AM, Sean Owen  wrote:
> I don't think a customer classloader is necessary.
>
> Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
> all run custom user code that creates new user objects without
> reflection. I should go see how that's done. Maybe it's totally valid
> to set the thread's context classloader for just this purpose, and I
> am not thinking clearly.
>
> On Mon, May 19, 2014 at 8:26 AM, Andrew Ash  wrote:
>> Sounds like the problem is that classloaders always look in their parents
>> before themselves, and Spark users want executors to pick up classes from
>> their custom code before the ones in Spark plus its dependencies.
>>
>> Would a custom classloader that delegates to the parent after first
>> checking itself fix this up?
>>
>>
>> On Mon, May 19, 2014 at 12:17 AM, DB Tsai  wrote:
>>
>>> Hi Sean,
>>>
>>> It's true that the issue here is classloader, and due to the classloader
>>> delegation model, users have to use reflection in the executors to pick up
>>> the classloader in order to use those classes added by sc.addJars APIs.
>>> However, it's very inconvenience for users, and not documented in spark.
>>>
>>> I'm working on a patch to solve it by calling the protected method addURL
>>> in URLClassLoader to update the current default classloader, so no
>>> customClassLoader anymore. I wonder if this is an good way to go.
>>>
>>>   private def addURL(url: URL, loader: URLClassLoader){
>>> try {
>>>   val method: Method =
>>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>>>   method.setAccessible(true)
>>>   method.invoke(loader, url)
>>> }
>>> catch {
>>>   case t: Throwable => {
>>> throw new IOException("Error, could not add URL to system
>>> classloader")
>>>   }
>>> }
>>>   }
>>>
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen  wrote:
>>>
>>> > I might be stating the obvious for everyone, but the issue here is not
>>> > reflection or the source of the JAR, but the ClassLoader. The basic
>>> > rules are this.
>>> >
>>> > "new Foo" will use the ClassLoader that defines Foo. This is usually
>>> > the ClassLoader that loaded whatever it is that first referenced Foo
>>> > and caused it to be loaded -- usually the ClassLoader holding your
>>> > other app classes.
>>> >
>>> > ClassLoaders can have a parent-child relationship. ClassLoaders always
>>> > look in their parent before themselves.
>>> >
>>> > (Careful then -- in contexts like Hadoop or Tomcat where your app is
>>> > loaded in a child ClassLoader, and you reference a class that Hadoop
>>> > or Tomcat also has (like a lib class) you will get the container's
>>> > version!)
>>> >
>>> > When you load an external JAR it has a separate ClassLoader which does
>>> > not necessarily bear any relation to the one containing your app
>>> > classes, so yeah it is not generally going to make "new Foo" work.
>>> >
>>> > Reflection lets you pick the ClassLoader, yes.
>>> >
>>> > I would not call setContextClassLoader.
>>> >
>>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza 
>>> > wrote:
>>> > > I spoke with DB offline about this a little while ago and he confirmed
>>> > that
>>> > > he was able to access the jar from the driver.
>>> > >
>>> > > The issue appears to be a general Java issue: you can't directly
>>> > > instantiate a class from a dynamically loaded jar.
>>> > >
>>> > > I reproduced it locally outside of Spark with:
>>> > > ---
>>> > > URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
>>> > > File("myotherjar.jar").toURI().toURL() }, null);
>>> > > Thread.currentThread().setContextClassLoader(urlClassLoader);
>>> > > MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>>> > > ---
>>> > >
>>> > > I was able to load the class with reflection.
>>> >
>>>

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Patrick Wendell

Hey Matei - the issue you found is not related to security. This patch
a few days ago broke builds for Hadoop 1 with YARN support enabled.
The patch directly altered the way we deal with commons-lang
dependency, which is what is at the base of this stack trace.

https://github.com/apache/spark/pull/754

- Patrick

On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia  wrote:
> Alright, I've opened https://github.com/apache/spark/pull/819 with the 
> Windows fixes. I also found one other likely bug, 
> https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages for 
> Hadoop1 built in this RC. I think this is due to Hadoop 1's security code 
> depending on a different version of org.apache.commons than Hadoop 2, but it 
> needs investigation. Tom, any thoughts on this?
>
> Matei
>
> On May 18, 2014, at 12:33 PM, Matei Zaharia  wrote:
>
>> I took the always fun task of testing it on Windows, and unfortunately, I 
>> found some small problems with the prebuilt packages due to recent changes 
>> to the launch scripts: bin/spark-class2.cmd looks in ./jars instead of ./lib 
>> for the assembly JAR, and bin/run-example2.cmd doesn't quite match the 
>> master-setting behavior of the Unix based one. I'll send a pull request to 
>> fix them soon.
>>
>> Matei
>>
>>
>> On May 17, 2014, at 11:32 AM, Sandy Ryza  wrote:
>>
>>> +1
>>>
>>> Reran my tests from rc5:
>>>
>>> * Built the release from source.
>>> * Compiled Java and Scala apps that interact with HDFS against it.
>>> * Ran them in local mode.
>>> * Ran them against a pseudo-distributed YARN cluster in both yarn-client
>>> mode and yarn-cluster mode.
>>>
>>>
>>> On Sat, May 17, 2014 at 10:08 AM, Andrew Or  wrote:
>>>
>>>> +1
>>>>
>>>>
>>>> 2014-05-17 8:53 GMT-07:00 Mark Hamstra :
>>>>
>>>>> +1
>>>>>
>>>>>
>>>>> On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell >>>>> wrote:
>>>>>
>>>>>> I'll start the voting with a +1.
>>>>>>
>>>>>> On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
>>>>>> wrote:
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version
>>>>>> 1.0.0!
>>>>>>> This has one bug fix and one minor feature on top of rc8:
>>>>>>> SPARK-1864: https://github.com/apache/spark/pull/808
>>>>>>> SPARK-1808: https://github.com/apache/spark/pull/799
>>>>>>>
>>>>>>> The tag to be voted on is v1.0.0-rc9 (commit 920f947):
>>>>>>>
>>>>>>
>>>>>
>>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>> at:
>>>>>>> http://people.apache.org/~pwendell/spark-1.0.0-rc9/
>>>>>>>
>>>>>>> Release artifacts are signed with the following key:
>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>
>>>>>>> The staging repository for this release can be found at:
>>>>>>>
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1017/
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>> http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/
>>>>>>>
>>>>>>> Please vote on releasing this package as Apache Spark 1.0.0!
>>>>>>>
>>>>>>> The vote is open until Tuesday, May 20, at 08:56 UTC and passes if
>>>>>>> amajority of at least 3 +1 PMC votes are cast.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 1.0.0
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see
>>>>>>> http://spark.apache.org/
>>>>>>>
>>>>>>> == API Changes ==
>>>>>>> We welcome users to compile Spark applications against 1.0. There are
>>>>>>> a few API changes in this release. Here are links to the associated
>>>>>>> upgrade guides - user facing changes have been kept as small as
>>>>>>> possible.
>>>>>>>
>>>>>>> changes to ML vector specification:
>>>>>>>
>>>>>>
>>>>>
>>>> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
>>>>>>>
>>>>>>> changes to the Java API:
>>>>>>>
>>>>>>
>>>>>
>>>> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>>>>>>>
>>>>>>> changes to the streaming API:
>>>>>>>
>>>>>>
>>>>>
>>>> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>>>>>>>
>>>>>>> changes to the GraphX API:
>>>>>>>
>>>>>>
>>>>>
>>>> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>>>>>>>
>>>>>>> coGroup and related functions now return Iterable[T] instead of
>>>> Seq[T]
>>>>>>> ==> Call toSeq on the result to restore the old behavior
>>>>>>>
>>>>>>> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>>>>>>> ==> Call toSeq on the result to restore old behavior
>>>>>>
>>>>>
>>>>
>>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Patrick Wendell

@xiangrui - we don't expect these to be present on the system
classpath, because they get dynamically added by Spark (e.g. your
application can call sc.addJar well after the JVM's have started).

@db - I'm pretty surprised to see that behavior. It's definitely not
intended that users need reflection to instantiate their classes -
something odd is going on in your case. If you could create an
isolated example and post it to the JIRA, that would be great.

On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng  wrote:
> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
>
> DB, could you add more info to that JIRA? Thanks!
>
> -Xiangrui
>
> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng  wrote:
>> Btw, I tried
>>
>> rdd.map { i =>
>>   System.getProperty("java.class.path")
>> }.collect()
>>
>> but didn't see the jars added via "--jars" on the executor classpath.
>>
>> -Xiangrui
>>
>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng  wrote:
>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
>>> reflection approach mentioned by DB didn't work either. I checked the
>>> distributed cache on a worker node and found the jar there. It is also
>>> in the Environment tab of the WebUI. The workaround is making an
>>> assembly jar.
>>>
>>> DB, could you create a JIRA and describe what you have found so far? Thanks!
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan  
>>> wrote:
 Can you try moving your mapPartitions to another class/object which is
 referenced only after sc.addJar ?

 I would suspect CNFEx is coming while loading the class containing
 mapPartitions before addJars is executed.

 In general though, dynamic loading of classes means you use reflection to
 instantiate it since expectation is you don't know which implementation
 provides the interface ... If you statically know it apriori, you bundle it
 in your classpath.

 Regards
 Mridul
 On 17-May-2014 7:28 am, "DB Tsai"  wrote:

> Finally find a way out of the ClassLoader maze! It took me some times to
> understand how it works; I think it worths to document it in a separated
> thread.
>
> We're trying to add external utility.jar which contains CSVRecordParser,
> and we added the jar to executors through sc.addJar APIs.
>
> If the instance of CSVRecordParser is created without reflection, it
> raises *ClassNotFound
> Exception*.
>
> data.mapPartitions(lines => {
> val csvParser = new CSVRecordParser((delimiter.charAt(0))
> lines.foreach(line => {
>   val lineElems = csvParser.parseLine(line)
> })
> ...
> ...
>  )
>
>
> If the instance of CSVRecordParser is created through reflection, it 
> works.
>
> data.mapPartitions(lines => {
> val loader = Thread.currentThread.getContextClassLoader
> val CSVRecordParser =
> loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>
> val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
> .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>
> val parseLine = CSVRecordParser
> .getDeclaredMethod("parseLine", classOf[String])
>
> lines.foreach(line => {
>val lineElems = parseLine.invoke(csvParser,
> line).asInstanceOf[Array[String]]
> })
> ...
> ...
>  )
>
>
> This is identical to this question,
>
> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>
> It's not intuitive for users to load external classes through reflection,
> but couple available solutions including 1) messing around
> systemClassLoader by calling systemClassLoader.addURI through reflection 
> or
> 2) forking another JVM to add jars into classpath before bootstrap loader
> are very tricky.
>
> Any thought on fixing it properly?
>
> @Xiangrui,
> netlib-java jniloader is loaded from netlib-java through reflection, so
> this problem will not be seen.
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Patrick Wendell

@db - it's possible that you aren't including the jar in the classpath
of your driver program (I think this is what mridul was suggesting).
It would be helpful to see the stack trace of the CNFE.

- Patrick

On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell  wrote:
> @xiangrui - we don't expect these to be present on the system
> classpath, because they get dynamically added by Spark (e.g. your
> application can call sc.addJar well after the JVM's have started).
>
> @db - I'm pretty surprised to see that behavior. It's definitely not
> intended that users need reflection to instantiate their classes -
> something odd is going on in your case. If you could create an
> isolated example and post it to the JIRA, that would be great.
>
> On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng  wrote:
>> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
>>
>> DB, could you add more info to that JIRA? Thanks!
>>
>> -Xiangrui
>>
>> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng  wrote:
>>> Btw, I tried
>>>
>>> rdd.map { i =>
>>>   System.getProperty("java.class.path")
>>> }.collect()
>>>
>>> but didn't see the jars added via "--jars" on the executor classpath.
>>>
>>> -Xiangrui
>>>
>>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng  wrote:
>>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
>>>> reflection approach mentioned by DB didn't work either. I checked the
>>>> distributed cache on a worker node and found the jar there. It is also
>>>> in the Environment tab of the WebUI. The workaround is making an
>>>> assembly jar.
>>>>
>>>> DB, could you create a JIRA and describe what you have found so far? 
>>>> Thanks!
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan  
>>>> wrote:
>>>>> Can you try moving your mapPartitions to another class/object which is
>>>>> referenced only after sc.addJar ?
>>>>>
>>>>> I would suspect CNFEx is coming while loading the class containing
>>>>> mapPartitions before addJars is executed.
>>>>>
>>>>> In general though, dynamic loading of classes means you use reflection to
>>>>> instantiate it since expectation is you don't know which implementation
>>>>> provides the interface ... If you statically know it apriori, you bundle 
>>>>> it
>>>>> in your classpath.
>>>>>
>>>>> Regards
>>>>> Mridul
>>>>> On 17-May-2014 7:28 am, "DB Tsai"  wrote:
>>>>>
>>>>>> Finally find a way out of the ClassLoader maze! It took me some times to
>>>>>> understand how it works; I think it worths to document it in a separated
>>>>>> thread.
>>>>>>
>>>>>> We're trying to add external utility.jar which contains CSVRecordParser,
>>>>>> and we added the jar to executors through sc.addJar APIs.
>>>>>>
>>>>>> If the instance of CSVRecordParser is created without reflection, it
>>>>>> raises *ClassNotFound
>>>>>> Exception*.
>>>>>>
>>>>>> data.mapPartitions(lines => {
>>>>>> val csvParser = new CSVRecordParser((delimiter.charAt(0))
>>>>>> lines.foreach(line => {
>>>>>>   val lineElems = csvParser.parseLine(line)
>>>>>> })
>>>>>> ...
>>>>>> ...
>>>>>>  )
>>>>>>
>>>>>>
>>>>>> If the instance of CSVRecordParser is created through reflection, it 
>>>>>> works.
>>>>>>
>>>>>> data.mapPartitions(lines => {
>>>>>> val loader = Thread.currentThread.getContextClassLoader
>>>>>> val CSVRecordParser =
>>>>>> loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>>>>>>
>>>>>> val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
>>>>>> .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>>>>>>
>>>>>> val parseLine = CSVRecordParser
>>>>>> .getDeclaredMethod("parseLine", classOf[String])
>>>>>>
>>>>>> lines.foreach(line => {
>>>>>>val lineElems = parseLine.invoke(csvParser,
>>>>>> line).asInstanceOf[Array[String]]
>>>>>> })
>>>>>> ...
>>>>>> ...
>>>>>>  )
>>>>>>
>>>>>>
>>>>>> This is identical to this question,
>>>>>>
>>>>>> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>>>>>>
>>>>>> It's not intuitive for users to load external classes through reflection,
>>>>>> but couple available solutions including 1) messing around
>>>>>> systemClassLoader by calling systemClassLoader.addURI through reflection 
>>>>>> or
>>>>>> 2) forking another JVM to add jars into classpath before bootstrap loader
>>>>>> are very tricky.
>>>>>>
>>>>>> Any thought on fixing it properly?
>>>>>>
>>>>>> @Xiangrui,
>>>>>> netlib-java jniloader is loaded from netlib-java through reflection, so
>>>>>> this problem will not be seen.
>>>>>>
>>>>>> Sincerely,
>>>>>>
>>>>>> DB Tsai
>>>>>> ---
>>>>>> My Blog: https://www.dbtsai.com
>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>

[VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-17 Thread Patrick Wendell

Please vote on releasing the following candidate as Apache Spark version 1.0.0!
This has one bug fix and one minor feature on top of rc8:
SPARK-1864: https://github.com/apache/spark/pull/808
SPARK-1808: https://github.com/apache/spark/pull/799

The tag to be voted on is v1.0.0-rc9 (commit 920f947):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc9/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1017/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Tuesday, May 20, at 08:56 UTC and passes if
amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-17 Thread Patrick Wendell

I'll start the voting with a +1.

On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.0.0!
> This has one bug fix and one minor feature on top of rc8:
> SPARK-1864: https://github.com/apache/spark/pull/808
> SPARK-1808: https://github.com/apache/spark/pull/799
>
> The tag to be voted on is v1.0.0-rc9 (commit 920f947):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc9/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1017/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/
>
> Please vote on releasing this package as Apache Spark 1.0.0!
>
> The vote is open until Tuesday, May 20, at 08:56 UTC and passes if
> amajority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.0.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == API Changes ==
> We welcome users to compile Spark applications against 1.0. There are
> a few API changes in this release. Here are links to the associated
> upgrade guides - user facing changes have been kept as small as
> possible.
>
> changes to ML vector specification:
> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
>
> changes to the Java API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> changes to the streaming API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>
> changes to the GraphX API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior

[RESULT] [VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-17 Thread Patrick Wendell

Cancelled in favor of rc9.

On Sat, May 17, 2014 at 12:51 AM, Patrick Wendell  wrote:
> Due to the issue discovered by Michael, this vote is cancelled in favor of 
> rc9.
>
> On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust
>  wrote:
>> -1
>>
>> We found a regression in the way configuration is passed to executors.
>>
>> https://issues.apache.org/jira/browse/SPARK-1864
>> https://github.com/apache/spark/pull/808
>>
>> Michael
>>
>>
>> On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra 
>> wrote:
>>>
>>> +1
>>>
>>>
>>> On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell 
>>> wrote:
>>>
>>> > [Due to ASF e-mail outage, I'm not if anyone will actually receive
>>> > this.]
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark version
>>> > 1.0.0!
>>> > This has only minor changes on top of rc7.
>>> >
>>> > The tag to be voted on is v1.0.0-rc8 (commit 80eea0f):
>>> >
>>> >
>>> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8/
>>> >
>>> > Release artifacts are signed with the following key:
>>> > https://people.apache.org/keys/committer/pwendell.asc
>>> >
>>> > The staging repository for this release can be found at:
>>> > https://repository.apache.org/content/repositories/orgapachespark-1016/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/
>>> >
>>> > Please vote on releasing this package as Apache Spark 1.0.0!
>>> >
>>> > The vote is open until Monday, May 19, at 10:15 UTC and passes if a
>>> > majority of at least 3 +1 PMC votes are cast.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 1.0.0
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see
>>> > http://spark.apache.org/
>>> >
>>> > == API Changes ==
>>> > We welcome users to compile Spark applications against 1.0. There are
>>> > a few API changes in this release. Here are links to the associated
>>> > upgrade guides - user facing changes have been kept as small as
>>> > possible.
>>> >
>>> > changes to ML vector specification:
>>> >
>>> >
>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
>>> >
>>> > changes to the Java API:
>>> >
>>> >
>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>>> >
>>> > changes to the streaming API:
>>> >
>>> >
>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>>> >
>>> > changes to the GraphX API:
>>> >
>>> >
>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>>> >
>>> > coGroup and related functions now return Iterable[T] instead of Seq[T]
>>> > ==> Call toSeq on the result to restore the old behavior
>>> >
>>> > SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>>> > ==> Call toSeq on the result to restore old behavior
>>> >
>>
>>

Re: [VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-17 Thread Patrick Wendell

Due to the issue discovered by Michael, this vote is cancelled in favor of rc9.

On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust
 wrote:
> -1
>
> We found a regression in the way configuration is passed to executors.
>
> https://issues.apache.org/jira/browse/SPARK-1864
> https://github.com/apache/spark/pull/808
>
> Michael
>
>
> On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra 
> wrote:
>>
>> +1
>>
>>
>> On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell 
>> wrote:
>>
>> > [Due to ASF e-mail outage, I'm not if anyone will actually receive
>> > this.]
>> >
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.0.0!
>> > This has only minor changes on top of rc7.
>> >
>> > The tag to be voted on is v1.0.0-rc8 (commit 80eea0f):
>> >
>> >
>> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1016/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/
>> >
>> > Please vote on releasing this package as Apache Spark 1.0.0!
>> >
>> > The vote is open until Monday, May 19, at 10:15 UTC and passes if a
>> > majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.0.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> > http://spark.apache.org/
>> >
>> > == API Changes ==
>> > We welcome users to compile Spark applications against 1.0. There are
>> > a few API changes in this release. Here are links to the associated
>> > upgrade guides - user facing changes have been kept as small as
>> > possible.
>> >
>> > changes to ML vector specification:
>> >
>> >
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
>> >
>> > changes to the Java API:
>> >
>> >
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>> >
>> > changes to the streaming API:
>> >
>> >
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>> >
>> > changes to the GraphX API:
>> >
>> >
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>> >
>> > coGroup and related functions now return Iterable[T] instead of Seq[T]
>> > ==> Call toSeq on the result to restore the old behavior
>> >
>> > SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> > ==> Call toSeq on the result to restore old behavior
>> >
>
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc7)

2014-05-16 Thread Patrick Wendell

Hey all,

My vote threads seem to be running about 24 hours behind and/or
getting swallowed by infra e-mail.

I sent RC8 yesterday and we might send one tonight as well. I'll make
sure to close all existing ones

There have been only small "polish" changes in the recent RC's since
RC5. So testing any off these should be pretty equivalent. I'll make
sure I close all the other threads by tonight.

- Patrick

On Fri, May 16, 2014 at 1:10 PM, Mark Hamstra  wrote:
> Sorry for the duplication, but I think this is the current VOTE candidate
> -- we're not voting on rc8 yet?
>
> +1, but just barely.  We've got quite a number of outstanding bugs
> identified, and many of them have fixes in progress.  I'd hate to see those
> efforts get lost in a post-1.0.0 flood of new features targeted at 1.1.0 --
> in other words, I'd like to see 1.0.1 retain a high priority relative to
> 1.1.0.
>
> Looking through the unresolved JIRAs, it doesn't look like any of the
> identified bugs are show-stoppers or strictly regressions (although I will
> note that one that I have in progress, SPARK-1749, is a bug that we
> introduced with recent work -- it's not strictly a regression because we
> had equally bad but different behavior when the DAGScheduler exceptions
> weren't previously being handled at all vs. being slightly mis-handled
> now), so I'm not currently seeing a reason not to release.
>
>
> On Fri, May 16, 2014 at 11:42 AM, Henry Saputra 
> wrote:
>
>> Ah ok, thanks Aaron
>>
>> Just to make sure we VOTE the right RC.
>>
>> Thanks,
>>
>> Henry
>>
>> On Fri, May 16, 2014 at 11:37 AM, Aaron Davidson 
>> wrote:
>> > It was, but due to the apache infra issues, some may not have received
>> the
>> > email yet...
>> >
>> > On Fri, May 16, 2014 at 10:48 AM, Henry Saputra > >
>> > wrote:
>> >>
>> >> Hi Patrick,
>> >>
>> >> Just want to make sure that VOTE for rc6 also cancelled?
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Henry
>> >>
>> >> On Thu, May 15, 2014 at 1:15 AM, Patrick Wendell 
>> >> wrote:
>> >> > I'll start the voting with a +1.
>> >> >
>> >> > On Thu, May 15, 2014 at 1:14 AM, Patrick Wendell 
>> >> > wrote:
>> >> >> Please vote on releasing the following candidate as Apache Spark
>> >> >> version 1.0.0!
>> >> >>
>> >> >> This patch has minor documentation changes and fixes on top of rc6.
>> >> >>
>> >> >> The tag to be voted on is v1.0.0-rc7 (commit 9212b3e):
>> >> >>
>> >> >>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464
>> >> >>
>> >> >> The release files, including signatures, digests, etc. can be found
>> at:
>> >> >> http://people.apache.org/~pwendell/spark-1.0.0-rc7/
>> >> >>
>> >> >> Release artifacts are signed with the following key:
>> >> >> https://people.apache.org/keys/committer/pwendell.asc
>> >> >>
>> >> >> The staging repository for this release can be found at:
>> >> >>
>> https://repository.apache.org/content/repositories/orgapachespark-1015
>> >> >>
>> >> >> The documentation corresponding to this release can be found at:
>> >> >> http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/
>> >> >>
>> >> >> Please vote on releasing this package as Apache Spark 1.0.0!
>> >> >>
>> >> >> The vote is open until Sunday, May 18, at 09:12 UTC and passes if a
>> >> >> majority of at least 3 +1 PMC votes are cast.
>> >> >>
>> >> >> [ ] +1 Release this package as Apache Spark 1.0.0
>> >> >> [ ] -1 Do not release this package because ...
>> >> >>
>> >> >> To learn more about Apache Spark, please see
>> >> >> http://spark.apache.org/
>> >> >>
>> >> >> == API Changes ==
>> >> >> We welcome users to compile Spark applications against 1.0. There are
>> >> >> a few API changes in this release. Here are links to the associated
>> >> >> upgrade guides - user facing changes have been kept as small as
>> >> >> possible.
>> >> >>
>> >> >> changes to ML vector specification:
>> >> >>
>> >> >>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10
>> >> >>
>> >> >> changes to the Java API:
>> >> >>
>> >> >>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>> >> >>
>> >> >> changes to the streaming API:
>> >> >>
>> >> >>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>> >> >>
>> >> >> changes to the GraphX API:
>> >> >>
>> >> >>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>> >> >>
>> >> >> coGroup and related functions now return Iterable[T] instead of
>> Seq[T]
>> >> >> ==> Call toSeq on the result to restore the old behavior
>> >> >>
>> >> >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> >> >> ==> Call toSeq on the result to restore old behavior
>> >
>> >
>>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-16 Thread Patrick Wendell

Thanks for your feedback. Since it's not a regression, it won't block
the release.

On Wed, May 14, 2014 at 12:17 AM, witgo  wrote:
> SPARK-1817 will cause users to get incorrect results  and RDD.zip is common 
> usage .
> This should be the highest priority. I think we should fix the bug,and should 
> also test the previous release
> -- Original ------
> From:  "Patrick Wendell";;
> Date:  Wed, May 14, 2014 03:02 PM
> To:  "dev@spark.apache.org";
>
> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>
>
>
> Hey @witgo - those bugs are not severe enough to block the release,
> but it would be nice to get them fixed.
>
> At this point we are focused on severe bugs with an immediate fix, or
> regressions from previous versions of Spark. Anything that misses this
> release will get merged into the branch-1.0 branch and make it into
> the 1.0.1 release, so people will have access to it.
>
> On Tue, May 13, 2014 at 5:32 PM, witgo  wrote:
>> -1
>> The following bug should be fixed:
>> https://issues.apache.org/jira/browse/SPARK-1817
>> https://issues.apache.org/jira/browse/SPARK-1712
>>
>>
>> -- Original --
>> From:  "Patrick Wendell";;
>> Date:  Wed, May 14, 2014 04:07 AM
>> To:  "dev@spark.apache.org";
>>
>> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>>
>>
>>
>> Hey all - there were some earlier RC's that were not presented to the
>> dev list because issues were found with them. Also, there seems to be
>> some issues with the reliability of the dev list e-mail. Just a heads
>> up.
>>
>> I'll lead with a +1 for this.
>>
>> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu  wrote:
>>> just curious, where is rc4 VOTE?
>>>
>>> I searched my gmail but didn't find that?
>>>
>>>
>>>
>>>
>>> On Tue, May 13, 2014 at 9:49 AM, Sean Owen  wrote:
>>>
>>>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
>>>> wrote:
>>>> > The release files, including signatures, digests, etc. can be found at:
>>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>>>
>>>> Good news is that the sigs, MD5 and SHA are all correct.
>>>>
>>>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>>>> use SHA512, which took me a bit of head-scratching to figure out.
>>>>
>>>> If another RC comes out, I might suggest making it SHA1 everywhere?
>>>> But there is nothing wrong with these signatures and checksums.
>>>>
>>>> Now to look at the contents...
>>>>
>> .
> .

[VOTE] Release Apache Spark 1.0.0 (rc7)

2014-05-16 Thread Patrick Wendell

Please vote on releasing the following candidate as Apache Spark version 1.0.0!

This patch has minor documentation changes and fixes on top of rc6.

The tag to be voted on is v1.0.0-rc7 (commit 9212b3e):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc7/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1015

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Sunday, May 18, at 09:12 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior

[VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-16 Thread Patrick Wendell

[Due to ASF e-mail outage, I'm not if anyone will actually receive this.]

Please vote on releasing the following candidate as Apache Spark version 1.0.0!
This has only minor changes on top of rc7.

The tag to be voted on is v1.0.0-rc8 (commit 80eea0f):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc8/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1016/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Monday, May 19, at 10:15 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-16 Thread Patrick Wendell

Hey Everyone,

Just a heads up - I've sent other release candidates to the list, but
they appear to be getting swallowed (i.e. they are not on nabble). I
think there is an issue with Apache mail servers.

I'm going to keep trying... if you get duplicate e-mails I apologize in advance.

On Thu, May 15, 2014 at 10:23 AM, Patrick Wendell  wrote:
> Thanks for your feedback. Since it's not a regression, it won't block
> the release.
>
> On Wed, May 14, 2014 at 12:17 AM, witgo  wrote:
>> SPARK-1817 will cause users to get incorrect results  and RDD.zip is common 
>> usage .
>> This should be the highest priority. I think we should fix the bug,and 
>> should also test the previous release
>> ---------- Original --
>> From:  "Patrick Wendell";;
>> Date:  Wed, May 14, 2014 03:02 PM
>> To:  "dev@spark.apache.org";
>>
>> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>>
>>
>>
>> Hey @witgo - those bugs are not severe enough to block the release,
>> but it would be nice to get them fixed.
>>
>> At this point we are focused on severe bugs with an immediate fix, or
>> regressions from previous versions of Spark. Anything that misses this
>> release will get merged into the branch-1.0 branch and make it into
>> the 1.0.1 release, so people will have access to it.
>>
>> On Tue, May 13, 2014 at 5:32 PM, witgo  wrote:
>>> -1
>>> The following bug should be fixed:
>>> https://issues.apache.org/jira/browse/SPARK-1817
>>> https://issues.apache.org/jira/browse/SPARK-1712
>>>
>>>
>>> -- Original --
>>> From:  "Patrick Wendell";;
>>> Date:  Wed, May 14, 2014 04:07 AM
>>> To:  "dev@spark.apache.org";
>>>
>>> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>>>
>>>
>>>
>>> Hey all - there were some earlier RC's that were not presented to the
>>> dev list because issues were found with them. Also, there seems to be
>>> some issues with the reliability of the dev list e-mail. Just a heads
>>> up.
>>>
>>> I'll lead with a +1 for this.
>>>
>>> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu  wrote:
>>>> just curious, where is rc4 VOTE?
>>>>
>>>> I searched my gmail but didn't find that?
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 13, 2014 at 9:49 AM, Sean Owen  wrote:
>>>>
>>>>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
>>>>> wrote:
>>>>> > The release files, including signatures, digests, etc. can be found at:
>>>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>>>>
>>>>> Good news is that the sigs, MD5 and SHA are all correct.
>>>>>
>>>>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>>>>> use SHA512, which took me a bit of head-scratching to figure out.
>>>>>
>>>>> If another RC comes out, I might suggest making it SHA1 everywhere?
>>>>> But there is nothing wrong with these signatures and checksums.
>>>>>
>>>>> Now to look at the contents...
>>>>>
>>> .
>> .

[RESULT][VOTE] Release Apache Spark 1.0.0 (rc6)

2014-05-16 Thread Patrick Wendell

This vote is cancelled in favor of rc7.

On Wed, May 14, 2014 at 1:02 PM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.0.0!
>
> This patch has a few minor fixes on top of rc5. I've also built the
> binary artifacts with Hive support enabled so people can test this
> configuration. When we release 1.0 we might just release both vanilla
> and Hive-enabled binaries.
>
> The tag to be voted on is v1.0.0-rc6 (commit 54133a):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=54133abdce0246f6643a1112a5204afb2c4caa82
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc6/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachestratos-1011
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc6-docs/
>
> Please vote on releasing this package as Apache Spark 1.0.0!
>
> The vote is open until Saturday, May 17, at 20:58 UTC and passes if
> amajority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.0.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == API Changes ==
> We welcome users to compile Spark applications against 1.0. There are
> a few API changes in this release. Here are links to the associated
> upgrade guides - user facing changes have been kept as small as
> possible.
>
> changes to ML vector specification:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10
>
> changes to the Java API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> changes to the streaming API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>
> changes to the GraphX API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior

Re: [VOTE] Release Apache Spark 1.0.0 (rc7)

2014-05-16 Thread Patrick Wendell

I'll start the voting with a +1.

On Thu, May 15, 2014 at 1:14 AM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.0.0!
>
> This patch has minor documentation changes and fixes on top of rc6.
>
> The tag to be voted on is v1.0.0-rc7 (commit 9212b3e):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc7/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1015
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/
>
> Please vote on releasing this package as Apache Spark 1.0.0!
>
> The vote is open until Sunday, May 18, at 09:12 UTC and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.0.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == API Changes ==
> We welcome users to compile Spark applications against 1.0. There are
> a few API changes in this release. Here are links to the associated
> upgrade guides - user facing changes have been kept as small as
> possible.
>
> changes to ML vector specification:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10
>
> changes to the Java API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> changes to the streaming API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>
> changes to the GraphX API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior

[VOTE] Release Apache Spark 1.0.0 (rc6)

2014-05-15 Thread Patrick Wendell

Please vote on releasing the following candidate as Apache Spark version 1.0.0!

This patch has a few minor fixes on top of rc5. I've also built the
binary artifacts with Hive support enabled so people can test this
configuration. When we release 1.0 we might just release both vanilla
and Hive-enabled binaries.

The tag to be voted on is v1.0.0-rc6 (commit 54133a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=54133abdce0246f6643a1112a5204afb2c4caa82

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc6/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachestratos-1011

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc6-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Saturday, May 17, at 20:58 UTC and passes if
amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior

[RESULT] [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-15 Thread Patrick Wendell

This vote is cancelled in favor of rc6.

On Wed, May 14, 2014 at 1:04 PM, Patrick Wendell  wrote:
> I'm cancelling this vote in favor of rc6.
>
> On Tue, May 13, 2014 at 8:01 AM, Sean Owen  wrote:
>> On Tue, May 13, 2014 at 2:49 PM, Sean Owen  wrote:
>>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell  wrote:
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>>
>>> Good news is that the sigs, MD5 and SHA are all correct.
>>>
>>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>>> use SHA512, which took me a bit of head-scratching to figure out.
>>>
>>> If another RC comes out, I might suggest making it SHA1 everywhere?
>>> But there is nothing wrong with these signatures and checksums.
>>>
>>> Now to look at the contents...
>>
>> This is a bit of drudgery that probably needs to be done too: a review
>> of the LICENSE and NOTICE file. Having dumped the licenses of
>> dependencies, I don't believe these reflect all of the software that's
>> going to be distributed in 1.0.
>>
>> (Good news is there's no forbidden license stuff included AFAICT.)
>>
>> And good news is that NOTICE can be auto-generated, largely, with a
>> Maven plugin. This can be done manually for now.
>>
>> And there is a license plugin that will list all known licenses of
>> transitive dependencies so that LICENSE can be filled out fairly
>> easily.
>>
>> What say? want a JIRA with details?

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-15 Thread Patrick Wendell

I'm cancelling this vote in favor of rc6.

On Tue, May 13, 2014 at 8:01 AM, Sean Owen  wrote:
> On Tue, May 13, 2014 at 2:49 PM, Sean Owen  wrote:
>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell  wrote:
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>
>> Good news is that the sigs, MD5 and SHA are all correct.
>>
>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>> use SHA512, which took me a bit of head-scratching to figure out.
>>
>> If another RC comes out, I might suggest making it SHA1 everywhere?
>> But there is nothing wrong with these signatures and checksums.
>>
>> Now to look at the contents...
>
> This is a bit of drudgery that probably needs to be done too: a review
> of the LICENSE and NOTICE file. Having dumped the licenses of
> dependencies, I don't believe these reflect all of the software that's
> going to be distributed in 1.0.
>
> (Good news is there's no forbidden license stuff included AFAICT.)
>
> And good news is that NOTICE can be auto-generated, largely, with a
> Maven plugin. This can be done manually for now.
>
> And there is a license plugin that will list all known licenses of
> transitive dependencies so that LICENSE can be filled out fairly
> easily.
>
> What say? want a JIRA with details?

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-14 Thread Patrick Wendell

Hey @witgo - those bugs are not severe enough to block the release,
but it would be nice to get them fixed.

At this point we are focused on severe bugs with an immediate fix, or
regressions from previous versions of Spark. Anything that misses this
release will get merged into the branch-1.0 branch and make it into
the 1.0.1 release, so people will have access to it.

On Tue, May 13, 2014 at 5:32 PM, witgo  wrote:
> -1
> The following bug should be fixed:
> https://issues.apache.org/jira/browse/SPARK-1817
> https://issues.apache.org/jira/browse/SPARK-1712
>
>
> -- Original ------
> From:  "Patrick Wendell";;
> Date:  Wed, May 14, 2014 04:07 AM
> To:  "dev@spark.apache.org";
>
> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>
>
>
> Hey all - there were some earlier RC's that were not presented to the
> dev list because issues were found with them. Also, there seems to be
> some issues with the reliability of the dev list e-mail. Just a heads
> up.
>
> I'll lead with a +1 for this.
>
> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu  wrote:
>> just curious, where is rc4 VOTE?
>>
>> I searched my gmail but didn't find that?
>>
>>
>>
>>
>> On Tue, May 13, 2014 at 9:49 AM, Sean Owen  wrote:
>>
>>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
>>> wrote:
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>>
>>> Good news is that the sigs, MD5 and SHA are all correct.
>>>
>>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>>> use SHA512, which took me a bit of head-scratching to figure out.
>>>
>>> If another RC comes out, I might suggest making it SHA1 everywhere?
>>> But there is nothing wrong with these signatures and checksums.
>>>
>>> Now to look at the contents...
>>>
> .

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Patrick Wendell

Hey all - there were some earlier RC's that were not presented to the
dev list because issues were found with them. Also, there seems to be
some issues with the reliability of the dev list e-mail. Just a heads
up.

I'll lead with a +1 for this.

On Tue, May 13, 2014 at 8:07 AM, Nan Zhu  wrote:
> just curious, where is rc4 VOTE?
>
> I searched my gmail but didn't find that?
>
>
>
>
> On Tue, May 13, 2014 at 9:49 AM, Sean Owen  wrote:
>
>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
>> wrote:
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>
>> Good news is that the sigs, MD5 and SHA are all correct.
>>
>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>> use SHA512, which took me a bit of head-scratching to figure out.
>>
>> If another RC comes out, I might suggest making it SHA1 everywhere?
>> But there is nothing wrong with these signatures and checksums.
>>
>> Now to look at the contents...
>>

[VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Patrick Wendell

Please vote on releasing the following candidate as Apache Spark version 1.0.0!

The tag to be voted on is v1.0.0-rc5 (commit 18f0623):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=18f062303303824139998e8fc8f4158217b0dbc3

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc5/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1012/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Friday, May 16, at 09:30 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior

Re: Updating docs for running on Mesos

2014-05-11 Thread Patrick Wendell

Andrew,

Updating these docs would be great! I think this would be a welcome change.

In terms of packaging, it would be good to mention the binaries
produced by the upstream project as well, in addition to Mesosphere.

- Patrick

On Thu, May 8, 2014 at 12:51 AM, Andrew Ash  wrote:
> The docs for how to run Spark on Mesos have changed very little since
> 0.6.0, but setting it up is much easier now than then.  Does it make sense
> to revamp with the below changes?
>
>
> You no longer need to build mesos yourself as pre-built versions are
> available from Mesosphere: http://mesosphere.io/downloads/
>
> And the instructions guide you towards compiling your own distribution of
> Spark, when you can use the prebuilt versions of Spark as well.
>
>
> I'd like to split that portion of the documentation into two sections, a
> build-from-scratch section and a use-prebuilt section.  The new outline
> would look something like this:
>
>
> *Running Spark on Mesos*
>
> Installing Mesos
> - using prebuilt (recommended)
>  - pointer to mesosphere's packages
> - from scratch
>  - (similar to current)
>
>
> Connecting Spark to Mesos
> - loading distribution into an accessible location
> - Spark settings
>
> Mesos Run Modes
> - (same as current)
>
> Running Alongside Hadoop
> - (trim this down)
>
>
>
> Does that work for people?
>
>
> Thanks!
> Andrew
>
>
> PS Basically all the same:
>
> http://spark.apache.org/docs/0.6.0/running-on-mesos.html
> http://spark.apache.org/docs/0.6.2/running-on-mesos.html
> http://spark.apache.org/docs/0.7.3/running-on-mesos.html
> http://spark.apache.org/docs/0.8.1/running-on-mesos.html
> http://spark.apache.org/docs/0.9.1/running-on-mesos.html
> https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html

Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell

Patch here:
https://github.com/apache/spark/pull/609

On Wed, Apr 30, 2014 at 2:26 PM, Patrick Wendell  wrote:
> Dean - our e-mails crossed, but thanks for the tip. Was independently
> arriving at your solution :)
>
> Okay I'll submit something.
>
> - Patrick
>
> On Wed, Apr 30, 2014 at 2:14 PM, Marcelo Vanzin  wrote:
>> Cool, that seems to work. Thanks!
>>
>> On Wed, Apr 30, 2014 at 2:09 PM, Patrick Wendell  wrote:
>>> Marcelo - Mind trying the following diff locally? If it works I can
>>> send a patch:
>>>
>>> patrick@patrick-t430s:~/Documents/spark$ git diff bin/spark-submit
>>> diff --git a/bin/spark-submit b/bin/spark-submit
>>> index dd0d95d..49bc262 100755
>>> --- a/bin/spark-submit
>>> +++ b/bin/spark-submit
>>> @@ -18,7 +18,7 @@
>>>  #
>>>
>>>  export SPARK_HOME="$(cd `dirname $0`/..; pwd)"
>>> -ORIG_ARGS=$@
>>> +ORIG_ARGS=("$@")
>>>
>>>  while (($#)); do
>>>if [ "$1" = "--deploy-mode" ]; then
>>> @@ -39,5 +39,5 @@ if [ ! -z $DRIVER_MEMORY ] && [ ! -z $DEPLOY_MODE ]
>>> && [ $DEPLOY_MODE = "client"
>>>export SPARK_MEM=$DRIVER_MEMORY
>>>  fi
>>>
>>> -$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS
>>> +$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit
>>> "${ORIG_ARGS[@]}"
>>>
>>> On Wed, Apr 30, 2014 at 1:51 PM, Patrick Wendell  wrote:
>>>> So I reproduced the problem here:
>>>>
>>>> == test.sh ==
>>>> #!/bin/bash
>>>> for x in "$@"; do
>>>>   echo "arg: $x"
>>>> done
>>>> ARGS_COPY=$@
>>>> for x in "$ARGS_COPY"; do
>>>>   echo "arg_copy: $x"
>>>> done
>>>> ==
>>>>
>>>> ./test.sh a b "c d e" f
>>>> arg: a
>>>> arg: b
>>>> arg: c d e
>>>> arg: f
>>>> arg_copy: a b c d e f
>>>>
>>>> I'll dig around a bit more and see if we can fix it. Pretty sure we
>>>> aren't passing these argument arrays around correctly in bash.
>>>>
>>>> On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin  
>>>> wrote:
>>>>> On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell  
>>>>> wrote:
>>>>>> Yeah I think the problem is that the spark-submit script doesn't pass
>>>>>> the argument array to spark-class in the right way, so any quoted
>>>>>> strings get flattened.
>>>>>>
>>>>>> I think we'll need to figure out how to do this correctly in the bash
>>>>>> script so that quoted strings get passed in the right way.
>>>>>
>>>>> I tried a few different approaches but finally ended up giving up; my
>>>>> bash-fu is apparently not strong enough. If you can make it work
>>>>> great, but I have "-J" working locally in case you give up like me.
>>>>> :-)
>>>>>
>>>>> --
>>>>> Marcelo
>>
>>
>>
>> --
>> Marcelo

Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell

Dean - our e-mails crossed, but thanks for the tip. Was independently
arriving at your solution :)

Okay I'll submit something.

- Patrick

On Wed, Apr 30, 2014 at 2:14 PM, Marcelo Vanzin  wrote:
> Cool, that seems to work. Thanks!
>
> On Wed, Apr 30, 2014 at 2:09 PM, Patrick Wendell  wrote:
>> Marcelo - Mind trying the following diff locally? If it works I can
>> send a patch:
>>
>> patrick@patrick-t430s:~/Documents/spark$ git diff bin/spark-submit
>> diff --git a/bin/spark-submit b/bin/spark-submit
>> index dd0d95d..49bc262 100755
>> --- a/bin/spark-submit
>> +++ b/bin/spark-submit
>> @@ -18,7 +18,7 @@
>>  #
>>
>>  export SPARK_HOME="$(cd `dirname $0`/..; pwd)"
>> -ORIG_ARGS=$@
>> +ORIG_ARGS=("$@")
>>
>>  while (($#)); do
>>if [ "$1" = "--deploy-mode" ]; then
>> @@ -39,5 +39,5 @@ if [ ! -z $DRIVER_MEMORY ] && [ ! -z $DEPLOY_MODE ]
>> && [ $DEPLOY_MODE = "client"
>>export SPARK_MEM=$DRIVER_MEMORY
>>  fi
>>
>> -$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS
>> +$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit
>> "${ORIG_ARGS[@]}"
>>
>> On Wed, Apr 30, 2014 at 1:51 PM, Patrick Wendell  wrote:
>>> So I reproduced the problem here:
>>>
>>> == test.sh ==
>>> #!/bin/bash
>>> for x in "$@"; do
>>>   echo "arg: $x"
>>> done
>>> ARGS_COPY=$@
>>> for x in "$ARGS_COPY"; do
>>>   echo "arg_copy: $x"
>>> done
>>> ==
>>>
>>> ./test.sh a b "c d e" f
>>> arg: a
>>> arg: b
>>> arg: c d e
>>> arg: f
>>> arg_copy: a b c d e f
>>>
>>> I'll dig around a bit more and see if we can fix it. Pretty sure we
>>> aren't passing these argument arrays around correctly in bash.
>>>
>>> On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin  wrote:
>>>> On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell  
>>>> wrote:
>>>>> Yeah I think the problem is that the spark-submit script doesn't pass
>>>>> the argument array to spark-class in the right way, so any quoted
>>>>> strings get flattened.
>>>>>
>>>>> I think we'll need to figure out how to do this correctly in the bash
>>>>> script so that quoted strings get passed in the right way.
>>>>
>>>> I tried a few different approaches but finally ended up giving up; my
>>>> bash-fu is apparently not strong enough. If you can make it work
>>>> great, but I have "-J" working locally in case you give up like me.
>>>> :-)
>>>>
>>>> --
>>>> Marcelo
>
>
>
> --
> Marcelo

Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell

Marcelo - Mind trying the following diff locally? If it works I can
send a patch:

patrick@patrick-t430s:~/Documents/spark$ git diff bin/spark-submit
diff --git a/bin/spark-submit b/bin/spark-submit
index dd0d95d..49bc262 100755
--- a/bin/spark-submit
+++ b/bin/spark-submit
@@ -18,7 +18,7 @@
 #

 export SPARK_HOME="$(cd `dirname $0`/..; pwd)"
-ORIG_ARGS=$@
+ORIG_ARGS=("$@")

 while (($#)); do
   if [ "$1" = "--deploy-mode" ]; then
@@ -39,5 +39,5 @@ if [ ! -z $DRIVER_MEMORY ] && [ ! -z $DEPLOY_MODE ]
&& [ $DEPLOY_MODE = "client"
   export SPARK_MEM=$DRIVER_MEMORY
 fi

-$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS
+$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit
"${ORIG_ARGS[@]}"

On Wed, Apr 30, 2014 at 1:51 PM, Patrick Wendell  wrote:
> So I reproduced the problem here:
>
> == test.sh ==
> #!/bin/bash
> for x in "$@"; do
>   echo "arg: $x"
> done
> ARGS_COPY=$@
> for x in "$ARGS_COPY"; do
>   echo "arg_copy: $x"
> done
> ==
>
> ./test.sh a b "c d e" f
> arg: a
> arg: b
> arg: c d e
> arg: f
> arg_copy: a b c d e f
>
> I'll dig around a bit more and see if we can fix it. Pretty sure we
> aren't passing these argument arrays around correctly in bash.
>
> On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin  wrote:
>> On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell  wrote:
>>> Yeah I think the problem is that the spark-submit script doesn't pass
>>> the argument array to spark-class in the right way, so any quoted
>>> strings get flattened.
>>>
>>> I think we'll need to figure out how to do this correctly in the bash
>>> script so that quoted strings get passed in the right way.
>>
>> I tried a few different approaches but finally ended up giving up; my
>> bash-fu is apparently not strong enough. If you can make it work
>> great, but I have "-J" working locally in case you give up like me.
>> :-)
>>
>> --
>> Marcelo

Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell

So I reproduced the problem here:

== test.sh ==
#!/bin/bash
for x in "$@"; do
  echo "arg: $x"
done
ARGS_COPY=$@
for x in "$ARGS_COPY"; do
  echo "arg_copy: $x"
done
==

./test.sh a b "c d e" f
arg: a
arg: b
arg: c d e
arg: f
arg_copy: a b c d e f

I'll dig around a bit more and see if we can fix it. Pretty sure we
aren't passing these argument arrays around correctly in bash.

On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin  wrote:
> On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell  wrote:
>> Yeah I think the problem is that the spark-submit script doesn't pass
>> the argument array to spark-class in the right way, so any quoted
>> strings get flattened.
>>
>> I think we'll need to figure out how to do this correctly in the bash
>> script so that quoted strings get passed in the right way.
>
> I tried a few different approaches but finally ended up giving up; my
> bash-fu is apparently not strong enough. If you can make it work
> great, but I have "-J" working locally in case you give up like me.
> :-)
>
> --
> Marcelo

Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell

Yeah I think the problem is that the spark-submit script doesn't pass
the argument array to spark-class in the right way, so any quoted
strings get flattened.

We do:
ORIG_ARGS=$@
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS

This works:
// remove all the code relating to `shift`ing the arguments
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

Not sure, but I think the issue is that when you make a copy of $@ in
bash the type actually changes from an array to something else.

My patch fixes this for spark-shell but I didn't realize that
spark-submit does the same thing.
https://github.com/apache/spark/pull/576/files#diff-bc287993dfd11fd18794041e169ffd72L23

I think we'll need to figure out how to do this correctly in the bash
script so that quoted strings get passed in the right way.

On Wed, Apr 30, 2014 at 1:06 PM, Marcelo Vanzin  wrote:
> Just pulled again just in case. Verified your fix is there.
>
> $ ./bin/spark-submit --master yarn --deploy-mode client
> --driver-java-options "-Dfoo -Dbar" blah blah blah
> error: Unrecognized option '-Dbar'.
> run with --help for more information or --verbose for debugging output
>
>
> On Wed, Apr 30, 2014 at 12:49 PM, Patrick Wendell  wrote:
>> I added a fix for this recently and it didn't require adding -J
>> notation - are you trying it with this patch?
>>
>> https://issues.apache.org/jira/browse/SPARK-1654
>>
>>  ./bin/spark-shell --driver-java-options "-Dfoo=a -Dbar=b"
>> scala> sys.props.get("foo")
>> res0: Option[String] = Some(a)
>> scala> sys.props.get("bar")
>> res1: Option[String] = Some(b)
>>
>> - Patrick
>>
>> On Wed, Apr 30, 2014 at 11:29 AM, Marcelo Vanzin  wrote:
>>> Hello all,
>>>
>>> Maybe my brain is not evolved enough to be able to trace through what
>>> happens with command-line arguments as they're parsed through all the
>>> shell scripts... but I really can't figure out how to pass more than a
>>> single JVM option on the command line.
>>>
>>> Unless someone has an obvious workaround that I'm missing, I'd like to
>>> propose something that is actually pretty standard in JVM tools: using
>>> -J. From javac:
>>>
>>>   -J   Pass  directly to the runtime system
>>>
>>> So "javac -J-Xmx1g" would pass "-Xmx1g" to the underlying JVM. You can
>>> use several of those to pass multiple options (unlike
>>> --driver-java-options), so it helps that it's a short syntax.
>>>
>>> Unless someone has some issue with that I'll work on a patch for it...
>>> (well, I'm going to do it locally for me anyway because I really can't
>>> figure out how to do what I want to otherwise.)
>>>
>>>
>>> --
>>> Marcelo
>
>
>
> --
> Marcelo

Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell

I added a fix for this recently and it didn't require adding -J
notation - are you trying it with this patch?

https://issues.apache.org/jira/browse/SPARK-1654

 ./bin/spark-shell --driver-java-options "-Dfoo=a -Dbar=b"
scala> sys.props.get("foo")
res0: Option[String] = Some(a)
scala> sys.props.get("bar")
res1: Option[String] = Some(b)

- Patrick

On Wed, Apr 30, 2014 at 11:29 AM, Marcelo Vanzin  wrote:
> Hello all,
>
> Maybe my brain is not evolved enough to be able to trace through what
> happens with command-line arguments as they're parsed through all the
> shell scripts... but I really can't figure out how to pass more than a
> single JVM option on the command line.
>
> Unless someone has an obvious workaround that I'm missing, I'd like to
> propose something that is actually pretty standard in JVM tools: using
> -J. From javac:
>
>   -J   Pass  directly to the runtime system
>
> So "javac -J-Xmx1g" would pass "-Xmx1g" to the underlying JVM. You can
> use several of those to pass multiple options (unlike
> --driver-java-options), so it helps that it's a short syntax.
>
> Unless someone has some issue with that I'll work on a patch for it...
> (well, I'm going to do it locally for me anyway because I really can't
> figure out how to do what I want to otherwise.)
>
>
> --
> Marcelo

Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell

That suggestion got lost along the way and IIRC the patch didn't have
that. It's a good idea though, if nothing else to provide a simple
means for backwards compatibility.

I created a JIRA for this. It's very straightforward so maybe someone
can pick it up quickly:
https://issues.apache.org/jira/browse/SPARK-1677


On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler  wrote:
> Thanks. I'm fine with the logic change, although I was a bit surprised to
> see Hadoop used for file I/O.
>
> Anyway, the jira issue and pull request discussions mention a flag to
> enable overwrites. That would be very convenient for a tutorial I'm
> writing, although I wouldn't recommend it for normal use, of course.
> However, I can't figure out if this actually exists. I found the
> spark.files.overwrite property, but that doesn't apply.  Does this override
> flag, method call, or method argument actually exist?
>
> Thanks,
> Dean
>
>
> On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell  wrote:
>
>> Hi Dean,
>>
>> We always used the Hadoop libraries here to read and write local
>> files. In Spark 1.0 we started enforcing the rule that you can't
>> over-write an existing directory because it can cause
>> confusing/undefined behavior if multiple jobs output to the directory
>> (they partially clobber each other's output).
>>
>> https://issues.apache.org/jira/browse/SPARK-1100
>> https://github.com/apache/spark/pull/11
>>
>> In the JIRA I actually proposed slightly deviating from Hadoop
>> semantics and allowing the directory to exist if it is empty, but I
>> think in the end we decided to just go with the exact same semantics
>> as Hadoop (i.e. empty directories are a problem).
>>
>> - Patrick
>>
>> On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler 
>> wrote:
>> > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's
>> using
>> > HDFS classes for file I/O, while the same script compiled and running
>> with
>> > 0.9.1 uses only the local-mode File IO.
>> >
>> > The script is a variation of the Word Count script. Here are the "guts":
>> >
>> > object WordCount2 {
>> >   def main(args: Array[String]) = {
>> >
>> > val sc = new SparkContext("local", "Word Count (2)")
>> >
>> > val input = sc.textFile(".../some/local/file").map(line =>
>> > line.toLowerCase)
>> > input.cache
>> >
>> > val wc2 = input
>> >   .flatMap(line => line.split("""\W+"""))
>> >   .map(word => (word, 1))
>> >   .reduceByKey((count1, count2) => count1 + count2)
>> >
>> > wc2.saveAsTextFile("output/some/directory")
>> >
>> > sc.stop()
>> >
>> > It works fine compiled and executed with 0.9.1. If I recompile and run
>> with
>> > 1.0.0-RC1, where the same output directory still exists, I get this
>> > familiar Hadoop-ish exception:
>> >
>> > [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
>> > Output directory
>> >
>> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
>> > already exists
>> > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>> >
>> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
>> > already exists
>> >  at
>> >
>> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
>> > at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
>> >  at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
>> > at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
>> >  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
>> > at spark.activator.WordCount2$.main(WordCount2.scala:42)
>> >  at spark.activator.WordCount2.main(WordCount2.scala)
>> > ...
>> >
>> > Thoughts?
>> >
>> >
>> > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell 
>> wrote:
>> >
>> >> Hey All,
>> >>
>> >> This is not an official vote, but I wanted to cut an RC so that people
>> can
>> >> test against the Maven artifacts, test building with their
>> configuration,
>> >> etc. We ar

Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell

Hi Dean,

We always used the Hadoop libraries here to read and write local
files. In Spark 1.0 we started enforcing the rule that you can't
over-write an existing directory because it can cause
confusing/undefined behavior if multiple jobs output to the directory
(they partially clobber each other's output).

https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/pull/11

In the JIRA I actually proposed slightly deviating from Hadoop
semantics and allowing the directory to exist if it is empty, but I
think in the end we decided to just go with the exact same semantics
as Hadoop (i.e. empty directories are a problem).

- Patrick

On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler  wrote:
> I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using
> HDFS classes for file I/O, while the same script compiled and running with
> 0.9.1 uses only the local-mode File IO.
>
> The script is a variation of the Word Count script. Here are the "guts":
>
> object WordCount2 {
>   def main(args: Array[String]) = {
>
> val sc = new SparkContext("local", "Word Count (2)")
>
> val input = sc.textFile(".../some/local/file").map(line =>
> line.toLowerCase)
> input.cache
>
> val wc2 = input
>   .flatMap(line => line.split("""\W+"""))
>   .map(word => (word, 1))
>   .reduceByKey((count1, count2) => count1 + count2)
>
> wc2.saveAsTextFile("output/some/directory")
>
> sc.stop()
>
> It works fine compiled and executed with 0.9.1. If I recompile and run with
> 1.0.0-RC1, where the same output directory still exists, I get this
> familiar Hadoop-ish exception:
>
> [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
> Output directory
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> already exists
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> already exists
>  at
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
>  at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
>  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
> at spark.activator.WordCount2$.main(WordCount2.scala:42)
>  at spark.activator.WordCount2.main(WordCount2.scala)
> ...
>
> Thoughts?
>
>
> On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell  wrote:
>
>> Hey All,
>>
>> This is not an official vote, but I wanted to cut an RC so that people can
>> test against the Maven artifacts, test building with their configuration,
>> etc. We are still chasing down a few issues and updating docs, etc.
>>
>> If you have issues or bug reports for this release, please send an e-mail
>> to the Spark dev list and/or file a JIRA.
>>
>> Commit: d636772 (v1.0.0-rc3)
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>>
>> Binaries:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>>
>> Docs:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>>
>> Repository:
>> https://repository.apache.org/content/repositories/orgapachespark-1012/
>>
>> == API Changes ==
>> If you want to test building against Spark there are some minor API
>> changes. We'll get these written up for the final release but I'm noting a
>> few here (not comprehensive):
>>
>> changes to ML vector specification:
>>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>>
>> changes to the Java API:
>>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>>
>> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> ==> Call toSeq on the result to restore the old behavior
>>
>> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> ==> Call toSeq on the result to restore old behavior
>>
>> Streaming classes have been renamed:
>> NetworkReceiver -> Receiver
>>
>
>
>
> --
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com

Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell

Sorry got cut off. For 0.9.0 and 1.0.0 they are not binary compatible
and in a few cases not source compatible. 1.X will be source
compatible. We are also planning to support binary compatibility in
1.X but I'm waiting util we make a few releases to officially promise
that, since Scala makes this pretty tricky.

On Tue, Apr 29, 2014 at 11:47 AM, Patrick Wendell  wrote:
>> What are the expectations / guarantees on binary compatibility between
>> 0.9 and 1.0?
>
> There are not guarantees.

Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell

> What are the expectations / guarantees on binary compatibility between
> 0.9 and 1.0?

There are not guarantees.

Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell

Hey All,

This is not an official vote, but I wanted to cut an RC so that people can
test against the Maven artifacts, test building with their configuration,
etc. We are still chasing down a few issues and updating docs, etc.

If you have issues or bug reports for this release, please send an e-mail
to the Spark dev list and/or file a JIRA.

Commit: d636772 (v1.0.0-rc3)
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221

Binaries:
http://people.apache.org/~pwendell/spark-1.0.0-rc3/

Docs:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/

Repository:
https://repository.apache.org/content/repositories/orgapachespark-1012/

== API Changes ==
If you want to test building against Spark there are some minor API
changes. We'll get these written up for the final release but I'm noting a
few here (not comprehensive):

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior

Streaming classes have been renamed:
NetworkReceiver -> Receiver

Re: Fw: Is there any way to make a quick test on some pre-commit code?

2014-04-24 Thread Patrick Wendell

This is already on the wiki:

https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools



On Wed, Apr 23, 2014 at 6:52 PM, Nan Zhu  wrote:

> I'm just asked by others for the same question
>
> I think Reynold gave a pretty helpful tip on this,
>
> Shall we put this on Contribute-to-Spark wiki?
>
> --
> Nan Zhu
>
>
> Forwarded message:
>
> > From: Reynold Xin 
> > Reply To: d...@spark.incubator.apache.org
> > To: d...@spark.incubator.apache.org 
> > Date: Thursday, February 6, 2014 at 7:50:57 PM
> > Subject: Re: Is there any way to make a quick test on some pre-commit
> code?
> >
> > You can do
> >
> > sbt/sbt assemble-deps
> >
> >
> > and then just run
> >
> > sbt/sbt package
> >
> > each time.
> >
> >
> > You can even do
> >
> > sbt/sbt ~package
> >
> > for automatic incremental compilation.
> >
> >
> >
> > On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu  zhunanmcg...@gmail.com)> wrote:
> >
> > > Hi, all
> > >
> > > Is it always necessary to run sbt assembly when you want to test some
> code,
> > >
> > > Sometimes you just repeatedly change one or two lines for some failed
> test
> > > case, it is really time-consuming to sbt assembly every time
> > >
> > > any faster way?
> > >
> > > Best,
> > >
> > > --
> > > Nan Zhu
> > >
> >
> >
> >
> >
>
>
>

Re: all values for a key must fit in memory

2014-04-20 Thread Patrick Wendell

Just wanted to mention - one common thing I've seen users do is use
groupByKey, then do something that is commutitive and associative once the
values are grouped. Really users here should be doing reduceByKey.

rdd.groupByKey().map{ case (key, values) => (key, values.sum))
rdd.reduceByKey(_ + _)

I've seen this happen particularly for users coming from MapReduce where
they are used to having to write their own combiners and it's not intuitive
that these functions are very different.

Sandy - have you heard from users who have a specific problems they can't
solve using an associative function? I'm sure they exist, but I wonder how
often it's this vs. they just don't understand they API.

I wonder if we should actually warn about this in the groupByKey
documentation.

- Patrick


On Sun, Apr 20, 2014 at 8:13 PM, Matei Zaharia wrote:

> We've updated the user-facing API of groupBy in 1.0 to allow this:
> https://issues.apache.org/jira/browse/SPARK-1271. The ShuffleFetcher API
> is internal to Spark, it doesn't really matter what it is because we can
> change it. But the problem before was that groupBy and cogroup were defined
> as returning (Key, Seq[Value]). Now they return (Key, Iterable[Value]),
> which will allow us to make the internal changes to allow spilling to disk
> within a key. This will happen after 1.0 though, but it will be doable
> without any changes to user programs.
>
> Matei
>
> On Apr 20, 2014, at 5:55 PM, Sandy Ryza  wrote:
>
> > The issue isn't that the Iterator[P] can't be disk-backed.  It's that,
> with
> > a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
> > into memory at once.  The ShuffledRDD is agnostic to what goes inside P.
> >
> > On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan  >wrote:
> >
> >> An iterator does not imply data has to be memory resident.
> >> Think merge sort output as an iterator (disk backed).
> >>
> >> Tom is actually planning to work on something similar with me on this
> >> hopefully this or next month.
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >> On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza 
> >> wrote:
> >>> Hey all,
> >>>
> >>> After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
> >> key
> >>> to not all fit in memory.  The current ShuffleFetcher.fetch API, which
> >>> doesn't distinguish between keys and values, only returning an
> >> Iterator[P],
> >>> seems incompatible with this.
> >>>
> >>> Any thoughts on how we could achieve parity here?
> >>>
> >>> -Sandy
> >>
>
>

Re: [jira] [Commented] (SPARK-1496) SparkContext.jarOfClass should return Option instead of a sequence

2014-04-16 Thread Patrick Wendell

Just option [string] is fine. Happy to accept a fix but I'll probably
submit one tonight if no one else has.

---
sent from my phone
On Apr 16, 2014 10:16 AM, "haosdent (JIRA)"  wrote:

>
> [
> https://issues.apache.org/jira/browse/SPARK-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13971671#comment-13971671]
>
> haosdent commented on SPARK-1496:
> -
>
> Is it should return Option[Seq[String]]? Maybe I could help you fix this
> issue. :-)
>
> > SparkContext.jarOfClass should return Option instead of a sequence
> > --
> >
> > Key: SPARK-1496
> > URL: https://issues.apache.org/jira/browse/SPARK-1496
> > Project: Spark
> >  Issue Type: Improvement
> >      Components: Spark Core
> >Reporter: Patrick Wendell
> >Assignee: Patrick Wendell
> > Fix For: 1.0.0
> >
> >
> > This is pretty confusing, especially since addJar expects to take a
> single jar.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>

Re: It seems that jenkins for PR is not working

2014-04-15 Thread Patrick Wendell

There are a few things going on here wrt tests.

1. I fixed up the RAT issues with a hotfix.

2. The Hive tests were actually disabled for a while accidentally. A recent
fix correctly re-enabled them. Without Hive Spark tests run in about 40
minutes and with Hive it runs in 1 hour and 15 minutes, so it's a big
difference.

To ease things I committed a patch today that only runs the Hive tests if
the change touches Spark SQL. So this should make it simpler for normal
tests.

We can actually generalize this to do much finer grained testing, e.g. if
something in MLLib changes we don't need to re-run the streaming tests.
I've added this JIRA to track it:
https://issues.apache.org/jira/browse/SPARK-1455

3. Overall we've experienced more race conditions with tests recently. I
noticed a few zombie test processes on Jenkins hogging up 100% of CPU so I
think this has triggered several previously unseen races due to CPU
contention on the test cluster. I killed them and we'll see if they crop up
again.

4. Please try to keep an eye on the length of new tests that get committed.
It's common to see people commit tests that e.g. sleep for several seconds
or do things that take a long time. Almost always this can be avoided and
usually avoiding it makes the test cleaner anyways (e.g. use proper
synchronization instead of sleeping).

- Patrick

On Tue, Apr 15, 2014 at 9:34 AM, Mark Hamstra wrote:

> The RAT path issue is now fixed, but it appears to me that some recent
> change has dramatically altered the behavior of the testing framework, so
> that I am now seeing many individual tests taking more than a minute to run
> and the complete test run taking a very, very long time.  I expect that
> this is what is causing Jenkins to now timeout repeatedly.
>
>
> On Mon, Apr 14, 2014 at 1:32 PM, Nan Zhu  wrote:
>
> > +1
> >
> > --
> > Nan Zhu
> >
> >
> > On Friday, April 11, 2014 at 5:35 PM, DB Tsai wrote:
> >
> > > I always got
> > >
> =
> > >
> > > Could not find Apache license headers in the following files:
> > > !? /root/workspace/SparkPullRequestBuilder/python/metastore/db.lck
> > > !?
> >
> /root/workspace/SparkPullRequestBuilder/python/metastore/service.properties
> > >
> > >
> > > Sincerely,
> > >
> > > DB Tsai
> > > ---
> > > My Blog: https://www.dbtsai.com
> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> > >
> > >
> >
> >
> >
>

Re: org.apache.spark.util.Vector is deprecated what next ?

2014-04-10 Thread Patrick Wendell

You'll need to use the associated functionality in Breeze and then create a
dense vector from a Breeze vector. I have a JIRA for us to update the
examples for 1.0...  I'm hoping Xiangrui can take a look at it.

https://issues.apache.org/jira/browse/SPARK-1464

https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra

On Thu, Apr 10, 2014 at 12:56 PM, techaddict  wrote:

> org.apache.spark.util.Vector is deprecated so what should be done to use
> say
> if want to create a vector with zeros, def zeros(length: Int) in
> util.Vector
> using new mllib.linalg.Vector ?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/org-apache-spark-util-Vector-is-deprecated-what-next-tp6288.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

branch-1.0 cut

2014-04-09 Thread Patrick Wendell

Hey All,

In accordance with the scheduled window for the release I've cut a 1.0
branch. Thanks a ton to everyone for being so active in reviews during the
last week. In the last 7 days we've merged 66 new patches, and every one of
them has undergone thorough peer-review. Tons of committers have been
active in code review - pretty cool!

At this point the 1.0 branch transitions to a normal maintenance branch*.
Bug fixes, documentation are still welcome or additions to higher level
libraries (e.g. MLLib). The focus though is shifting to QA, fixes, and
documentation for the release.

Thanks again to everyone who participated in the last week!

- Patrick

*caveat: we will still merge in some API visibility patches and a few
remaining loose ends in the next day or two.

Re: Flaky streaming tests

2014-04-07 Thread Patrick Wendell

TD - do you know what is going on here?

I looked into this ab it and at least a few of these that use
Thread.sleep() and assume the sleep will be exact, which is wrong. We
should disable all the tests that do and probably they should be re-written
to virtualize time.

- Patrick

On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout wrote:

> Hi all,
>
> The InputStreamsSuite seems to have some serious flakiness issues -- I've
> seen the file input stream fail many times and now I'm seeing some actor
> input stream test failures (
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
> )
> on what I think is an unrelated change.  Does anyone know anything about
> these?  Should we just remove some of these tests since they seem to be
> constantly failing?
>
> -Kay
>

Re: Master compilation

2014-04-05 Thread Patrick Wendell

If you want to submit a hot fix for this issue specifically please do. I'm
not sure why it didn't fail our build...


On Sat, Apr 5, 2014 at 2:30 PM, Debasish Das wrote:

> I verified this is happening for both CDH4.5 and 1.0.4...My deploy
> environment is Java 6...so Java 7 compilation is not going to help...
>
> Is this the PR which caused it ?
>
> Andre Schumacher
>
> fbebaedSpark parquet improvements A few improvements to the Parquet
> support for SQL queries: - Instead of files a ParquetRelation is now backed
> by a directory, which simplifies importing data from other sources -
> InsertIntoParquetTable operation now supports switching between overwriting
> or appending (at least in HiveQL) - tests now use the new API - Parquet
> logging can be set to WARNING level (Default) - Default compression for
> Parquet files (GZIP, as in parquet-mr) Author: Andre Schumacher &...2
> days agoSPARK-1383
>
> I will go to a stable checkin before this
>
>
>
>
> On Sat, Apr 5, 2014 at 2:22 PM, Debasish Das  >wrote:
>
> > I can compile with Java 7...let me try that...
> >
> >
> > On Sat, Apr 5, 2014 at 2:19 PM, Sean Owen  wrote:
> >
> >> That method was added in Java 7. The project is on Java 6, so I think
> >> this was just an inadvertent error in a recent PR (it was the 'Spark
> >> parquet improvements' one).
> >>
> >> I'll open a hot-fix PR after looking for other stuff like this that
> >> might have snuck in.
> >> --
> >> Sean Owen | Director, Data Science | London
> >>
> >>
> >> On Sat, Apr 5, 2014 at 10:04 PM, Debasish Das  >
> >> wrote:
> >> > I am synced with apache/spark master but getting error in spark/sql
> >> > compilation...
> >> >
> >> > Is the master broken ?
> >> >
> >> > [info] Compiling 34 Scala sources to
> >> > /home/debasish/spark_deploy/sql/core/target/scala-2.10/classes...
> >> > [error]
> >> >
> >>
> /home/debasish/spark_deploy/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala:106:
> >> > value getGlobal is not a member of object java.util.logging.Logger
> >> > [error]   logger.setParent(Logger.getGlobal)
> >> > [error]   ^
> >> > [error] one error found
> >> > [error] (sql/compile:compile) Compilation failed
> >> > [error] Total time: 171 s, completed Apr 5, 2014 4:58:41 PM
> >> >
> >> > Thanks.
> >> > Deb
> >>
> >
> >
>

Re: Recent heartbeats

2014-04-04 Thread Patrick Wendell

I answered this over on the user list...


On Fri, Apr 4, 2014 at 6:13 PM, Debasish Das wrote:

> Hi,
>
> Also posted it on user but then I realized it might be more involved.
>
> In my ALS runs I am noticing messages that complain about heart beats:
>
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms
> exceeds 45000ms
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(12, machine2, 60714, 0) with no recent heart beats: 45328ms
> exceeds 45000ms
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(19, machine3, 39496, 0) with no recent heart beats: 53259ms
> exceeds 45000ms
>
> Is this some issue with the underlying jvm over which akka is run ? Can I
> increase the heartbeat somehow to get these messages resolved ?
>
> Any more insight about the possible cause for the heartbeat will be
> helpful...
>
> Thanks.
> Deb
>

Re: Would anyone mind having a quick look at PR#288?

2014-04-02 Thread Patrick Wendell

Hey Evan,

Ya thanks this is a pretty small patch. Should definitely be do-able for
1.0.

- Patrick


On Wed, Apr 2, 2014 at 10:25 AM, Evan Chan  wrote:

> https://github.com/apache/spark/pull/288
>
> It's for fixing SPARK-1154, which would help Spark be a better citizen for
> most deploys, and should be really small and easy to review.
>
> thanks,
> Evan
>
>
> --
> --
> Evan Chan
> Staff Engineer
> e...@ooyala.com  |
>
> 
> <
> http://www.twitter.com/ooyala>
>

Re: sbt-package-bin

2014-04-01 Thread Patrick Wendell

Ya there is already some fragmentation here. Maven has some "dist" targets
and there is also ./make-distribution.sh.


On Tue, Apr 1, 2014 at 11:31 AM, Mark Hamstra wrote:

> A basic Debian package can already be created from the Maven build: mvn
> -Pdeb ...
>
>
> On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan  wrote:
>
> > Also, I understand this is the last week / merge window for 1.0, so if
> > folks are interested I'd like to get in a PR quickly.
> >
> > thanks,
> > Evan
> >
> >
> >
> > On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan  wrote:
> >
> > > Hey folks,
> > >
> > > We are in the middle of creating a Chef recipe for Spark.   As part of
> > > that we want to create a Debian package for Spark.
> > >
> > > What do folks think of adding the sbt-package-bin plugin to allow easy
> > > creation of a Spark .deb file?  I believe it adds all dependency jars
> > into
> > > a single lib/ folder, so in some ways it's even easier to manage than
> the
> > > assembly.
> > >
> > > Also I'm not sure if there's an equivalent plugin for Maven.
> > >
> > > thanks,
> > > Evan
> > >
> > >
> > > --
> > > --
> > >  Evan Chan
> > > Staff Engineer
> > > e...@ooyala.com  |
> > >
> > >  <
> > http://www.linkedin.com/company/ooyala>
> > >
> > >
> >
> >
> > --
> > --
> > Evan Chan
> > Staff Engineer
> > e...@ooyala.com  |
> >
> > 
> >  ><
> > http://www.twitter.com/ooyala>
> >
>

Re: sbt-package-bin

2014-04-01 Thread Patrick Wendell

And there is a deb target as well - ah didn't see Mark's email.


On Tue, Apr 1, 2014 at 11:36 AM, Patrick Wendell  wrote:

> Ya there is already some fragmentation here. Maven has some "dist" targets
> and there is also ./make-distribution.sh.
>
>
> On Tue, Apr 1, 2014 at 11:31 AM, Mark Hamstra wrote:
>
>> A basic Debian package can already be created from the Maven build: mvn
>> -Pdeb ...
>>
>>
>> On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan  wrote:
>>
>> > Also, I understand this is the last week / merge window for 1.0, so if
>> > folks are interested I'd like to get in a PR quickly.
>> >
>> > thanks,
>> > Evan
>> >
>> >
>> >
>> > On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan  wrote:
>> >
>> > > Hey folks,
>> > >
>> > > We are in the middle of creating a Chef recipe for Spark.   As part of
>> > > that we want to create a Debian package for Spark.
>> > >
>> > > What do folks think of adding the sbt-package-bin plugin to allow easy
>> > > creation of a Spark .deb file?  I believe it adds all dependency jars
>> > into
>> > > a single lib/ folder, so in some ways it's even easier to manage than
>> the
>> > > assembly.
>> > >
>> > > Also I'm not sure if there's an equivalent plugin for Maven.
>> > >
>> > > thanks,
>> > > Evan
>> > >
>> > >
>> > > --
>> > > --
>> > >  Evan Chan
>> > > Staff Engineer
>> > > e...@ooyala.com  |
>> > >
>> > > <http://www.ooyala.com/> <http://www.facebook.com/ooyala><
>> > http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>
>> > >
>> > >
>> >
>> >
>> > --
>> > --
>> > Evan Chan
>> > Staff Engineer
>> > e...@ooyala.com  |
>> >
>> > <http://www.ooyala.com/>
>> > <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala
>> ><
>> > http://www.twitter.com/ooyala>
>> >
>>
>
>

Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-04-01 Thread Patrick Wendell

Tom,

Given this is a pretty straightforward workaround, what do yo think about
the following course of action:

(a) We can put the workaround in the docs for 0.9.1. We don't need to do a
new RC/vote for this since we can update the published docs independently.

(b) We try to get a fix in for this into the 0.9 branch so it can end up in
0.9.2. But this takes the fix off the critical path for this release.

- Patrick


On Tue, Apr 1, 2014 at 7:28 AM, Tom Graves  wrote:

> Thanks for extending the voting.
>
> Unfortunately I've found an issue with the spark-shell in yarn-client
> mode.  It doesn't work with secure HDFS unless you
> export SPARK_YARN_MODE=true before starting the shell, or if you happen to
> do something immediately with HDFS.  If you wait for the connection to the
> namenode to timeout it will fail.
>
> I think it was actually this way in the 0.9 release also so I thought I
> would send this and get peoples feedback to see if you want it fixed?
>
> Another option would be to document that you have to export
> SPARK_YARN_MODE=true for the shell.   The fix actually went in with the
> authentication changes I made in master but I never realized that change
> needed to apply to 0.9.
>
>
> https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0
>
> See the SparkILoop diff.
>
> Tom
>
>
> On Monday, March 31, 2014 1:33 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
> Yes, lets extend the vote for two more days from now. So the vote is open
> till Wednesday, April 02, at 20:00 UTC
>
> On that note, my +1
>
>
> TD
>
>
>
>
>
>
> On Mon, Mar 31, 2014 at 9:57 AM, Patrick Wendell 
> wrote:
>
> Yeah good point. Let's just extend this vote another few days?
> >
> >
> >
> >On Mon, Mar 31, 2014 at 8:12 AM, Tom Graves  wrote:
> >
> >> I should probably pull this off into another thread, but going forward
> can
> >> we try to not have the release votes end on a weekend? Since we only
> seem
> >> to give 3 days, it makes it really hard for anyone who is offline for
> the
> >> weekend to try it out.   Either that or extend the voting for more then
> 3
> >> days.
> >>
> >> Tom
> >> On Monday, March 31, 2014 12:50 AM, Patrick Wendell  >
> >> wrote:
> >>
> >> TD - I downloaded and did some local testing. Looks good to me!
> >>
> >> +1
> >>
> >> You should cast your own vote - at that point it's enough to pass.
> >>
> >> - Patrick
> >>
> >>
> >>
> >> On Sun, Mar 30, 2014 at 9:47 PM, prabeesh k 
> wrote:
> >>
> >> > +1
> >> > tested on Ubuntu12.04 64bit
> >> >
> >> >
> >> > On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia <
> matei.zaha...@gmail.com
> >> > >wrote:
> >> >
> >> > > +1 tested on Mac OS X.
> >> > >
> >> > > Matei
> >> > >
> >> > > On Mar 27, 2014, at 1:32 AM, Tathagata Das <
> >> tathagata.das1...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Please vote on releasing the following candidate as Apache Spark
> >> > version
> >> > > 0.9.1
> >> > > >
> >> > > > A draft of the release notes along with the CHANGES.txt file is
> >> > > > attached to this e-mail.
> >> > > >
> >> > > > The tag to be voted on is v0.9.1-rc3 (commit 4c43182b):
> >> > > >
> >> > >
> >> >
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208
> >> > > >
> >> > > > The release files, including signatures, digests, etc. can be
> found
> >> at:
> >> > > > http://people.apache.org/~tdas/spark-0.9.1-rc3/
> >> > > >
> >> > > > Release artifacts are signed with the following key:
> >> > > > https://people.apache.org/keys/committer/tdas.asc
> >> > > >
> >> > > > The staging repository for this release can be found at:
> >> > > >
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1009/
> >> > > >
> >> > > > The documentation corresponding to this release can be found at:
> >> > > > http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/
> >> > > >
> >> > > > Please vote on releasing this package as Apache Spark 0.9.1!
> >> > > >
> >> > > > The vote is open until Sunday, March 30, at 10:00 UTC and passes
> if
> >> > > > a majority of at least 3 +1 PMC votes are cast.
> >> > > >
> >> > > > [ ] +1 Release this package as Apache Spark 0.9.1
> >> > > > [ ] -1 Do not release this package because ...
> >> > > >
> >> > > > To learn more about Apache Spark, please see
> >> > > > http://spark.apache.org/
> >> > > > 
> >> > >
> >> > >
> >> >
> >>
> >
>

Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-31 Thread Patrick Wendell

Yeah good point. Let's just extend this vote another few days?


On Mon, Mar 31, 2014 at 8:12 AM, Tom Graves  wrote:

> I should probably pull this off into another thread, but going forward can
> we try to not have the release votes end on a weekend? Since we only seem
> to give 3 days, it makes it really hard for anyone who is offline for the
> weekend to try it out.   Either that or extend the voting for more then 3
> days.
>
> Tom
> On Monday, March 31, 2014 12:50 AM, Patrick Wendell 
> wrote:
>
> TD - I downloaded and did some local testing. Looks good to me!
>
> +1
>
> You should cast your own vote - at that point it's enough to pass.
>
> - Patrick
>
>
>
> On Sun, Mar 30, 2014 at 9:47 PM, prabeesh k  wrote:
>
> > +1
> > tested on Ubuntu12.04 64bit
> >
> >
> > On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia  > >wrote:
> >
> > > +1 tested on Mac OS X.
> > >
> > > Matei
> > >
> > > On Mar 27, 2014, at 1:32 AM, Tathagata Das <
> tathagata.das1...@gmail.com>
> > > wrote:
> > >
> > > > Please vote on releasing the following candidate as Apache Spark
> > version
> > > 0.9.1
> > > >
> > > > A draft of the release notes along with the CHANGES.txt file is
> > > > attached to this e-mail.
> > > >
> > > > The tag to be voted on is v0.9.1-rc3 (commit 4c43182b):
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208
> > > >
> > > > The release files, including signatures, digests, etc. can be found
> at:
> > > > http://people.apache.org/~tdas/spark-0.9.1-rc3/
> > > >
> > > > Release artifacts are signed with the following key:
> > > > https://people.apache.org/keys/committer/tdas.asc
> > > >
> > > > The staging repository for this release can be found at:
> > > >
> > https://repository.apache.org/content/repositories/orgapachespark-1009/
> > > >
> > > > The documentation corresponding to this release can be found at:
> > > > http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/
> > > >
> > > > Please vote on releasing this package as Apache Spark 0.9.1!
> > > >
> > > > The vote is open until Sunday, March 30, at 10:00 UTC and passes if
> > > > a majority of at least 3 +1 PMC votes are cast.
> > > >
> > > > [ ] +1 Release this package as Apache Spark 0.9.1
> > > > [ ] -1 Do not release this package because ...
> > > >
> > > > To learn more about Apache Spark, please see
> > > > http://spark.apache.org/
> > > > 
> > >
> > >
> >
>

Re: The difference between driver and master in Spark

2014-03-31 Thread Patrick Wendell

Checkout this page:
http://spark.incubator.apache.org/docs/latest/cluster-overview.html


On Mon, Mar 31, 2014 at 9:11 AM, Nan Zhu  wrote:

> master is managing the resources in the cluster, e.g. ensuring all
> components can work together, master/worker/driver
>
> e.g. you have to submit your application with the path: driver -> master
> -> worker
>
> then
>
> the driver take most of the responsibility of running your application,
> e.g. scheduling jobs/tasks
>
> the driver is more like a user-facing component, while master is more
> transparent to the user
>
> Best,
>
> --
> Nan Zhu
>
>
> On Monday, March 31, 2014 at 10:48 AM, Dan wrote:
>
> > Hi,
> >
> > I've been recently reading spark code and confused about driver and
> > master. What's the difference between them?
> >
> > When I run spark in standalone cluster, from the log it seems that the
> > driver has not been launched.
> >
> > Thanks,
> > Dan
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/The-difference-between-driver-and-master-in-Spark-tp6158.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com (http://Nabble.com).
> >
> >
>
>
>

Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-30 Thread Patrick Wendell

TD - I downloaded and did some local testing. Looks good to me!

+1

You should cast your own vote - at that point it's enough to pass.

- Patrick


On Sun, Mar 30, 2014 at 9:47 PM, prabeesh k  wrote:

> +1
> tested on Ubuntu12.04 64bit
>
>
> On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia  >wrote:
>
> > +1 tested on Mac OS X.
> >
> > Matei
> >
> > On Mar 27, 2014, at 1:32 AM, Tathagata Das 
> > wrote:
> >
> > > Please vote on releasing the following candidate as Apache Spark
> version
> > 0.9.1
> > >
> > > A draft of the release notes along with the CHANGES.txt file is
> > > attached to this e-mail.
> > >
> > > The tag to be voted on is v0.9.1-rc3 (commit 4c43182b):
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > http://people.apache.org/~tdas/spark-0.9.1-rc3/
> > >
> > > Release artifacts are signed with the following key:
> > > https://people.apache.org/keys/committer/tdas.asc
> > >
> > > The staging repository for this release can be found at:
> > >
> https://repository.apache.org/content/repositories/orgapachespark-1009/
> > >
> > > The documentation corresponding to this release can be found at:
> > > http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/
> > >
> > > Please vote on releasing this package as Apache Spark 0.9.1!
> > >
> > > The vote is open until Sunday, March 30, at 10:00 UTC and passes if
> > > a majority of at least 3 +1 PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Spark 0.9.1
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see
> > > http://spark.apache.org/
> > > 
> >
> >
>

Migration to the new Spark JIRA

2014-03-29 Thread Patrick Wendell

Hey All,

We've successfully migrated the Spark JIRA to the Apache infrastructure.
This turned out to be a huge effort, lead by Andy Konwinski, who deserves
all of our deepest appreciation for managing this complex migration

Since Apache runs the same JIRA version as Spark's existing JIRA, there is
no new software to learn. A few things to note though:

- The issue tracker for Spark is now at:
https://issues.apache.org/jira/browse/SPARK

- You can sign up to receive an e-mail feed of JIRA updates by e-mailing:
issues-subscr...@spark.apache.org

- DO NOT create issues on the old JIRA. I'll try to disable this so that it
is read-only.

- You'll need to create an account at the new site if you don't have one
already.

- We've imported all the old JIRA's. In some cases the import tool can't
correctly guess the assignee for the JIRA, so we may have to do some manual
assignment.

- If you feel like you don't have sufficient permissions on the new JIRA,
please send me an e-mail. I tried to add all of the committers as
administrators but I may have missed some.

Thanks,
Patrick

Re: Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Patrick Wendell

Okay cool - sorry about that. Infra should be able to migrate these over to
an issues@ list shortly. I'd rather bother a few moderators than the entire
dev list... but ya I realize it's annoying :/


On Sat, Mar 29, 2014 at 1:22 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> I "reverted" this Patrick, per your request:
>
> [hermes] 8:21pm spark.apache.org > ezmlm-list dev | grep jira
> j...@apache.org
> [hermes] 8:21pm spark.apache.org > ezmlm-unsub dev j...@apache.org
> [hermes] 8:21pm spark.apache.org > ezmlm-list dev | grep jira
> [hermes] 8:21pm spark.apache.org >
>
> Note, that I an other moderators will now receive moderation
>
> emails until the infra ticket is fixed but others will not.
> I'll set up a mail filter.
>
> Chris
>
>
> -Original Message-
> From: , Chris Mattmann 
> Date: Saturday, March 29, 2014 1:11 PM
> To: Patrick Wendell , Chris Mattmann
> 
> Cc: "dev@spark.apache.org" 
> Subject: Re: Could you undo the JIRA dev list e-mails?
>
> >Patrick,
> >
> >No problem -- at the same time realize that I and the other
> >moderators were getting spammed by moderation emails from JIRA,
> >
> >so you should take that into consideration as well.
> >
> >Cheers,
> >Chris
> >
> >
> >-Original Message-
> >From: Patrick Wendell 
> >Date: Saturday, March 29, 2014 11:59 AM
> >To: Chris Mattmann 
> >Cc: "d...@spark.incubator.apache.org" 
> >Subject: Re: Could you undo the JIRA dev list e-mails?
> >
> >>Okay I think I managed to revert this by just removing jira@a.o from our
> >>dev list.
> >>
> >>
> >>On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell
> >> wrote:
> >>
> >>Hey Chris,
> >>
> >>
> >>I don't think our JIRA has been fully migrated to Apache infra, so it's
> >>really confusing to send people e-mails referring to the new JIRA since
> >>we haven't announced it yet. There is some content there because we've
> >>been trying to do the migration, but
> >> I'm not sure it's entirely finished.
> >>
> >>
> >>Also, right now our github comments go to a commits@ list. I'm actually
> >>-1 copying all of these to JIRA because we do a bunch of review level
> >>comments that are going to pollute the JIRA a bunch.
> >>
> >>
> >>In any case, can you revert the change whatever it was that sent these to
> >>the dev list? We should have a coordinated plan about this transition and
> >>the e-mail changes we plan to make.
> >>
> >>
> >>- Patrick
> >>
> >>
> >>
> >>
> >>
> >>
> >
>
>

Re: JIRA. github and asf updates

2014-03-29 Thread Patrick Wendell

I'm working with infra to get the following set-up:

1. Don't post github updates to jira comments (they are too low level). If
users want these they can subscribe to commits@s.a.o.
2. Jira comment stream will go to issues@s.a.o so people can opt into that.

One thing YARN has set-up that might be cool in the future is to e-mail
*new* JIRA's to the dev list. That might be cool to set up in the future.


On Sat, Mar 29, 2014 at 1:15 PM, Mridul Muralidharan wrote:

> If the PR comments are going to be replicated into the jira's and they
> are going to be set to dev@, then we could keep that and remove
> [Github] updates ?
> The last was added since discussions were happening off apache lists -
> which should be handled by the jira updates ?
>
> I dont mind the mails if they had content - this is just duplication
> of the same message in three mails :-)
> Btw, this is a good problem to have - a vibrant and very actively
> engaged community generated a lot of meaningful traffic !
> I just dont want to get distracted from it by repetitions.
>
> Regards,
> Mridul
>
>
> On Sat, Mar 29, 2014 at 11:46 PM, Patrick Wendell 
> wrote:
> > Ah sorry I see - Jira updates are going to the dev list. Maybe that's not
> > desirable. I think we should send them to the issues@ list.
> >
> >
> > On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell  >wrote:
> >
> >> Mridul,
> >>
> >> You can unsubscribe yourself from any of these sources, right?
> >>
> >> - Patrick
> >>
> >>
> >> On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan  >wrote:
> >>
> >>> Hi,
> >>>
> >>>   So we are now receiving updates from three sources for each change to
> >>> the PR.
> >>> While each of them handles a corner case which others might miss,
> >>> would be great if we could minimize the volume of duplicated
> >>> communication.
> >>>
> >>>
> >>> Regards,
> >>> Mridul
> >>>
> >>
> >>
>

Re: Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Patrick Wendell

Okay I think I managed to revert this by just removing jira@a.o from our
dev list.


On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell wrote:

> Hey Chris,
>
> I don't think our JIRA has been fully migrated to Apache infra, so it's
> really confusing to send people e-mails referring to the new JIRA since we
> haven't announced it yet. There is some content there because we've been
> trying to do the migration, but I'm not sure it's entirely finished.
>
> Also, right now our github comments go to a commits@ list. I'm actually
> -1 copying all of these to JIRA because we do a bunch of review level
> comments that are going to pollute the JIRA a bunch.
>
> In any case, can you revert the change whatever it was that sent these to
> the dev list? We should have a coordinated plan about this transition and
> the e-mail changes we plan to make.
>
> - Patrick
>

Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Patrick Wendell

Hey Chris,

I don't think our JIRA has been fully migrated to Apache infra, so it's
really confusing to send people e-mails referring to the new JIRA since we
haven't announced it yet. There is some content there because we've been
trying to do the migration, but I'm not sure it's entirely finished.

Also, right now our github comments go to a commits@ list. I'm actually -1
copying all of these to JIRA because we do a bunch of review level comments
that are going to pollute the JIRA a bunch.

In any case, can you revert the change whatever it was that sent these to
the dev list? We should have a coordinated plan about this transition and
the e-mail changes we plan to make.

- Patrick

Re: JIRA. github and asf updates

2014-03-29 Thread Patrick Wendell

Ah sorry I see - Jira updates are going to the dev list. Maybe that's not
desirable. I think we should send them to the issues@ list.


On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell wrote:

> Mridul,
>
> You can unsubscribe yourself from any of these sources, right?
>
> - Patrick
>
>
> On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan wrote:
>
>> Hi,
>>
>>   So we are now receiving updates from three sources for each change to
>> the PR.
>> While each of them handles a corner case which others might miss,
>> would be great if we could minimize the volume of duplicated
>> communication.
>>
>>
>> Regards,
>> Mridul
>>
>
>

Re: JIRA. github and asf updates

2014-03-29 Thread Patrick Wendell

Mridul,

You can unsubscribe yourself from any of these sources, right?

- Patrick


On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan wrote:

> Hi,
>
>   So we are now receiving updates from three sources for each change to
> the PR.
> While each of them handles a corner case which others might miss,
> would be great if we could minimize the volume of duplicated
> communication.
>
>
> Regards,
> Mridul
>

Re: Scala 2.10.4

2014-03-28 Thread Patrick Wendell

Really - I didn't know this ever was changed. But in any case, I think you
can compile with 2.10.4 and run with 2.10.3 and it's fine - right?


On Fri, Mar 28, 2014 at 11:48 AM, Matei Zaharia wrote:

> We don't actually use Scala from the user's OS anymore, we use it from the
> Spark build, so it's not a big deal. This release just has some bug fixes.
>
> Matei
>
> On Mar 28, 2014, at 11:26 AM, Kay Ousterhout 
> wrote:
>
> > What do we get by upgrading to 2.10.4?  Just wondering if it's worth the
> > annoyance of everyone needing to download a new version of Scala, making
> > yet another version of the AMIs, etc.
> >
> > -Kay
> >
> >
> > On Thu, Mar 27, 2014 at 4:33 PM, Matei Zaharia  >wrote:
> >
> >> Sounds good. Feel free to send a PR even though it's a small change (it
> >> leads to better Git history and such).
> >>
> >> Matei
> >>
> >> On Mar 27, 2014, at 4:15 PM, Mark Hamstra 
> wrote:
> >>
> >>> FYI, Spark master does build cleanly and the tests do run successfully
> >> with
> >>> Scala version set to 2.10.4, so we can probably bump 1.0.0-SNAPSHOT to
> >> use
> >>> the new version anytime we care to.
> >>
> >>
>
>

Re: Mailbomb from amplabs jenkins ?

2014-03-27 Thread Patrick Wendell

Yeah sorry guys - Jenkins is having some issues and there isn't a way to
fix this that doesn't spam people following github. Apologies!


On Thu, Mar 27, 2014 at 8:16 PM, Nan Zhu  wrote:

> yes, it sends for every PR you were involved
>
> I think Patrick is doing something on Jenkins, he just stopped some
> testing jobs manually
>
> Best,
>
> --
> Nan Zhu
>
>
> On Thursday, March 27, 2014 at 11:07 PM, Mridul Muralidharan wrote:
>
> > Got some 100 odd mails from jenkins (?) with "Can one of the admins
> > verify this patch?"
> > Part of upgrade or some other issue ?
> > Significantly reduced the snr of my inbox !
> >
> > Regards,
> > Mridul
> >
> >
>
>
>

Re: Spark 0.9.1 release

2014-03-26 Thread Patrick Wendell

Hey TD,

This one we just merged into master this morning:
https://spark-project.atlassian.net/browse/SPARK-1322

It should definitely go into the 0.9 branch because there was a bug in the
semantics of top() which at this point is unreleased in Python.

I didn't backport it yet because I figured you might want to do this at a
specific time. So please go ahead and backport it. Not sure whether this
warrants another RC.

- Patrick


On Tue, Mar 25, 2014 at 10:47 PM, Mridul Muralidharan wrote:

> On Wed, Mar 26, 2014 at 10:53 AM, Tathagata Das
>  wrote:
> > PR 159 seems like a fairly big patch to me. And quite recent, so its
> impact
> > on the scheduling is not clear. It may also depend on other changes that
> > may have gotten into the DAGScheduler but not pulled into branch 0.9. I
> am
> > not sure it is a good idea to pull that in. We can pull those changes
> later
> > for 0.9.2 if required.
>
>
> There is no impact on scheduling : it only has an impact on error
> handling - it ensures that you can actually use spark on yarn in
> multi-tennent clusters more reliably.
> Currently, any reasonably long running job (30 mins+) working on non
> trivial dataset will fail due to accumulated failures in spark.
>
>
> Regards,
> Mridul
>
>
> >
> > TD
> >
> >
> >
> >
> > On Tue, Mar 25, 2014 at 8:44 PM, Mridul Muralidharan  >wrote:
> >
> >> Forgot to mention this in the earlier request for PR's.
> >> If there is another RC being cut, please add
> >> https://github.com/apache/spark/pull/159 to it too (if not done
> >> already !).
> >>
> >> Thanks,
> >> Mridul
> >>
> >> On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
> >>  wrote:
> >> >  Hello everyone,
> >> >
> >> > Since the release of Spark 0.9, we have received a number of important
> >> bug
> >> > fixes and we would like to make a bug-fix release of Spark 0.9.1. We
> are
> >> > going to cut a release candidate soon and we would love it if people
> test
> >> > it out. We have backported several bug fixes into the 0.9 and updated
> >> JIRA
> >> > accordingly<
> >>
> https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
> >> >.
> >> > Please let me know if there are fixes that were not backported but you
> >> > would like to see them in 0.9.1.
> >> >
> >> > Thanks!
> >> >
> >> > TD
> >>
>

Re: Travis CI

2014-03-25 Thread Patrick Wendell

Ya It's been a little bit slow lately because of a high error rate in
interactions with the git-hub API. Unfortunately we are pretty slammed
for the release and haven't had a ton of time to do further debugging.

- Patrick

On Tue, Mar 25, 2014 at 7:13 PM, Nan Zhu  wrote:
> I just found that the Jenkins is not working from this afternoon
>
> for one PR, the first time build failed after 90 minutes, the second time it
> has run for more than 2 hours, no result is returned
>
> Best,
>
> --
> Nan Zhu
>
>
> On Tuesday, March 25, 2014 at 10:06 PM, Patrick Wendell wrote:
>
> That's not correct - like Michael said the Jenkins build remains the
> reference build for now.
>
> On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu  wrote:
>
> I assume the Jenkins is not working now?
>
> Best,
>
> --
> Nan Zhu
>
>
> On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:
>
> Just a quick note to everyone that Patrick and I are playing around with
> Travis CI on the Spark github repository. For now, travis does not run all
> of the test cases, so will only be turned on experimentally. Long term it
> looks like Travis might give better integration with github, so we are
> going to see if it is feasible to get all of our tests running on it.
>
> *Jenkins remains the reference CI and should be consulted before merging
> pull requests, independent of what Travis says.*
>
> If you have any questions or want to help out with the investigation, let
> me know!
>
> Michael
>
>

Re: Travis CI

2014-03-25 Thread Patrick Wendell

That's not correct - like Michael said the Jenkins build remains the
reference build for now.

On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu  wrote:
> I assume the Jenkins is not working now?
>
> Best,
>
> --
> Nan Zhu
>
>
> On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:
>
> Just a quick note to everyone that Patrick and I are playing around with
> Travis CI on the Spark github repository. For now, travis does not run all
> of the test cases, so will only be turned on experimentally. Long term it
> looks like Travis might give better integration with github, so we are
> going to see if it is feasible to get all of our tests running on it.
>
> *Jenkins remains the reference CI and should be consulted before merging
> pull requests, independent of what Travis says.*
>
> If you have any questions or want to help out with the investigation, let
> me know!
>
> Michael
>
>

Re: Spark 0.9.1 release

2014-03-24 Thread Patrick Wendell

> Spark's dependency graph in a maintenance
*Modifying* Spark's dependency graph...

Re: Spark 0.9.1 release

2014-03-24 Thread Patrick Wendell

Hey Evan and TD,

Spark's dependency graph in a maintenance release seems potentially
harmful, especially upgrading a minor version (not just a patch
version) like this. This could affect other downstream users. For
instance, now without knowing their fastutil dependency gets bumped
and they hit some new problem in fastutil 6.5.

- Patrick

On Mon, Mar 24, 2014 at 12:02 AM, Tathagata Das
 wrote:
> @Shivaram, That is a useful patch but I am bit afraid merge it in.
> Randomizing the executor has performance implications, especially for Spark
> Streaming. The non-randomized ordering of allocating machines to tasks was
> subtly helping to speed up certain window-based shuffle operations.  For
> example, corresponding shuffle partitions in multiple shuffles using the
> same partitioner were likely to be co-located, that is, shuffle partition 0
> were likely to be on the same machine for multiple shuffles. While this is
> the not a reliable mechanism to rely on, randomization may lead to
> performance degradation. So I am afraid to merge this one without
> understanding the consequences.
>
> @Evan, I have already cut a release! You can submit the PR and we can merge
> it branch-0.9. If we have to cut another release, then we can include it.
>
>
>
> On Sun, Mar 23, 2014 at 11:42 PM, Evan Chan  wrote:
>
>> I also have a really minor fix for SPARK-1057  (upgrading fastutil),
>> could that also make it in?
>>
>> -Evan
>>
>>
>> On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman
>>  wrote:
>> > Sorry this request is coming in a bit late, but would it be possible to
>> > backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing
>> > executor offers and I would like to use this in a release sooner rather
>> > than later.
>> >
>> > Thanks
>> > Shivaram
>> >
>> > [1]
>> >
>> https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd
>> >
>> >
>> > On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta 
>> wrote:
>> >
>> >> Thank You! We plan to test out 0.9.1 on YARN once it is out.
>> >>
>> >> Regards,
>> >> Bhaskar
>> >>
>> >> On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves 
>> wrote:
>> >>
>> >> > I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when
>> running
>> >> > on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as
>> >> > submitting user - JIRA in.  The pyspark one I would consider more of
>> an
>> >> > enhancement so might not be appropriate for a point release.
>> >> >
>> >> >
>> >> >  [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on
>> YA...
>> >> > org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at
>> >> >
>> >>
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
>> >> > at org.apache.spark.schedule...
>> >> > View on spark-project.atlassian.net Preview by Yahoo
>> >> >
>> >> >
>> >> >  [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
>> >> > This means that they can't write/read from files that the yarn user
>> >> > doesn't have permissions to but the submitting user does.
>> >> > View on spark-project.atlassian.net Preview by Yahoo
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta > >
>> >> > wrote:
>> >> >
>> >> > It will be great if
>> >> > "SPARK-1101:
>> >> > Umbrella
>> >> > for hardening Spark on YARN" can get into 0.9.1.
>> >> >
>> >> > Thanks,
>> >> > Bhaskar
>> >> >
>> >> >
>> >> > On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
>> >> > wrote:
>> >> >
>> >> > >  Hello everyone,
>> >> > >
>> >> > > Since the release of Spark 0.9, we have received a number of
>> important
>> >> > bug
>> >> > > fixes and we would like to make a bug-fix release of Spark 0.9.1. We
>> >> are
>> >> > > going to cut a release candidate soon and we would love it if people
>> >> test
>> >> > > it out. We have backported several bug fixes into the 0.9 and
>> updated
>> >> > JIRA
>> >> > > accordingly<
>> >> > >
>> >> >
>> >>
>> https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
>> >> > > >.
>> >> > > Please let me know if there are fixes that were not backported but
>> you
>> >> > > would like to see them in 0.9.1.
>> >> > >
>> >> > > Thanks!
>> >> > >
>> >> > > TD
>> >> > >
>> >> >
>> >>
>>
>>
>>
>> --
>> --
>> Evan Chan
>> Staff Engineer
>> e...@ooyala.com  |
>>

< 1 2 3 4 5 6 7 >

501 - 600 of 623 matches

Mail list logo