Announcing Spark 1.4.1!

2015-07-15 Thread Patrick Wendell
Hi All,

I'm happy to announce the Spark 1.4.1 maintenance release.
We recommend all users on the 1.4 branch upgrade to
this release, which contain several important bug fixes.

Download Spark 1.4.1 - http://spark.apache.org/downloads.html
Release notes - http://spark.apache.org/releases/spark-release-1-4-1.html
Comprehensive list of fixes - http://s.apache.org/spark-1.4.1

Thanks to the 85 developers who worked on this release!

Please contact me directly for errata in the release notes.

- Patrick

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkHub: a new community site for Apache Spark

2015-07-10 Thread Patrick Wendell
Hi All,

Today, I'm happy to announce SparkHub
(http://sparkhub.databricks.com), a service for the Apache Spark
community to easily find the most relevant Spark resources on the web.

SparkHub is a curated list of Spark news, videos and talks, package
releases, upcoming events around the world, and a Spark Meetup
directory to help you find a meetup close to you.

We will continue to expand the site in the coming months and add more
content. I hope SparkHub can help you find Spark related information
faster and more easily than is currently possible. Everything is
sourced from the Spark community, and we welcome input from you as
well!

- Patrick

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Fully in-memory shuffles

2015-06-11 Thread Patrick Wendell
Hey Corey,

Yes, when shuffles are smaller than available memory to the OS, most
often the outputs never get stored to disk. I believe this holds same
for the YARN shuffle service, because the write path is actually the
same, i.e. we don't fsync the writes and force them to disk. I would
guess in such shuffles the bottleneck is serializing the data rather
than raw IO, so I'm not sure explicitly buffering the data in the JVM
process would yield a large improvement.

Writing shuffle to an explicitly pinned memory filesystem is also
possible (per Davies suggestion), but it's brittle because the job
will fail if shuffle output exceeds memory.

- Patrick

On Wed, Jun 10, 2015 at 9:50 PM, Davies Liu dav...@databricks.com wrote:
 If you have enough memory, you can put the temporary work directory in
 tempfs (in memory file system).

 On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet cjno...@gmail.com wrote:
 Ok so it is the case that small shuffles can be done without hitting any
 disk. Is this the same case for the aux shuffle service in yarn? Can that be
 done without hitting disk?

 On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote:

 In many cases the shuffle will actually hit the OS buffer cache and
 not ever touch spinning disk if it is a size that is less than memory
 on the machine.

 - Patrick

 On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote:
  So with this... to help my understanding of Spark under the hood-
 
  Is this statement correct When data needs to pass between multiple
  JVMs, a
  shuffle will always hit disk?
 
  On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen rosenvi...@gmail.com
  wrote:
 
  There's a discussion of this at
  https://github.com/apache/spark/pull/5403
 
 
 
  On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote:
 
  Is it possible to configure Spark to do all of its shuffling FULLY in
  memory (given that I have enough memory to store all the data)?
 
 
 
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[ANNOUNCE] Announcing Spark 1.4

2015-06-11 Thread Patrick Wendell
Hi All,

I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is
the fifth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 210 developers and more
than 1,000 commits!

A huge thanks go to all of the individuals and organizations involved
in development and testing of this release.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone who helped work on this release!

[1] http://spark.apache.org/releases/spark-release-1-4-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Fully in-memory shuffles

2015-06-10 Thread Patrick Wendell
In many cases the shuffle will actually hit the OS buffer cache and
not ever touch spinning disk if it is a size that is less than memory
on the machine.

- Patrick

On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote:
 So with this... to help my understanding of Spark under the hood-

 Is this statement correct When data needs to pass between multiple JVMs, a
 shuffle will always hit disk?

 On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen rosenvi...@gmail.com wrote:

 There's a discussion of this at https://github.com/apache/spark/pull/5403



 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote:

 Is it possible to configure Spark to do all of its shuffling FULLY in
 memory (given that I have enough memory to store all the data)?






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark timeout issue

2015-04-26 Thread Patrick Wendell
Hi Deepak - please direct this to the user@ list. This list is for
development of Spark itself.

On Sun, Apr 26, 2015 at 12:42 PM, Deepak Gopalakrishnan
dgk...@gmail.com wrote:
 Hello All,

 I'm trying to process a 3.5GB file on standalone mode using spark. I could
 run my spark job succesfully on a 100MB file and it works as expected. But,
 when I try to run it on the 3.5GB file, I run into the below error :


 15/04/26 12:45:50 INFO BlockManagerMaster: Updated info of block taskresult_83
 15/04/26 12:46:46 WARN AkkaUtils: Error sending message [message =
 Heartbeat(2,[Lscala.Tuple2;@790223d3,BlockManagerId(2,
 master.spark.com, 39143))] in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
 15/04/26 12:47:15 INFO MemoryStore: ensureFreeSpace(26227673) called
 with curMem=265897, maxMem=5556991426
 15/04/26 12:47:15 INFO MemoryStore: Block taskresult_92 stored as
 bytes in memory (estimated size 25.0 MB, free 5.2 GB)
 15/04/26 12:47:16 INFO MemoryStore: ensureFreeSpace(26272879) called
 with curMem=26493570, maxMem=5556991426
 15/04/26 12:47:16 INFO MemoryStore: Block taskresult_94 stored as
 bytes in memory (estimated size 25.1 MB, free 5.1 GB)
 15/04/26 12:47:18 INFO MemoryStore: ensureFreeSpace(26285327) called
 with curMem=52766449, maxMem=5556991426


 and the job fails.


 I'm on AWS and have opened all ports. Also, since the 100MB file works, it
 should not be a connection issue.  I've a r3 xlarge and 2 m3 large.

 Can anyone suggest a way to fix this?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Announcing Spark 1.3.1 and 1.2.2

2015-04-17 Thread Patrick Wendell
Hi All,

I'm happy to announce the Spark 1.3.1 and 1.2.2 maintenance releases.
We recommend all users on the 1.3 and 1.2 Spark branches upgrade to
these releases, which contain several important bug fixes.

Download Spark 1.3.1 or 1.2.2:
http://spark.apache.org/downloads.html

Release notes:
1.3.1: http://spark.apache.org/releases/spark-release-1-3-1.html
1.2.2:  http://spark.apache.org/releases/spark-release-1-2-2.html

Comprehensive list of fixes:
1.3.1: http://s.apache.org/spark-1.3.1
1.2.2: http://s.apache.org/spark-1.2.2

Thanks to everyone who worked on these releases!

- Patrick

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Patrick Wendell
Hey Jonathan,

Are you referring to disk space used for storing persisted RDD's? For
that, Spark does not bound the amount of data persisted to disk. It's
a similar story to how Spark's shuffle disk output works (and also
Hadoop and other frameworks make this assumption as well for their
shuffle data, AFAIK).

We could (in theory) add a storage level that bounds the amount of
data persisted to disk and forces re-computation if the partition did
not fit. I'd be interested to hear more about a workload where that's
relevant though, before going that route. Maybe if people are using
SSD's that would make sense.

- Patrick

On Mon, Apr 13, 2015 at 8:19 AM, Jonathan Coveney jcove...@gmail.com wrote:
 I'm surprised that I haven't been able to find this via google, but I
 haven't...

 What is the setting that requests some amount of disk space for the
 executors? Maybe I'm misunderstanding how this is configured...

 Thanks for any help!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD resiliency -- does it keep state?

2015-03-27 Thread Patrick Wendell
If you invoke this, you will get at-least-once semantics on failure.
For instance, if a machine dies in the middle of executing the foreach
for a single partition, that will be re-executed on another machine.
It could even fully complete on one machine, but the machine dies
immediately before reporting the result back to the driver.

This means you need to make sure the side-effects are idempotent, or
use some transactional locking. Spark's own output operations, such as
saving to Hadoop, use such mechanisms. For instance, in the case of
Hadoop it uses the OutputCommitter classes.

- Patrick

On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos michal.klo...@gmail.com wrote:
 Hi Spark group,

 We haven't been able to find clear descriptions of how Spark handles the
 resiliency of RDDs in relationship to executing actions with side-effects.
 If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect
 for each element in the RDD. If a partition goes down -- the resiliency
 rebuilds the data,  but did it keep track of how far it go in the
 partition's set of data or will it start from the beginning again. So will
 it do at-least-once execution of foreach closures or at-most-once?

 thanks,
 Michal

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Patrick Wendell
The source code should match the Spark commit
4aaf48d46d13129f0f9bdafd771dd80fe568a7dc. Do you see any differences?

On Fri, Mar 27, 2015 at 11:28 AM, Manoj Samel manojsamelt...@gmail.com wrote:
 While looking into a issue, I noticed that the source displayed on Github
 site does not matches the downloaded tar for 1.3

 Thoughts ?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639

We could also add a map function that does same. Or you can just write
your map using an iterator.

- Patrick

On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney jcove...@gmail.com wrote:
 This is just a deficiency of the api, imo. I agree: mapValues could
 definitely be a function (K, V)=V1. The option isn't set by the function,
 it's on the RDD. So you could look at the code and do this.
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

  def mapValues[U](f: V = U): RDD[(K, U)] = {
 val cleanF = self.context.clean(f)
 new MapPartitionsRDD[(K, U), (K, V)](self,
   (context, pid, iter) = iter.map { case (k, v) = (k, cleanF(v)) },
   preservesPartitioning = true)
   }

 What you want:

  def mapValues[U](f: (K, V) = U): RDD[(K, U)] = {
 val cleanF = self.context.clean(f)
 new MapPartitionsRDD[(K, U), (K, V)](self,
   (context, pid, iter) = iter.map { case t@(k, _) = (k, cleanF(t)) },
   preservesPartitioning = true)
   }

 One of the nice things about spark is that making such new operators is very
 easy :)

 2015-03-26 17:54 GMT-04:00 Zhan Zhang zzh...@hortonworks.com:

 Thanks Jonathan. You are right regarding rewrite the example.

 I mean providing such option to developer so that it is controllable. The
 example may seems silly, and I don't know the use cases.

 But for example, if I also want to operate both the key and value part to
 generate some new value with keeping key part untouched. Then mapValues may
 not be able to  do this.

 Changing the code to allow this is trivial, but I don't know whether there
 is some special reason behind this.

 Thanks.

 Zhan Zhang




 On Mar 26, 2015, at 2:49 PM, Jonathan Coveney jcove...@gmail.com wrote:

 I believe if you do the following:


 sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString

 (8) MapPartitionsRDD[34] at reduceByKey at console:23 []
  |  MapPartitionsRDD[33] at mapValues at console:23 []
  |  ShuffledRDD[32] at reduceByKey at console:23 []
  +-(8) MapPartitionsRDD[31] at map at console:23 []
 |  ParallelCollectionRDD[30] at parallelize at console:23 []

 The difference is that spark has no way to know that your map closure
 doesn't change the key. if you only use mapValues, it does. Pretty cool that
 they optimized that :)

 2015-03-26 17:44 GMT-04:00 Zhan Zhang zzh...@hortonworks.com:

 Hi Folks,

 Does anybody know what is the reason not allowing preserverPartitioning
 in RDD.map? Do I miss something here?

 Following example involves two shuffles. I think if preservePartitioning
 is allowed, we can avoid the second one, right?

  val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
  val r2 = r1.map((_, 1))
  val r3 = r2.reduceByKey(_+_)
  val r4 = r3.map(x=(x._1, x._2 + 1))
  val r5 = r4.reduceByKey(_+_)
  r5.collect.foreach(println)

 scala r5.toDebugString
 res2: String =
 (8) ShuffledRDD[4] at reduceByKey at console:29 []
  +-(8) MapPartitionsRDD[3] at map at console:27 []
 |  ShuffledRDD[2] at reduceByKey at console:25 []
 +-(8) MapPartitionsRDD[1] at map at console:23 []
|  ParallelCollectionRDD[0] at parallelize at console:21 []

 Thanks.

 Zhan Zhang

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 1.3 Hadoop File System problem

2015-03-24 Thread Patrick Wendell
Hey Jim,

Thanks for reporting this. Can you give a small end-to-end code
example that reproduces it? If so, we can definitely fix it.

- Patrick

On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote:

 I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to
 find the s3 hadoop file system.

 I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my
 file], expected: file:/// when I try to save a parquet file. This worked in
 1.2.1.

 Has anyone else seen this?

 I'm running spark using local[8] so it's all internal. These are actually
 unit tests in our app that are failing now.

 Thanks.
 Jim




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Patrick Wendell
Hey Yiannis,

If you just perform a count on each name, date pair... can it succeed?
If so, can you do a count and then order by to find the largest one?

I'm wondering if there is a single pathologically large group here that is
somehow causing OOM.

Also, to be clear, you are getting GC limit warnings on the executors, not
the driver. Correct?

- Patrick

On Mon, Mar 23, 2015 at 10:21 AM, Martin Goodson mar...@skimlinks.com
wrote:

 Have you tried to repartition() your original data to make more partitions
 before you aggregate?


 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]

 On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi Yin,

 Yes, I have set spark.executor.memory to 8g and the worker memory to 16g
 without any success.
 I cannot figure out how to increase the number of mapPartitions tasks.

 Thanks a lot

 On 20 March 2015 at 18:44, Yin Huai yh...@databricks.com wrote:

 spark.sql.shuffle.partitions only control the number of tasks in the
 second stage (the number of reducers). For your case, I'd say that the
 number of tasks in the first state (number of mappers) will be the number
 of files you have.

 Actually, have you changed spark.executor.memory (it controls the
 memory for an executor of your application)? I did not see it in your
 original email. The difference between worker memory and executor memory
 can be found at (
 http://spark.apache.org/docs/1.3.0/spark-standalone.html),

 SPARK_WORKER_MEMORY
 Total amount of memory to allow Spark applications to use on the
 machine, e.g. 1000m, 2g (default: total memory minus 1 GB); note that
 each application's individual memory is configured using its
 spark.executor.memory property.


 On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Actually I realized that the correct way is:

 sqlContext.sql(set spark.sql.shuffle.partitions=1000)

 but I am still experiencing the same behavior/error.

 On 20 March 2015 at 16:04, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi Yin,

 the way I set the configuration is:

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext.setConf(spark.sql.shuffle.partitions,1000);

 it is the correct way right?
 In the mapPartitions task (the first task which is launched), I get
 again the same number of tasks and again the same error. :(

 Thanks a lot!

 On 19 March 2015 at 17:40, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi Yin,

 thanks a lot for that! Will give it a shot and let you know.

 On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote:

 Was the OOM thrown during the execution of first stage (map) or the
 second stage (reduce)? If it was the second stage, can you increase the
 value of spark.sql.shuffle.partitions and see if the OOM disappears?

 This setting controls the number of reduces Spark SQL will use and
 the default is 200. Maybe there are too many distinct values and the 
 memory
 pressure on every task (of those 200 reducers) is pretty high. You can
 start with 400 and increase it until the OOM disappears. Hopefully this
 will help.

 Thanks,

 Yin


 On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas 
 johngou...@gmail.com wrote:

 Hi Yin,

 Thanks for your feedback. I have 1700 parquet files, sized 100MB
 each. The number of tasks launched is equal to the number of parquet 
 files.
 Do you have any idea on how to deal with this situation?

 Thanks a lot
 On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote:

 Seems there are too many distinct groups processed in a task,
 which trigger the problem.

 How many files do your dataset have and how large is a file? Seems
 your query will be executed with two stages, table scan and map-side
 aggregation in the first stage and the final round of reduce-side
 aggregation in the second stage. Can you take a look at the numbers of
 tasks launched in these two stages?

 Thanks,

 Yin

 On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas 
 johngou...@gmail.com wrote:

 Hi there, I set the executor memory to 8g but it didn't help

 On 18 March 2015 at 13:59, Cheng Lian lian.cs@gmail.com
 wrote:

 You should probably increase executor memory by setting
 spark.executor.memory.

 Full list of available configurations can be found here
 http://spark.apache.org/docs/latest/configuration.html

 Cheng


 On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:

 Hi there,

 I was trying the new DataFrame API with some basic operations
 on a parquet dataset.
 I have 7 nodes of 12 cores and 8GB RAM allocated to each worker
 in a standalone cluster mode.
 The code is the following:

 val people = sqlContext.parquetFile(/data.parquet);
 val res = people.groupBy(name,date).
 agg(sum(power),sum(supply)).take(10);
 System.out.println(res);

 The dataset consists of 16 billion entries.
 The error I get is java.lang.OutOfMemoryError: GC overhead
 limit exceeded

 My configuration is:

 spark.serializer org.apache.spark.serializer.KryoSerializer
 

[ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Patrick Wendell
Hi All,

I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is
the fourth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone who helped work on this release!

[1] http://spark.apache.org/releases/spark-release-1-3-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to set per-user spark.local.dir?

2015-03-11 Thread Patrick Wendell
We don't support expressions or wildcards in that configuration. For
each application, the local directories need to be constant. If you
have users submitting different Spark applications, those can each set
spark.local.dirs.

- Patrick

On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang jianshi.hu...@gmail.com wrote:
 Hi,

 I need to set per-user spark.local.dir, how can I do that?

 I tried both

   /x/home/${user.name}/spark/tmp
 and
   /x/home/${USER}/spark/tmp

 And neither worked. Looks like it has to be a constant setting in
 spark-defaults.conf. Right?

 Any ideas how to do that?

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Patrick Wendell
You may need to add the -Phadoop-2.4 profile. When building or release
packages for Hadoop 2.4 we use the following flags:

-Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn

- Patrick

On Thu, Mar 5, 2015 at 12:47 PM, Kelly, Jonathan jonat...@amazon.com wrote:
 I confirmed that this has nothing to do with BigTop by running the same mvn
 command directly in a fresh clone of the Spark package at the v1.2.1 tag.  I
 got the same exact error.


 Jonathan Kelly

 Elastic MapReduce - SDE

 Port 99 (SEA35) 08.220.C2


 From: Kelly, Jonathan Kelly jonat...@amazon.com
 Date: Thursday, March 5, 2015 at 10:39 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark v1.2.1 failing under BigTop build in External Flume Sink (due
 to missing Netty library)

 I'm running into an issue building Spark v1.2.1 (as well as the latest in
 branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop (v0.9,
 which is not quite released yet).  The build fails in the External Flume
 Sink subproject with the following error:

 [INFO] Compiling 5 Scala sources and 3 Java sources to
 /workspace/workspace/bigtop.spark-rpm/build/spark/rpm/BUILD/spark-1.3.0/external/flume-sink/target/scala-2.10/classes...
 [WARNING] Class org.jboss.netty.channel.ChannelFactory not found -
 continuing with a stub.
 [ERROR] error while loading NettyServer, class file
 '/home/ec2-user/.m2/repository/org/apache/avro/avro-ipc/1.7.6/avro-ipc-1.7.6.jar(org/apache/avro/ipc/NettyServer.class)'
 is broken
 (class java.lang.NullPointerException/null)
 [WARNING] one warning found
 [ERROR] one error found

 It seems like what is happening is that the Netty library is missing at
 build time, which happens because it is explicitly excluded in the pom.xml
 (see
 https://github.com/apache/spark/blob/v1.2.1/external/flume-sink/pom.xml#L42).
 I attempted removing the exclusions and the explicit re-add for the test
 scope on lines 77-88, and that allowed the build to succeed, though I don't
 know if that will cause problems at runtime.  I don't have any experience
 with the Flume Sink, so I don't really know how to test it.  (And, to be
 clear, I'm not necessarily trying to get the Flume Sink to work-- I just
 want the project to build successfully, though of course I'd still want the
 Flume Sink to work for whomever does need it.)

 Does anybody have any idea what's going on here?  Here is the command BigTop
 is running to build Spark:

 mvn -Pbigtop-dist -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl
 -Divy.home=/home/ec2-user/.ivy2 -Dsbt.ivy.home=/home/ec2-user/.ivy2
 -Duser.home=/home/ec2-user -Drepo.maven.org=
 -Dreactor.repo=file:///home/ec2-user/.m2/repository
 -Dhadoop.version=2.4.0-amzn-3-SNAPSHOT -Dyarn.version=2.4.0-amzn-3-SNAPSHOT
 -Dprotobuf.version=2.5.0 -Dscala.version=2.10.3 -Dscala.binary.version=2.10
 -DskipTests -DrecompileMode=all install

 As I mentioned above, if I switch to the latest in branch-1.2, to
 v1.3.0-rc2, or to the latest in branch-1.3, I get the same exact error.  I
 was not getting the error with Spark v1.1.0, though there weren't any
 changes to the external/flume-sink/pom.xml between v1.1.0 and v1.2.1.


 ~ Jonathan Kelly

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is SPARK_CLASSPATH really deprecated?

2015-02-27 Thread Patrick Wendell
I think we need to just update the docs, it is a bit unclear right
now. At the time, we made it worded fairly sternly because we really
wanted people to use --jars when we deprecated SPARK_CLASSPATH. But
there are other types of deployments where there is a legitimate need
to augment the classpath of every executor.

I think it should probably say something more like

Extra classpath entries to append to the classpath of executors. This
is sometimes used in deployment environments where dependencies of
Spark are present in a specific place on all nodes.

Kannan - if you want to submit I patch I can help review it.

On Thu, Feb 26, 2015 at 8:24 PM, Kannan Rajah kra...@maprtech.com wrote:
 Thanks Marcelo. Do you think it would be useful to make
 spark.executor.extraClassPath be made to pick up some environment variable
 that can be set from spark-env.sh? Here is a example.

 spark-env.sh
 --
 executor_extra_cp = get_hbase_jars_for_cp
 export executor_extra_cp

 spark-defaults.conf
 -
 spark.executor.extraClassPath = ${executor_extra_cp}

 This will let us add logic inside get_hbase_jars_for_cp function to pick the
 right version hbase jars. There could be multiple versions installed on the
 node.



 --
 Kannan

 On Thu, Feb 26, 2015 at 6:08 PM, Marcelo Vanzin van...@cloudera.com wrote:

 On Thu, Feb 26, 2015 at 5:12 PM, Kannan Rajah kra...@maprtech.com wrote:
  Also, I would like to know if there is a localization overhead when we
  use
  spark.executor.extraClassPath. Again, in the case of hbase, these jars
  would
  be typically available on all nodes. So there is no need to localize
  them
  from the node where job was submitted. I am wondering if we use the
  SPARK_CLASSPATH approach, then it would not do localization. That would
  be
  an added benefit.
  Please clarify.

 spark.executor.extraClassPath doesn't localize anything. It just
 prepends those classpath entries to the usual classpath used to launch
 the executor. There's no copying of files or anything, so they're
 expected to exist on the nodes.

 It's basically exactly the same as SPARK_CLASSPATH, but broken down to
 two options (one for the executors, and one for the driver).

 --
 Marcelo



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Add PredictionIO to Powered by Spark

2015-02-24 Thread Patrick Wendell
Added - thanks! I trimmed it down a bit to fit our normal description length.

On Mon, Jan 5, 2015 at 8:24 AM, Thomas Stone tho...@prediction.io wrote:
 Please can we add PredictionIO to
 https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

 PredictionIO
 http://prediction.io/

 PredictionIO is an open source machine learning server for software
 developers to easily build and deploy predictive applications on production.

 PredictionIO currently offers two engine templates for Apache Spark MLlib
 for recommendation (MLlib ALS) and classification (MLlib Naive Bayes). With
 these templates, you can create a custom predictive engine for production
 deployment efficiently. A standard PredictionIO stack is built on top of
 solid open source technology, such as Scala, Apache Spark, HBase and
 Elasticsearch.

 We are already featured on https://databricks.com/certified-on-spark

 Kind regards and Happy New Year!

 Thomas

 --

 This page tracks the users of Spark. To add yourself to the list, please
 email user@spark.apache.org with your organization name, URL, a list of
 which Spark components you are using, and a short description of your use
 case.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can you add Big Industries to the Powered by Spark page?

2015-02-24 Thread Patrick Wendell
I've added it, thanks!

On Fri, Feb 20, 2015 at 12:22 AM, Emre Sevinc emre.sev...@gmail.com wrote:

 Hello,

 Could you please add Big Industries to the Powered by Spark page at
 https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ?


 Company Name: Big Industries

 URL:  http://http://www.bigindustries.be/

 Spark Components: Spark Streaming

 Use Case: Big Content Platform

 Summary: The Big Content Platform is a business-to-business content asset
 management service providing a searchable, aggregated source of live news
 feeds, public domain media and archives of content.

 The platform is founded on Apache Hadoop, uses the HDFS filesystem, Apache
 Spark, Titan Distributed Graph Database, HBase, and Solr. Additionally, the
 platform leverages public datasets like Freebase, DBpedia, Wiktionary, and
 Geonames to support semantic text enrichment.



 Kind regards,

 Emre Sevinç
 http://www.bigindustries.be/


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Patrick Wendell
What happens if you submit from the master node itself on ec2 (in
client mode), does that work? What about in cluster mode?

It would be helpful if you could print the full command that the
executor is failing. That might show that spark.driver.host is being
set strangely. IIRC we print the launch command before starting the
executor.

Overall the standalone cluster mode is not as well tested across
environments with asymmetric connectivity. I didn't actually realize
that akka (which the submission uses) can handle this scenario. But it
does seem like the job is submitted, it's just not starting correctly.

- Patrick

On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh o...@solver.com wrote:
 Patrick,

 I haven't changed the configs much. I just executed ec2-script to create 1 
 master, 2 slaves cluster. Then I try to submit the jobs from remote machine 
 leaving all defaults configured by Spark scripts as default. I've tried to 
 change configs as suggested in other mailing-list and stack overflow threads 
 (such as setting spark.driver.host, etc...), removed (hopefully) all 
 security/firewall restrictions from AWS, etc. but it didn't help.

 I think that what you are saying is exactly the issue: on my master node UI 
 at the bottom I can see the list of Completed Drivers all with ERROR 
 state...

 Thanks,
 Oleg

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Monday, February 23, 2015 12:59 AM
 To: Oleg Shirokikh
 Cc: user@spark.apache.org
 Subject: Re: Submitting jobs to Spark EC2 cluster remotely

 Can you list other configs that you are setting? It looks like the executor 
 can't communicate back to the driver. I'm actually not sure it's a good idea 
 to set spark.driver.host here, you want to let spark set that automatically.

 - Patrick

 On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh o...@solver.com wrote:
 Dear Patrick,

 Thanks a lot for your quick response. Indeed, following your advice I've 
 uploaded the jar onto S3 and FileNotFoundException is gone now and job is 
 submitted in cluster deploy mode.

 However, now both (client and cluster) fail with the following errors in 
 executors (they keep exiting/killing executors as I see in UI):

 15/02/23 08:42:46 ERROR security.UserGroupInformation:
 PriviledgedActionException as:oleg
 cause:java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]


 Full log is:

 15/02/23 01:59:11 INFO executor.CoarseGrainedExecutorBackend:
 Registered signal handlers for [TERM, HUP, INT]
 15/02/23 01:59:12 INFO spark.SecurityManager: Changing view acls to:
 root,oleg
 15/02/23 01:59:12 INFO spark.SecurityManager: Changing modify acls to:
 root,oleg
 15/02/23 01:59:12 INFO spark.SecurityManager: SecurityManager:
 authentication disabled; ui acls disabled; users with view
 permissions: Set(root, oleg); users with modify permissions: Set(root,
 oleg)
 15/02/23 01:59:12 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/02/23 01:59:12 INFO Remoting: Starting remoting
 15/02/23 01:59:13 INFO Remoting: Remoting started; listening on
 addresses
 :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.int
 ernal:39379]
 15/02/23 01:59:13 INFO util.Utils: Successfully started service 
 'driverPropsFetcher' on port 39379.
 15/02/23 01:59:43 ERROR security.UserGroupInformation:
 PriviledgedActionException as:oleg 
 cause:java.util.concurrent.TimeoutException: Futures timed out after [30 
 seconds] Exception in thread main 
 java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrai
 nedExecutorBackend.scala) Caused by:
 java.security.PrivilegedActionException: 
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 ... 4 more
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
 [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell
The map will start with a capacity of 64, but will grow to accommodate
new data. Are you using the groupBy operator in Spark or are you using
Spark SQL's group by? This usually happens if you are grouping or
aggregating in a way that doesn't sufficiently condense the data
created from each input partition.

- Patrick

On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com fightf...@163.com wrote:
 Hi,

 Really have no adequate solution got for this issue. Expecting any available
 analytical rules or hints.

 Thanks,
 Sun.

 
 fightf...@163.com


 From: fightf...@163.com
 Date: 2015-02-09 11:56
 To: user; dev
 Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for
 large data sets
 Hi,
 Problem still exists. Any experts would take a look at this?

 Thanks,
 Sun.

 
 fightf...@163.com


 From: fightf...@163.com
 Date: 2015-02-06 17:54
 To: user; dev
 Subject: Sort Shuffle performance issues about using AppendOnlyMap for large
 data sets
 Hi, all
 Recently we had caught performance issues when using spark 1.2.0 to read
 data from hbase and do some summary work.
 Our scenario means to : read large data sets from hbase (maybe 100G+ file) ,
 form hbaseRDD, transform to schemardd,
 groupby and aggregate the data while got fewer new summary data sets,
 loading data into hbase (phoenix).

 Our major issue lead to : aggregate large datasets to get summary data sets
 would consume too long time (1 hour +) , while that
 should be supposed not so bad performance. We got the dump file attached and
 stacktrace from jstack like the following:

 From the stacktrace and dump file we can identify that processing large
 datasets would cause frequent AppendOnlyMap growing, and
 leading to huge map entrysize. We had referenced the source code of
 org.apache.spark.util.collection.AppendOnlyMap and found that
 the map had been initialized with capacity of 64. That would be too small
 for our use case.

 So the question is : Does anyone had encounted such issues before? How did
 that be resolved? I cannot find any jira issues for such problems and
 if someone had seen, please kindly let us know.

 More specified solution would goes to : Does any possibility exists for user
 defining the map capacity releatively in spark? If so, please
 tell how to achieve that.

 Best Thanks,
 Sun.

Thread 22432: (state = IN_JAVA)
 - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
 line=224 (Compiled frame; information may be imprecise)
 - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
 @bci=1, line=38 (Interpreted frame)
 - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
 line=198 (Compiled frame)
 -
 org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=201, line=145 (Compiled frame)
 -
 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
 -
 org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
 -
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
 -
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
 @bci=169, line=68 (Interpreted frame)
 -
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
 @bci=2, line=41 (Interpreted frame)
 - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted
 frame)
 - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196
 (Interpreted frame)
 -
 java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)


 Thread 22431: (state = IN_JAVA)
 - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
 line=224 (Compiled frame; information may be imprecise)
 - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
 @bci=1, line=38 (Interpreted frame)
 - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
 line=198 (Compiled frame)
 -
 org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=201, line=145 (Compiled frame)
 -
 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
 -
 org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
 -
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
 -
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
 

Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Patrick Wendell
I think there is a minor error here in that the first example needs a
tail after the seq:

df.map { row =
  (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double]))
}.toDataFrame(label, features)

On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust
mich...@databricks.com wrote:
 It sounds like you probably want to do a standard Spark map, that results in
 a tuple with the structure you are looking for.  You can then just assign
 names to turn it back into a dataframe.

 Assuming the first column is your label and the rest are features you can do
 something like this:

 val df = sc.parallelize(
   (1.0, 2.3, 2.4) ::
   (1.2, 3.4, 1.2) ::
   (1.2, 2.3, 1.2) :: Nil).toDataFrame(a, b, c)

 df.map { row =
   (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double]))
 }.toDataFrame(label, features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 If you'd prefer to stick closer to SQL you can define a UDF:

 val createArray = udf((a: Double, b: Double) = Seq(a, b))
 df.select('a as 'label, createArray('b,'c) as 'features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 We'll add createArray as a first class member of the DSL.

 Michael

 On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hey All,

 I've been playing around with the new DataFrame and ML pipelines APIs and
 am having trouble accomplishing what seems like should be a fairly basic
 task.

 I have a DataFrame where each column is a Double.  I'd like to turn this
 into a DataFrame with a features column and a label column that I can feed
 into a regression.

 So far all the paths I've gone down have led me to internal APIs or
 convoluted casting in and out of RDD[Row] and DataFrame.  Is there a simple
 way of accomplishing this?

 any assistance (lookin' at you Xiangrui) much appreciated,
 Sandy



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Questions about Spark standalone resource scheduler

2015-02-02 Thread Patrick Wendell
Hey Jerry,

I think standalone mode will still add more features over time, but
the goal isn't really for it to become equivalent to what Mesos/YARN
are today. Or at least, I doubt Spark Standalone will ever attempt to
manage _other_ frameworks outside of Spark and become a general
purpose resource manager.

In terms of having better support for multi tenancy, meaning multiple
*Spark* instances, this is something I think could be in scope in the
future. For instance, we added H/A to the standalone scheduler a while
back, because it let us support H/A streaming apps in a totally native
way. It's a trade off of adding new features and keeping the scheduler
very simple and easy to use. We've tended to bias towards simplicity
as the main goal, since this is something we want to be really easy
out of the box.

One thing to point out, a lot of people use the standalone mode with
some coarser grained scheduler, such as running in a cloud service. In
this case they really just want a simple inner cluster manager. This
may even be the majority of all Spark installations. This is slightly
different than Hadoop environments, where they might just want nice
integration into the existing Hadoop stack via something like YARN.

- Patrick

On Mon, Feb 2, 2015 at 12:24 AM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi all,



 I have some questions about the future development of Spark's standalone
 resource scheduler. We've heard some users have the requirements to have
 multi-tenant support in standalone mode, like multi-user management,
 resource management and isolation, whitelist of users. Seems current Spark
 standalone do not support such kind of functionalities, while resource
 schedulers like Yarn offers such kind of advanced managements, I'm not sure
 what's the future target of standalone resource scheduler, will it only
 target on simple implementation, and for advanced usage shift to YARN? Or
 will it plan to add some simple multi-tenant related functionalities?



 Thanks a lot for your comments.



 BR

 Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell
Akhil,

Those are handled by ASF infrastructure, not anyone in the Spark
project. So this list is not the appropriate place to ask for help.

- Patrick

On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
 My mails to the mailing list are getting rejected, have opened a Jira issue,
 can someone take a look at it?

 https://issues.apache.org/jira/browse/INFRA-9032






 Thanks
 Best Regards

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accumulator value in Spark UI

2015-01-14 Thread Patrick Wendell
It should appear in the page for any stage in which accumulators are updated.

On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io wrote:
 Hello,

 From accumulator documentation, it says that if the accumulator is named, it
 will be displayed in the WebUI. However, I cannot find it anywhere.

 Do I need to specify anything in the spark ui config?

 Thanks.

 Justin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell
What do you mean when you say the overhead of spark shuffles start to
accumulate? Could you elaborate more?

In newer versions of Spark shuffle data is cleaned up automatically
when an RDD goes out of scope. It is safe to remove shuffle data at
this point because the RDD can no longer be referenced. If you are
seeing a large build up of shuffle data, it's possible you are
retaining references to older RDDs inadvertently. Could you explain
what your job actually doing?

- Patrick

On Mon, Dec 22, 2014 at 2:36 PM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:
 Hi all, I have a long running job iterating over a huge dataset. Parts of
 this operation are cached. Since the job runs for so long, eventually the
 overhead of spark shuffles starts to accumulate culminating in the driver
 starting to swap.

 I am aware of the spark.cleanup.tll parameter that allows me to configure
 when cleanup happens but the issue with doing this is that it isn't done
 safely, e.g. I can be in the middle of processing a stage when this cleanup
 happens and my cached RDDs get cleared. This ultimately causes a
 KeyNotFoundException when I try to reference the now cleared cached RDD.
 This behavior doesn't make much sense to me, I would expect the cached RDD
 to either get regenerated or at the very least for there to be an option to
 execute this cleanup without deleting those RDDs.

 Is there a programmatically safe way of doing this cleanup that doesn't
 break everything?

 If I instead tear down the spark context and bring up a new context for
 every iteration (assuming that each iteration is sufficiently long-lived),
 would memory get released appropriately?

 

 The information contained in this e-mail is confidential and/or proprietary
 to Capital One and/or its affiliates. The information transmitted herewith
 is intended only for use by the individual or entity to which it is
 addressed.  If the reader of this message is not the intended recipient, you
 are hereby notified that any review, retransmission, dissemination,
 distribution, copying or other use of, or taking of any action in reliance
 upon this information is strictly prohibited. If you have received this
 communication in error, please contact the sender and delete the material
 from your computer.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: action progress in ipython notebook?

2014-12-28 Thread Patrick Wendell
Hey Eric,

I'm just curious - which specific features in 1.2 do you find most
help with usability? This is a theme we're focusing on for 1.3 as
well, so it's helpful to hear what makes a difference.

- Patrick

On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman
eric.d.fried...@gmail.com wrote:
 Hi Josh,

 Thanks for the informative answer. Sounds like one should await your changes
 in 1.3. As information, I found the following set of options for doing the
 visual in a notebook.

 http://nbviewer.ipython.org/github/ipython/ipython/blob/3607712653c66d63e0d7f13f073bde8c0f209ba8/docs/examples/notebooks/Animations_and_Progress.ipynb


 On Dec 27, 2014, at 4:07 PM, Josh Rosen rosenvi...@gmail.com wrote:

 The console progress bars are implemented on top of a new stable status
 API that was added in Spark 1.2.  It's possible to query job progress using
 this interface (in older versions of Spark, you could implement a custom
 SparkListener and maintain the counts of completed / running / failed tasks
 / stages yourself).

 There are actually several subtleties involved in implementing job-level
 progress bars which behave in an intuitive way; there's a pretty extensive
 discussion of the challenges at https://github.com/apache/spark/pull/3009.
 Also, check out the pull request for the console progress bars for an
 interesting design discussion around how they handle parallel stages:
 https://github.com/apache/spark/pull/3029.

 I'm not sure about the plumbing that would be necessary to display live
 progress updates in the IPython notebook UI, though.  The general pattern
 would probably involve a mapping to relate notebook cells to Spark jobs (you
 can do this with job groups, I think), plus some periodic timer that polls
 the driver for the status of the current job in order to update the progress
 bar.

 For Spark 1.3, I'm working on designing a REST interface to accesses this
 type of job / stage / task progress information, as well as expanding the
 types of information exposed through the stable status API interface.

 - Josh

 On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman eric.d.fried...@gmail.com
 wrote:

 Spark 1.2.0 is SO much more usable than previous releases -- many thanks
 to the team for this release.

 A question about progress of actions.  I can see how things are
 progressing using the Spark UI.  I can also see the nice ASCII art animation
 on the spark driver console.

 Has anyone come up with a way to accomplish something similar in an
 iPython notebook using pyspark?

 Thanks
 Eric



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with saveAsTextFile
 like API, but need to overwrite the data if existed. I'm not sure if current
 Spark support such kind of operations, or I need to check this manually?



 There's a thread in mailing list discussed about this
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
So the behavior of overwriting existing directories IMO is something
we don't want to encourage. The reason why the Hadoop client has these
checks is that it's very easy for users to do unsafe things without
them. For instance, a user could overwrite an RDD that had 100
partitions with an RDD that has 10 partitions... and if they read back
the RDD they would get a corrupted RDD that has a combination of data
from the old and new RDD.

If users want to circumvent these safety checks, we need to make them
explicitly disable them. Given this, I think a config option is as
reasonable as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote:
 I am wondering if we can provide more friendly API, other than configuration 
 for this purpose. What do you think Patrick?

 Cheng Hao

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Thursday, December 25, 2014 3:22 PM
 To: Shao, Saisai
 Cc: user@spark.apache.org; d...@spark.apache.org
 Subject: Re: Question on saveAsTextFile with overwrite option

 Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

 http://spark.apache.org/docs/latest/configuration.html

 - Patrick

 On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with
 saveAsTextFile like API, but need to overwrite the data if existed.
 I'm not sure if current Spark support such kind of operations, or I need to 
 check this manually?



 There's a thread in mailing list discussed about this
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
 ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
 commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
the third release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!

This release brings operational and performance improvements in Spark
core including a new network transport subsytem designed for very
large shuffles. Spark SQL introduces an API for external data sources
along with Hive 13 support, dynamic partitioning, and the
fixed-precision decimal type. MLlib adds a new pipeline-oriented
package (spark.ml) for composing multiple algorithms. Spark Streaming
adds a Python API and a write ahead log for fault tolerance. Finally,
GraphX has graduated from alpha and introduces a stable API along with
performance improvements.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone involved in creating, testing, and documenting this release!

[1] http://spark.apache.org/releases/spark-release-1-2-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming kafa best practices ?

2014-12-17 Thread Patrick Wendell
Foreach is slightly more efficient because Spark doesn't bother to try
and collect results from each task since it's understood there will be
no return type. I think the difference is very marginal though - it's
mostly stylistic... typically you use foreach for something that is
intended to produce a side effect and map for something that will
return a new dataset.

On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas gerard.m...@gmail.com wrote:
 Patrick,

 I was wondering why one would choose for rdd.map vs rdd.foreach to execute a
 side-effecting function on an RDD.

 -kr, Gerard.

 On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell pwend...@gmail.com wrote:

 The second choice is better. Once you call collect() you are pulling
 all of the data onto a single node, you want to do most of the
 processing  in parallel on the cluster, which is what map() will do.
 Ideally you'd try to summarize the data or reduce it before calling
 collect().

 On Fri, Dec 5, 2014 at 5:26 AM, david david...@free.fr wrote:
  hi,
 
What is the bet way to process a batch window in SparkStreaming :
 
  kafkaStream.foreachRDD(rdd = {
rdd.collect().foreach(event = {
  // process the event
  process(event)
})
  })
 
 
  Or
 
  kafkaStream.foreachRDD(rdd = {
rdd.map(event = {
  // process the event
  process(event)
}).collect()
  })
 
 
  thank's
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Server - How to implement

2014-12-12 Thread Patrick Wendell
Hey Manoj,

One proposal potentially of interest is the Spark Kernel project from
IBM - you should look for their. The interface in that project is more
of a remote REPL interface, i.e. you submit commands (as strings)
and get back results (as strings), but you don't have direct
programmatic access to state like in the JobServer. Not sure if this
is what you need.

https://issues.apache.org/jira/browse/SPARK-4605

This type of higher level execution context is something we've
generally defined to be outside of scope for the core Spark
distribution because they can be cleanly built on the stable API, and
from what I've seen of different applications that build on Spark, the
requirements are fairly different for different applications. I'm
guessing that in the next year we'll see a handful of community
projects pop up around providing various types of execution services
for spark apps.

- Patrick

On Fri, Dec 12, 2014 at 10:06 AM, Manoj Samel manojsamelt...@gmail.com wrote:
 Thanks Marcelo.

 Spark Gurus/Databricks team - do you have something in roadmap for such a
 spark server ?

 Thanks,

 On Thu, Dec 11, 2014 at 5:43 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Oops, sorry, fat fingers.

 We've been playing with something like that inside Hive:
 https://github.com/apache/hive/tree/spark/spark-client

 That seems to have at least a few of the characteristics you're
 looking for; but it's a very young project, and at this moment we're
 not developing it as a public API, but mostly for internal Hive use.
 It can give you a few ideas, though. Also, SPARK-3215.


 On Thu, Dec 11, 2014 at 5:41 PM, Marcelo Vanzin van...@cloudera.com
 wrote:
  Hi Manoj,
 
  I'm not aware of any public projects that do something like that,
  except for the Ooyala server which you say doesn't cover your needs.
 
  We've been playing with something like that inside Hive, though:
 
  On Thu, Dec 11, 2014 at 5:33 PM, Manoj Samel manojsamelt...@gmail.com
  wrote:
  Hi,
 
  If spark based services are to be exposed as a continuously available
  server, what are the options?
 
  * The API exposed to client will be proprietary and fine grained (RPC
  style
  ..), not a Job level API
  * The client API need not be SQL so the Thrift JDBC server does not
  seem to
  be option .. but I could be wrong here ...
  * Ooyala implementation is a REST API for job submission, but as
  mentioned
  above; the desired API is a finer grain API, not a job submission
 
  Any existing implementation?
 
  Is it build your own server? Any thoughts on approach to use ?
 
  Thanks,
 
 
 
 
 
 
 
  --
  Marcelo



 --
 Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Stateful mapPartitions

2014-12-05 Thread Patrick Wendell
Yeah the main way to do this would be to have your own static cache of
connections. These could be using an object in Scala or just a static
variable in Java (for instance a set of connections that you can
borrow from).

- Patrick

On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote:
 Hi,

 On Fri, Dec 5, 2014 at 3:56 AM, Akshat Aranya aara...@gmail.com wrote:

 Is it possible to have some state across multiple calls to mapPartitions
 on each partition, for instance, if I want to keep a database connection
 open?


 If you're using Scala, you can use a singleton object, this will exist once
 per JVM (i.e., once per executor), like

 object DatabaseConnector {
   lazy val conn = ...
 }

 Please be aware that shutting down the connection is much harder than
 opening it, because you basically have no idea when processing is done for
 an executor, AFAIK.

 Tobias


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exception adding resource files in latest Spark

2014-12-04 Thread Patrick Wendell
Thanks for flagging this. I reverted the relevant YARN fix in Spark
1.2 release. We can try to debug this in master.

On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:
 I created a ticket for this:

   https://issues.apache.org/jira/browse/SPARK-4757


 Jianshi

 On Fri, Dec 5, 2014 at 1:31 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Correction:

 According to Liancheng, this hotfix might be the root cause:


 https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce

 Jianshi

 On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Looks like the datanucleus*.jar shouldn't appear in the hdfs path in
 Yarn-client mode.

 Maybe this patch broke yarn-client.


 https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53

 Jianshi

 On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Actually my HADOOP_CLASSPATH has already been set to include
 /etc/hadoop/conf/*

 export
 HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase
 classpath)

 Jianshi

 On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Looks like somehow Spark failed to find the core-site.xml in
 /et/hadoop/conf

 I've already set the following env variables:

 export YARN_CONF_DIR=/etc/hadoop/conf
 export HADOOP_CONF_DIR=/etc/hadoop/conf
 export HBASE_CONF_DIR=/etc/hbase/conf

 Should I put $HADOOP_CONF_DIR/* to HADOOP_CLASSPATH?

 Jianshi

 On Fri, Dec 5, 2014 at 11:37 AM, Jianshi Huang
 jianshi.hu...@gmail.com wrote:

 I got the following error during Spark startup (Yarn-client mode):

 14/12/04 19:33:58 INFO Client: Uploading resource
 file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar
 -
 hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar
 java.lang.IllegalArgumentException: Wrong FS:
 hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar,
 expected: file:///
 at
 org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
 at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
 at
 org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
 at
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257)
 at
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242)
 at
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35)
 at
 org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350)
 at
 org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35)
 at
 org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80)
 at
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
 at
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
 at
 org.apache.spark.SparkContext.init(SparkContext.scala:335)
 at
 org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
 at $iwC$$iwC.init(console:9)
 at $iwC.init(console:18)
 at init(console:20)
 at .init(console:24)

 I'm using latest Spark built from master HEAD yesterday. Is this a
 bug?

 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Patrick Wendell
Thanks Judy. While this is not directly caused by a Spark issue, it is
likely other users will run into this. This is an unfortunate
consequence of the way that we've shaded Guava in this release, we
rely on byte code shading of Hadoop itself as well. And if the user
has their own Hadoop classes present it can cause issues.

On Sun, Nov 30, 2014 at 10:53 PM, Judy Nash
judyn...@exchange.microsoft.com wrote:
 Thanks Patrick and Cheng for the suggestions.

 The issue was Hadoop common jar was added to a classpath. After I removed 
 Hadoop common jar from both master and slave, I was able to bypass the error.
 This was caused by a local change, so no impact on the 1.2 release.
 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Wednesday, November 26, 2014 8:17 AM
 To: Judy Nash
 Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on 
 Guava

 Just to double check - I looked at our own assembly jar and I confirmed that 
 our Hadoop configuration class does use the correctly shaded version of 
 Guava. My best guess here is that somehow a separate Hadoop library is ending 
 up on the classpath, possible because Spark put it there somehow.

 tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar
 cd org/apache/hadoop/
 javap -v Configuration | grep Precond

 Warning: Binary file Configuration contains 
 org.apache.hadoop.conf.Configuration

#497 = Utf8   org/spark-project/guava/common/base/Preconditions

#498 = Class  #497 //
 org/spark-project/guava/common/base/Preconditions

#502 = Methodref  #498.#501//
 org/spark-project/guava/common/base/Preconditions.checkArgument:(ZLjava/lang/Object;)V

 12: invokestatic  #502// Method
 org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V

 50: invokestatic  #502// Method
 org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V

 On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hi Judy,

 Are you somehow modifying Spark's classpath to include jars from
 Hadoop and Hive that you have running on the machine? The issue seems
 to be that you are somehow including a version of Hadoop that
 references the original guava package. The Hadoop that is bundled in
 the Spark jars should not do this.

 - Patrick

 On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash
 judyn...@exchange.microsoft.com wrote:
 Looks like a config issue. I ran spark-pi job and still failing with
 the same guava error

 Command ran:

 .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
 org.apache.spark.examples.SparkPi --master spark://headnodehost:7077
 --executor-memory 1G --num-executors 1
 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100



 Had used the same build steps on spark 1.1 and had no issue.



 From: Denny Lee [mailto:denny.g@gmail.com]
 Sent: Tuesday, November 25, 2014 5:47 PM
 To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org


 Subject: Re: latest Spark 1.2 thrift server fail with
 NoClassDefFoundError on Guava



 To determine if this is a Windows vs. other configuration, can you
 just try to call the Spark-class.cmd SparkSubmit without actually
 referencing the Hadoop or Thrift server classes?





 On Tue Nov 25 2014 at 5:42:09 PM Judy Nash
 judyn...@exchange.microsoft.com
 wrote:

 I traced the code and used the following to call:

 Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
 spark-internal --hiveconf hive.server2.thrift.port=1



 The issue ended up to be much more fundamental however. Spark doesn't
 work at all in configuration below. When open spark-shell, it fails
 with the same ClassNotFound error.

 Now I wonder if this is a windows-only issue or the hive/Hadoop
 configuration that is having this problem.



 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Tuesday, November 25, 2014 1:50 AM


 To: Judy Nash; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with
 NoClassDefFoundError on Guava



 Oh so you're using Windows. What command are you using to start the
 Thrift server then?

 On 11/25/14 4:25 PM, Judy Nash wrote:

 Made progress but still blocked.

 After recompiling the code on cmd instead of PowerShell, now I can
 see all 5 classes as you mentioned.

 However I am still seeing the same error as before. Anything else I
 can check for?



 From: Judy Nash [mailto:judyn...@exchange.microsoft.com]
 Sent: Monday, November 24, 2014 11:50 PM
 To: Cheng Lian; u...@spark.incubator.apache.org
 Subject: RE: latest Spark 1.2 thrift server fail with
 NoClassDefFoundError on Guava



 This is what I got from jar tf:

 org/spark-project/guava/common/base/Preconditions.class

 org/spark-project/guava/common/math/MathPreconditions.class

Re: Opening Spark on IntelliJ IDEA

2014-11-29 Thread Patrick Wendell
I recently posted instructions on loading Spark in Intellij from scratch:

https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA

You need to do a few extra steps for the YARN project to work.

Also, for questions like this that relate to developing Spark, you can
e-mail the dev@ list as well.

On Sat, Nov 29, 2014 at 12:15 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 Restarting the intellij multiple times and adding/changing project's SDK
 would resolve the issue.

 Thanks
 Best Regards

 On Thu, Nov 27, 2014 at 11:30 AM, Taeyun Kim taeyun@innowireless.com
 wrote:

 Hi,



 I'm trying to open the Spark source code with IntelliJ IDEA.

 I opened pom.xml on the Spark source code root directory.

 Project tree is displayed in the Project tool window.

 But, when I open a source file, say
 org.apache.spark.deploy.yarn.ClientBase.scala, a lot of red marks shows on
 the editor scroll bar.

 It is the 'Cannot resolve symbol' error. Even it cannot resolve
 StringOps.format.

 How can it be fixed?



 The versions I'm using are as follows:

 - OS: Windows 7

 - IntelliJ IDEA: 13.1.6

 - Scala plugin: 0.41.2

 - Spark source code: 1.1.1 (with a few file modified by me)



 I've tried to fix this and error state changed somewhat, but eventually I
 gave up fixing it on my own (with googling) and deleted .idea folder and
 started over. So now I'm seeing the errors described above.



 Thank you.





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell
Hi Judy,

Are you somehow modifying Spark's classpath to include jars from
Hadoop and Hive that you have running on the machine? The issue seems
to be that you are somehow including a version of Hadoop that
references the original guava package. The Hadoop that is bundled in
the Spark jars should not do this.

- Patrick

On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash
judyn...@exchange.microsoft.com wrote:
 Looks like a config issue. I ran spark-pi job and still failing with the
 same guava error

 Command ran:

 .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
 org.apache.spark.examples.SparkPi --master spark://headnodehost:7077
 --executor-memory 1G --num-executors 1
 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100



 Had used the same build steps on spark 1.1 and had no issue.



 From: Denny Lee [mailto:denny.g@gmail.com]
 Sent: Tuesday, November 25, 2014 5:47 PM
 To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org


 Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError
 on Guava



 To determine if this is a Windows vs. other configuration, can you just try
 to call the Spark-class.cmd SparkSubmit without actually referencing the
 Hadoop or Thrift server classes?





 On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com
 wrote:

 I traced the code and used the following to call:

 Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal
 --hiveconf hive.server2.thrift.port=1



 The issue ended up to be much more fundamental however. Spark doesn't work
 at all in configuration below. When open spark-shell, it fails with the same
 ClassNotFound error.

 Now I wonder if this is a windows-only issue or the hive/Hadoop
 configuration that is having this problem.



 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Tuesday, November 25, 2014 1:50 AM


 To: Judy Nash; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError
 on Guava



 Oh so you're using Windows. What command are you using to start the Thrift
 server then?

 On 11/25/14 4:25 PM, Judy Nash wrote:

 Made progress but still blocked.

 After recompiling the code on cmd instead of PowerShell, now I can see all 5
 classes as you mentioned.

 However I am still seeing the same error as before. Anything else I can
 check for?



 From: Judy Nash [mailto:judyn...@exchange.microsoft.com]
 Sent: Monday, November 24, 2014 11:50 PM
 To: Cheng Lian; u...@spark.incubator.apache.org
 Subject: RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError
 on Guava



 This is what I got from jar tf:

 org/spark-project/guava/common/base/Preconditions.class

 org/spark-project/guava/common/math/MathPreconditions.class

 com/clearspring/analytics/util/Preconditions.class

 parquet/Preconditions.class



 I seem to have the line that reported missing, but I am missing this file:

 com/google/inject/internal/util/$Preconditions.class



 Any suggestion on how to fix this?

 Very much appreciate the help as I am very new to Spark and open source
 technologies.



 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Monday, November 24, 2014 8:24 PM
 To: Judy Nash; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError
 on Guava



 Hm, I tried exactly the same commit and the build command locally, but
 couldn't reproduce this.

 Usually this kind of errors are caused by classpath misconfiguration. Could
 you please try this to ensure corresponding Guava classes are included in
 the assembly jar you built?

 jar tf
 assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop2.4.0.jar |
 grep Preconditions

 On my machine I got these lines (the first line is the one reported as
 missing in your case):

 org/spark-project/guava/common/base/Preconditions.class

 org/spark-project/guava/common/math/MathPreconditions.class

 com/clearspring/analytics/util/Preconditions.class

 parquet/Preconditions.class

 com/google/inject/internal/util/$Preconditions.class

 On 11/25/14 6:25 AM, Judy Nash wrote:

 Thank you Cheng for responding.


 Here is the commit SHA1 on the 1.2 branch I saw this failure in:

 commit 6f70e0295572e3037660004797040e026e440dbd

 Author: zsxwing zsxw...@gmail.com

 Date:   Fri Nov 21 00:42:43 2014 -0800



 [SPARK-4472][Shell] Print Spark context available as sc. only when
 SparkContext is created...



 ... successfully



 It's weird that printing Spark context available as sc when creating
 SparkContext unsuccessfully.



 Let me know if you need anything else.



 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Friday, November 21, 2014 8:02 PM
 To: Judy Nash; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError
 on Guava



 Hi Judy, could you please provide the commit SHA1 of the version you're
 

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell
Just to double check - I looked at our own assembly jar and I
confirmed that our Hadoop configuration class does use the correctly
shaded version of Guava. My best guess here is that somehow a separate
Hadoop library is ending up on the classpath, possible because Spark
put it there somehow.

 tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar
 cd org/apache/hadoop/
 javap -v Configuration | grep Precond

Warning: Binary file Configuration contains org.apache.hadoop.conf.Configuration

   #497 = Utf8   org/spark-project/guava/common/base/Preconditions

   #498 = Class  #497 //
org/spark-project/guava/common/base/Preconditions

   #502 = Methodref  #498.#501//
org/spark-project/guava/common/base/Preconditions.checkArgument:(ZLjava/lang/Object;)V

12: invokestatic  #502// Method
org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V

50: invokestatic  #502// Method
org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V

On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hi Judy,

 Are you somehow modifying Spark's classpath to include jars from
 Hadoop and Hive that you have running on the machine? The issue seems
 to be that you are somehow including a version of Hadoop that
 references the original guava package. The Hadoop that is bundled in
 the Spark jars should not do this.

 - Patrick

 On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash
 judyn...@exchange.microsoft.com wrote:
 Looks like a config issue. I ran spark-pi job and still failing with the
 same guava error

 Command ran:

 .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
 org.apache.spark.examples.SparkPi --master spark://headnodehost:7077
 --executor-memory 1G --num-executors 1
 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100



 Had used the same build steps on spark 1.1 and had no issue.



 From: Denny Lee [mailto:denny.g@gmail.com]
 Sent: Tuesday, November 25, 2014 5:47 PM
 To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org


 Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError
 on Guava



 To determine if this is a Windows vs. other configuration, can you just try
 to call the Spark-class.cmd SparkSubmit without actually referencing the
 Hadoop or Thrift server classes?





 On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com
 wrote:

 I traced the code and used the following to call:

 Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal
 --hiveconf hive.server2.thrift.port=1



 The issue ended up to be much more fundamental however. Spark doesn't work
 at all in configuration below. When open spark-shell, it fails with the same
 ClassNotFound error.

 Now I wonder if this is a windows-only issue or the hive/Hadoop
 configuration that is having this problem.



 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Tuesday, November 25, 2014 1:50 AM


 To: Judy Nash; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError
 on Guava



 Oh so you're using Windows. What command are you using to start the Thrift
 server then?

 On 11/25/14 4:25 PM, Judy Nash wrote:

 Made progress but still blocked.

 After recompiling the code on cmd instead of PowerShell, now I can see all 5
 classes as you mentioned.

 However I am still seeing the same error as before. Anything else I can
 check for?



 From: Judy Nash [mailto:judyn...@exchange.microsoft.com]
 Sent: Monday, November 24, 2014 11:50 PM
 To: Cheng Lian; u...@spark.incubator.apache.org
 Subject: RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError
 on Guava



 This is what I got from jar tf:

 org/spark-project/guava/common/base/Preconditions.class

 org/spark-project/guava/common/math/MathPreconditions.class

 com/clearspring/analytics/util/Preconditions.class

 parquet/Preconditions.class



 I seem to have the line that reported missing, but I am missing this file:

 com/google/inject/internal/util/$Preconditions.class



 Any suggestion on how to fix this?

 Very much appreciate the help as I am very new to Spark and open source
 technologies.



 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Monday, November 24, 2014 8:24 PM
 To: Judy Nash; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError
 on Guava



 Hm, I tried exactly the same commit and the build command locally, but
 couldn't reproduce this.

 Usually this kind of errors are caused by classpath misconfiguration. Could
 you please try this to ensure corresponding Guava classes are included in
 the assembly jar you built?

 jar tf
 assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop2.4.0.jar |
 grep Preconditions

 On my machine I got

Re: toLocalIterator in Spark 1.0.0

2014-11-13 Thread Patrick Wendell
It looks like you are trying to directly import the toLocalIterator
function. You can't import functions, it should just appear as a
method of an existing RDD if you have one.

- Patrick

On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan
pradhandeep1...@gmail.com wrote:
 Hi,

 I am using Spark 1.0.0 and Scala 2.10.3.

 I want to use toLocalIterator in a code but the spark shell tells

 not found: value toLocalIterator

 I also did import org.apache.spark.rdd but even after this the shell tells

 object toLocalIterator is not a member of package org.apache.spark.rdd

 Can anyone help me in this?

 Thank You

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Still struggling with building documentation

2014-11-11 Thread Patrick Wendell
The doc build appears to be broken in master. We'll get it patched up
before the release:

https://issues.apache.org/jira/browse/SPARK-4326

On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta
alexbare...@gmail.com wrote:
 Nichols and Patrick,

 Thanks for your help, but, no, it still does not work. The latest master
 produces the following scaladoc errors:

 [error]
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55:
 not found: type Type
 [error]   protected Type type() { return Type.UPLOAD_BLOCK; }
 [error] ^
 [error]
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39:
 not found: type Type
 [error]   protected Type type() { return Type.STREAM_HANDLE; }
 [error] ^
 [error]
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40:
 not found: type Type
 [error]   protected Type type() { return Type.OPEN_BLOCKS; }
 [error] ^
 [error]
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44:
 not found: type Type
 [error]   protected Type type() { return Type.REGISTER_EXECUTOR; }
 [error] ^

 ...

 [error] four errors found
 [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code
 [error] (spark/scalaunidoc:doc) Scaladoc generation failed
 [error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM
 Moving back into docs dir.
 Making directory api/scala
 cp -r ../target/scala-2.10/unidoc/. api/scala
 Making directory api/java
 cp -r ../target/javaunidoc/. api/java
 Moving to python/docs directory and building sphinx.
 Makefile:14: *** The 'sphinx-build' command was not found. Make sure you
 have Sphinx installed, then set the SPHINXBUILD environment variable to
 point to the full path of the 'sphinx-build' executable. Alternatively you
 can add the directory with the executable to your PATH. If you don't have
 Sphinx installed, grab it from http://sphinx-doc.org/.  Stop.

 Moving back into home dir.
 Making directory api/python
 cp -r python/docs/_build/html/. docs/api/python
 /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory -
 python/docs/_build/html/. (Errno::ENOENT)
 from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest'
 from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0'
 from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest'
 from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r'
 from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `top
 (required)'
 from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
 from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
 from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup'
 from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each'
 from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup'
 from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize'
 from /usr/bin/jekyll:224:in `new'
 from /usr/bin/jekyll:224:in `main'

 What next?

 Alex




 On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 I believe the web docs need to be built separately according to the
 instructions here.

 Did you give those a shot?

 It's annoying to have a separate thing with new dependencies in order to
 build the web docs, but that's how it is at the moment.

 Nick

 On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com
 wrote:

 I finally came to realize that there is a special maven target to build
 the scaladocs, although arguably a very unintuitive on: mvn verify. So now I
 have scaladocs for each package, but not for the whole spark project.
 Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
 build/docs/api directory referenced in api.html is missing. How do I build
 it?

 Alex Baretta




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark and Play

2014-11-11 Thread Patrick Wendell
Hi There,

Because Akka versions are not binary compatible with one another, it
might not be possible to integrate Play with Spark 1.1.0.

- Patrick

On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote:
 Hi,

 Sorry if this has been asked before; I didn't find a satisfactory answer
 when searching.  How can I integrate a Play application with Spark?  I'm
 getting into issues of akka-actor versions.  Play 2.2.x uses akka-actor 2.0,
 whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine with
 Spark 1.1.0.  Is there something I should do with libraryDependencies in my
 build.sbt to make it work?

 Thanks,
 Akshat

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Support Hive 0.13 .1 in Spark SQL

2014-10-28 Thread Patrick Wendell
Hey Cheng,

Right now we aren't using stable API's to communicate with the Hive
Metastore. We didn't want to drop support for Hive 0.12 so right now
we are using a shim layer to support compiling for 0.12 and 0.13. This
is very costly to maintain.

If Hive has a stable meta-data API for talking to a Metastore, we
should use that (is HCatalog sufficient for this purpose?). Ideally we
would be able to talk to multiple versions of the Hive metastore and
we can keep a single internal version of Hive for our use of Serde's,
etc.

I've created SPARK-4114 for this:
https://issues.apache.org/jira/browse/SPARK-4114

This is a very important issue for Spark SQL, so I'd welcome comments
on that JIRA from anyone who is familiar with Hive/HCatalog internals.

- Patrick

On Mon, Oct 27, 2014 at 9:54 PM, Cheng, Hao hao.ch...@intel.com wrote:
 Hi, all

I have some PRs blocked by hive upgrading (e.g.
 https://github.com/apache/spark/pull/2570), the problem is some internal
 hive method signature changed, it's hard to make the compatible in code
 level (sql/hive) when switching back/forth the Hive versions.



   I guess the motivation of the upgrading is to support the Metastore with
 different Hive versions. So, how about just keep the metastore related hive
 jars upgrading or utilize the HCatalog directly? And of course we can either
 leaving hive-exec.jar hive-cli.jar etc as 0.12 or upgrade to 0.13.1, but not
 support them both.



 Sorry if I missed some discussion of Hive upgrading.



 Cheng Hao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ending a job early

2014-10-28 Thread Patrick Wendell
Hey Jim,

There are some experimental (unstable) API's that support running jobs
which might short-circuit:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1126

This can be used for doing online aggregations like you are
describing. And in one or two cases we've exposed functions that rely
on this:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L334

I would expect more robust support for online aggregation to show up
in a future version of Spark.

- Patrick

On Tue, Oct 28, 2014 at 7:27 AM, Jim Carroll jimfcarr...@gmail.com wrote:

 We have some very large datasets where the calculation converge on a result.
 Our current implementation allows us to track how quickly the calculations
 are converging and end the processing early. This can significantly speed up
 some of our processing.

 Is there a way to do the same thing is spark?

 A trivial example might be a column average on a dataset. As we're
 'aggregating' rows into columnar averages I can track how fast these
 averages are moving and decide to stop after a low percentage of the rows
 has been processed, producing an estimate rather than an exact value.

 Within a partition, or better yet, within a worker across 'reduce' steps, is
 there a way to stop all of the aggregations and just continue on with
 reduces of already processed data?

 Thanks
 JIm




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Ending-a-job-early-tp17505.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: scalac crash when compiling DataTypeConversions.scala

2014-10-23 Thread Patrick Wendell
Hey Ryan,

I've found that filing issues with the Scala/Typesafe JIRA is pretty
helpful if the issue can be fully reproduced, and even sometimes
helpful if it can't. You can file bugs here:

https://issues.scala-lang.org/secure/Dashboard.jspa

The Spark SQL code in particular is typically the source of these, as
we use more fancy scala features. In a pinch it is also possible to
recompile and test code without building SQL if you just run tests for
the specific module (e.g. streaming). In sbt this sort of just
works:

sbt/sbt streaming/test-only a.b.c*

In maven it's more clunky but if you do a mvn install first then (I
think) you can test sub-modules independently:

mvn test -pl streaming ...

- Patrick

On Wed, Oct 22, 2014 at 10:00 PM, Ryan Williams
ryan.blake.willi...@gmail.com wrote:
 I started building Spark / running Spark tests this weekend and on maybe
 5-10 occasions have run into a compiler crash while compiling
 DataTypeConversions.scala.

 Here is a full gist of an innocuous test command (mvn test
 -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts on
 L512 and there's a final stack trace at the bottom.

 mvn clean or ./sbt/sbt clean fix it (I believe I've observed the issue
 while compiling with each tool), but are annoying/time-consuming to do,
 obvs, and it's happening pretty frequently for me when doing only small
 numbers of incremental compiles punctuated by e.g. checking out different
 git commits.

 Have other people seen this? This post on this list is basically the same
 error, but in TestSQLContext.scala and this SO post claims to be hitting it
 when trying to build in intellij.

 It seems likely to be a bug in scalac; would finding a consistent repro case
 and filing it somewhere be useful?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: About Memory usage in the Spark UI

2014-10-23 Thread Patrick Wendell
It shows the amount of memory used to store RDD blocks, which are created
when you run .cache()/.persist() on an RDD.

On Wed, Oct 22, 2014 at 10:07 PM, Haopu Wang hw...@qilinsoft.com wrote:

  Hi, please take a look at the attached screen-shot. I wonders what's the
 Memory Used column mean.



 I give 2GB memory to the driver process and 12GB memory to the executor
 process.



 Thank you!






Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Patrick Wendell
IIRC - the random is seeded with the index, so it will always produce
the same result for the same index. Maybe I don't totally follow
though. Could you give a small example of how this might change the
RDD ordering in a way that you don't expect? In general repartition()
will not preserve the ordering of an RDD.

On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
 I noticed that repartition will result in non-deterministic lineage because
 it'll result in changed orders for rows.

 So for instance, if you do things like:

 val data = read(...)
 val k = data.repartition(5)
 val h = k.repartition(5)

 It seems that this results in different ordering of rows for 'k' each time
 you call it.
 And because of this different ordering, 'h' will result in different
 partitions even, because 'repartition' distributes through a random number
 generator with the 'index' as the key.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sparksql connect remote hive cluster

2014-10-08 Thread Patrick Wendell
Spark will need to connect both to the hive metastore and to all HDFS
nodes (NN and DN's). If that is all in place then it should work. In
this case it looks like maybe it can't connect to a datanode in HDFS
to get the raw data. Keep in mind that the performance might not be
very good if you are trying to read large amounts of data over the
network.

On Wed, Oct 8, 2014 at 5:33 AM, jamborta jambo...@gmail.com wrote:
 Hi all,

 just wondering if is it possible to allow spark to connect to hive on
 another cluster located remotely?

 I have setup hive-site.xml and amended the hive-metatstore uri, also opened
 the port for zookeeper, webhdfs and hive metastore.

 It seems it connects to hive, then it fails with the following:

 org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
 BP-1886934195-100.73.212.101-1411645855947:blk_1073763904_23146
 file=/user/tja01/datasets/00ab46fa4d6711e4afb70003ff41ebbf/part-3

 not sure if some of the ports are not open or it needs access to additional
 things.

 thanks,




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/sparksql-connect-remote-hive-cluster-tp15928.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spot instances on Amazon EMR

2014-09-18 Thread Patrick Wendell
Hey Grzegorz,

EMR is a service that is not maintained by the Spark community. So
this list isn't the right place to ask EMR questions.

- Patrick

On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek
grzegorz.bia...@codilime.com wrote:
 Hi,
 I would like to run Spark application on Amazon EMR. I have some questions
 about that:
 1. I have input data on other hdfs (not on Amazon). Can I send all input
 data from that cluster to HDFS on Amazon EMR cluster (if it has enough
 storage memory) or do I have send it to Amazon S3 storage and then load this
 data on EMR cluster where I want to run my application?
 2. Which nodes should be on-demand instances and which can be spot instances
 (I don't want to spend to much money but I also lost my data or have to
 recompute everything after spot instance interruption)?
 3. Can I use Amazon S3 storage for input and output data to have less
 on-demand instances and more spot instances? (Or maybe there is another
 solution to lower costs)

 I would like to run this application once and computation would take around
 30h I think.

 Could you answer on (at least some of) this questions?

 Thanks,
 Grzegorz

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: partitioned groupBy

2014-09-17 Thread Patrick Wendell
If you'd like to re-use the resulting inverted map, you can persist the result:

x = myRdd.mapPartitions(create inverted map).persist()

Your function would create the reverse map and then return an iterator
over the keys in that map.


On Wed, Sep 17, 2014 at 1:04 PM, Akshat Aranya aara...@gmail.com wrote:
 Patrick,

 If I understand this correctly, I won't be able to do this in the closure
 provided to mapPartitions() because that's going to be stateless, in the
 sense that a hash map that I create within the closure would only be useful
 for one call of MapPartitionsRDD.compute().  I guess I would need to
 override mapPartitions() directly within my RDD.  Right?

 On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 If each partition can fit in memory, you can do this using
 mapPartitions and then building an inverse mapping within each
 partition. You'd need to construct a hash map within each partition
 yourself.

 On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote:
  I have a use case where my RDD is set up such:
 
  Partition 0:
  K1 - [V1, V2]
  K2 - [V2]
 
  Partition 1:
  K3 - [V1]
  K4 - [V3]
 
  I want to invert this RDD, but only within a partition, so that the
  operation does not require a shuffle.  It doesn't matter if the
  partitions
  of the inverted RDD have non unique keys across the partitions, for
  example:
 
  Partition 0:
  V1 - [K1]
  V2 - [K1, K2]
 
  Partition 1:
  V1 - [K3]
  V3 - [K4]
 
  Is there a way to do only a per-partition groupBy, instead of shuffling
  the
  entire data?
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell
If each partition can fit in memory, you can do this using
mapPartitions and then building an inverse mapping within each
partition. You'd need to construct a hash map within each partition
yourself.

On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote:
 I have a use case where my RDD is set up such:

 Partition 0:
 K1 - [V1, V2]
 K2 - [V2]

 Partition 1:
 K3 - [V1]
 K4 - [V3]

 I want to invert this RDD, but only within a partition, so that the
 operation does not require a shuffle.  It doesn't matter if the partitions
 of the inverted RDD have non unique keys across the partitions, for example:

 Partition 0:
 V1 - [K1]
 V2 - [K1, K2]

 Partition 1:
 V1 - [K3]
 V3 - [K4]

 Is there a way to do only a per-partition groupBy, instead of shuffling the
 entire data?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Patrick Wendell
Yeah that issue has been fixed by adding better docs, it just didn't make
it in time for the release:

https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54


On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo huozhanf...@gmail.com
wrote:

 resolved:

 ./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon
 -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests

 This code is a bit misleading


 --
 Zhanfeng Huo


 *From:* Zhanfeng Huo huozhanf...@gmail.com
 *Date:* 2014-09-12 14:13
 *To:* user user@spark.apache.org
 *Subject:* spark-1.1.0 with make-distribution.sh problem
 Hi,

 I compile spark with cmd  bash -x make-distribution.sh -Pyarn -Phive
 --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0
 -Phadoop.version=2.3.0, it errors.

 How to use it correct?

message:
 + set -o pipefail
 + set -e
 +++ dirname make-distribution.sh
 ++ cd .
 ++ pwd
 + FWDIR=/home/syn/spark/spark-1.1.0
 + DISTDIR=/home/syn/spark/spark-1.1.0/dist
 + SPARK_TACHYON=false
 + MAKE_TGZ=false
 + NAME=none
 + (( 7 ))
 + case $1 in
 + break
 + '[' -z /home/syn/usr/jdk1.7.0_55 ']'
 + '[' -z /home/syn/usr/jdk1.7.0_55 ']'
 + which git
 ++ git rev-parse --short HEAD
 + GITREV=5f6f219
 + '[' '!' -z 5f6f219 ']'
 + GITREVSTRING=' (git revision 5f6f219)'
 + unset GITREV
 + which mvn
 ++ mvn help:evaluate -Dexpression=project.version
 ++ grep -v INFO
 ++ tail -n 1
 + VERSION=1.1.0
 ++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive
 --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0
 -Phadoop.version=2.3.0
 ++ grep -v INFO
 ++ tail -n 1
 + SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output'

 Best Regards
 --
 Zhanfeng Huo




Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell
[moving to user@]

This would typically be accomplished with a union() operation. You
can't mutate an RDD in-place, but you can create a new RDD with a
union() which is an inexpensive operator.

On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur
archit279tha...@gmail.com wrote:
 Hi,

 We have a use case where we are planning to keep sparkcontext alive in a
 server and run queries on it. But the issue is we have  a continuous
 flowing data the comes in batches of constant duration(say, 1hour). Now we
 want to exploit the schemaRDD and its benefits of columnar caching and
 compression. Is there a way I can append the new batch (uncached) to the
 older(cached) batch without losing the older data from cache and caching
 the whole dataset.

 Thanks and Regards,


 Archit Thakur.
 Sr Software Developer,
 Guavus, Inc.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread Patrick Wendell
Hey SK,

Yeah, the documented format is the same (we expect users to add the
jar at the end) but the old spark-submit had a bug where it would
actually accept inputs that did not match the documented format. Sorry
if this was difficult to find!

- Patrick

On Fri, Sep 12, 2014 at 1:50 PM, SK skrishna...@gmail.com wrote:
 This issue is resolved. Looks like in the new spark-submit, the jar path has
 to be at the end of the options. Earlier I could specify this path in any
 order on the command line.

 thanks



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-Cannot-load-main-class-from-JAR-tp14123p14124.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Announcing Spark 1.1.0!

2014-09-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!

This release brings operational and performance improvements in Spark
core including a new implementation of the Spark shuffle designed for
very large scale workloads. Spark 1.1 adds significant extensions to
the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
JDBC server, byte code generation for fast expression evaluation, a
public types API, JSON support, and other features and optimizations.
MLlib introduces a new statistics library along with several new
algorithms and optimizations. Spark 1.1 also builds out Spark's Python
support and adds new components to the Spark Streaming module.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

[1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
[2] http://spark.eu.apache.org/downloads.html

NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS.

Please e-mail me directly for any type-o's in the release notes or name listing.

Thanks, and congratulations!
- Patrick

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Patrick Wendell
I would say that the first three are all used pretty heavily. Mesos
was the first one supported (long ago),  the standalone is the
simplest and most popular today, and YARN is newer but growing a lot
in activity.

SIMR is not used as much... it was designed mostly for environments
where users had access to Hadoop and couldn't easily install Spark.
Now most Hadoop vendors bundle Spark anyways, so it's not needed.

On Sun, Sep 7, 2014 at 6:29 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 I'm trying to determine which Spark deployment models are the most popular -
 Standalone, YARN, Mesos, or SIMR.  Anyone knows?

 I thought I'm use search-hadoop.com to help me figure this out and this is
 what I found:


 1) Standalone
 http://search-hadoop.com/?q=standalonefc_project=Sparkfc_type=mail+_hash_+user
 (seems the most popular?)

 2) YARN
  http://search-hadoop.com/?q=yarnfc_project=Sparkfc_type=mail+_hash_+user
 (almost as popular as standalone?)

 3) Mesos
 http://search-hadoop.com/?q=mesosfc_project=Sparkfc_type=mail+_hash_+user
 (less popular than yarn or standalone)

 4) SIMR
 http://search-hadoop.com/?q=simrfc_project=Sparkfc_type=mail+_hash_+user
 (no mentions?)

 This is obviously not very accurate but is the order right?

 Thanks,
 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: memory size for caching RDD

2014-09-03 Thread Patrick Wendell
Changing this is not supported, it si immutable similar to other spark
configuration settings.

On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 nzjem...@gmail.com wrote:
 Dear all:

 Spark uses memory to cache RDD and the memory size is specified by
 spark.storage.memoryFraction.

 One the Executor starts, does Spark support adjusting/resizing memory size
 of this part dynamically?

 Thanks.

 --
 *Regards,*
 *Zhaojie*

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell
Yeah - each batch will produce a new RDD.

On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
 Thanks.

 Just to double check, rdd.id would be unique for a batch in a DStream?


 On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote:

 You can use RDD id as the seed, which is unique in the same spark
 context. Suppose none of the RDDs would contain more than 1 billion
 records. Then you can use

 rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)

 Just a hack ..

 On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
 kumar.soumi...@gmail.com wrote:
  So, I guess zipWithUniqueId will be similar.
 
  Is there a way to get unique index?
 
 
  On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:
 
  No. The indices start at 0 for every RDD. -Xiangrui
 
  On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
  kumar.soumi...@gmail.com wrote:
   Hello,
  
   If I do:
  
   DStream transform {
   rdd.zipWithIndex.map {
  
   Is the index guaranteed to be unique across all RDDs here?
  
   }
   }
  
   Thanks,
   -Soumitra.
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Submit to the Powered By Spark Page!

2014-08-26 Thread Patrick Wendell
Hi All,

I want to invite users to submit to the Spark Powered By page. This page
is a great way for people to learn about Spark use cases. Since Spark
activity has increased a lot in the higher level libraries and people often
ask who uses each one, we'll include information about which components
each organization uses as well. If you are interested, simply respond to
this e-mail (or e-mail me off-list) with:

1) Organization name
2) URL
3) Which Spark components you use: Core, SQL, Streaming, MLlib, GraphX
4) A 1-2 sentence description of your use case.

I'll post any new entries here:
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

- Patrick


Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Patrick Wendell
Hey Andrew,

We might create a new JIRA for it, but it doesn't exist yet. We'll create
JIRA's for the major 1.2 issues at the beginning of September.

- Patrick


On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash and...@andrewash.com wrote:

 Hi Patrick,

 For the spilling within on key work you mention might land in Spark 1.2,
 is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823
 or is there another ticket I should be following?

 Thanks!
 Andrew


 On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hi Jens,

 Within a partition things will spill - so the current documentation is
 correct. This spilling can only occur *across keys* at the moment. Spilling
 cannot occur within a key at present.

 This is discussed in the video here:

 https://www.youtube.com/watch?v=dmL0N3qfSc8index=3list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ

 Spilling within one key for GroupBy's is likely to end up in the next
 release of Spark, Spark 1.2. In most cases we see when users hit this, they
 are actually trying to just do aggregations which would be more efficiently
 implemented without the groupBy operator.

 If the goal is literally to just write out to disk all the values
 associated with each group, and the values associated with a single group
 are larger than fit in memory, this cannot be accomplished right now with
 the groupBy operator.

 The best way to work around this depends a bit on what you are trying to
 do with the data down stream. Typically approaches involve sub-dividing any
 very large groups, for instance, appending a hashed value in a small range
 (1-10) to large keys. Then your downstream code has to deal with
 aggregating partial values for each group. If your goal is just to lay each
 group out sequentially on disk on one big file, you can call `sortByKey`
 with a hashed suffix as well. The sort functions are externalized in Spark
 1.1 (which is in pre-release).

 - Patrick


 On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti sp...@jkg.dk wrote:

 Patrick Wendell wrote
  In the latest version of Spark we've added documentation to make this
  distinction more clear to users:
 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390

 That is a very good addition to the documentation. Nice and clear about
 the
 dangers of groupBy.


 Patrick Wendell wrote
  Currently groupBy requires that all
  of the values for one key can fit in memory.

 Is that really true? Will partitions not spill to disk, hence the
 recommendation in the documentation to up the parallelism of groupBy et
 al?

 A better question might be: How exactly does partitioning affect groupBy
 with regards to memory consumption. What will **have** to fit in memory,
 and
 what may be spilled to disk, if running out of memory?

 And if it really is true, that Spark requires all groups' values to fit
 in
 memory, how do I do a on-disk grouping of results, similar to what I'd
 to
 in a Hadoop job by using a mapper emitting (groupId, value) key-value
 pairs,
 and having an entity reducer writing results to disk?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Advantage of using cache()

2014-08-23 Thread Patrick Wendell
Yep - that's correct. As an optimization we save the shuffle output and
re-use if if you execute a stage twice. So this can make A:B tests like
this a bit confusing.

- Patrick

On Friday, August 22, 2014, Nieyuan qiushuiwuh...@gmail.com wrote:

 Because map-reduce tasks like join will save shuffle data to disk . So the
 only diffrence with caching or no-caching version is :
.map { case (x, (n, i)) = (x, n)}



 -
 Thanks,
 Nieyuan
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Advantage-of-using-cache-tp12480p12634.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;




Re: Advantage of using cache()

2014-08-20 Thread Patrick Wendell
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact
effect of caching. In rdd3, in addition to the fact that rdd will be
cached, you are also doing a bunch of extra random number generation. So it
will be hard to isolate the effect of caching.


On Wed, Aug 20, 2014 at 7:48 AM, Grzegorz Białek 
grzegorz.bia...@codilime.com wrote:

 Hi,

 I tried to write small program which shows that using cache() can speed up
 execution but results with and without cache were similar. Could help me
 with this issue? I tried to compute rdd and use it later in two places and
 I thought in second usage this rdd is recomputed but it doesn't:

   val help = sc.parallelize(Array.range(1, 2)).repartition(100)
 .map(x = (scala.util.Random.nextInt(10), x))
   val rdd = sc.parallelize(Array.range(1,2))
 .repartition(100)
 .map(x = (scala.util.Random.nextInt(10), x))
 .join(help)
 .map { case (x, (n, i)) = (x, n)}
 .reduceByKey(_ + _)
 .cache()

   val rdd2 = sc.parallelize(Array.range(1,1000)).map(x = (x, x))
 .join(rdd).saveAsTextFile(output/1)
   val rdd3 = sc.parallelize(Array.range(1,1000)).map(x =
 (scala.util.Random.nextInt(1000), x))
 .join(rdd).saveAsTextFile(output/2)

 Thanks,
 Grzegorz



Re: Broadcast vs simple variable

2014-08-20 Thread Patrick Wendell
For large objects, it will be more efficient to broadcast it. If your array
is small it won't really matter. How many centers do you have? Unless you
are finding that you have very large tasks (and Spark will print a warning
about this), it could be okay to just reference it directly.


On Wed, Aug 20, 2014 at 1:18 AM, Julien Naour julna...@gmail.com wrote:

 Hi,

 I have a question about broadcast. I'm working on a clustering algorithm
 close to KMeans. It seems that KMeans broadcast clusters centers at each
 step. For the moment I just use my centers as Array that I call directly in
 my map at each step. Could it be more efficient to use broadcast instead of
 simple variable?

 Cheers,

 Julien Naour



Re: Web UI doesn't show some stages

2014-08-20 Thread Patrick Wendell
The reason is that some operators get pipelined into a single stage.
rdd.map(XX).filter(YY) - this executes in a single stage since there is no
data movement needed in between these operations.

If you call toDeubgString on the final RDD it will give you some
information about the exact lineage. In Spark 1.1 this will return
information about stage boudnaries as well.


On Wed, Aug 20, 2014 at 4:22 AM, Grzegorz Białek 
grzegorz.bia...@codilime.com wrote:

 Hi,

 I am wondering why in web UI some stages (like join, filter) are not
 visible. For example this code:

 val simple = sc.parallelize(Array.range(0,100))
 val simple2 = sc.parallelize(Array.range(0,100))

   val toJoin = simple.map(x = (x, x.toString + x.toString))
   val rdd = simple2
 .map(x = (scala.util.Random.nextInt(100), x))
 .join(toJoin)
 .map { case (r, (x, s)) = (r, x)}
 .reduceByKey(_ + _)
 .sortByKey()
 .cache()
   rdd.saveAsTextFile(output/1)

   val rdd2 = toJoin
 .groupBy{ case (x, _) = x}
 .filter{ case (x, _) = x  10}
   rdd2.saveAsTextFile(output/2)

   println(rdd2.join(toJoin).count())

 in UI doesn't show join and filter stages and moreover it shows sortByKey
 and reduceByKey twice.
 Could anyone explain how it works?

 Thanks,
 Grzegorz



Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Patrick Wendell
Hi Jens,

Within a partition things will spill - so the current documentation is
correct. This spilling can only occur *across keys* at the moment. Spilling
cannot occur within a key at present.

This is discussed in the video here:
https://www.youtube.com/watch?v=dmL0N3qfSc8index=3list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ

Spilling within one key for GroupBy's is likely to end up in the next
release of Spark, Spark 1.2. In most cases we see when users hit this, they
are actually trying to just do aggregations which would be more efficiently
implemented without the groupBy operator.

If the goal is literally to just write out to disk all the values
associated with each group, and the values associated with a single group
are larger than fit in memory, this cannot be accomplished right now with
the groupBy operator.

The best way to work around this depends a bit on what you are trying to do
with the data down stream. Typically approaches involve sub-dividing any
very large groups, for instance, appending a hashed value in a small range
(1-10) to large keys. Then your downstream code has to deal with
aggregating partial values for each group. If your goal is just to lay each
group out sequentially on disk on one big file, you can call `sortByKey`
with a hashed suffix as well. The sort functions are externalized in Spark
1.1 (which is in pre-release).

- Patrick


On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti sp...@jkg.dk wrote:

 Patrick Wendell wrote
  In the latest version of Spark we've added documentation to make this
  distinction more clear to users:
 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390

 That is a very good addition to the documentation. Nice and clear about the
 dangers of groupBy.


 Patrick Wendell wrote
  Currently groupBy requires that all
  of the values for one key can fit in memory.

 Is that really true? Will partitions not spill to disk, hence the
 recommendation in the documentation to up the parallelism of groupBy et al?

 A better question might be: How exactly does partitioning affect groupBy
 with regards to memory consumption. What will **have** to fit in memory,
 and
 what may be spilled to disk, if running out of memory?

 And if it really is true, that Spark requires all groups' values to fit in
 memory, how do I do a on-disk grouping of results, similar to what I'd to
 in a Hadoop job by using a mapper emitting (groupId, value) key-value
 pairs,
 and having an entity reducer writing results to disk?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
It seems possible that you are running out of memory unrolling a single
partition of the RDD. This is something that can cause your executor to
OOM, especially if the cache is close to being full so the executor doesn't
have much free memory left. How large are your executors? At the time of
failure, is the cache already nearly full?

I also believe the Snappy compression codec in Hadoop is not splittable.
This means that each of your JSON files is read in its entirety as one
spark partition. If you have files that are larger than the standard block
size (128MB), it will exacerbate this shortcoming of Spark. Incidentally,
this means minPartitions won't help you at all here.

This is fixed in the master branch and will be fixed in Spark 1.1. As a
debugging step (if this is doable), it's worth running this job on the
master branch and seeing if it succeeds.

https://github.com/apache/spark/pull/1165

A (potential) workaround would be to first persist your data to disk, then
re-partition it, then cache it. I'm not 100% sure whether that will work
though.

 val a = 
sc.textFile(s3n://some-path/*.json).persist(DISK_ONLY).repartition(larger
nr of partitions).cache()

- Patrick


On Fri, Aug 1, 2014 at 10:17 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen so...@cloudera.com wrote:

 Isn't this your worker running out of its memory for computations,
 rather than for caching RDDs?

 I'm not sure how to interpret the stack trace, but let's say that's true.
 I'm even seeing this with a simple a = sc.textFile().cache() and then
 a.count(). Spark shouldn't need that much memory for this kind of work,
 no?

 then the answer is that you should tell
 it to use less memory for caching.

 I can try that. That's done by changing spark.storage.memoryFraction,
 right?

 This still seems strange though. The default fraction of the JVM left for
 non-cache activity (1 - 0.6 = 40%
 http://spark.apache.org/docs/latest/configuration.html#execution-behavior)
 should be plenty for just counting elements. I'm using m1.xlarge nodes
 that have 15GB of memory apiece.

 Nick
 



Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
BTW - the reason why the workaround could help is because when persisting
to DISK_ONLY, we explicitly avoid materializing the RDD partition in
memory... we just pass it through to disk


On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell pwend...@gmail.com wrote:

 It seems possible that you are running out of memory unrolling a single
 partition of the RDD. This is something that can cause your executor to
 OOM, especially if the cache is close to being full so the executor doesn't
 have much free memory left. How large are your executors? At the time of
 failure, is the cache already nearly full?

 I also believe the Snappy compression codec in Hadoop is not splittable.
 This means that each of your JSON files is read in its entirety as one
 spark partition. If you have files that are larger than the standard block
 size (128MB), it will exacerbate this shortcoming of Spark. Incidentally,
 this means minPartitions won't help you at all here.

 This is fixed in the master branch and will be fixed in Spark 1.1. As a
 debugging step (if this is doable), it's worth running this job on the
 master branch and seeing if it succeeds.

 https://github.com/apache/spark/pull/1165

 A (potential) workaround would be to first persist your data to disk, then
 re-partition it, then cache it. I'm not 100% sure whether that will work
 though.

  val a = 
 sc.textFile(s3n://some-path/*.json).persist(DISK_ONLY).repartition(larger
 nr of partitions).cache()

 - Patrick


 On Fri, Aug 1, 2014 at 10:17 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen so...@cloudera.com wrote:

 Isn't this your worker running out of its memory for computations,
 rather than for caching RDDs?

 I'm not sure how to interpret the stack trace, but let's say that's true.
 I'm even seeing this with a simple a = sc.textFile().cache() and then
 a.count(). Spark shouldn't need that much memory for this kind of work,
 no?

 then the answer is that you should tell
 it to use less memory for caching.

 I can try that. That's done by changing spark.storage.memoryFraction,
 right?

 This still seems strange though. The default fraction of the JVM left for
 non-cache activity (1 - 0.6 = 40%
 http://spark.apache.org/docs/latest/configuration.html#execution-behavior)
 should be plenty for just counting elements. I'm using m1.xlarge nodes
 that have 15GB of memory apiece.

 Nick





Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
For hortonworks, I believe it should work to just link against the
corresponding upstream version. I.e. just set the Hadoop version to 2.4.0

Does that work?

- Patrick


On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid
wrote:

 Hi,
   Not sure whose issue this is, but if I run make-distribution using HDP
 2.4.0.2.1.3.0-563 as the hadoop version (replacing it in
 make-distribution.sh), I get a strange error with the exception below. If I
 use a slightly older version of HDP (2.4.0.2.1.2.0-402) with
 make-distribution, using the generated assembly all works fine for me.
 Either 1.0.0 or 1.0.1 will work fine.

   Should I file a JIRA or is this a known issue?

 Thanks,
 Ron

 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 0.0:0 failed 1 times, most recent failure:
 Exception failure in TID 0 on host localhost:
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(
 AvroKeyInputFormat.java:47)
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(
 NewHadoopRDD.scala:111)
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111
 )
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(
 Executor.scala:187)
 java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)



Re: Cached RDD Block Size - Uneven Distribution

2014-08-04 Thread Patrick Wendell
Are you directly caching files from Hadoop or are you doing some
transformation on them first? If you are doing a groupBy or some type of
transformation, then you could be causing data skew that way.


On Sun, Aug 3, 2014 at 1:19 PM, iramaraju iramar...@gmail.com wrote:

 I am running spark 1.0.0, Tachyon 0.5 and Hadoop 1.0.4.

 I am selecting a subset of a large dataset and trying to run queries on the
 cached schema RDD. Strangely, in web UI, I see the following.

 150 Partitions

 Block Name  Storage Level   Size in Memory ▴Size on Disk
  Executors
 rdd_30_68   Memory Deserialized 1x Replicated   307.5 MB
  0.0 B
 ip-172-31-45-100.ec2.internal:37796
 rdd_30_133  Memory Deserialized 1x Replicated   216.0 MB
  0.0 B
 ip-172-31-45-101.ec2.internal:55947
 rdd_30_18   Memory Deserialized 1x Replicated   194.2 MB
  0.0 B
 ip-172-31-42-159.ec2.internal:43543
 rdd_30_24   Memory Deserialized 1x Replicated   173.3 MB
  0.0 B
 ip-172-31-45-101.ec2.internal:55947
 rdd_30_70   Memory Deserialized 1x Replicated   168.2 MB
  0.0 B
 ip-172-31-18-220.ec2.internal:39847
 rdd_30_105  Memory Deserialized 1x Replicated   154.1 MB
  0.0 B
 ip-172-31-45-102.ec2.internal:36700
 rdd_30_79   Memory Deserialized 1x Replicated   153.9 MB
  0.0 B
 ip-172-31-45-99.ec2.internal:59538
 rdd_30_60   Memory Deserialized 1x Replicated   4.2 MB  0.0 B
 ip-172-31-45-102.ec2.internal:36700
 rdd_30_99   Memory Deserialized 1x Replicated   112.0 B 0.0 B
 ip-172-31-45-102.ec2.internal:36700
 rdd_30_90   Memory Deserialized 1x Replicated   112.0 B 0.0 B
 ip-172-31-45-102.ec2.internal:36700
 rdd_30_9Memory Deserialized 1x Replicated   112.0 B 0.0 B
 ip-172-31-18-220.ec2.internal:39847
 rdd_30_89   Memory Deserialized 1x Replicated   112.0 B 0.0 B
 ip-172-31-45-102.ec2.internal:36700

 What is strange to me is the size in Memory is mostly 112Bytes except for 8
 of them. ( I have 9 data files in Hadoop, which are well distributed 64mb
 blocks ).

 The tasks processing the rdd are getting stuck after finishing few initial
 tasks. I am wondering, it is because, the spark has hit the large blocks
 and
 trying to process them on one worker per task.

 Any suggestions on how I can distribute them more evenly (Size of blocks) ?
 And why my hadoop blocks are nicely even and spark cached RDD has such a
 uneven distribution ? Any help is appreciated.

 Regards
 Ram



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Cached-RDD-Block-Size-Uneven-Distribution-tp11286.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: disable log4j for spark-shell

2014-08-03 Thread Patrick Wendell
If you want to customize the logging behavior - the simplest way is to copy
conf/log4j.properties.tempate to conf/log4j.properties. Then you can go and
modify the log level in there. The spark shells should pick this up.




On Sun, Aug 3, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:

 That's just a template. Nothing consults that file by default.

 It's looking inside the Spark .jar. If you edit
 core/src/main/resources/org/apache/spark/log4j-defaults.properties and
 rebuild Spark, it will pick up those changes.

 I think you could also use the JVM argument
 -Dlog4j.configuration=conf/log4j-defaults.properties to force it to
 look at your local, edited props file.

 Someone  may have to correct me, but I think that in master right now,
 that means using --driver-java-options=.. to set an argument to the
 JVM that runs the shell?

 On Sun, Aug 3, 2014 at 2:07 PM, Gil Vernik g...@il.ibm.com wrote:
  Hi,
 
  I would like to run spark-shell without any INFO messages printed.
  To achieve this I edited /conf/log4j.properties and added line
  log4j.rootLogger=OFF
  that suppose to disable all logging.
 
  However, when I  run ./spark-shell I see the message
  4/08/03 16:02:15 INFO SecurityManager: Using Spark's default log4j
 profile:
  org/apache/spark/log4j-defaults.properties
 
  And then spark-shell prints all INFO messages as is.
 
  What did i missed? Why spark-shell uses default log4j properties and not
 the
  one defined in /conf directory?
  Is there another solution to prevent spark-shell from printing INFO
  messages?
 
  Thanks,
  Gil.
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
This is a Scala bug - I filed something upstream, hopefully they can fix it
soon and/or we can provide a work around:

https://issues.scala-lang.org/browse/SI-8772

- Patrick


On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:

 Currently scala 2.10.2 can't be pulled in from maven central it seems,
 however if you have it in your ivy cache it should work.


 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:

 Me 3


 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote:

 I also ran into same issue. What is the solution?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
 Cell : 425-233-8271




 --
 Cell : 425-233-8271



Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
I've had intermiddent access to the artifacts themselves, but for me the
directory listing always 404's.

I think if sbt hits a 404 on the directory, it sends a somewhat confusing
error message that it can't download the artifact.

- Patrick


On Fri, Aug 1, 2014 at 3:28 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 This fails for me too. I have no idea why it happens as I can wget the pom
 from maven central. To work around this I just copied the ivy xmls and jars
 from this github repo

 https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library
 and put it in /root/.ivy2/cache/org.scala-lang/scala-library

 Thanks
 Shivaram




 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:

 Currently scala 2.10.2 can't be pulled in from maven central it seems,
 however if you have it in your ivy cache it should work.


 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca
 wrote:

 Me 3


 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote:

 I also ran into same issue. What is the solution?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
 Cell : 425-233-8271




 --
 Cell : 425-233-8271





Re: how to publish spark inhouse?

2014-07-28 Thread Patrick Wendell
All of the scripts we use to publish Spark releases are in the Spark
repo itself, so you could follow these as a guideline. The publishing
process in Maven is similar to in SBT:

https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65

On Mon, Jul 28, 2014 at 12:39 PM, Koert Kuipers ko...@tresata.com wrote:
 ah ok thanks. guess i am gonna read up about maven-release-plugin then!


 On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote:

 This is not something you edit yourself. The Maven release plugin
 manages setting all this. I think virtually everything you're worried
 about is done for you by this plugin.

 Maven requires artifacts to set a version and it can't inherit one. I
 feel like I understood the reason this is necessary at one point.

 On Mon, Jul 28, 2014 at 8:33 PM, Koert Kuipers ko...@tresata.com wrote:
  and if i want to change the version, it seems i have to change it in all
  23
  pom files? mhhh. is it mandatory for these sub-project pom files to
  repeat
  that version info? useful?
 
  spark$ grep 1.1.0-SNAPSHOT * -r  | wc -l
  23
 
 
 
  On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com
  wrote:
 
  hey we used to publish spark inhouse by simply overriding the publishTo
  setting. but now that we are integrated in SBT with maven i cannot find
  it
  anymore.
 
  i tried looking into the pom file, but after reading 1144 lines of xml
  i
  1) havent found anything that looks like publishing
  2) i feel somewhat sick too
  3) i am considering alternative careers to developing...
 
  where am i supposed to look?
  thanks for your help!
 
 




Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell
Adding new build modules is pretty high overhead, so if this is a case
where a small amount of duplicated code could get rid of the
dependency, that could also be a good short-term option.

- Patrick

On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Yeah, I'd just add a spark-util that has these things.

 Matei

 On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Yeah, sadly this dependency was introduced when someone consolidated the
 logging infrastructure.  However, the dependency should be very small and
 thus easy to remove, and I would like catalyst to be usable outside of
 Spark.  A pull request to make this possible would be welcome.

 Ideally, we'd create some sort of spark common package that has things like
 logging.  That way catalyst could depend on that, without pulling in all of
 Hadoop, etc.  Maybe others have opinions though, so I'm cc-ing the dev list.


 On Mon, Jul 14, 2014 at 12:21 AM, Yanbo Liang yanboha...@gmail.com wrote:

 Make Catalyst independent of Spark is the goal of Catalyst, maybe need
 time and evolution.
 I awared that package org.apache.spark.sql.catalyst.util embraced
 org.apache.spark.util.{Utils = SparkUtils},
 so that Catalyst has a dependency on Spark core.
 I'm not sure whether it will be replaced by other component independent of
 Spark in later release.


 2014-07-14 11:51 GMT+08:00 Aniket Bhatnagar aniket.bhatna...@gmail.com:

 As per the recent presentation given in Scala days
 (http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was
 mentioned that Catalyst is independent of Spark. But on inspecting pom.xml
 of sql/catalyst module, it seems it has a dependency on Spark Core. Any
 particular reason for the dependency? I would love to use Catalyst outside
 Spark

 (reposted as previous email bounced. Sorry if this is a duplicate).






Announcing Spark 1.0.1

2014-07-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.0.1! This release
includes contributions from 70 developers. Spark 1.0.0 includes fixes
across several areas of Spark, including the core API, PySpark, and
MLlib. It also includes new features in Spark's (alpha) SQL library,
including support for JSON data and performance and stability fixes.

Visit the release notes[1] to read about this release or download[2]
the release today.

[1] http://spark.apache.org/releases/spark-release-1-0-1.html
[2] http://spark.apache.org/downloads.html


Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Patrick Wendell
There isn't currently a way to do this, but it will start dropping
older applications once more than 200 are stored.

On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang hw...@qilinsoft.com wrote:
 Besides restarting the Master, is there any other way to clear the
 Completed Applications in Master web UI?


Re: Purpose of spark-submit?

2014-07-09 Thread Patrick Wendell
It fulfills a few different functions. The main one is giving users a
way to inject Spark as a runtime dependency separately from their
program and make sure they get exactly the right version of Spark. So
a user can bundle an application and then use spark-submit to send it
to different types of clusters (or using different versions of Spark).

It also unifies the way you bundle and submit an app for Yarn, Mesos,
etc... this was something that became very fragmented over time before
this was added.

Another feature is allowing users to set configuration values
dynamically rather than compile them inside of their program. That's
the one you mention here. You can choose to use this feature or not.
If you know your configs are not going to change, then you don't need
to set them with spark-submit.


On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote:
 What is the purpose of spark-submit? Does it do anything outside of
 the standard val conf = new SparkConf ... val sc = new SparkContext
 ... ?


Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Patrick Wendell
Hey Mikhail,

I think (hope?) the -em and -dm options were never in an official
Spark release. They were just in the master branch at some point. Did
you use these during a previous Spark release or were you just on
master?

- Patrick

On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov streb...@gmail.com wrote:
 Thanks Andrew,
  ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30
 --executor-memory 20g --driver-memory 10g
 works well, just wanted to make sure that I'm not missing anything



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9111.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: hadoop + yarn + spark

2014-06-27 Thread Patrick Wendell
Hi There,

There is an issue with PySpark-on-YARN that requires users build with
Java 6. The issue has to do with how Java 6 and 7 package jar files
differently.

Can you try building spark with Java 6 and trying again?

- Patrick

On Fri, Jun 27, 2014 at 5:00 PM, sdeb sangha...@gmail.com wrote:
 Hello,

 I have installed spark on top of hadoop + yarn.
 when I launch the pyspark shell  try to compute something I get this error.

 Error from python worker:
   /usr/bin/python: No module named pyspark

 The pyspark module should be there, do I have to put an external link to it?

 --Sanghamitra.




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-yarn-spark-tp8466.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: 1.0.1 release plan

2014-06-20 Thread Patrick Wendell
Hey There,

I'd like to start voting on this release shortly because there are a
few important fixes that have queued up. We're just waiting to fix an
akka issue. I'd guess we'll cut a vote in the next few days.

- Patrick

On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim m...@palantir.com wrote:
 Hi all,

 Is there any plan for 1.0.1 release?

 Mingyu


Re: Trailing Tasks Saving to HDFS

2014-06-19 Thread Patrick Wendell
I'll make a comment on the JIRA - thanks for reporting this, let's get
to the bottom of it.

On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
 I've created an issue for this but if anyone has any advice, please let me
 know.

 Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two
 remaining tasks (out of 320). Those tasks seem to be waiting on data from
 another task on another node. Eventually (about 2 hours later) they time out
 with a connection reset by peer.

 All the data actually seems to be on HDFS as the expected part files. It
 just seems like the remaining tasks have corrupted metadata, so that they
 do not realize that they are done. Just a guess though.

 https://issues.apache.org/jira/browse/SPARK-2202

 -Suren




 On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:

 Looks like eventually there was some type of reset or timeout and the
 tasks have been reassigned. I'm guessing they'll keep failing until max
 failure count.

 The machine it disconnected from was a remote machine, though I've seen
 such failures from connections to itself with other problems. The log lines
 from the remote machine are also below.

 Any thoughts or guesses would be appreciated!

 HUNG WORKER

 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from
 connection to ConnectionManagerId(172.16.25.103,57626)

 java.io.IOException: Connection reset by peer

 at sun.nio.ch.FileDispatcher.read0(Native Method)

 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)

 at sun.nio.ch.IOUtil.read(IOUtil.java:224)

 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)

 at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)

 at
 org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

 at java.lang.Thread.run(Thread.java:679)

 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection
 error on connection to ConnectionManagerId(172.16.25.103,57626)

 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
 ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)

 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
 SendingConnection to ConnectionManagerId(172.16.25.103,57626)

 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
 ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)

 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
 SendingConnectionManagerId not found


 REMOTE WORKER

 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
 ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)

 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
 SendingConnectionManagerId not found




 On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:

 I have a flow that ends with saveAsTextFile() to HDFS.

 It seems all the expected files per partition have been written out,
 based on the number of part files and the file sizes.

 But the driver logs show 2 tasks still not completed and has no activity
 and the worker logs show no activity for those two tasks for a while now.

 Has anyone run into this situation? It's happened to me a couple of times
 now.

 Thanks.

 -- Suren

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hira...@velos.io
 W: www.velos.io




 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hira...@velos.io
 W: www.velos.io




 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hira...@velos.io
 W: www.velos.io



Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
Hey Jeremy,

This is patched in the 1.0 and 0.9 branches of Spark. We're likely to
make a 1.0.1 release soon (this patch being one of the main reasons),
but if you are itching for this sooner, you can just checkout the head
of branch-1.0 and you will be able to use r3.XXX instances.

- Patrick

On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
 Some people (me included) might have wondered why all our m1.large spot
 instances (in us-west-1) shut down a few hours ago...

 Simple reason: The EC2 spot price for Spark's default m1.large instances
 just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably
 something to do with world cup.

 So far this is just us-west-1, but prices have a tendency to equalize across
 centers as the days pass. Time to make backups and plans.

 m3 spot prices are still down at $0.02 (and being new, will be bypassed by
 older systems), so it would be REAAALLYY nice if there had been some
 progress on that issue. Let me know if I can help with testing and whatnot.


 --
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers


Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
By the way, in case it's not clear, I mean our maintenance branches:

https://github.com/apache/spark/tree/branch-1.0

On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Jeremy,

 This is patched in the 1.0 and 0.9 branches of Spark. We're likely to
 make a 1.0.1 release soon (this patch being one of the main reasons),
 but if you are itching for this sooner, you can just checkout the head
 of branch-1.0 and you will be able to use r3.XXX instances.

 - Patrick

 On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee
 unorthodox.engine...@gmail.com wrote:
 Some people (me included) might have wondered why all our m1.large spot
 instances (in us-west-1) shut down a few hours ago...

 Simple reason: The EC2 spot price for Spark's default m1.large instances
 just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably
 something to do with world cup.

 So far this is just us-west-1, but prices have a tendency to equalize across
 centers as the days pass. Time to make backups and plans.

 m3 spot prices are still down at $0.02 (and being new, will be bypassed by
 older systems), so it would be REAAALLYY nice if there had been some
 progress on that issue. Let me know if I can help with testing and whatnot.


 --
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers


Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
Actually you'll just want to clone the 1.0 branch then use the
spark-ec2 script in there to launch your cluster. The --spark-git-repo
flag is if you want to launch with a different version of Spark on the
cluster. In your case you just need a different version of the launch
script itself, which will be present in the 1.0 branch of Spark.

- Patrick

On Tue, Jun 17, 2014 at 9:29 PM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
 I am about to spin up some new clusters, so I may give that a go... any
 special instructions for making them work? I assume I use the 
 --spark-git-repo= option on the spark-ec2 command. Is it as easy as
 concatenating your string as the value?

 On cluster management GUIs... I've been looking around at Amabari, Datastax,
 Cloudera, OpsCenter etc. Not totally convinced by any of them yet. Anyone
 using a good one I should know about? I'm really beginning to lean in the
 direction of Cassandra as the distributed data store...


 On Wed, Jun 18, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com wrote:

 By the way, in case it's not clear, I mean our maintenance branches:

 https://github.com/apache/spark/tree/branch-1.0

 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Jeremy,
 
  This is patched in the 1.0 and 0.9 branches of Spark. We're likely to
  make a 1.0.1 release soon (this patch being one of the main reasons),
  but if you are itching for this sooner, you can just checkout the head
  of branch-1.0 and you will be able to use r3.XXX instances.
 
  - Patrick
 
  On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee
  unorthodox.engine...@gmail.com wrote:
  Some people (me included) might have wondered why all our m1.large spot
  instances (in us-west-1) shut down a few hours ago...
 
  Simple reason: The EC2 spot price for Spark's default m1.large
  instances
  just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times.
  Probably
  something to do with world cup.
 
  So far this is just us-west-1, but prices have a tendency to equalize
  across
  centers as the days pass. Time to make backups and plans.
 
  m3 spot prices are still down at $0.02 (and being new, will be
  bypassed by
  older systems), so it would be REAAALLYY nice if there had been some
  progress on that issue. Let me know if I can help with testing and
  whatnot.
 
 
  --
  Jeremy Lee  BCompSci(Hons)
The Unorthodox Engineers




 --
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers


Re: Wildcard support in input path

2014-06-17 Thread Patrick Wendell
These paths get passed directly to the Hadoop FileSystem API and I
think the support globbing out-of-the box. So AFAIK it should just
work.

On Tue, Jun 17, 2014 at 9:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
 Hi Jianshi,

 I have used wild card characters (*) in my program and it worked..
 My code was like this
 b = sc.textFile(hdfs:///path to file/data_file_2013SEP01*)

 Thanks  Regards,
 Meethu M


 On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:


 It would be convenient if Spark's textFile, parquetFile, etc. can support
 path with wildcard, such as:

   hdfs://domain/user/jianshuang/data/parquet/table/month=2014*

 Or is there already a way to do it now?

 Jianshi

 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-17 Thread Patrick Wendell
Out of curiosity - are you guys using speculation, shuffle
consolidation, or any other non-default option? If so that would help
narrow down what's causing this corruption.

On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
 Matt/Ryan,

 Did you make any headway on this? My team is running into this also.
 Doesn't happen on smaller datasets. Our input set is about 10 GB but we
 generate 100s of GBs in the flow itself.

 -Suren




 On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com wrote:

 Just ran into this today myself. I'm on branch-1.0 using a CDH3
 cluster (no modifications to Spark or its dependencies). The error
 appeared trying to run GraphX's .connectedComponents() on a ~200GB
 edge list (GraphX worked beautifully on smaller data).

 Here's the stacktrace (it's quite similar to yours
 https://imgur.com/7iBA4nJ ).

 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
 4 times; aborting job
 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
 VertexRDD.scala:100
 Exception in thread main org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 5.599:39 failed 4 times, most
 recent failure: Exception failure in TID 29735 on host node18:
 java.io.StreamCorruptedException: invalid type code: AC
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)

 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)

 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)

 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)

 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)

 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 

Re: Setting spark memory limit

2014-06-09 Thread Patrick Wendell
I you run locally then Spark doesn't launch remote executors. However,
in this case you can set the memory with --spark-driver-memory flag to
spark-submit. Does that work?

- Patrick

On Mon, Jun 9, 2014 at 3:24 PM, Henggang Cui cuihengg...@gmail.com wrote:
 Hi,

 I'm trying to run the SimpleApp example
 (http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala)
 on a larger dataset.

 The input file is about 1GB, but when I run the Spark program, it
 says:java.lang.OutOfMemoryError: GC overhead limit exceeded, the full
 error output is attached at the end of the E-mail.

 Then I tried multiple ways of setting the memory limit.

 In SimpleApp.scala file, I set the following configurations:
 val conf = new SparkConf()
.setAppName(Simple Application)
.set(spark.executor.memory, 10g)

 And I have also tried appending the following configuration to
 conf/spark-defaults.conf file:
 spark.executor.memory   10g

 But neither of them works. In the error message, it claims (estimated size
 103.8 MB, free 191.1 MB), so the total available memory is still 300MB.
 Why?

 Thanks,
 Cui


 $ ~/spark-1.0.0-bin-hadoop1/bin/spark-submit --class SimpleApp --master
 local[4] target/scala-2.10/simple-project_2.10-1.0.jar /tmp/mdata0-10.tsd
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 14/06/09 15:06:29 INFO SecurityManager: Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 14/06/09 15:06:29 INFO SecurityManager: Changing view acls to: cuihe
 14/06/09 15:06:29 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(cuihe)
 14/06/09 15:06:29 INFO Slf4jLogger: Slf4jLogger started
 14/06/09 15:06:29 INFO Remoting: Starting remoting
 14/06/09 15:06:30 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sp...@131-1.bfc.hpl.hp.com:40779]
 14/06/09 15:06:30 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sp...@131-1.bfc.hpl.hp.com:40779]
 14/06/09 15:06:30 INFO SparkEnv: Registering MapOutputTracker
 14/06/09 15:06:30 INFO SparkEnv: Registering BlockManagerMaster
 14/06/09 15:06:30 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20140609150630-eaa9
 14/06/09 15:06:30 INFO MemoryStore: MemoryStore started with capacity 294.9
 MB.
 14/06/09 15:06:30 INFO ConnectionManager: Bound socket to port 47164 with id
 = ConnectionManagerId(131-1.bfc.hpl.hp.com,47164)
 14/06/09 15:06:30 INFO BlockManagerMaster: Trying to register BlockManager
 14/06/09 15:06:30 INFO BlockManagerInfo: Registering block manager
 131-1.bfc.hpl.hp.com:47164 with 294.9 MB RAM
 14/06/09 15:06:30 INFO BlockManagerMaster: Registered BlockManager
 14/06/09 15:06:30 INFO HttpServer: Starting HTTP Server
 14/06/09 15:06:30 INFO HttpBroadcast: Broadcast server started at
 http://16.106.36.131:48587
 14/06/09 15:06:30 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-35e1c47b-bfa1-4fba-bc64-df8eee287bb7
 14/06/09 15:06:30 INFO HttpServer: Starting HTTP Server
 14/06/09 15:06:30 INFO SparkUI: Started SparkUI at
 http://131-1.bfc.hpl.hp.com:4040
 14/06/09 15:06:30 INFO SparkContext: Added JAR
 file:/data/cuihe/spark-app/target/scala-2.10/simple-project_2.10-1.0.jar at
 http://16.106.36.131:35579/jars/simple-project_2.10-1.0.jar with timestamp
 1402351590741
 14/06/09 15:06:30 INFO MemoryStore: ensureFreeSpace(32856) called with
 curMem=0, maxMem=309225062
 14/06/09 15:06:30 INFO MemoryStore: Block broadcast_0 stored as values to
 memory (estimated size 32.1 KB, free 294.9 MB)
 14/06/09 15:06:30 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/06/09 15:06:30 WARN LoadSnappy: Snappy native library not loaded
 14/06/09 15:06:30 INFO FileInputFormat: Total input paths to process : 1
 14/06/09 15:06:30 INFO SparkContext: Starting job: count at
 SimpleApp.scala:14
 14/06/09 15:06:31 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:14)
 with 7 output partitions (allowLocal=false)
 14/06/09 15:06:31 INFO DAGScheduler: Final stage: Stage 0(count at
 SimpleApp.scala:14)
 14/06/09 15:06:31 INFO DAGScheduler: Parents of final stage: List()
 14/06/09 15:06:31 INFO DAGScheduler: Missing parents: List()
 14/06/09 15:06:31 INFO DAGScheduler: Submitting Stage 0 (FilteredRDD[2] at
 filter at SimpleApp.scala:14), which has no missing parents
 14/06/09 15:06:31 INFO DAGScheduler: Submitting 7 missing tasks from Stage 0
 (FilteredRDD[2] at filter at SimpleApp.scala:14)
 14/06/09 15:06:31 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks
 14/06/09 15:06:31 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
 executor localhost: localhost (PROCESS_LOCAL)
 14/06/09 15:06:31 INFO TaskSetManager: Serialized task 0.0:0 as 1839 bytes
 in 2 ms
 14/06/09 15:06:31 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
 executor localhost: localhost (PROCESS_LOCAL)

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Paul,

Could you give the version of Java that you are building with and the
version of Java you are running with? Are they the same?

Just off the cuff, I wonder if this is related to:
https://issues.apache.org/jira/browse/SPARK-1520

If it is, it could appear that certain functions are not in the jar
because they go beyond the extended zip boundary `jar tvf` won't list
them.

- Patrick

On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote:
 Moving over to the dev list, as this isn't a user-scope issue.

 I just ran into this issue with the missing saveAsTestFile, and here's a
 little additional information:

 - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
 - Driver built as an uberjar via Maven.
 - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
 Spark 1.0.0-hadoop1 downloaded from Apache.

 Given that it functions correctly in local mode but not in a standalone
 cluster, this suggests to me that the issue is in a difference between the
 Maven version and the hadoop1 version.

 In the spirit of taking the computer at its word, we can just have a look
 in the JAR files.  Here's what's in the Maven dep as of 1.0.0:

 jar tvf
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
 | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 And here's what's in the hadoop1 distribution:

 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'


 I.e., it's not there.  It is in the hadoop2 distribution:

 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 So something's clearly broken with the way that the distribution assemblies
 are created.

 FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors
 of Spark to Maven Central would be as *entirely different* artifacts
 (spark-core-h1, spark-core-h2).

 Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075.

 Cheers.
 -- Paul



 --
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote:

 I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
 Im using google compute engine and cloud storage. but saveAsTextFile is
 returning errors while saving in the cloud or saving local. When i start a
 job in the cluster i do get an error but after this error it keeps on
 running fine untill the saveAsTextFile. ( I don't know if the two are
 connected)

 ---Error at job startup---
  ERROR metrics.MetricsSystem: Sink class
 org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at

 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at

 org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
 at

 org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
 at

 org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
 at
 org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84)
 at

 org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
 at org.apache.spark.SparkContext.init(SparkContext.scala:202)
 at Hello$.main(Hello.scala:101)
 at Hello.main(Hello.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at sbt.Run.invokeMain(Run.scala:72)
 at sbt.Run.run0(Run.scala:65)
 at 

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Also I should add - thanks for taking time to help narrow this down!

On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote:
 Paul,

 Could you give the version of Java that you are building with and the
 version of Java you are running with? Are they the same?

 Just off the cuff, I wonder if this is related to:
 https://issues.apache.org/jira/browse/SPARK-1520

 If it is, it could appear that certain functions are not in the jar
 because they go beyond the extended zip boundary `jar tvf` won't list
 them.

 - Patrick

 On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote:
 Moving over to the dev list, as this isn't a user-scope issue.

 I just ran into this issue with the missing saveAsTestFile, and here's a
 little additional information:

 - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
 - Driver built as an uberjar via Maven.
 - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
 Spark 1.0.0-hadoop1 downloaded from Apache.

 Given that it functions correctly in local mode but not in a standalone
 cluster, this suggests to me that the issue is in a difference between the
 Maven version and the hadoop1 version.

 In the spirit of taking the computer at its word, we can just have a look
 in the JAR files.  Here's what's in the Maven dep as of 1.0.0:

 jar tvf
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
 | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 And here's what's in the hadoop1 distribution:

 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'


 I.e., it's not there.  It is in the hadoop2 distribution:

 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 So something's clearly broken with the way that the distribution assemblies
 are created.

 FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors
 of Spark to Maven Central would be as *entirely different* artifacts
 (spark-core-h1, spark-core-h2).

 Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075.

 Cheers.
 -- Paul



 --
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote:

 I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
 Im using google compute engine and cloud storage. but saveAsTextFile is
 returning errors while saving in the cloud or saving local. When i start a
 job in the cluster i do get an error but after this error it keeps on
 running fine untill the saveAsTextFile. ( I don't know if the two are
 connected)

 ---Error at job startup---
  ERROR metrics.MetricsSystem: Sink class
 org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at

 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at

 org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
 at

 org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
 at

 org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
 at
 org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84)
 at

 org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
 at org.apache.spark.SparkContext.init(SparkContext.scala:202)
 at Hello$.main(Hello.scala:101)
 at Hello.main(Hello.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Okay I think I've isolated this a bit more. Let's discuss over on the JIRA:

https://issues.apache.org/jira/browse/SPARK-2075

On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown p...@mult.ifario.us wrote:

 Hi, Patrick --

 Java 7 on the development machines:

 » java -version
 1 ↵
 java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


 And on the deployed boxes:

 $ java -version
 java version 1.7.0_55
 OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
 OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)


 Also, unzip -l in place of jar tvf gives the same results, so I don't
 think it's an issue with jar not reporting the files.  Also, the classes do
 get correctly packaged into the uberjar:

 unzip -l /target/[deleted]-driver.jar | grep 'rdd/RDD' | grep 'saveAs'
  1519  06-08-14 12:05
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
  1560  06-08-14 12:05
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 Best.
 -- Paul

 —
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote:

 Paul,

 Could you give the version of Java that you are building with and the
 version of Java you are running with? Are they the same?

 Just off the cuff, I wonder if this is related to:
 https://issues.apache.org/jira/browse/SPARK-1520

 If it is, it could appear that certain functions are not in the jar
 because they go beyond the extended zip boundary `jar tvf` won't list
 them.

 - Patrick

 On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote:
  Moving over to the dev list, as this isn't a user-scope issue.
 
  I just ran into this issue with the missing saveAsTestFile, and here's a
  little additional information:
 
  - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
  - Driver built as an uberjar via Maven.
  - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
  Spark 1.0.0-hadoop1 downloaded from Apache.
 
  Given that it functions correctly in local mode but not in a standalone
  cluster, this suggests to me that the issue is in a difference between
  the
  Maven version and the hadoop1 version.
 
  In the spirit of taking the computer at its word, we can just have a
  look
  in the JAR files.  Here's what's in the Maven dep as of 1.0.0:
 
  jar tvf
 
  ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
1519 Mon May 26 13:57:58 PDT 2014
  org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
1560 Mon May 26 13:57:58 PDT 2014
  org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 
 
  And here's what's in the hadoop1 distribution:
 
  jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep
  'saveAs'
 
 
  I.e., it's not there.  It is in the hadoop2 distribution:
 
  jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep
  'saveAs'
1519 Mon May 26 07:29:54 PDT 2014
  org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
1560 Mon May 26 07:29:54 PDT 2014
  org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 
 
  So something's clearly broken with the way that the distribution
  assemblies
  are created.
 
  FWIW and IMHO, the right way to publish the hadoop1 and hadoop2
  flavors
  of Spark to Maven Central would be as *entirely different* artifacts
  (spark-core-h1, spark-core-h2).
 
  Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075.
 
  Cheers.
  -- Paul
 
 
 
  --
  p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
 
 
  On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote:
 
  I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
  Im using google compute engine and cloud storage. but saveAsTextFile is
  returning errors while saving in the cloud or saving local. When i
  start a
  job in the cluster i do get an error but after this error it keeps on
  running fine untill the saveAsTextFile. ( I don't know if the two are
  connected)
 
  ---Error at job startup---
   ERROR metrics.MetricsSystem: Sink class
  org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
  java.lang.reflect.InvocationTargetException
  at
  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
 
 
  sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at
 
 
  sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at
  java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at
 
 
  org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
  at
 
 
  org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130

Re: Setting executor memory when using spark-shell

2014-06-06 Thread Patrick Wendell
In 1.0+ you can just pass the --executor-memory flag to ./bin/spark-shell.

On Fri, Jun 6, 2014 at 12:32 AM, Oleg Proudnikov
oleg.proudni...@gmail.com wrote:
 Thank you, Hassan!


 On 6 June 2014 03:23, hassan hellfire...@gmail.com wrote:

 just use -Dspark.executor.memory=



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Setting-executor-memory-when-using-spark-shell-tp7082p7103.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
 Kind regards,

 Oleg



Re: Spark 1.0 embedded Hive libraries

2014-06-06 Thread Patrick Wendell
They are forked and slightly modified for two reasons:

(a) Hive embeds a bunch of other dependencies in their published jars
such that it makes it really hard for other projects to depend on
them. If you look at the hive-exec jar they copy a bunch of other
dependencies directly into this jar. We modified the Hive 0.12 build
to produce jars that do not include other dependencies inside of them.

(b) Hive replies on a version of protobuf that means it is
incompatible with certain Hadoop versions. We used a shaded version of
the protobuf dependency to avoid this.

The forked copy is here - feel free to take a look:
https://github.com/pwendell/hive/commits/branch-0.12-shaded-protobuf

I'm hoping the upstream Hive project will change their published
artifacts to make them usable as a library for other applications.
Unfortunately as it stands we had to fork our own copy of these to
make it work. I think it's being tracked by this JIRA:

https://issues.apache.org/jira/browse/HIVE-5733

- Patrick

On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
 Is there a repo somewhere with the code for the Hive dependencies
 (hive-exec, hive-serde,  hive-metastore) used in SparkSQL? Are they forked
 with Spark-specific customizations, like Shark, or simply relabeled with a
 new package name (org.spark-project.hive)? I couldn't find any repos on
 Github or Apache main.

 I'm wanting to use some Hive packages outside of the ones burned into the
 Spark JAR but I'm having all sorts of headaches due to jar-hell with the
 Hive JARs in CDH or even HDP mismatched with the Spark Hive JARs.

 Thanks,
 Silvio


Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Patrick Wendell
Hey, thanks a lot for reporting this. Do you mind making a JIRA with
the details so we can track it?

- Patrick

On Wed, Jun 4, 2014 at 9:24 AM, Marek Wiewiorka
marek.wiewio...@gmail.com wrote:
 Exactly the same story - it used to work with 0.9.1 and does not work
 anymore with 1.0.0.
 I ran tests using spark-shell as well as my application(so tested turning on
 coarse mode via env variable and  SparkContext properties explicitly)

 M.


 2014-06-04 18:12 GMT+02:00 ajatix a...@sigmoidanalytics.com:

 I'm running a manually built cluster on EC2. I have mesos (0.18.2) and
 hdfs
 (2.0.0-cdh4.5.0) installed on all slaves (3) and masters (3). I have
 spark-1.0.0 on one master and the executor file is on hdfs for the slaves.
 Whenever I try to launch a spark application on the cluster, it starts a
 task on each slave (i'm using default configs) and they start FAILING with
 the error msg - 'Is spark installed on it?'



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-fails-if-mesos-coarse-set-to-true-tp6817p6945.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Patrick Wendell
Hey There,

This is only possible in Scala right now. However, this is almost
never needed since the core API is fairly flexible. I have the same
question as Andrew... what are you trying to do with your RDD?

- Patrick

On Wed, Jun 4, 2014 at 7:49 AM, Andrew Ash and...@andrewash.com wrote:
 Just curious, what do you want your custom RDD to do that the normal ones
 don't?


 On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 bluejoe2...@gmail.com wrote:

 hi, folks,
 is there any easier way to define a custom RDD in Java?
 I am wondering if I have to define a new java class which extends RDD
 from scratch? It is really a hard job for developers!

 2014-06-04
 
 bluejoe2008




Re: error with cdh 5 spark installation

2014-06-04 Thread Patrick Wendell
Hey Chirag,

Those init scripts are part of the Cloudera Spark package (they are
not in the Spark project itself) so you might try e-mailing their
support lists directly.

- Patrick

On Wed, Jun 4, 2014 at 7:19 AM, chirag lakhani chirag.lakh...@gmail.com wrote:
 I recently spun up an AWS cluster with cdh 5 using Cloudera Manager.  I am
 trying to install spark and simply used the install command, as stated in
 the CDH 5 documentation.


 sudo apt-get install spark-core spark-master spark-worker spark-python

 I get the following error

 Setting up spark-master
 (0.9.0+cdh5.0.1+33-1.cdh5.0.1.p0.25~precise-cdh5.0.1) ...
  * Starting Spark master (spark-master):
 invoke-rc.d: initscript spark-master, action start failed.
 dpkg: error processing spark-master (--configure):
  subprocess installed post-installation script returned error exit status 1
 Errors were encountered while processing:
  spark-master


 Has anyone else encountered this?  Does anyone have any suggestions of what
 to do about it?

 Chirag




Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Patrick Wendell
Hey Jeremy,

The issue is that you are using one of the external libraries and
these aren't actually packaged with Spark on the cluster, so you need
to create an uber jar that includes them.

You can look at the example here (I recently did this for a kafka
project and the idea is the same):

https://github.com/pwendell/kafka-spark-example

You'll want to make an uber jar that includes these packages (run sbt
assembly) and then submit that jar to spark-submit. Also, I'd try
running it locally first (if you aren't already) just to make the
debugging simpler.

- Patrick


On Wed, Jun 4, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:
 Ah sorry, this may be the thing I learned for the day. The issue is
 that classes from that particular artifact are missing though. Worth
 interrogating the resulting .jar file with jar tf to see if it made
 it in?

 On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath nick.pentre...@gmail.com 
 wrote:
 @Sean, the %% syntax in SBT should automatically add the Scala major version
 qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct
 syntax for the build.

 I seemed to run into this issue with some missing Jackson deps, and solved
 it by including the jar explicitly on the driver class path:

 bin/spark-submit --driver-class-path
 SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar --class SimpleApp
 SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar

 Seems redundant to me since I thought that the JAR as argument is copied to
 driver and made available. But this solved it for me so perhaps give it a
 try?



 On Wed, Jun 4, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

 Those aren't the names of the artifacts:


 http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22

 The name is spark-streaming-twitter_2.10

 On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee
 unorthodox.engine...@gmail.com wrote:
  Man, this has been hard going. Six days, and I finally got a Hello
  World
  App working that I wrote myself.
 
  Now I'm trying to make a minimal streaming app based on the twitter
  examples, (running standalone right now while learning) and when running
  it
  like this:
 
  bin/spark-submit --class SimpleApp
  SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
 
  I'm getting this error:
 
  Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/spark/streaming/twitter/TwitterUtils$
 
  Which I'm guessing is because I haven't put in a dependency to
  external/twitter in the .sbt, but _how_? I can't find any docs on it.
  Here's my build file so far:
 
  simple.sbt
  --
  name := Simple Project
 
  version := 1.0
 
  scalaVersion := 2.10.4
 
  libraryDependencies += org.apache.spark %% spark-core % 1.0.0
 
  libraryDependencies += org.apache.spark %% spark-streaming % 1.0.0
 
  libraryDependencies += org.apache.spark %% spark-streaming-twitter %
  1.0.0
 
  libraryDependencies += org.twitter4j % twitter4j-stream % 3.0.3
 
  resolvers += Akka Repository at http://repo.akka.io/releases/;
  --
 
  I've tried a few obvious things like adding:
 
  libraryDependencies += org.apache.spark %% spark-external % 1.0.0
 
  libraryDependencies += org.apache.spark %% spark-external-twitter %
  1.0.0
 
  because, well, that would match the naming scheme implied so far, but it
  errors.
 
 
  Also, I just realized I don't completely understand if:
  (a) the spark-submit command _sends_ the .jar to all the workers, or
  (b) the spark-submit commands sends a _job_ to the workers, which are
  supposed to already have the jar file installed (or in hdfs), or
  (c) the Context is supposed to list the jars to be distributed. (is that
  deprecated?)
 
  One part of the documentation says:
 
   Once you have an assembled jar you can call the bin/spark-submit
  script as
  shown here while passing your jar.
 
  but another says:
 
  application-jar: Path to a bundled jar including your application and
  all
  dependencies. The URL must be globally visible inside of your cluster,
  for
  instance, an hdfs:// path or a file:// path that is present on all
  nodes.
 
  I suppose both could be correct if you take a certain point of view.
 
  --
  Jeremy Lee  BCompSci(Hons)
The Unorthodox Engineers




Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Patrick Wendell
Hey Sam,

You mentioned two problems here, did your VPC error message get fixed
or only the key permissions problem?

I noticed we had some report a similar issue with the VPC stuff a long
time back (but there is no real resolution here):
https://spark-project.atlassian.net/browse/SPARK-1166

If that's still an issue, one thing to try is just changing the name
of the cluster. We create groups that are identified with the cluster
name, and there might be something that just got screwed up with the
original group creation and AWS isn't happy.

- Patrick



On Wed, Jun 4, 2014 at 12:55 PM, Sam Taylor Steyer sste...@stanford.edu wrote:
 Awesome, that worked. Thank you!

 - Original Message -
 From: Krishna Sankar ksanka...@gmail.com
 To: user@spark.apache.org
 Sent: Wednesday, June 4, 2014 12:52:00 PM
 Subject: Re: Trouble launching EC2 Cluster with Spark

 chmod 600 path/FinalKey.pem

 Cheers

 k/


 On Wed, Jun 4, 2014 at 12:49 PM, Sam Taylor Steyer sste...@stanford.edu
 wrote:

 Also, once my friend logged in to his cluster he received the error
 Permissions 0644 for 'FinalKey.pem' are too open. This sounds like the
 other problem described. How do we make the permissions more private?

 Thanks very much,
 Sam

 - Original Message -
 From: Sam Taylor Steyer sste...@stanford.edu
 To: user@spark.apache.org
 Sent: Wednesday, June 4, 2014 12:42:04 PM
 Subject: Re: Trouble launching EC2 Cluster with Spark

 Thanks you! The regions advice solved the problem for my friend who was
 getting the key pair does not exist problem. I am still getting the error:

 ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid
 value 'null' for protocol. VPC security group rules must specify protocols
 explicitly./Message/Error/ErrorsRequestID7ff92687-b95a-4a39-94cb-e2d00a6928fd/RequestID/Response

 This sounds like it could have to do with the access settings of the
 security group, but I don't know how to change. Any advice would be much
 appreciated!

 Sam

 - Original Message -
 From: Krishna Sankar ksanka...@gmail.com
 To: user@spark.apache.org
 Sent: Wednesday, June 4, 2014 8:52:59 AM
 Subject: Re: Trouble launching EC2 Cluster with Spark

 One reason could be that the keys are in a different region. Need to create
 the keys in us-east-1-North Virginia.
 Cheers
 k/


 On Wed, Jun 4, 2014 at 7:45 AM, Sam Taylor Steyer sste...@stanford.edu
 wrote:

  Hi,
 
  I am trying to launch an EC2 cluster from spark using the following
  command:
 
  ./spark-ec2 -k HackerPair -i [path]/HackerPair.pem -s 2 launch
  HackerCluster
 
  I set my access key id and secret access key. I have been getting an
 error
  in the setting up security groups... phase:
 
  Invalid value 'null' for protocol. VPC security groups must specify
  protocols explicitly.
 
  My project partner gets one step further and then gets the error
 
  The key pair 'JamesAndSamTest' does not exist.
 
  Any thoughts as to how we could fix these problems? Thanks a lot!
  Sam
 



Re: spark 1.0 not using properties file from SPARK_CONF_DIR

2014-06-03 Thread Patrick Wendell
You can set an arbitrary properties file by adding --properties-file
argument to spark-submit. It would be nice to have spark-submit also
look in SPARK_CONF_DIR as well by default. If you opened a JIRA for
that I'm sure someone would pick it up.

On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
 Is it on purpose that when setting SPARK_CONF_DIR spark submit still loads
 the properties file from SPARK_HOME/conf/spark-defauls.conf ?

 IMO it would be more natural to override what is defined in SPARK_HOME/conf
 by SPARK_CONF_DIR when defined (and SPARK_CONF_DIR being overriden by
 command line args).

 Eugen


Re: spark 1.0.0 on yarn

2014-06-02 Thread Patrick Wendell
Okay I'm guessing that our upstreaming Hadoop2 package isn't new
enough to work with CDH5. We should probably clarify this in our
downloads. Thanks for reporting this. What was the exact string you
used when building? Also which CDH-5 version are you building against?

On Mon, Jun 2, 2014 at 8:11 AM, Xu (Simon) Chen xche...@gmail.com wrote:
 OK, rebuilding the assembly jar file with cdh5 works now...
 Thanks..

 -Simon


 On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 That helped a bit... Now I have a different failure: the start up process
 is stuck in an infinite loop outputting the following message:

 14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:
 appMasterRpcPort: -1
 appStartTime: 1401672868277
 yarnAppState: ACCEPTED

 I am using the hadoop 2 prebuild package. Probably it doesn't have the
 latest yarn client.

 -Simon




 On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 As a debugging step, does it work if you use a single resource manager
 with the key yarn.resourcemanager.address instead of using two named
 resource managers? I wonder if somehow the YARN client can't detect
 this multi-master set-up.

 On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
  Note that everything works fine in spark 0.9, which is packaged in
  CDH5: I
  can launch a spark-shell and interact with workers spawned on my yarn
  cluster.
 
  So in my /opt/hadoop/conf/yarn-site.xml, I have:
  ...
  property
  nameyarn.resourcemanager.address.rm1/name
  valuecontroller-1.mycomp.com:23140/value
  /property
  ...
  property
  nameyarn.resourcemanager.address.rm2/name
  valuecontroller-2.mycomp.com:23140/value
  /property
  ...
 
  And the other usual stuff.
 
  So spark 1.0 is launched like this:
  Spark Command: java -cp
 
  ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
  -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
  org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client
  --class
  org.apache.spark.repl.Main
 
  I do see /opt/hadoop/conf included, but not sure it's the right
  place.
 
  Thanks..
  -Simon
 
 
 
  On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  I would agree with your guess, it looks like the yarn library isn't
  correctly finding your yarn-site.xml file. If you look in
  yarn-site.xml do you definitely the resource manager
  address/addresses?
 
  Also, you can try running this command with
  SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
  set-up correctly.
 
  - Patrick
 
  On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com
  wrote:
   Hi all,
  
   I tried a couple ways, but couldn't get it to work..
  
   The following seems to be what the online document
   (http://spark.apache.org/docs/latest/running-on-yarn.html) is
   suggesting:
  
  
   SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
   YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
  
   Help info of spark-shell seems to be suggesting --master yarn
   --deploy-mode
   cluster.
  
   But either way, I am seeing the following messages:
   14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager
   at
   /0.0.0.0:8032
   14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
   0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
   RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
   SECONDS)
   14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
   0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
   RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
   SECONDS)
  
   My guess is that spark-shell is trying to talk to resource manager
   to
   setup
   spark master/worker nodes - I am not sure where 0.0.0.0:8032 came
   from
   though. I am running CDH5 with two resource managers in HA mode.
   Their
   IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
   HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
  
   Any ideas? Thanks.
   -Simon
 
 





Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Hey There,

The issue was that the old behavior could cause users to silently
overwrite data, which is pretty bad, so to be conservative we decided
to enforce the same checks that Hadoop does.

This was documented by this JIRA:
https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1

However, it would be very easy to add an option that allows preserving
the old behavior. Is anyone here interested in contributing that? I
created a JIRA for it:

https://issues.apache.org/jira/browse/SPARK-1993

- Patrick

On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
pierre.borckm...@realimpactanalytics.com wrote:
 Indeed, the behavior has changed for good or for bad. I mean, I agree with
 the danger you mention but I'm not sure it's happening like that. Isn't
 there a mechanism for overwrite in Hadoop that automatically removes part
 files, then writes a _temporary folder and then only the part files along
 with the _success folder.

 In any case this change of behavior should be documented IMO.

 Cheers
 Pierre

 Message sent from a mobile device - excuse typos and abbreviations

 Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a
 écrit :

 What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is
 that files get overwritten automatically. This is one danger to this though.
 If I save to a directory that already has 20 part- files, but this time
 around I'm only saving 15 part- files, then there will be 5 leftover part-
 files from the previous set mixed in with the 15 newer files. This is
 potentially dangerous.

 I haven't checked to see if this behavior has changed in 1.0.0. Are you
 saying it has, Pierre?

 On Mon, Jun 2, 2014 at 9:41 AM, Pierre B
 [pierre.borckm...@realimpactanalytics.com](mailto:pierre.borckm...@realimpactanalytics.com)
 wrote:

 Hi Michaël,

 Thanks for this. We could indeed do that.

 But I guess the question is more about the change of behaviour from 0.9.1
 to
 1.0.0.
 We never had to care about that in previous versions.

 Does that mean we have to manually remove existing files or is there a way
 to aumotically overwrite when using saveAsTextFile?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
I accidentally assigned myself way back when I created it). This
should be an easy fix.

On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 Hi, Patrick,

 I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about
 the same thing?

 How about assigning it to me?

 I think I missed the configuration part in my previous commit, though I
 declared that in the PR description

 Best,

 --
 Nan Zhu

 On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote:

 Hey There,

 The issue was that the old behavior could cause users to silently
 overwrite data, which is pretty bad, so to be conservative we decided
 to enforce the same checks that Hadoop does.

 This was documented by this JIRA:
 https://issues.apache.org/jira/browse/SPARK-1100
 https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1

 However, it would be very easy to add an option that allows preserving
 the old behavior. Is anyone here interested in contributing that? I
 created a JIRA for it:

 https://issues.apache.org/jira/browse/SPARK-1993

 - Patrick

 On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
 pierre.borckm...@realimpactanalytics.com wrote:

 Indeed, the behavior has changed for good or for bad. I mean, I agree with
 the danger you mention but I'm not sure it's happening like that. Isn't
 there a mechanism for overwrite in Hadoop that automatically removes part
 files, then writes a _temporary folder and then only the part files along
 with the _success folder.

 In any case this change of behavior should be documented IMO.

 Cheers
 Pierre

 Message sent from a mobile device - excuse typos and abbreviations

 Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a
 écrit :

 What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is
 that files get overwritten automatically. This is one danger to this though.
 If I save to a directory that already has 20 part- files, but this time
 around I'm only saving 15 part- files, then there will be 5 leftover part-
 files from the previous set mixed in with the 15 newer files. This is
 potentially dangerous.

 I haven't checked to see if this behavior has changed in 1.0.0. Are you
 saying it has, Pierre?

 On Mon, Jun 2, 2014 at 9:41 AM, Pierre B
 [pierre.borckm...@realimpactanalytics.com](mailto:pierre.borckm...@realimpactanalytics.com)
 wrote:


 Hi Michaël,

 Thanks for this. We could indeed do that.

 But I guess the question is more about the change of behaviour from 0.9.1
 to
 1.0.0.
 We never had to care about that in previous versions.

 Does that mean we have to manually remove existing files or is there a way
 to aumotically overwrite when using saveAsTextFile?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




  1   2   >