Announcing Spark 1.4.1!
Hi All, I'm happy to announce the Spark 1.4.1 maintenance release. We recommend all users on the 1.4 branch upgrade to this release, which contain several important bug fixes. Download Spark 1.4.1 - http://spark.apache.org/downloads.html Release notes - http://spark.apache.org/releases/spark-release-1-4-1.html Comprehensive list of fixes - http://s.apache.org/spark-1.4.1 Thanks to the 85 developers who worked on this release! Please contact me directly for errata in the release notes. - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkHub: a new community site for Apache Spark
Hi All, Today, I'm happy to announce SparkHub (http://sparkhub.databricks.com), a service for the Apache Spark community to easily find the most relevant Spark resources on the web. SparkHub is a curated list of Spark news, videos and talks, package releases, upcoming events around the world, and a Spark Meetup directory to help you find a meetup close to you. We will continue to expand the site in the coming months and add more content. I hope SparkHub can help you find Spark related information faster and more easily than is currently possible. Everything is sourced from the Spark community, and we welcome input from you as well! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fully in-memory shuffles
Hey Corey, Yes, when shuffles are smaller than available memory to the OS, most often the outputs never get stored to disk. I believe this holds same for the YARN shuffle service, because the write path is actually the same, i.e. we don't fsync the writes and force them to disk. I would guess in such shuffles the bottleneck is serializing the data rather than raw IO, so I'm not sure explicitly buffering the data in the JVM process would yield a large improvement. Writing shuffle to an explicitly pinned memory filesystem is also possible (per Davies suggestion), but it's brittle because the job will fail if shuffle output exceeds memory. - Patrick On Wed, Jun 10, 2015 at 9:50 PM, Davies Liu dav...@databricks.com wrote: If you have enough memory, you can put the temporary work directory in tempfs (in memory file system). On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet cjno...@gmail.com wrote: Ok so it is the case that small shuffles can be done without hitting any disk. Is this the same case for the aux shuffle service in yarn? Can that be done without hitting disk? On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote: In many cases the shuffle will actually hit the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote: So with this... to help my understanding of Spark under the hood- Is this statement correct When data needs to pass between multiple JVMs, a shuffle will always hit disk? On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen rosenvi...@gmail.com wrote: There's a discussion of this at https://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote: Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[ANNOUNCE] Announcing Spark 1.4
Hi All, I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is the fifth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 210 developers and more than 1,000 commits! A huge thanks go to all of the individuals and organizations involved in development and testing of this release. Visit the release notes [1] to read about the new features, or download [2] the release today. For errata in the contributions or release notes, please e-mail me *directly* (not on-list). Thanks to everyone who helped work on this release! [1] http://spark.apache.org/releases/spark-release-1-4-0.html [2] http://spark.apache.org/downloads.html - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fully in-memory shuffles
In many cases the shuffle will actually hit the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote: So with this... to help my understanding of Spark under the hood- Is this statement correct When data needs to pass between multiple JVMs, a shuffle will always hit disk? On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen rosenvi...@gmail.com wrote: There's a discussion of this at https://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote: Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark timeout issue
Hi Deepak - please direct this to the user@ list. This list is for development of Spark itself. On Sun, Apr 26, 2015 at 12:42 PM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello All, I'm trying to process a 3.5GB file on standalone mode using spark. I could run my spark job succesfully on a 100MB file and it works as expected. But, when I try to run it on the 3.5GB file, I run into the below error : 15/04/26 12:45:50 INFO BlockManagerMaster: Updated info of block taskresult_83 15/04/26 12:46:46 WARN AkkaUtils: Error sending message [message = Heartbeat(2,[Lscala.Tuple2;@790223d3,BlockManagerId(2, master.spark.com, 39143))] in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427) 15/04/26 12:47:15 INFO MemoryStore: ensureFreeSpace(26227673) called with curMem=265897, maxMem=5556991426 15/04/26 12:47:15 INFO MemoryStore: Block taskresult_92 stored as bytes in memory (estimated size 25.0 MB, free 5.2 GB) 15/04/26 12:47:16 INFO MemoryStore: ensureFreeSpace(26272879) called with curMem=26493570, maxMem=5556991426 15/04/26 12:47:16 INFO MemoryStore: Block taskresult_94 stored as bytes in memory (estimated size 25.1 MB, free 5.1 GB) 15/04/26 12:47:18 INFO MemoryStore: ensureFreeSpace(26285327) called with curMem=52766449, maxMem=5556991426 and the job fails. I'm on AWS and have opened all ports. Also, since the 100MB file works, it should not be a connection issue. I've a r3 xlarge and 2 m3 large. Can anyone suggest a way to fix this? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Announcing Spark 1.3.1 and 1.2.2
Hi All, I'm happy to announce the Spark 1.3.1 and 1.2.2 maintenance releases. We recommend all users on the 1.3 and 1.2 Spark branches upgrade to these releases, which contain several important bug fixes. Download Spark 1.3.1 or 1.2.2: http://spark.apache.org/downloads.html Release notes: 1.3.1: http://spark.apache.org/releases/spark-release-1-3-1.html 1.2.2: http://spark.apache.org/releases/spark-release-1-2-2.html Comprehensive list of fixes: 1.3.1: http://s.apache.org/spark-1.3.1 1.2.2: http://s.apache.org/spark-1.2.2 Thanks to everyone who worked on these releases! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Configuring amount of disk space available to spark executors in mesos?
Hey Jonathan, Are you referring to disk space used for storing persisted RDD's? For that, Spark does not bound the amount of data persisted to disk. It's a similar story to how Spark's shuffle disk output works (and also Hadoop and other frameworks make this assumption as well for their shuffle data, AFAIK). We could (in theory) add a storage level that bounds the amount of data persisted to disk and forces re-computation if the partition did not fit. I'd be interested to hear more about a workload where that's relevant though, before going that route. Maybe if people are using SSD's that would make sense. - Patrick On Mon, Apr 13, 2015 at 8:19 AM, Jonathan Coveney jcove...@gmail.com wrote: I'm surprised that I haven't been able to find this via google, but I haven't... What is the setting that requests some amount of disk space for the executors? Maybe I'm misunderstanding how this is configured... Thanks for any help! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD resiliency -- does it keep state?
If you invoke this, you will get at-least-once semantics on failure. For instance, if a machine dies in the middle of executing the foreach for a single partition, that will be re-executed on another machine. It could even fully complete on one machine, but the machine dies immediately before reporting the result back to the driver. This means you need to make sure the side-effects are idempotent, or use some transactional locking. Spark's own output operations, such as saving to Hadoop, use such mechanisms. For instance, in the case of Hadoop it uses the OutputCommitter classes. - Patrick On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos michal.klo...@gmail.com wrote: Hi Spark group, We haven't been able to find clear descriptions of how Spark handles the resiliency of RDDs in relationship to executing actions with side-effects. If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect for each element in the RDD. If a partition goes down -- the resiliency rebuilds the data, but did it keep track of how far it go in the partition's set of data or will it start from the beginning again. So will it do at-least-once execution of foreach closures or at-most-once? thanks, Michal - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.3 Source - Github and source tar does not seem to match
The source code should match the Spark commit 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc. Do you see any differences? On Fri, Mar 27, 2015 at 11:28 AM, Manoj Samel manojsamelt...@gmail.com wrote: While looking into a issue, I noticed that the source displayed on Github site does not matches the downloaded tar for 1.3 Thoughts ? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD.map does not allowed to preservesPartitioning?
I think we have a version of mapPartitions that allows you to tell Spark the partitioning is preserved: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639 We could also add a map function that does same. Or you can just write your map using an iterator. - Patrick On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney jcove...@gmail.com wrote: This is just a deficiency of the api, imo. I agree: mapValues could definitely be a function (K, V)=V1. The option isn't set by the function, it's on the RDD. So you could look at the code and do this. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala def mapValues[U](f: V = U): RDD[(K, U)] = { val cleanF = self.context.clean(f) new MapPartitionsRDD[(K, U), (K, V)](self, (context, pid, iter) = iter.map { case (k, v) = (k, cleanF(v)) }, preservesPartitioning = true) } What you want: def mapValues[U](f: (K, V) = U): RDD[(K, U)] = { val cleanF = self.context.clean(f) new MapPartitionsRDD[(K, U), (K, V)](self, (context, pid, iter) = iter.map { case t@(k, _) = (k, cleanF(t)) }, preservesPartitioning = true) } One of the nice things about spark is that making such new operators is very easy :) 2015-03-26 17:54 GMT-04:00 Zhan Zhang zzh...@hortonworks.com: Thanks Jonathan. You are right regarding rewrite the example. I mean providing such option to developer so that it is controllable. The example may seems silly, and I don't know the use cases. But for example, if I also want to operate both the key and value part to generate some new value with keeping key part untouched. Then mapValues may not be able to do this. Changing the code to allow this is trivial, but I don't know whether there is some special reason behind this. Thanks. Zhan Zhang On Mar 26, 2015, at 2:49 PM, Jonathan Coveney jcove...@gmail.com wrote: I believe if you do the following: sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString (8) MapPartitionsRDD[34] at reduceByKey at console:23 [] | MapPartitionsRDD[33] at mapValues at console:23 [] | ShuffledRDD[32] at reduceByKey at console:23 [] +-(8) MapPartitionsRDD[31] at map at console:23 [] | ParallelCollectionRDD[30] at parallelize at console:23 [] The difference is that spark has no way to know that your map closure doesn't change the key. if you only use mapValues, it does. Pretty cool that they optimized that :) 2015-03-26 17:44 GMT-04:00 Zhan Zhang zzh...@hortonworks.com: Hi Folks, Does anybody know what is the reason not allowing preserverPartitioning in RDD.map? Do I miss something here? Following example involves two shuffles. I think if preservePartitioning is allowed, we can avoid the second one, right? val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)) val r2 = r1.map((_, 1)) val r3 = r2.reduceByKey(_+_) val r4 = r3.map(x=(x._1, x._2 + 1)) val r5 = r4.reduceByKey(_+_) r5.collect.foreach(println) scala r5.toDebugString res2: String = (8) ShuffledRDD[4] at reduceByKey at console:29 [] +-(8) MapPartitionsRDD[3] at map at console:27 [] | ShuffledRDD[2] at reduceByKey at console:25 [] +-(8) MapPartitionsRDD[1] at map at console:23 [] | ParallelCollectionRDD[0] at parallelize at console:21 [] Thanks. Zhan Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: 1.3 Hadoop File System problem
Hey Jim, Thanks for reporting this. Can you give a small end-to-end code example that reproduces it? If so, we can definitely fix it. - Patrick On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote: I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to find the s3 hadoop file system. I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my file], expected: file:/// when I try to save a parquet file. This worked in 1.2.1. Has anyone else seen this? I'm running spark using local[8] so it's all internal. These are actually unit tests in our app that are failing now. Thanks. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DataFrame operation on parquet: GC overhead limit exceeded
Hey Yiannis, If you just perform a count on each name, date pair... can it succeed? If so, can you do a count and then order by to find the largest one? I'm wondering if there is a single pathologically large group here that is somehow causing OOM. Also, to be clear, you are getting GC limit warnings on the executors, not the driver. Correct? - Patrick On Mon, Mar 23, 2015 at 10:21 AM, Martin Goodson mar...@skimlinks.com wrote: Have you tried to repartition() your original data to make more partitions before you aggregate? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, Yes, I have set spark.executor.memory to 8g and the worker memory to 16g without any success. I cannot figure out how to increase the number of mapPartitions tasks. Thanks a lot On 20 March 2015 at 18:44, Yin Huai yh...@databricks.com wrote: spark.sql.shuffle.partitions only control the number of tasks in the second stage (the number of reducers). For your case, I'd say that the number of tasks in the first state (number of mappers) will be the number of files you have. Actually, have you changed spark.executor.memory (it controls the memory for an executor of your application)? I did not see it in your original email. The difference between worker memory and executor memory can be found at ( http://spark.apache.org/docs/1.3.0/spark-standalone.html), SPARK_WORKER_MEMORY Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m, 2g (default: total memory minus 1 GB); note that each application's individual memory is configured using its spark.executor.memory property. On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas johngou...@gmail.com wrote: Actually I realized that the correct way is: sqlContext.sql(set spark.sql.shuffle.partitions=1000) but I am still experiencing the same behavior/error. On 20 March 2015 at 16:04, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, the way I set the configuration is: val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext.setConf(spark.sql.shuffle.partitions,1000); it is the correct way right? In the mapPartitions task (the first task which is launched), I get again the same number of tasks and again the same error. :( Thanks a lot! On 19 March 2015 at 17:40, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, thanks a lot for that! Will give it a shot and let you know. On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote: Was the OOM thrown during the execution of first stage (map) or the second stage (reduce)? If it was the second stage, can you increase the value of spark.sql.shuffle.partitions and see if the OOM disappears? This setting controls the number of reduces Spark SQL will use and the default is 200. Maybe there are too many distinct values and the memory pressure on every task (of those 200 reducers) is pretty high. You can start with 400 and increase it until the OOM disappears. Hopefully this will help. Thanks, Yin On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The number of tasks launched is equal to the number of parquet files. Do you have any idea on how to deal with this situation? Thanks a lot On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote: Seems there are too many distinct groups processed in a task, which trigger the problem. How many files do your dataset have and how large is a file? Seems your query will be executed with two stages, table scan and map-side aggregation in the first stage and the final round of reduce-side aggregation in the second stage. Can you take a look at the numbers of tasks launched in these two stages? Thanks, Yin On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi there, I set the executor memory to 8g but it didn't help On 18 March 2015 at 13:59, Cheng Lian lian.cs@gmail.com wrote: You should probably increase executor memory by setting spark.executor.memory. Full list of available configurations can be found here http://spark.apache.org/docs/latest/configuration.html Cheng On 3/18/15 9:15 PM, Yiannis Gkoufas wrote: Hi there, I was trying the new DataFrame API with some basic operations on a parquet dataset. I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a standalone cluster mode. The code is the following: val people = sqlContext.parquetFile(/data.parquet); val res = people.groupBy(name,date). agg(sum(power),sum(supply)).take(10); System.out.println(res); The dataset consists of 16 billion entries. The error I get is java.lang.OutOfMemoryError: GC overhead limit exceeded My configuration is: spark.serializer org.apache.spark.serializer.KryoSerializer
[ANNOUNCE] Announcing Spark 1.3!
Hi All, I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is the fourth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! Visit the release notes [1] to read about the new features, or download [2] the release today. For errata in the contributions or release notes, please e-mail me *directly* (not on-list). Thanks to everyone who helped work on this release! [1] http://spark.apache.org/releases/spark-release-1-3-0.html [2] http://spark.apache.org/downloads.html - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to set per-user spark.local.dir?
We don't support expressions or wildcards in that configuration. For each application, the local directories need to be constant. If you have users submitting different Spark applications, those can each set spark.local.dirs. - Patrick On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I need to set per-user spark.local.dir, how can I do that? I tried both /x/home/${user.name}/spark/tmp and /x/home/${USER}/spark/tmp And neither worked. Looks like it has to be a constant setting in spark-defaults.conf. Right? Any ideas how to do that? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)
You may need to add the -Phadoop-2.4 profile. When building or release packages for Hadoop 2.4 we use the following flags: -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn - Patrick On Thu, Mar 5, 2015 at 12:47 PM, Kelly, Jonathan jonat...@amazon.com wrote: I confirmed that this has nothing to do with BigTop by running the same mvn command directly in a fresh clone of the Spark package at the v1.2.1 tag. I got the same exact error. Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Kelly, Jonathan Kelly jonat...@amazon.com Date: Thursday, March 5, 2015 at 10:39 AM To: user@spark.apache.org user@spark.apache.org Subject: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library) I'm running into an issue building Spark v1.2.1 (as well as the latest in branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop (v0.9, which is not quite released yet). The build fails in the External Flume Sink subproject with the following error: [INFO] Compiling 5 Scala sources and 3 Java sources to /workspace/workspace/bigtop.spark-rpm/build/spark/rpm/BUILD/spark-1.3.0/external/flume-sink/target/scala-2.10/classes... [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing with a stub. [ERROR] error while loading NettyServer, class file '/home/ec2-user/.m2/repository/org/apache/avro/avro-ipc/1.7.6/avro-ipc-1.7.6.jar(org/apache/avro/ipc/NettyServer.class)' is broken (class java.lang.NullPointerException/null) [WARNING] one warning found [ERROR] one error found It seems like what is happening is that the Netty library is missing at build time, which happens because it is explicitly excluded in the pom.xml (see https://github.com/apache/spark/blob/v1.2.1/external/flume-sink/pom.xml#L42). I attempted removing the exclusions and the explicit re-add for the test scope on lines 77-88, and that allowed the build to succeed, though I don't know if that will cause problems at runtime. I don't have any experience with the Flume Sink, so I don't really know how to test it. (And, to be clear, I'm not necessarily trying to get the Flume Sink to work-- I just want the project to build successfully, though of course I'd still want the Flume Sink to work for whomever does need it.) Does anybody have any idea what's going on here? Here is the command BigTop is running to build Spark: mvn -Pbigtop-dist -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Divy.home=/home/ec2-user/.ivy2 -Dsbt.ivy.home=/home/ec2-user/.ivy2 -Duser.home=/home/ec2-user -Drepo.maven.org= -Dreactor.repo=file:///home/ec2-user/.m2/repository -Dhadoop.version=2.4.0-amzn-3-SNAPSHOT -Dyarn.version=2.4.0-amzn-3-SNAPSHOT -Dprotobuf.version=2.5.0 -Dscala.version=2.10.3 -Dscala.binary.version=2.10 -DskipTests -DrecompileMode=all install As I mentioned above, if I switch to the latest in branch-1.2, to v1.3.0-rc2, or to the latest in branch-1.3, I get the same exact error. I was not getting the error with Spark v1.1.0, though there weren't any changes to the external/flume-sink/pom.xml between v1.1.0 and v1.2.1. ~ Jonathan Kelly - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is SPARK_CLASSPATH really deprecated?
I think we need to just update the docs, it is a bit unclear right now. At the time, we made it worded fairly sternly because we really wanted people to use --jars when we deprecated SPARK_CLASSPATH. But there are other types of deployments where there is a legitimate need to augment the classpath of every executor. I think it should probably say something more like Extra classpath entries to append to the classpath of executors. This is sometimes used in deployment environments where dependencies of Spark are present in a specific place on all nodes. Kannan - if you want to submit I patch I can help review it. On Thu, Feb 26, 2015 at 8:24 PM, Kannan Rajah kra...@maprtech.com wrote: Thanks Marcelo. Do you think it would be useful to make spark.executor.extraClassPath be made to pick up some environment variable that can be set from spark-env.sh? Here is a example. spark-env.sh -- executor_extra_cp = get_hbase_jars_for_cp export executor_extra_cp spark-defaults.conf - spark.executor.extraClassPath = ${executor_extra_cp} This will let us add logic inside get_hbase_jars_for_cp function to pick the right version hbase jars. There could be multiple versions installed on the node. -- Kannan On Thu, Feb 26, 2015 at 6:08 PM, Marcelo Vanzin van...@cloudera.com wrote: On Thu, Feb 26, 2015 at 5:12 PM, Kannan Rajah kra...@maprtech.com wrote: Also, I would like to know if there is a localization overhead when we use spark.executor.extraClassPath. Again, in the case of hbase, these jars would be typically available on all nodes. So there is no need to localize them from the node where job was submitted. I am wondering if we use the SPARK_CLASSPATH approach, then it would not do localization. That would be an added benefit. Please clarify. spark.executor.extraClassPath doesn't localize anything. It just prepends those classpath entries to the usual classpath used to launch the executor. There's no copying of files or anything, so they're expected to exist on the nodes. It's basically exactly the same as SPARK_CLASSPATH, but broken down to two options (one for the executors, and one for the driver). -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Add PredictionIO to Powered by Spark
Added - thanks! I trimmed it down a bit to fit our normal description length. On Mon, Jan 5, 2015 at 8:24 AM, Thomas Stone tho...@prediction.io wrote: Please can we add PredictionIO to https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark PredictionIO http://prediction.io/ PredictionIO is an open source machine learning server for software developers to easily build and deploy predictive applications on production. PredictionIO currently offers two engine templates for Apache Spark MLlib for recommendation (MLlib ALS) and classification (MLlib Naive Bayes). With these templates, you can create a custom predictive engine for production deployment efficiently. A standard PredictionIO stack is built on top of solid open source technology, such as Scala, Apache Spark, HBase and Elasticsearch. We are already featured on https://databricks.com/certified-on-spark Kind regards and Happy New Year! Thomas -- This page tracks the users of Spark. To add yourself to the list, please email user@spark.apache.org with your organization name, URL, a list of which Spark components you are using, and a short description of your use case. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can you add Big Industries to the Powered by Spark page?
I've added it, thanks! On Fri, Feb 20, 2015 at 12:22 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, Could you please add Big Industries to the Powered by Spark page at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ? Company Name: Big Industries URL: http://http://www.bigindustries.be/ Spark Components: Spark Streaming Use Case: Big Content Platform Summary: The Big Content Platform is a business-to-business content asset management service providing a searchable, aggregated source of live news feeds, public domain media and archives of content. The platform is founded on Apache Hadoop, uses the HDFS filesystem, Apache Spark, Titan Distributed Graph Database, HBase, and Solr. Additionally, the platform leverages public datasets like Freebase, DBpedia, Wiktionary, and Geonames to support semantic text enrichment. Kind regards, Emre Sevinç http://www.bigindustries.be/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: FW: Submitting jobs to Spark EC2 cluster remotely
What happens if you submit from the master node itself on ec2 (in client mode), does that work? What about in cluster mode? It would be helpful if you could print the full command that the executor is failing. That might show that spark.driver.host is being set strangely. IIRC we print the launch command before starting the executor. Overall the standalone cluster mode is not as well tested across environments with asymmetric connectivity. I didn't actually realize that akka (which the submission uses) can handle this scenario. But it does seem like the job is submitted, it's just not starting correctly. - Patrick On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh o...@solver.com wrote: Patrick, I haven't changed the configs much. I just executed ec2-script to create 1 master, 2 slaves cluster. Then I try to submit the jobs from remote machine leaving all defaults configured by Spark scripts as default. I've tried to change configs as suggested in other mailing-list and stack overflow threads (such as setting spark.driver.host, etc...), removed (hopefully) all security/firewall restrictions from AWS, etc. but it didn't help. I think that what you are saying is exactly the issue: on my master node UI at the bottom I can see the list of Completed Drivers all with ERROR state... Thanks, Oleg -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Monday, February 23, 2015 12:59 AM To: Oleg Shirokikh Cc: user@spark.apache.org Subject: Re: Submitting jobs to Spark EC2 cluster remotely Can you list other configs that you are setting? It looks like the executor can't communicate back to the driver. I'm actually not sure it's a good idea to set spark.driver.host here, you want to let spark set that automatically. - Patrick On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh o...@solver.com wrote: Dear Patrick, Thanks a lot for your quick response. Indeed, following your advice I've uploaded the jar onto S3 and FileNotFoundException is gone now and job is submitted in cluster deploy mode. However, now both (client and cluster) fail with the following errors in executors (they keep exiting/killing executors as I see in UI): 15/02/23 08:42:46 ERROR security.UserGroupInformation: PriviledgedActionException as:oleg cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Full log is: 15/02/23 01:59:11 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 15/02/23 01:59:12 INFO spark.SecurityManager: Changing view acls to: root,oleg 15/02/23 01:59:12 INFO spark.SecurityManager: Changing modify acls to: root,oleg 15/02/23 01:59:12 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, oleg); users with modify permissions: Set(root, oleg) 15/02/23 01:59:12 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/02/23 01:59:12 INFO Remoting: Starting remoting 15/02/23 01:59:13 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.int ernal:39379] 15/02/23 01:59:13 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 39379. 15/02/23 01:59:43 ERROR security.UserGroupInformation: PriviledgedActionException as:oleg cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrai nedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107
Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets
The map will start with a capacity of 64, but will grow to accommodate new data. Are you using the groupBy operator in Spark or are you using Spark SQL's group by? This usually happens if you are grouping or aggregating in a way that doesn't sufficiently condense the data created from each input partition. - Patrick On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com fightf...@163.com wrote: Hi, Really have no adequate solution got for this issue. Expecting any available analytical rules or hints. Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-09 11:56 To: user; dev Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets Hi, Problem still exists. Any experts would take a look at this? Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-06 17:54 To: user; dev Subject: Sort Shuffle performance issues about using AppendOnlyMap for large data sets Hi, all Recently we had caught performance issues when using spark 1.2.0 to read data from hbase and do some summary work. Our scenario means to : read large data sets from hbase (maybe 100G+ file) , form hbaseRDD, transform to schemardd, groupby and aggregate the data while got fewer new summary data sets, loading data into hbase (phoenix). Our major issue lead to : aggregate large datasets to get summary data sets would consume too long time (1 hour +) , while that should be supposed not so bad performance. We got the dump file attached and stacktrace from jstack like the following: From the stacktrace and dump file we can identify that processing large datasets would cause frequent AppendOnlyMap growing, and leading to huge map entrysize. We had referenced the source code of org.apache.spark.util.collection.AppendOnlyMap and found that the map had been initialized with capacity of 64. That would be too small for our use case. So the question is : Does anyone had encounted such issues before? How did that be resolved? I cannot find any jira issues for such problems and if someone had seen, please kindly let us know. More specified solution would goes to : Does any possibility exists for user defining the map capacity releatively in spark? If so, please tell how to achieve that. Best Thanks, Sun. Thread 22432: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=169, line=68 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=2, line=41 (Interpreted frame) - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame) - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) Thread 22431: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
Re: feeding DataFrames into predictive algorithms
I think there is a minor error here in that the first example needs a tail after the seq: df.map { row = (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double])) }.toDataFrame(label, features) On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust mich...@databricks.com wrote: It sounds like you probably want to do a standard Spark map, that results in a tuple with the structure you are looking for. You can then just assign names to turn it back into a dataframe. Assuming the first column is your label and the rest are features you can do something like this: val df = sc.parallelize( (1.0, 2.3, 2.4) :: (1.2, 3.4, 1.2) :: (1.2, 2.3, 1.2) :: Nil).toDataFrame(a, b, c) df.map { row = (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double])) }.toDataFrame(label, features) df: org.apache.spark.sql.DataFrame = [label: double, features: arraydouble] If you'd prefer to stick closer to SQL you can define a UDF: val createArray = udf((a: Double, b: Double) = Seq(a, b)) df.select('a as 'label, createArray('b,'c) as 'features) df: org.apache.spark.sql.DataFrame = [label: double, features: arraydouble] We'll add createArray as a first class member of the DSL. Michael On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hey All, I've been playing around with the new DataFrame and ML pipelines APIs and am having trouble accomplishing what seems like should be a fairly basic task. I have a DataFrame where each column is a Double. I'd like to turn this into a DataFrame with a features column and a label column that I can feed into a regression. So far all the paths I've gone down have led me to internal APIs or convoluted casting in and out of RDD[Row] and DataFrame. Is there a simple way of accomplishing this? any assistance (lookin' at you Xiangrui) much appreciated, Sandy - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Questions about Spark standalone resource scheduler
Hey Jerry, I think standalone mode will still add more features over time, but the goal isn't really for it to become equivalent to what Mesos/YARN are today. Or at least, I doubt Spark Standalone will ever attempt to manage _other_ frameworks outside of Spark and become a general purpose resource manager. In terms of having better support for multi tenancy, meaning multiple *Spark* instances, this is something I think could be in scope in the future. For instance, we added H/A to the standalone scheduler a while back, because it let us support H/A streaming apps in a totally native way. It's a trade off of adding new features and keeping the scheduler very simple and easy to use. We've tended to bias towards simplicity as the main goal, since this is something we want to be really easy out of the box. One thing to point out, a lot of people use the standalone mode with some coarser grained scheduler, such as running in a cloud service. In this case they really just want a simple inner cluster manager. This may even be the majority of all Spark installations. This is slightly different than Hadoop environments, where they might just want nice integration into the existing Hadoop stack via something like YARN. - Patrick On Mon, Feb 2, 2015 at 12:24 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant support in standalone mode, like multi-user management, resource management and isolation, whitelist of users. Seems current Spark standalone do not support such kind of functionalities, while resource schedulers like Yarn offers such kind of advanced managements, I'm not sure what's the future target of standalone resource scheduler, will it only target on simple implementation, and for advanced usage shift to YARN? Or will it plan to add some simple multi-tenant related functionalities? Thanks a lot for your comments. BR Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Bouncing Mails
Akhil, Those are handled by ASF infrastructure, not anyone in the Spark project. So this list is not the appropriate place to ask for help. - Patrick On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das ak...@sigmoidanalytics.com wrote: My mails to the mailing list are getting rejected, have opened a Jira issue, can someone take a look at it? https://issues.apache.org/jira/browse/INFRA-9032 Thanks Best Regards - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accumulator value in Spark UI
It should appear in the page for any stage in which accumulators are updated. On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io wrote: Hello, From accumulator documentation, it says that if the accumulator is named, it will be displayed in the WebUI. However, I cannot find it anywhere. Do I need to specify anything in the spark ui config? Thanks. Justin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Long-running job cleanup
What do you mean when you say the overhead of spark shuffles start to accumulate? Could you elaborate more? In newer versions of Spark shuffle data is cleaned up automatically when an RDD goes out of scope. It is safe to remove shuffle data at this point because the RDD can no longer be referenced. If you are seeing a large build up of shuffle data, it's possible you are retaining references to older RDDs inadvertently. Could you explain what your job actually doing? - Patrick On Mon, Dec 22, 2014 at 2:36 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long, eventually the overhead of spark shuffles starts to accumulate culminating in the driver starting to swap. I am aware of the spark.cleanup.tll parameter that allows me to configure when cleanup happens but the issue with doing this is that it isn't done safely, e.g. I can be in the middle of processing a stage when this cleanup happens and my cached RDDs get cleared. This ultimately causes a KeyNotFoundException when I try to reference the now cleared cached RDD. This behavior doesn't make much sense to me, I would expect the cached RDD to either get regenerated or at the very least for there to be an option to execute this cleanup without deleting those RDDs. Is there a programmatically safe way of doing this cleanup that doesn't break everything? If I instead tear down the spark context and bring up a new context for every iteration (assuming that each iteration is sufficiently long-lived), would memory get released appropriately? The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: action progress in ipython notebook?
Hey Eric, I'm just curious - which specific features in 1.2 do you find most help with usability? This is a theme we're focusing on for 1.3 as well, so it's helpful to hear what makes a difference. - Patrick On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman eric.d.fried...@gmail.com wrote: Hi Josh, Thanks for the informative answer. Sounds like one should await your changes in 1.3. As information, I found the following set of options for doing the visual in a notebook. http://nbviewer.ipython.org/github/ipython/ipython/blob/3607712653c66d63e0d7f13f073bde8c0f209ba8/docs/examples/notebooks/Animations_and_Progress.ipynb On Dec 27, 2014, at 4:07 PM, Josh Rosen rosenvi...@gmail.com wrote: The console progress bars are implemented on top of a new stable status API that was added in Spark 1.2. It's possible to query job progress using this interface (in older versions of Spark, you could implement a custom SparkListener and maintain the counts of completed / running / failed tasks / stages yourself). There are actually several subtleties involved in implementing job-level progress bars which behave in an intuitive way; there's a pretty extensive discussion of the challenges at https://github.com/apache/spark/pull/3009. Also, check out the pull request for the console progress bars for an interesting design discussion around how they handle parallel stages: https://github.com/apache/spark/pull/3029. I'm not sure about the plumbing that would be necessary to display live progress updates in the IPython notebook UI, though. The general pattern would probably involve a mapping to relate notebook cells to Spark jobs (you can do this with job groups, I think), plus some periodic timer that polls the driver for the status of the current job in order to update the progress bar. For Spark 1.3, I'm working on designing a REST interface to accesses this type of job / stage / task progress information, as well as expanding the types of information exposed through the stable status API interface. - Josh On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman eric.d.fried...@gmail.com wrote: Spark 1.2.0 is SO much more usable than previous releases -- many thanks to the team for this release. A question about progress of actions. I can see how things are progressing using the Spark UI. I can also see the nice ASCII art animation on the spark driver console. Has anyone come up with a way to accomplish something similar in an iPython notebook using pyspark? Thanks Eric - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Question on saveAsTextFile with overwrite option
Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Question on saveAsTextFile with overwrite option
So the behavior of overwriting existing directories IMO is something we don't want to encourage. The reason why the Hadoop client has these checks is that it's very easy for users to do unsafe things without them. For instance, a user could overwrite an RDD that had 100 partitions with an RDD that has 10 partitions... and if they read back the RDD they would get a corrupted RDD that has a combination of data from the old and new RDD. If users want to circumvent these safety checks, we need to make them explicitly disable them. Given this, I think a config option is as reasonable as any alternatives. This is already pretty easy IMO. - Patrick On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote: I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: user@spark.apache.org; d...@spark.apache.org Subject: Re: Question on saveAsTextFile with overwrite option Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Announcing Spark 1.2!
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is the third release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! This release brings operational and performance improvements in Spark core including a new network transport subsytem designed for very large shuffles. Spark SQL introduces an API for external data sources along with Hive 13 support, dynamic partitioning, and the fixed-precision decimal type. MLlib adds a new pipeline-oriented package (spark.ml) for composing multiple algorithms. Spark Streaming adds a Python API and a write ahead log for fault tolerance. Finally, GraphX has graduated from alpha and introduces a stable API along with performance improvements. Visit the release notes [1] to read about the new features, or download [2] the release today. For errata in the contributions or release notes, please e-mail me *directly* (not on-list). Thanks to everyone involved in creating, testing, and documenting this release! [1] http://spark.apache.org/releases/spark-release-1-2-0.html [2] http://spark.apache.org/downloads.html - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark streaming kafa best practices ?
Foreach is slightly more efficient because Spark doesn't bother to try and collect results from each task since it's understood there will be no return type. I think the difference is very marginal though - it's mostly stylistic... typically you use foreach for something that is intended to produce a side effect and map for something that will return a new dataset. On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas gerard.m...@gmail.com wrote: Patrick, I was wondering why one would choose for rdd.map vs rdd.foreach to execute a side-effecting function on an RDD. -kr, Gerard. On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell pwend...@gmail.com wrote: The second choice is better. Once you call collect() you are pulling all of the data onto a single node, you want to do most of the processing in parallel on the cluster, which is what map() will do. Ideally you'd try to summarize the data or reduce it before calling collect(). On Fri, Dec 5, 2014 at 5:26 AM, david david...@free.fr wrote: hi, What is the bet way to process a batch window in SparkStreaming : kafkaStream.foreachRDD(rdd = { rdd.collect().foreach(event = { // process the event process(event) }) }) Or kafkaStream.foreachRDD(rdd = { rdd.map(event = { // process the event process(event) }).collect() }) thank's -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Server - How to implement
Hey Manoj, One proposal potentially of interest is the Spark Kernel project from IBM - you should look for their. The interface in that project is more of a remote REPL interface, i.e. you submit commands (as strings) and get back results (as strings), but you don't have direct programmatic access to state like in the JobServer. Not sure if this is what you need. https://issues.apache.org/jira/browse/SPARK-4605 This type of higher level execution context is something we've generally defined to be outside of scope for the core Spark distribution because they can be cleanly built on the stable API, and from what I've seen of different applications that build on Spark, the requirements are fairly different for different applications. I'm guessing that in the next year we'll see a handful of community projects pop up around providing various types of execution services for spark apps. - Patrick On Fri, Dec 12, 2014 at 10:06 AM, Manoj Samel manojsamelt...@gmail.com wrote: Thanks Marcelo. Spark Gurus/Databricks team - do you have something in roadmap for such a spark server ? Thanks, On Thu, Dec 11, 2014 at 5:43 PM, Marcelo Vanzin van...@cloudera.com wrote: Oops, sorry, fat fingers. We've been playing with something like that inside Hive: https://github.com/apache/hive/tree/spark/spark-client That seems to have at least a few of the characteristics you're looking for; but it's a very young project, and at this moment we're not developing it as a public API, but mostly for internal Hive use. It can give you a few ideas, though. Also, SPARK-3215. On Thu, Dec 11, 2014 at 5:41 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Manoj, I'm not aware of any public projects that do something like that, except for the Ooyala server which you say doesn't cover your needs. We've been playing with something like that inside Hive, though: On Thu, Dec 11, 2014 at 5:33 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, If spark based services are to be exposed as a continuously available server, what are the options? * The API exposed to client will be proprietary and fine grained (RPC style ..), not a Job level API * The client API need not be SQL so the Thrift JDBC server does not seem to be option .. but I could be wrong here ... * Ooyala implementation is a REST API for job submission, but as mentioned above; the desired API is a finer grain API, not a job submission Any existing implementation? Is it build your own server? Any thoughts on approach to use ? Thanks, -- Marcelo -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Stateful mapPartitions
Yeah the main way to do this would be to have your own static cache of connections. These could be using an object in Scala or just a static variable in Java (for instance a set of connections that you can borrow from). - Patrick On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Dec 5, 2014 at 3:56 AM, Akshat Aranya aara...@gmail.com wrote: Is it possible to have some state across multiple calls to mapPartitions on each partition, for instance, if I want to keep a database connection open? If you're using Scala, you can use a singleton object, this will exist once per JVM (i.e., once per executor), like object DatabaseConnector { lazy val conn = ... } Please be aware that shutting down the connection is much harder than opening it, because you basically have no idea when processing is done for an executor, AFAIK. Tobias - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Exception adding resource files in latest Spark
Thanks for flagging this. I reverted the relevant YARN fix in Spark 1.2 release. We can try to debug this in master. On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I created a ticket for this: https://issues.apache.org/jira/browse/SPARK-4757 Jianshi On Fri, Dec 5, 2014 at 1:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Correction: According to Liancheng, this hotfix might be the root cause: https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Jianshi On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like the datanucleus*.jar shouldn't appear in the hdfs path in Yarn-client mode. Maybe this patch broke yarn-client. https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Jianshi On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Actually my HADOOP_CLASSPATH has already been set to include /etc/hadoop/conf/* export HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase classpath) Jianshi On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like somehow Spark failed to find the core-site.xml in /et/hadoop/conf I've already set the following env variables: export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HBASE_CONF_DIR=/etc/hbase/conf Should I put $HADOOP_CONF_DIR/* to HADOOP_CLASSPATH? Jianshi On Fri, Dec 5, 2014 at 11:37 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: I got the following error during Spark startup (Yarn-client mode): 14/12/04 19:33:58 INFO Client: Uploading resource file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar - hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar java.lang.IllegalArgumentException: Wrong FS: hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242) at scala.Option.foreach(Option.scala:236) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35) at org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140) at org.apache.spark.SparkContext.init(SparkContext.scala:335) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) I'm using latest Spark built from master HEAD yesterday. Is this a bug? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava
Thanks Judy. While this is not directly caused by a Spark issue, it is likely other users will run into this. This is an unfortunate consequence of the way that we've shaded Guava in this release, we rely on byte code shading of Hadoop itself as well. And if the user has their own Hadoop classes present it can cause issues. On Sun, Nov 30, 2014 at 10:53 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Thanks Patrick and Cheng for the suggestions. The issue was Hadoop common jar was added to a classpath. After I removed Hadoop common jar from both master and slave, I was able to bypass the error. This was caused by a local change, so no impact on the 1.2 release. -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Wednesday, November 26, 2014 8:17 AM To: Judy Nash Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Just to double check - I looked at our own assembly jar and I confirmed that our Hadoop configuration class does use the correctly shaded version of Guava. My best guess here is that somehow a separate Hadoop library is ending up on the classpath, possible because Spark put it there somehow. tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar cd org/apache/hadoop/ javap -v Configuration | grep Precond Warning: Binary file Configuration contains org.apache.hadoop.conf.Configuration #497 = Utf8 org/spark-project/guava/common/base/Preconditions #498 = Class #497 // org/spark-project/guava/common/base/Preconditions #502 = Methodref #498.#501// org/spark-project/guava/common/base/Preconditions.checkArgument:(ZLjava/lang/Object;)V 12: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V 50: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote: Hi Judy, Are you somehow modifying Spark's classpath to include jars from Hadoop and Hive that you have running on the machine? The issue seems to be that you are somehow including a version of Hadoop that references the original guava package. The Hadoop that is bundled in the Spark jars should not do this. - Patrick On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Looks like a config issue. I ran spark-pi job and still failing with the same guava error Command ran: .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.examples.SparkPi --master spark://headnodehost:7077 --executor-memory 1G --num-executors 1 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100 Had used the same build steps on spark 1.1 and had no issue. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Tuesday, November 25, 2014 5:47 PM To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava To determine if this is a Windows vs. other configuration, can you just try to call the Spark-class.cmd SparkSubmit without actually referencing the Hadoop or Thrift server classes? On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com wrote: I traced the code and used the following to call: Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal --hiveconf hive.server2.thrift.port=1 The issue ended up to be much more fundamental however. Spark doesn't work at all in configuration below. When open spark-shell, it fails with the same ClassNotFound error. Now I wonder if this is a windows-only issue or the hive/Hadoop configuration that is having this problem. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Tuesday, November 25, 2014 1:50 AM To: Judy Nash; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Oh so you're using Windows. What command are you using to start the Thrift server then? On 11/25/14 4:25 PM, Judy Nash wrote: Made progress but still blocked. After recompiling the code on cmd instead of PowerShell, now I can see all 5 classes as you mentioned. However I am still seeing the same error as before. Anything else I can check for? From: Judy Nash [mailto:judyn...@exchange.microsoft.com] Sent: Monday, November 24, 2014 11:50 PM To: Cheng Lian; u...@spark.incubator.apache.org Subject: RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava This is what I got from jar tf: org/spark-project/guava/common/base/Preconditions.class org/spark-project/guava/common/math/MathPreconditions.class
Re: Opening Spark on IntelliJ IDEA
I recently posted instructions on loading Spark in Intellij from scratch: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA You need to do a few extra steps for the YARN project to work. Also, for questions like this that relate to developing Spark, you can e-mail the dev@ list as well. On Sat, Nov 29, 2014 at 12:15 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Restarting the intellij multiple times and adding/changing project's SDK would resolve the issue. Thanks Best Regards On Thu, Nov 27, 2014 at 11:30 AM, Taeyun Kim taeyun@innowireless.com wrote: Hi, I'm trying to open the Spark source code with IntelliJ IDEA. I opened pom.xml on the Spark source code root directory. Project tree is displayed in the Project tool window. But, when I open a source file, say org.apache.spark.deploy.yarn.ClientBase.scala, a lot of red marks shows on the editor scroll bar. It is the 'Cannot resolve symbol' error. Even it cannot resolve StringOps.format. How can it be fixed? The versions I'm using are as follows: - OS: Windows 7 - IntelliJ IDEA: 13.1.6 - Scala plugin: 0.41.2 - Spark source code: 1.1.1 (with a few file modified by me) I've tried to fix this and error state changed somewhat, but eventually I gave up fixing it on my own (with googling) and deleted .idea folder and started over. So now I'm seeing the errors described above. Thank you. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava
Hi Judy, Are you somehow modifying Spark's classpath to include jars from Hadoop and Hive that you have running on the machine? The issue seems to be that you are somehow including a version of Hadoop that references the original guava package. The Hadoop that is bundled in the Spark jars should not do this. - Patrick On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Looks like a config issue. I ran spark-pi job and still failing with the same guava error Command ran: .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.examples.SparkPi --master spark://headnodehost:7077 --executor-memory 1G --num-executors 1 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100 Had used the same build steps on spark 1.1 and had no issue. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Tuesday, November 25, 2014 5:47 PM To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava To determine if this is a Windows vs. other configuration, can you just try to call the Spark-class.cmd SparkSubmit without actually referencing the Hadoop or Thrift server classes? On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com wrote: I traced the code and used the following to call: Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal --hiveconf hive.server2.thrift.port=1 The issue ended up to be much more fundamental however. Spark doesn't work at all in configuration below. When open spark-shell, it fails with the same ClassNotFound error. Now I wonder if this is a windows-only issue or the hive/Hadoop configuration that is having this problem. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Tuesday, November 25, 2014 1:50 AM To: Judy Nash; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Oh so you're using Windows. What command are you using to start the Thrift server then? On 11/25/14 4:25 PM, Judy Nash wrote: Made progress but still blocked. After recompiling the code on cmd instead of PowerShell, now I can see all 5 classes as you mentioned. However I am still seeing the same error as before. Anything else I can check for? From: Judy Nash [mailto:judyn...@exchange.microsoft.com] Sent: Monday, November 24, 2014 11:50 PM To: Cheng Lian; u...@spark.incubator.apache.org Subject: RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava This is what I got from jar tf: org/spark-project/guava/common/base/Preconditions.class org/spark-project/guava/common/math/MathPreconditions.class com/clearspring/analytics/util/Preconditions.class parquet/Preconditions.class I seem to have the line that reported missing, but I am missing this file: com/google/inject/internal/util/$Preconditions.class Any suggestion on how to fix this? Very much appreciate the help as I am very new to Spark and open source technologies. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Monday, November 24, 2014 8:24 PM To: Judy Nash; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Hm, I tried exactly the same commit and the build command locally, but couldn't reproduce this. Usually this kind of errors are caused by classpath misconfiguration. Could you please try this to ensure corresponding Guava classes are included in the assembly jar you built? jar tf assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop2.4.0.jar | grep Preconditions On my machine I got these lines (the first line is the one reported as missing in your case): org/spark-project/guava/common/base/Preconditions.class org/spark-project/guava/common/math/MathPreconditions.class com/clearspring/analytics/util/Preconditions.class parquet/Preconditions.class com/google/inject/internal/util/$Preconditions.class On 11/25/14 6:25 AM, Judy Nash wrote: Thank you Cheng for responding. Here is the commit SHA1 on the 1.2 branch I saw this failure in: commit 6f70e0295572e3037660004797040e026e440dbd Author: zsxwing zsxw...@gmail.com Date: Fri Nov 21 00:42:43 2014 -0800 [SPARK-4472][Shell] Print Spark context available as sc. only when SparkContext is created... ... successfully It's weird that printing Spark context available as sc when creating SparkContext unsuccessfully. Let me know if you need anything else. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Friday, November 21, 2014 8:02 PM To: Judy Nash; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Hi Judy, could you please provide the commit SHA1 of the version you're
Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava
Just to double check - I looked at our own assembly jar and I confirmed that our Hadoop configuration class does use the correctly shaded version of Guava. My best guess here is that somehow a separate Hadoop library is ending up on the classpath, possible because Spark put it there somehow. tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar cd org/apache/hadoop/ javap -v Configuration | grep Precond Warning: Binary file Configuration contains org.apache.hadoop.conf.Configuration #497 = Utf8 org/spark-project/guava/common/base/Preconditions #498 = Class #497 // org/spark-project/guava/common/base/Preconditions #502 = Methodref #498.#501// org/spark-project/guava/common/base/Preconditions.checkArgument:(ZLjava/lang/Object;)V 12: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V 50: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote: Hi Judy, Are you somehow modifying Spark's classpath to include jars from Hadoop and Hive that you have running on the machine? The issue seems to be that you are somehow including a version of Hadoop that references the original guava package. The Hadoop that is bundled in the Spark jars should not do this. - Patrick On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Looks like a config issue. I ran spark-pi job and still failing with the same guava error Command ran: .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.examples.SparkPi --master spark://headnodehost:7077 --executor-memory 1G --num-executors 1 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100 Had used the same build steps on spark 1.1 and had no issue. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Tuesday, November 25, 2014 5:47 PM To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava To determine if this is a Windows vs. other configuration, can you just try to call the Spark-class.cmd SparkSubmit without actually referencing the Hadoop or Thrift server classes? On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com wrote: I traced the code and used the following to call: Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal --hiveconf hive.server2.thrift.port=1 The issue ended up to be much more fundamental however. Spark doesn't work at all in configuration below. When open spark-shell, it fails with the same ClassNotFound error. Now I wonder if this is a windows-only issue or the hive/Hadoop configuration that is having this problem. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Tuesday, November 25, 2014 1:50 AM To: Judy Nash; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Oh so you're using Windows. What command are you using to start the Thrift server then? On 11/25/14 4:25 PM, Judy Nash wrote: Made progress but still blocked. After recompiling the code on cmd instead of PowerShell, now I can see all 5 classes as you mentioned. However I am still seeing the same error as before. Anything else I can check for? From: Judy Nash [mailto:judyn...@exchange.microsoft.com] Sent: Monday, November 24, 2014 11:50 PM To: Cheng Lian; u...@spark.incubator.apache.org Subject: RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava This is what I got from jar tf: org/spark-project/guava/common/base/Preconditions.class org/spark-project/guava/common/math/MathPreconditions.class com/clearspring/analytics/util/Preconditions.class parquet/Preconditions.class I seem to have the line that reported missing, but I am missing this file: com/google/inject/internal/util/$Preconditions.class Any suggestion on how to fix this? Very much appreciate the help as I am very new to Spark and open source technologies. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Monday, November 24, 2014 8:24 PM To: Judy Nash; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Hm, I tried exactly the same commit and the build command locally, but couldn't reproduce this. Usually this kind of errors are caused by classpath misconfiguration. Could you please try this to ensure corresponding Guava classes are included in the assembly jar you built? jar tf assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop2.4.0.jar | grep Preconditions On my machine I got
Re: toLocalIterator in Spark 1.0.0
It looks like you are trying to directly import the toLocalIterator function. You can't import functions, it should just appear as a method of an existing RDD if you have one. - Patrick On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark 1.0.0 and Scala 2.10.3. I want to use toLocalIterator in a code but the spark shell tells not found: value toLocalIterator I also did import org.apache.spark.rdd but even after this the shell tells object toLocalIterator is not a member of package org.apache.spark.rdd Can anyone help me in this? Thank You - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Still struggling with building documentation
The doc build appears to be broken in master. We'll get it patched up before the release: https://issues.apache.org/jira/browse/SPARK-4326 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta alexbare...@gmail.com wrote: Nichols and Patrick, Thanks for your help, but, no, it still does not work. The latest master produces the following scaladoc errors: [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55: not found: type Type [error] protected Type type() { return Type.UPLOAD_BLOCK; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39: not found: type Type [error] protected Type type() { return Type.STREAM_HANDLE; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: not found: type Type [error] protected Type type() { return Type.OPEN_BLOCKS; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: not found: type Type [error] protected Type type() { return Type.REGISTER_EXECUTOR; } [error] ^ ... [error] four errors found [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code [error] (spark/scalaunidoc:doc) Scaladoc generation failed [error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM Moving back into docs dir. Making directory api/scala cp -r ../target/scala-2.10/unidoc/. api/scala Making directory api/java cp -r ../target/javaunidoc/. api/java Moving to python/docs directory and building sphinx. Makefile:14: *** The 'sphinx-build' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the 'sphinx-build' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/. Stop. Moving back into home dir. Making directory api/python cp -r python/docs/_build/html/. docs/api/python /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory - python/docs/_build/html/. (Errno::ENOENT) from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0' from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r' from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `top (required)' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize' from /usr/bin/jekyll:224:in `new' from /usr/bin/jekyll:224:in `main' What next? Alex On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I believe the web docs need to be built separately according to the instructions here. Did you give those a shot? It's annoying to have a separate thing with new dependencies in order to build the web docs, but that's how it is at the moment. Nick On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com wrote: I finally came to realize that there is a special maven target to build the scaladocs, although arguably a very unintuitive on: mvn verify. So now I have scaladocs for each package, but not for the whole spark project. Specifically, build/docs/api/scala/index.html is missing. Indeed the whole build/docs/api directory referenced in api.html is missing. How do I build it? Alex Baretta - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark and Play
Hi There, Because Akka versions are not binary compatible with one another, it might not be possible to integrate Play with Spark 1.1.0. - Patrick On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote: Hi, Sorry if this has been asked before; I didn't find a satisfactory answer when searching. How can I integrate a Play application with Spark? I'm getting into issues of akka-actor versions. Play 2.2.x uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine with Spark 1.1.0. Is there something I should do with libraryDependencies in my build.sbt to make it work? Thanks, Akshat - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Support Hive 0.13 .1 in Spark SQL
Hey Cheng, Right now we aren't using stable API's to communicate with the Hive Metastore. We didn't want to drop support for Hive 0.12 so right now we are using a shim layer to support compiling for 0.12 and 0.13. This is very costly to maintain. If Hive has a stable meta-data API for talking to a Metastore, we should use that (is HCatalog sufficient for this purpose?). Ideally we would be able to talk to multiple versions of the Hive metastore and we can keep a single internal version of Hive for our use of Serde's, etc. I've created SPARK-4114 for this: https://issues.apache.org/jira/browse/SPARK-4114 This is a very important issue for Spark SQL, so I'd welcome comments on that JIRA from anyone who is familiar with Hive/HCatalog internals. - Patrick On Mon, Oct 27, 2014 at 9:54 PM, Cheng, Hao hao.ch...@intel.com wrote: Hi, all I have some PRs blocked by hive upgrading (e.g. https://github.com/apache/spark/pull/2570), the problem is some internal hive method signature changed, it's hard to make the compatible in code level (sql/hive) when switching back/forth the Hive versions. I guess the motivation of the upgrading is to support the Metastore with different Hive versions. So, how about just keep the metastore related hive jars upgrading or utilize the HCatalog directly? And of course we can either leaving hive-exec.jar hive-cli.jar etc as 0.12 or upgrade to 0.13.1, but not support them both. Sorry if I missed some discussion of Hive upgrading. Cheng Hao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Ending a job early
Hey Jim, There are some experimental (unstable) API's that support running jobs which might short-circuit: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1126 This can be used for doing online aggregations like you are describing. And in one or two cases we've exposed functions that rely on this: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L334 I would expect more robust support for online aggregation to show up in a future version of Spark. - Patrick On Tue, Oct 28, 2014 at 7:27 AM, Jim Carroll jimfcarr...@gmail.com wrote: We have some very large datasets where the calculation converge on a result. Our current implementation allows us to track how quickly the calculations are converging and end the processing early. This can significantly speed up some of our processing. Is there a way to do the same thing is spark? A trivial example might be a column average on a dataset. As we're 'aggregating' rows into columnar averages I can track how fast these averages are moving and decide to stop after a low percentage of the rows has been processed, producing an estimate rather than an exact value. Within a partition, or better yet, within a worker across 'reduce' steps, is there a way to stop all of the aggregations and just continue on with reduces of already processed data? Thanks JIm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ending-a-job-early-tp17505.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: scalac crash when compiling DataTypeConversions.scala
Hey Ryan, I've found that filing issues with the Scala/Typesafe JIRA is pretty helpful if the issue can be fully reproduced, and even sometimes helpful if it can't. You can file bugs here: https://issues.scala-lang.org/secure/Dashboard.jspa The Spark SQL code in particular is typically the source of these, as we use more fancy scala features. In a pinch it is also possible to recompile and test code without building SQL if you just run tests for the specific module (e.g. streaming). In sbt this sort of just works: sbt/sbt streaming/test-only a.b.c* In maven it's more clunky but if you do a mvn install first then (I think) you can test sub-modules independently: mvn test -pl streaming ... - Patrick On Wed, Oct 22, 2014 at 10:00 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here is a full gist of an innocuous test command (mvn test -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts on L512 and there's a final stack trace at the bottom. mvn clean or ./sbt/sbt clean fix it (I believe I've observed the issue while compiling with each tool), but are annoying/time-consuming to do, obvs, and it's happening pretty frequently for me when doing only small numbers of incremental compiles punctuated by e.g. checking out different git commits. Have other people seen this? This post on this list is basically the same error, but in TestSQLContext.scala and this SO post claims to be hitting it when trying to build in intellij. It seems likely to be a bug in scalac; would finding a consistent repro case and filing it somewhere be useful? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: About Memory usage in the Spark UI
It shows the amount of memory used to store RDD blocks, which are created when you run .cache()/.persist() on an RDD. On Wed, Oct 22, 2014 at 10:07 PM, Haopu Wang hw...@qilinsoft.com wrote: Hi, please take a look at the attached screen-shot. I wonders what's the Memory Used column mean. I give 2GB memory to the driver process and 12GB memory to the executor process. Thank you!
Re: coalesce with shuffle or repartition is not necessarily fault-tolerant
IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the ordering of an RDD. On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows. So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k' each time you call it. And because of this different ordering, 'h' will result in different partitions even, because 'repartition' distributes through a random number generator with the 'index' as the key. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sparksql connect remote hive cluster
Spark will need to connect both to the hive metastore and to all HDFS nodes (NN and DN's). If that is all in place then it should work. In this case it looks like maybe it can't connect to a datanode in HDFS to get the raw data. Keep in mind that the performance might not be very good if you are trying to read large amounts of data over the network. On Wed, Oct 8, 2014 at 5:33 AM, jamborta jambo...@gmail.com wrote: Hi all, just wondering if is it possible to allow spark to connect to hive on another cluster located remotely? I have setup hive-site.xml and amended the hive-metatstore uri, also opened the port for zookeeper, webhdfs and hive metastore. It seems it connects to hive, then it fails with the following: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1886934195-100.73.212.101-1411645855947:blk_1073763904_23146 file=/user/tja01/datasets/00ab46fa4d6711e4afb70003ff41ebbf/part-3 not sure if some of the ports are not open or it needs access to additional things. thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sparksql-connect-remote-hive-cluster-tp15928.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spot instances on Amazon EMR
Hey Grzegorz, EMR is a service that is not maintained by the Spark community. So this list isn't the right place to ask EMR questions. - Patrick On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, I would like to run Spark application on Amazon EMR. I have some questions about that: 1. I have input data on other hdfs (not on Amazon). Can I send all input data from that cluster to HDFS on Amazon EMR cluster (if it has enough storage memory) or do I have send it to Amazon S3 storage and then load this data on EMR cluster where I want to run my application? 2. Which nodes should be on-demand instances and which can be spot instances (I don't want to spend to much money but I also lost my data or have to recompute everything after spot instance interruption)? 3. Can I use Amazon S3 storage for input and output data to have less on-demand instances and more spot instances? (Or maybe there is another solution to lower costs) I would like to run this application once and computation would take around 30h I think. Could you answer on (at least some of) this questions? Thanks, Grzegorz - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: partitioned groupBy
If you'd like to re-use the resulting inverted map, you can persist the result: x = myRdd.mapPartitions(create inverted map).persist() Your function would create the reverse map and then return an iterator over the keys in that map. On Wed, Sep 17, 2014 at 1:04 PM, Akshat Aranya aara...@gmail.com wrote: Patrick, If I understand this correctly, I won't be able to do this in the closure provided to mapPartitions() because that's going to be stateless, in the sense that a hash map that I create within the closure would only be useful for one call of MapPartitionsRDD.compute(). I guess I would need to override mapPartitions() directly within my RDD. Right? On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell pwend...@gmail.com wrote: If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote: I have a use case where my RDD is set up such: Partition 0: K1 - [V1, V2] K2 - [V2] Partition 1: K3 - [V1] K4 - [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non unique keys across the partitions, for example: Partition 0: V1 - [K1] V2 - [K1, K2] Partition 1: V1 - [K3] V3 - [K4] Is there a way to do only a per-partition groupBy, instead of shuffling the entire data? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: partitioned groupBy
If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote: I have a use case where my RDD is set up such: Partition 0: K1 - [V1, V2] K2 - [V2] Partition 1: K3 - [V1] K4 - [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non unique keys across the partitions, for example: Partition 0: V1 - [K1] V2 - [K1, K2] Partition 1: V1 - [K3] V3 - [K4] Is there a way to do only a per-partition groupBy, instead of shuffling the entire data? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-1.1.0 with make-distribution.sh problem
Yeah that issue has been fixed by adding better docs, it just didn't make it in time for the release: https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54 On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo huozhanf...@gmail.com wrote: resolved: ./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests This code is a bit misleading -- Zhanfeng Huo *From:* Zhanfeng Huo huozhanf...@gmail.com *Date:* 2014-09-12 14:13 *To:* user user@spark.apache.org *Subject:* spark-1.1.0 with make-distribution.sh problem Hi, I compile spark with cmd bash -x make-distribution.sh -Pyarn -Phive --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0, it errors. How to use it correct? message: + set -o pipefail + set -e +++ dirname make-distribution.sh ++ cd . ++ pwd + FWDIR=/home/syn/spark/spark-1.1.0 + DISTDIR=/home/syn/spark/spark-1.1.0/dist + SPARK_TACHYON=false + MAKE_TGZ=false + NAME=none + (( 7 )) + case $1 in + break + '[' -z /home/syn/usr/jdk1.7.0_55 ']' + '[' -z /home/syn/usr/jdk1.7.0_55 ']' + which git ++ git rev-parse --short HEAD + GITREV=5f6f219 + '[' '!' -z 5f6f219 ']' + GITREVSTRING=' (git revision 5f6f219)' + unset GITREV + which mvn ++ mvn help:evaluate -Dexpression=project.version ++ grep -v INFO ++ tail -n 1 + VERSION=1.1.0 ++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0 ++ grep -v INFO ++ tail -n 1 + SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output' Best Regards -- Zhanfeng Huo
Re: Use Case of mutable RDD - any ideas around will help.
[moving to user@] This would typically be accomplished with a union() operation. You can't mutate an RDD in-place, but you can create a new RDD with a union() which is an inexpensive operator. On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur archit279tha...@gmail.com wrote: Hi, We have a use case where we are planning to keep sparkcontext alive in a server and run queries on it. But the issue is we have a continuous flowing data the comes in batches of constant duration(say, 1hour). Now we want to exploit the schemaRDD and its benefits of columnar caching and compression. Is there a way I can append the new batch (uncached) to the older(cached) batch without losing the older data from cache and caching the whole dataset. Thanks and Regards, Archit Thakur. Sr Software Developer, Guavus, Inc. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.1.0: Cannot load main class from JAR
Hey SK, Yeah, the documented format is the same (we expect users to add the jar at the end) but the old spark-submit had a bug where it would actually accept inputs that did not match the documented format. Sorry if this was difficult to find! - Patrick On Fri, Sep 12, 2014 at 1:50 PM, SK skrishna...@gmail.com wrote: This issue is resolved. Looks like in the new spark-submit, the jar path has to be at the end of the options. Earlier I could specify this path in any order on the command line. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-Cannot-load-main-class-from-JAR-tp14123p14124.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Announcing Spark 1.1.0!
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR
I would say that the first three are all used pretty heavily. Mesos was the first one supported (long ago), the standalone is the simplest and most popular today, and YARN is newer but growing a lot in activity. SIMR is not used as much... it was designed mostly for environments where users had access to Hadoop and couldn't easily install Spark. Now most Hadoop vendors bundle Spark anyways, so it's not needed. On Sun, Sep 7, 2014 at 6:29 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I'm trying to determine which Spark deployment models are the most popular - Standalone, YARN, Mesos, or SIMR. Anyone knows? I thought I'm use search-hadoop.com to help me figure this out and this is what I found: 1) Standalone http://search-hadoop.com/?q=standalonefc_project=Sparkfc_type=mail+_hash_+user (seems the most popular?) 2) YARN http://search-hadoop.com/?q=yarnfc_project=Sparkfc_type=mail+_hash_+user (almost as popular as standalone?) 3) Mesos http://search-hadoop.com/?q=mesosfc_project=Sparkfc_type=mail+_hash_+user (less popular than yarn or standalone) 4) SIMR http://search-hadoop.com/?q=simrfc_project=Sparkfc_type=mail+_hash_+user (no mentions?) This is obviously not very accurate but is the order right? Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: memory size for caching RDD
Changing this is not supported, it si immutable similar to other spark configuration settings. On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 nzjem...@gmail.com wrote: Dear all: Spark uses memory to cache RDD and the memory size is specified by spark.storage.memoryFraction. One the Executor starts, does Spark support adjusting/resizing memory size of this part dynamically? Thanks. -- *Regards,* *Zhaojie* - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming: DStream - zipWithIndex
Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Submit to the Powered By Spark Page!
Hi All, I want to invite users to submit to the Spark Powered By page. This page is a great way for people to learn about Spark use cases. Since Spark activity has increased a lot in the higher level libraries and people often ask who uses each one, we'll include information about which components each organization uses as well. If you are interested, simply respond to this e-mail (or e-mail me off-list) with: 1) Organization name 2) URL 3) Which Spark components you use: Core, SQL, Streaming, MLlib, GraphX 4) A 1-2 sentence description of your use case. I'll post any new entries here: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark - Patrick
Re: Understanding RDD.GroupBy OutOfMemory Exceptions
Hey Andrew, We might create a new JIRA for it, but it doesn't exist yet. We'll create JIRA's for the major 1.2 issues at the beginning of September. - Patrick On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash and...@andrewash.com wrote: Hi Patrick, For the spilling within on key work you mention might land in Spark 1.2, is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or is there another ticket I should be following? Thanks! Andrew On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Jens, Within a partition things will spill - so the current documentation is correct. This spilling can only occur *across keys* at the moment. Spilling cannot occur within a key at present. This is discussed in the video here: https://www.youtube.com/watch?v=dmL0N3qfSc8index=3list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ Spilling within one key for GroupBy's is likely to end up in the next release of Spark, Spark 1.2. In most cases we see when users hit this, they are actually trying to just do aggregations which would be more efficiently implemented without the groupBy operator. If the goal is literally to just write out to disk all the values associated with each group, and the values associated with a single group are larger than fit in memory, this cannot be accomplished right now with the groupBy operator. The best way to work around this depends a bit on what you are trying to do with the data down stream. Typically approaches involve sub-dividing any very large groups, for instance, appending a hashed value in a small range (1-10) to large keys. Then your downstream code has to deal with aggregating partial values for each group. If your goal is just to lay each group out sequentially on disk on one big file, you can call `sortByKey` with a hashed suffix as well. The sort functions are externalized in Spark 1.1 (which is in pre-release). - Patrick On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti sp...@jkg.dk wrote: Patrick Wendell wrote In the latest version of Spark we've added documentation to make this distinction more clear to users: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390 That is a very good addition to the documentation. Nice and clear about the dangers of groupBy. Patrick Wendell wrote Currently groupBy requires that all of the values for one key can fit in memory. Is that really true? Will partitions not spill to disk, hence the recommendation in the documentation to up the parallelism of groupBy et al? A better question might be: How exactly does partitioning affect groupBy with regards to memory consumption. What will **have** to fit in memory, and what may be spilled to disk, if running out of memory? And if it really is true, that Spark requires all groups' values to fit in memory, how do I do a on-disk grouping of results, similar to what I'd to in a Hadoop job by using a mapper emitting (groupId, value) key-value pairs, and having an entity reducer writing results to disk? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Advantage of using cache()
Yep - that's correct. As an optimization we save the shuffle output and re-use if if you execute a stage twice. So this can make A:B tests like this a bit confusing. - Patrick On Friday, August 22, 2014, Nieyuan qiushuiwuh...@gmail.com wrote: Because map-reduce tasks like join will save shuffle data to disk . So the only diffrence with caching or no-caching version is : .map { case (x, (n, i)) = (x, n)} - Thanks, Nieyuan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Advantage-of-using-cache-tp12480p12634.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: user-h...@spark.apache.org javascript:;
Re: Advantage of using cache()
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact effect of caching. In rdd3, in addition to the fact that rdd will be cached, you are also doing a bunch of extra random number generation. So it will be hard to isolate the effect of caching. On Wed, Aug 20, 2014 at 7:48 AM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, I tried to write small program which shows that using cache() can speed up execution but results with and without cache were similar. Could help me with this issue? I tried to compute rdd and use it later in two places and I thought in second usage this rdd is recomputed but it doesn't: val help = sc.parallelize(Array.range(1, 2)).repartition(100) .map(x = (scala.util.Random.nextInt(10), x)) val rdd = sc.parallelize(Array.range(1,2)) .repartition(100) .map(x = (scala.util.Random.nextInt(10), x)) .join(help) .map { case (x, (n, i)) = (x, n)} .reduceByKey(_ + _) .cache() val rdd2 = sc.parallelize(Array.range(1,1000)).map(x = (x, x)) .join(rdd).saveAsTextFile(output/1) val rdd3 = sc.parallelize(Array.range(1,1000)).map(x = (scala.util.Random.nextInt(1000), x)) .join(rdd).saveAsTextFile(output/2) Thanks, Grzegorz
Re: Broadcast vs simple variable
For large objects, it will be more efficient to broadcast it. If your array is small it won't really matter. How many centers do you have? Unless you are finding that you have very large tasks (and Spark will print a warning about this), it could be okay to just reference it directly. On Wed, Aug 20, 2014 at 1:18 AM, Julien Naour julna...@gmail.com wrote: Hi, I have a question about broadcast. I'm working on a clustering algorithm close to KMeans. It seems that KMeans broadcast clusters centers at each step. For the moment I just use my centers as Array that I call directly in my map at each step. Could it be more efficient to use broadcast instead of simple variable? Cheers, Julien Naour
Re: Web UI doesn't show some stages
The reason is that some operators get pipelined into a single stage. rdd.map(XX).filter(YY) - this executes in a single stage since there is no data movement needed in between these operations. If you call toDeubgString on the final RDD it will give you some information about the exact lineage. In Spark 1.1 this will return information about stage boudnaries as well. On Wed, Aug 20, 2014 at 4:22 AM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, I am wondering why in web UI some stages (like join, filter) are not visible. For example this code: val simple = sc.parallelize(Array.range(0,100)) val simple2 = sc.parallelize(Array.range(0,100)) val toJoin = simple.map(x = (x, x.toString + x.toString)) val rdd = simple2 .map(x = (scala.util.Random.nextInt(100), x)) .join(toJoin) .map { case (r, (x, s)) = (r, x)} .reduceByKey(_ + _) .sortByKey() .cache() rdd.saveAsTextFile(output/1) val rdd2 = toJoin .groupBy{ case (x, _) = x} .filter{ case (x, _) = x 10} rdd2.saveAsTextFile(output/2) println(rdd2.join(toJoin).count()) in UI doesn't show join and filter stages and moreover it shows sortByKey and reduceByKey twice. Could anyone explain how it works? Thanks, Grzegorz
Re: Understanding RDD.GroupBy OutOfMemory Exceptions
Hi Jens, Within a partition things will spill - so the current documentation is correct. This spilling can only occur *across keys* at the moment. Spilling cannot occur within a key at present. This is discussed in the video here: https://www.youtube.com/watch?v=dmL0N3qfSc8index=3list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ Spilling within one key for GroupBy's is likely to end up in the next release of Spark, Spark 1.2. In most cases we see when users hit this, they are actually trying to just do aggregations which would be more efficiently implemented without the groupBy operator. If the goal is literally to just write out to disk all the values associated with each group, and the values associated with a single group are larger than fit in memory, this cannot be accomplished right now with the groupBy operator. The best way to work around this depends a bit on what you are trying to do with the data down stream. Typically approaches involve sub-dividing any very large groups, for instance, appending a hashed value in a small range (1-10) to large keys. Then your downstream code has to deal with aggregating partial values for each group. If your goal is just to lay each group out sequentially on disk on one big file, you can call `sortByKey` with a hashed suffix as well. The sort functions are externalized in Spark 1.1 (which is in pre-release). - Patrick On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti sp...@jkg.dk wrote: Patrick Wendell wrote In the latest version of Spark we've added documentation to make this distinction more clear to users: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390 That is a very good addition to the documentation. Nice and clear about the dangers of groupBy. Patrick Wendell wrote Currently groupBy requires that all of the values for one key can fit in memory. Is that really true? Will partitions not spill to disk, hence the recommendation in the documentation to up the parallelism of groupBy et al? A better question might be: How exactly does partitioning affect groupBy with regards to memory consumption. What will **have** to fit in memory, and what may be spilled to disk, if running out of memory? And if it really is true, that Spark requires all groups' values to fit in memory, how do I do a on-disk grouping of results, similar to what I'd to in a Hadoop job by using a mapper emitting (groupId, value) key-value pairs, and having an entity reducer writing results to disk? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: What should happen if we try to cache more data than the cluster can hold in memory?
It seems possible that you are running out of memory unrolling a single partition of the RDD. This is something that can cause your executor to OOM, especially if the cache is close to being full so the executor doesn't have much free memory left. How large are your executors? At the time of failure, is the cache already nearly full? I also believe the Snappy compression codec in Hadoop is not splittable. This means that each of your JSON files is read in its entirety as one spark partition. If you have files that are larger than the standard block size (128MB), it will exacerbate this shortcoming of Spark. Incidentally, this means minPartitions won't help you at all here. This is fixed in the master branch and will be fixed in Spark 1.1. As a debugging step (if this is doable), it's worth running this job on the master branch and seeing if it succeeds. https://github.com/apache/spark/pull/1165 A (potential) workaround would be to first persist your data to disk, then re-partition it, then cache it. I'm not 100% sure whether that will work though. val a = sc.textFile(s3n://some-path/*.json).persist(DISK_ONLY).repartition(larger nr of partitions).cache() - Patrick On Fri, Aug 1, 2014 at 10:17 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen so...@cloudera.com wrote: Isn't this your worker running out of its memory for computations, rather than for caching RDDs? I'm not sure how to interpret the stack trace, but let's say that's true. I'm even seeing this with a simple a = sc.textFile().cache() and then a.count(). Spark shouldn't need that much memory for this kind of work, no? then the answer is that you should tell it to use less memory for caching. I can try that. That's done by changing spark.storage.memoryFraction, right? This still seems strange though. The default fraction of the JVM left for non-cache activity (1 - 0.6 = 40% http://spark.apache.org/docs/latest/configuration.html#execution-behavior) should be plenty for just counting elements. I'm using m1.xlarge nodes that have 15GB of memory apiece. Nick
Re: What should happen if we try to cache more data than the cluster can hold in memory?
BTW - the reason why the workaround could help is because when persisting to DISK_ONLY, we explicitly avoid materializing the RDD partition in memory... we just pass it through to disk On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell pwend...@gmail.com wrote: It seems possible that you are running out of memory unrolling a single partition of the RDD. This is something that can cause your executor to OOM, especially if the cache is close to being full so the executor doesn't have much free memory left. How large are your executors? At the time of failure, is the cache already nearly full? I also believe the Snappy compression codec in Hadoop is not splittable. This means that each of your JSON files is read in its entirety as one spark partition. If you have files that are larger than the standard block size (128MB), it will exacerbate this shortcoming of Spark. Incidentally, this means minPartitions won't help you at all here. This is fixed in the master branch and will be fixed in Spark 1.1. As a debugging step (if this is doable), it's worth running this job on the master branch and seeing if it succeeds. https://github.com/apache/spark/pull/1165 A (potential) workaround would be to first persist your data to disk, then re-partition it, then cache it. I'm not 100% sure whether that will work though. val a = sc.textFile(s3n://some-path/*.json).persist(DISK_ONLY).repartition(larger nr of partitions).cache() - Patrick On Fri, Aug 1, 2014 at 10:17 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen so...@cloudera.com wrote: Isn't this your worker running out of its memory for computations, rather than for caching RDDs? I'm not sure how to interpret the stack trace, but let's say that's true. I'm even seeing this with a simple a = sc.textFile().cache() and then a.count(). Spark shouldn't need that much memory for this kind of work, no? then the answer is that you should tell it to use less memory for caching. I can try that. That's done by changing spark.storage.memoryFraction, right? This still seems strange though. The default fraction of the JVM left for non-cache activity (1 - 0.6 = 40% http://spark.apache.org/docs/latest/configuration.html#execution-behavior) should be plenty for just counting elements. I'm using m1.xlarge nodes that have 15GB of memory apiece. Nick
Re: Issues with HDP 2.4.0.2.1.3.0-563
For hortonworks, I believe it should work to just link against the corresponding upstream version. I.e. just set the Hadoop version to 2.4.0 Does that work? - Patrick On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid wrote: Hi, Not sure whose issue this is, but if I run make-distribution using HDP 2.4.0.2.1.3.0-563 as the hadoop version (replacing it in make-distribution.sh), I get a strange error with the exception below. If I use a slightly older version of HDP (2.4.0.2.1.2.0-402) with make-distribution, using the generated assembly all works fine for me. Either 1.0.0 or 1.0.1 will work fine. Should I file a JIRA or is this a known issue? Thanks, Ron Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader( AvroKeyInputFormat.java:47) org.apache.spark.rdd.NewHadoopRDD$$anon$1.init( NewHadoopRDD.scala:111) org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111 ) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run( Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745)
Re: Cached RDD Block Size - Uneven Distribution
Are you directly caching files from Hadoop or are you doing some transformation on them first? If you are doing a groupBy or some type of transformation, then you could be causing data skew that way. On Sun, Aug 3, 2014 at 1:19 PM, iramaraju iramar...@gmail.com wrote: I am running spark 1.0.0, Tachyon 0.5 and Hadoop 1.0.4. I am selecting a subset of a large dataset and trying to run queries on the cached schema RDD. Strangely, in web UI, I see the following. 150 Partitions Block Name Storage Level Size in Memory ▴Size on Disk Executors rdd_30_68 Memory Deserialized 1x Replicated 307.5 MB 0.0 B ip-172-31-45-100.ec2.internal:37796 rdd_30_133 Memory Deserialized 1x Replicated 216.0 MB 0.0 B ip-172-31-45-101.ec2.internal:55947 rdd_30_18 Memory Deserialized 1x Replicated 194.2 MB 0.0 B ip-172-31-42-159.ec2.internal:43543 rdd_30_24 Memory Deserialized 1x Replicated 173.3 MB 0.0 B ip-172-31-45-101.ec2.internal:55947 rdd_30_70 Memory Deserialized 1x Replicated 168.2 MB 0.0 B ip-172-31-18-220.ec2.internal:39847 rdd_30_105 Memory Deserialized 1x Replicated 154.1 MB 0.0 B ip-172-31-45-102.ec2.internal:36700 rdd_30_79 Memory Deserialized 1x Replicated 153.9 MB 0.0 B ip-172-31-45-99.ec2.internal:59538 rdd_30_60 Memory Deserialized 1x Replicated 4.2 MB 0.0 B ip-172-31-45-102.ec2.internal:36700 rdd_30_99 Memory Deserialized 1x Replicated 112.0 B 0.0 B ip-172-31-45-102.ec2.internal:36700 rdd_30_90 Memory Deserialized 1x Replicated 112.0 B 0.0 B ip-172-31-45-102.ec2.internal:36700 rdd_30_9Memory Deserialized 1x Replicated 112.0 B 0.0 B ip-172-31-18-220.ec2.internal:39847 rdd_30_89 Memory Deserialized 1x Replicated 112.0 B 0.0 B ip-172-31-45-102.ec2.internal:36700 What is strange to me is the size in Memory is mostly 112Bytes except for 8 of them. ( I have 9 data files in Hadoop, which are well distributed 64mb blocks ). The tasks processing the rdd are getting stuck after finishing few initial tasks. I am wondering, it is because, the spark has hit the large blocks and trying to process them on one worker per task. Any suggestions on how I can distribute them more evenly (Size of blocks) ? And why my hadoop blocks are nicely even and spark cached RDD has such a uneven distribution ? Any help is appreciated. Regards Ram -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cached-RDD-Block-Size-Uneven-Distribution-tp11286.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: disable log4j for spark-shell
If you want to customize the logging behavior - the simplest way is to copy conf/log4j.properties.tempate to conf/log4j.properties. Then you can go and modify the log level in there. The spark shells should pick this up. On Sun, Aug 3, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote: That's just a template. Nothing consults that file by default. It's looking inside the Spark .jar. If you edit core/src/main/resources/org/apache/spark/log4j-defaults.properties and rebuild Spark, it will pick up those changes. I think you could also use the JVM argument -Dlog4j.configuration=conf/log4j-defaults.properties to force it to look at your local, edited props file. Someone may have to correct me, but I think that in master right now, that means using --driver-java-options=.. to set an argument to the JVM that runs the shell? On Sun, Aug 3, 2014 at 2:07 PM, Gil Vernik g...@il.ibm.com wrote: Hi, I would like to run spark-shell without any INFO messages printed. To achieve this I edited /conf/log4j.properties and added line log4j.rootLogger=OFF that suppose to disable all logging. However, when I run ./spark-shell I see the message 4/08/03 16:02:15 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties And then spark-shell prints all INFO messages as is. What did i missed? Why spark-shell uses default log4j properties and not the one defined in /conf directory? Is there another solution to prevent spark-shell from printing INFO messages? Thanks, Gil. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2
This is a Scala bug - I filed something upstream, hopefully they can fix it soon and/or we can provide a work around: https://issues.scala-lang.org/browse/SI-8772 - Patrick On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Currently scala 2.10.2 can't be pulled in from maven central it seems, however if you have it in your ivy cache it should work. On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote: I also ran into same issue. What is the solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Cell : 425-233-8271 -- Cell : 425-233-8271
Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2
I've had intermiddent access to the artifacts themselves, but for me the directory listing always 404's. I think if sbt hits a 404 on the directory, it sends a somewhat confusing error message that it can't download the artifact. - Patrick On Fri, Aug 1, 2014 at 3:28 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: This fails for me too. I have no idea why it happens as I can wget the pom from maven central. To work around this I just copied the ivy xmls and jars from this github repo https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library and put it in /root/.ivy2/cache/org.scala-lang/scala-library Thanks Shivaram On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Currently scala 2.10.2 can't be pulled in from maven central it seems, however if you have it in your ivy cache it should work. On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote: I also ran into same issue. What is the solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Cell : 425-233-8271 -- Cell : 425-233-8271
Re: how to publish spark inhouse?
All of the scripts we use to publish Spark releases are in the Spark repo itself, so you could follow these as a guideline. The publishing process in Maven is similar to in SBT: https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65 On Mon, Jul 28, 2014 at 12:39 PM, Koert Kuipers ko...@tresata.com wrote: ah ok thanks. guess i am gonna read up about maven-release-plugin then! On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote: This is not something you edit yourself. The Maven release plugin manages setting all this. I think virtually everything you're worried about is done for you by this plugin. Maven requires artifacts to set a version and it can't inherit one. I feel like I understood the reason this is necessary at one point. On Mon, Jul 28, 2014 at 8:33 PM, Koert Kuipers ko...@tresata.com wrote: and if i want to change the version, it seems i have to change it in all 23 pom files? mhhh. is it mandatory for these sub-project pom files to repeat that version info? useful? spark$ grep 1.1.0-SNAPSHOT * -r | wc -l 23 On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote: hey we used to publish spark inhouse by simply overriding the publishTo setting. but now that we are integrated in SBT with maven i cannot find it anymore. i tried looking into the pom file, but after reading 1144 lines of xml i 1) havent found anything that looks like publishing 2) i feel somewhat sick too 3) i am considering alternative careers to developing... where am i supposed to look? thanks for your help!
Re: Catalyst dependency on Spark Core
Adding new build modules is pretty high overhead, so if this is a case where a small amount of duplicated code could get rid of the dependency, that could also be a good short-term option. - Patrick On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd create some sort of spark common package that has things like logging. That way catalyst could depend on that, without pulling in all of Hadoop, etc. Maybe others have opinions though, so I'm cc-ing the dev list. On Mon, Jul 14, 2014 at 12:21 AM, Yanbo Liang yanboha...@gmail.com wrote: Make Catalyst independent of Spark is the goal of Catalyst, maybe need time and evolution. I awared that package org.apache.spark.sql.catalyst.util embraced org.apache.spark.util.{Utils = SparkUtils}, so that Catalyst has a dependency on Spark core. I'm not sure whether it will be replaced by other component independent of Spark in later release. 2014-07-14 11:51 GMT+08:00 Aniket Bhatnagar aniket.bhatna...@gmail.com: As per the recent presentation given in Scala days (http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was mentioned that Catalyst is independent of Spark. But on inspecting pom.xml of sql/catalyst module, it seems it has a dependency on Spark Core. Any particular reason for the dependency? I would love to use Catalyst outside Spark (reposted as previous email bounced. Sorry if this is a duplicate).
Announcing Spark 1.0.1
I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for JSON data and performance and stability fixes. Visit the release notes[1] to read about this release or download[2] the release today. [1] http://spark.apache.org/releases/spark-release-1-0-1.html [2] http://spark.apache.org/downloads.html
Re: How to clear the list of Completed Appliations in Spark web UI?
There isn't currently a way to do this, but it will start dropping older applications once more than 200 are stored. On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang hw...@qilinsoft.com wrote: Besides restarting the Master, is there any other way to clear the Completed Applications in Master web UI?
Re: Purpose of spark-submit?
It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ?
Re: issues with ./bin/spark-shell for standalone mode
Hey Mikhail, I think (hope?) the -em and -dm options were never in an official Spark release. They were just in the master branch at some point. Did you use these during a previous Spark release or were you just on master? - Patrick On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov streb...@gmail.com wrote: Thanks Andrew, ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30 --executor-memory 20g --driver-memory 10g works well, just wanted to make sure that I'm not missing anything -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9111.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: hadoop + yarn + spark
Hi There, There is an issue with PySpark-on-YARN that requires users build with Java 6. The issue has to do with how Java 6 and 7 package jar files differently. Can you try building spark with Java 6 and trying again? - Patrick On Fri, Jun 27, 2014 at 5:00 PM, sdeb sangha...@gmail.com wrote: Hello, I have installed spark on top of hadoop + yarn. when I launch the pyspark shell try to compute something I get this error. Error from python worker: /usr/bin/python: No module named pyspark The pyspark module should be there, do I have to put an external link to it? --Sanghamitra. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-yarn-spark-tp8466.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: 1.0.1 release plan
Hey There, I'd like to start voting on this release shortly because there are a few important fixes that have queued up. We're just waiting to fix an akka issue. I'd guess we'll cut a vote in the next few days. - Patrick On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim m...@palantir.com wrote: Hi all, Is there any plan for 1.0.1 release? Mingyu
Re: Trailing Tasks Saving to HDFS
I'll make a comment on the JIRA - thanks for reporting this, let's get to the bottom of it. On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I've created an issue for this but if anyone has any advice, please let me know. Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two remaining tasks (out of 320). Those tasks seem to be waiting on data from another task on another node. Eventually (about 2 hours later) they time out with a connection reset by peer. All the data actually seems to be on HDFS as the expected part files. It just seems like the remaining tasks have corrupted metadata, so that they do not realize that they are done. Just a guess though. https://issues.apache.org/jira/browse/SPARK-2202 -Suren On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Looks like eventually there was some type of reset or timeout and the tasks have been reassigned. I'm guessing they'll keep failing until max failure count. The machine it disconnected from was a remote machine, though I've seen such failures from connections to itself with other problems. The log lines from the remote machine are also below. Any thoughts or guesses would be appreciated! HUNG WORKER 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from connection to ConnectionManagerId(172.16.25.103,57626) java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) at sun.nio.ch.IOUtil.read(IOUtil.java:224) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(172.16.25.103,57626) 14/06/18 19:41:18 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.103,57626) 14/06/18 19:41:18 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.103,57626) 14/06/18 19:41:18 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.103,57626) 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found REMOTE WORKER 14/06/18 19:41:18 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: I have a flow that ends with saveAsTextFile() to HDFS. It seems all the expected files per partition have been written out, based on the number of part files and the file sizes. But the driver logs show 2 tasks still not completed and has no activity and the worker logs show no activity for those two tasks for a while now. Has anyone run into this situation? It's happened to me a couple of times now. Thanks. -- Suren SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hira...@velos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hira...@velos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hira...@velos.io W: www.velos.io
Re: Enormous EC2 price jump makes r3.large patch more important
Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patrick On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Some people (me included) might have wondered why all our m1.large spot instances (in us-west-1) shut down a few hours ago... Simple reason: The EC2 spot price for Spark's default m1.large instances just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably something to do with world cup. So far this is just us-west-1, but prices have a tendency to equalize across centers as the days pass. Time to make backups and plans. m3 spot prices are still down at $0.02 (and being new, will be bypassed by older systems), so it would be REAAALLYY nice if there had been some progress on that issue. Let me know if I can help with testing and whatnot. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Enormous EC2 price jump makes r3.large patch more important
By the way, in case it's not clear, I mean our maintenance branches: https://github.com/apache/spark/tree/branch-1.0 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patrick On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Some people (me included) might have wondered why all our m1.large spot instances (in us-west-1) shut down a few hours ago... Simple reason: The EC2 spot price for Spark's default m1.large instances just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably something to do with world cup. So far this is just us-west-1, but prices have a tendency to equalize across centers as the days pass. Time to make backups and plans. m3 spot prices are still down at $0.02 (and being new, will be bypassed by older systems), so it would be REAAALLYY nice if there had been some progress on that issue. Let me know if I can help with testing and whatnot. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Enormous EC2 price jump makes r3.large patch more important
Actually you'll just want to clone the 1.0 branch then use the spark-ec2 script in there to launch your cluster. The --spark-git-repo flag is if you want to launch with a different version of Spark on the cluster. In your case you just need a different version of the launch script itself, which will be present in the 1.0 branch of Spark. - Patrick On Tue, Jun 17, 2014 at 9:29 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: I am about to spin up some new clusters, so I may give that a go... any special instructions for making them work? I assume I use the --spark-git-repo= option on the spark-ec2 command. Is it as easy as concatenating your string as the value? On cluster management GUIs... I've been looking around at Amabari, Datastax, Cloudera, OpsCenter etc. Not totally convinced by any of them yet. Anyone using a good one I should know about? I'm really beginning to lean in the direction of Cassandra as the distributed data store... On Wed, Jun 18, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com wrote: By the way, in case it's not clear, I mean our maintenance branches: https://github.com/apache/spark/tree/branch-1.0 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patrick On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Some people (me included) might have wondered why all our m1.large spot instances (in us-west-1) shut down a few hours ago... Simple reason: The EC2 spot price for Spark's default m1.large instances just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably something to do with world cup. So far this is just us-west-1, but prices have a tendency to equalize across centers as the days pass. Time to make backups and plans. m3 spot prices are still down at $0.02 (and being new, will be bypassed by older systems), so it would be REAAALLYY nice if there had been some progress on that issue. Let me know if I can help with testing and whatnot. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Wildcard support in input path
These paths get passed directly to the Hadoop FileSystem API and I think the support globbing out-of-the box. So AFAIK it should just work. On Tue, Jun 17, 2014 at 9:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Jianshi, I have used wild card characters (*) in my program and it worked.. My code was like this b = sc.textFile(hdfs:///path to file/data_file_2013SEP01*) Thanks Regards, Meethu M On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: It would be convenient if Spark's textFile, parquetFile, etc. can support path with wildcard, such as: hdfs://domain/user/jianshuang/data/parquet/table/month=2014* Or is there already a way to do it now? Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: Java IO Stream Corrupted - Invalid Type AC?
Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway on this? My team is running into this also. Doesn't happen on smaller datasets. Our input set is about 10 GB but we generate 100s of GBs in the flow itself. -Suren On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com wrote: Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to yours https://imgur.com/7iBA4nJ ). 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed 4 times; aborting job 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at VertexRDD.scala:100 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.599:39 failed 4 times, most recent failure: Exception failure in TID 29735 on host node18: java.io.StreamCorruptedException: invalid type code: AC java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355) java.io.ObjectInputStream.readObject(ObjectInputStream.java:350) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192) org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at
Re: Setting spark memory limit
I you run locally then Spark doesn't launch remote executors. However, in this case you can set the memory with --spark-driver-memory flag to spark-submit. Does that work? - Patrick On Mon, Jun 9, 2014 at 3:24 PM, Henggang Cui cuihengg...@gmail.com wrote: Hi, I'm trying to run the SimpleApp example (http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala) on a larger dataset. The input file is about 1GB, but when I run the Spark program, it says:java.lang.OutOfMemoryError: GC overhead limit exceeded, the full error output is attached at the end of the E-mail. Then I tried multiple ways of setting the memory limit. In SimpleApp.scala file, I set the following configurations: val conf = new SparkConf() .setAppName(Simple Application) .set(spark.executor.memory, 10g) And I have also tried appending the following configuration to conf/spark-defaults.conf file: spark.executor.memory 10g But neither of them works. In the error message, it claims (estimated size 103.8 MB, free 191.1 MB), so the total available memory is still 300MB. Why? Thanks, Cui $ ~/spark-1.0.0-bin-hadoop1/bin/spark-submit --class SimpleApp --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar /tmp/mdata0-10.tsd Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/06/09 15:06:29 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/09 15:06:29 INFO SecurityManager: Changing view acls to: cuihe 14/06/09 15:06:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cuihe) 14/06/09 15:06:29 INFO Slf4jLogger: Slf4jLogger started 14/06/09 15:06:29 INFO Remoting: Starting remoting 14/06/09 15:06:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@131-1.bfc.hpl.hp.com:40779] 14/06/09 15:06:30 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@131-1.bfc.hpl.hp.com:40779] 14/06/09 15:06:30 INFO SparkEnv: Registering MapOutputTracker 14/06/09 15:06:30 INFO SparkEnv: Registering BlockManagerMaster 14/06/09 15:06:30 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140609150630-eaa9 14/06/09 15:06:30 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/06/09 15:06:30 INFO ConnectionManager: Bound socket to port 47164 with id = ConnectionManagerId(131-1.bfc.hpl.hp.com,47164) 14/06/09 15:06:30 INFO BlockManagerMaster: Trying to register BlockManager 14/06/09 15:06:30 INFO BlockManagerInfo: Registering block manager 131-1.bfc.hpl.hp.com:47164 with 294.9 MB RAM 14/06/09 15:06:30 INFO BlockManagerMaster: Registered BlockManager 14/06/09 15:06:30 INFO HttpServer: Starting HTTP Server 14/06/09 15:06:30 INFO HttpBroadcast: Broadcast server started at http://16.106.36.131:48587 14/06/09 15:06:30 INFO HttpFileServer: HTTP File server directory is /tmp/spark-35e1c47b-bfa1-4fba-bc64-df8eee287bb7 14/06/09 15:06:30 INFO HttpServer: Starting HTTP Server 14/06/09 15:06:30 INFO SparkUI: Started SparkUI at http://131-1.bfc.hpl.hp.com:4040 14/06/09 15:06:30 INFO SparkContext: Added JAR file:/data/cuihe/spark-app/target/scala-2.10/simple-project_2.10-1.0.jar at http://16.106.36.131:35579/jars/simple-project_2.10-1.0.jar with timestamp 1402351590741 14/06/09 15:06:30 INFO MemoryStore: ensureFreeSpace(32856) called with curMem=0, maxMem=309225062 14/06/09 15:06:30 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.1 KB, free 294.9 MB) 14/06/09 15:06:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/09 15:06:30 WARN LoadSnappy: Snappy native library not loaded 14/06/09 15:06:30 INFO FileInputFormat: Total input paths to process : 1 14/06/09 15:06:30 INFO SparkContext: Starting job: count at SimpleApp.scala:14 14/06/09 15:06:31 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:14) with 7 output partitions (allowLocal=false) 14/06/09 15:06:31 INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:14) 14/06/09 15:06:31 INFO DAGScheduler: Parents of final stage: List() 14/06/09 15:06:31 INFO DAGScheduler: Missing parents: List() 14/06/09 15:06:31 INFO DAGScheduler: Submitting Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:14), which has no missing parents 14/06/09 15:06:31 INFO DAGScheduler: Submitting 7 missing tasks from Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:14) 14/06/09 15:06:31 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks 14/06/09 15:06:31 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/06/09 15:06:31 INFO TaskSetManager: Serialized task 0.0:0 as 1839 bytes in 2 ms 14/06/09 15:06:31 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0
Paul, Could you give the version of Java that you are building with and the version of Java you are running with? Are they the same? Just off the cuff, I wonder if this is related to: https://issues.apache.org/jira/browse/SPARK-1520 If it is, it could appear that certain functions are not in the jar because they go beyond the extended zip boundary `jar tvf` won't list them. - Patrick On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote: Moving over to the dev list, as this isn't a user-scope issue. I just ran into this issue with the missing saveAsTestFile, and here's a little additional information: - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases. - Driver built as an uberjar via Maven. - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with Spark 1.0.0-hadoop1 downloaded from Apache. Given that it functions correctly in local mode but not in a standalone cluster, this suggests to me that the issue is in a difference between the Maven version and the hadoop1 version. In the spirit of taking the computer at its word, we can just have a look in the JAR files. Here's what's in the Maven dep as of 1.0.0: jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class And here's what's in the hadoop1 distribution: jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' I.e., it's not there. It is in the hadoop2 distribution: jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class So something's clearly broken with the way that the distribution assemblies are created. FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors of Spark to Maven Central would be as *entirely different* artifacts (spark-core-h1, spark-core-h2). Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075. Cheers. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote: I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0. Im using google compute engine and cloud storage. but saveAsTextFile is returning errors while saving in the cloud or saving local. When i start a job in the cluster i do get an error but after this error it keeps on running fine untill the saveAsTextFile. ( I don't know if the two are connected) ---Error at job startup--- ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130) at org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84) at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at Hello$.main(Hello.scala:101) at Hello.main(Hello.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sbt.Run.invokeMain(Run.scala:72) at sbt.Run.run0(Run.scala:65) at
Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0
Also I should add - thanks for taking time to help narrow this down! On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote: Paul, Could you give the version of Java that you are building with and the version of Java you are running with? Are they the same? Just off the cuff, I wonder if this is related to: https://issues.apache.org/jira/browse/SPARK-1520 If it is, it could appear that certain functions are not in the jar because they go beyond the extended zip boundary `jar tvf` won't list them. - Patrick On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote: Moving over to the dev list, as this isn't a user-scope issue. I just ran into this issue with the missing saveAsTestFile, and here's a little additional information: - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases. - Driver built as an uberjar via Maven. - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with Spark 1.0.0-hadoop1 downloaded from Apache. Given that it functions correctly in local mode but not in a standalone cluster, this suggests to me that the issue is in a difference between the Maven version and the hadoop1 version. In the spirit of taking the computer at its word, we can just have a look in the JAR files. Here's what's in the Maven dep as of 1.0.0: jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class And here's what's in the hadoop1 distribution: jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' I.e., it's not there. It is in the hadoop2 distribution: jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class So something's clearly broken with the way that the distribution assemblies are created. FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors of Spark to Maven Central would be as *entirely different* artifacts (spark-core-h1, spark-core-h2). Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075. Cheers. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote: I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0. Im using google compute engine and cloud storage. but saveAsTextFile is returning errors while saving in the cloud or saving local. When i start a job in the cluster i do get an error but after this error it keeps on running fine untill the saveAsTextFile. ( I don't know if the two are connected) ---Error at job startup--- ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130) at org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84) at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at Hello$.main(Hello.scala:101) at Hello.main(Hello.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43
Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0
Okay I think I've isolated this a bit more. Let's discuss over on the JIRA: https://issues.apache.org/jira/browse/SPARK-2075 On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- Java 7 on the development machines: » java -version 1 ↵ java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) And on the deployed boxes: $ java -version java version 1.7.0_55 OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1) OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode) Also, unzip -l in place of jar tvf gives the same results, so I don't think it's an issue with jar not reporting the files. Also, the classes do get correctly packaged into the uberjar: unzip -l /target/[deleted]-driver.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 06-08-14 12:05 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 06-08-14 12:05 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class Best. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote: Paul, Could you give the version of Java that you are building with and the version of Java you are running with? Are they the same? Just off the cuff, I wonder if this is related to: https://issues.apache.org/jira/browse/SPARK-1520 If it is, it could appear that certain functions are not in the jar because they go beyond the extended zip boundary `jar tvf` won't list them. - Patrick On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote: Moving over to the dev list, as this isn't a user-scope issue. I just ran into this issue with the missing saveAsTestFile, and here's a little additional information: - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases. - Driver built as an uberjar via Maven. - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with Spark 1.0.0-hadoop1 downloaded from Apache. Given that it functions correctly in local mode but not in a standalone cluster, this suggests to me that the issue is in a difference between the Maven version and the hadoop1 version. In the spirit of taking the computer at its word, we can just have a look in the JAR files. Here's what's in the Maven dep as of 1.0.0: jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class And here's what's in the hadoop1 distribution: jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' I.e., it's not there. It is in the hadoop2 distribution: jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class So something's clearly broken with the way that the distribution assemblies are created. FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors of Spark to Maven Central would be as *entirely different* artifacts (spark-core-h1, spark-core-h2). Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075. Cheers. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote: I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0. Im using google compute engine and cloud storage. but saveAsTextFile is returning errors while saving in the cloud or saving local. When i start a job in the cluster i do get an error but after this error it keeps on running fine untill the saveAsTextFile. ( I don't know if the two are connected) ---Error at job startup--- ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130
Re: Setting executor memory when using spark-shell
In 1.0+ you can just pass the --executor-memory flag to ./bin/spark-shell. On Fri, Jun 6, 2014 at 12:32 AM, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Thank you, Hassan! On 6 June 2014 03:23, hassan hellfire...@gmail.com wrote: just use -Dspark.executor.memory= -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-executor-memory-when-using-spark-shell-tp7082p7103.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Kind regards, Oleg
Re: Spark 1.0 embedded Hive libraries
They are forked and slightly modified for two reasons: (a) Hive embeds a bunch of other dependencies in their published jars such that it makes it really hard for other projects to depend on them. If you look at the hive-exec jar they copy a bunch of other dependencies directly into this jar. We modified the Hive 0.12 build to produce jars that do not include other dependencies inside of them. (b) Hive replies on a version of protobuf that means it is incompatible with certain Hadoop versions. We used a shaded version of the protobuf dependency to avoid this. The forked copy is here - feel free to take a look: https://github.com/pwendell/hive/commits/branch-0.12-shaded-protobuf I'm hoping the upstream Hive project will change their published artifacts to make them usable as a library for other applications. Unfortunately as it stands we had to fork our own copy of these to make it work. I think it's being tracked by this JIRA: https://issues.apache.org/jira/browse/HIVE-5733 - Patrick On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Is there a repo somewhere with the code for the Hive dependencies (hive-exec, hive-serde, hive-metastore) used in SparkSQL? Are they forked with Spark-specific customizations, like Shark, or simply relabeled with a new package name (org.spark-project.hive)? I couldn't find any repos on Github or Apache main. I'm wanting to use some Hive packages outside of the ones burned into the Spark JAR but I'm having all sorts of headaches due to jar-hell with the Hive JARs in CDH or even HDP mismatched with the Spark Hive JARs. Thanks, Silvio
Re: Spark 1.0.0 fails if mesos.coarse set to true
Hey, thanks a lot for reporting this. Do you mind making a JIRA with the details so we can track it? - Patrick On Wed, Jun 4, 2014 at 9:24 AM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: Exactly the same story - it used to work with 0.9.1 and does not work anymore with 1.0.0. I ran tests using spark-shell as well as my application(so tested turning on coarse mode via env variable and SparkContext properties explicitly) M. 2014-06-04 18:12 GMT+02:00 ajatix a...@sigmoidanalytics.com: I'm running a manually built cluster on EC2. I have mesos (0.18.2) and hdfs (2.0.0-cdh4.5.0) installed on all slaves (3) and masters (3). I have spark-1.0.0 on one master and the executor file is on hdfs for the slaves. Whenever I try to launch a spark application on the cluster, it starts a task on each slave (i'm using default configs) and they start FAILING with the error msg - 'Is spark installed on it?' -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-fails-if-mesos-coarse-set-to-true-tp6817p6945.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: is there any easier way to define a custom RDD in Java
Hey There, This is only possible in Scala right now. However, this is almost never needed since the core API is fairly flexible. I have the same question as Andrew... what are you trying to do with your RDD? - Patrick On Wed, Jun 4, 2014 at 7:49 AM, Andrew Ash and...@andrewash.com wrote: Just curious, what do you want your custom RDD to do that the normal ones don't? On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 bluejoe2...@gmail.com wrote: hi, folks, is there any easier way to define a custom RDD in Java? I am wondering if I have to define a new java class which extends RDD from scratch? It is really a hard job for developers! 2014-06-04 bluejoe2008
Re: error with cdh 5 spark installation
Hey Chirag, Those init scripts are part of the Cloudera Spark package (they are not in the Spark project itself) so you might try e-mailing their support lists directly. - Patrick On Wed, Jun 4, 2014 at 7:19 AM, chirag lakhani chirag.lakh...@gmail.com wrote: I recently spun up an AWS cluster with cdh 5 using Cloudera Manager. I am trying to install spark and simply used the install command, as stated in the CDH 5 documentation. sudo apt-get install spark-core spark-master spark-worker spark-python I get the following error Setting up spark-master (0.9.0+cdh5.0.1+33-1.cdh5.0.1.p0.25~precise-cdh5.0.1) ... * Starting Spark master (spark-master): invoke-rc.d: initscript spark-master, action start failed. dpkg: error processing spark-master (--configure): subprocess installed post-installation script returned error exit status 1 Errors were encountered while processing: spark-master Has anyone else encountered this? Does anyone have any suggestions of what to do about it? Chirag
Re: Can't seem to link external/twitter classes from my own app
Hey Jeremy, The issue is that you are using one of the external libraries and these aren't actually packaged with Spark on the cluster, so you need to create an uber jar that includes them. You can look at the example here (I recently did this for a kafka project and the idea is the same): https://github.com/pwendell/kafka-spark-example You'll want to make an uber jar that includes these packages (run sbt assembly) and then submit that jar to spark-submit. Also, I'd try running it locally first (if you aren't already) just to make the debugging simpler. - Patrick On Wed, Jun 4, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote: Ah sorry, this may be the thing I learned for the day. The issue is that classes from that particular artifact are missing though. Worth interrogating the resulting .jar file with jar tf to see if it made it in? On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath nick.pentre...@gmail.com wrote: @Sean, the %% syntax in SBT should automatically add the Scala major version qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct syntax for the build. I seemed to run into this issue with some missing Jackson deps, and solved it by including the jar explicitly on the driver class path: bin/spark-submit --driver-class-path SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar --class SimpleApp SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar Seems redundant to me since I thought that the JAR as argument is copied to driver and made available. But this solved it for me so perhaps give it a try? On Wed, Jun 4, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote: Those aren't the names of the artifacts: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22 The name is spark-streaming-twitter_2.10 On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Man, this has been hard going. Six days, and I finally got a Hello World App working that I wrote myself. Now I'm trying to make a minimal streaming app based on the twitter examples, (running standalone right now while learning) and when running it like this: bin/spark-submit --class SimpleApp SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar I'm getting this error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ Which I'm guessing is because I haven't put in a dependency to external/twitter in the .sbt, but _how_? I can't find any docs on it. Here's my build file so far: simple.sbt -- name := Simple Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.0.0 libraryDependencies += org.apache.spark %% spark-streaming % 1.0.0 libraryDependencies += org.apache.spark %% spark-streaming-twitter % 1.0.0 libraryDependencies += org.twitter4j % twitter4j-stream % 3.0.3 resolvers += Akka Repository at http://repo.akka.io/releases/; -- I've tried a few obvious things like adding: libraryDependencies += org.apache.spark %% spark-external % 1.0.0 libraryDependencies += org.apache.spark %% spark-external-twitter % 1.0.0 because, well, that would match the naming scheme implied so far, but it errors. Also, I just realized I don't completely understand if: (a) the spark-submit command _sends_ the .jar to all the workers, or (b) the spark-submit commands sends a _job_ to the workers, which are supposed to already have the jar file installed (or in hdfs), or (c) the Context is supposed to list the jars to be distributed. (is that deprecated?) One part of the documentation says: Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar. but another says: application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. I suppose both could be correct if you take a certain point of view. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Trouble launching EC2 Cluster with Spark
Hey Sam, You mentioned two problems here, did your VPC error message get fixed or only the key permissions problem? I noticed we had some report a similar issue with the VPC stuff a long time back (but there is no real resolution here): https://spark-project.atlassian.net/browse/SPARK-1166 If that's still an issue, one thing to try is just changing the name of the cluster. We create groups that are identified with the cluster name, and there might be something that just got screwed up with the original group creation and AWS isn't happy. - Patrick On Wed, Jun 4, 2014 at 12:55 PM, Sam Taylor Steyer sste...@stanford.edu wrote: Awesome, that worked. Thank you! - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, June 4, 2014 12:52:00 PM Subject: Re: Trouble launching EC2 Cluster with Spark chmod 600 path/FinalKey.pem Cheers k/ On Wed, Jun 4, 2014 at 12:49 PM, Sam Taylor Steyer sste...@stanford.edu wrote: Also, once my friend logged in to his cluster he received the error Permissions 0644 for 'FinalKey.pem' are too open. This sounds like the other problem described. How do we make the permissions more private? Thanks very much, Sam - Original Message - From: Sam Taylor Steyer sste...@stanford.edu To: user@spark.apache.org Sent: Wednesday, June 4, 2014 12:42:04 PM Subject: Re: Trouble launching EC2 Cluster with Spark Thanks you! The regions advice solved the problem for my friend who was getting the key pair does not exist problem. I am still getting the error: ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'null' for protocol. VPC security group rules must specify protocols explicitly./Message/Error/ErrorsRequestID7ff92687-b95a-4a39-94cb-e2d00a6928fd/RequestID/Response This sounds like it could have to do with the access settings of the security group, but I don't know how to change. Any advice would be much appreciated! Sam - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, June 4, 2014 8:52:59 AM Subject: Re: Trouble launching EC2 Cluster with Spark One reason could be that the keys are in a different region. Need to create the keys in us-east-1-North Virginia. Cheers k/ On Wed, Jun 4, 2014 at 7:45 AM, Sam Taylor Steyer sste...@stanford.edu wrote: Hi, I am trying to launch an EC2 cluster from spark using the following command: ./spark-ec2 -k HackerPair -i [path]/HackerPair.pem -s 2 launch HackerCluster I set my access key id and secret access key. I have been getting an error in the setting up security groups... phase: Invalid value 'null' for protocol. VPC security groups must specify protocols explicitly. My project partner gets one step further and then gets the error The key pair 'JamesAndSamTest' does not exist. Any thoughts as to how we could fix these problems? Thanks a lot! Sam
Re: spark 1.0 not using properties file from SPARK_CONF_DIR
You can set an arbitrary properties file by adding --properties-file argument to spark-submit. It would be nice to have spark-submit also look in SPARK_CONF_DIR as well by default. If you opened a JIRA for that I'm sure someone would pick it up. On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Is it on purpose that when setting SPARK_CONF_DIR spark submit still loads the properties file from SPARK_HOME/conf/spark-defauls.conf ? IMO it would be more natural to override what is defined in SPARK_HOME/conf by SPARK_CONF_DIR when defined (and SPARK_CONF_DIR being overriden by command line args). Eugen
Re: spark 1.0.0 on yarn
Okay I'm guessing that our upstreaming Hadoop2 package isn't new enough to work with CDH5. We should probably clarify this in our downloads. Thanks for reporting this. What was the exact string you used when building? Also which CDH-5 version are you building against? On Mon, Jun 2, 2014 at 8:11 AM, Xu (Simon) Chen xche...@gmail.com wrote: OK, rebuilding the assembly jar file with cdh5 works now... Thanks.. -Simon On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote: That helped a bit... Now I have a different failure: the start up process is stuck in an infinite loop outputting the following message: 14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1401672868277 yarnAppState: ACCEPTED I am using the hadoop 2 prebuild package. Probably it doesn't have the latest yarn client. -Simon On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com wrote: As a debugging step, does it work if you use a single resource manager with the key yarn.resourcemanager.address instead of using two named resource managers? I wonder if somehow the YARN client can't detect this multi-master set-up. On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark-shell and interact with workers spawned on my yarn cluster. So in my /opt/hadoop/conf/yarn-site.xml, I have: ... property nameyarn.resourcemanager.address.rm1/name valuecontroller-1.mycomp.com:23140/value /property ... property nameyarn.resourcemanager.address.rm2/name valuecontroller-2.mycomp.com:23140/value /property ... And the other usual stuff. So spark 1.0 is launched like this: Spark Command: java -cp ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class org.apache.spark.repl.Main I do see /opt/hadoop/conf included, but not sure it's the right place. Thanks.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being set-up correctly. - Patrick On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document (http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas? Thanks. -Simon
Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file
Hey There, The issue was that the old behavior could cause users to silently overwrite data, which is pretty bad, so to be conservative we decided to enforce the same checks that Hadoop does. This was documented by this JIRA: https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 However, it would be very easy to add an option that allows preserving the old behavior. Is anyone here interested in contributing that? I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-1993 - Patrick On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans pierre.borckm...@realimpactanalytics.com wrote: Indeed, the behavior has changed for good or for bad. I mean, I agree with the danger you mention but I'm not sure it's happening like that. Isn't there a mechanism for overwrite in Hadoop that automatically removes part files, then writes a _temporary folder and then only the part files along with the _success folder. In any case this change of behavior should be documented IMO. Cheers Pierre Message sent from a mobile device - excuse typos and abbreviations Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a écrit : What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is that files get overwritten automatically. This is one danger to this though. If I save to a directory that already has 20 part- files, but this time around I'm only saving 15 part- files, then there will be 5 leftover part- files from the previous set mixed in with the 15 newer files. This is potentially dangerous. I haven't checked to see if this behavior has changed in 1.0.0. Are you saying it has, Pierre? On Mon, Jun 2, 2014 at 9:41 AM, Pierre B [pierre.borckm...@realimpactanalytics.com](mailto:pierre.borckm...@realimpactanalytics.com) wrote: Hi Michaël, Thanks for this. We could indeed do that. But I guess the question is more about the change of behaviour from 0.9.1 to 1.0.0. We never had to care about that in previous versions. Does that mean we have to manually remove existing files or is there a way to aumotically overwrite when using saveAsTextFile? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think I accidentally assigned myself way back when I created it). This should be an easy fix. On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Patrick, I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about the same thing? How about assigning it to me? I think I missed the configuration part in my previous commit, though I declared that in the PR description Best, -- Nan Zhu On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote: Hey There, The issue was that the old behavior could cause users to silently overwrite data, which is pretty bad, so to be conservative we decided to enforce the same checks that Hadoop does. This was documented by this JIRA: https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 However, it would be very easy to add an option that allows preserving the old behavior. Is anyone here interested in contributing that? I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-1993 - Patrick On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans pierre.borckm...@realimpactanalytics.com wrote: Indeed, the behavior has changed for good or for bad. I mean, I agree with the danger you mention but I'm not sure it's happening like that. Isn't there a mechanism for overwrite in Hadoop that automatically removes part files, then writes a _temporary folder and then only the part files along with the _success folder. In any case this change of behavior should be documented IMO. Cheers Pierre Message sent from a mobile device - excuse typos and abbreviations Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a écrit : What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is that files get overwritten automatically. This is one danger to this though. If I save to a directory that already has 20 part- files, but this time around I'm only saving 15 part- files, then there will be 5 leftover part- files from the previous set mixed in with the 15 newer files. This is potentially dangerous. I haven't checked to see if this behavior has changed in 1.0.0. Are you saying it has, Pierre? On Mon, Jun 2, 2014 at 9:41 AM, Pierre B [pierre.borckm...@realimpactanalytics.com](mailto:pierre.borckm...@realimpactanalytics.com) wrote: Hi Michaël, Thanks for this. We could indeed do that. But I guess the question is more about the change of behaviour from 0.9.1 to 1.0.0. We never had to care about that in previous versions. Does that mean we have to manually remove existing files or is there a way to aumotically overwrite when using saveAsTextFile? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html Sent from the Apache Spark User List mailing list archive at Nabble.com.