Re: Spark + Kinesis
Assembly settings have an option to exclude jars. You need something similar to: assemblyExcludedJars in assembly = (fullClasspath in assembly) map { cp = val excludes = Set( minlog-1.2.jar ) cp filter { jar = excludes(jar.data.getName) } } in your build file (may need to be refactored into a .scala file) On Fri, Apr 3, 2015 at 12:57 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Remove provided and got the following error: [error] (*:assembly) deduplicate: different file contents found in the following: [error] /Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class [error] /Users/vb/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:com/esotericsoftware/minlog/Log$Logger.class ᐧ On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t...@databricks.com wrote: Just remove provided for spark-streaming-kinesis-asl libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Thanks. So how do I fix it? ᐧ On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com wrote: spark-streaming-kinesis-asl is not part of the Spark distribution on your cluster, so you cannot have it be just a provided dependency. This is also why the KCL and its dependencies were not included in the assembly (but yes, they should be). ~ Jonathan Kelly From: Vadim Bichutskiy vadim.bichuts...@gmail.com Date: Friday, April 3, 2015 at 12:26 PM To: Jonathan Kelly jonat...@amazon.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: Spark + Kinesis Hi all, Good news! I was able to create a Kinesis consumer and assemble it into an uber jar following http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and example https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala . However when I try to spark-submit it I get the following exception: *Exception in thread main java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider* Do I need to include KCL dependency in *build.sbt*, here's what it looks like currently: import AssemblyKeys._ name := Kinesis Consumer version := 1.0 organization := com.myconsumer scalaVersion := 2.11.5 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 % provided assemblySettings jarName in assembly := consumer-assembly.jar assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala=false) Any help appreciated. Thanks, Vadim On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com wrote: It looks like you're attempting to mix Scala versions, so that's going to cause some problems. If you really want to use Scala 2.11.5, you must also use Spark package versions built for Scala 2.11 rather than 2.10. Anyway, that's not quite the correct way to specify Scala dependencies in build.sbt. Instead of placing the Scala version after the artifactId (like spark-core_2.10), what you actually want is to use just spark-core with two percent signs before it. Using two percent signs will make it use the version of Scala that matches your declared scalaVersion. For example: libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 I think that may get you a little closer, though I think you're probably going to run into the same problems I ran into in this thread: https://www.mail-archive.com/user@spark.apache.org/msg23891.html I never really got an answer for that, and I temporarily moved on to other things for now. ~ Jonathan Kelly From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com Date: Thursday, April 2, 2015 at 9:53 AM To: user@spark.apache.org user@spark.apache.org Subject: Spark + Kinesis Hi all, I am trying to write an Amazon Kinesis consumer Scala app that processes data in the Kinesis stream. Is this the correct way to specify *build.sbt*: --- *import AssemblyKeys._* *name := Kinesis Consumer* *version := 1.0 organization := com.myconsumer scalaVersion := 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 % 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)* * assemblySettings jarName in assembly := consumer-assembly.jar assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala=false)* In
Re: Spark-EC2 Security Group Error
Appears to be a problem with boto. Make sure you have boto 2.34 on your system. On Wed, Apr 1, 2015 at 11:19 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – I’m trying to bring up a spark ec2 cluster with the script below and see the following error. Can anyone please advise as to what’s going on? Is this indicative of me being unable to connect in the first place? The keys are known to work (they’re used elsewhere). ./spark-ec2 -k $AWS_KEYPAIR_NAME -i $AWS_PRIVATE_KEY -s 2 --region=us-west1 --zone=us-west-1a --instance-type=r3.2xlarge launch FuntimePartyLand Setting up security groups... Traceback (most recent call last): File ./spark_ec2.py, line 1383, in module main() File ./spark_ec2.py, line 1375, in main real_main() File ./spark_ec2.py, line 1210, in real_main (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name) File ./spark_ec2.py, line 431, in launch_cluster master_group = get_or_make_group(conn, cluster_name + -master, opts.vpc_id) File ./spark_ec2.py, line 310, in get_or_make_group groups = conn.get_all_security_groups() AttributeError: 'NoneType' object has no attribute ‘get_all_security_groups' Thank you, Ilya Ganelin -- The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. -- *Dan Osipov* *Shazam* 2114 Broadway Street, Redwood City, CA 94063 Please consider the environment before printing this document Shazam Entertainment Limited is incorporated in England and Wales under company number 3998831 and its registered office is at 26-28 Hammersmith Grove, London W6 7HA. Shazam Media Services Inc is a member of the Shazam Entertainment Limited group of companies.
Re: Spark on EC2
You're probably requesting more instances than allowed by your account, so the error gets generated for the extra instances. Try launching a smaller cluster. On Wed, Apr 1, 2015 at 12:41 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Hi all, I just tried launching a Spark cluster on EC2 as described in http://spark.apache.org/docs/1.3.0/ec2-scripts.html I got the following response: *ResponseErrorsErrorCodePendingVerification/CodeMessageYour account is currently being verified. Verification normally takes less than 2 hours. Until your account is verified, you may not be able to launch additional instances or create additional volumes. If you are still receiving this message after more than 2 hours, please let us know by writing to aws-verificat...@amazon.com aws-verificat...@amazon.com. We appreciate your patience...* However I can see the EC2 instances in AWS console as running Any thoughts on what's going on? Thanks, Vadim ᐧ -- *Dan Osipov* *Shazam* 2114 Broadway Street, Redwood City, CA 94063 Please consider the environment before printing this document Shazam Entertainment Limited is incorporated in England and Wales under company number 3998831 and its registered office is at 26-28 Hammersmith Grove, London W6 7HA. Shazam Media Services Inc is a member of the Shazam Entertainment Limited group of companies.
Re: GraphX pregel: getting the current iteration number
I don't think its possible to access. What I've done before is send the current or next iteration index with the message, where the message is a case class. HTH Dan On Tue, Feb 3, 2015 at 10:20 AM, Matthew Cornell corn...@cs.umass.edu wrote: Hi Folks, I'm new to GraphX and Scala and my sendMsg function needs to index into an input list to my algorithm based on the pregel()() iteration number, but I don't see a way to access that. I see in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala that it's just an index variable i in a while loop, but is there a way for sendMsg to access within the loop's scope? I don't think so, so barring that, given Scala's functional stateless nature, what other approaches would you take to do this? I'm considering a closure, but a var that gets updated by all the sendMsgs seems a recipe for trouble. Thank you, matt - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark S3 Performance
Can you verify that its reading the entire file on each worker using network monitoring stats? If it does, that would be a bug in my opinion. On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe ni...@actioniq.co wrote: Andrei, Ashish, To be clear, I don't think it's *counting* the entire file. It just seems from the logging and the timing that it is doing a get of the entire file, then figures out it only needs some certain blocks, does another get of only the specific block. Regarding # partitions - I think I see now it has to do with Hadoop's block size being set at 64MB. This is not a big deal to me, the main issue is the first one, why is every worker doing a call to get the entire file followed by the *real* call to get only the specific partitions it needs. Best, - Nitay Founder CTO On Sat, Nov 22, 2014 at 8:28 PM, Andrei faithlessfri...@gmail.com wrote: Concerning your second question, I believe you try to set number of partitions with something like this: rdd = sc.textFile(..., 8) but things like `textFile()` don't actually take fixed number of partitions. Instead, they expect *minimal* number of partitions. Since in your file you have 21 blocks of data, it creates exactly 21 worker (which is greater than 8, as expected). To set exact number of partitions, use `repartition()` or its full version - `coalesce()` (see example [1]) [1]: http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#coalesce On Sat, Nov 22, 2014 at 10:04 PM, Ashish Rangole arang...@gmail.com wrote: What makes you think that each executor is reading the whole file? If that is the case then the count value returned to the driver will be actual X NumOfExecutors. Is that the case when compared with actual lines in the input file? If the count returned is same as actual then you probably don't have an extra read problem. I also see this in your logs which indicates that it is a read that starts from an offset and reading one split size (64MB) worth of data: 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input split: s3n://mybucket/myfile:335544320+67108864 On Nov 22, 2014 7:23 AM, Nitay Joffe ni...@actioniq.co wrote: Err I meant #1 :) - Nitay Founder CTO On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co wrote: Anyone have any thoughts on this? Trying to understand especially #2 if it's a legit bug or something I'm doing wrong. - Nitay Founder CTO On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co wrote: I have a simple S3 job to read a text file and do a line count. Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The file is about 1.2GB. My setup is standalone spark cluster with 4 workers each with 2 cores / 16GB ram. I'm using branch-1.2 code built against hadoop 2.4 (though I'm not actually using HDFS, just straight S3 = Spark). The whole count is taking on the order of a couple of minutes, which seems extremely slow. I've been looking into it and so far have noticed two things, hoping the community has seen this before and knows what to do... 1) Every executor seems to make an S3 call to read the *entire file* before making another call to read just it's split. Here's a paste I've cleaned up to show just one task: http://goo.gl/XCfyZA. I've verified this happens in every task. It is taking a long time (40-50 seconds), I don't see why it is doing this? 2) I've tried a few numPartitions parameters. When I make the parameter anything below 21 it seems to get ignored. Under the hood FileInputFormat is doing something that always ends up with at least 21 partitions of ~64MB or so. I've also tried 40, 60, and 100 partitions and have seen that the performance only gets worse as I increase it beyond 21. I would like to try 8 just to see, but again I don't see how to force it to go below 21. Thanks for the help, - Nitay Founder CTO
Re: Mulitple Spark Context
Its not recommended to have multiple spark contexts in one JVM, but you could launch a separate JVM per context. How resources get allocated is probably outside the scope of Spark, and more of a task for the cluster manager. On Fri, Nov 14, 2014 at 12:58 PM, Charles charles...@cenx.com wrote: I need continuously run multiple calculations concurrently on a cluster. They are not sharing RDDs. Each of the calculations needs different number of cores and memory. Also, some of them are long running calculation and others are short running calculation.They all need be run on regular basis and finish in time. The short ones cannot wait until long ones to finish. They cannot run too slow either since the short ones' running interval is short as well. It looks like the sharing inside a sparkContext cannot guarantee that the short ones will get enough resources to finish in time if long ones already running. Or am I wrong about that? I tried to create a sparkContext for each of the calculations but only the first one is alive. The rest dies. I am getting the error below. Is it possible to create multiple sparkContext from inside one application jvm? ERROR 2014-11-14 14:59:46 akka.actor.OneForOneStrategy: spark.httpBroadcast.uri java.util.NoSuchElementException: spark.httpBroadcast.uri at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:151) at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:151) at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at scala.collection.AbstractMap.getOrElse(Map.scala:58) at org.apache.spark.SparkConf.get(SparkConf.scala:151) at org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:104) at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcast.scala:70) at org.apache.spark.broadcast.BroadcastManager.initialize(Broadcast.scala:81) at org.apache.spark.broadcast.BroadcastManager.init(Broadcast.scala:68) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:175) at org.apache.spark.executor.Executor.init(Executor.scala:110) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:56) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) INFO 2014-11-14 14:59:46 org.apache.spark.executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@172.32.1.12:51590/user/CoarseGrainedScheduler ERROR 2014-11-14 14:59:46 org.apache.spark.executor.CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID: 1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mulitple-Spark-Context-tp18975.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
GraphX: Get edges for a vertex
Hello, I'm attempting to implement a clustering algorithm on top of Pregel implementation in GraphX, however I'm hitting a wall. Ideally, I'd like to be able to get all edges for a specific vertex, since they factor into the calculation. My understanding was that sendMsg function would receive all relevant edges in participating vertices (all initially, declining as they converge and stop changing state), and I was planning to keep vertex edges associated to each vertex and propagate to other vertices that need to know about these edges. What I'm finding is that not all edges get iterated on by sendMsg before sending messages to vprog. Even if I try to keep track of edges, I don't account all of them, leading to incorrect results. The graph I'm testing on has edges between all nodes, one for each direction, and I'm using EdgeDirection.Both. Anyone seen something similar, and have some suggestions? Dan
Re: Example of Fold
You should look at how fold is used in scala in general to help. Here is a blog post that may also give some guidance: http://blog.madhukaraphatak.com/spark-rdd-fold The zero value should be your bean, with the 4th parameter set to the minimum value. Your fold function should compare the 4th param in the incoming records and chose the larger one. If you want to group by the 3 parameters prior to folding you're probably better off using a reduce function. On Fri, Oct 31, 2014 at 12:01 PM, Ron Ayoub ronalday...@live.com wrote: I'm want to fold an RDD into a smaller RDD with max elements. I have simple bean objects with 4 properties. I want to group by 3 of the properties and then select the max of the 4th. So I believe fold is the appropriate method for this. My question is, is there a good fold example out there. Additionally, what it the zero value used for as the first argument? Thanks.
Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?
You can use --spark-version argument to spark-ec2 to specify a GIT hash corresponding to the version you want to checkout. If you made changes that are not in the master repository, you can use --spark-git-repo to specify the git repository to pull down spark from, which contains the specified commit hash. On Tue, Oct 21, 2014 at 3:52 PM, sameerf same...@databricks.com wrote: Hi, Can you post what the error looks like? Sameer F. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p16963.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Getting spark to use more than 4 cores on Amazon EC2
How are you launching the cluster, and how are you submitting the job to it? Can you list any Spark configuration parameters you provide? On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on machines with many more cores. Am I misunderstanding the docs? Is it just that high end ec2 instances get I/O starved when running spark? It would be strange if that consistently produced a 400% hard limit though. thanks Daniel
Re: Multipart uploads to Amazon S3 from Apache Spark
Not directly related, but FWIW, EMR seems to back away from s3n usage: Previously, Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html On Mon, Oct 13, 2014 at 1:42 PM, Nick Chammas nicholas.cham...@gmail.com wrote: Cross posting an interesting question on Stack Overflow http://stackoverflow.com/questions/26321947/multipart-uploads-to-amazon-s3-from-apache-spark . Nick -- View this message in context: Multipart uploads to Amazon S3 from Apache Spark http://apache-spark-user-list.1001560.n3.nabble.com/Multipart-uploads-to-Amazon-S3-from-Apache-Spark-tp16315.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: S3 Bucket Access
(Copying the user list) You should use spark_ec2 script to configure the cluster. If you use trunk version you can use the new --copy-aws-credentials option to configure the S3 parameters automatically, otherwise either include them in your SparkConf variable or add them to /root/spark/ephemeral-hdfs/conf/core-site.xml On Mon, Oct 13, 2014 at 2:56 PM, Ranga sra...@gmail.com wrote: The cluster is deployed on EC2 and I am trying to access the S3 files from within a spark-shell session. On Mon, Oct 13, 2014 at 2:51 PM, Daniil Osipov daniil.osi...@shazam.com wrote: So is your cluster running on EC2, or locally? If you're running locally, you should still be able to access S3 files, you just need to locate the core-site.xml and add the parameters as defined in the error. On Mon, Oct 13, 2014 at 2:49 PM, Ranga sra...@gmail.com wrote: Hi Daniil No. I didn't create the spark-cluster using the ec2 scripts. Is that something that I need to do? I just downloaded Spark-1.1.0 and Hadoop-2.4. However, I am trying to access files on S3 from this cluster. - Ranga On Mon, Oct 13, 2014 at 2:36 PM, Daniil Osipov daniil.osi...@shazam.com wrote: Did you add the fs.s3n.aws* configuration parameters in /root/spark/ephemeral-hdfs/conf/core-ste.xml? On Mon, Oct 13, 2014 at 11:03 AM, Ranga sra...@gmail.com wrote: Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId and Secret using environment variables ( AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to access the S3 buckets. Before setting the access key and secret the error was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). After setting the access key and secret, the error is: The AWS Access Key Id you provided does not exist in our records. The id/secret being set are the right values. This makes me believe that something else (token, etc.) needs to be set as well. Any help is appreciated. - Ranga
Re: Cannot read from s3 using sc.textFile
Try using s3n:// instead of s3 (for the credential configuration as well). On Tue, Oct 7, 2014 at 9:51 AM, Sunny Khatri sunny.k...@gmail.com wrote: Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing in the required configuration object. On Tue, Oct 7, 2014 at 4:20 AM, Tomer Benyamini tomer@gmail.com wrote: Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName(TestApp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX); sc.hadoopConfiguration().set(fs.s3.awsSecretAccessKey, XX); String path = s3://bucket/test/testdata; JavaRDDString textFile = sc.textFile(path); System.out.println(textFile.count()); - But getting this error: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://bucket/test/testdata at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097) at org.apache.spark.rdd.RDD.count(RDD.scala:861) at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365) at org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29) Looking at the debug log I see that org.jets3t.service.impl.rest.httpclient.RestS3Service returned 404 error trying to locate the file. Using a simple java program with com.amazonaws.services.s3.AmazonS3Client works just fine. Any idea? Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: compiling spark source code
In the spark source folder, execute `sbt/sbt assembly` On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com wrote: HI, Can someone please tell me how to compile the spark source code to effect the changes in the source code. I was trying to ship the jars to all the slaves, but in vain. -Karthik
Re: Spark on Raspberry Pi?
Limited memory could also cause you some problems and limit usability. If you're looking for a local testing environment, vagrant boxes may serve you much better. On Thu, Sep 11, 2014 at 6:18 AM, Chen He airb...@gmail.com wrote: Pi's bus speed, memory size and access speed, and processing ability are limited. The only benefit could be the power consumption. On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh sand...@techaddict.me wrote: Has anyone tried using Raspberry Pi for Spark? How efficient is it to use around 10 Pi's for local testing env ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PrintWriter error in foreach
Try providing full path to the file you want to write, and make sure the directory exists and is writable by the Spark process. On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra arun.lut...@gmail.com wrote: I have a spark program that worked in local mode, but throws an error in yarn-client mode on a cluster. On the edge node in my home directory, I have an output directory (called transout) which is ready to receive files. The spark job I'm running is supposed to write a few hundred files into that directory, once for each iteration of a foreach function. This works in local mode, and my only guess as to why this would fail in yarn-client mode is that the RDD is distributed across many nodes and the program is trying to use the PrintWriter on the datanodes, where the output directory doesn't exist. Is this what's happening? Any proposed solution? abbreviation of the code: import java.io.PrintWriter ... rdd.foreach { val outFile = new PrintWriter(transoutput/output.%s.format(id)) outFile.println(test) outFile.close() } Error: 14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26) 14/09/10 16:57:09 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: transoutput/input.598718 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:194) at java.io.FileOutputStream.init(FileOutputStream.java:84) at java.io.PrintWriter.init(PrintWriter.java:146) at com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98) at com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: Crawler and Scraper with different priorities
Depending on what you want to do with the result of the scraping, Spark may not be the best framework for your use case. Take a look at a general Akka application. On Sun, Sep 7, 2014 at 12:15 AM, Sandeep Singh sand...@techaddict.me wrote: Hi all, I am Implementing a Crawler, Scraper. The It should be able to process the request for crawling scraping, within few seconds of submitting the job(around 1mil/sec), for rest I can take some time(scheduled evenly all over the day). What is the best way to implement this? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-and-Scraper-with-different-priorities-tp13645.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-ec2 [Errno 110] Connection time out
Make sure your key pair is configured to access whatever region you're deploying to - it defaults to us-east-1, but you can provide a custom one with parameter --region. On Sat, Aug 30, 2014 at 12:53 AM, David Matheson david.j.mathe...@gmail.com wrote: I'm following the latest documentation on configuring a cluster on ec2 (http://spark.apache.org/docs/latest/ec2-scripts.html). Running ./spark-ec2 -k Blah -i .ssh/Blah.pem -s 2 launch spark-ec2-test gets a generic timeout error that's coming from File ./spark_ec2.py, line 717, in real_main conn = ec2.connect_to_region(opts.region) Any suggestions on how to debug the cause of the timeout? Note: I replaced the name of my keypair with Blah. Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-Errno-110-Connection-time-out-tp13171.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
FileNotFoundException (No space left on device) writing to S3
Hello, I've been seeing the following errors when trying to save to S3: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage fail ure: Task 4058 in stage 2.1 failed 4 times, most recent failure: Lost task 4058.3 in stag e 2.1 (TID 12572, ip-10-81-151-40.ec2.internal): java.io.FileNotFoundException: /mnt/spa$ k/spark-local-20140827191008-05ae/0c/shuffle_1_7570_5768 (No space left on device) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175$ org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$ eWriter.scala:67) org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$ eWriter.scala:65) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65$ org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) DF tells me there is plenty of space left on the worker node: root@ip-10-81-151-40 ~]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 4.6G 3.3G 59% / tmpfs 7.4G 0 7.4G 0% /dev/shm /dev/xvdb 37G 11G 25G 30% /mnt /dev/xvdf 37G 9.5G 26G 27% /mnt2 Any suggestions? Dan
Re: countByWindow save the count ?
You could try to use foreachRDD on the result of countByWindow with a function that performs the save operation. On Fri, Aug 22, 2014 at 1:58 AM, Josh J joshjd...@gmail.com wrote: Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka or redis). The examples show only stream.print() Thanks, Josh
OOM Java heap space error on saveAsTextFile
Hello, My job keeps failing on saveAsTextFile stage (frustrating after a 3 hour run) with an OOM exception. The log is below. I'm running the job on an input of ~8Tb gzipped JSON files, executing on 15 m3.xlarge instances. Executor is given 13Gb memory, and I'm setting two custom preferences in the job: spark.akka.frameSize: 50 (otherwise it fails due to exceeding the limit of 10Mb), spark.storage.memoryFraction: 0.2 Any suggestions? 14/08/21 19:29:26 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-99-160-181.ec2.internal:36962 14/08/21 19:29:31 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 17541459 bytes 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-144-221-26.ec2.internal:49973 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-69-31-121.ec2.internal:34569 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-165-70-221.ec2.internal:49193 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-218-181-93.ec2.internal:57648 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-142-187-230.ec2.internal:48115 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-101-178-68.ec2.internal:51931 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-99-165-121.ec2.internal:38153 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-179-187-182.ec2.internal:55645 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-182-231-107.ec2.internal:54088 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-165-79-9.ec2.internal:40112 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-111-169-138.ec2.internal:40394 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-203-161-222.ec2.internal:47447 14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@ip-10-153-141-230.ec2.internal:53906 14/08/21 19:29:32 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-20] shutting down ActorSystem [spark] java.lang.OutOfMemoryError: Java heap space at com.google.protobuf_spark.AbstractMessageLite.toByteArray(AbstractMessageLite.java:62) at akka.remote.transport.AkkaPduProtobufCodec$.constructPayload(AkkaPduCodec.scala:145) at akka.remote.transport.AkkaProtocolHandle.write(AkkaProtocolTransport.scala:156) at akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:569) at akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:544) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at akka.actor.FSM$class.processEvent(FSM.scala:595) at akka.remote.EndpointWriter.processEvent(Endpoint.scala:443) at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:589) at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:583) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/08/21 19:29:32 INFO scheduler.DAGScheduler: Failed to run saveAsTextFile at RecRateApp.scala:88 Exception in thread main org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:639) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:638) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at