Re: Shuffle Files
From BlockManager code + ShuffleMapTask code, it writes under spark.local.dir or java.io.tmpdir. val diskBlockManager = new DiskBlockManager(shuffleBlockManager, conf.get(spark.local.dir, System.getProperty(java.io.tmpdir))) On Mon, Mar 3, 2014 at 10:45 PM, Usman Ghani us...@platfora.com wrote: Where on the filesystem does spark write the shuffle files? -- ...:::Aniket:::... Quetzalco@tl
RE: Connection Refused When Running SparkPi Locally
I've encountered similar problems. Maybe you can try using hostname or FQDN (rather than IP address) of your node for the master URI. In my case, AKKA picks the FQDN for master URI and worker has to use exactly the same string for connection. From: Benny Thompson [mailto:ben.d.tho...@gmail.com] Sent: Saturday, March 01, 2014 10:18 AM To: u...@spark.incubator.apache.org Subject: Connection Refused When Running SparkPi Locally I'm trying to run a simple execution of the SparkPi example. I started the master and one worker, then executed the job on my local cluster, but end up getting a sequence of errors all ending with Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /127.0.0.1:39398http://127.0.0.1:39398 I originally tried running my master and worker without configuration but ended up with the same error. I tried to change to 127.0.0.1 to test if it was maybe just a firewall issue since the server is locked down from the outside world. My conf/spark-conf.sh contains the following: export SPARK_MASTER_IP=127.0.0.1 Here is the order and commands I run: 1) sbin/start-master.sh (to start the master) 2) bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077http://127.0.0.1:7077 --ip 127.0.0.1 --port (in a different session on the same machine to start the slave) 3) bin/run-example org.apache.spark.examples.SparkPi spark://127.0.0.1:7077http://127.0.0.1:7077 (in a different session on the same machine to start the job) I find it hard to believe that I'm locked down enough that running locally would cause problems. Any help is greatly appreciated! Thanks, Benny
RE: Actors and sparkcontext actions
Hi Ognen, See if this helps. I was working on this : class MyClass[T](sc : SparkContext, flag1 : Boolean, rdd : RDD[T], hdfsPath : String) extends Actor { def act(){ if(flag1) this.process() else this.count } private def process(){ println(sc.textFile(hdfsPath).count) //do the processing } private def count(){ println(rdd.count) //do the counting } } Thanks and Regards, Suraj Sheth -Original Message- From: Ognen Duzlevski [mailto:og...@nengoiksvelzud.com] Sent: 27 February 2014 01:09 To: u...@spark.incubator.apache.org Subject: Actors and sparkcontext actions Can someone point me to a simple, short code example of creating a basic Actor that gets a context and runs an operation such as .textFile.count? I am trying to figure out how to create just a basic actor that gets a message like this: case class Msg(filename:String, ctx: SparkContext) and then something like this: class HelloActor extends Actor { import context.dispatcher def receive = { case Msg(fn,ctx) = { // get the count here! // cts.textFile(fn).count } case _ = println(huh?) } } Where I would want to do something like: val conf = new SparkConf().setMaster(spark://192.168.10.29:7077).setAppName(Hello).setSparkHome(/Users/maketo/plainvanilla/spark-0.9) val sc = new SparkContext(conf) val system = ActorSystem(mySystem) val helloActor1 = system.actorOf( Props[ HelloActor], name = helloactor1) helloActor1 ! new Msg(test.json,sc) Thanks, Ognen
Re: Problem with delete spark temp dir on spark 0.8.1
Hi, Try to clean your temp dir, System.getProperty(java.io.tmpdir) Also, Can you paste a longer stacktrace? Thanks Best Regards On Tue, Mar 4, 2014 at 2:55 PM, goi cto goi@gmail.com wrote: Hi, I am running a spark java program on a local machine. when I try to write the output to a file (RDD.SaveAsTextFile) I am getting this exception: Exception in thread Delete Spark temp dir ... This is running on my local window machine. Any ideas? -- Eran | CTO -- Thanks Best Regards
Re: Problem with delete spark temp dir on spark 0.8.1
Exception in thread delete Spark temp dir C:\Users\... java.io.IOException: failed to delete: C:\Users\...\simple-project-1.0.jar at org.apache.spark.util.utils$.deleteRecursively(Utils.scala:495) at org.apache.spark.util.utils$$anonfun$deleteRecursively$1.apply(Utils.scala:491) I deleted my temp dir as suggested and indeed all spark.. directories were deleted. after which I run the program again and got the same error again. also indeed a spark-... directory with the simple-project-1.0.jar was found left on the file system. I had no problem deleting this file once the program completed. Eran On Tue, Mar 4, 2014 at 11:36 AM, Akhil Das ak...@mobipulse.in wrote: Hi, Try to clean your temp dir, System.getProperty(java.io.tmpdir) Also, Can you paste a longer stacktrace? Thanks Best Regards On Tue, Mar 4, 2014 at 2:55 PM, goi cto goi@gmail.com wrote: Hi, I am running a spark java program on a local machine. when I try to write the output to a file (RDD.SaveAsTextFile) I am getting this exception: Exception in thread Delete Spark temp dir ... This is running on my local window machine. Any ideas? -- Eran | CTO -- Thanks Best Regards -- Eran | CTO
RDD Manipulation in Scala.
Hello, I am using Spark with Scala and I am attempting to understand the different filtering and mapping capabilities available. I haven't found an example of the specific task I would like to do. I am trying to read in a tab spaced text file and filter specific entries. I would like this filter to be applied to different columns and not lines. I was using the following to split the data but attempts to filter by column afterwards are not working. - val data = sc.textFile(test_data.txt) var parsedData = data.map( _.split(\t).map(_.toString)) -- To try to give a more concrete example of my goal, Suppose the data file is: A1A2 A3 A4 B1B2 A3 A4 C1A2 C2 C3 How would I filter the data based on the second column to only return those entries which have A2 in column two? So, that the resulting RDD would just be: A1A2 A3 A4 C1A2 C2 C3 Is there a convenient way to do this? Any suggestions or assistance would be appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Manipulation-in-Scala-tp2285.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDD Manipulation in Scala.
Thanks Sean, I think that is doing what I needed. It was much simpler than what I had been attempting. Is it possible to do an OR statement filter? So, that for example column 2 can be filtered by A2 appearances and column 3 by A4? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Manipulation-in-Scala-tp2285p2287.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Actors and sparkcontext actions
Hi Ognen, Any particular reason of choosing scalatra over options like play or spray ? Is scalatra much better in serving apis or is it due to similarity with ruby's sinatra ? Did you try the other options and then pick scalatra ? Thanks. Deb On Tue, Mar 4, 2014 at 4:50 AM, Ognen Duzlevski og...@plainvanillagames.com wrote: Suraj, I posted to this list a link to my blog where I detail how to do a simple actor/sparkcontext thing with the added obstacle of it being within a Scalatra servlet. Thanks for the code! Ognen On 3/4/14, 3:20 AM, Suraj Satishkumar Sheth wrote: Hi Ognen, See if this helps. I was working on this : class MyClass[T](sc : SparkContext, flag1 : Boolean, rdd : RDD[T], hdfsPath : String) extends Actor { def act(){ if(flag1) this.process() else this.count } private def process(){ println(sc.textFile(hdfsPath).count) //do the processing } private def count(){ println(rdd.count) //do the counting } } Thanks and Regards, Suraj Sheth -Original Message- From: Ognen Duzlevski [mailto:og...@nengoiksvelzud.com] Sent: 27 February 2014 01:09 To: u...@spark.incubator.apache.org Subject: Actors and sparkcontext actions Can someone point me to a simple, short code example of creating a basic Actor that gets a context and runs an operation such as .textFile.count? I am trying to figure out how to create just a basic actor that gets a message like this: case class Msg(filename:String, ctx: SparkContext) and then something like this: class HelloActor extends Actor { import context.dispatcher def receive = { case Msg(fn,ctx) = { // get the count here! // cts.textFile(fn).count } case _ = println(huh?) } } Where I would want to do something like: val conf = new SparkConf().setMaster(spark://192.168.10.29:7077).setAppName(Hello). setSparkHome(/Users/maketo/plainvanilla/spark-0.9) val sc = new SparkContext(conf) val system = ActorSystem(mySystem) val helloActor1 = system.actorOf( Props[ HelloActor], name = helloactor1) helloActor1 ! new Msg(test.json,sc) Thanks, Ognen -- Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems. -- Jamie Zawinski
Re: o.a.s.u.Vector instances for equality
Thanks. Does it make sence to add ==/equals method for Vector with this (or same) behavior? 2014-03-04 6:00 GMT+02:00 Shixiong Zhu zsxw...@gmail.com: Vector is an enhanced Array[Double]. You can compare it like Array[Double]. E.g., scala val v1 = Vector(1.0, 2.0) v1: org.apache.spark.util.Vector = (1.0, 2.0) scala val v2 = Vector(1.0, 2.0) v2: org.apache.spark.util.Vector = (1.0, 2.0) scala val exactResult = v1.elements.sameElements(v2.elements) // exact comparison exactResult: Boolean = true scala val delta = 1E-6 delta: Double = 1.0E-6 scala val inexactResult = v1.elements.length == v2.elements.length v1.elements.zip(v2.elements).forall { case (x, y) = (x - y).abs delta } // inexact comparison inexactResult : Boolean = true Best Regards, Shixiong Zhu 2014-03-04 4:23 GMT+08:00 Oleksandr Olgashko alexandrolg...@gmail.com: Hello. How should i better check two Vector's for equality? val a = new Vector(Array(1)) val b = new Vector(Array(1)) println(a == b) // false
Re: Actors and sparkcontext actions
Deb, On 3/4/14, 9:02 AM, Debasish Das wrote: Hi Ognen, Any particular reason of choosing scalatra over options like play or spray ? Is scalatra much better in serving apis or is it due to similarity with ruby's sinatra ? Did you try the other options and then pick scalatra ? Not really. I just happen to like Scalatra, it is easy with read and easy to write. If it did not work out I would have gone probably with something like Unfiltered. Ognen
Re: Missing Spark URL after staring the master
Hi Mayur, I am using CDH4.6.0p0.26. And the latest Cloudera Spark parcel is Spark 0.9.0 CDH4.6.0p0.50. As I mentioned, somehow, the Cloudera Spark version doesn't contain the run-example shell scripts.. However, it is automatically configured and it is pretty easy to set up across the cluster... Thanks, Bin On Tue, Mar 4, 2014 at 10:59 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: I have on cloudera vm http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM which version are you trying to setup on cloudera.. also which cloudera version are you using... Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang binwang...@gmail.com wrote: Hi Ognen/Mayur, Thanks for the reply and it is good to know how easy it is to setup Spark on AWS cluster. My situation is a bit different from yours, our company already have a cluster and it really doesn't make that much sense not to use them. That is why I have been going through this. I really wish there are some tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster or .. some way to tweak the CDH Spark distribution, so it is up to date. Ognen, of course it will be very helpful if you can 'history | grep spark... ' and document the work that you have done since you've already made it! Bin On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski og...@plainvanillagames.com wrote: I should add that in this setup you really do not need to look for the printout of the master node's IP - you set it yourself a priori. If anyone is interested, let me know, I can write it all up so that people can follow some set of instructions. Who knows, maybe I can come up with a set of scripts to automate it all... Ognen On 3/3/14, 3:02 PM, Ognen Duzlevski wrote: I have a Standalone spark cluster running in an Amazon VPC that I set up by hand. All I did was provision the machines from a common AMI image (my underlying distribution is Ubuntu), I created a sparkuser on each machine and I have a /home/sparkuser/spark folder where I downladed spark. I did this on the master only, I did sbt/sbt assemble and I set up the conf/spark-env.sh to point to the master which is an IP address (in my case 10.10.0.200, the port is the default 7077). I also set up the slaves file in the same subdirectory to have all 16 ip addresses of the worker nodes (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz file to each worker using the same sparkuser account and unpacked the .tgz on each slave (this will effectively replicate everything from master to all slaves - you can script this so you don't do it by hand). Your AMI should have the distribution's version of Java and git installed by the way. All you have to do then is sparkuser@spark-master spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh) and it will all automagically start :) All my Amazon nodes come with 4x400 Gb of ephemeral space which I have set up into a 1.6TB RAID0 array on each node and I am pooling this into an HDFS filesystem which is operated by a namenode outside the spark cluster while all the datanodes are the same nodes as the spark workers. This enables replication and extremely fast access since ephemeral is much faster than EBS or anything else on Amazon (you can do even better with SSD drives on this setup but it will cost ya). If anyone is interested I can document our pipeline set up - I came up with it myself and do not have a clue as to what the industry standards are since I could not find any written instructions anywhere online about how to set up a whole data analytics pipeline from the point of ingestion to the point of analytics (people don't want to share their secrets? or am I just in the dark and incapable of using Google properly?). My requirement was that I wanted this to run within a VPC for added security and simplicity, the Amazon security groups get really old quickly. Added bonus is that you can use a VPN as an entry into the whole system and your cluster instantly becomes local to you in terms of IPs etc. I use OpenVPN since I don't like Cisco nor Juniper (the only two options Amazon provides for their VPN gateways). Ognen On 3/3/14, 1:00 PM, Bin Wang wrote: Hi there, I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and built from source there successfully with everything as default. I ran a few examples and everything seems work fine in the local mode. Then I am thinking about scale it to my cluster, which is what the
Spark Streaming Maven Build
Hi there, I tried the Kafka WordCount example and it works perfect and the code is pretty straightforward to understand. Can anyone show to me how to start your own maven project with the KafkaWordCount example using minimum-effort. 1. How the pom file should look like (including jar-plugin? assembly-plugin?..etc) 2. mvn install or mvn clean install or mvn install compile assembly:single? 3. after you have a jar file, then how do you execute the jar file instead of using bin/run-example... To answer those people who might ask what you have done (Here is a derivative from the KafkaWordCount example that I have written to do IP count example where the input data from Kafka is actually JSON string. https://github.com/biwa7636/binwangREPO I had such a bad lucky got it to working. So if anyone can copy the code of WordCountExample and build it using maven and got it working.. if you can share your pom and those maven commands, I will be so appreciated!) Best regards and let me know if you need further info. Bin