Spark with Spark Streaming
Hi! There are any way to use spark with spark streaming together to create real time architecture? How can I merge the spark and spark streaming result at realtime (and drop streaming result if spark result generated)? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Spark-Streaming-tp7164.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: best practice: write and debug Spark application in scala-ide and maven
I think that you have two options: - to run your code locally, you can use local mode by using the 'local' master like so: new SparkConf().setMaster(local[4]) where 4 is the number of cores assigned to the local mode. - to run your code remotely you need to build the jar with dependencies and add it to your context. new SparkConf().setMaster(spark://uri ).addJars(Array(/path/to/target/jar-with-dependencies.jar) You will need to run maven before running your program to ensure the latest version of your jar is built. -regards, Gerard. On Sat, Jun 7, 2014 at 3:10 AM, Wei Tan w...@us.ibm.com wrote: Hi, I am trying to write and debug Spark applications in scala-ide and maven, and in my code I target at a Spark instance at spark://xxx object App { def main(args : Array[String]) { println( Hello World! ) val sparkConf = new SparkConf().setMaster(spark://xxx:7077).setAppName(WordCount) val spark = new SparkContext(sparkConf) val file = spark.textFile(hdfs://xxx:9000/wcinput/pg1184.txt) val counts = file.flatMap(line = line.split( )) .map(word = (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(hdfs://flex05.watson.ibm.com:9000/wcoutput) } } I added spark-core and hadoop-client in maven dependency so the code compiles fine. When I click run in Eclipse I got this error: 14/06/06 20:52:18 WARN scheduler.TaskSetManager: Loss was due to java.lang.ClassNotFoundException java.lang.ClassNotFoundException: samples.App$$anonfun$2 I googled this error and it seems that I need to package my code into a jar file and push it to spark nodes. But since I am debugging the code, it would be handy if I can quickly see results without packaging and uploading jars. What is the best practice of writing a spark application (like wordcount) and debug quickly on a remote spark instance? Thanks! Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center *http://researcher.ibm.com/person/us-wtan* http://researcher.ibm.com/person/us-wtan
Re: Scheduling code for Spark
Hi, The scheduling related code can be found at: https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler The DAG (Directed Acyclic Graph) scheduler is a good start point: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Enjoy! -Gerard. On Sat, Jun 7, 2014 at 10:24 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, *I am new to Spark framework. I understood Spark framework to some extent. I have some experience with Hadoop as well. The concepts of in-memory computation and RDD's *are extremely fascinating. I am trying to understand the scheduler of Spark framework. Can someone help me out where to look for Spark Scheduler code. Thank you!!
Re: Using Spark on Data size larger than Memory size
Aaron, Thank You for your response and clarifying things. -Vibhor On Sun, Jun 1, 2014 at 11:40 AM, Aaron Davidson ilike...@gmail.com wrote: There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on). In general, one problem with Spark today is that you *can* OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course *-- *for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew. Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast. On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com wrote: Some inputs will be really helpful. Thanks, -Vibhor On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com wrote: Hi all, I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ? Any thoughts in this direction will be helpful and are welcome. Thanks, -Vibhor -- Vibhor Banga Software Development Engineer Flipkart Internet Pvt. Ltd., Bangalore
Re: New user streaming question
I would make sure that your workers are running. It is very difficult to tell from the console dribble if you just have no data or the workers just disassociated from masters. Gino B. On Jun 6, 2014, at 11:32 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Yup, when it's running, DStream.print() will print out a timestamped block for every time step, even if the block is empty. (for v1.0.0, which I have running in the other window) If you're not getting that, I'd guess the stream hasn't started up properly. On Sat, Jun 7, 2014 at 11:50 AM, Michael Campbell michael.campb...@gmail.com wrote: I've been playing with spark and streaming and have a question on stream outputs. The symptom is I don't get any. I have run spark-shell and all does as I expect, but when I run the word-count example with streaming, it *works* in that things happen and there are no errors, but I never get any output. Am I understanding how it it is supposed to work correctly? Is the Dstream.print() method supposed to print the output for every (micro)batch of the streamed data? If that's the case, I'm not seeing it. I'm using the netcat example and the StreamingContext uses the network to read words, but as I said, nothing comes out. I tried changing the .print() to .saveAsTextFiles(), and I AM getting a file, but nothing is in it other than a _temporary subdir. I'm sure I'm confused here, but not sure where. Help? -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Spark Streaming, download a s3 file to run a script shell on it
So you can run a job / spark job to get data to disk/hdfs. Then run a dstream from a hdfs folder. As you move your files, the dstream will kick in. Regards Mayur On 6 Jun 2014 21:13, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: Where are the API for QueueStream and RddQueue? In my solution I cannot open a DStream with S3 location because I need to run a script on the file (a script that unluckily doesn't accept stdin as input), so I have to download it on my disk somehow than handle it from there before creating the stream. Thanks Gianluca On 06/06/2014 02:19, Mayur Rustagi wrote: You can look to create a Dstream directly from S3 location using file stream. If you want to use any specific logic you can rely on Queuestream read data yourself from S3, process it push it into RDDQueue. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 6, 2014 at 3:00 AM, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: Hi, I've got a weird question but maybe someone has already dealt with it. My Spark Streaming application needs to - download a file from a S3 bucket, - run a script with the file as input, - create a DStream from this script output. I've already got the second part done with the rdd.pipe() API that really fits my request, but I have no idea how to manage the first part. How can I manage to download a file and run a script on them inside a Spark Streaming Application? Should I use process() from Scala or it won't work? Thanks Gianluca
ec2 deployment regions supported
Hi , I am interested in deploying spark 1.0.0 on ec2 and wanted to know which all regions are supported.I have been able to deploy the previous version in east but i had a hard time launching the cluster due to bad connection the script provided would fail to ssh into a node after a couple of tries and stop.
Re: New user streaming question
Thanks all - I still don't know what the underlying problem is, but I KIND OF got it working by dumping my random-words stuff to a file and pointing spark streaming to that. So it's not Streaming, as such, but I got output. More investigation to follow =) On Sat, Jun 7, 2014 at 8:22 AM, Gino Bustelo lbust...@gmail.com wrote: I would make sure that your workers are running. It is very difficult to tell from the console dribble if you just have no data or the workers just disassociated from masters. Gino B. On Jun 6, 2014, at 11:32 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Yup, when it's running, DStream.print() will print out a timestamped block for every time step, even if the block is empty. (for v1.0.0, which I have running in the other window) If you're not getting that, I'd guess the stream hasn't started up properly. On Sat, Jun 7, 2014 at 11:50 AM, Michael Campbell michael.campb...@gmail.com wrote: I've been playing with spark and streaming and have a question on stream outputs. The symptom is I don't get any. I have run spark-shell and all does as I expect, but when I run the word-count example with streaming, it *works* in that things happen and there are no errors, but I never get any output. Am I understanding how it it is supposed to work correctly? Is the Dstream.print() method supposed to print the output for every (micro)batch of the streamed data? If that's the case, I'm not seeing it. I'm using the netcat example and the StreamingContext uses the network to read words, but as I said, nothing comes out. I tried changing the .print() to .saveAsTextFiles(), and I AM getting a file, but nothing is in it other than a _temporary subdir. I'm sure I'm confused here, but not sure where. Help? -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
How to process multiple classification with SVM in MLlib
Hi All, As we know, In MLlib the SVM is used for binary classification. I wonder how to train SVM model for mutiple classification in MLlib. In addition, how to apply the machine learning algorithm in Spark if the algorithm isn't included in MLlib. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-process-multiple-classification-with-SVM-in-MLlib-tp7174.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: error loading large files in PySpark 0.9.0
Ah looking at that inputformat it should just work out the box using sc.newAPIHadoopFile ... Would be interested to hear if it works as expected for you (in python you'll end up with bytearray values). N — Sent from Mailbox On Fri, Jun 6, 2014 at 9:38 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat support. We recently wrote a custom hadoop input format so we can support flat binary files (https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop), and have been testing it in Scala. So I was following Nick's progress and was eager to check this out when ready. Will let you guys know how it goes. -- J -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p7144.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Streaming, download a s3 file to run a script shell on it
QueueStream example is in Spark Streaming examples: http://www.boyunjian.com/javasrc/org.spark-project/spark-examples_2.9.3/0.7.2/_/spark/streaming/examples/QueueStream.scala Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, Jun 7, 2014 at 6:41 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: So you can run a job / spark job to get data to disk/hdfs. Then run a dstream from a hdfs folder. As you move your files, the dstream will kick in. Regards Mayur On 6 Jun 2014 21:13, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: Where are the API for QueueStream and RddQueue? In my solution I cannot open a DStream with S3 location because I need to run a script on the file (a script that unluckily doesn't accept stdin as input), so I have to download it on my disk somehow than handle it from there before creating the stream. Thanks Gianluca On 06/06/2014 02:19, Mayur Rustagi wrote: You can look to create a Dstream directly from S3 location using file stream. If you want to use any specific logic you can rely on Queuestream read data yourself from S3, process it push it into RDDQueue. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 6, 2014 at 3:00 AM, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: Hi, I've got a weird question but maybe someone has already dealt with it. My Spark Streaming application needs to - download a file from a S3 bucket, - run a script with the file as input, - create a DStream from this script output. I've already got the second part done with the rdd.pipe() API that really fits my request, but I have no idea how to manage the first part. How can I manage to download a file and run a script on them inside a Spark Streaming Application? Should I use process() from Scala or it won't work? Thanks Gianluca
Re: Using Java functions in Spark
Increasing number of partitions on data file solved the problem. On 6 June 2014 18:46, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Additional observation - the map and mapValues are pipelined and executed - as expected - in pairs. This means that there is a simple sequence of steps - first read from Cassandra and then processing for each value of K. This is the exact behaviour of a normal Java loop with these two steps inside. I understand that this eliminates batch loading first and pile up of massive text arrays. Also the keys are relatively evenly distributed across Executors. The question is - why is this still so slow? I would appreciate any suggestions on where to focus my search. Thank you, Oleg On 6 June 2014 16:24, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Hi All, I am passing Java static methods into RDD transformations map and mapValues. The first map is from a simple string K into a (K,V) where V is a Java ArrayList of large text strings, 50K each, read from Cassandra. MapValues does processing of these text blocks into very small ArrayLists. The code runs quite slow compared to running it in parallel on the same servers from plain Java. I gave the same heap to Executors and Java. Does java run slower under Spark or do I suffer from excess heap pressure or am I missing something? Thank you for any insight, Oleg -- Kind regards, Oleg -- Kind regards, Oleg
Re: cache spark sql parquet file in memory?
Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon http://tachyon-project.org/ instead of HDFS. This is untested though, please report any issues you run into. Michael On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com wrote: This might be a stupid question... but it seems that saveAsParquetFile() writes everything back to HDFS. I am wondering if it is possible to cache parquet-format intermediate results in memory, and therefore making spark sql queries faster. Thanks. -Simon
Re: cache spark sql parquet file in memory?
I was also thinking of using tachyon to store parquet files - maybe tomorrow I will give a try as well. 2014-06-07 20:01 GMT+02:00 Michael Armbrust mich...@databricks.com: Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon http://tachyon-project.org/ instead of HDFS. This is untested though, please report any issues you run into. Michael On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com wrote: This might be a stupid question... but it seems that saveAsParquetFile() writes everything back to HDFS. I am wondering if it is possible to cache parquet-format intermediate results in memory, and therefore making spark sql queries faster. Thanks. -Simon
Re: best practice: write and debug Spark application in scala-ide and maven
For debugging, I run locally inside Eclipse without maven. I just add the Spark assembly jar to my Eclipse project build path and click 'Run As... Scala Application'. I have done the same with Java and Scala Test, it's quick and easy. I didn't see any third party jar dependencies in your code, so that should be sufficient for your example. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/best-practice-write-and-debug-Spark-application-in-scala-ide-and-maven-tp7151p7183.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: cache spark sql parquet file in memory?
Is there a way to start tachyon on top of a yarn cluster? On Jun 7, 2014 2:11 PM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: I was also thinking of using tachyon to store parquet files - maybe tomorrow I will give a try as well. 2014-06-07 20:01 GMT+02:00 Michael Armbrust mich...@databricks.com: Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon http://tachyon-project.org/ instead of HDFS. This is untested though, please report any issues you run into. Michael On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com wrote: This might be a stupid question... but it seems that saveAsParquetFile() writes everything back to HDFS. I am wondering if it is possible to cache parquet-format intermediate results in memory, and therefore making spark sql queries faster. Thanks. -Simon
Dumping Metics on HDFS
Hi All, I am running spark applications in yarn-cluster mode and need to read the spark application metrics even after the application is over. I was planning to use the csv sink, but it seems that codehale's CsvReporter only supports dumping metrics to local filesystem. Any suggestions to navigate around this limitation would be helpful. Thanks, Rahul Singhal
Re: Gradient Descent with MLBase
Hi Aslan, You can check out the unittest code of GradientDescent.runMiniBatchSGD https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Jun 7, 2014 at 6:24 AM, Aslan Bekirov aslanbeki...@gmail.com wrote: Hi All, I have to create a model using SGD in mlbase. I examined a bit mlbase and run some samples of classification , collaborative filtering etc.. But I could not run Gradient descent. I have to run val model = GradientDescent.runMiniBatchSGD(params) of course before params must be computed. I tried but could not managed to give parameters correctly. Can anyone explain parameters a bit and give an example of code? BR, Aslan
Re: How to process multiple classification with SVM in MLlib
At this time, you need to do one-vs-all manually for multiclass training. For your second question, if the algorithm is implemented in Java/Scala/Python and designed for single machine, you can broadcast the dataset to each worker, train models on workers. If the algorithm is implemented in a different language, maybe you need pipe to train the models outside JVM (similar to Hadoop Streaming). If the algorithm is designed for a different parallel platform, then it may be hard to use it in Spark. -Xiangrui On Sat, Jun 7, 2014 at 7:15 AM, littlebird cxp...@163.com wrote: Hi All, As we know, In MLlib the SVM is used for binary classification. I wonder how to train SVM model for mutiple classification in MLlib. In addition, how to apply the machine learning algorithm in Spark if the algorithm isn't included in MLlib. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-process-multiple-classification-with-SVM-in-MLlib-tp7174.html Sent from the Apache Spark User List mailing list archive at Nabble.com.