Spark with Spark Streaming

2014-06-07 Thread b0c1
Hi! There are any way to use spark with spark streaming together to create real time architecture? How can I merge the spark and spark streaming result at realtime (and drop streaming result if spark result generated)? Thanks -- View this message in context:

Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Gerard Maas
I think that you have two options: - to run your code locally, you can use local mode by using the 'local' master like so: new SparkConf().setMaster(local[4]) where 4 is the number of cores assigned to the local mode. - to run your code remotely you need to build the jar with dependencies and

Re: Scheduling code for Spark

2014-06-07 Thread Gerard Maas
Hi, The scheduling related code can be found at: https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler The DAG (Directed Acyclic Graph) scheduler is a good start point:

Re: Using Spark on Data size larger than Memory size

2014-06-07 Thread Vibhor Banga
Aaron, Thank You for your response and clarifying things. -Vibhor On Sun, Jun 1, 2014 at 11:40 AM, Aaron Davidson ilike...@gmail.com wrote: There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus

Re: New user streaming question

2014-06-07 Thread Gino Bustelo
I would make sure that your workers are running. It is very difficult to tell from the console dribble if you just have no data or the workers just disassociated from masters. Gino B. On Jun 6, 2014, at 11:32 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Yup, when it's running,

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-07 Thread Mayur Rustagi
So you can run a job / spark job to get data to disk/hdfs. Then run a dstream from a hdfs folder. As you move your files, the dstream will kick in. Regards Mayur On 6 Jun 2014 21:13, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: Where are the API for QueueStream and RddQueue?

ec2 deployment regions supported

2014-06-07 Thread Joe Mathai
Hi , I am interested in deploying spark 1.0.0 on ec2 and wanted to know which all regions are supported.I have been able to deploy the previous version in east but i had a hard time launching the cluster due to bad connection the script provided would fail to ssh into a node after a couple of

Re: New user streaming question

2014-06-07 Thread Michael Campbell
Thanks all - I still don't know what the underlying problem is, but I KIND OF got it working by dumping my random-words stuff to a file and pointing spark streaming to that. So it's not Streaming, as such, but I got output. More investigation to follow =) On Sat, Jun 7, 2014 at 8:22 AM, Gino

How to process multiple classification with SVM in MLlib

2014-06-07 Thread littlebird
Hi All, As we know, In MLlib the SVM is used for binary classification. I wonder how to train SVM model for mutiple classification in MLlib. In addition, how to apply the machine learning algorithm in Spark if the algorithm isn't included in MLlib. Thank you. -- View this message in context:

Re: error loading large files in PySpark 0.9.0

2014-06-07 Thread Nick Pentreath
Ah looking at that inputformat it should just work out the box using sc.newAPIHadoopFile ... Would be interested to hear if it works as expected for you (in python you'll end up with bytearray values). N — Sent from Mailbox On Fri, Jun 6, 2014 at 9:38 PM, Jeremy Freeman

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-07 Thread Mayur Rustagi
QueueStream example is in Spark Streaming examples: http://www.boyunjian.com/javasrc/org.spark-project/spark-examples_2.9.3/0.7.2/_/spark/streaming/examples/QueueStream.scala Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: Using Java functions in Spark

2014-06-07 Thread Oleg Proudnikov
Increasing number of partitions on data file solved the problem. On 6 June 2014 18:46, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Additional observation - the map and mapValues are pipelined and executed - as expected - in pairs. This means that there is a simple sequence of steps -

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Michael Armbrust
Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon http://tachyon-project.org/ instead of HDFS. This is untested though, please report any issues you run into. Michael On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Marek Wiewiorka
I was also thinking of using tachyon to store parquet files - maybe tomorrow I will give a try as well. 2014-06-07 20:01 GMT+02:00 Michael Armbrust mich...@databricks.com: Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon

Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Madhu
For debugging, I run locally inside Eclipse without maven. I just add the Spark assembly jar to my Eclipse project build path and click 'Run As... Scala Application'. I have done the same with Java and Scala Test, it's quick and easy. I didn't see any third party jar dependencies in your code, so

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Xu (Simon) Chen
Is there a way to start tachyon on top of a yarn cluster? On Jun 7, 2014 2:11 PM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: I was also thinking of using tachyon to store parquet files - maybe tomorrow I will give a try as well. 2014-06-07 20:01 GMT+02:00 Michael Armbrust

Dumping Metics on HDFS

2014-06-07 Thread Rahul Singhal
Hi All, I am running spark applications in yarn-cluster mode and need to read the spark application metrics even after the application is over. I was planning to use the csv sink, but it seems that codehale's CsvReporter only supports dumping metrics to local filesystem. Any suggestions to

Re: Gradient Descent with MLBase

2014-06-07 Thread DB Tsai
Hi Aslan, You can check out the unittest code of GradientDescent.runMiniBatchSGD https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala Sincerely, DB Tsai --- My Blog:

Re: How to process multiple classification with SVM in MLlib

2014-06-07 Thread Xiangrui Meng
At this time, you need to do one-vs-all manually for multiclass training. For your second question, if the algorithm is implemented in Java/Scala/Python and designed for single machine, you can broadcast the dataset to each worker, train models on workers. If the algorithm is implemented in a