Spark with Spark Streaming

2014-06-07 Thread b0c1
Hi!

There are any way to use spark with spark streaming together to create real
time architecture?
How can I merge the spark and spark streaming result at realtime (and drop
streaming result if spark result generated)?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Spark-Streaming-tp7164.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Gerard Maas
I think that you have two options:

- to run your code locally, you can use local mode by using the 'local'
master like so:
 new SparkConf().setMaster(local[4])  where 4 is the number of cores
assigned to the local mode.

- to run your code remotely you need to build the jar with dependencies and
add it to your context.
new SparkConf().setMaster(spark://uri
).addJars(Array(/path/to/target/jar-with-dependencies.jar)
You will need to run maven before running your program to ensure the latest
version of your jar is built.

-regards, Gerard.



On Sat, Jun 7, 2014 at 3:10 AM, Wei Tan w...@us.ibm.com wrote:

 Hi,

   I am trying to write and debug Spark applications in scala-ide and
 maven, and in my code I target at a Spark instance at spark://xxx

 object App {


   def main(args : Array[String]) {
 println( Hello World! )
 val sparkConf = new
 SparkConf().setMaster(spark://xxx:7077).setAppName(WordCount)

 val spark = new SparkContext(sparkConf)
 val file = spark.textFile(hdfs://xxx:9000/wcinput/pg1184.txt)
 val counts = file.flatMap(line = line.split( ))
  .map(word = (word, 1))
  .reduceByKey(_ + _)
 counts.saveAsTextFile(hdfs://flex05.watson.ibm.com:9000/wcoutput)
   }

 }

 I added spark-core and hadoop-client in maven dependency so the code
 compiles fine.

 When I click run in Eclipse I got this error:

 14/06/06 20:52:18 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.ClassNotFoundException
 java.lang.ClassNotFoundException: samples.App$$anonfun$2

 I googled this error and it seems that I need to package my code into a
 jar file and push it to spark nodes. But since I am debugging the code, it
 would be handy if I can quickly see results without packaging and uploading
 jars.

 What is the best practice of writing a spark application (like wordcount)
 and debug quickly on a remote spark instance?

 Thanks!
 Wei


 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center
 *http://researcher.ibm.com/person/us-wtan*
 http://researcher.ibm.com/person/us-wtan


Re: Scheduling code for Spark

2014-06-07 Thread Gerard Maas
Hi,

The scheduling related code can be found at:
https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler

The DAG (Directed Acyclic Graph) scheduler is a good start point:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Enjoy!

-Gerard.



On Sat, Jun 7, 2014 at 10:24 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi,
*I am  new to Spark framework. I understood Spark framework to some
 extent. I have some experience with Hadoop as well. The concepts of
 in-memory computation and RDD's  *are extremely fascinating.

 I am trying to understand the scheduler of Spark framework. Can someone
 help me out where to look for Spark Scheduler code.

 Thank you!!



Re: Using Spark on Data size larger than Memory size

2014-06-07 Thread Vibhor Banga
Aaron, Thank You for your response and clarifying things.

-Vibhor


On Sun, Jun 1, 2014 at 11:40 AM, Aaron Davidson ilike...@gmail.com wrote:

 There is no fundamental issue if you're running on data that is larger
 than cluster memory size. Many operations can stream data through, and thus
 memory usage is independent of input data size. Certain operations require
 an entire *partition* (not dataset) to fit in memory, but there are not
 many instances of this left (sorting comes to mind, and this is being
 worked on).

 In general, one problem with Spark today is that you *can* OOM under
 certain configurations, and it's possible you'll need to change from the
 default configuration if you're using doing very memory-intensive jobs.
 However, there are very few cases where Spark would simply fail as a matter
 of course *-- *for instance, you can always increase the number of
 partitions to decrease the size of any given one. or repartition data to
 eliminate skew.

 Regarding impact on performance, as Mayur said, there may absolutely be an
 impact depending on your jobs. If you're doing a join on a very large
 amount of data with few partitions, then we'll have to spill to disk. If
 you can't cache your working set of data in memory, you will also see a
 performance degradation. Spark enables the use of memory to make things
 fast, but if you just don't have enough memory, it won't be terribly fast.


 On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 Clearly thr will be impact on performance but frankly depends on what you
 are trying to achieve with the dataset.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows
 larger than the size of RAM available in the cluster, will the application
 fail, or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore






Re: New user streaming question

2014-06-07 Thread Gino Bustelo
I would make sure that your workers are running. It is very difficult to tell 
from the console dribble if you just have no data or the workers just 
disassociated from masters. 

Gino B.

 On Jun 6, 2014, at 11:32 PM, Jeremy Lee unorthodox.engine...@gmail.com 
 wrote:
 
 Yup, when it's running, DStream.print() will print out a timestamped block 
 for every time step, even if the block is empty. (for v1.0.0, which I have 
 running in the other window)
 
 If you're not getting that, I'd guess the stream hasn't started up properly. 
 
 
 On Sat, Jun 7, 2014 at 11:50 AM, Michael Campbell 
 michael.campb...@gmail.com wrote:
 I've been playing with spark and streaming and have a question on stream 
 outputs.  The symptom is I don't get any.
 
 I have run spark-shell and all does as I expect, but when I run the 
 word-count example with streaming, it *works* in that things happen and 
 there are no errors, but I never get any output.  
 
 Am I understanding how it it is supposed to work correctly?  Is the 
 Dstream.print() method supposed to print the output for every (micro)batch 
 of the streamed data?  If that's the case, I'm not seeing it.
 
 I'm using the netcat example and the StreamingContext uses the network to 
 read words, but as I said, nothing comes out. 
 
 I tried changing the .print() to .saveAsTextFiles(), and I AM getting a 
 file, but nothing is in it other than a _temporary subdir.
 
 I'm sure I'm confused here, but not sure where.  Help?
 
 
 
 -- 
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers


Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-07 Thread Mayur Rustagi
So you can run a job / spark job to get data to disk/hdfs. Then run a
dstream from a hdfs folder. As you move your files, the dstream will kick
in.
Regards
Mayur
On 6 Jun 2014 21:13, Gianluca Privitera 
gianluca.privite...@studio.unibo.it wrote:

  Where are the API for QueueStream and RddQueue?
 In my solution I cannot open a DStream with S3 location because I need to
 run a script on the file (a script that unluckily doesn't accept stdin as
 input), so I have to download it on my disk somehow than handle it from
 there before creating the stream.

 Thanks
 Gianluca

 On 06/06/2014 02:19, Mayur Rustagi wrote:

 You can look to create a Dstream directly from S3 location using file
 stream. If you want to use any specific logic you can rely on Queuestream 
 read data yourself from S3, process it  push it into RDDQueue.

  Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Jun 6, 2014 at 3:00 AM, Gianluca Privitera 
 gianluca.privite...@studio.unibo.it wrote:

 Hi,
 I've got a weird question but maybe someone has already dealt with it.
 My Spark Streaming application needs to
 - download a file from a S3 bucket,
 - run a script with the file as input,
 - create a DStream from this script output.
 I've already got the second part done with the rdd.pipe() API that really
 fits my request, but I have no idea how to manage the first part.
 How can I manage to download a file and run a script on them inside a
 Spark Streaming Application?
 Should I use process() from Scala or it won't work?

 Thanks
 Gianluca






ec2 deployment regions supported

2014-06-07 Thread Joe Mathai
Hi ,

I am interested in deploying spark 1.0.0 on ec2 and wanted to know
which all regions are supported.I have been able to deploy the
previous version in east but i had a hard time launching the cluster
due to bad connection the script provided would fail to ssh into a
node after a couple of tries and stop.


Re: New user streaming question

2014-06-07 Thread Michael Campbell
Thanks all - I still don't know what the underlying problem is, but I KIND
OF got it working by dumping my random-words stuff to a file and pointing
spark streaming to that.  So it's not Streaming, as such, but I got
output.

More investigation to follow =)


On Sat, Jun 7, 2014 at 8:22 AM, Gino Bustelo lbust...@gmail.com wrote:

 I would make sure that your workers are running. It is very difficult to
 tell from the console dribble if you just have no data or the workers just
 disassociated from masters.

 Gino B.

 On Jun 6, 2014, at 11:32 PM, Jeremy Lee unorthodox.engine...@gmail.com
 wrote:

 Yup, when it's running, DStream.print() will print out a timestamped block
 for every time step, even if the block is empty. (for v1.0.0, which I have
 running in the other window)

 If you're not getting that, I'd guess the stream hasn't started up
 properly.


 On Sat, Jun 7, 2014 at 11:50 AM, Michael Campbell 
 michael.campb...@gmail.com wrote:

 I've been playing with spark and streaming and have a question on stream
 outputs.  The symptom is I don't get any.

 I have run spark-shell and all does as I expect, but when I run the
 word-count example with streaming, it *works* in that things happen and
 there are no errors, but I never get any output.

 Am I understanding how it it is supposed to work correctly?  Is the
 Dstream.print() method supposed to print the output for every (micro)batch
 of the streamed data?  If that's the case, I'm not seeing it.

 I'm using the netcat example and the StreamingContext uses the network
 to read words, but as I said, nothing comes out.

 I tried changing the .print() to .saveAsTextFiles(), and I AM getting a
 file, but nothing is in it other than a _temporary subdir.

 I'm sure I'm confused here, but not sure where.  Help?




 --
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers




How to process multiple classification with SVM in MLlib

2014-06-07 Thread littlebird
Hi All,
  As we know, In MLlib the SVM is used for binary classification. I wonder
how to train SVM model for mutiple classification in MLlib. In addition, how
to apply the machine learning algorithm in Spark if the algorithm isn't
included in MLlib. Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-process-multiple-classification-with-SVM-in-MLlib-tp7174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: error loading large files in PySpark 0.9.0

2014-06-07 Thread Nick Pentreath
Ah looking at that inputformat it should just work out the box using 
sc.newAPIHadoopFile ...


Would be interested to hear if it works as expected for you (in python you'll 
end up with bytearray values).




N
—
Sent from Mailbox

On Fri, Jun 6, 2014 at 9:38 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:

 Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat
 support. We recently wrote a custom hadoop input format so we can support
 flat binary files
 (https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop),
 and have been testing it in Scala. So I was following Nick's progress and
 was eager to check this out when ready. Will let you guys know how it goes.
 -- J
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p7144.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-07 Thread Mayur Rustagi
QueueStream example is in Spark Streaming examples:
http://www.boyunjian.com/javasrc/org.spark-project/spark-examples_2.9.3/0.7.2/_/spark/streaming/examples/QueueStream.scala


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Sat, Jun 7, 2014 at 6:41 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 So you can run a job / spark job to get data to disk/hdfs. Then run a
 dstream from a hdfs folder. As you move your files, the dstream will kick
 in.
 Regards
 Mayur
 On 6 Jun 2014 21:13, Gianluca Privitera 
 gianluca.privite...@studio.unibo.it wrote:

  Where are the API for QueueStream and RddQueue?
 In my solution I cannot open a DStream with S3 location because I need to
 run a script on the file (a script that unluckily doesn't accept stdin as
 input), so I have to download it on my disk somehow than handle it from
 there before creating the stream.

 Thanks
 Gianluca

 On 06/06/2014 02:19, Mayur Rustagi wrote:

 You can look to create a Dstream directly from S3 location using file
 stream. If you want to use any specific logic you can rely on Queuestream 
 read data yourself from S3, process it  push it into RDDQueue.

  Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Jun 6, 2014 at 3:00 AM, Gianluca Privitera 
 gianluca.privite...@studio.unibo.it wrote:

 Hi,
 I've got a weird question but maybe someone has already dealt with it.
 My Spark Streaming application needs to
 - download a file from a S3 bucket,
 - run a script with the file as input,
 - create a DStream from this script output.
 I've already got the second part done with the rdd.pipe() API that
 really fits my request, but I have no idea how to manage the first part.
 How can I manage to download a file and run a script on them inside a
 Spark Streaming Application?
 Should I use process() from Scala or it won't work?

 Thanks
 Gianluca






Re: Using Java functions in Spark

2014-06-07 Thread Oleg Proudnikov
Increasing number of partitions on data file solved the problem.


On 6 June 2014 18:46, Oleg Proudnikov oleg.proudni...@gmail.com wrote:

 Additional observation - the map and mapValues are pipelined and executed
 - as expected - in pairs. This means that there is a simple sequence of
 steps - first read from Cassandra and then processing for each value of K.
 This is the exact behaviour of a normal Java loop with these two steps
 inside. I understand that this eliminates batch loading first and pile up
 of massive text arrays.

 Also the keys are relatively evenly distributed across Executors.

 The question is - why is this still so slow? I would appreciate any
 suggestions on where to focus my search.

 Thank you,
 Oleg



 On 6 June 2014 16:24, Oleg Proudnikov oleg.proudni...@gmail.com wrote:

 Hi All,

 I am passing Java static methods into RDD transformations map and
 mapValues. The first map is from a simple string K into a (K,V) where V is
 a Java ArrayList of large text strings, 50K each, read from Cassandra.
 MapValues does processing of these text blocks into very small ArrayLists.

 The code runs quite slow compared to running it in parallel on the same
 servers from plain Java.

 I gave the same heap to Executors and Java. Does java run slower under
 Spark or do I suffer from excess heap pressure or am I missing something?

 Thank you for any insight,
 Oleg




 --
 Kind regards,

 Oleg




-- 
Kind regards,

Oleg


Re: cache spark sql parquet file in memory?

2014-06-07 Thread Michael Armbrust
Not a stupid question!  I would like to be able to do this.  For now, you
might try writing the data to tachyon http://tachyon-project.org/ instead
of HDFS.  This is untested though, please report any issues you run into.

Michael


On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 This might be a stupid question... but it seems that saveAsParquetFile()
 writes everything back to HDFS. I am wondering if it is possible to cache
 parquet-format intermediate results in memory, and therefore making spark
 sql queries faster.

 Thanks.
 -Simon



Re: cache spark sql parquet file in memory?

2014-06-07 Thread Marek Wiewiorka
I was also thinking of using tachyon to store parquet files - maybe
tomorrow I will give a try as well.


2014-06-07 20:01 GMT+02:00 Michael Armbrust mich...@databricks.com:

 Not a stupid question!  I would like to be able to do this.  For now, you
 might try writing the data to tachyon http://tachyon-project.org/
 instead of HDFS.  This is untested though, please report any issues you run
 into.

 Michael


 On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 This might be a stupid question... but it seems that saveAsParquetFile()
 writes everything back to HDFS. I am wondering if it is possible to cache
 parquet-format intermediate results in memory, and therefore making spark
 sql queries faster.

 Thanks.
 -Simon





Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Madhu
For debugging, I run locally inside Eclipse without maven.
I just add the Spark assembly jar to my Eclipse project build path and click
'Run As... Scala Application'.
I have done the same with Java and Scala Test, it's quick and easy.
I didn't see any third party jar dependencies in your code, so that should
be sufficient for your example.



-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/best-practice-write-and-debug-Spark-application-in-scala-ide-and-maven-tp7151p7183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: cache spark sql parquet file in memory?

2014-06-07 Thread Xu (Simon) Chen
Is there a way to start tachyon on top of a yarn cluster?
 On Jun 7, 2014 2:11 PM, Marek Wiewiorka marek.wiewio...@gmail.com
wrote:

 I was also thinking of using tachyon to store parquet files - maybe
 tomorrow I will give a try as well.


 2014-06-07 20:01 GMT+02:00 Michael Armbrust mich...@databricks.com:

 Not a stupid question!  I would like to be able to do this.  For now, you
 might try writing the data to tachyon http://tachyon-project.org/
 instead of HDFS.  This is untested though, please report any issues you run
 into.

 Michael


 On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:

 This might be a stupid question... but it seems that saveAsParquetFile()
 writes everything back to HDFS. I am wondering if it is possible to cache
 parquet-format intermediate results in memory, and therefore making spark
 sql queries faster.

 Thanks.
 -Simon






Dumping Metics on HDFS

2014-06-07 Thread Rahul Singhal
Hi All,

I am running spark applications in yarn-cluster mode and need to read the spark 
application metrics even after the application is over. I was planning to use 
the csv sink, but it seems that codehale's CsvReporter only supports dumping 
metrics to local filesystem.

Any suggestions to navigate around this limitation would be helpful.

Thanks,
Rahul Singhal


Re: Gradient Descent with MLBase

2014-06-07 Thread DB Tsai
Hi Aslan,

You can check out the unittest code of GradientDescent.runMiniBatchSGD

https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sat, Jun 7, 2014 at 6:24 AM, Aslan Bekirov aslanbeki...@gmail.com
wrote:

 Hi All,

 I have to create a model using SGD in mlbase. I examined a bit mlbase and
 run some samples of classification , collaborative filtering etc.. But I
 could not run Gradient descent. I have to run

  val model =  GradientDescent.runMiniBatchSGD(params)

 of course before params must be computed. I tried but could not managed to
 give parameters correctly.

 Can anyone explain parameters a bit and give an example of code?

 BR,
 Aslan




Re: How to process multiple classification with SVM in MLlib

2014-06-07 Thread Xiangrui Meng
At this time, you need to do one-vs-all manually for multiclass
training. For your second question, if the algorithm is implemented in
Java/Scala/Python and designed for single machine, you can broadcast
the dataset to each worker, train models on workers. If the algorithm
is implemented in a different language, maybe you need pipe to train
the models outside JVM (similar to Hadoop Streaming). If the algorithm
is designed for a different parallel platform, then it may be hard to
use it in Spark. -Xiangrui

On Sat, Jun 7, 2014 at 7:15 AM, littlebird cxp...@163.com wrote:
 Hi All,
   As we know, In MLlib the SVM is used for binary classification. I wonder
 how to train SVM model for mutiple classification in MLlib. In addition, how
 to apply the machine learning algorithm in Spark if the algorithm isn't
 included in MLlib. Thank you.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-process-multiple-classification-with-SVM-in-MLlib-tp7174.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.