Spark Standard Application to Test
Hello, I am preparing some tests to execute in Spark in order to manipulate properties and check the variations in results. For this, I need to use a Standard Application in my environment like the well-known apps to Hadoop: Terasort https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html and specially Terrier http://terrier.org/ or something similar. I do not need applications wordcount and grep because I have used them. Can anyone suggest me something about this? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Standard-Application-to-Test-tp21803.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib - Show an element in RDD[(Int, Iterable[Array[Double]])]
I solve the question with this code: import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile(/opt/testAppSpark/data/height-weight.txt).map { line = Vectors.dense(line.split(' ').map(_.toDouble)) }.cache() val cluster = KMeans.train(data, 3, 20) val vectorsAndClusterIdx = data.map{ point = val prediction = cluster.predict(point) (point.toString, prediction) } vectorsAndClusterIdx.collect -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Show-an-element-in-RDD-Int-Iterable-Array-Double-tp21521p21522.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLlib - Show an element in RDD[(Int, Iterable[Array[Double]])]
Hi, I'm learning Spark and testing the Spark MLlib library with algorithm K-means. So, I created a file height-weight.txt like this: 65.0 220.0 73.0 160.0 59.0 110.0 61.0 120.0 ... And the code (executed in spark-shell): import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile(/opt/testAppSpark/data/height-weight.txt) val parsedData = data.map(s = Vectors.dense(s.split(' ').map(_.toDouble))).cache() val numCluster = 3 val numIterations = 30 val cluster = KMeans.train(parsedData, numCluster, numIterations) val groups = data.map{_.split(' ').map(_.toDouble)}.groupBy{rdd = cluster.predict(Vectors.dense(rdd))} groups.collect When I typed /groups.collect/, I received an information like: res29: Array[(Int, Iterable[Array[Double]])] = Array((0,CompactBuffer([D@12c6123d, [D@9d76c6c, [D@1e0f2b80, [D@75f0efea, [D@1d172824, [D@5b4c6267, [D@73d08704)), (2,CompactBuffer([D@7f505302, [D@7279e99a, [D@21d7b82d, [D@597ca3b6, [D@5e02fa0)), (1,CompactBuffer([D@4156b463, [D@235cf118, [D@2ad870cb, [D@67d53566, [D@5ea4f0cb, [D@1ebccff8, [D@7df9b28b, [D@1439044a))) Typing /groups/ em command line I see: res1: org.apache.spark.rdd.RDD[(Int, Iterable[Array[Double]])] = ShuffledRDD[28] at groupBy at console:24 How can I see the results? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Show-an-element-in-RDD-Int-Iterable-Array-Double-tp21521.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark metrics for ganglia
Thanks tsingfu, I used this configuration based in your post: (with ganglia unicast mode) # Enable GangliaSink for all instances *.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink *.sink.ganglia.host=10.0.0.7 *.sink.ganglia.port=8649 *.sink.ganglia.period=15 *.sink.ganglia.unit=seconds *.sink.ganglia.ttl=1 *.sink.ganglia.mode=unicast Then, I have the following error now. ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.GangliaSink cannot be instantialized java.lang.ClassNotFoundException: org.apache.spark.metrics.sink.GangliaSink -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-ganglia-tp14335p20690.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can not see any spark metrics on ganglia-web
I used the command below because I'm using Spark 1.0.2 built with SBT and it worked. SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_GANGLIA_LGPL=true sbt/sbt assembly -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p20384.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark metrics for ganglia
Hello Samudrala, Did you solve this issue about view metrics in Ganglia?? Because I have the same problem. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-ganglia-tp14335p20385.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: A question about streaming throughput
Ok, I understand. But in both cases the data are in the same processing node. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-question-about-streaming-throughput-tp16416p16501.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
A question about streaming throughput
Hi, I'm learning about Apache Spark Streaming and I'm doing some tests. Now, I have a modified version of the app NetworkWordCount that perform a /reduceByKeyAndWindow/ with window of 10 seconds in intervals of 5 seconds. I'm using also the function to measure the rate of records/second like this: /words.foreachRDD(rdd = { val count = rdd.count() println(Current rate: + (count/1) + records/second) })/ Then, In my computer with 4 cores and 8gb (running: /local[4]/) I have this average result: Current rate: 130 000 Running locally with my computer as /master and worker/ I have this: Current rate: 25 000 And running in a cloud computing azure with 4 cores and 7 gb, the result is: Current rate: 10 000 I read the Spark Streaming paper http://www.eecs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf and the performance evaluation to a similar application was 250 000 records/second. To send data in the socket I'm using an application similar to this: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-code-to-simulate-a-network-socket-data-source-td3431.html#a13814 So, Can anyone suggest me something to improve these rate? /(I increased the memory in executor and I didn't have better results)/ Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-question-about-streaming-throughput-tp16416.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can not see any spark metrics on ganglia-web
Hi tsingfu, I want to see metrics in ganglia too. But I don't understand this step: ./make-distribution.sh --tgz --skip-java-test -Phadoop-2.3 -Pyarn -Phive -Pspark-ganglia-lgpl Are you installing the hadoop, yarn, hive AND ganglia?? If I want to install just ganglia? Can you suggest me something? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p15631.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can not see any spark metrics on ganglia-web
Ok Krishna Sankar, In relation to this information on Spark monitoring webpage, For sbt users, set the SPARK_GANGLIA_LGPL environment variable before building. For Maven users, enable the -Pspark-ganglia-lgpl profile Do you know what I need to do to install with sbt? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p15636.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Monitoring with Ganglia
Hi, I need monitoring some aspects about my cluster like network and resources. Ganglia looks like a good option for what I need. Then, I found out that Spark has support to Ganglia. On the Spark monitoring webpage there is this information: To install the GangliaSink you’ll need to perform a custom build of Spark. I found in my Spark the directory: /extras/spark-ganglia-lgpl. But I don't know how to install it. How can I install the Ganglia to monitoring Spark cluster? How I do this custom build? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Monitoring-with-Ganglia-tp15538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Question About Submit Application
I'll do this test and after I reply the result. Thank you Marcelo. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p15539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Question About Submit Application
Hello, I'm learning about Spark Streaming and I'm really excited. Today I was testing to package some apps and send them in a Standalone cluster in my computer locally. It occurred ok. So, I created one Virtual Machine with network bridge and I tried to send again the app to this VM from my local PC. But I had some errors, like these: WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@spark-01:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@spark-01:7077] 14/09/24 16:28:48 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Can I submit an application from another computer to a remote master with workers? Anybody suggest something? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming Twitter Example Error
I solved this question using the SBT plugin sbt-assembly. It's very good! Bye. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Twitter-Example-Error-tp12600p15073.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Memory/Network Intensive Workload
Thank you for the suggestion! Bye. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-Network-Intensive-Workload-tp8501p15074.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Question About Submit Application
One more information.. When I submit the application from my local PC to my VM, This VM was the master and worker and my local PC wasn't part of the cluster. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p15076.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: streaming: code to simulate a network socket data source
Hello Diana, How can I include this implementation in my code, in terms of start this task together the NetworkWordCount. In my case, I have a directory with several files. Then, I include this line: StreamingDataGenerator.streamingGenerator(NetPort, BytesSecond, DirFiles) But the program stays in my loop of files. And after returning to NetworkWordCount. Can you suggest me something to start these tasks in parallel? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-code-to-simulate-a-network-socket-data-source-tp3431p13814.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: streaming: code to simulate a network socket data source
I utilize this code in separated but the program block in this character: val socket = listener.accept() Do you have any suggestion? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-code-to-simulate-a-network-socket-data-source-tp3431p13817.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Records - Input Byte
Hi, I was reading the paper of Spark Streaming: Discretized Streams: Fault-Tolerant Streaming Computation at Scale So, I read that performance evaluation used 100-byte input records in test Grep and WordCount. I don't have much experience and I'd like to know how can I control this value in my records (like words in an input file)? Can anyone suggest me something to start? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Records-Input-Byte-tp13733.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming Twitter Example Error
Hi! I'm beginning with the development in Spark Streaming.. And I'm learning with the examples available in the spark directory. There are several applications and I want to make modifications. I can execute the TwitterPopularTags normally with command: ./bin/run-example TwitterPopularTags auth So, I moved the source code to a separate folder with the structure: ./src/main/scala/ With the files: -TwitterPopularTags -TwitterUtils -StreamingExamples -TwitterInputDStream But when I run the command: ./bin/spark-submit --class TwitterPopularTags --master local[4] /MY_DIR/TwitterTest/target/scala-2.10/simple-project_2.10-1.0.jar auth I receive the following error: Exception in thread main java.lang.NoClassDefFoundError: twitter4j/auth/Authorization at TwitterUtils$.createStream(TwitterUtils.scala:42) at TwitterPopularTags$.main(TwitterPopularTags.scala:65) at TwitterPopularTags.main(TwitterPopularTags.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: twitter4j.auth.Authorization at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 10 more This is my sbt build file: name := Simple Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.0.2 libraryDependencies += org.apache.spark %% spark-streaming % 1.0.2 libraryDependencies += org.twitter4j % twitter4j-core % 3.0.3 libraryDependencies += org.twitter4j % twitter4j-stream % 3.0.3 resolvers += Akka Repository at http://repo.akka.io/releases/; Can anybody help me? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Twitter-Example-Error-tp12600.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Memory/Network Intensive Workload
Hello, I'm studying the Spark platform and I'd like to realize experiments in your extension Spark Streaming. So, I guess that an intensive memory and network workload are a good options. Can anyone suggest a few typical Spark Streaming workloads that are network/memory intensive? If someone have other suggestions for good workloads upon Spark Streaming will be interesting too. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-Network-Intensive-Workload-tp8501.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Interconnect benchmarking
Hi, According with the research paper bellow of Mathei Zaharia, Spark's creator, http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf He says on page 10 that: Grep is network-bound due to the cost to replicate the input data to multiple nodes. So, I guess a can be a good initial recommendation. But I would like to know others workloads too. Best Regards. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Interconnect-benchmarking-tp8467p8470.html Sent from the Apache Spark User List mailing list archive at Nabble.com.