Re: saveAsTextFile error
Hi Niko, It looks like you are calling a method on DStream, which does not exist. Check out: https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams for the method saveAsTextFiles Harold On Fri, Nov 14, 2014 at 10:39 AM, Niko Gamulin niko.gamu...@gmail.com wrote: Hi, I tried to modify NetworkWordCount example in order to save the output to a file. In NetworkWordCount.scala I replaced the line wordCounts.print() with wordCounts.saveAsTextFile(/home/bart/rest_services/output.txt) When I ran sbt/sbt package it returned the following error: [error] /home/bart/spark-1.1.0/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCountModified.scala:57: value saveAsTextFile is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)] [error] wordCounts.saveAsTextFile(/home/bart/rest_services/output.txt) [error]^ [error] one error found [error] (examples/compile:compile) Compilation failed [error] Total time: 5 s, completed Nov 14, 2014 9:38:53 PM Does anyone know why this error occurs? Is there any other way to save the results to a text file? Regards, Niko
Re: Can spark read and write to cassandra without HDFS?
Hi Kevin, Yes, Spark can read and write to Cassandra without Hadoop. Have you seen this: https://github.com/datastax/spark-cassandra-connector Harold On Wed, Nov 12, 2014 at 9:28 PM, Kevin Burton bur...@spinn3r.com wrote: We have all our data in Cassandra so I’d prefer to not have to bring up Hadoop/HDFS as that’s just another thing that can break. But I’m reading that spark requires a shared filesystem like HDFS or S3… Can I use Tachyon or this or something simple for a shared filesystem? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Spark Streaming - Most popular Twitter Hashtags
Hi all, I was just reading this nice documentation here: http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html And got to the end of it, which says: Note that there are more efficient ways to get the top 10 hashtags. For example, instead of sorting the entire of 5-minute-counts (thereby, incurring the cost of a data shuffle), one can get the top 10 hashtags in each partition, collect them together at the driver and then find the top 10 hashtags among them. We leave this as an exercise for the reader to try. I was just wondering if anyone had managed to do this, and was willing to share as an example :) This seems to be the exact use case that will help me! Thanks! Harold
Re: Manipulating RDDs within a DStream
Thanks Lalit, and Helena, What I'd like to do is manipulate the values within a DStream like this: DStream.foreachRDD( rdd = { val arr = record.toArray } I'd then like to be able to insert results from the arr back into Cassadnra, after I've manipulated the arr array. However, for all the examples I've seen, inserting into Cassandra is something like: val collection = sc.parralellize(Seq(foo, bar))) Where foo and bar could be elements in the arr array. So I would like to know how to insert into Cassandra at the worker level. Best wishes, Harold On Thu, Oct 30, 2014 at 11:48 PM, lalit1303 la...@sigmoidanalytics.com wrote: Hi, Since, the cassandra object is not serializable you can't open the connection on driver level and access the object inside foreachRDD (i.e. at worker level). You have to open connection inside foreachRDD only, perform the operation and then close the connection. For example: wordCounts.foreachRDD( rdd = { val arr = rdd.toArray OPEN cassandra connection store arr CLOSE cassandra connection }) Thanks - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
NonSerializable Exception in foreachRDD
Hi all, In Spark Streaming, when I do foreachRDD on my DStreams, I get a NonSerializable exception when I try to do something like: DStream.foreachRDD( rdd = { var sc.parallelize(Seq((test, blah))) }) Is there any way around that ? Thanks, Harold
Re: Manipulating RDDs within a DStream
Hi, Sorry, there's a typo there: val arr = rdd.toArray Harold On Thu, Oct 30, 2014 at 9:58 AM, Harold Nguyen har...@nexgate.com wrote: Hi all, I'd like to be able to modify values in a DStream, and then send it off to an external source like Cassandra, but I keep getting Serialization errors and am not sure how to use the correct design pattern. I was wondering if you could help me. I'd like to be able to do the following: wordCounts.foreachRDD( rdd = { val arr = record.toArray ... }) I would like to use the arr to send back to cassandra, for instance: Use it like this: val collection = sc.parallelize(Seq(a.head._1, a.head_.2)) collection.saveToCassandra() Or something like that, but as you know, I can't do this within the foreacRDD but only at the driver level. How do I use the arr variable to do something like that ? Thanks for any help, Harold
Manipulating RDDs within a DStream
Hi all, I'd like to be able to modify values in a DStream, and then send it off to an external source like Cassandra, but I keep getting Serialization errors and am not sure how to use the correct design pattern. I was wondering if you could help me. I'd like to be able to do the following: wordCounts.foreachRDD( rdd = { val arr = record.toArray ... }) I would like to use the arr to send back to cassandra, for instance: Use it like this: val collection = sc.parallelize(Seq(a.head._1, a.head_.2)) collection.saveToCassandra() Or something like that, but as you know, I can't do this within the foreacRDD but only at the driver level. How do I use the arr variable to do something like that ? Thanks for any help, Harold
Spark Streaming from Kafka
Hi, Just wondering if you've seen the following error when reading from Kafka: ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.NoClassDefFoundError: scala/reflect/ClassManifest at kafka.utils.Log4jController$.init(Log4jController.scala:29) at kafka.utils.Log4jController$.clinit(Log4jController.scala) at kafka.utils.Logging$class.$init$(Logging.scala:29) at kafka.utils.VerifiableProperties.init(VerifiableProperties.scala:24) at kafka.consumer.ConsumerConfig.init(ConsumerConfig.scala:78) at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:96) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.ClassNotFoundException: scala.reflect.ClassManifest at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 18 more Thanks, Harold
Spark Streaming with Kinesis
Hi all, I followed the guide here: http://spark.apache.org/docs/latest/streaming-kinesis-integration.html But got this error: Exception in thread main java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider Would you happen to know what dependency or jar is needed ? Harold
Re: Spark Streaming with Kinesis
Hi again, After getting through several dependencies, I finally got to this non-dependency type error: Exception in thread main java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V It look every similar to this post: http://stackoverflow.com/questions/24788949/nosuchmethoderror-while-running-aws-s3-client-on-spark-while-javap-shows-otherwi Since I'm a little new to everything, would someone be able to provide a step-by-step guidance for that ? Harold On Wed, Oct 29, 2014 at 9:22 AM, Harold Nguyen har...@nexgate.com wrote: Hi all, I followed the guide here: http://spark.apache.org/docs/latest/streaming-kinesis-integration.html But got this error: Exception in thread main java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider Would you happen to know what dependency or jar is needed ? Harold
Convert DStream to String
Hi all, How do I convert a DStream to a string ? For instance, I want to be able to: val myword = words.filter(word = word.startsWith(blah)) And use myword in other places, like tacking it onto (key, value) pairs, like so: val pairs = words.map(word = (myword+_+word, 1)) Thanks for any help, Harold
Re: Convert DStream to String
Hi Sean, I'd just like to take the first word of every line, and use it as a variable for later. Is there a way to do that? Here's the gist of what I want to do: val lines = KafkaUtils.createStream(ssc, localhost:2181, test, Map(test - 10)).map(_._2) val words = lines.flatMap(_.split( )) val acct = words.filter(word = word.startsWith(SECRETWORD)) val pairs = words.map(word = (acct+_+word, 1)) Take all lines coming into Kafka, and add the word 'acct' to each word. As an example, here is a line: hello world you are SECRETWORDthebest hello world And it should do this: (SECRETWORDthebest_hello, 2), (SECRETWORDthebest_world, 2), (SECRETWORDthebest_you, 1), etc... Harold On Wed, Oct 29, 2014 at 3:36 PM, Sean Owen so...@cloudera.com wrote: What would it mean to make a DStream into a String? it's inherently a sequence of things over time, each of which might be a string but which are usually RDDs of things. On Wed, Oct 29, 2014 at 11:15 PM, Harold Nguyen har...@nexgate.com wrote: Hi all, How do I convert a DStream to a string ? For instance, I want to be able to: val myword = words.filter(word = word.startsWith(blah)) And use myword in other places, like tacking it onto (key, value) pairs, like so: val pairs = words.map(word = (myword+_+word, 1)) Thanks for any help, Harold
Saving to Cassandra from Spark Streaming
Hi all, I'm having trouble troubleshooting this particular block of code for Spark Streaming and saving to Cassandra: val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split( )) val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _) //-- Writing it to Cassandra wordCounts.saveToCassandra(test, kv, SomeColumns(key, value), 1) Could you tell me where I'm going wrong ? Can I not call wordCounts.saveToCassandra ? Here's the error: Exception in thread main java.lang.NoClassDefFoundError: com/datastax/spark/connector/mapper/ColumnMapper Thanks, Harold
Including jars in Spark-shell vs Spark-submit
Hi all, The following works fine when submitting dependency jars through Spark-Shell: ./bin/spark-shell --master spark://ip-172-31-38-112:7077 --jars /home/ubuntu/spark-cassandra-connector/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar But not through spark-submit: ./bin/spark-submit --class org.apache.spark.examples.streaming.CassandraSave --master spark://ip-172-31-38-112:7077 streaming-test/target/scala-2.10/simple-streaming_2.10-1.0.jar --jars local:///home/ubuntu/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar Am I issuing the spark-submit command incorrectly ? Each of the workers has that built jar in their respective directories (spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar) Thanks, Harold
Spark Streaming into Cassandra - NoClass ColumnMapper
Hi Spark friends, I'm trying to connect Spark Streaming into Cassandra by modifying the NetworkWordCount.scala streaming example, and doing the make as few changes as possible but having it insert data into Cassandra. Could you let me know if you see any errors? I'm using the spark-cassandra-connector, and I receive this error when I submit my spark jar: == Exception in thread main java.lang.NoClassDefFoundError: com/datastax/spark/connector/mapper/ColumnMapper at org.apache.spark.examples.streaming.CassandraWordCount.main(CassandraWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.datastax.spark.connector.mapper.ColumnMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 8 more === And here is the class I'm using: package org.apache.spark.examples.streaming import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel import com.datastax.spark.connector._ import com.datastax.spark.connector.streaming._ object CassandraWordCount { def main(args: Array[String]) { if (args.length 2) { System.err.println(Usage: NetworkWordCount hostname port) System.exit(1) } val conf = new SparkConf(true).set(spark.cassandra.connection.host, ec2-54-191-235-127.us-west-2.compute.amazonaws.com ).setAppName(CassandraWordCount) val sc = new SparkContext(spark://ip-172-31-38-112:7077, test, conf) StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val ssc = new StreamingContext(sc, Seconds(1)) // Create a socket stream on target ip:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance. val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split( )) val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _) //-- Writing it to Cassandra wordCounts.saveToCassandra(test, kv, SomeColumns(key, value)) ssc.start() ssc.awaitTermination() } } === Finally, here is my sbt build file: == name := Simple Streaming version := 1.0 scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark % spark-streaming_2.10 % 1.1.0, com.datastax.spark %% spark-cassandra-connector % 1.1.0-alpha3 withSources() withJavadoc(), org.apache.spark %% spark-sql % 1.1.0 ) = Any help would be appreciated! Thanks so much! Harold
Re: Spark Streaming into Cassandra - NoClass ColumnMapper
Hi again, Not sure if this is the right thing to do, but I pulled down the latest spark-cassandra-connector, built the jars, and added --jars flag with spark-submit (and added to all workers). It looks like it moves a little further, but now I have the following error (with all file content the same): 14/10/28 04:43:44 INFO AppClient$ClientActor: Connecting to master spark://ip-172-31-38-112:7077... 14/10/28 04:43:44 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 Exception in thread main java.lang.NoSuchMethodError: com.datastax.spark.connector.streaming.DStreamFunctions.saveToCassandra(Ljava/lang/String;Ljava/lang/String;Lcom/datastax/spark/connector/SomeColumns;Lcom/datastax/spark/connector/writer/RowWriterFactory;)V at org.apache.spark.examples.streaming.CassandraWordCount$.main(CassandraWordCount.scala:52) at org.apache.spark.examples.streaming.CassandraWordCount.main(CassandraWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Mon, Oct 27, 2014 at 9:22 PM, Harold Nguyen har...@nexgate.com wrote: Hi Spark friends, I'm trying to connect Spark Streaming into Cassandra by modifying the NetworkWordCount.scala streaming example, and doing the make as few changes as possible but having it insert data into Cassandra. Could you let me know if you see any errors? I'm using the spark-cassandra-connector, and I receive this error when I submit my spark jar: == Exception in thread main java.lang.NoClassDefFoundError: com/datastax/spark/connector/mapper/ColumnMapper at org.apache.spark.examples.streaming.CassandraWordCount.main(CassandraWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.datastax.spark.connector.mapper.ColumnMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 8 more === And here is the class I'm using: package org.apache.spark.examples.streaming import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel import com.datastax.spark.connector._ import com.datastax.spark.connector.streaming._ object CassandraWordCount { def main(args: Array[String]) { if (args.length 2) { System.err.println(Usage: NetworkWordCount hostname port) System.exit(1) } val conf = new SparkConf(true).set(spark.cassandra.connection.host, ec2-54-191-235-127.us-west-2.compute.amazonaws.com ).setAppName(CassandraWordCount) val sc = new SparkContext(spark://ip-172-31-38-112:7077, test, conf) StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val ssc = new StreamingContext(sc, Seconds(1)) // Create a socket stream on target ip:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance. val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split( )) val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _) //-- Writing it to Cassandra wordCounts.saveToCassandra(test, kv, SomeColumns(key, value)) ssc.start() ssc.awaitTermination() } } === Finally, here is my sbt build file: == name := Simple Streaming version := 1.0 scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark % spark-streaming_2.10 % 1.1.0, com.datastax.spark %% spark-cassandra-connector % 1.1.0-alpha3