Re: saveAsTextFile error

2014-11-14 Thread Harold Nguyen
Hi Niko,

It looks like you are calling a method on DStream, which does not exist.

Check out:
https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams

for the method saveAsTextFiles

Harold

On Fri, Nov 14, 2014 at 10:39 AM, Niko Gamulin niko.gamu...@gmail.com
wrote:

 Hi,

 I tried to modify NetworkWordCount example in order to save the output to
 a file.

 In NetworkWordCount.scala I replaced the line

 wordCounts.print()
 with
 wordCounts.saveAsTextFile(/home/bart/rest_services/output.txt)

 When I ran sbt/sbt package it returned the following error:

 [error]
 /home/bart/spark-1.1.0/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCountModified.scala:57:
 value saveAsTextFile is not a member of
 org.apache.spark.streaming.dstream.DStream[(String, Int)]
 [error]
 wordCounts.saveAsTextFile(/home/bart/rest_services/output.txt)
 [error]^
 [error] one error found
 [error] (examples/compile:compile) Compilation failed
 [error] Total time: 5 s, completed Nov 14, 2014 9:38:53 PM

 Does anyone know why this error occurs?
 Is there any other way to save the results to a text file?

 Regards,

 Niko







Re: Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Harold Nguyen
Hi Kevin,

Yes, Spark can read and write to Cassandra without Hadoop. Have you seen
this:

https://github.com/datastax/spark-cassandra-connector

Harold

On Wed, Nov 12, 2014 at 9:28 PM, Kevin Burton bur...@spinn3r.com wrote:

 We have all our data in Cassandra so I’d prefer to not have to bring up
 Hadoop/HDFS as that’s just another thing that can break.

 But I’m reading that spark requires a shared filesystem like HDFS or S3…
 Can I use Tachyon or this or something simple for a shared filesystem?


 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com




Spark Streaming - Most popular Twitter Hashtags

2014-11-03 Thread Harold Nguyen
Hi all,

I was just reading this nice documentation here:
http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html

And got to the end of it, which says:

Note that there are more efficient ways to get the top 10 hashtags. For
example, instead of sorting the entire of 5-minute-counts (thereby,
incurring the cost of a data shuffle), one can get the top 10 hashtags in
each partition, collect them together at the driver and then find the top
10 hashtags among them. We leave this as an exercise for the reader to try.

I was just wondering if anyone had managed to do this, and was willing to
share as an example :) This seems to be the exact use case that will help
me!

Thanks!

Harold


Re: Manipulating RDDs within a DStream

2014-10-31 Thread Harold Nguyen
Thanks Lalit, and Helena,

What I'd like to do is manipulate the values within a DStream like this:

DStream.foreachRDD( rdd = {

   val arr = record.toArray

}

I'd then like to be able to insert results from the arr back into
Cassadnra, after I've manipulated the arr array.
However, for all the examples I've seen, inserting into Cassandra is
something like:

val collection = sc.parralellize(Seq(foo, bar)))

Where foo and bar could be elements in the arr array. So I would like
to know how to insert into Cassandra at the worker level.

Best wishes,

Harold

On Thu, Oct 30, 2014 at 11:48 PM, lalit1303 la...@sigmoidanalytics.com
wrote:

 Hi,

 Since, the cassandra object is not serializable you can't open the
 connection on driver level and access the object inside foreachRDD (i.e. at
 worker level).
 You have to open connection inside foreachRDD only, perform the operation
 and then close the connection.

 For example:

  wordCounts.foreachRDD( rdd = {

val arr = rdd.toArray

OPEN cassandra connection
store arr
CLOSE cassandra connection

 })


 Thanks



 -
 Lalit Yadav
 la...@sigmoidanalytics.com
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




NonSerializable Exception in foreachRDD

2014-10-30 Thread Harold Nguyen
Hi all,

In Spark Streaming, when I do foreachRDD on my DStreams, I get a
NonSerializable exception when I try to do something like:

DStream.foreachRDD( rdd = {
  var sc.parallelize(Seq((test, blah)))
})

Is there any way around that ?

Thanks,

Harold


Re: Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen
Hi,

Sorry, there's a typo there:

val arr = rdd.toArray


Harold

On Thu, Oct 30, 2014 at 9:58 AM, Harold Nguyen har...@nexgate.com wrote:

 Hi all,

 I'd like to be able to modify values in a DStream, and then send it off to
 an external source like Cassandra, but I keep getting Serialization errors
 and am not sure how to use the correct design pattern. I was wondering if
 you could help me.

 I'd like to be able to do the following:

  wordCounts.foreachRDD( rdd = {

val arr = record.toArray
...

 })

 I would like to use the arr to send back to cassandra, for instance:

 Use it like this:

 val collection = sc.parallelize(Seq(a.head._1, a.head_.2))
 collection.saveToCassandra()

 Or something like that, but as you know, I can't do this within the
 foreacRDD but only at the driver level. How do I use the arr variable
 to do something like that ?

 Thanks for any help,

 Harold




Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen
Hi all,

I'd like to be able to modify values in a DStream, and then send it off to
an external source like Cassandra, but I keep getting Serialization errors
and am not sure how to use the correct design pattern. I was wondering if
you could help me.

I'd like to be able to do the following:

 wordCounts.foreachRDD( rdd = {

   val arr = record.toArray
   ...

})

I would like to use the arr to send back to cassandra, for instance:

Use it like this:

val collection = sc.parallelize(Seq(a.head._1, a.head_.2))
collection.saveToCassandra()

Or something like that, but as you know, I can't do this within the
foreacRDD but only at the driver level. How do I use the arr variable
to do something like that ?

Thanks for any help,

Harold


Spark Streaming from Kafka

2014-10-29 Thread Harold Nguyen
Hi,

Just wondering if you've seen the following error when reading from Kafka:

ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting
receiver 0 - java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
at kafka.utils.Log4jController$.init(Log4jController.scala:29)
at kafka.utils.Log4jController$.clinit(Log4jController.scala)
at kafka.utils.Logging$class.$init$(Logging.scala:29)
at kafka.utils.VerifiableProperties.init(VerifiableProperties.scala:24)
at kafka.consumer.ConsumerConfig.init(ConsumerConfig.scala:78)
at
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:96)
at
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.ClassNotFoundException: scala.reflect.ClassManifest
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 18 more

Thanks,

Harold


Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi all,

I followed the guide here:
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

But got this error:
Exception in thread main java.lang.NoClassDefFoundError:
com/amazonaws/auth/AWSCredentialsProvider

Would you happen to know what dependency or jar is needed ?

Harold


Re: Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi again,

After getting through several dependencies, I finally got to this
non-dependency type error:

Exception in thread main java.lang.NoSuchMethodError:
org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V

It look every similar to this post:

http://stackoverflow.com/questions/24788949/nosuchmethoderror-while-running-aws-s3-client-on-spark-while-javap-shows-otherwi

Since I'm a little new to everything, would someone be able to provide a
step-by-step guidance for that ?

Harold

On Wed, Oct 29, 2014 at 9:22 AM, Harold Nguyen har...@nexgate.com wrote:

 Hi all,

 I followed the guide here:
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

 But got this error:
 Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider

 Would you happen to know what dependency or jar is needed ?

 Harold




Convert DStream to String

2014-10-29 Thread Harold Nguyen
Hi all,

How do I convert a DStream to a string ?

For instance, I want to be able to:

val myword = words.filter(word = word.startsWith(blah))

And use myword in other places, like tacking it onto (key, value) pairs,
like so:

val pairs = words.map(word = (myword+_+word, 1))

Thanks for any help,

Harold


Re: Convert DStream to String

2014-10-29 Thread Harold Nguyen
Hi Sean,

I'd just like to take the first word of every line, and use it as a
variable for later. Is there a way to do that?

Here's the gist of what I want to do:

  val lines = KafkaUtils.createStream(ssc, localhost:2181, test,
Map(test - 10)).map(_._2)
  val words = lines.flatMap(_.split( ))
  val acct = words.filter(word = word.startsWith(SECRETWORD))
  val pairs = words.map(word = (acct+_+word, 1))

Take all lines coming into Kafka, and add the word 'acct' to each word.

As an example, here is a line:

hello world you are SECRETWORDthebest hello world

And it should do this:

(SECRETWORDthebest_hello, 2), (SECRETWORDthebest_world, 2),
(SECRETWORDthebest_you, 1), etc...

Harold


On Wed, Oct 29, 2014 at 3:36 PM, Sean Owen so...@cloudera.com wrote:

 What would it mean to make a DStream into a String? it's inherently a
 sequence of things over time, each of which might be a string but
 which are usually RDDs of things.

 On Wed, Oct 29, 2014 at 11:15 PM, Harold Nguyen har...@nexgate.com
 wrote:
  Hi all,
 
  How do I convert a DStream to a string ?
 
  For instance, I want to be able to:
 
  val myword = words.filter(word = word.startsWith(blah))
 
  And use myword in other places, like tacking it onto (key, value)
 pairs,
  like so:
 
  val pairs = words.map(word = (myword+_+word, 1))
 
  Thanks for any help,
 
  Harold
 
 
 
 



Saving to Cassandra from Spark Streaming

2014-10-28 Thread Harold Nguyen
Hi all,

I'm having trouble troubleshooting this particular block of code for Spark
Streaming and saving to Cassandra:

val lines = ssc.socketTextStream(args(0), args(1).toInt,
StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split( ))
val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _)

//-- Writing it to Cassandra
wordCounts.saveToCassandra(test, kv, SomeColumns(key, value), 1)

Could you tell me where I'm going wrong ? Can I not call
wordCounts.saveToCassandra ?

Here's the error:

Exception in thread main java.lang.NoClassDefFoundError:
com/datastax/spark/connector/mapper/ColumnMapper

Thanks,

Harold


Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Harold Nguyen
Hi all,

The following works fine when submitting dependency jars through
Spark-Shell:

./bin/spark-shell --master spark://ip-172-31-38-112:7077 --jars
/home/ubuntu/spark-cassandra-connector/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar

But not through spark-submit:

 ./bin/spark-submit --class
org.apache.spark.examples.streaming.CassandraSave --master
spark://ip-172-31-38-112:7077
streaming-test/target/scala-2.10/simple-streaming_2.10-1.0.jar --jars
local:///home/ubuntu/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar

Am I issuing the spark-submit command incorrectly ? Each of the workers has
that built jar in their respective directories
(spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar)

Thanks,

Harold


Spark Streaming into Cassandra - NoClass ColumnMapper

2014-10-27 Thread Harold Nguyen
Hi Spark friends,

I'm trying to connect Spark Streaming into Cassandra by modifying the
NetworkWordCount.scala streaming example, and doing the make as few
changes as possible but having it insert data into Cassandra.

Could you let me know if you see any errors?

I'm using the spark-cassandra-connector, and I receive this error when I
submit my spark jar:

==

Exception in thread main java.lang.NoClassDefFoundError:
com/datastax/spark/connector/mapper/ColumnMapper
at
org.apache.spark.examples.streaming.CassandraWordCount.main(CassandraWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
com.datastax.spark.connector.mapper.ColumnMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 8 more


===

And here is the class I'm using:

package org.apache.spark.examples.streaming

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel

import com.datastax.spark.connector._
import com.datastax.spark.connector.streaming._

object CassandraWordCount {
  def main(args: Array[String]) {
if (args.length  2) {
  System.err.println(Usage: NetworkWordCount hostname port)
  System.exit(1)
}

val conf = new SparkConf(true).set(spark.cassandra.connection.host, 
ec2-54-191-235-127.us-west-2.compute.amazonaws.com
).setAppName(CassandraWordCount)
val sc = new SparkContext(spark://ip-172-31-38-112:7077, test, conf)

StreamingExamples.setStreamingLogLevels()

// Create the context with a 1 second batch size
val ssc = new StreamingContext(sc, Seconds(1))

// Create a socket stream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
val lines = ssc.socketTextStream(args(0), args(1).toInt,
StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split( ))
val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _)

//-- Writing it to Cassandra
wordCounts.saveToCassandra(test, kv, SomeColumns(key, value))

ssc.start()
ssc.awaitTermination()
  }
}

===

Finally, here is my sbt build file:

==

name := Simple Streaming

version := 1.0

scalaVersion := 2.10.4

libraryDependencies ++= Seq(
  org.apache.spark % spark-streaming_2.10 % 1.1.0,
  com.datastax.spark %% spark-cassandra-connector % 1.1.0-alpha3
withSources() withJavadoc(),
  org.apache.spark %% spark-sql % 1.1.0
)


=

Any help would be appreciated! Thanks so much!

Harold


Re: Spark Streaming into Cassandra - NoClass ColumnMapper

2014-10-27 Thread Harold Nguyen
Hi again,

Not sure if this is the right thing to do, but I pulled down the latest
spark-cassandra-connector, built the jars, and added --jars flag with
spark-submit (and added to all workers).

It looks like it moves a little further, but now I have the following error
(with all file content the same):


14/10/28 04:43:44 INFO AppClient$ClientActor: Connecting to master
spark://ip-172-31-38-112:7077...
14/10/28 04:43:44 INFO SparkDeploySchedulerBackend: SchedulerBackend is
ready for scheduling beginning after reached minRegisteredResourcesRatio:
0.0
Exception in thread main java.lang.NoSuchMethodError:
com.datastax.spark.connector.streaming.DStreamFunctions.saveToCassandra(Ljava/lang/String;Ljava/lang/String;Lcom/datastax/spark/connector/SomeColumns;Lcom/datastax/spark/connector/writer/RowWriterFactory;)V
at
org.apache.spark.examples.streaming.CassandraWordCount$.main(CassandraWordCount.scala:52)
at
org.apache.spark.examples.streaming.CassandraWordCount.main(CassandraWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

On Mon, Oct 27, 2014 at 9:22 PM, Harold Nguyen har...@nexgate.com wrote:

 Hi Spark friends,

 I'm trying to connect Spark Streaming into Cassandra by modifying the
 NetworkWordCount.scala streaming example, and doing the make as few
 changes as possible but having it insert data into Cassandra.

 Could you let me know if you see any errors?

 I'm using the spark-cassandra-connector, and I receive this error when I
 submit my spark jar:

 ==

 Exception in thread main java.lang.NoClassDefFoundError:
 com/datastax/spark/connector/mapper/ColumnMapper
 at
 org.apache.spark.examples.streaming.CassandraWordCount.main(CassandraWordCount.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException:
 com.datastax.spark.connector.mapper.ColumnMapper
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 8 more


 ===

 And here is the class I'm using:

 package org.apache.spark.examples.streaming

 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.storage.StorageLevel

 import com.datastax.spark.connector._
 import com.datastax.spark.connector.streaming._

 object CassandraWordCount {
   def main(args: Array[String]) {
 if (args.length  2) {
   System.err.println(Usage: NetworkWordCount hostname port)
   System.exit(1)
 }

 val conf = new SparkConf(true).set(spark.cassandra.connection.host, 
 ec2-54-191-235-127.us-west-2.compute.amazonaws.com
 ).setAppName(CassandraWordCount)
 val sc = new SparkContext(spark://ip-172-31-38-112:7077, test,
 conf)

 StreamingExamples.setStreamingLogLevels()

 // Create the context with a 1 second batch size
 val ssc = new StreamingContext(sc, Seconds(1))

 // Create a socket stream on target ip:port and count the
 // words in input stream of \n delimited text (eg. generated by 'nc')
 // Note that no duplication in storage level only for running locally.
 // Replication necessary in distributed scenario for fault tolerance.
 val lines = ssc.socketTextStream(args(0), args(1).toInt,
 StorageLevel.MEMORY_AND_DISK_SER)
 val words = lines.flatMap(_.split( ))
 val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _)

 //-- Writing it to Cassandra
 wordCounts.saveToCassandra(test, kv, SomeColumns(key, value))

 ssc.start()
 ssc.awaitTermination()
   }
 }

 ===

 Finally, here is my sbt build file:

 ==

 name := Simple Streaming

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies ++= Seq(
   org.apache.spark % spark-streaming_2.10 % 1.1.0,
   com.datastax.spark %% spark-cassandra-connector % 1.1.0-alpha3