Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread Michael Campbell
On Thu, Nov 13, 2014 at 11:02 AM, Sean Owen  wrote:

> Yes. Data is collected for 5 minutes, then processing starts at the
> end. The result may be an arbitrary function of the data in the
> interval, so the interval has to finish before computation can start.
>

Thanks everyone.


Does Spark Streaming calculate during a batch?

2014-11-13 Thread Michael Campbell
I was running a proof of concept for my company with spark streaming, and
the conclusion I came to is that spark collects data for the
batch-duration, THEN starts the data-pipeline calculations.

My batch size was 5 minutes, and the CPU was all but dead for 5, then when
the 5 minutes were up the CPU's would spike for a while presumably doing
the calculations.

Is this presumption true, or is it running the data through the calculation
pipeline before the batch is up?

What could lead to the periodic CPU spike - I had a reduceByKey, so was it
doing that only after all the batch data was in?

Thanks


Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-23 Thread Michael Campbell
Can you list what your fix was so others can benefit?

On Wed, Oct 22, 2014 at 8:15 PM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:

> Hi,
>
> I have managed to resolve it because a wrong setting. Please ignore this .
>
> Regards
> Arthur
>
> On 23 Oct, 2014, at 5:14 am, arthur.hk.c...@gmail.com <
> arthur.hk.c...@gmail.com> wrote:
>
>
> 14/10/23 05:09:04 WARN ConnectionManager: All connections not cleaned up
>
>
>


Re: Submission to cluster fails (Spark SQL; NoSuchMethodError on SchemaRDD)

2014-10-17 Thread Michael Campbell
For posterity's sake, I solved this.  The problem was the Cloudera cluster
I was submitting to is running 1.0, and I was compiling against the latest
1.1 release.  Downgrading to 1.0 on my compile got me past this.

On Tue, Oct 14, 2014 at 6:08 PM, Michael Campbell <
michael.campb...@gmail.com> wrote:

> Hey all, I'm trying a very basic spark SQL job and apologies as I'm new to
> a lot of this, but I'm getting this failure:
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.spark.sql.SchemaRDD.take(I)[Lorg/apache/spark/sql/catalyst/expressions/Row;
>
> I've tried a variety of uber-jar creation, but it always comes down to
> this.  Right now my jar has *NO* dependencies on other jars (other than
> Spark itself), and my "uber jar" contains essentially only the .class file
> of my own code.
>
> My own code simply reads a parquet file and does a count(*) on the
> contents. I'm sure this is very basic, but I'm at a loss.
>
> Thoughts and/or debugging tips welcome.
>
> Here's my code, which I call from the "main" object which passes in the
> context, Parquet file name, and some arbitrary sql, (which for this is just
> "select count(*) from flows").
>
> I have run the equivalent successfully in spark-shell in the cluster.
>
>   def readParquetFile(sparkCtx: SparkContext, dataFileName: String, sql:
> String) = {
> val sqlContext = new SQLContext(sparkCtx)
> val parquetFile = sqlContext.parquetFile(dataFileName)
>
> parquetFile.registerAsTable("flows")
>
> println(s"About to run $sql")
>
> val start = System.nanoTime()
> val countRDD = sqlContext.sql(sql)
> val rows: Array[Row] = countRDD.take(1)  // DIES HERE
> val stop = System.nanoTime()
>
> println(s"result: ${rows(0)}")
> println(s"Query took ${(stop - start) / 1e9} seconds.")
> println(s"Query was $sql")
>
>   }
>
>
>
>


Spark Bug? job fails to run when given options on spark-submit (but starts and fails without)

2014-10-16 Thread Michael Campbell
TL;DR - a spark SQL job fails with an OOM (Out of heap space) error.  If
given "--executor-memory" values, it won't even start.  Even (!) if the
values given ARE THE SAME AS THE DEFAULT.



Without --executor-memory:

14/10/16 17:14:58 INFO TaskSetManager: Serialized task 1.0:64 as 14710
bytes in 1 ms
14/10/16 17:14:58 WARN TaskSetManager: Lost TID 26 (task 1.0:25)
14/10/16 17:14:58 WARN TaskSetManager: Loss was due to
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
at
parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:609)
at
parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
...


USING --executor-memory (WITH ANY VALUE), even "1G" which is the default:

Parsed arguments:
  master  spark://:7077
  deployMode  null
  executorMemory  1G
...

System properties:
spark.executor.memory -> 1G
spark.eventLog.enabled -> true
...

14/10/16 17:14:23 INFO TaskSchedulerImpl: Adding task set 1.0 with 678 tasks
14/10/16 17:14:38 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory



Spark 1.0.0.  Is this a bug?


Re: jsonRDD: NoSuchMethodError

2014-10-15 Thread Michael Campbell
How did you resolve it?

On Tue, Jul 15, 2014 at 3:50 AM, SK  wrote:

> The problem is resolved. Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/jsonRDD-NoSuchMethodError-tp9688p9742.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Submission to cluster fails (Spark SQL; NoSuchMethodError on SchemaRDD)

2014-10-14 Thread Michael Campbell
Hey all, I'm trying a very basic spark SQL job and apologies as I'm new to
a lot of this, but I'm getting this failure:

Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.spark.sql.SchemaRDD.take(I)[Lorg/apache/spark/sql/catalyst/expressions/Row;

I've tried a variety of uber-jar creation, but it always comes down to
this.  Right now my jar has *NO* dependencies on other jars (other than
Spark itself), and my "uber jar" contains essentially only the .class file
of my own code.

My own code simply reads a parquet file and does a count(*) on the
contents. I'm sure this is very basic, but I'm at a loss.

Thoughts and/or debugging tips welcome.

Here's my code, which I call from the "main" object which passes in the
context, Parquet file name, and some arbitrary sql, (which for this is just
"select count(*) from flows").

I have run the equivalent successfully in spark-shell in the cluster.

  def readParquetFile(sparkCtx: SparkContext, dataFileName: String, sql:
String) = {
val sqlContext = new SQLContext(sparkCtx)
val parquetFile = sqlContext.parquetFile(dataFileName)

parquetFile.registerAsTable("flows")

println(s"About to run $sql")

val start = System.nanoTime()
val countRDD = sqlContext.sql(sql)
val rows: Array[Row] = countRDD.take(1)  // DIES HERE
val stop = System.nanoTime()

println(s"result: ${rows(0)}")
println(s"Query took ${(stop - start) / 1e9} seconds.")
println(s"Query was $sql")

  }


Re: can't print DStream after reduce

2014-07-15 Thread Michael Campbell
I think you typo'd the jira id; it should be
https://issues.apache.org/jira/browse/SPARK-2475  "Check whether #cores >
#receivers in local mode"


On Mon, Jul 14, 2014 at 3:57 PM, Tathagata Das 
wrote:

> The problem is not really for local[1] or local. The problem arises when
> there are more input streams than there are cores.
> But I agree, for people who are just beginning to use it by running it
> locally, there should be a check addressing this.
>
> I made a JIRA for this.
> https://issues.apache.org/jira/browse/SPARK-2464
>
> TD
>
>
> On Sun, Jul 13, 2014 at 4:26 PM, Sean Owen  wrote:
>
>> How about a PR that rejects a context configured for local or local[1]?
>> As I understand it is not intended to work and has bitten several people.
>> On Jul 14, 2014 12:24 AM, "Michael Campbell" 
>> wrote:
>>
>>> This almost had me not using Spark; I couldn't get any output.  It is
>>> not at all obvious what's going on here to the layman (and to the best of
>>> my knowledge, not documented anywhere), but now you know you'll be able to
>>> answer this question for the numerous people that will also have it.
>>>
>>>
>>> On Sun, Jul 13, 2014 at 5:24 PM, Walrus theCat 
>>> wrote:
>>>
>>>> Great success!
>>>>
>>>> I was able to get output to the driver console by changing the
>>>> construction of the Streaming Spark Context from:
>>>>
>>>>  val ssc = new StreamingContext("local" /**TODO change once a cluster
>>>> is up **/,
>>>> "AppName", Seconds(1))
>>>>
>>>>
>>>> to:
>>>>
>>>> val ssc = new StreamingContext("local[2]" /**TODO change once a cluster
>>>> is up **/,
>>>> "AppName", Seconds(1))
>>>>
>>>>
>>>> I found something that tipped me off that this might work by digging
>>>> through this mailing list.
>>>>
>>>>
>>>> On Sun, Jul 13, 2014 at 2:00 PM, Walrus theCat 
>>>> wrote:
>>>>
>>>>> More strange behavior:
>>>>>
>>>>> lines.foreachRDD(x => println(x.first)) // works
>>>>> lines.foreachRDD(x => println((x.count,x.first))) // no output is
>>>>> printed to driver console
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat <
>>>>> walrusthe...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Thanks for your interest.
>>>>>>
>>>>>> lines.foreachRDD(x => println(x.count))
>>>>>>
>>>>>>  And I got 0 every once in a while (which I think is strange, because
>>>>>> lines.print prints the input I'm giving it over the socket.)
>>>>>>
>>>>>>
>>>>>> When I tried:
>>>>>>
>>>>>> lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))
>>>>>>
>>>>>> I got no count.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das <
>>>>>> tathagata.das1...@gmail.com> wrote:
>>>>>>
>>>>>>> Try doing DStream.foreachRDD and then printing the RDD count and
>>>>>>> further inspecting the RDD.
>>>>>>>  On Jul 13, 2014 1:03 AM, "Walrus theCat" 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have a DStream that works just fine when I say:
>>>>>>>>
>>>>>>>> dstream.print
>>>>>>>>
>>>>>>>> If I say:
>>>>>>>>
>>>>>>>> dstream.map(_,1).print
>>>>>>>>
>>>>>>>> that works, too.  However, if I do the following:
>>>>>>>>
>>>>>>>> dstream.reduce{case(x,y) => x}.print
>>>>>>>>
>>>>>>>> I don't get anything on my console.  What's going on?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>


Re: can't print DStream after reduce

2014-07-15 Thread Michael Campbell
Thank you Tathagata.  This had me going for far too long.


On Mon, Jul 14, 2014 at 3:57 PM, Tathagata Das 
wrote:

> The problem is not really for local[1] or local. The problem arises when
> there are more input streams than there are cores.
> But I agree, for people who are just beginning to use it by running it
> locally, there should be a check addressing this.
>
> I made a JIRA for this.
> https://issues.apache.org/jira/browse/SPARK-2464
>
> TD
>
>
> On Sun, Jul 13, 2014 at 4:26 PM, Sean Owen  wrote:
>
>> How about a PR that rejects a context configured for local or local[1]?
>> As I understand it is not intended to work and has bitten several people.
>> On Jul 14, 2014 12:24 AM, "Michael Campbell" 
>> wrote:
>>
>>> This almost had me not using Spark; I couldn't get any output.  It is
>>> not at all obvious what's going on here to the layman (and to the best of
>>> my knowledge, not documented anywhere), but now you know you'll be able to
>>> answer this question for the numerous people that will also have it.
>>>
>>>
>>> On Sun, Jul 13, 2014 at 5:24 PM, Walrus theCat 
>>> wrote:
>>>
>>>> Great success!
>>>>
>>>> I was able to get output to the driver console by changing the
>>>> construction of the Streaming Spark Context from:
>>>>
>>>>  val ssc = new StreamingContext("local" /**TODO change once a cluster
>>>> is up **/,
>>>> "AppName", Seconds(1))
>>>>
>>>>
>>>> to:
>>>>
>>>> val ssc = new StreamingContext("local[2]" /**TODO change once a cluster
>>>> is up **/,
>>>> "AppName", Seconds(1))
>>>>
>>>>
>>>> I found something that tipped me off that this might work by digging
>>>> through this mailing list.
>>>>
>>>>
>>>> On Sun, Jul 13, 2014 at 2:00 PM, Walrus theCat 
>>>> wrote:
>>>>
>>>>> More strange behavior:
>>>>>
>>>>> lines.foreachRDD(x => println(x.first)) // works
>>>>> lines.foreachRDD(x => println((x.count,x.first))) // no output is
>>>>> printed to driver console
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat <
>>>>> walrusthe...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Thanks for your interest.
>>>>>>
>>>>>> lines.foreachRDD(x => println(x.count))
>>>>>>
>>>>>>  And I got 0 every once in a while (which I think is strange, because
>>>>>> lines.print prints the input I'm giving it over the socket.)
>>>>>>
>>>>>>
>>>>>> When I tried:
>>>>>>
>>>>>> lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))
>>>>>>
>>>>>> I got no count.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das <
>>>>>> tathagata.das1...@gmail.com> wrote:
>>>>>>
>>>>>>> Try doing DStream.foreachRDD and then printing the RDD count and
>>>>>>> further inspecting the RDD.
>>>>>>>  On Jul 13, 2014 1:03 AM, "Walrus theCat" 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have a DStream that works just fine when I say:
>>>>>>>>
>>>>>>>> dstream.print
>>>>>>>>
>>>>>>>> If I say:
>>>>>>>>
>>>>>>>> dstream.map(_,1).print
>>>>>>>>
>>>>>>>> that works, too.  However, if I do the following:
>>>>>>>>
>>>>>>>> dstream.reduce{case(x,y) => x}.print
>>>>>>>>
>>>>>>>> I don't get anything on my console.  What's going on?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>


Re: not getting output from socket connection

2014-07-13 Thread Michael Campbell
Make sure you use "local[n]" (where n > 1) in your context setup too, (if
you're running locally), or you won't get output.


On Sat, Jul 12, 2014 at 11:36 PM, Walrus theCat 
wrote:

> Thanks!
>
> I thought it would get "passed through" netcat, but given your email, I
> was able to follow this tutorial and get it to work:
>
> http://docs.oracle.com/javase/tutorial/networking/sockets/clientServer.html
>
>
>
>
> On Fri, Jul 11, 2014 at 1:31 PM, Sean Owen  wrote:
>
>> netcat is listening for a connection on port . It is echoing what
>> you type to its console to anything that connects to  and reads.
>> That is what Spark streaming does.
>>
>> If you yourself connect to  and write, nothing happens except that
>> netcat echoes it. This does not cause Spark to somehow get that data.
>> nc is only echoing input from the console.
>>
>> On Fri, Jul 11, 2014 at 9:25 PM, Walrus theCat 
>> wrote:
>> > Hi,
>> >
>> > I have a java application that is outputting a string every second.  I'm
>> > running the wordcount example that comes with Spark 1.0, and running nc
>> -lk
>> > . When I type words into the terminal running netcat, I get counts.
>> > However, when I write the String onto a socket on port , I don't get
>> > counts.  I can see the strings showing up in the netcat terminal, but no
>> > counts from Spark.  If I paste in the string, I get counts.
>> >
>> > Any ideas?
>> >
>> > Thanks
>>
>
>


Re: can't print DStream after reduce

2014-07-13 Thread Michael Campbell
This almost had me not using Spark; I couldn't get any output.  It is not
at all obvious what's going on here to the layman (and to the best of my
knowledge, not documented anywhere), but now you know you'll be able to
answer this question for the numerous people that will also have it.


On Sun, Jul 13, 2014 at 5:24 PM, Walrus theCat 
wrote:

> Great success!
>
> I was able to get output to the driver console by changing the
> construction of the Streaming Spark Context from:
>
>  val ssc = new StreamingContext("local" /**TODO change once a cluster is
> up **/,
> "AppName", Seconds(1))
>
>
> to:
>
> val ssc = new StreamingContext("local[2]" /**TODO change once a cluster is
> up **/,
> "AppName", Seconds(1))
>
>
> I found something that tipped me off that this might work by digging
> through this mailing list.
>
>
> On Sun, Jul 13, 2014 at 2:00 PM, Walrus theCat 
> wrote:
>
>> More strange behavior:
>>
>> lines.foreachRDD(x => println(x.first)) // works
>> lines.foreachRDD(x => println((x.count,x.first))) // no output is printed
>> to driver console
>>
>>
>>
>>
>> On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat 
>> wrote:
>>
>>>
>>> Thanks for your interest.
>>>
>>> lines.foreachRDD(x => println(x.count))
>>>
>>>  And I got 0 every once in a while (which I think is strange, because
>>> lines.print prints the input I'm giving it over the socket.)
>>>
>>>
>>> When I tried:
>>>
>>> lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))
>>>
>>> I got no count.
>>>
>>> Thanks
>>>
>>>
>>> On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Try doing DStream.foreachRDD and then printing the RDD count and
 further inspecting the RDD.
  On Jul 13, 2014 1:03 AM, "Walrus theCat" 
 wrote:

> Hi,
>
> I have a DStream that works just fine when I say:
>
> dstream.print
>
> If I say:
>
> dstream.map(_,1).print
>
> that works, too.  However, if I do the following:
>
> dstream.reduce{case(x,y) => x}.print
>
> I don't get anything on my console.  What's going on?
>
> Thanks
>

>>>
>>
>


Re: HELP!? Re: Streaming trouble (reduceByKey causes all printing to stop)

2014-06-12 Thread Michael Campbell
I don't know if this matters, but I'm looking at the web site that spark
puts up, and I see under the streaming tab:


   - *Started at: *Thu Jun 12 11:42:10 EDT 2014
   - *Time since start: *6 minutes 3 seconds
   - *Network receivers: *1
   - *Batch interval: *5 seconds
   - *Processed batches: *0
   - *Waiting batches: *1




Why would a batch be "waiting" for long over my batch time of 5 seconds?


On Thu, Jun 12, 2014 at 10:18 AM, Michael Campbell <
michael.campb...@gmail.com> wrote:

> Ad... it's NOT working.
>
> Here's the code:
>
> val bytes = kafkaStream.map({ case (key, messageBytes) =>
> messageBytes}) // Map to just get the bytes part out...
> val things = bytes.flatMap(bytesArrayToThings) // convert to a
> thing
> val srcDestinations = things.map(thing =>
> (ipToString(thing.getSourceIp), Set(ipToString(thing.getDestinationIp
> // up to this point works fine.
>
> // this fails to print
> val srcDestinationSets = srcDestinations.reduceByKey((exist:
> Set[String], addl: Set[String]) => exist ++ addl)
>
>
> What it does...
>
> From a kafka message containing many "things", convert the message to an
> array of said "things", flatMap them out to a stream of 1 "thing" at a
> time, pull out and make a Tuple of a (SourceIP, DestinationIP).
>
> ALL THAT WORKS.  If I do a "srcDestinations.print()" I get output like the
> following, every 5 seconds, which is my batch size.
>
> ---
> Time: 140258200 ms
> ---
> (10.30.51.216,Set(10.20.1.1))
> (10.20.11.3,Set(10.10.61.98))
> (10.20.11.3,Set(10.10.61.95))
> ...
>
>
>
> What I want is a SET of (sourceIP -> Set(all the destination Ips))  Using
> a set because as you can see above, the same source may have the same
> destination multiple times and I want to eliminate dupes on the destination
> side.
>
> When I call the reduceByKey() method, I never get any output.  When I do a
> "srcDestinationSets.print()"  NOTHING EVER PRINTS.  Ever.  Never.
>
> What am I doing wrong?  (The same happens for "reduceByKeyAndWindow(...,
> Seconds(5))".)
>
> I'm sure this is something I've done, but I cannot figure out what it was.
>
> Help, please?
>
>
>


HELP!? Re: Streaming trouble (reduceByKey causes all printing to stop)

2014-06-12 Thread Michael Campbell
Ad... it's NOT working.

Here's the code:

val bytes = kafkaStream.map({ case (key, messageBytes) =>
messageBytes}) // Map to just get the bytes part out...
val things = bytes.flatMap(bytesArrayToThings) // convert to a thing
val srcDestinations = things.map(thing =>
(ipToString(thing.getSourceIp), Set(ipToString(thing.getDestinationIp
// up to this point works fine.

// this fails to print
val srcDestinationSets = srcDestinations.reduceByKey((exist:
Set[String], addl: Set[String]) => exist ++ addl)


What it does...

>From a kafka message containing many "things", convert the message to an
array of said "things", flatMap them out to a stream of 1 "thing" at a
time, pull out and make a Tuple of a (SourceIP, DestinationIP).

ALL THAT WORKS.  If I do a "srcDestinations.print()" I get output like the
following, every 5 seconds, which is my batch size.

---
Time: 140258200 ms
---
(10.30.51.216,Set(10.20.1.1))
(10.20.11.3,Set(10.10.61.98))
(10.20.11.3,Set(10.10.61.95))
...



What I want is a SET of (sourceIP -> Set(all the destination Ips))  Using a
set because as you can see above, the same source may have the same
destination multiple times and I want to eliminate dupes on the destination
side.

When I call the reduceByKey() method, I never get any output.  When I do a
"srcDestinationSets.print()"  NOTHING EVER PRINTS.  Ever.  Never.

What am I doing wrong?  (The same happens for "reduceByKeyAndWindow(...,
Seconds(5))".)

I'm sure this is something I've done, but I cannot figure out what it was.

Help, please?


Kafka client - specify offsets?

2014-06-11 Thread Michael Campbell
Is there a way in the Apache Spark Kafka Utils to specify an offset to
start reading?  Specifically, from the start of the queue, or failing that,
a specific point?


Re: Having trouble with streaming (updateStateByKey)

2014-06-11 Thread Michael Campbell
I rearranged my code to do a reduceByKey which I think is working.  I also
don't think the problem was that updateState call, but something else;
unfortunately I changed a lot in looking for this issue, so not sure what
the actual fix might have been, but I think it's working now.


On Wed, Jun 11, 2014 at 1:47 PM, Michael Campbell <
michael.campb...@gmail.com> wrote:

> I'm having a little trouble getting an "updateStateByKey()" call to work;
> was wondering if anyone could help.
>
> In my chain of calls from getting Kafka messages out of the queue to
> converting the message to a set of "things", then pulling out 2 attributes
> of those things to a Tuple2, everything works.
>
> So what I end up with is about a 1 second dump of things like this (this
> is crufted up data, but it's basically 2 IPV6 addresses...)
>
> ---
> Time: 1402507839000 ms
> ---
> (:::a14:b03,:::a0a:2bd4)
> (:::a14:b03,:::a0a:2bd4)
> (:::a0a:25a7,:::a14:b03)
> (:::a14:b03,:::a0a:2483)
> (:::a14:b03,:::a0a:2480)
> (:::a0a:2d96,:::a14:b03)
> (:::a0a:abb5,:::a14:100)
> (:::a0a:dcd7,:::a14:28)
> (:::a14:28,:::a0a:da94)
> (:::a14:b03,:::a0a:2def)
> ...
>
>
> This works ok.
>
> The problem is when I add a call to updateStateByKey - the streaming app
> runs and runs and runs and never outputs anything.  When I debug, I can't
> confirm that my state update passed-in function is ever actually getting
> called.
>
> Indeed I have breakpoints and println statements in my updateFunc and it
> LOOKS like it's never getting called.  I can confirm that the
> updateStateByKey function IS getting called (via it stopping on a
> breakpoint).
>
> Is there something obvious I'm missing?
>


Having trouble with streaming (updateStateByKey)

2014-06-11 Thread Michael Campbell
I'm having a little trouble getting an "updateStateByKey()" call to work;
was wondering if anyone could help.

In my chain of calls from getting Kafka messages out of the queue to
converting the message to a set of "things", then pulling out 2 attributes
of those things to a Tuple2, everything works.

So what I end up with is about a 1 second dump of things like this (this is
crufted up data, but it's basically 2 IPV6 addresses...)

---
Time: 1402507839000 ms
---
(:::a14:b03,:::a0a:2bd4)
(:::a14:b03,:::a0a:2bd4)
(:::a0a:25a7,:::a14:b03)
(:::a14:b03,:::a0a:2483)
(:::a14:b03,:::a0a:2480)
(:::a0a:2d96,:::a14:b03)
(:::a0a:abb5,:::a14:100)
(:::a0a:dcd7,:::a14:28)
(:::a14:28,:::a0a:da94)
(:::a14:b03,:::a0a:2def)
...


This works ok.

The problem is when I add a call to updateStateByKey - the streaming app
runs and runs and runs and never outputs anything.  When I debug, I can't
confirm that my state update passed-in function is ever actually getting
called.

Indeed I have breakpoints and println statements in my updateFunc and it
LOOKS like it's never getting called.  I can confirm that the
updateStateByKey function IS getting called (via it stopping on a
breakpoint).

Is there something obvious I'm missing?


Re: New user streaming question

2014-06-07 Thread Michael Campbell
Thanks all - I still don't know what the underlying problem is, but I KIND
OF got it working by dumping my random-words stuff to a file and pointing
spark streaming to that.  So it's not "Streaming", as such, but I got
output.

More investigation to follow =)


On Sat, Jun 7, 2014 at 8:22 AM, Gino Bustelo  wrote:

> I would make sure that your workers are running. It is very difficult to
> tell from the console dribble if you just have no data or the workers just
> disassociated from masters.
>
> Gino B.
>
> On Jun 6, 2014, at 11:32 PM, Jeremy Lee 
> wrote:
>
> Yup, when it's running, DStream.print() will print out a timestamped block
> for every time step, even if the block is empty. (for v1.0.0, which I have
> running in the other window)
>
> If you're not getting that, I'd guess the stream hasn't started up
> properly.
>
>
> On Sat, Jun 7, 2014 at 11:50 AM, Michael Campbell <
> michael.campb...@gmail.com> wrote:
>
>> I've been playing with spark and streaming and have a question on stream
>> outputs.  The symptom is I don't get any.
>>
>> I have run spark-shell and all does as I expect, but when I run the
>> word-count example with streaming, it *works* in that things happen and
>> there are no errors, but I never get any output.
>>
>> Am I understanding how it it is supposed to work correctly?  Is the
>> Dstream.print() method supposed to print the output for every (micro)batch
>> of the streamed data?  If that's the case, I'm not seeing it.
>>
>> I'm using the "netcat" example and the StreamingContext uses the network
>> to read words, but as I said, nothing comes out.
>>
>> I tried changing the .print() to .saveAsTextFiles(), and I AM getting a
>> file, but nothing is in it other than a "_temporary" subdir.
>>
>> I'm sure I'm confused here, but not sure where.  Help?
>>
>
>
>
> --
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers
>
>


New user streaming question

2014-06-06 Thread Michael Campbell
I've been playing with spark and streaming and have a question on stream
outputs.  The symptom is I don't get any.

I have run spark-shell and all does as I expect, but when I run the
word-count example with streaming, it *works* in that things happen and
there are no errors, but I never get any output.

Am I understanding how it it is supposed to work correctly?  Is the
Dstream.print() method supposed to print the output for every (micro)batch
of the streamed data?  If that's the case, I'm not seeing it.

I'm using the "netcat" example and the StreamingContext uses the network to
read words, but as I said, nothing comes out.

I tried changing the .print() to .saveAsTextFiles(), and I AM getting a
file, but nothing is in it other than a "_temporary" subdir.

I'm sure I'm confused here, but not sure where.  Help?