Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Nan Zhu
Hi, Sean   

I was in the same problem

but when I changed MASTER=“local” to MASTER=“local[2]”

everything back to the normal

Hasn’t get a chance to ask here

Best,  

--  
Nan Zhu


On Friday, May 30, 2014 at 9:09 AM, Sean Owen wrote:

 Guys I'm struggling to debug some strange behavior in a simple
 Streaming + Java + Kafka example -- in fact, a simplified version of
 JavaKafkaWordcount, that is just calling print() on a sequence of
 messages.
  
 Data is flowing, but it only appears to work for a few periods --
 sometimes 0 -- before ceasing to call any actions. Sorry for lots of
 log posting but it may illustrate to someone who knows this better
 what is happening:
  
  
  
 Key action in the logs seems to be as follows -- it works a few times:
  
 ...
 2014-05-30 13:53:50 INFO ReceiverTracker:58 - Stream 0 received 0 blocks
 2014-05-30 13:53:50 INFO JobScheduler:58 - Added jobs for time 140145443 
 ms
 ---
 Time: 140145443 ms
 ---
  
 2014-05-30 13:53:50 INFO JobScheduler:58 - Starting job streaming job
 140145443 ms.0 from job set of time 140145443 ms
 2014-05-30 13:53:50 INFO JobScheduler:58 - Finished job streaming job
 140145443 ms.0 from job set of time 140145443 ms
 2014-05-30 13:53:50 INFO JobScheduler:58 - Total delay: 0.004 s for
 time 140145443 ms (execution: 0.000 s)
 2014-05-30 13:53:50 INFO MappedRDD:58 - Removing RDD 2 from persistence list
 2014-05-30 13:53:50 INFO BlockManager:58 - Removing RDD 2
 2014-05-30 13:53:50 INFO BlockRDD:58 - Removing RDD 1 from persistence list
 2014-05-30 13:53:50 INFO BlockManager:58 - Removing RDD 1
 2014-05-30 13:53:50 INFO KafkaInputDStream:58 - Removing blocks of
 RDD BlockRDD[1] at BlockRDD at ReceiverInputDStream.scala:69 of time
 140145443 ms
 2014-05-30 13:54:00 INFO ReceiverTracker:58 - Stream 0 received 0 blocks
 2014-05-30 13:54:00 INFO JobScheduler:58 - Added jobs for time 140145444 
 ms
 ...
  
  
 Then works with some additional, different output in the logs -- here
 you see output is flowing too:
  
 ...
 2014-05-30 13:54:20 INFO ReceiverTracker:58 - Stream 0 received 2 blocks
 2014-05-30 13:54:20 INFO JobScheduler:58 - Added jobs for time 140145446 
 ms
 2014-05-30 13:54:20 INFO JobScheduler:58 - Starting job streaming job
 140145446 ms.0 from job set of time 140145446 ms
 2014-05-30 13:54:20 INFO SparkContext:58 - Starting job: take at
 DStream.scala:593
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Got job 1 (take at
 DStream.scala:593) with 1 output partitions (allowLocal=true)
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Final stage: Stage 1(take
 at DStream.scala:593)
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Parents of final stage: List()
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Missing parents: List()
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Computing the requested
 partition locally
 2014-05-30 13:54:20 INFO BlockManager:58 - Found block
 input-0-1401454458400 locally
 2014-05-30 13:54:20 INFO SparkContext:58 - Job finished: take at
 DStream.scala:593, took 0.007007 s
 2014-05-30 13:54:20 INFO SparkContext:58 - Starting job: take at
 DStream.scala:593
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Got job 2 (take at
 DStream.scala:593) with 1 output partitions (allowLocal=true)
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Final stage: Stage 2(take
 at DStream.scala:593)
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Parents of final stage: List()
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Missing parents: List()
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Computing the requested
 partition locally
 2014-05-30 13:54:20 INFO BlockManager:58 - Found block
 input-0-1401454459400 locally
 2014-05-30 13:54:20 INFO SparkContext:58 - Job finished: take at
 DStream.scala:593, took 0.002217 s
 ---
 Time: 140145446 ms
 ---
 99,true,-0.11342268416043325
 17,false,1.6732879882133793
 ...
  
  
 Then keeps repeating the following with no more evidence that the
 print() action is being called:
  
 ...
 2014-05-30 13:54:20 INFO JobScheduler:58 - Finished job streaming job
 140145446 ms.0 from job set of time 140145446 ms
 2014-05-30 13:54:20 INFO MappedRDD:58 - Removing RDD 8 from persistence list
 2014-05-30 13:54:20 INFO JobScheduler:58 - Total delay: 0.019 s for
 time 140145446 ms (execution: 0.015 s)
 2014-05-30 13:54:20 INFO BlockManager:58 - Removing RDD 8
 2014-05-30 13:54:20 INFO BlockRDD:58 - Removing RDD 7 from persistence list
 2014-05-30 13:54:20 INFO BlockManager:58 - Removing RDD 7
 2014-05-30 13:54:20 INFO KafkaInputDStream:58 - Removing blocks of
 RDD BlockRDD[7] at BlockRDD at ReceiverInputDStream.scala:69 of time
 140145446 ms
 2014-05-30 13:54:20 INFO MemoryStore:58 - ensureFreeSpace(100) called
 with curMem=201, maxMem=2290719129
 2014-05-30 13:54:20 INFO MemoryStore:58 - Block input-0-1401454460400
 

Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Sean Owen
Thanks Nan, that does appear to fix it. I was using local. Can
anyone say whether that's to be expected or whether it could be a bug
somewhere?

On Fri, May 30, 2014 at 2:42 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 Hi, Sean

 I was in the same problem

 but when I changed MASTER=“local” to MASTER=“local[2]”

 everything back to the normal

 Hasn’t get a chance to ask here

 Best,

 --
 Nan Zhu



Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Patrick Wendell
Yeah - Spark streaming needs at least two threads to run. I actually
thought we warned the user if they only use one (@tdas?) but the
warning might not be working correctly - or I'm misremembering.

On Fri, May 30, 2014 at 6:38 AM, Sean Owen so...@cloudera.com wrote:
 Thanks Nan, that does appear to fix it. I was using local. Can
 anyone say whether that's to be expected or whether it could be a bug
 somewhere?

 On Fri, May 30, 2014 at 2:42 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 Hi, Sean

 I was in the same problem

 but when I changed MASTER=local to MASTER=local[2]

 everything back to the normal

 Hasn't get a chance to ask here

 Best,

 --
 Nan Zhu



Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Nan Zhu
If local[2] is expected, then the streaming doc is actually misleading? 

as the given example is 

import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
// Create a StreamingContext with a local master
val ssc = new StreamingContext(local, NetworkWordCount, Seconds(1))

http://spark.apache.org/docs/latest/streaming-programming-guide.html

I created a JIRA and a PR 

https://github.com/apache/spark/pull/924 

-- 
Nan Zhu


On Friday, May 30, 2014 at 1:53 PM, Patrick Wendell wrote:

 Yeah - Spark streaming needs at least two threads to run. I actually
 thought we warned the user if they only use one (@tdas?) but the
 warning might not be working correctly - or I'm misremembering.
 
 On Fri, May 30, 2014 at 6:38 AM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:
  Thanks Nan, that does appear to fix it. I was using local. Can
  anyone say whether that's to be expected or whether it could be a bug
  somewhere?
  
  On Fri, May 30, 2014 at 2:42 PM, Nan Zhu zhunanmcg...@gmail.com 
  (mailto:zhunanmcg...@gmail.com) wrote:
   Hi, Sean
   
   I was in the same problem
   
   but when I changed MASTER=local to MASTER=local[2]
   
   everything back to the normal
   
   Hasn't get a chance to ask here
   
   Best,
   
   --
   Nan Zhu