The last successful batch before stop re-execute after restart the DStreams with checkpoint

2018-03-11 Thread Terry Hoo
Experts, I see the last batch before stop (graceful shutdown) always re-execute after restart the DStream from a checkpoint, is this a expected behavior? I see a bug in JIRA: https://issues.apache.org/jira/browse/SPARK-20050, whic reports duplicates on Kafka, I also see this with HDFS file.

Re: Getting memory error when starting spark shell but not often

2016-09-06 Thread Terry Hoo
Maybe not enough continues memory (10G?) in your host Regards, - Terry On Wed, Sep 7, 2016 at 10:51 AM, Divya Gehlot wrote: > Hi, > I am using EMR 4.7 with Spark 1.6 > Sometimes when I start the spark shell I get below error > > OpenJDK 64-Bit Server VM warning: INFO:

Re: spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread Terry Hoo
Kevin, Try to create the StreamingContext as following: val ssc = new StreamingContext(spark.sparkContext, Seconds(2)) On Tue, Jul 26, 2016 at 11:25 AM, kevin wrote: > hi,all: > I want to read data from kafka and regist as a table then join a jdbc > table. > My

Re: Another problem about parallel computing

2016-06-13 Thread Terry Hoo
hero, Did you check whether there is any exception after retry? If the port is 0, the spark worker should bind to a random port. BTW, what's the spark version? Regards, - Terry On Mon, Jun 13, 2016 at 4:24 PM, hero wrote: > Hi, guys > > I have another problem about

Re: StackOverflow in Spark

2016-06-13 Thread Terry Hoo
Maybe the same issue with SPARK_6847 , which has been fixed in spark 2.0 Regards - Terry On Mon, Jun 13, 2016 at 3:15 PM, Michel Hubert wrote: > > > I’ve found my problem. > > > > I’ve got a DAG with two consecutive

ArrayIndexOutOfBoundsException in model selection via cross-validation sample with spark 1.6.1

2016-05-04 Thread Terry Hoo
All, I met the ArrayIndexOutOfBoundsException when run the model selection via cross-validation sample with spark 1.6.1, did anyone else meet this before? How to resolve this? Call stack:

Re: Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Terry Hoo
on of the property "How > many finished batches the Spark UI and status APIs remember before garbage > collecting." So the data is stored in memory, but the the memory of which > component ... I imagine the driver ? > > regards, > > On Fri, Jan 29, 2016 at 10:52 AM,

Re: Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Terry Hoo
Hi Mehdi, Do you try a larger value of "spark.streaming.ui.retainedBatches"(default is 1000)? Regards, - Terry On Fri, Jan 29, 2016 at 5:45 PM, Mehdi Ben Haj Abbes wrote: > Hi folks, > > I have a streaming job running for more than 24 hours. It seems that there > is a

Re: [Spark 1.6][Streaming] About the behavior of mapWithState

2016-01-17 Thread Terry Hoo
ceByKey(_ + _).mapWithState(spec) > > On Fri, Jan 15, 2016 at 1:21 AM, Terry Hoo <hujie.ea...@gmail.com> wrote: > >> Hi, >> I am doing a simple test with mapWithState, and get some events >> unexpected, is this correct? >> >> The test is very simple: s

[Spark 1.6][Streaming] About the behavior of mapWithState

2016-01-15 Thread Terry Hoo
Hi, I am doing a simple test with mapWithState, and get some events unexpected, is this correct? The test is very simple: sum the value of each key val mappingFunction = (key: Int, value: Option[Int], state: State[Int]) => { state.update(state.getOption().getOrElse(0) + value.getOrElse(0))

[Streaming] Long time to catch up when streaming application restarts from checkpoint

2015-11-06 Thread Terry Hoo
All, I have a streaming application that monitors a HDFS folder and compute some metrics based on this data, the data in this folder will be updated by another uploaded application. The streaming application's batch interval is 1 minute, batch processing time of streaming is about 30 seconds,

[SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-14 Thread Terry Hoo
All, Does anyone meet memory leak issue with spark streaming and spark sql in spark 1.5.1? I can see the memory is increasing all the time when running this simple sample: val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) import sqlContext.implicits._

Re: Streaming Application Unable to get Stream from Kafka

2015-10-09 Thread Terry Hoo
Hi Prateek, How many cores (threads) do you assign to spark in local mode? It is very likely the local spark does not have enough resource to proceed. You can check http://yourip:4040 to check the details. Thanks! Terry On Fri, Oct 9, 2015 at 10:34 PM, Prateek . wrote: >

Re: Cant perform full outer join

2015-09-29 Thread Terry Hoo
Saif, Might be you can rename one of the dataframe to different name first, then do an outer join and a select like this: val cur_d = cur_data.toDF("Date_1", "Value_1") val r = data.join(cur_d, data("DATE" === cur_d("Date_1", "outer").select($"DATE", $"VALUE", $"Value_1") Thanks, Terry On Tue,

Re: Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-24 Thread Terry Hoo
I met this before: in my program, some DStreams are not initialized since they are not in the path of of output. You can check if you are the same case. Thanks! - Terry On Fri, Sep 25, 2015 at 10:22 AM, Tathagata Das wrote: > Are you by any chance setting