Spark Streaming data checkpoint performance

2015-11-01 Thread Thúy Hằng
Hi Spark guru I am evaluating Spark Streaming, In my application I need to maintain cumulative statistics (e.g the total running word count), so I need to call the updateStateByKey function on very micro-batch. After setting those things, I got following behaviors: * The Processing Time

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng
seconds for checkpoint. Now my application have average 30 seconds latency and keep increasingly. 2015-11-06 11:11 GMT+07:00 Thúy Hằng Lê <thuyhang...@gmail.com>: > Thankd all, it would be great to have this feature soon. > Do you know what's the release plan for 1.6? > > In a

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng
e. } } Without using updateStageByKey, I'm only have the stats of the last micro-batch. Any advise on this? 2015-11-07 11:35 GMT+07:00 Aniket Bhatnagar <aniket.bhatna...@gmail.com>: > Can you try storing the state (word count) in an external key value store? > >

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Thúy Hằng
our state object more complicated and try to prune out words >with very few occurrences or that haven’t been updated for a long time > - You can do this by emitting None from updateStateByKey > > Hope this helps, > -adrian > > From: Thúy Hằng Lê > Date: Monday, Novemb

Re: Spark Streaming data checkpoint performance

2015-11-05 Thread Thúy Hằng
"Adrian Tanase" <atan...@adobe.com> wrote: > Nice! Thanks for sharing, I wasn’t aware of the new API. > > Left some comments on the JIRA and design doc. > > -adrian > > From: Shixiong Zhu > Date: Tuesday, November 3, 2015 at 3:32 AM > To: Thúy Hằng Lê

Re: Using Spark for portfolio manager app

2015-09-22 Thread Thúy Hằng
ecific for metrics - > chose the RDDs that write to OpenTSDB using foreachRdd > > ​-adrian > > -- > *From:* Thúy Hằng Lê <thuyhang...@gmail.com > <javascript:_e(%7B%7D,'cvml','thuyhang...@gmail.com');>> > *Sent:* Sunday

Using Spark for portfolio manager app

2015-09-18 Thread Thúy Hằng
Hi all, I am going to build a financial application for Portfolio Manager, where each portfolio contains a list of stocks, the number of shares purchased, and the purchase price. Another source of information is stocks price from market data. The application need to calculate real-time gain or

Re: Using Spark for portfolio manager app

2015-09-25 Thread Thúy Hằng
Thanks all for the feedback so far. I havn't decided which external storage will be used yet. HBase is cool but it requires Hadoop in production. I only have 3-4 servers for the whole things ( i am thinking of a relational database for this, can be MariaDB, Memsql or mysql) but they are hard to

Re: RDD partition after calling mapToPair

2015-11-23 Thread Thúy Hằng
Thanks Cody, I still have concerns about this. What's do you mean by saying Spark direct stream doesn't have a default partitioner? Could you please help me to explain more? When i assign 20 cores to 20 Kafka partitions, I am expecting each core will work on a partition. Is it correct? I'm