Thanks all, Using external storage seems to be the best solution for now.
Btw, have any one heard about following spark streaming module from Intel? https://github.com/Intel-bigdata/spark-streamingsql Seems it allow us to query on Spark stream on the fly, however it haven't updated for 9 months, so I'm not sure it's still good to use. 2015-09-20 13:17 GMT+07:00 Jörn Franke <jornfra...@gmail.com>: > I think generally the way forward would be to put aggregate statistics to > an external storage (eg hbase) - it should not have that much influence on > latency. You will probably need it anyway if you need to store historical > information. Wrt to deltas - always a tricky topic. You may want to work > with absolute values and when the application queries the external > datastore then it calculates deltas. Once this works you can think if you > still need to do the delta approach or not. > > Le dim. 20 sept. 2015 à 6:26, Thúy Hằng Lê <thuyhang...@gmail.com> a > écrit : > >> Thanks Adrian and Jorn for the answers. >> >> Yes, you're right there are lot of things I need to consider if I want to >> use Spark for my app. >> >> I still have few concerns/questions from your information: >> >> 1/ I need to combine trading stream with tick stream, I am planning to >> use Kafka for that >> If I am using approach #2 (Direct Approach) in this tutorial >> https://spark.apache.org/docs/latest/streaming-kafka-integration.html >> Will I receive exactly one semantics? Or I have to add some logic in my >> code to archive that. >> As your suggestion of using delta update, exactly one semantic is >> required for this application. >> >> 2/ For ad-hoc query, I must output of Spark to external storage and query >> on that right? >> Is there any way to do ah-hoc query on Spark? my application could have >> 50k updates per second at pick time. >> Persistent to external storage lead to high latency in my app. >> >> 3/ How to get real-time statistics from Spark, >> In most of the Spark streaming examples, the statistics are echo to the >> stdout. >> However, I want to display those statics on GUI, is there any way to >> retrieve data from Spark directly without using external Storage? >> >> >> 2015-09-19 16:23 GMT+07:00 Jörn Franke <jornfra...@gmail.com>: >> >>> If you want to be able to let your users query their portfolio then you >>> may want to think about storing the current state of the portfolios in >>> hbase/phoenix or alternatively a cluster of relationaldatabases can make >>> sense. For the rest you may use Spark. >>> >>> Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê <thuyhang...@gmail.com> a >>> écrit : >>> >>>> Hi all, >>>> >>>> I am going to build a financial application for Portfolio Manager, >>>> where each portfolio contains a list of stocks, the number of shares >>>> purchased, and the purchase price. >>>> Another source of information is stocks price from market data. The >>>> application need to calculate real-time gain or lost of each stock in each >>>> portfolio ( compared to the purchase price). >>>> >>>> I am new with Spark, i know using Spark Streaming I can aggregate >>>> portfolio possitions in real-time, for example: >>>> user A contains: >>>> - 100 IBM stock with transactionValue=$15000 >>>> - 500 AAPL stock with transactionValue=$11400 >>>> >>>> Now given the stock prices change in real-time too, e.g if IBM price at >>>> 151, i want to update the gain or lost of it: gainOrLost(IBM) = 151*100 - >>>> 15000 = $100 >>>> >>>> My questions are: >>>> >>>> * What is the best method to combine 2 real-time streams( >>>> transaction made by user and market pricing data) in Spark. >>>> * How can I use real-time Adhoc SQL again >>>> portfolio's positions, is there any way i can do SQL on the output of Spark >>>> Streamming. >>>> For example, >>>> select sum(gainOrLost) from portfolio where user='A'; >>>> * What are prefered external storages for Spark in this use >>>> case. >>>> * Is spark is right choice for my use case? >>>> >>>> >>> >>