That's great answer Andrian. I find a lots of information here. I have direction for application now, i will try your suggestion :)
Vào Thứ Ba, ngày 22 tháng 9 năm 2015, Adrian Tanase <atan...@adobe.com> đã viết: > > 1. reading from kafka has exactly once guarantees - we are using it in > production today (with the direct receiver) > 1. you will probably have 2 topics, loading both into spark and > joining / unioning as needed is not an issue > 2. tons of optimizations you can do there, assuming everything else > works > 2. for ad-hoc query I would say you absolutely need to look at > external storage > 1. querying the Dstream or spark's RDD's directly should be done > mostly for aggregates/metrics, not by users > 2. if you look at HBase or Cassandra for storage then 50k > writes /sec are not a problem at all, especially combined with a smart > client that does batch puts (like async hbase > <https://github.com/OpenTSDB/asynchbase>) > 3. you could also consider writing the updates to another kafka > topic and have a different component that updates the DB, if you think > of > other optimisations there > 3. by stats I assume you mean metrics (operational or business) > 1. there are multiple ways to do this, however I would not encourage > you to query spark directly, especially if you need an archive/history > of > your datapoints > 2. we are using OpenTSDB (we already have a HBase cluster) + > Grafana for dashboarding > 3. collecting the metrics is a bit hairy in a streaming app - we > have experimented with both accumulators and RDDs specific for metrics - > chose the RDDs that write to OpenTSDB using foreachRdd > > -adrian > > ------------------------------ > *From:* Thúy Hằng Lê <thuyhang...@gmail.com > <javascript:_e(%7B%7D,'cvml','thuyhang...@gmail.com');>> > *Sent:* Sunday, September 20, 2015 7:26 AM > *To:* Jörn Franke > *Cc:* user@spark.apache.org > <javascript:_e(%7B%7D,'cvml','user@spark.apache.org');> > *Subject:* Re: Using Spark for portfolio manager app > > Thanks Adrian and Jorn for the answers. > > Yes, you're right there are lot of things I need to consider if I want to > use Spark for my app. > > I still have few concerns/questions from your information: > > 1/ I need to combine trading stream with tick stream, I am planning to use > Kafka for that > If I am using approach #2 (Direct Approach) in this tutorial > https://spark.apache.org/docs/latest/streaming-kafka-integration.html > <https://spark.apache.org/docs/latest/streaming-kafka-integration.html> > Spark Streaming + Kafka Integration Guide - Spark 1.4.1 ... > Spark Streaming + Kafka Integration Guide. Apache Kafka is > publish-subscribe messaging rethought as a distributed, partitioned, > replicated commit log service. > Read more... > <https://spark.apache.org/docs/latest/streaming-kafka-integration.html> > > Will I receive exactly one semantics? Or I have to add some logic in my > code to archive that. > As your suggestion of using delta update, exactly one semantic is required > for this application. > > 2/ For ad-hoc query, I must output of Spark to external storage and query > on that right? > Is there any way to do ah-hoc query on Spark? my application could have > 50k updates per second at pick time. > Persistent to external storage lead to high latency in my app. > > 3/ How to get real-time statistics from Spark, > In most of the Spark streaming examples, the statistics are echo to the > stdout. > However, I want to display those statics on GUI, is there any way to > retrieve data from Spark directly without using external Storage? > > > 2015-09-19 16:23 GMT+07:00 Jörn Franke <jornfra...@gmail.com > <javascript:_e(%7B%7D,'cvml','jornfra...@gmail.com');>>: > >> If you want to be able to let your users query their portfolio then you >> may want to think about storing the current state of the portfolios in >> hbase/phoenix or alternatively a cluster of relationaldatabases can make >> sense. For the rest you may use Spark. >> >> Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê <thuyhang...@gmail.com >> <javascript:_e(%7B%7D,'cvml','thuyhang...@gmail.com');>> a écrit : >> >>> Hi all, >>> >>> I am going to build a financial application for Portfolio Manager, where >>> each portfolio contains a list of stocks, the number of shares purchased, >>> and the purchase price. >>> Another source of information is stocks price from market data. The >>> application need to calculate real-time gain or lost of each stock in each >>> portfolio ( compared to the purchase price). >>> >>> I am new with Spark, i know using Spark Streaming I can aggregate >>> portfolio possitions in real-time, for example: >>> user A contains: >>> - 100 IBM stock with transactionValue=$15000 >>> - 500 AAPL stock with transactionValue=$11400 >>> >>> Now given the stock prices change in real-time too, e.g if IBM price at >>> 151, i want to update the gain or lost of it: gainOrLost(IBM) = 151*100 - >>> 15000 = $100 >>> >>> My questions are: >>> >>> * What is the best method to combine 2 real-time streams( >>> transaction made by user and market pricing data) in Spark. >>> * How can I use real-time Adhoc SQL again >>> portfolio's positions, is there any way i can do SQL on the output of Spark >>> Streamming. >>> For example, >>> select sum(gainOrLost) from portfolio where user='A'; >>> * What are prefered external storages for Spark in this use >>> case. >>> * Is spark is right choice for my use case? >>> >>> >> >