Thuy, if you decide to go with Hbase for external storage consider using a light-weight SQL layer such as Apache Phoenix, it has a spark plugin <https://phoenix.apache.org/phoenix_spark.html> & JDBC driver, and throughput is pretty good even for heavy market data feed (make sure to use batched commits).
In our case we send Kafka streams directly into Hbase via Phoenix JDBC upserts <https://phoenix.apache.org/language/index.html#upsert_values>, and Spark dataframes are mapped to Phoenix tables for downstream analytics. Alternatively you can use Cassandra <https://databricks.com/blog/2015/06/16/zen-and-the-art-of-spark-maintenance-with-cassandra.html> for the backend, but phoenix saves you a lot of coding, and a lot of optimizations for joins & aggregations are already done for you (it plugs into Hbase coprocessors). Alex On Tue, Sep 22, 2015 at 12:12 PM, Thúy Hằng Lê <thuyhang...@gmail.com> wrote: > That's great answer Andrian. > I find a lots of information here. I have direction for application now, i > will try your suggestion :) > > Vào Thứ Ba, ngày 22 tháng 9 năm 2015, Adrian Tanase <atan...@adobe.com> > đã viết: > > >> 1. reading from kafka has exactly once guarantees - we are using it >> in production today (with the direct receiver) >> 1. you will probably have 2 topics, loading both into spark and >> joining / unioning as needed is not an issue >> 2. tons of optimizations you can do there, assuming everything >> else works >> 2. for ad-hoc query I would say you absolutely need to look at >> external storage >> 1. querying the Dstream or spark's RDD's directly should be done >> mostly for aggregates/metrics, not by users >> 2. if you look at HBase or Cassandra for storage then 50k >> writes /sec are not a problem at all, especially combined with a smart >> client that does batch puts (like async hbase >> <https://github.com/OpenTSDB/asynchbase>) >> 3. you could also consider writing the updates to another kafka >> topic and have a different component that updates the DB, if you >> think of >> other optimisations there >> 3. by stats I assume you mean metrics (operational or business) >> 1. there are multiple ways to do this, however I would not encourage >> you to query spark directly, especially if you need an archive/history >> of >> your datapoints >> 2. we are using OpenTSDB (we already have a HBase cluster) + >> Grafana for dashboarding >> 3. collecting the metrics is a bit hairy in a streaming app - we >> have experimented with both accumulators and RDDs specific for metrics >> - >> chose the RDDs that write to OpenTSDB using foreachRdd >> >> -adrian >> >> ------------------------------ >> *From:* Thúy Hằng Lê <thuyhang...@gmail.com> >> *Sent:* Sunday, September 20, 2015 7:26 AM >> *To:* Jörn Franke >> *Cc:* user@spark.apache.org >> *Subject:* Re: Using Spark for portfolio manager app >> >> Thanks Adrian and Jorn for the answers. >> >> Yes, you're right there are lot of things I need to consider if I want to >> use Spark for my app. >> >> I still have few concerns/questions from your information: >> >> 1/ I need to combine trading stream with tick stream, I am planning to >> use Kafka for that >> If I am using approach #2 (Direct Approach) in this tutorial >> https://spark.apache.org/docs/latest/streaming-kafka-integration.html >> <https://spark.apache.org/docs/latest/streaming-kafka-integration.html> >> Spark Streaming + Kafka Integration Guide - Spark 1.4.1 ... >> Spark Streaming + Kafka Integration Guide. Apache Kafka is >> publish-subscribe messaging rethought as a distributed, partitioned, >> replicated commit log service. >> Read more... >> <https://spark.apache.org/docs/latest/streaming-kafka-integration.html> >> >> Will I receive exactly one semantics? Or I have to add some logic in my >> code to archive that. >> As your suggestion of using delta update, exactly one semantic is >> required for this application. >> >> 2/ For ad-hoc query, I must output of Spark to external storage and query >> on that right? >> Is there any way to do ah-hoc query on Spark? my application could have >> 50k updates per second at pick time. >> Persistent to external storage lead to high latency in my app. >> >> 3/ How to get real-time statistics from Spark, >> In most of the Spark streaming examples, the statistics are echo to the >> stdout. >> However, I want to display those statics on GUI, is there any way to >> retrieve data from Spark directly without using external Storage? >> >> >> 2015-09-19 16:23 GMT+07:00 Jörn Franke <jornfra...@gmail.com>: >> >>> If you want to be able to let your users query their portfolio then you >>> may want to think about storing the current state of the portfolios in >>> hbase/phoenix or alternatively a cluster of relationaldatabases can make >>> sense. For the rest you may use Spark. >>> >>> Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê <thuyhang...@gmail.com> a >>> écrit : >>> >>>> Hi all, >>>> >>>> I am going to build a financial application for Portfolio Manager, >>>> where each portfolio contains a list of stocks, the number of shares >>>> purchased, and the purchase price. >>>> Another source of information is stocks price from market data. The >>>> application need to calculate real-time gain or lost of each stock in each >>>> portfolio ( compared to the purchase price). >>>> >>>> I am new with Spark, i know using Spark Streaming I can aggregate >>>> portfolio possitions in real-time, for example: >>>> user A contains: >>>> - 100 IBM stock with transactionValue=$15000 >>>> - 500 AAPL stock with transactionValue=$11400 >>>> >>>> Now given the stock prices change in real-time too, e.g if IBM price at >>>> 151, i want to update the gain or lost of it: gainOrLost(IBM) = 151*100 - >>>> 15000 = $100 >>>> >>>> My questions are: >>>> >>>> * What is the best method to combine 2 real-time streams( >>>> transaction made by user and market pricing data) in Spark. >>>> * How can I use real-time Adhoc SQL again >>>> portfolio's positions, is there any way i can do SQL on the output of Spark >>>> Streamming. >>>> For example, >>>> select sum(gainOrLost) from portfolio where user='A'; >>>> * What are prefered external storages for Spark in this use >>>> case. >>>> * Is spark is right choice for my use case? >>>> >>>> >>> >>