I have designed this prototype for a risk business. Here I would like to discuss issues with batch layer. *Apologies about being long winded.*
*Business objective* Reduce risk in the credit business while making better credit and trading decisions. Specifically, to identify risk trends within certain years of trading data. For example, measure the risk exposure in a give portfolio by industry, region, credit rating and other parameters. At the macroscopic level, analyze data across market sectors, over a given time horizon to asses risk changes *Deliverable* Enable real time and batch analysis of risk data *Batch technology stack used* Kafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, Spark as the query tool, Zeppelin *Test volumes for POC* 1 message queue (csv format), 100 stock prices streaming in very 2 seconds, 180K prices per hour, 4 million + per day 1. prices to Kafka -> Zookeeper -> Flume -> HDFS 2. HDFS daily partition for that day's data 3. Hive external table looking at HDFS partitioned location 4. Hive managed table populated every 15 minutes via cron from Hive external table (table type ORC partitioned by date). This is purely Hive job. Hive table is populated using insert/overwrite for that day to avoid boundary value/missing data etc. 5. Typical batch ingestion time (Hive table populated from HDFS files) ~ 2 minutes 6. Data in Hive table has 15 minutes latency 7. Zeppelin to be used as UI with Spark Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell. Within Spark shell, users can access batch tables in Hive *or *they have a choice of accessing raw data on HDFS files which gives them* real time access * (not to be confused with speed layer). Using typical query with Spark, to see the last 15 minutes of real time data (T-15 -Now) takes 1 min. Running the same query (my typical query not user query) on Hive tables this time using Spark takes 6 seconds. However, there are some design concerns: 1. Zeppelin starts slowing down by the end of day. Sometimes it throws broken pipe message. I resolve this by restarting Zeppelin daemon. Potential show stopper 2. As the volume of data increases throughout the day, performance becomes an issue 3. Every 15 minutes when the cron starts, Hive insert/overwrites can potentially get in conflict with users throwing queries from Zeppelin/Spark. I am sure that with exclusive writes, Hive will block all users from accessing these tables (at partition level) until insert overwrite is done. This can be improved by better partitioning of Hive tables or relaxing ingestion time to half hour or one hour at a cost of more lagging. I tried Parquet tables in Hive but really no difference in performance gain. I have thought of replacing Hive with Hbase etc. but that brings new complications in as well without necessarily solving the issue. 4. I am not convinced this design can scale up easily with 5 times more volume of data. 5. We will also get real time data from RDBMS tables (Oracle, Sybase, MSSQL)using replication technologies such as Sap Replication Server. These currently deliver changed log data to Hive tables. So there is some compatibility issue here. So I am sure some members can add useful ideas :) Thanks Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.