I have designed this prototype for a risk business. Here I would like to
discuss issues with batch layer. *Apologies about being long winded.*

*Business objective*

Reduce risk in the credit business while making better credit and trading
decisions. Specifically, to identify risk trends within certain years of
trading data. For example, measure the risk exposure in a give portfolio by
industry, region, credit rating and other parameters. At the macroscopic
level, analyze data across market sectors, over a given time horizon to
asses risk changes


*Deliverable*
Enable real time and batch analysis of risk data

*Batch technology stack used*
Kafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, Spark as the query
tool, Zeppelin

*Test volumes for POC*
1 message queue (csv format), 100 stock prices streaming in very 2 seconds,
180K prices per hour, 4 million + per day



   1. prices to Kafka -> Zookeeper -> Flume -> HDFS
   2. HDFS daily partition for that day's data
   3. Hive external table looking at HDFS partitioned location
   4. Hive managed table populated every 15 minutes via cron from Hive
   external table (table type ORC partitioned by date). This is purely Hive
   job. Hive table is populated using insert/overwrite for that day to
   avoid boundary value/missing data etc.
   5. Typical batch ingestion time (Hive table populated from HDFS files) ~
   2 minutes
   6. Data in Hive table has 15 minutes latency
   7. Zeppelin to be used as UI with Spark


Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell.
Within Spark shell, users can access batch tables in Hive *or *they have a
choice of accessing raw data on HDFS files which gives them* real time
access * (not to be confused with speed layer).  Using typical query with
Spark, to see the last 15 minutes of real time data (T-15 -Now) takes 1
min. Running the same query (my typical query not user query) on Hive
tables this time using Spark takes 6 seconds.

However, there are some  design concerns:


   1. Zeppelin starts slowing down by the end of day. Sometimes it throws
   broken pipe message. I resolve this by restarting Zeppelin daemon.
   Potential show stopper
   2. As the volume of data increases throughout the day, performance
   becomes an issue
   3. Every 15 minutes when the cron starts, Hive insert/overwrites can
   potentially get in conflict with users throwing queries from
   Zeppelin/Spark. I am sure that with exclusive writes, Hive will block all
   users from accessing these tables (at partition level) until insert
   overwrite is done. This can be improved by better partitioning of Hive
   tables or relaxing ingestion time to half hour or one hour at a cost of
   more lagging. I tried Parquet tables in Hive but really no difference in
   performance gain. I have thought of replacing Hive with Hbase etc. but that
   brings new complications in as well without necessarily solving the issue.
   4. I am not convinced this design can scale up easily with 5 times more
   volume of data.
   5. We will also get real time data from RDBMS tables (Oracle, Sybase,
   MSSQL)using replication technologies such as Sap Replication Server. These
   currently deliver changed log data to Hive tables. So there is some
   compatibility issue here.


So I am sure some members can add useful ideas :)

Thanks

Mich




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to