Ok… so what’s the tricky part? Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.
The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… ) Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?) HBase? Depending on your admin… stability could be a problem. Cassandra? That would be a separate cluster and that in itself could be a problem… YMMV so you need to address the pros/cons of each tool specific to your environment and skill level. HTH -Mike > On Sep 29, 2016, at 8:54 AM, Ali Akhtar <ali.rac...@gmail.com> wrote: > > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and writing their raw data > into Kafka. > > I need to: > > - Do ETL on the data, and standardize it. > > - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / > ElasticSearch / Postgres) > > - Query this data to generate reports / analytics (There will be a web UI > which will be the front-end to the data, and will show the reports) > > Java is being used as the backend language for everything (backend of the web > UI, as well as the ETL layer) > > I'm considering: > > - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive > raw data from Kafka, standardize & store it) > > - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and > to allow queries > > - In the backend of the web UI, I could either use Spark to run queries > across the data (mostly filters), or directly run queries against Cassandra / > HBase > > I'd appreciate some thoughts / suggestions on which of these alternatives I > should go with (e.g, using raw Kafka consumers vs Spark for ETL, which > persistent data store to use, and how to query that data store in the backend > of the web UI, for displaying the reports). > > > Thanks.