Ok… so what’s the tricky part? 
Spark Streaming isn’t real time so if you don’t mind a slight delay in 
processing… it would work.

The drawback is that you now have a long running Spark Job (assuming under 
YARN) and that could become a problem in terms of security and resources. 
(How well does Yarn handle long running jobs these days in a secured Cluster? 
Steve L. may have some insight… ) 

Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
want to write your own compaction code? Or use Hive 1.x+?)

HBase? Depending on your admin… stability could be a problem. 
Cassandra? That would be a separate cluster and that in itself could be a 
problem… 

YMMV so you need to address the pros/cons of each tool specific to your 
environment and skill level. 

HTH

-Mike

> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
> 
> I have a somewhat tricky use case, and I'm looking for ideas.
> 
> I have 5-6 Kafka producers, reading various APIs, and writing their raw data 
> into Kafka.
> 
> I need to:
> 
> - Do ETL on the data, and standardize it.
> 
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
> ElasticSearch / Postgres)
> 
> - Query this data to generate reports / analytics (There will be a web UI 
> which will be the front-end to the data, and will show the reports)
> 
> Java is being used as the backend language for everything (backend of the web 
> UI, as well as the ETL layer)
> 
> I'm considering:
> 
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
> raw data from Kafka, standardize & store it)
> 
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and 
> to allow queries
> 
> - In the backend of the web UI, I could either use Spark to run queries 
> across the data (mostly filters), or directly run queries against Cassandra / 
> HBase
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I 
> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
> persistent data store to use, and how to query that data store in the backend 
> of the web UI, for displaying the reports).
> 
> 
> Thanks.

Reply via email to