Since the inflow is huge , flume would also need to be run with multiple channels in distributed fashion. In that case , the resource utilization will be high in that case as well.
Thanks Deepak On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > - Spark Streaming to read data from Kafka > - Storing the data on HDFS using Flume > > You don't need Spark streaming to read data from Kafka and store on HDFS. > It is a waste of resources. > > Couple Flume to use Kafka as source and HDFS as sink directly > > KafkaAgent.sources = kafka-sources > KafkaAgent.sinks.hdfs-sinks.type = hdfs > > That will be for your batch layer. To analyse you can directly read from > hdfs files with Spark or simply store data in a database of your choice via > cron or something. Do not mix your batch layer with speed layer. > > Your speed layer will ingest the same data directly from Kafka into spark > streaming and that will be online or near real time (defined by your > window). > > Then you have a a serving layer to present data from both speed (the one > from SS) and batch layer. > > HTH > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> wrote: > >> The web UI is actually the speed layer, it needs to be able to query the >> data online, and show the results in real-time. >> >> It also needs a custom front-end, so a system like Tableau can't be used, >> it must have a custom backend + front-end. >> >> Thanks for the recommendation of Flume. Do you think this will work: >> >> - Spark Streaming to read data from Kafka >> - Storing the data on HDFS using Flume >> - Using Spark to query the data in the backend of the web UI? >> >> >> >> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> You need a batch layer and a speed layer. Data from Kafka can be stored >>> on HDFS using flume. >>> >>> - Query this data to generate reports / analytics (There will be a web >>> UI which will be the front-end to the data, and will show the reports) >>> >>> This is basically batch layer and you need something like Tableau or >>> Zeppelin to query data >>> >>> You will also need spark streaming to query data online for speed layer. >>> That data could be stored in some transient fabric like ignite or even >>> druid. >>> >>> HTH >>> >>> >>> >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com> wrote: >>> >>>> It needs to be able to scale to a very large amount of data, yes. >>>> >>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmc...@gmail.com> >>>> wrote: >>>> >>>>> What is the message inflow ? >>>>> If it's really high , definitely spark will be of great use . >>>>> >>>>> Thanks >>>>> Deepak >>>>> >>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote: >>>>> >>>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>>> >>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their >>>>>> raw data into Kafka. >>>>>> >>>>>> I need to: >>>>>> >>>>>> - Do ETL on the data, and standardize it. >>>>>> >>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS >>>>>> / ElasticSearch / Postgres) >>>>>> >>>>>> - Query this data to generate reports / analytics (There will be a >>>>>> web UI which will be the front-end to the data, and will show the >>>>>> reports) >>>>>> >>>>>> Java is being used as the backend language for everything (backend of >>>>>> the web UI, as well as the ETL layer) >>>>>> >>>>>> I'm considering: >>>>>> >>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>>>> (receive raw data from Kafka, standardize & store it) >>>>>> >>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>>>> data, and to allow queries >>>>>> >>>>>> - In the backend of the web UI, I could either use Spark to run >>>>>> queries across the data (mostly filters), or directly run queries against >>>>>> Cassandra / HBase >>>>>> >>>>>> I'd appreciate some thoughts / suggestions on which of these >>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark >>>>>> for >>>>>> ETL, which persistent data store to use, and how to query that data store >>>>>> in the backend of the web UI, for displaying the reports). >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>> >>>> >>> >> > -- Thanks Deepak www.bigdatabig.com www.keosha.net