Re: Architecture recommendations for a tricky use case

Deepak Sharma Thu, 29 Sep 2016 08:15:49 -0700

Hi Cody
Spark direct stream is just fine for this use case.
But why postgres and not cassandra?
Is there anything specific here that i may not be aware?


Thanks
Deepak

On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> wrote:

> How are you going to handle etl failures?  Do you care about lost /
> duplicated data?  Are your writes idempotent?
>
> Absent any other information about the problem, I'd stay away from
> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> feeding postgres.
>
> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
> > Is there an advantage to that vs directly consuming from Kafka? Nothing
> is
> > being done to the data except some light ETL and then storing it in
> > Cassandra
> >
> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmc...@gmail.com>
> > wrote:
> >>
> >> Its better you use spark's direct stream to ingest from kafka.
> >>
> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com>
> wrote:
> >>>
> >>> I don't think I need a different speed storage and batch storage. Just
> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
> where
> >>> the web UI can query it, seems like it will be enough.
> >>>
> >>> I'm thinking about:
> >>>
> >>> - Reading data from Kafka via Spark Streaming
> >>> - Standardizing, then storing it in Cassandra
> >>> - Querying Cassandra from the web ui
> >>>
> >>> That seems like it will work. My question now is whether to use Spark
> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>>
> >>>
> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>> <mich.talebza...@gmail.com> wrote:
> >>>>
> >>>> - Spark Streaming to read data from Kafka
> >>>> - Storing the data on HDFS using Flume
> >>>>
> >>>> You don't need Spark streaming to read data from Kafka and store on
> >>>> HDFS. It is a waste of resources.
> >>>>
> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >>>>
> >>>> KafkaAgent.sources = kafka-sources
> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >>>>
> >>>> That will be for your batch layer. To analyse you can directly read
> from
> >>>> hdfs files with Spark or simply store data in a database of your
> choice via
> >>>> cron or something. Do not mix your batch layer with speed layer.
> >>>>
> >>>> Your speed layer will ingest the same data directly from Kafka into
> >>>> spark streaming and that will be  online or near real time (defined
> by your
> >>>> window).
> >>>>
> >>>> Then you have a a serving layer to present data from both speed  (the
> >>>> one from SS) and batch layer.
> >>>>
> >>>> HTH
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Dr Mich Talebzadeh
> >>>>
> >>>>
> >>>>
> >>>> LinkedIn
> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>
> >>>>
> >>>>
> >>>> http://talebzadehmich.wordpress.com
> >>>>
> >>>>
> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >>>> loss, damage or destruction of data or any other property which may
> arise
> >>>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>>> author will in no case be liable for any monetary damages arising
> from such
> >>>> loss, damage or destruction.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com>
> wrote:
> >>>>>
> >>>>> The web UI is actually the speed layer, it needs to be able to query
> >>>>> the data online, and show the results in real-time.
> >>>>>
> >>>>> It also needs a custom front-end, so a system like Tableau can't be
> >>>>> used, it must have a custom backend + front-end.
> >>>>>
> >>>>> Thanks for the recommendation of Flume. Do you think this will work:
> >>>>>
> >>>>> - Spark Streaming to read data from Kafka
> >>>>> - Storing the data on HDFS using Flume
> >>>>> - Using Spark to query the data in the backend of the web UI?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >>>>> <mich.talebza...@gmail.com> wrote:
> >>>>>>
> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
> >>>>>> stored on HDFS using flume.
> >>>>>>
> >>>>>> -  Query this data to generate reports / analytics (There will be a
> >>>>>> web UI which will be the front-end to the data, and will show the
> reports)
> >>>>>>
> >>>>>> This is basically batch layer and you need something like Tableau or
> >>>>>> Zeppelin to query data
> >>>>>>
> >>>>>> You will also need spark streaming to query data online for speed
> >>>>>> layer. That data could be stored in some transient fabric like
> ignite or
> >>>>>> even druid.
> >>>>>>
> >>>>>> HTH
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Dr Mich Talebzadeh
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> LinkedIn
> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> http://talebzadehmich.wordpress.com
> >>>>>>
> >>>>>>
> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> >>>>>> any loss, damage or destruction of data or any other property which
> may
> >>>>>> arise from relying on this email's technical content is explicitly
> >>>>>> disclaimed. The author will in no case be liable for any monetary
> damages
> >>>>>> arising from such loss, damage or destruction.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> It needs to be able to scale to a very large amount of data, yes.
> >>>>>>>
> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >>>>>>> <deepakmc...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> What is the message inflow ?
> >>>>>>>> If it's really high , definitely spark will be of great use .
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Deepak
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
> >>>>>>>>>
> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
> their
> >>>>>>>>> raw data into Kafka.
> >>>>>>>>>
> >>>>>>>>> I need to:
> >>>>>>>>>
> >>>>>>>>> - Do ETL on the data, and standardize it.
> >>>>>>>>>
> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >>>>>>>>>
> >>>>>>>>> - Query this data to generate reports / analytics (There will be
> a
> >>>>>>>>> web UI which will be the front-end to the data, and will show
> the reports)
> >>>>>>>>>
> >>>>>>>>> Java is being used as the backend language for everything
> (backend
> >>>>>>>>> of the web UI, as well as the ETL layer)
> >>>>>>>>>
> >>>>>>>>> I'm considering:
> >>>>>>>>>
> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >>>>>>>>>
> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> standardized
> >>>>>>>>> data, and to allow queries
> >>>>>>>>>
> >>>>>>>>> - In the backend of the web UI, I could either use Spark to run
> >>>>>>>>> queries across the data (mostly filters), or directly run
> queries against
> >>>>>>>>> Cassandra / HBase
> >>>>>>>>>
> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs
> Spark for
> >>>>>>>>> ETL, which persistent data store to use, and how to query that
> data store in
> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Thanks
> >> Deepak
> >> www.bigdatabig.com
> >> www.keosha.net
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Reply via email to