Re: Architecture recommendations for a tricky use case

Cody Koeninger Thu, 29 Sep 2016 08:11:51 -0700

How are you going to handle etl failures?  Do you care about lost /
duplicated data?  Are your writes idempotent?


Absent any other information about the problem, I'd stay away from
cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
feeding postgres.

On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
> Is there an advantage to that vs directly consuming from Kafka? Nothing is
> being done to the data except some light ETL and then storing it in
> Cassandra
>
> On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>>
>> Its better you use spark's direct stream to ingest from kafka.
>>
>> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>>
>>> I don't think I need a different speed storage and batch storage. Just
>>> taking in raw data from Kafka, standardizing, and storing it somewhere where
>>> the web UI can query it, seems like it will be enough.
>>>
>>> I'm thinking about:
>>>
>>> - Reading data from Kafka via Spark Streaming
>>> - Standardizing, then storing it in Cassandra
>>> - Querying Cassandra from the web ui
>>>
>>> That seems like it will work. My question now is whether to use Spark
>>> Streaming to read Kafka, or use Kafka consumers directly.
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> <mich.talebza...@gmail.com> wrote:
>>>>
>>>> - Spark Streaming to read data from Kafka
>>>> - Storing the data on HDFS using Flume
>>>>
>>>> You don't need Spark streaming to read data from Kafka and store on
>>>> HDFS. It is a waste of resources.
>>>>
>>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>>
>>>> KafkaAgent.sources = kafka-sources
>>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>>
>>>> That will be for your batch layer. To analyse you can directly read from
>>>> hdfs files with Spark or simply store data in a database of your choice via
>>>> cron or something. Do not mix your batch layer with speed layer.
>>>>
>>>> Your speed layer will ingest the same data directly from Kafka into
>>>> spark streaming and that will be  online or near real time (defined by your
>>>> window).
>>>>
>>>> Then you have a a serving layer to present data from both speed  (the
>>>> one from SS) and batch layer.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>>> loss, damage or destruction of data or any other property which may arise
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The
>>>> author will in no case be liable for any monetary damages arising from such
>>>> loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>>>>
>>>>> The web UI is actually the speed layer, it needs to be able to query
>>>>> the data online, and show the results in real-time.
>>>>>
>>>>> It also needs a custom front-end, so a system like Tableau can't be
>>>>> used, it must have a custom backend + front-end.
>>>>>
>>>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>>>
>>>>> - Spark Streaming to read data from Kafka
>>>>> - Storing the data on HDFS using Flume
>>>>> - Using Spark to query the data in the backend of the web UI?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>>>>>> stored on HDFS using flume.
>>>>>>
>>>>>> -  Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the 
>>>>>> reports)
>>>>>>
>>>>>> This is basically batch layer and you need something like Tableau or
>>>>>> Zeppelin to query data
>>>>>>
>>>>>> You will also need spark streaming to query data online for speed
>>>>>> layer. That data could be stored in some transient fabric like ignite or
>>>>>> even druid.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
>>>>>> any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>>>
>>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>>>>>> <deepakmc...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> What is the message inflow ?
>>>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Deepak
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>>>
>>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>>>> raw data into Kafka.
>>>>>>>>>
>>>>>>>>> I need to:
>>>>>>>>>
>>>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>>>
>>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>>>
>>>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>>>> web UI which will be the front-end to the data, and will show the 
>>>>>>>>> reports)
>>>>>>>>>
>>>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>>>
>>>>>>>>> I'm considering:
>>>>>>>>>
>>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>>>
>>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>>>> data, and to allow queries
>>>>>>>>>
>>>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>>>> queries across the data (mostly filters), or directly run queries 
>>>>>>>>> against
>>>>>>>>> Cassandra / HBase
>>>>>>>>>
>>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs 
>>>>>>>>> Spark for
>>>>>>>>> ETL, which persistent data store to use, and how to query that data 
>>>>>>>>> store in
>>>>>>>>> the backend of the web UI, for displaying the reports).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Architecture recommendations for a tricky use case

Reply via email to