Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I have a somewhat tricky use case, and I'm looking for ideas.

I have 5-6 Kafka producers, reading various APIs, and writing their raw
data into Kafka.

I need to:

- Do ETL on the data, and standardize it.

- Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
ElasticSearch / Postgres)

- Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)

Java is being used as the backend language for everything (backend of the
web UI, as well as the ETL layer)

I'm considering:

- Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
raw data from Kafka, standardize & store it)

- Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
and to allow queries

- In the backend of the web UI, I could either use Spark to run queries
across the data (mostly filters), or directly run queries against Cassandra
/ HBase

I'd appreciate some thoughts / suggestions on which of these alternatives I
should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
persistent data store to use, and how to query that data store in the
backend of the web UI, for displaying the reports).


Thanks.


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
It needs to be able to scale to a very large amount of data, yes.

On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>> data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>> ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI
>> which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the
>> web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>> (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
>> and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries
>> across the data (mostly filters), or directly run queries against Cassandra
>> / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives
>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
>> persistent data store to use, and how to query that data store in the
>> backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
You need a batch layer and a speed layer. Data from Kafka can be stored on
HDFS using flume.

-  Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)

This is basically batch layer and you need something like Tableau or
Zeppelin to query data

You will also need spark streaming to query data online for speed layer.
That data could be stored in some transient fabric like ignite or even
druid.

HTH








Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:01, Ali Akhtar  wrote:

> It needs to be able to scale to a very large amount of data, yes.
>
> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
> wrote:
>
>> What is the message inflow ?
>> If it's really high , definitely spark will be of great use .
>>
>> Thanks
>> Deepak
>>
>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>>> data into Kafka.
>>>
>>> I need to:
>>>
>>> - Do ETL on the data, and standardize it.
>>>
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>>> ElasticSearch / Postgres)
>>>
>>> - Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> Java is being used as the backend language for everything (backend of
>>> the web UI, as well as the ETL layer)
>>>
>>> I'm considering:
>>>
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>> (receive raw data from Kafka, standardize & store it)
>>>
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>> data, and to allow queries
>>>
>>> - In the backend of the web UI, I could either use Spark to run queries
>>> across the data (mostly filters), or directly run queries against Cassandra
>>> / HBase
>>>
>>> I'd appreciate some thoughts / suggestions on which of these
>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>> ETL, which persistent data store to use, and how to query that data store
>>> in the backend of the web UI, for displaying the reports).
>>>
>>>
>>> Thanks.
>>>
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ?
If it's really high , definitely spark will be of great use .

Thanks
Deepak

On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their raw
> data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
> ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of the
> web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
> raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
> and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run queries
> across the data (mostly filters), or directly run queries against Cassandra
> / HBase
>
> I'd appreciate some thoughts / suggestions on which of these alternatives
> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> persistent data store to use, and how to query that data store in the
> backend of the web UI, for displaying the reports).
>
>
> Thanks.
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The web UI is actually the speed layer, it needs to be able to query the
data online, and show the results in real-time.

It also needs a custom front-end, so a system like Tableau can't be used,
it must have a custom backend + front-end.

Thanks for the recommendation of Flume. Do you think this will work:

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume
- Using Spark to query the data in the backend of the web UI?



On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh 
wrote:

> You need a batch layer and a speed layer. Data from Kafka can be stored on
> HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed layer.
> That data could be stored in some transient fabric like ignite or even
> druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>> wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>
 I have a somewhat tricky use case, and I'm looking for ideas.

 I have 5-6 Kafka producers, reading various APIs, and writing their raw
 data into Kafka.

 I need to:

 - Do ETL on the data, and standardize it.

 - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
 ElasticSearch / Postgres)

 - Query this data to generate reports / analytics (There will be a web
 UI which will be the front-end to the data, and will show the reports)

 Java is being used as the backend language for everything (backend of
 the web UI, as well as the ETL layer)

 I'm considering:

 - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
 (receive raw data from Kafka, standardize & store it)

 - Using Cassandra, HBase, or raw HDFS, for storing the standardized
 data, and to allow queries

 - In the backend of the web UI, I could either use Spark to run queries
 across the data (mostly filters), or directly run queries against Cassandra
 / HBase

 I'd appreciate some thoughts / suggestions on which of these
 alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
 ETL, which persistent data store to use, and how to query that data store
 in the backend of the web UI, for displaying the reports).


 Thanks.

>>>
>>
>


RE: Architecture recommendations for a tricky use case

2016-09-29 Thread Tauzell, Dave
Spark Streaming needs to store the output somewhere.  Cassandra is a possible 
target for that.

-Dave

-Original Message-
From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Thursday, September 29, 2016 9:16 AM
Cc: users@kafka.apache.org; spark users
Subject: Re: Architecture recommendations for a tricky use case

The web UI is actually the speed layer, it needs to be able to query the data 
online, and show the results in real-time.

It also needs a custom front-end, so a system like Tableau can't be used, it 
must have a custom backend + front-end.

Thanks for the recommendation of Flume. Do you think this will work:

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume
- Using Spark to query the data in the backend of the web UI?



On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh 
wrote:

> You need a batch layer and a speed layer. Data from Kafka can be
> stored on HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a
> web UI which will be the front-end to the data, and will show the
> reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed layer.
> That data could be stored in some transient fabric like ignite or even
> druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCC
> dOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPC
> CdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's technical content is explicitly 
> disclaimed.
> The author will in no case be liable for any monetary damages arising
> from such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> 
>> wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>> raw data into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>> HDFS / ElasticSearch / Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a
>>>> web UI which will be the front-end to the data, and will show the
>>>> reports)
>>>>
>>>> Java is being used as the backend language for everything (backend
>>>> of the web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>> (receive raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>> data, and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run
>>>> queries across the data (mostly filters), or directly run queries
>>>> against Cassandra / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these
>>>> alternatives I should go with (e.g, using raw Kafka consumers vs
>>>> Spark for ETL, which persistent data store to use, and how to query
>>>> that data store in the backend of the web UI, for displaying the reports).
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>
>
This e-mail and any files transmitted with it are confidential, may contain 
sensitive information, and are intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this e-mail in error, 
please notify the sender by reply e-mail immediately and destroy all copies of 
the e-mail and any attachments.


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
For ui , you need DB such as Cassandra that is designed to work around
queries .
Ingest the data to spark streaming (speed layer) and write to hdfs(for
batch layer).
Now you have data at rest as well as in motion(real time).
>From spark streaming itself , do further processing and write the final
result to Cassandra/nosql DB.
UI can pick the data from the DB now.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:00 PM, Alonso Isidoro Roman 
wrote:

> "Using Spark to query the data in the backend of the web UI?"
>
> Dont do that. I would recommend that spark streaming process stores data
> into some nosql or sql database and the web ui to query data from that
> database.
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-09-29 16:15 GMT+02:00 Ali Akhtar :
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>>
 It needs to be able to scale to a very large amount of data, yes.

 On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
 wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their
>> raw data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>> / ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a
>> web UI which will be the front-end to the data, and will show the 
>> reports)
>>
>> Java is being used as the backend language for everything (backend of
>> the web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>> (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>> data, and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run
>> queries across the data (mostly filters), or directly run queries against
>> Cassandra / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these
>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark 
>> for
>> ETL, which persistent data store to use, and how to query that data store
>> in the backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>>
>

>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume

You don't need Spark streaming to read data from Kafka and store on HDFS.
It is a waste of resources.

Couple Flume to use Kafka as source and HDFS as sink directly

KafkaAgent.sources = kafka-sources
KafkaAgent.sinks.hdfs-sinks.type = hdfs

That will be for your batch layer. To analyse you can directly read from
hdfs files with Spark or simply store data in a database of your choice via
cron or something. Do not mix your batch layer with speed layer.

Your speed layer will ingest the same data directly from Kafka into spark
streaming and that will be  online or near real time (defined by your
window).

Then you have a a serving layer to present data from both speed  (the one
from SS) and batch layer.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:15, Ali Akhtar  wrote:

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>>> wrote:
>>>
 What is the message inflow ?
 If it's really high , definitely spark will be of great use .

 Thanks
 Deepak

 On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their
> raw data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
> / ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web
> UI which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of
> the web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> (receive raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
> data, and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run
> queries across the data (mostly filters), or directly run queries against
> Cassandra / HBase
>
> I'd appreciate some thoughts / suggestions on which of these
> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
> ETL, which persistent data store to use, and how to query that data store
> in the backend of the web UI, for displaying the reports).
>
>
> Thanks.
>

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I don't think I need a different speed storage and batch storage. Just
taking in raw data from Kafka, standardizing, and storing it somewhere
where the web UI can query it, seems like it will be enough.

I'm thinking about:

- Reading data from Kafka via Spark Streaming
- Standardizing, then storing it in Cassandra
- Querying Cassandra from the web ui

That seems like it will work. My question now is whether to use Spark
Streaming to read Kafka, or use Kafka consumers directly.


On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh 
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>>
 It needs to be able to scale to a very large amount of data, yes.

 On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
 wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their
>> raw data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>> / ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a
>> web UI which will be the front-end to the data, and will show the 
>> reports)
>>
>> Java is being used as the backend language for everything (backend of
>> the web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consume

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Since the inflow is huge , flume would also need to be run with multiple
channels in distributed fashion.
In that case , the resource utilization will be high in that case as well.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh 
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>>
 It needs to be able to scale to a very large amount of data, yes.

 On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
 wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their
>> raw data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>> / ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a
>> web UI which will be the front-end to the data, and will show the 
>> reports)
>>
>> Java is being used as the backend language for everything (backend of
>> the web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>> (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>> data, and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Its better you use spark's direct stream to ingest from kafka.

On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar  wrote:

> I don't think I need a different speed storage and batch storage. Just
> taking in raw data from Kafka, standardizing, and storing it somewhere
> where the web UI can query it, seems like it will be enough.
>
> I'm thinking about:
>
> - Reading data from Kafka via Spark Streaming
> - Standardizing, then storing it in Cassandra
> - Querying Cassandra from the web ui
>
> That seems like it will work. My question now is whether to use Spark
> Streaming to read Kafka, or use Kafka consumers directly.
>
>
> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>>
>> You don't need Spark streaming to read data from Kafka and store on HDFS.
>> It is a waste of resources.
>>
>> Couple Flume to use Kafka as source and HDFS as sink directly
>>
>> KafkaAgent.sources = kafka-sources
>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>
>> That will be for your batch layer. To analyse you can directly read from
>> hdfs files with Spark or simply store data in a database of your choice via
>> cron or something. Do not mix your batch layer with speed layer.
>>
>> Your speed layer will ingest the same data directly from Kafka into spark
>> streaming and that will be  online or near real time (defined by your
>> window).
>>
>> Then you have a a serving layer to present data from both speed  (the one
>> from SS) and batch layer.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>>
>>> The web UI is actually the speed layer, it needs to be able to query the
>>> data online, and show the results in real-time.
>>>
>>> It also needs a custom front-end, so a system like Tableau can't be
>>> used, it must have a custom backend + front-end.
>>>
>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>> - Using Spark to query the data in the backend of the web UI?
>>>
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 You need a batch layer and a speed layer. Data from Kafka can be stored
 on HDFS using flume.

 -  Query this data to generate reports / analytics (There will be a web
 UI which will be the front-end to the data, and will show the reports)

 This is basically batch layer and you need something like Tableau or
 Zeppelin to query data

 You will also need spark streaming to query data online for speed
 layer. That data could be stored in some transient fabric like ignite or
 even druid.

 HTH








 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 29 September 2016 at 15:01, Ali Akhtar  wrote:

> It needs to be able to scale to a very large amount of data, yes.
>
> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
> wrote:
>
>> What is the message inflow ?
>> If it's really high , definitely spark will be of great use .
>>
>> Thanks
>> Deepak
>>
>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>
>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>> raw data into Kafka.
>>>
>>> I need to:
>>>
>>> - Do ETL on the data, and standardize it.
>>>
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>> HDFS / ElasticSearch / Postgres)
>>>
>>> - Query this data to generate reports / analy

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures?  Do you care about lost /
duplicated data?  Are your writes idempotent?

Absent any other information about the problem, I'd stay away from
cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
feeding postgres.

On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar  wrote:
> Is there an advantage to that vs directly consuming from Kafka? Nothing is
> being done to the data except some light ETL and then storing it in
> Cassandra
>
> On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
> wrote:
>>
>> Its better you use spark's direct stream to ingest from kafka.
>>
>> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar  wrote:
>>>
>>> I don't think I need a different speed storage and batch storage. Just
>>> taking in raw data from Kafka, standardizing, and storing it somewhere where
>>> the web UI can query it, seems like it will be enough.
>>>
>>> I'm thinking about:
>>>
>>> - Reading data from Kafka via Spark Streaming
>>> - Standardizing, then storing it in Cassandra
>>> - Querying Cassandra from the web ui
>>>
>>> That seems like it will work. My question now is whether to use Spark
>>> Streaming to read Kafka, or use Kafka consumers directly.
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>>  wrote:

 - Spark Streaming to read data from Kafka
 - Storing the data on HDFS using Flume

 You don't need Spark streaming to read data from Kafka and store on
 HDFS. It is a waste of resources.

 Couple Flume to use Kafka as source and HDFS as sink directly

 KafkaAgent.sources = kafka-sources
 KafkaAgent.sinks.hdfs-sinks.type = hdfs

 That will be for your batch layer. To analyse you can directly read from
 hdfs files with Spark or simply store data in a database of your choice via
 cron or something. Do not mix your batch layer with speed layer.

 Your speed layer will ingest the same data directly from Kafka into
 spark streaming and that will be  online or near real time (defined by your
 window).

 Then you have a a serving layer to present data from both speed  (the
 one from SS) and batch layer.

 HTH




 Dr Mich Talebzadeh



 LinkedIn
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



 http://talebzadehmich.wordpress.com


 Disclaimer: Use it at your own risk. Any and all responsibility for any
 loss, damage or destruction of data or any other property which may arise
 from relying on this email's technical content is explicitly disclaimed. 
 The
 author will in no case be liable for any monetary damages arising from such
 loss, damage or destruction.




 On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>
> The web UI is actually the speed layer, it needs to be able to query
> the data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be
> used, it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>  wrote:
>>
>> You need a batch layer and a speed layer. Data from Kafka can be
>> stored on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a
>> web UI which will be the front-end to the data, and will show the 
>> reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed
>> layer. That data could be stored in some transient fabric like ignite or
>> even druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar 
>> wrote:
>>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>>  wrote:

 What is the message inflow ?
>>>

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
Is there an advantage to that vs directly consuming from Kafka? Nothing is
being done to the data except some light ETL and then storing it in
Cassandra

On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
wrote:

> Its better you use spark's direct stream to ingest from kafka.
>
> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar  wrote:
>
>> I don't think I need a different speed storage and batch storage. Just
>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> where the web UI can query it, seems like it will be enough.
>>
>> I'm thinking about:
>>
>> - Reading data from Kafka via Spark Streaming
>> - Standardizing, then storing it in Cassandra
>> - Querying Cassandra from the web ui
>>
>> That seems like it will work. My question now is whether to use Spark
>> Streaming to read Kafka, or use Kafka consumers directly.
>>
>>
>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>>
>>> You don't need Spark streaming to read data from Kafka and store on
>>> HDFS. It is a waste of resources.
>>>
>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>
>>> KafkaAgent.sources = kafka-sources
>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>
>>> That will be for your batch layer. To analyse you can directly read from
>>> hdfs files with Spark or simply store data in a database of your choice via
>>> cron or something. Do not mix your batch layer with speed layer.
>>>
>>> Your speed layer will ingest the same data directly from Kafka into
>>> spark streaming and that will be  online or near real time (defined by your
>>> window).
>>>
>>> Then you have a a serving layer to present data from both speed  (the
>>> one from SS) and batch layer.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>>>
 The web UI is actually the speed layer, it needs to be able to query
 the data online, and show the results in real-time.

 It also needs a custom front-end, so a system like Tableau can't be
 used, it must have a custom backend + front-end.

 Thanks for the recommendation of Flume. Do you think this will work:

 - Spark Streaming to read data from Kafka
 - Storing the data on HDFS using Flume
 - Using Spark to query the data in the backend of the web UI?



 On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> You need a batch layer and a speed layer. Data from Kafka can be
> stored on HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a
> web UI which will be the front-end to the data, and will show the reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed
> layer. That data could be stored in some transient fabric like ignite or
> even druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar 
> wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma > > wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>
 I have a somewhat tricky use case, and I'm looking for ideas.

>>>

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Hi Cody
Spark direct stream is just fine for this use case.
But why postgres and not cassandra?
Is there anything specific here that i may not be aware?

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger  wrote:

> How are you going to handle etl failures?  Do you care about lost /
> duplicated data?  Are your writes idempotent?
>
> Absent any other information about the problem, I'd stay away from
> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> feeding postgres.
>
> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar  wrote:
> > Is there an advantage to that vs directly consuming from Kafka? Nothing
> is
> > being done to the data except some light ETL and then storing it in
> > Cassandra
> >
> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
> > wrote:
> >>
> >> Its better you use spark's direct stream to ingest from kafka.
> >>
> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
> wrote:
> >>>
> >>> I don't think I need a different speed storage and batch storage. Just
> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
> where
> >>> the web UI can query it, seems like it will be enough.
> >>>
> >>> I'm thinking about:
> >>>
> >>> - Reading data from Kafka via Spark Streaming
> >>> - Standardizing, then storing it in Cassandra
> >>> - Querying Cassandra from the web ui
> >>>
> >>> That seems like it will work. My question now is whether to use Spark
> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>>
> >>>
> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>>  wrote:
> 
>  - Spark Streaming to read data from Kafka
>  - Storing the data on HDFS using Flume
> 
>  You don't need Spark streaming to read data from Kafka and store on
>  HDFS. It is a waste of resources.
> 
>  Couple Flume to use Kafka as source and HDFS as sink directly
> 
>  KafkaAgent.sources = kafka-sources
>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
> 
>  That will be for your batch layer. To analyse you can directly read
> from
>  hdfs files with Spark or simply store data in a database of your
> choice via
>  cron or something. Do not mix your batch layer with speed layer.
> 
>  Your speed layer will ingest the same data directly from Kafka into
>  spark streaming and that will be  online or near real time (defined
> by your
>  window).
> 
>  Then you have a a serving layer to present data from both speed  (the
>  one from SS) and batch layer.
> 
>  HTH
> 
> 
> 
> 
>  Dr Mich Talebzadeh
> 
> 
> 
>  LinkedIn
>  https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
> 
> 
>  http://talebzadehmich.wordpress.com
> 
> 
>  Disclaimer: Use it at your own risk. Any and all responsibility for
> any
>  loss, damage or destruction of data or any other property which may
> arise
>  from relying on this email's technical content is explicitly
> disclaimed. The
>  author will in no case be liable for any monetary damages arising
> from such
>  loss, damage or destruction.
> 
> 
> 
> 
>  On 29 September 2016 at 15:15, Ali Akhtar 
> wrote:
> >
> > The web UI is actually the speed layer, it needs to be able to query
> > the data online, and show the results in real-time.
> >
> > It also needs a custom front-end, so a system like Tableau can't be
> > used, it must have a custom backend + front-end.
> >
> > Thanks for the recommendation of Flume. Do you think this will work:
> >
> > - Spark Streaming to read data from Kafka
> > - Storing the data on HDFS using Flume
> > - Using Spark to query the data in the backend of the web UI?
> >
> >
> >
> > On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >  wrote:
> >>
> >> You need a batch layer and a speed layer. Data from Kafka can be
> >> stored on HDFS using flume.
> >>
> >> -  Query this data to generate reports / analytics (There will be a
> >> web UI which will be the front-end to the data, and will show the
> reports)
> >>
> >> This is basically batch layer and you need something like Tableau or
> >> Zeppelin to query data
> >>
> >> You will also need spark streaming to query data online for speed
> >> layer. That data could be stored in some transient fabric like
> ignite or
> >> even druid.
> >>
> >> HTH
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for
> >> any loss, damage or destruction of data or any other property wh

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
I wouldn't give up the flexibility and maturity of a relational
database, unless you have a very specific use case.  I'm not trashing
cassandra, I've used cassandra, but if all I know is that you're doing
analytics, I wouldn't want to give up the ability to easily do ad-hoc
aggregations without a lot of forethought.  If you're worried about
scaling, there are several options for horizontally scaling Postgres
in particular.  One of the current best from what I've worked with is
Citus.

On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma  wrote:
> Hi Cody
> Spark direct stream is just fine for this use case.
> But why postgres and not cassandra?
> Is there anything specific here that i may not be aware?
>
> Thanks
> Deepak
>
> On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger  wrote:
>>
>> How are you going to handle etl failures?  Do you care about lost /
>> duplicated data?  Are your writes idempotent?
>>
>> Absent any other information about the problem, I'd stay away from
>> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> feeding postgres.
>>
>> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar  wrote:
>> > Is there an advantage to that vs directly consuming from Kafka? Nothing
>> > is
>> > being done to the data except some light ETL and then storing it in
>> > Cassandra
>> >
>> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
>> > wrote:
>> >>
>> >> Its better you use spark's direct stream to ingest from kafka.
>> >>
>> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>> >> wrote:
>> >>>
>> >>> I don't think I need a different speed storage and batch storage. Just
>> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> >>> where
>> >>> the web UI can query it, seems like it will be enough.
>> >>>
>> >>> I'm thinking about:
>> >>>
>> >>> - Reading data from Kafka via Spark Streaming
>> >>> - Standardizing, then storing it in Cassandra
>> >>> - Querying Cassandra from the web ui
>> >>>
>> >>> That seems like it will work. My question now is whether to use Spark
>> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >>>
>> >>>
>> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >>>  wrote:
>> 
>>  - Spark Streaming to read data from Kafka
>>  - Storing the data on HDFS using Flume
>> 
>>  You don't need Spark streaming to read data from Kafka and store on
>>  HDFS. It is a waste of resources.
>> 
>>  Couple Flume to use Kafka as source and HDFS as sink directly
>> 
>>  KafkaAgent.sources = kafka-sources
>>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> 
>>  That will be for your batch layer. To analyse you can directly read
>>  from
>>  hdfs files with Spark or simply store data in a database of your
>>  choice via
>>  cron or something. Do not mix your batch layer with speed layer.
>> 
>>  Your speed layer will ingest the same data directly from Kafka into
>>  spark streaming and that will be  online or near real time (defined
>>  by your
>>  window).
>> 
>>  Then you have a a serving layer to present data from both speed  (the
>>  one from SS) and batch layer.
>> 
>>  HTH
>> 
>> 
>> 
>> 
>>  Dr Mich Talebzadeh
>> 
>> 
>> 
>>  LinkedIn
>> 
>>  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> 
>> 
>>  http://talebzadehmich.wordpress.com
>> 
>> 
>>  Disclaimer: Use it at your own risk. Any and all responsibility for
>>  any
>>  loss, damage or destruction of data or any other property which may
>>  arise
>>  from relying on this email's technical content is explicitly
>>  disclaimed. The
>>  author will in no case be liable for any monetary damages arising
>>  from such
>>  loss, damage or destruction.
>> 
>> 
>> 
>> 
>>  On 29 September 2016 at 15:15, Ali Akhtar 
>>  wrote:
>> >
>> > The web UI is actually the speed layer, it needs to be able to query
>> > the data online, and show the results in real-time.
>> >
>> > It also needs a custom front-end, so a system like Tableau can't be
>> > used, it must have a custom backend + front-end.
>> >
>> > Thanks for the recommendation of Flume. Do you think this will work:
>> >
>> > - Spark Streaming to read data from Kafka
>> > - Storing the data on HDFS using Flume
>> > - Using Spark to query the data in the backend of the web UI?
>> >
>> >
>> >
>> > On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >  wrote:
>> >>
>> >> You need a batch layer and a speed layer. Data from Kafka can be
>> >> stored on HDFS using flume.
>> >>
>> >> -  Query this data to generate reports / analytics (There will be a
>> >> web UI which will be the front-end to the data, and will show the
>> >> reports)
>> >>
>> >> This is basically batch layer and you need 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
My concern with Postgres / Cassandra is only scalability. I will look
further into Postgres horizontal scaling, thanks.

Writes could be idempotent if done as upserts, otherwise updates will be
idempotent but not inserts.

Data should not be lost. The system should be as fault tolerant as possible.

What's the advantage of using Spark for reading Kafka instead of direct
Kafka consumers?

On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger  wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma  >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>>  wrote:
> >> 
> >>  - Spark Streaming to read data from Kafka
> >>  - Storing the data on HDFS using Flume
> >> 
> >>  You don't need Spark streaming to read data from Kafka and store on
> >>  HDFS. It is a waste of resources.
> >> 
> >>  Couple Flume to use Kafka as source and HDFS as sink directly
> >> 
> >>  KafkaAgent.sources = kafka-sources
> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> 
> >>  That will be for your batch layer. To analyse you can directly read
> >>  from
> >>  hdfs files with Spark or simply store data in a database of your
> >>  choice via
> >>  cron or something. Do not mix your batch layer with speed layer.
> >> 
> >>  Your speed layer will ingest the same data directly from Kafka into
> >>  spark streaming and that will be  online or near real time (defined
> >>  by your
> >>  window).
> >> 
> >>  Then you have a a serving layer to present data from both speed
> (the
> >>  one from SS) and batch layer.
> >> 
> >>  HTH
> >> 
> >> 
> >> 
> >> 
> >>  Dr Mich Talebzadeh
> >> 
> >> 
> >> 
> >>  LinkedIn
> >> 
> >>  https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> 
> >> 
> >> 
> >>  http://talebzadehmich.wordpress.com
> >> 
> >> 
> >>  Disclaimer: Use it at your own risk. Any and all responsibility for
> >>  any
> >>  loss, damage or destruction of data or any other property which may
> >>  arise
> >>  from relying on this email's technical content is explicitly
> >>  disclaimed. The
> >>  author will in no case be liable for any monetary damages arising
> >>  from such
> >>  loss, damage or destruction.
> >> 
> >> 
> >> 
> >> 
> >>  On 29 September 2016 at 15:15, Ali Akhtar 
> >>  wrote:
> >> >
> >> > The web UI is actually the speed layer, it needs to be able to
> query
> >> > the data online, and show the results in real-time.
> >> >
> >> > It also needs a custom front-end, so a system like Tableau can't
> be
> >> > used, it must have a custom backend + front-end.
> >> >
> >> 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Yes but still these writes from Spark have  to go through JDBC? Correct.

Having said that I don't see how doing this through Spark streaming to
postgress is going to be faster than source -> Kafka - flume via zookeeper
-> HDFS.

I believe there is direct streaming from Kakfa to Hive as well and from
Flume to Hbase

I would have thought that if one wanted to do real time analytics with SS,
then that would be a good fit with a real time dashboard.

What is not so clear is the business use case for this.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:28, Cody Koeninger  wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma  >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>>  wrote:
> >> 
> >>  - Spark Streaming to read data from Kafka
> >>  - Storing the data on HDFS using Flume
> >> 
> >>  You don't need Spark streaming to read data from Kafka and store on
> >>  HDFS. It is a waste of resources.
> >> 
> >>  Couple Flume to use Kafka as source and HDFS as sink directly
> >> 
> >>  KafkaAgent.sources = kafka-sources
> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> 
> >>  That will be for your batch layer. To analyse you can directly read
> >>  from
> >>  hdfs files with Spark or simply store data in a database of your
> >>  choice via
> >>  cron or something. Do not mix your batch layer with speed layer.
> >> 
> >>  Your speed layer will ingest the same data directly from Kafka into
> >>  spark streaming and that will be  online or near real time (defined
> >>  by your
> >>  window).
> >> 
> >>  Then you have a a serving layer to present data from both speed
> (the
> >>  one from SS) and batch layer.
> >> 
> >>  HTH
> >> 
> >> 
> >> 
> >> 
> >>  Dr Mich Talebzadeh
> >> 
> >> 
> >> 
> >>  LinkedIn
> >> 
> >>  https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> 
> >> 
> >> 
> >>  http://talebzadehmich.wordpress.com
> >> 
> >> 
> >>  Disclaimer: Use it at your own risk. Any and all responsibility for
> >>  any
> >>  loss, damage or destruction of data or any other property

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
If you're doing any kind of pre-aggregation during ETL, spark direct
stream will let you more easily get the delivery semantics you need,
especially if you're using a transactional data store.

If you're literally just copying individual uniquely keyed items from
kafka to a key-value store, use kafka consumers, sure.

On Thu, Sep 29, 2016 at 10:35 AM, Ali Akhtar  wrote:
> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger  wrote:
>>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>> > wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>> >> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> >> > Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>> >> > 
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> >> >>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> >> >>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> >> >>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>>  wrote:
>> >> 
>> >>  - Spark Streaming to read data from Kafka
>> >>  - Storing the data on HDFS using Flume
>> >> 
>> >>  You don't need Spark streaming to read data from Kafka and store
>> >>  on
>> >>  HDFS. It is a waste of resources.
>> >> 
>> >>  Couple Flume to use Kafka as source and HDFS as sink directly
>> >> 
>> >>  KafkaAgent.sources = kafka-sources
>> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> 
>> >>  That will be for your batch layer. To analyse you can directly
>> >>  read
>> >>  from
>> >>  hdfs files with Spark or simply store data in a database of your
>> >>  choice via
>> >>  cron or something. Do not mix your batch layer with speed layer.
>> >> 
>> >>  Your speed layer will ingest the same data directly from Kafka
>> >>  into
>> >>  spark streaming and that will be  online or near real time
>> >>  (defined
>> >>  by your
>> >>  window).
>> >> 
>> >>  Then you have a a serving layer to present data from both speed
>> >>  (the
>> >>  one from SS) and batch layer.
>> >> 
>> >>  HTH
>> >> 
>> >> 
>> >> 
>> >> 
>> >>  Dr Mich Talebzadeh
>> >> 
>> >> 
>> >> 
>> >>  LinkedIn
>> >> 
>> >> 
>> >>  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> 
>> >> 
>> >> 
>> >>  http://talebzadehmich.wordpress.com
>> >> 
>> >> 
>> >>  Disclaimer: Use it at your own risk. Any and all responsibility
>> >>  for
>> >>  any
>> >>  loss, damage or destruction of data or any other property which
>> >>  may
>> >>  arise
>> >>  from relying on this email's te

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
If you use spark direct streams , it ensure end to end guarantee for
messages.


On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar  wrote:

> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as
> possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
> wrote:
>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>> wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>> deepakmc...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>>  wrote:
>> >> 
>> >>  - Spark Streaming to read data from Kafka
>> >>  - Storing the data on HDFS using Flume
>> >> 
>> >>  You don't need Spark streaming to read data from Kafka and store
>> on
>> >>  HDFS. It is a waste of resources.
>> >> 
>> >>  Couple Flume to use Kafka as source and HDFS as sink directly
>> >> 
>> >>  KafkaAgent.sources = kafka-sources
>> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> 
>> >>  That will be for your batch layer. To analyse you can directly
>> read
>> >>  from
>> >>  hdfs files with Spark or simply store data in a database of your
>> >>  choice via
>> >>  cron or something. Do not mix your batch layer with speed layer.
>> >> 
>> >>  Your speed layer will ingest the same data directly from Kafka
>> into
>> >>  spark streaming and that will be  online or near real time
>> (defined
>> >>  by your
>> >>  window).
>> >> 
>> >>  Then you have a a serving layer to present data from both speed
>> (the
>> >>  one from SS) and batch layer.
>> >> 
>> >>  HTH
>> >> 
>> >> 
>> >> 
>> >> 
>> >>  Dr Mich Talebzadeh
>> >> 
>> >> 
>> >> 
>> >>  LinkedIn
>> >> 
>> >>  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >> 
>> >> 
>> >> 
>> >>  http://talebzadehmich.wordpress.com
>> >> 
>> >> 
>> >>  Disclaimer: Use it at your own risk. Any and all responsibility
>> for
>> >>  any
>> >>  loss, damage or destruction of data or any other property which
>> may
>> >>  arise
>> >>  from relying on this email's technical content is explicitly
>> >>  disclaimed. The
>> >>  author will in no case be liable for any monetary damages arising
>> >>  from such
>> >>  loss, damage or destruction.
>> >> 
>> >> 
>> >> 
>> >> 
>> >>  On 29 September 2016 at 15:15, Ali Akhtar 
>> >>  wrote:

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Ali,

What is the business use case for this?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:40, Deepak Sharma  wrote:

> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar  wrote:
>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>> wrote:
>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>>> wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>>> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>>> deepakmc...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>>  wrote:
>>> >> 
>>> >>  - Spark Streaming to read data from Kafka
>>> >>  - Storing the data on HDFS using Flume
>>> >> 
>>> >>  You don't need Spark streaming to read data from Kafka and store
>>> on
>>> >>  HDFS. It is a waste of resources.
>>> >> 
>>> >>  Couple Flume to use Kafka as source and HDFS as sink directly
>>> >> 
>>> >>  KafkaAgent.sources = kafka-sources
>>> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >> 
>>> >>  That will be for your batch layer. To analyse you can directly
>>> read
>>> >>  from
>>> >>  hdfs files with Spark or simply store data in a database of your
>>> >>  choice via
>>> >>  cron or something. Do not mix your batch layer with speed layer.
>>> >> 
>>> >>  Your speed layer will ingest the same data directly from Kafka
>>> into
>>> >>  spark streaming and that will be  online or near real time
>>> (defined
>>> >>  by your
>>> >>  window).
>>> >> 
>>> >>  Then you have a a serving layer to present data from both speed
>>> (the
>>> >>  one from SS) and batch layer.
>>> >> 
>>> >>  HTH
>>> >> 
>>> >> 
>>> >> 
>>> >> 
>>> >>  Dr Mich Talebzadeh
>>> >> 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
No, direct stream in and of itself won't ensure an end-to-end
guarantee, because it doesn't know anything about your output actions.

You still need to do some work.  The point is having easy access to
offsets for batches on a per-partition basis makes it easier to do
that work, especially in conjunction with aggregation.

On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma  wrote:
> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar  wrote:
>>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>> wrote:
>>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>>> > wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>>> >> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> >> > Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>> >> > 
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> >> >>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> >> >>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> >> >>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>>  wrote:
>>> >> 
>>> >>  - Spark Streaming to read data from Kafka
>>> >>  - Storing the data on HDFS using Flume
>>> >> 
>>> >>  You don't need Spark streaming to read data from Kafka and store
>>> >>  on
>>> >>  HDFS. It is a waste of resources.
>>> >> 
>>> >>  Couple Flume to use Kafka as source and HDFS as sink directly
>>> >> 
>>> >>  KafkaAgent.sources = kafka-sources
>>> >>  KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >> 
>>> >>  That will be for your batch layer. To analyse you can directly
>>> >>  read
>>> >>  from
>>> >>  hdfs files with Spark or simply store data in a database of your
>>> >>  choice via
>>> >>  cron or something. Do not mix your batch layer with speed layer.
>>> >> 
>>> >>  Your speed layer will ingest the same data directly from Kafka
>>> >>  into
>>> >>  spark streaming and that will be  online or near real time
>>> >>  (defined
>>> >>  by your
>>> >>  window).
>>> >> 
>>> >>  Then you have a a serving layer to present data from both speed
>>> >>  (the
>>> >>  one from SS) and batch layer.
>>> >> 
>>> >>  HTH
>>> >> 
>>> >> 
>>> >> 
>>> >> 
>>> >>  Dr Mich Talebzadeh
>>> >> 
>>> >> 
>>> >> 
>>> >>  LinkedIn
>>> >> 
>>> >> 
>>> >>  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >> 
>>> >> 
>>> >> 
>>> >>  http://talebzadehmich.wordp

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The business use case is to read a user's data from a variety of different
services through their API, and then allowing the user to query that data,
on a per service basis, as well as an aggregate across all services.

The way I'm considering doing it, is to do some basic ETL (drop all the
unnecessary fields, rename some fields into something more manageable, etc)
and then store the data in Cassandra / Postgres.

Then, when the user wants to view a particular report, query the respective
table in Cassandra / Postgres. (select .. from data where user = ? and date
between  and  and some_field = ?)

How will Spark Streaming help w/ aggregation? Couldn't the data be queried
from Cassandra / Postgres via the Kafka consumer and aggregated that way?

On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger  wrote:

> No, direct stream in and of itself won't ensure an end-to-end
> guarantee, because it doesn't know anything about your output actions.
>
> You still need to do some work.  The point is having easy access to
> offsets for batches on a per-partition basis makes it easier to do
> that work, especially in conjunction with aggregation.
>
> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
> wrote:
> > If you use spark direct streams , it ensure end to end guarantee for
> > messages.
> >
> >
> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
> wrote:
> >>
> >> My concern with Postgres / Cassandra is only scalability. I will look
> >> further into Postgres horizontal scaling, thanks.
> >>
> >> Writes could be idempotent if done as upserts, otherwise updates will be
> >> idempotent but not inserts.
> >>
> >> Data should not be lost. The system should be as fault tolerant as
> >> possible.
> >>
> >> What's the advantage of using Spark for reading Kafka instead of direct
> >> Kafka consumers?
> >>
> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
> >> wrote:
> >>>
> >>> I wouldn't give up the flexibility and maturity of a relational
> >>> database, unless you have a very specific use case.  I'm not trashing
> >>> cassandra, I've used cassandra, but if all I know is that you're doing
> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> >>> aggregations without a lot of forethought.  If you're worried about
> >>> scaling, there are several options for horizontally scaling Postgres
> >>> in particular.  One of the current best from what I've worked with is
> >>> Citus.
> >>>
> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma  >
> >>> wrote:
> >>> > Hi Cody
> >>> > Spark direct stream is just fine for this use case.
> >>> > But why postgres and not cassandra?
> >>> > Is there anything specific here that i may not be aware?
> >>> >
> >>> > Thanks
> >>> > Deepak
> >>> >
> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
> >>> > wrote:
> >>> >>
> >>> >> How are you going to handle etl failures?  Do you care about lost /
> >>> >> duplicated data?  Are your writes idempotent?
> >>> >>
> >>> >> Absent any other information about the problem, I'd stay away from
> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >>> >> feeding postgres.
> >>> >>
> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
> >>> >> wrote:
> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
> >>> >> > Nothing
> >>> >> > is
> >>> >> > being done to the data except some light ETL and then storing it
> in
> >>> >> > Cassandra
> >>> >> >
> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
> >>> >> > 
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
> >>> >> >>
> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <
> ali.rac...@gmail.com>
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> I don't think I need a different speed storage and batch
> storage.
> >>> >> >>> Just
> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
> >>> >> >>> somewhere
> >>> >> >>> where
> >>> >> >>> the web UI can query it, seems like it will be enough.
> >>> >> >>>
> >>> >> >>> I'm thinking about:
> >>> >> >>>
> >>> >> >>> - Reading data from Kafka via Spark Streaming
> >>> >> >>> - Standardizing, then storing it in Cassandra
> >>> >> >>> - Querying Cassandra from the web ui
> >>> >> >>>
> >>> >> >>> That seems like it will work. My question now is whether to use
> >>> >> >>> Spark
> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>> >> >>>  wrote:
> >>> >> 
> >>> >>  - Spark Streaming to read data from Kafka
> >>> >>  - Storing the data on HDFS using Flume
> >>> >> 
> >>> >>  You don't need Spark streaming to read data from Kafka and
> store
> >>> >>  on
> >>> >>  HDFS. It is a waste of resources.
> >>> >> 
> >>> >>  Couple Flume to use Kafka as source and HDFS as sink directly
> >>> >> 
> >>> >>  KafkaAgent.sources = kafka-sources
> >>> >>  KafkaAgent.sinks.

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
The way I see this, there are two things involved.


   1. Data ingestion through source to Kafka
   2. Date conversion and Storage ETL/ELT
   3. Presentation

Item 2 is the one that needs to be designed correctly. I presume raw data
has to confirm to some form of MDM that requires schema mapping etc before
putting into persistent storage (DB, HDFS etc). Which one to choose depends
on your volume of ingestion and your cluster size and complexity of data
conversion. Then your users will use some form of UI (Tableau, QlikView,
Zeppelin, direct SQL) to query data one way or other. Your users can
directly use UI like Tableau that offer in built analytics on SQL. Spark
sql offers the same). Your mileage varies according to your needs.

I still don't understand why writing to a transactional database with
locking and concurrency (read and writes) through JDBC will be fast for
this sort of data ingestion. If you ask me if I wanted to choose an RDBMS
to write to as my sink,I would use Oracle which offers the best locking and
concurrency among RDBMs and also handles key value pairs as well (assuming
that is what you want). In addition, it can be used as a Data Warehouse as
well.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:49, Ali Akhtar  wrote:

> The business use case is to read a user's data from a variety of different
> services through their API, and then allowing the user to query that data,
> on a per service basis, as well as an aggregate across all services.
>
> The way I'm considering doing it, is to do some basic ETL (drop all the
> unnecessary fields, rename some fields into something more manageable, etc)
> and then store the data in Cassandra / Postgres.
>
> Then, when the user wants to view a particular report, query the
> respective table in Cassandra / Postgres. (select .. from data where user =
> ? and date between  and  and some_field = ?)
>
> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>
> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger 
> wrote:
>
>> No, direct stream in and of itself won't ensure an end-to-end
>> guarantee, because it doesn't know anything about your output actions.
>>
>> You still need to do some work.  The point is having easy access to
>> offsets for batches on a per-partition basis makes it easier to do
>> that work, especially in conjunction with aggregation.
>>
>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
>> wrote:
>> > If you use spark direct streams , it ensure end to end guarantee for
>> > messages.
>> >
>> >
>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
>> wrote:
>> >>
>> >> My concern with Postgres / Cassandra is only scalability. I will look
>> >> further into Postgres horizontal scaling, thanks.
>> >>
>> >> Writes could be idempotent if done as upserts, otherwise updates will
>> be
>> >> idempotent but not inserts.
>> >>
>> >> Data should not be lost. The system should be as fault tolerant as
>> >> possible.
>> >>
>> >> What's the advantage of using Spark for reading Kafka instead of direct
>> >> Kafka consumers?
>> >>
>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>> >> wrote:
>> >>>
>> >>> I wouldn't give up the flexibility and maturity of a relational
>> >>> database, unless you have a very specific use case.  I'm not trashing
>> >>> cassandra, I've used cassandra, but if all I know is that you're doing
>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> >>> aggregations without a lot of forethought.  If you're worried about
>> >>> scaling, there are several options for horizontally scaling Postgres
>> >>> in particular.  One of the current best from what I've worked with is
>> >>> Citus.
>> >>>
>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <
>> deepakmc...@gmail.com>
>> >>> wrote:
>> >>> > Hi Cody
>> >>> > Spark direct stream is just fine for this use case.
>> >>> > But why postgres and not cassandra?
>> >>> > Is there anything specific here that i may not be aware?
>> >>> >
>> >>> > Thanks
>> >>> > Deepak
>> >>> >
>> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger > >
>> >>> > wrote:
>> >>> >>
>> >>> >> How are you going to handle etl failures?  Do you care about lost /
>> >>> >> duplicated data?  Are your writes idempotent?
>> >>> >>
>> >>> >> Absent any other information about the problem, I'd stay away from
>> >>> >> cassandra/flume/hdfs/

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because

A. raw kafka consumers have no built in framework for shuffling
amongst nodes, short of writing into an intermediate topic (I'm not
touching Kafka Streams here, I don't have experience), and

B. it deals with batches, so you can transactionally decide to commit
or rollback your aggregate data and your offsets.  Otherwise your
offsets and data store can get out of sync, leading to lost /
duplicate data.

Regarding long running spark jobs, I have streaming jobs in the
standalone manager that have been running for 6 months or more.

On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
 wrote:
> Ok… so what’s the tricky part?
> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
> processing… it would work.
>
> The drawback is that you now have a long running Spark Job (assuming under 
> YARN) and that could become a problem in terms of security and resources.
> (How well does Yarn handle long running jobs these days in a secured Cluster? 
> Steve L. may have some insight… )
>
> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
> want to write your own compaction code? Or use Hive 1.x+?)
>
> HBase? Depending on your admin… stability could be a problem.
> Cassandra? That would be a separate cluster and that in itself could be a 
> problem…
>
> YMMV so you need to address the pros/cons of each tool specific to your 
> environment and skill level.
>
> HTH
>
> -Mike
>
>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
>>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data 
>> into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>> ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI 
>> which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the 
>> web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
>> raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>> and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries 
>> across the data (mostly filters), or directly run queries against Cassandra 
>> / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives I 
>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>> persistent data store to use, and how to query that data store in the 
>> backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
> I still don't understand why writing to a transactional database with locking 
> and concurrency (read and writes) through JDBC will be fast for this sort of 
> data ingestion.

Who cares about fast if your data is wrong?  And it's still plenty fast enough

https://youtu.be/NVl9_6J1G60?list=WL&t=1819

https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/



On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh
 wrote:
> The way I see this, there are two things involved.
>
> Data ingestion through source to Kafka
> Date conversion and Storage ETL/ELT
> Presentation
>
> Item 2 is the one that needs to be designed correctly. I presume raw data
> has to confirm to some form of MDM that requires schema mapping etc before
> putting into persistent storage (DB, HDFS etc). Which one to choose depends
> on your volume of ingestion and your cluster size and complexity of data
> conversion. Then your users will use some form of UI (Tableau, QlikView,
> Zeppelin, direct SQL) to query data one way or other. Your users can
> directly use UI like Tableau that offer in built analytics on SQL. Spark sql
> offers the same). Your mileage varies according to your needs.
>
> I still don't understand why writing to a transactional database with
> locking and concurrency (read and writes) through JDBC will be fast for this
> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to
> write to as my sink,I would use Oracle which offers the best locking and
> concurrency among RDBMs and also handles key value pairs as well (assuming
> that is what you want). In addition, it can be used as a Data Warehouse as
> well.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 29 September 2016 at 16:49, Ali Akhtar  wrote:
>>
>> The business use case is to read a user's data from a variety of different
>> services through their API, and then allowing the user to query that data,
>> on a per service basis, as well as an aggregate across all services.
>>
>> The way I'm considering doing it, is to do some basic ETL (drop all the
>> unnecessary fields, rename some fields into something more manageable, etc)
>> and then store the data in Cassandra / Postgres.
>>
>> Then, when the user wants to view a particular report, query the
>> respective table in Cassandra / Postgres. (select .. from data where user =
>> ? and date between  and  and some_field = ?)
>>
>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>>
>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger 
>> wrote:
>>>
>>> No, direct stream in and of itself won't ensure an end-to-end
>>> guarantee, because it doesn't know anything about your output actions.
>>>
>>> You still need to do some work.  The point is having easy access to
>>> offsets for batches on a per-partition basis makes it easier to do
>>> that work, especially in conjunction with aggregation.
>>>
>>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
>>> wrote:
>>> > If you use spark direct streams , it ensure end to end guarantee for
>>> > messages.
>>> >
>>> >
>>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
>>> > wrote:
>>> >>
>>> >> My concern with Postgres / Cassandra is only scalability. I will look
>>> >> further into Postgres horizontal scaling, thanks.
>>> >>
>>> >> Writes could be idempotent if done as upserts, otherwise updates will
>>> >> be
>>> >> idempotent but not inserts.
>>> >>
>>> >> Data should not be lost. The system should be as fault tolerant as
>>> >> possible.
>>> >>
>>> >> What's the advantage of using Spark for reading Kafka instead of
>>> >> direct
>>> >> Kafka consumers?
>>> >>
>>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>>> >> wrote:
>>> >>>
>>> >>> I wouldn't give up the flexibility and maturity of a relational
>>> >>> database, unless you have a very specific use case.  I'm not trashing
>>> >>> cassandra, I've used cassandra, but if all I know is that you're
>>> >>> doing
>>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> >>> aggregations without a lot of forethought.  If you're worried about
>>> >>> scaling, there are several options for horizontally scaling Postgres
>>> >>> in particular.  One of the current best from what I've worked with is
>>> >>> Citus.
>>> >>>
>>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma
>>> >>> 
>>> >>> wrote:
>>> >>> > Hi Cody
>>> >>> > Spark direct stream is just fine for this use case.
>>> >>> > But why postgres and n

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Gwen Shapira
The original post made no mention of throughput or latency or
correctness requirements, so pretty much any data store will fit the
bill... discussion of "what is better" degrade fast when there are no
concrete standards to choose between.

Who cares about anything when we don't know what we need? :)

On Thu, Sep 29, 2016 at 9:23 AM, Cody Koeninger  wrote:
>> I still don't understand why writing to a transactional database with 
>> locking and concurrency (read and writes) through JDBC will be fast for this 
>> sort of data ingestion.
>
> Who cares about fast if your data is wrong?  And it's still plenty fast enough
>
> https://youtu.be/NVl9_6J1G60?list=WL&t=1819
>
> https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/
>
>
>
> On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh
>  wrote:
>> The way I see this, there are two things involved.
>>
>> Data ingestion through source to Kafka
>> Date conversion and Storage ETL/ELT
>> Presentation
>>
>> Item 2 is the one that needs to be designed correctly. I presume raw data
>> has to confirm to some form of MDM that requires schema mapping etc before
>> putting into persistent storage (DB, HDFS etc). Which one to choose depends
>> on your volume of ingestion and your cluster size and complexity of data
>> conversion. Then your users will use some form of UI (Tableau, QlikView,
>> Zeppelin, direct SQL) to query data one way or other. Your users can
>> directly use UI like Tableau that offer in built analytics on SQL. Spark sql
>> offers the same). Your mileage varies according to your needs.
>>
>> I still don't understand why writing to a transactional database with
>> locking and concurrency (read and writes) through JDBC will be fast for this
>> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to
>> write to as my sink,I would use Oracle which offers the best locking and
>> concurrency among RDBMs and also handles key value pairs as well (assuming
>> that is what you want). In addition, it can be used as a Data Warehouse as
>> well.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 29 September 2016 at 16:49, Ali Akhtar  wrote:
>>>
>>> The business use case is to read a user's data from a variety of different
>>> services through their API, and then allowing the user to query that data,
>>> on a per service basis, as well as an aggregate across all services.
>>>
>>> The way I'm considering doing it, is to do some basic ETL (drop all the
>>> unnecessary fields, rename some fields into something more manageable, etc)
>>> and then store the data in Cassandra / Postgres.
>>>
>>> Then, when the user wants to view a particular report, query the
>>> respective table in Cassandra / Postgres. (select .. from data where user =
>>> ? and date between  and  and some_field = ?)
>>>
>>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>>> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>>>
>>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger 
>>> wrote:

 No, direct stream in and of itself won't ensure an end-to-end
 guarantee, because it doesn't know anything about your output actions.

 You still need to do some work.  The point is having easy access to
 offsets for batches on a per-partition basis makes it easier to do
 that work, especially in conjunction with aggregation.

 On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
 wrote:
 > If you use spark direct streams , it ensure end to end guarantee for
 > messages.
 >
 >
 > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
 > wrote:
 >>
 >> My concern with Postgres / Cassandra is only scalability. I will look
 >> further into Postgres horizontal scaling, thanks.
 >>
 >> Writes could be idempotent if done as upserts, otherwise updates will
 >> be
 >> idempotent but not inserts.
 >>
 >> Data should not be lost. The system should be as fault tolerant as
 >> possible.
 >>
 >> What's the advantage of using Spark for reading Kafka instead of
 >> direct
 >> Kafka consumers?
 >>
 >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
 >> wrote:
 >>>
 >>> I wouldn't give up the flexibility and maturity of a relational
 >>> database, unless you have a very specific use case.  I'm not trashing
 >>> cassandra, I've used cassandra, but if all I know is that you're
 >>> doing
 >>> analytics, I wouldn't want to giv

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
The OP didn't say anything about Yarn, and why are you contemplating
putting Kafka or Spark on public networks to begin with?

Gwen's right, absent any actual requirements this is kind of pointless.

On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
 wrote:
> Spark standalone is not Yarn… or secure for that matter… ;-)
>
>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger  wrote:
>>
>> Spark streaming helps with aggregation because
>>
>> A. raw kafka consumers have no built in framework for shuffling
>> amongst nodes, short of writing into an intermediate topic (I'm not
>> touching Kafka Streams here, I don't have experience), and
>>
>> B. it deals with batches, so you can transactionally decide to commit
>> or rollback your aggregate data and your offsets.  Otherwise your
>> offsets and data store can get out of sync, leading to lost /
>> duplicate data.
>>
>> Regarding long running spark jobs, I have streaming jobs in the
>> standalone manager that have been running for 6 months or more.
>>
>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>  wrote:
>>> Ok… so what’s the tricky part?
>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>>> processing… it would work.
>>>
>>> The drawback is that you now have a long running Spark Job (assuming under 
>>> YARN) and that could become a problem in terms of security and resources.
>>> (How well does Yarn handle long running jobs these days in a secured 
>>> Cluster? Steve L. may have some insight… )
>>>
>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do 
>>> you want to write your own compaction code? Or use Hive 1.x+?)
>>>
>>> HBase? Depending on your admin… stability could be a problem.
>>> Cassandra? That would be a separate cluster and that in itself could be a 
>>> problem…
>>>
>>> YMMV so you need to address the pros/cons of each tool specific to your 
>>> environment and skill level.
>>>
>>> HTH
>>>
>>> -Mike
>>>
 On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:

 I have a somewhat tricky use case, and I'm looking for ideas.

 I have 5-6 Kafka producers, reading various APIs, and writing their raw 
 data into Kafka.

 I need to:

 - Do ETL on the data, and standardize it.

 - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
 ElasticSearch / Postgres)

 - Query this data to generate reports / analytics (There will be a web UI 
 which will be the front-end to the data, and will show the reports)

 Java is being used as the backend language for everything (backend of the 
 web UI, as well as the ETL layer)

 I'm considering:

 - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
 raw data from Kafka, standardize & store it)

 - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
 and to allow queries

 - In the backend of the web UI, I could either use Spark to run queries 
 across the data (mostly filters), or directly run queries against 
 Cassandra / HBase

 I'd appreciate some thoughts / suggestions on which of these alternatives 
 I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
 persistent data store to use, and how to query that data store in the 
 backend of the web UI, for displaying the reports).


 Thanks.
>>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Avi Flax

> On Sep 29, 2016, at 09:54, Ali Akhtar  wrote:
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I
> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> persistent data store to use, and how to query that data store in the
> backend of the web UI, for displaying the reports).

Hi Ali, I’m no expert in any of this, but I’m working on a project that is 
broadly similar to yours, and FWIW I’m evaluating Druid as the datastore which 
would host the queryable data and, well, actually handle and fulfill queries.

Since Druid has built-in support for streaming ingestion from Kafka topics, I’m 
tentatively thinking of doing my ETL in a stream processing topology (I’m using 
Kafka Streams, FWIW), which would write the events destined for Druid into 
certain topics, from which Druid would ingest those events.

HTH,
Avi


Software Architect @ Park Assist
We’re hiring! http://tech.parkassist.com/jobs/



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
Avi,

Why did you choose Druid over Postgres / Cassandra / Elasticsearch?

On Fri, Sep 30, 2016 at 1:09 AM, Avi Flax  wrote:

>
> > On Sep 29, 2016, at 09:54, Ali Akhtar  wrote:
> >
> > I'd appreciate some thoughts / suggestions on which of these
> alternatives I
> > should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> > persistent data store to use, and how to query that data store in the
> > backend of the web UI, for displaying the reports).
>
> Hi Ali, I’m no expert in any of this, but I’m working on a project that is
> broadly similar to yours, and FWIW I’m evaluating Druid as the datastore
> which would host the queryable data and, well, actually handle and fulfill
> queries.
>
> Since Druid has built-in support for streaming ingestion from Kafka
> topics, I’m tentatively thinking of doing my ETL in a stream processing
> topology (I’m using Kafka Streams, FWIW), which would write the events
> destined for Druid into certain topics, from which Druid would ingest those
> events.
>
> HTH,
> Avi
>
> 
> Software Architect @ Park Assist
> We’re hiring! http://tech.parkassist.com/jobs/
>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
OP mentioned HBase or HDFS as persisted storage. Therefore they have to be 
running YARN if they are considering spark. 
(Assuming that you’re not trying to do a storage / compute model and use 
standalone spark outside your cluster. You can, but you have more moving 
parts…) 

I never said anything about putting something on a public network. I mentioned 
running a secured cluster.
You don’t deal with PII or other regulated data, do you? 


If you read my original post, you are correct we don’t have a lot, if any real 
information. 
Based on what the OP said, there are design considerations since every tool he 
mentioned has pluses and minuses and the problem isn’t really that challenging 
unless you have something extraordinary like high velocity or some other 
constraint that makes this challenging. 

BTW, depending on scale and velocity… your relational engines may become 
problematic. 
HTH

-Mike


> On Sep 29, 2016, at 1:51 PM, Cody Koeninger  wrote:
> 
> The OP didn't say anything about Yarn, and why are you contemplating
> putting Kafka or Spark on public networks to begin with?
> 
> Gwen's right, absent any actual requirements this is kind of pointless.
> 
> On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
>  wrote:
>> Spark standalone is not Yarn… or secure for that matter… ;-)
>> 
>>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger  wrote:
>>> 
>>> Spark streaming helps with aggregation because
>>> 
>>> A. raw kafka consumers have no built in framework for shuffling
>>> amongst nodes, short of writing into an intermediate topic (I'm not
>>> touching Kafka Streams here, I don't have experience), and
>>> 
>>> B. it deals with batches, so you can transactionally decide to commit
>>> or rollback your aggregate data and your offsets.  Otherwise your
>>> offsets and data store can get out of sync, leading to lost /
>>> duplicate data.
>>> 
>>> Regarding long running spark jobs, I have streaming jobs in the
>>> standalone manager that have been running for 6 months or more.
>>> 
>>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>>  wrote:
 Ok… so what’s the tricky part?
 Spark Streaming isn’t real time so if you don’t mind a slight delay in 
 processing… it would work.
 
 The drawback is that you now have a long running Spark Job (assuming under 
 YARN) and that could become a problem in terms of security and resources.
 (How well does Yarn handle long running jobs these days in a secured 
 Cluster? Steve L. may have some insight… )
 
 Raw HDFS would become a problem because Apache HDFS is still a worm. (Do 
 you want to write your own compaction code? Or use Hive 1.x+?)
 
 HBase? Depending on your admin… stability could be a problem.
 Cassandra? That would be a separate cluster and that in itself could be a 
 problem…
 
 YMMV so you need to address the pros/cons of each tool specific to your 
 environment and skill level.
 
 HTH
 
 -Mike
 
> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
> 
> I have a somewhat tricky use case, and I'm looking for ideas.
> 
> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
> data into Kafka.
> 
> I need to:
> 
> - Do ETL on the data, and standardize it.
> 
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
> ElasticSearch / Postgres)
> 
> - Query this data to generate reports / analytics (There will be a web UI 
> which will be the front-end to the data, and will show the reports)
> 
> Java is being used as the backend language for everything (backend of the 
> web UI, as well as the ETL layer)
> 
> I'm considering:
> 
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer 
> (receive raw data from Kafka, standardize & store it)
> 
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
> and to allow queries
> 
> - In the backend of the web UI, I could either use Spark to run queries 
> across the data (mostly filters), or directly run queries against 
> Cassandra / HBase
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives 
> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
> persistent data store to use, and how to query that data store in the 
> backend of the web UI, for displaying the reports).
> 
> 
> Thanks.
 
>> 



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Spark standalone is not Yarn… or secure for that matter… ;-)

> On Sep 29, 2016, at 11:18 AM, Cody Koeninger  wrote:
> 
> Spark streaming helps with aggregation because
> 
> A. raw kafka consumers have no built in framework for shuffling
> amongst nodes, short of writing into an intermediate topic (I'm not
> touching Kafka Streams here, I don't have experience), and
> 
> B. it deals with batches, so you can transactionally decide to commit
> or rollback your aggregate data and your offsets.  Otherwise your
> offsets and data store can get out of sync, leading to lost /
> duplicate data.
> 
> Regarding long running spark jobs, I have streaming jobs in the
> standalone manager that have been running for 6 months or more.
> 
> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>  wrote:
>> Ok… so what’s the tricky part?
>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>> processing… it would work.
>> 
>> The drawback is that you now have a long running Spark Job (assuming under 
>> YARN) and that could become a problem in terms of security and resources.
>> (How well does Yarn handle long running jobs these days in a secured 
>> Cluster? Steve L. may have some insight… )
>> 
>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
>> want to write your own compaction code? Or use Hive 1.x+?)
>> 
>> HBase? Depending on your admin… stability could be a problem.
>> Cassandra? That would be a separate cluster and that in itself could be a 
>> problem…
>> 
>> YMMV so you need to address the pros/cons of each tool specific to your 
>> environment and skill level.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
>>> 
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>> 
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
>>> data into Kafka.
>>> 
>>> I need to:
>>> 
>>> - Do ETL on the data, and standardize it.
>>> 
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>>> ElasticSearch / Postgres)
>>> 
>>> - Query this data to generate reports / analytics (There will be a web UI 
>>> which will be the front-end to the data, and will show the reports)
>>> 
>>> Java is being used as the backend language for everything (backend of the 
>>> web UI, as well as the ETL layer)
>>> 
>>> I'm considering:
>>> 
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
>>> raw data from Kafka, standardize & store it)
>>> 
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>>> and to allow queries
>>> 
>>> - In the backend of the web UI, I could either use Spark to run queries 
>>> across the data (mostly filters), or directly run queries against Cassandra 
>>> / HBase
>>> 
>>> I'd appreciate some thoughts / suggestions on which of these alternatives I 
>>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>>> persistent data store to use, and how to query that data store in the 
>>> backend of the web UI, for displaying the reports).
>>> 
>>> 
>>> Thanks.
>> 



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Ok… so what’s the tricky part? 
Spark Streaming isn’t real time so if you don’t mind a slight delay in 
processing… it would work.

The drawback is that you now have a long running Spark Job (assuming under 
YARN) and that could become a problem in terms of security and resources. 
(How well does Yarn handle long running jobs these days in a secured Cluster? 
Steve L. may have some insight… ) 

Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
want to write your own compaction code? Or use Hive 1.x+?)

HBase? Depending on your admin… stability could be a problem. 
Cassandra? That would be a separate cluster and that in itself could be a 
problem… 

YMMV so you need to address the pros/cons of each tool specific to your 
environment and skill level. 

HTH

-Mike

> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
> 
> I have a somewhat tricky use case, and I'm looking for ideas.
> 
> I have 5-6 Kafka producers, reading various APIs, and writing their raw data 
> into Kafka.
> 
> I need to:
> 
> - Do ETL on the data, and standardize it.
> 
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
> ElasticSearch / Postgres)
> 
> - Query this data to generate reports / analytics (There will be a web UI 
> which will be the front-end to the data, and will show the reports)
> 
> Java is being used as the backend language for everything (backend of the web 
> UI, as well as the ETL layer)
> 
> I'm considering:
> 
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
> raw data from Kafka, standardize & store it)
> 
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and 
> to allow queries
> 
> - In the backend of the web UI, I could either use Spark to run queries 
> across the data (mostly filters), or directly run queries against Cassandra / 
> HBase
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I 
> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
> persistent data store to use, and how to query that data store in the backend 
> of the web UI, for displaying the reports).
> 
> 
> Thanks.



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Alonso Isidoro Roman
"Using Spark to query the data in the backend of the web UI?"

Dont do that. I would recommend that spark streaming process stores data
into some nosql or sql database and the web ui to query data from that
database.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-09-29 16:15 GMT+02:00 Ali Akhtar :

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>>> wrote:
>>>
 What is the message inflow ?
 If it's really high , definitely spark will be of great use .

 Thanks
 Deepak

 On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their
> raw data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
> / ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web
> UI which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of
> the web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> (receive raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
> data, and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run
> queries across the data (mostly filters), or directly run queries against
> Cassandra / HBase
>
> I'd appreciate some thoughts / suggestions on which of these
> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
> ETL, which persistent data store to use, and how to query that data store
> in the backend of the web UI, for displaying the reports).
>
>
> Thanks.
>

>>>
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-30 Thread Andrew Stevenson
· Kafka Connect for ingress “E”

· Kafka Streams , Flink or Spark Streaming for “T” – Read from and 
write back to Kafka – Keep the sources of data for you processing engine small 
Separation of concerns, why should Spark care about where you upstream sources 
are for example

· Kafka Connect for egress “L” to a datastore of your choice, Kudu, 
HDFS, Cassandra, ReThinkDB, HBase, postgre etc

· RestProxy from Confluent or 
https://github.com/datamountaineer/stream-reactor/tree/master/kafka-socket-streamer
 for UI on real time streams



https://github.com/datamountaineer/stream-reactor





On 29/09/16 17:11, "Cody Koeninger"  wrote:



How are you going to handle etl failures?  Do you care about lost /

duplicated data?  Are your writes idempotent?



Absent any other information about the problem, I'd stay away from

cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream

feeding postgres.



On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar  wrote:

> Is there an advantage to that vs directly consuming from Kafka? Nothing is

> being done to the data except some light ETL and then storing it in

> Cassandra

>

> On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 

> wrote:

>>

>> Its better you use spark's direct stream to ingest from kafka.

>>

>> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar  wrote:

>>>

>>> I don't think I need a different speed storage and batch storage. Just

>>> taking in raw data from Kafka, standardizing, and storing it somewhere 
where

>>> the web UI can query it, seems like it will be enough.

>>>

>>> I'm thinking about:

>>>

>>> - Reading data from Kafka via Spark Streaming

>>> - Standardizing, then storing it in Cassandra

>>> - Querying Cassandra from the web ui

>>>

>>> That seems like it will work. My question now is whether to use Spark

>>> Streaming to read Kafka, or use Kafka consumers directly.

>>>

>>>

>>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh

>>>  wrote:



 - Spark Streaming to read data from Kafka

 - Storing the data on HDFS using Flume



 You don't need Spark streaming to read data from Kafka and store on

 HDFS. It is a waste of resources.



 Couple Flume to use Kafka as source and HDFS as sink directly



 KafkaAgent.sources = kafka-sources

 KafkaAgent.sinks.hdfs-sinks.type = hdfs



 That will be for your batch layer. To analyse you can directly read 
from

 hdfs files with Spark or simply store data in a database of your 
choice via

 cron or something. Do not mix your batch layer with speed layer.



 Your speed layer will ingest the same data directly from Kafka into

 spark streaming and that will be  online or near real time (defined by 
your

 window).



 Then you have a a serving layer to present data from both speed  (the

 one from SS) and batch layer.



 HTH









 Dr Mich Talebzadeh







 LinkedIn

 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







 http://talebzadehmich.wordpress.com





 Disclaimer: Use it at your own risk. Any and all responsibility for any

 loss, damage or destruction of data or any other property which may 
arise

 from relying on this email's technical content is explicitly 
disclaimed. The

 author will in no case be liable for any monetary damages arising from 
such

 loss, damage or destruction.









 On 29 September 2016 at 15:15, Ali Akhtar  wrote:

>

> The web UI is actually the speed layer, it needs to be able to query

> the data online, and show the results in real-time.

>

> It also needs a custom front-end, so a system like Tableau can't be

> used, it must have a custom backend + front-end.

>

> Thanks for the recommendation of Flume. Do you think this will work:

>

> - Spark Streaming to read data from Kafka

> - Storing the data on HDFS using Flume

> - Using Spark to query the data in the backend of the web UI?

>

>

>

> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh

>  wrote:

>>

>> You need a batch layer and a speed layer. Data from Kafka can be

>> stored on HDFS using flume.

>>

>> -  Query this data to generate reports / analytics (There will be a

>> web UI which will be the front-end to the data, and will show the 
reports)

>>

>> This is basically batch layer and y

Re: Architecture recommendations for a tricky use case

2016-10-05 Thread Avi Flax

> On Sep 29, 2016, at 16:39, Ali Akhtar  wrote:
> 
> Why did you choose Druid over Postgres / Cassandra / Elasticsearch?

Well, to be clear, we haven’t chosen it yet — we’re evaluating it.

That said, it is looking quite promising for our use case.

The Druid docs say it well:

> Druid is an open source data store designed for OLAP queries on event data.

And that’s exactly what we need. The other options you listed are excellent 
systems, but they’re more general than Druid. Because Druid is specifically 
focused on OLAP queries on event data, it has features and properties that make 
it very well suited to such use cases.

In addition, Druid has built-in support for ingesting events from Kafka topics 
and making those events available for querying with very low latency. This is 
very attractive for my use case.

If you’d like to learn more about Druid I recommend this talk from last month 
at Strange Loop: https://www.youtube.com/watch?v=vbH8E0nH2Nw

HTH!

Avi


Software Architect @ Park Assist
We’re hiring! http://tech.parkassist.com/jobs/