Kafka use case for my setup

2022-06-08 Thread Jay Linux
Hi there,

I would like to automate some of my tasks using Apache Kafka. Previously i
used to do the same using Apache Airflow and which worked fine. But i want
to explore the same using Kafka whether this works better than Airflow or
not.


1) Kafka runs on Server A
2) Kafka searches for a file named test.xml on Server B, here kafka search
for every 10 or 20 mins whether this file created or not.
3) Once kafka sense the file created, then the job starts as follows
a)Create a jira ticket and update all the executions on jira for each
events
b) Trigger a rsync command
c) Then unarchive the files using tar command
d) Some script to execute using the unarchive files
e) Then archive the files and rsync to different location
f) Send email once all task finished

Please advise if this is something kafka intelligent to begin with? Or if
you have any other open source products which can do this actions , please
let me know. By the way i prefer to setup these on docker-compose based
installation.

Thanks
Jay


Kafka use case for my setup

2022-06-08 Thread Jay Linux
Hi there,

I would like to automate some of my tasks using Apache Kafka. Previously i
used to do the same using Apache Airflow and which worked fine. But i want
to explore the same using Kafka whether this works better than Airflow or
not.


1) Kafka runs on Server A
2) Kafka searches for a file named test.xml on Server B, here kafka search
for every 10 or 20 mins whether this file created or not.
3) Once kafka sense the file created, then the job starts as follows
a)Create a jira ticket and update all the executions on jira for each
events
b) Trigger a rsync command
c) Then unarchive the files using tar command
d) Some script to execute using the unarchive files
e) Then archive the files and rsync to different location
f) Send email once all task finished

Please advise if this is something kafka intelligent to begin with? Or if
you have any other open source products which can do this actions , please
let me know. By the way i prefer to setup these on docker-compose based
installation.

Thanks
Jay


Re: Use case: Per tenant deployments talking to multi tenant kafka cluster

2021-12-09 Thread Luke Chen
Hi Christian,

> For multiple partitions is it the correct behaviour to simply assign to
partition number:offset or do I have to provide offsets for the other
partitions too?

I'm not sure I get your question here. If you are asking if you should
commit offsets of other partitions that this consumer doesn't consumed,
then you don't have to. For example:

consumer A assign to partition 0
consumer B assign to partition 1,
So, when committing offsets, in consumer A, you only have to commit the
offset of partition 0. And in consumer B, you just commit the offsets of
partition 1.

For the Consumer#assign, you can check this link for more information:
https://stackoverflow.com/a/53938397

> In this case do we still need a custom
producer partitioner or is it enough to simply assign to the topic like
described above?

No, if you want each consumer to receive all the messages of all
environments, then you don't need custom partitioner.


Thank you,
Luke

On Wed, Dec 8, 2021 at 11:05 PM Christian Schneider 
wrote:

> Hi Luke,
>
> thanks for the hints. This helps a lot already.
>
> We already use assign as we manage offsets on the consumer side. Currently
> we only have one partition and simply assign a stored offset on partition
> 0.
> For multiple partitions is it the correct behaviour to simply assign to
> partition number:offset or do I have to provide offsets for the other
> partitions too? I only want to listen to one partition.
> You mentioned custom producer partitioner. We currently use a random
> consumer group name for each consumer as we want each consumer to receive
> all messages of the environment. In this case do we still need a custom
> producer partitioner or is it enough to simply assign to the topic like
> described above?
>
> Christian
>
> Am Mi., 8. Dez. 2021 um 11:19 Uhr schrieb Luke Chen :
>
> > Hi Christian,
> > Answering your question below:
> >
> > > Let's assume we just have one topic with 10 partitions for simplicity.
> > We can now use the environment id as a key for the messages to make sure
> > the messages of each environment arrive in order while sharing the load
> on
> > the partitions.
> >
> > > Now we want each environment to only read the minimal number of
> messages
> > while consuming. Ideally we would like to to only consume its own
> messages.
> > Can we somehow filter to only
> > receive messages with a certain key? Can we maybe only listen to a
> certain
> > partition at least?
> >
> >
> > Unfortunately, Kafka doesn't have the feature to filter the messages on
> > broker before sending to consumer.
> > But for your 2nd question:
> > > Can we maybe only listen to a certain partition at least?
> >
> > Actually, yes. Kafka has a way to just fetch data from a certain
> partition
> > of a topic. You can use Consumer#assign API to achieve that. So, to do
> > that, I think you also need to have a custom producer partitioner for
> your
> > purpose. Let's say, in your example, you have 10 partitions, and 10
> > environments. Your partitioner should send to the specific partition
> based
> > on the environment ID, ex: env ID 1 -> partition 1, env ID 2 -> partition
> > 2 So, in your consumer, you can just assign to the partition
> containing
> > its environment ID.
> >
> > And for the idea of encrypting the messages to achieve isolation, it's
> > interesting! I've never thought about it! :)
> >
> > Hope it helps.
> >
> > Thank you.
> > Luke
> >
> >
> > On Wed, Dec 8, 2021 at 4:48 PM Christian Schneider <
> > ch...@die-schneider.net>
> > wrote:
> >
> > > We have a single tenant application that we deploy to a kubernetes
> > cluster
> > > in many instances.
> > > Every customer has several environments of the application. Each
> > > application lives in a separate namespace and should be isolated from
> > other
> > > applications.
> > >
> > > We plan to use kafka to communicate inside an environment (between the
> > > different pods).
> > > As setting up one kafka cluster per such environment is a lot of
> overhead
> > > and cost we would like to just use a single multi tenant kafka cluster.
> > >
> > > Let's assume we just have one topic with 10 partitions for simplicity.
> > > We can now use the environment id as a key for the messages to make
> sure
> > > the messages of each environment arrive in order while sharing the load
> > on
> > > the partitions.
> > >
> > > Now we want each environment to only read the minimal number of
> messages
> > > while consuming. Ideally we would like to to only consume its own
> > messages.
> > > Can we somehow filter to only
> > > receive messages with a certain key? Can we maybe only listen to a
> > certain
> > > partition at least?
> > >
> > > Additionally we ideally would like to have enforced isolation. So each
> > > environment can only see its own messages even if it might receive
> > messages
> > > of other environments from the same partition.
> > > I think in worst case we can make this happen by encrypting the
> messages
> > > but it 

Re: Use case: Per tenant deployments talking to multi tenant kafka cluster

2021-12-08 Thread Christian Schneider
Hi Luke,

thanks for the hints. This helps a lot already.

We already use assign as we manage offsets on the consumer side. Currently
we only have one partition and simply assign a stored offset on partition 0.
For multiple partitions is it the correct behaviour to simply assign to
partition number:offset or do I have to provide offsets for the other
partitions too? I only want to listen to one partition.
You mentioned custom producer partitioner. We currently use a random
consumer group name for each consumer as we want each consumer to receive
all messages of the environment. In this case do we still need a custom
producer partitioner or is it enough to simply assign to the topic like
described above?

Christian

Am Mi., 8. Dez. 2021 um 11:19 Uhr schrieb Luke Chen :

> Hi Christian,
> Answering your question below:
>
> > Let's assume we just have one topic with 10 partitions for simplicity.
> We can now use the environment id as a key for the messages to make sure
> the messages of each environment arrive in order while sharing the load on
> the partitions.
>
> > Now we want each environment to only read the minimal number of messages
> while consuming. Ideally we would like to to only consume its own messages.
> Can we somehow filter to only
> receive messages with a certain key? Can we maybe only listen to a certain
> partition at least?
>
>
> Unfortunately, Kafka doesn't have the feature to filter the messages on
> broker before sending to consumer.
> But for your 2nd question:
> > Can we maybe only listen to a certain partition at least?
>
> Actually, yes. Kafka has a way to just fetch data from a certain partition
> of a topic. You can use Consumer#assign API to achieve that. So, to do
> that, I think you also need to have a custom producer partitioner for your
> purpose. Let's say, in your example, you have 10 partitions, and 10
> environments. Your partitioner should send to the specific partition based
> on the environment ID, ex: env ID 1 -> partition 1, env ID 2 -> partition
> 2 So, in your consumer, you can just assign to the partition containing
> its environment ID.
>
> And for the idea of encrypting the messages to achieve isolation, it's
> interesting! I've never thought about it! :)
>
> Hope it helps.
>
> Thank you.
> Luke
>
>
> On Wed, Dec 8, 2021 at 4:48 PM Christian Schneider <
> ch...@die-schneider.net>
> wrote:
>
> > We have a single tenant application that we deploy to a kubernetes
> cluster
> > in many instances.
> > Every customer has several environments of the application. Each
> > application lives in a separate namespace and should be isolated from
> other
> > applications.
> >
> > We plan to use kafka to communicate inside an environment (between the
> > different pods).
> > As setting up one kafka cluster per such environment is a lot of overhead
> > and cost we would like to just use a single multi tenant kafka cluster.
> >
> > Let's assume we just have one topic with 10 partitions for simplicity.
> > We can now use the environment id as a key for the messages to make sure
> > the messages of each environment arrive in order while sharing the load
> on
> > the partitions.
> >
> > Now we want each environment to only read the minimal number of messages
> > while consuming. Ideally we would like to to only consume its own
> messages.
> > Can we somehow filter to only
> > receive messages with a certain key? Can we maybe only listen to a
> certain
> > partition at least?
> >
> > Additionally we ideally would like to have enforced isolation. So each
> > environment can only see its own messages even if it might receive
> messages
> > of other environments from the same partition.
> > I think in worst case we can make this happen by encrypting the messages
> > but it would be great if we could filter on broker side.
> >
> > Christian
> >
> > --
> > --
> > Christian Schneider
> > http://www.liquid-reality.de
> >
> > Computer Scientist
> > http://www.adobe.com
> >
>


-- 
-- 
Christian Schneider
http://www.liquid-reality.de

Computer Scientist
http://www.adobe.com


Re: Use case: Per tenant deployments talking to multi tenant kafka cluster

2021-12-08 Thread Luke Chen
Hi Christian,
Answering your question below:

> Let's assume we just have one topic with 10 partitions for simplicity.
We can now use the environment id as a key for the messages to make sure
the messages of each environment arrive in order while sharing the load on
the partitions.

> Now we want each environment to only read the minimal number of messages
while consuming. Ideally we would like to to only consume its own messages.
Can we somehow filter to only
receive messages with a certain key? Can we maybe only listen to a certain
partition at least?


Unfortunately, Kafka doesn't have the feature to filter the messages on
broker before sending to consumer.
But for your 2nd question:
> Can we maybe only listen to a certain partition at least?

Actually, yes. Kafka has a way to just fetch data from a certain partition
of a topic. You can use Consumer#assign API to achieve that. So, to do
that, I think you also need to have a custom producer partitioner for your
purpose. Let's say, in your example, you have 10 partitions, and 10
environments. Your partitioner should send to the specific partition based
on the environment ID, ex: env ID 1 -> partition 1, env ID 2 -> partition
2 So, in your consumer, you can just assign to the partition containing
its environment ID.

And for the idea of encrypting the messages to achieve isolation, it's
interesting! I've never thought about it! :)

Hope it helps.

Thank you.
Luke


On Wed, Dec 8, 2021 at 4:48 PM Christian Schneider 
wrote:

> We have a single tenant application that we deploy to a kubernetes cluster
> in many instances.
> Every customer has several environments of the application. Each
> application lives in a separate namespace and should be isolated from other
> applications.
>
> We plan to use kafka to communicate inside an environment (between the
> different pods).
> As setting up one kafka cluster per such environment is a lot of overhead
> and cost we would like to just use a single multi tenant kafka cluster.
>
> Let's assume we just have one topic with 10 partitions for simplicity.
> We can now use the environment id as a key for the messages to make sure
> the messages of each environment arrive in order while sharing the load on
> the partitions.
>
> Now we want each environment to only read the minimal number of messages
> while consuming. Ideally we would like to to only consume its own messages.
> Can we somehow filter to only
> receive messages with a certain key? Can we maybe only listen to a certain
> partition at least?
>
> Additionally we ideally would like to have enforced isolation. So each
> environment can only see its own messages even if it might receive messages
> of other environments from the same partition.
> I think in worst case we can make this happen by encrypting the messages
> but it would be great if we could filter on broker side.
>
> Christian
>
> --
> --
> Christian Schneider
> http://www.liquid-reality.de
>
> Computer Scientist
> http://www.adobe.com
>


Use case: Per tenant deployments talking to multi tenant kafka cluster

2021-12-08 Thread Christian Schneider
We have a single tenant application that we deploy to a kubernetes cluster
in many instances.
Every customer has several environments of the application. Each
application lives in a separate namespace and should be isolated from other
applications.

We plan to use kafka to communicate inside an environment (between the
different pods).
As setting up one kafka cluster per such environment is a lot of overhead
and cost we would like to just use a single multi tenant kafka cluster.

Let's assume we just have one topic with 10 partitions for simplicity.
We can now use the environment id as a key for the messages to make sure
the messages of each environment arrive in order while sharing the load on
the partitions.

Now we want each environment to only read the minimal number of messages
while consuming. Ideally we would like to to only consume its own messages.
Can we somehow filter to only
receive messages with a certain key? Can we maybe only listen to a certain
partition at least?

Additionally we ideally would like to have enforced isolation. So each
environment can only see its own messages even if it might receive messages
of other environments from the same partition.
I think in worst case we can make this happen by encrypting the messages
but it would be great if we could filter on broker side.

Christian

-- 
-- 
Christian Schneider
http://www.liquid-reality.de

Computer Scientist
http://www.adobe.com


Re: Right Use Case For Kafka Streams?

2021-03-19 Thread Ning Zhang
just my 2 cents

the best answer is always from the real-world practices :)

RocksDB https://rocksdb.org/ is the implementation of "state store" in Kafka 
Stream and it is an "embedded" kv store (which is diff than the distributed kv 
store). The "state store" in Kafka Stream is also backed up by "changelog" 
topic, where the physical kv data is stored.

The performance hit may happen if:
(1) one of application node (that runs kafka stream, since kafka stream is a 
library) is gone, the "state store" has to be rebuilt from changelog topic and 
if the changelog topic is huge, the rebuild time could be long.
(2) the stream topology is complex with multiple state store / aggregation or 
called "reduce" operations, the rebuild or recovery time after failure could be 
long.

`num.standby.replicas` should help to significantly reduce the rebuild time, 
but it comes with the storage cost, since the "state store" is replicated at a 
different node.





On 2021/03/16 01:11:00, Gareth Collins  wrote: 
> Hi,
> 
> We have a requirement to calculate metrics on a huge number of keys (could
> be hundreds of millions, perhaps billions of keys - attempting caching on
> individual keys in many cases will have almost a 0% cache hit rate). Is
> Kafka Streams with RocksDB and compacting topics the right tool for a task
> like that?
> 
> As well, just from playing with Kafka Streams for a week it feels like it
> wants to create a lot of separate stores by default (if I want to calculate
> aggregates on five, ten and 30 days I will get three separate stores by
> default for this state data). Coming from a different distributed storage
> solution, I feel like I want to put them together in one store as I/O has
> always been my bottleneck (1 big read and 1 big write is better than three
> small separate reads and three small separate writes).
> 
> But am I perhaps missing something here? I don't want to avoid the DSL that
> Kafka Streams provides if I don't have to. Will the Kafka Streams RocksDB
> solution be so much faster than a distributed read that it won't be the
> bottleneck even with huge amounts of data?
> 
> Any info/opinions would be greatly appreciated.
> 
> thanks in advance,
> Gareth Collins
> 


Re: Right Use Case For Kafka Streams?

2021-03-17 Thread Guozhang Wang
Hello Gareth,

A common practice for rolling up aggregations with Kafka Streams is to do
the finest granularity at processor (5 days in your case), and to
coarse-grained rolling up upon query serving through the interactive query
API -- i.e. whenever a query is issued for a 30 day aggregate you do a
range scan on the 5-day-aggregate stores, and compute the rollup on the fly.

If you'd prefer to still materialize all of the granularities since maybe
their query frequency is high enough, maybe just go with three stores but
as three concatenated aggregations (i.e. a stream aggregation into 5-day,s
and the 5-day table aggregation to 10days, and 10-day table aggregation to
30-days).

Guozhang

On Mon, Mar 15, 2021 at 6:11 PM Gareth Collins 
wrote:

> Hi,
>
> We have a requirement to calculate metrics on a huge number of keys (could
> be hundreds of millions, perhaps billions of keys - attempting caching on
> individual keys in many cases will have almost a 0% cache hit rate). Is
> Kafka Streams with RocksDB and compacting topics the right tool for a task
> like that?
>
> As well, just from playing with Kafka Streams for a week it feels like it
> wants to create a lot of separate stores by default (if I want to calculate
> aggregates on five, ten and 30 days I will get three separate stores by
> default for this state data). Coming from a different distributed storage
> solution, I feel like I want to put them together in one store as I/O has
> always been my bottleneck (1 big read and 1 big write is better than three
> small separate reads and three small separate writes).
>
> But am I perhaps missing something here? I don't want to avoid the DSL that
> Kafka Streams provides if I don't have to. Will the Kafka Streams RocksDB
> solution be so much faster than a distributed read that it won't be the
> bottleneck even with huge amounts of data?
>
> Any info/opinions would be greatly appreciated.
>
> thanks in advance,
> Gareth Collins
>


-- 
-- Guozhang


Right Use Case For Kafka Streams?

2021-03-15 Thread Gareth Collins
Hi,

We have a requirement to calculate metrics on a huge number of keys (could
be hundreds of millions, perhaps billions of keys - attempting caching on
individual keys in many cases will have almost a 0% cache hit rate). Is
Kafka Streams with RocksDB and compacting topics the right tool for a task
like that?

As well, just from playing with Kafka Streams for a week it feels like it
wants to create a lot of separate stores by default (if I want to calculate
aggregates on five, ten and 30 days I will get three separate stores by
default for this state data). Coming from a different distributed storage
solution, I feel like I want to put them together in one store as I/O has
always been my bottleneck (1 big read and 1 big write is better than three
small separate reads and three small separate writes).

But am I perhaps missing something here? I don't want to avoid the DSL that
Kafka Streams provides if I don't have to. Will the Kafka Streams RocksDB
solution be so much faster than a distributed read that it won't be the
bottleneck even with huge amounts of data?

Any info/opinions would be greatly appreciated.

thanks in advance,
Gareth Collins


Re: Is this a valid use case for reading local store ?

2020-10-01 Thread Parthasarathy, Mohan
Thanks. I was curious about the other real world use cases i.e what do people 
use it for ? Is this widely used or mostly for debugging purposes ? Any caveats 
?

Thanks
Mohan


On 10/1/20, 5:55 PM, "Guozhang Wang"  wrote:

Mohan,

I think you can build a REST API on top of app1 directly leveraging on its
IQ interface. For some examples code you can refer to

https://github.com/confluentinc/kafka-streams-examples/tree/6.0.0-post/src/main/java/io/confluent/examples/streams/interactivequeries

Guozhang

On Thu, Oct 1, 2020 at 10:40 AM Parthasarathy, Mohan 
wrote:

> Hi Guozhang,
>
> The async event trigger process is not running as a kafka streams
> application. It offers REST interface where other applications post events
> which in turn needs to go through App1's state and send requests to App2
> via Kafka. Here is the diagram:
>
>KafkaTopics--->  App1 ---> App2
>|
>V
> REST >App3
>
> REST API to App3 and read the local store of App1 (IQ) and send requests
> to App2 (through kafka topic, not shown above).  Conceptually it looks 
same
> as your use case. What do people do if a kafka streams application (App1)
> has to offer REST interface also ?
>
> -thanks
> Mohan
>
> On 9/30/20, 5:01 PM, "Guozhang Wang"  wrote:
>
> Hello Mohan,
>
> If I understand correctly, your async event trigger process runs out
> of the
> streams application, that reads the state stores of app2 through the
> interactive query interface, right? This is actually a pretty common
> use
> case pattern for IQ :)
>
>
> Guozhang
>
> On Wed, Sep 30, 2020 at 1:22 PM Parthasarathy, Mohan  >
> wrote:
>
> > Hi,
> >
> > A traditional kafka streams application (App1)  reading data from a
> kafka
> > topic, doing aggregations resulting in some local state. The output
> of this
> > application is consumed by a different application(App2) for doing a
> > different task. Under some conditions, there is an external trigger
> (async
> > event) which needs to trigger requests for all the keys in the local
> store
> > to App2. To achieve this, we can read the local stores from all the
> > replicas and send the request to App2.
> >
> > This async event happens less frequently compared to the normal case
> that
> > leads to the state creation in the first place. Are there any
> caveats doing
> > it this way ? If not, any other suggestions ?
> >
> > Thanks
> > Mohan
> >
> >
>
> --
> -- Guozhang
>
>
>

-- 
-- Guozhang




Re: Is this a valid use case for reading local store ?

2020-10-01 Thread Guozhang Wang
Mohan,

I think you can build a REST API on top of app1 directly leveraging on its
IQ interface. For some examples code you can refer to
https://github.com/confluentinc/kafka-streams-examples/tree/6.0.0-post/src/main/java/io/confluent/examples/streams/interactivequeries

Guozhang

On Thu, Oct 1, 2020 at 10:40 AM Parthasarathy, Mohan 
wrote:

> Hi Guozhang,
>
> The async event trigger process is not running as a kafka streams
> application. It offers REST interface where other applications post events
> which in turn needs to go through App1's state and send requests to App2
> via Kafka. Here is the diagram:
>
>KafkaTopics--->  App1 ---> App2
>|
>V
> REST >App3
>
> REST API to App3 and read the local store of App1 (IQ) and send requests
> to App2 (through kafka topic, not shown above).  Conceptually it looks same
> as your use case. What do people do if a kafka streams application (App1)
> has to offer REST interface also ?
>
> -thanks
> Mohan
>
> On 9/30/20, 5:01 PM, "Guozhang Wang"  wrote:
>
> Hello Mohan,
>
> If I understand correctly, your async event trigger process runs out
> of the
> streams application, that reads the state stores of app2 through the
> interactive query interface, right? This is actually a pretty common
> use
> case pattern for IQ :)
>
>
> Guozhang
>
> On Wed, Sep 30, 2020 at 1:22 PM Parthasarathy, Mohan  >
> wrote:
>
> > Hi,
> >
> > A traditional kafka streams application (App1)  reading data from a
> kafka
> > topic, doing aggregations resulting in some local state. The output
> of this
> > application is consumed by a different application(App2) for doing a
> > different task. Under some conditions, there is an external trigger
> (async
> > event) which needs to trigger requests for all the keys in the local
> store
> > to App2. To achieve this, we can read the local stores from all the
> > replicas and send the request to App2.
> >
> > This async event happens less frequently compared to the normal case
> that
> > leads to the state creation in the first place. Are there any
> caveats doing
> > it this way ? If not, any other suggestions ?
> >
> > Thanks
> > Mohan
> >
> >
>
> --
> -- Guozhang
>
>
>

-- 
-- Guozhang


Re: Is this a valid use case for reading local store ?

2020-10-01 Thread Parthasarathy, Mohan
Hi Guozhang,

The async event trigger process is not running as a kafka streams application. 
It offers REST interface where other applications post events which in turn 
needs to go through App1's state and send requests to App2 via Kafka. Here is 
the diagram:

   KafkaTopics--->  App1 ---> App2 
   |
   V
REST >App3

REST API to App3 and read the local store of App1 (IQ) and send requests to 
App2 (through kafka topic, not shown above).  Conceptually it looks same as 
your use case. What do people do if a kafka streams application (App1) has to 
offer REST interface also ?

-thanks
Mohan

On 9/30/20, 5:01 PM, "Guozhang Wang"  wrote:

Hello Mohan,

If I understand correctly, your async event trigger process runs out of the
streams application, that reads the state stores of app2 through the
interactive query interface, right? This is actually a pretty common use
case pattern for IQ :)


Guozhang

On Wed, Sep 30, 2020 at 1:22 PM Parthasarathy, Mohan 
wrote:

> Hi,
>
> A traditional kafka streams application (App1)  reading data from a kafka
> topic, doing aggregations resulting in some local state. The output of 
this
> application is consumed by a different application(App2) for doing a
> different task. Under some conditions, there is an external trigger (async
> event) which needs to trigger requests for all the keys in the local store
> to App2. To achieve this, we can read the local stores from all the
> replicas and send the request to App2.
>
> This async event happens less frequently compared to the normal case that
> leads to the state creation in the first place. Are there any caveats 
doing
> it this way ? If not, any other suggestions ?
>
> Thanks
> Mohan
>
>

-- 
-- Guozhang




Re: Is this a valid use case for reading local store ?

2020-09-30 Thread Guozhang Wang
Hello Mohan,

If I understand correctly, your async event trigger process runs out of the
streams application, that reads the state stores of app2 through the
interactive query interface, right? This is actually a pretty common use
case pattern for IQ :)


Guozhang

On Wed, Sep 30, 2020 at 1:22 PM Parthasarathy, Mohan 
wrote:

> Hi,
>
> A traditional kafka streams application (App1)  reading data from a kafka
> topic, doing aggregations resulting in some local state. The output of this
> application is consumed by a different application(App2) for doing a
> different task. Under some conditions, there is an external trigger (async
> event) which needs to trigger requests for all the keys in the local store
> to App2. To achieve this, we can read the local stores from all the
> replicas and send the request to App2.
>
> This async event happens less frequently compared to the normal case that
> leads to the state creation in the first place. Are there any caveats doing
> it this way ? If not, any other suggestions ?
>
> Thanks
> Mohan
>
>

-- 
-- Guozhang


Is this a valid use case for reading local store ?

2020-09-30 Thread Parthasarathy, Mohan
Hi,

A traditional kafka streams application (App1)  reading data from a kafka 
topic, doing aggregations resulting in some local state. The output of this 
application is consumed by a different application(App2) for doing a different 
task. Under some conditions, there is an external trigger (async event) which 
needs to trigger requests for all the keys in the local store to App2. To 
achieve this, we can read the local stores from all the replicas and send the 
request to App2.

This async event happens less frequently compared to the normal case that leads 
to the state creation in the first place. Are there any caveats doing it this 
way ? If not, any other suggestions ?

Thanks
Mohan



Re: First time building a streaming app and I need help understanding how to build out my use case

2019-06-10 Thread SenthilKumar K
```*When I get a request for all of the messages containing a given user
ID, I need to query in to the topic and get the content of those messages.
Does that make sense and is it a think Kafka can do?*``` - If i
understand correctly , your requirement is to Query the Kafka Topics based
on key. Example : Topic `user_data`  [ Key : userid , Value : JSON or Some
other data ] . If you get userid , all you need is to consume the JSON data
from topic user_data for the supplied user_id? Is this correct? If yes ,
Kafka is not recommended to use as a Query Service. If you have very less
number of users data , still you can achieve this by consuming all data and
apply filter based on user_id.

--Senthil

On Mon, Jun 10, 2019 at 9:45 PM Simon Calvin 
wrote:

> Martin,
>
> Thank you very much for your reply. I appreciate the perspective on
> securing communications with Kafka, but before I get to that point I'm
> trying to figure out if/how I can implement this use case specifically in
> Kafka.
>
> The point that I'm stuck on is needing to query for specific messages
> within a topic when the app receives a request. To simplify the example,
> consider a service that is subscribed to messages that contain a user id.
> When I get a request for all of the messages containing a given user ID, I
> need to query in to the topic and get the content of those messages. Does
> that make sense and is it a think Kafka can do?
>
> Thanks again for your help and attention!
>
> Simon
>
> 
> From: Martin Gainty 
> Sent: Monday, June 10, 2019 8:20 AM
> To: users@kafka.apache.org
> Subject: Re: First time building a streaming app and I need help
> understanding how to build out my use case
>
> MG>below
>
> 
> From: Simon Calvin 
> Sent: Friday, June 7, 2019 3:39 PM
> To: users@kafka.apache.org
> Subject: First time building a streaming app and I need help understanding
> how to build out my use case
>
> Hello, everyone. I feel like I have a use case that it is well suited to
> the Kafka streaming paradigm, but I'm having a difficult time understanding
> how certain aspects will work as I'm prototyping.
>
> So here's my use case: Service 1 assigns a job to a user which is
> published as an event to Kafka. Service 2 is a domain service that owns the
> definition for all jobs. In this case, the definition boils down to a bunch
> of form fields that need to be filled in. As changes are made to the
> definitions, the updated versions are published by Service 2 to Kafka (I
> think this is a KTable?). The job from Service 1 and the definition from
> Service 2 get joined together to create a "bill of materials" that the user
> needs to fulfill.
>  Service 3, a REST API,
>
> MG>can you risk implementing a non-secured HTTP connection?... then go
> ahead
> MG>if not you will need to look into some manner of PKI implementation for
> your Kafka Streams (user_login or certs&keys)
>
> needs to pull any unfulfilled bills for a given user. Ideally we want the
> bill to contain the most current version of the job definition at the point
> it is retrieved (vs the version at the point that the job assignment was
> published). Then, as the user fulfills the items, we update the bill with
> their responses. Once the bill is complete it gets pushed on to the one or
> more additional services (all basic consumers).
>
> MG>for Ktable stream example please reference
> org.apache.kafka.streams.smoketest.SmokeTestClient createKafkaStreams
>
> The part I'm having the most trouble with is the retrieval of bills for a
> user in Service 3. I got this idea in my head that because Kafka is
> effectively a storage system there was a(n at least fairly) straightforward
> way of querying out messages that were keyed/tagged a certain way (i.e.,
> with the user ID), but it's not clear to me if and how that works in
> practice. I'm very new to the idea of streaming and so I think a lot of the
> issue is that I'm trying to force foreign concepts (the non-streaming way
> I'm used to doing things) in to the streaming paradigm. Any help is
> appreciated!
>
> MG>assuming your ID is *NOT* generated for your table
> MG>if implementing HTTPS request/response you might want to consider using
> identifier of unique secured SESSION_ID
>
> https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely
> [
> https://cdn.sstatic.net/Sites/security/img/apple-touch-i...@2.png?v=497726d850f9
> ]<
> https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely
> >
> How is the session ID sent securely? - Stack Exchange<
> https://security.

Re: First time building a streaming app and I need help understanding how to build out my use case

2019-06-10 Thread Simon Calvin
Martin,

Thank you very much for your reply. I appreciate the perspective on securing 
communications with Kafka, but before I get to that point I'm trying to figure 
out if/how I can implement this use case specifically in Kafka.

The point that I'm stuck on is needing to query for specific messages within a 
topic when the app receives a request. To simplify the example, consider a 
service that is subscribed to messages that contain a user id. When I get a 
request for all of the messages containing a given user ID, I need to query in 
to the topic and get the content of those messages. Does that make sense and is 
it a think Kafka can do?

Thanks again for your help and attention!

Simon


From: Martin Gainty 
Sent: Monday, June 10, 2019 8:20 AM
To: users@kafka.apache.org
Subject: Re: First time building a streaming app and I need help understanding 
how to build out my use case

MG>below


From: Simon Calvin 
Sent: Friday, June 7, 2019 3:39 PM
To: users@kafka.apache.org
Subject: First time building a streaming app and I need help understanding how 
to build out my use case

Hello, everyone. I feel like I have a use case that it is well suited to the 
Kafka streaming paradigm, but I'm having a difficult time understanding how 
certain aspects will work as I'm prototyping.

So here's my use case: Service 1 assigns a job to a user which is published as 
an event to Kafka. Service 2 is a domain service that owns the definition for 
all jobs. In this case, the definition boils down to a bunch of form fields 
that need to be filled in. As changes are made to the definitions, the updated 
versions are published by Service 2 to Kafka (I think this is a KTable?). The 
job from Service 1 and the definition from Service 2 get joined together to 
create a "bill of materials" that the user needs to fulfill.
 Service 3, a REST API,

MG>can you risk implementing a non-secured HTTP connection?... then go ahead
MG>if not you will need to look into some manner of PKI implementation for your 
Kafka Streams (user_login or certs&keys)

needs to pull any unfulfilled bills for a given user. Ideally we want the bill 
to contain the most current version of the job definition at the point it is 
retrieved (vs the version at the point that the job assignment was published). 
Then, as the user fulfills the items, we update the bill with their responses. 
Once the bill is complete it gets pushed on to the one or more additional 
services (all basic consumers).

MG>for Ktable stream example please reference 
org.apache.kafka.streams.smoketest.SmokeTestClient createKafkaStreams

The part I'm having the most trouble with is the retrieval of bills for a user 
in Service 3. I got this idea in my head that because Kafka is effectively a 
storage system there was a(n at least fairly) straightforward way of querying 
out messages that were keyed/tagged a certain way (i.e., with the user ID), but 
it's not clear to me if and how that works in practice. I'm very new to the 
idea of streaming and so I think a lot of the issue is that I'm trying to force 
foreign concepts (the non-streaming way I'm used to doing things) in to the 
streaming paradigm. Any help is appreciated!

MG>assuming your ID is *NOT* generated for your table
MG>if implementing HTTPS request/response you might want to consider using 
identifier of unique secured SESSION_ID
https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely
[https://cdn.sstatic.net/Sites/security/img/apple-touch-i...@2.png?v=497726d850f9]<https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely>
How is the session ID sent securely? - Stack 
Exchange<https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely>
Answer 1: if the server uses SSL/HTTPS(verified by third party-not self-signed 
certificate), cookies and session IDs travel as cipher-text over the network, 
and if an attacker (Man in the Middle) uses a packet sniffer, they can not 
obtain any information. They can not decrypt data because the connection 
between client and server is secured by a verified third party.so HTTPS without 
verified ...
security.stackexchange.com


Thanks very much for your kind attention!

Simon Calvin


Re: First time building a streaming app and I need help understanding how to build out my use case

2019-06-10 Thread Martin Gainty
MG>below


From: Simon Calvin 
Sent: Friday, June 7, 2019 3:39 PM
To: users@kafka.apache.org
Subject: First time building a streaming app and I need help understanding how 
to build out my use case

Hello, everyone. I feel like I have a use case that it is well suited to the 
Kafka streaming paradigm, but I'm having a difficult time understanding how 
certain aspects will work as I'm prototyping.

So here's my use case: Service 1 assigns a job to a user which is published as 
an event to Kafka. Service 2 is a domain service that owns the definition for 
all jobs. In this case, the definition boils down to a bunch of form fields 
that need to be filled in. As changes are made to the definitions, the updated 
versions are published by Service 2 to Kafka (I think this is a KTable?). The 
job from Service 1 and the definition from Service 2 get joined together to 
create a "bill of materials" that the user needs to fulfill.
 Service 3, a REST API,

MG>can you risk implementing a non-secured HTTP connection?... then go ahead
MG>if not you will need to look into some manner of PKI implementation for your 
Kafka Streams (user_login or certs&keys)

needs to pull any unfulfilled bills for a given user. Ideally we want the bill 
to contain the most current version of the job definition at the point it is 
retrieved (vs the version at the point that the job assignment was published). 
Then, as the user fulfills the items, we update the bill with their responses. 
Once the bill is complete it gets pushed on to the one or more additional 
services (all basic consumers).

MG>for Ktable stream example please reference 
org.apache.kafka.streams.smoketest.SmokeTestClient createKafkaStreams

The part I'm having the most trouble with is the retrieval of bills for a user 
in Service 3. I got this idea in my head that because Kafka is effectively a 
storage system there was a(n at least fairly) straightforward way of querying 
out messages that were keyed/tagged a certain way (i.e., with the user ID), but 
it's not clear to me if and how that works in practice. I'm very new to the 
idea of streaming and so I think a lot of the issue is that I'm trying to force 
foreign concepts (the non-streaming way I'm used to doing things) in to the 
streaming paradigm. Any help is appreciated!

MG>assuming your ID is *NOT* generated for your table
MG>if implementing HTTPS request/response you might want to consider using 
identifier of unique secured SESSION_ID
https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely
[https://cdn.sstatic.net/Sites/security/img/apple-touch-i...@2.png?v=497726d850f9]<https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely>
How is the session ID sent securely? - Stack 
Exchange<https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely>
Answer 1: if the server uses SSL/HTTPS(verified by third party-not self-signed 
certificate), cookies and session IDs travel as cipher-text over the network, 
and if an attacker (Man in the Middle) uses a packet sniffer, they can not 
obtain any information. They can not decrypt data because the connection 
between client and server is secured by a verified third party.so HTTPS without 
verified ...
security.stackexchange.com


Thanks very much for your kind attention!

Simon Calvin


First time building a streaming app and I need help understanding how to build out my use case

2019-06-07 Thread Simon Calvin
Hello, everyone. I feel like I have a use case that it is well suited to the 
Kafka streaming paradigm, but I'm having a difficult time understanding how 
certain aspects will work as I'm prototyping.

So here's my use case: Service 1 assigns a job to a user which is published as 
an event to Kafka. Service 2 is a domain service that owns the definition for 
all jobs. In this case, the definition boils down to a bunch of form fields 
that need to be filled in. As changes are made to the definitions, the updated 
versions are published by Service 2 to Kafka (I think this is a KTable?). The 
job from Service 1 and the definition from Service 2 get joined together to 
create a "bill of materials" that the user needs to fulfill. Service 3, a REST 
API, needs to pull any unfulfilled bills for a given user. Ideally we want the 
bill to contain the most current version of the job definition at the point it 
is retrieved (vs the version at the point that the job assignment was 
published). Then, as the user fulfills the items, we update the bill with their 
responses. Once the bill is complete it gets pushed on to the one or more 
additional services (all basic consumers).

The part I'm having the most trouble with is the retrieval of bills for a user 
in Service 3. I got this idea in my head that because Kafka is effectively a 
storage system there was a(n at least fairly) straightforward way of querying 
out messages that were keyed/tagged a certain way (i.e., with the user ID), but 
it's not clear to me if and how that works in practice. I'm very new to the 
idea of streaming and so I think a lot of the issue is that I'm trying to force 
foreign concepts (the non-streaming way I'm used to doing things) in to the 
streaming paradigm. Any help is appreciated!

Thanks very much for your kind attention!

Simon Calvin


Re: Configuration guidelines for a specific use-case

2019-01-11 Thread Ryanne Dolan
Given your high throughput on the consumer side, you might consider adding
more partitions than 6, so you can scale up beyond 6 consumers per group if
need be.

Ryanne

On Wed, Jan 9, 2019 at 10:37 AM Gioacchino Vino 
wrote:

> Hi Ryanne,
>
>
> I just forgot to insert the "linger.ms=0" configuration.
>
> I got this result:
>
>
> 5000 records sent, 706793.701055 records/sec (67.41 MB/sec), 7.29 ms
> avg latency, 1245.00 ms max latency, 0 ms 50th, 3 ms 95th, 197 ms 99th,
> 913 ms 99.9th.
>
>
> it's pretty good but I would like to improve it just a bit.
>
> Do you think using 6 partitions in a 3 broker cluster is a good choice?
>
>
> Gioacchino
>
>
> On 08/01/2019 18:52, Ryanne Dolan wrote:
> > Latency sounds high to me, maybe your JVMs are GC'ing a lot?
> >
> > Ryanne
> >
> > On Tue, Jan 8, 2019, 11:45 AM Gioacchino Vino  > wrote:
> >
> >> Hi expert,
> >>
> >>
> >> I would ask you some guidelines, web-pages or comments regarding my
> >> use-case.
> >>
> >>
> >> *Requirements*:
> >>
> >> - 2000+ producers
> >>
> >> - input rate 600k messages/s
> >>
> >> - consumers must write in 3 different databases (so i assume 3 consumer
> >> groups) at 600k messages/s overall (200k messages/s/database)
> >>
> >> - latency < 500ms between producers and databases
> >>
> >> - good availability
> >>
> >> - Possibility to process messages before to send them to the databases
> >> (Kafka stream? Of course in HA. Docker? Marathon?)
> >>
> >> - it's tolerate missing data ( 0.5% max ) (disk writing is not strictly
> >> required), latency has higher priority
> >>
> >> - record size: 100-1000
> >>
> >>
> >> *Resources*:
> >>
> >> brokers ( Bandwidth: 25 Gbps, 32Cpus, 1 disk (I/O 99.0 MB/s)
> >>
> >> producers -> brokers -> consumers ( Bandwidth: 1 Gbps )
> >>
> >>
> >> *My* *configuration*:
> >>
> >> 3 brokers
> >>
> >> 6 partition (without replication in order to minimize latency)
> >>
> >> ack = 0 (missing data is tolerate)
> >>
> >> batch.size = 1024 (with 8196 the throughput is max)
> >>
> >> producers -> compression.type=none
> >>
> >>
> >>
> >> I did test using kafka-producer-perf-test.sh and
> >> kafka-consumer-perf-test.sh and i have a good throughput (500-600k
> >> messages/s using 3 producers and 3 consumers) but i would improve
> >> latency (0.3-2 sec) or features I'm not still considering.
> >>
> >>
> >> I thank you in advance.
> >>
> >> Cheers,
> >>
> >>
> >> Gioacchino
> >>
> >>
>


Re: Configuration guidelines for a specific use-case

2019-01-09 Thread Gioacchino Vino

Hi Ryanne,


I just forgot to insert the "linger.ms=0" configuration.

I got this result:


5000 records sent, 706793.701055 records/sec (67.41 MB/sec), 7.29 ms 
avg latency, 1245.00 ms max latency, 0 ms 50th, 3 ms 95th, 197 ms 99th, 
913 ms 99.9th.



it's pretty good but I would like to improve it just a bit.

Do you think using 6 partitions in a 3 broker cluster is a good choice?


Gioacchino


On 08/01/2019 18:52, Ryanne Dolan wrote:

Latency sounds high to me, maybe your JVMs are GC'ing a lot?

Ryanne

On Tue, Jan 8, 2019, 11:45 AM Gioacchino Vino 
Hi expert,


I would ask you some guidelines, web-pages or comments regarding my
use-case.


*Requirements*:

- 2000+ producers

- input rate 600k messages/s

- consumers must write in 3 different databases (so i assume 3 consumer
groups) at 600k messages/s overall (200k messages/s/database)

- latency < 500ms between producers and databases

- good availability

- Possibility to process messages before to send them to the databases
(Kafka stream? Of course in HA. Docker? Marathon?)

- it's tolerate missing data ( 0.5% max ) (disk writing is not strictly
required), latency has higher priority

- record size: 100-1000


*Resources*:

brokers ( Bandwidth: 25 Gbps, 32Cpus, 1 disk (I/O 99.0 MB/s)

producers -> brokers -> consumers ( Bandwidth: 1 Gbps )


*My* *configuration*:

3 brokers

6 partition (without replication in order to minimize latency)

ack = 0 (missing data is tolerate)

batch.size = 1024 (with 8196 the throughput is max)

producers -> compression.type=none



I did test using kafka-producer-perf-test.sh and
kafka-consumer-perf-test.sh and i have a good throughput (500-600k
messages/s using 3 producers and 3 consumers) but i would improve
latency (0.3-2 sec) or features I'm not still considering.


I thank you in advance.

Cheers,


Gioacchino




Re: Configuration guidelines for a specific use-case

2019-01-08 Thread Ryanne Dolan
Latency sounds high to me, maybe your JVMs are GC'ing a lot?

Ryanne

On Tue, Jan 8, 2019, 11:45 AM Gioacchino Vino  Hi expert,
>
>
> I would ask you some guidelines, web-pages or comments regarding my
> use-case.
>
>
> *Requirements*:
>
> - 2000+ producers
>
> - input rate 600k messages/s
>
> - consumers must write in 3 different databases (so i assume 3 consumer
> groups) at 600k messages/s overall (200k messages/s/database)
>
> - latency < 500ms between producers and databases
>
> - good availability
>
> - Possibility to process messages before to send them to the databases
> (Kafka stream? Of course in HA. Docker? Marathon?)
>
> - it's tolerate missing data ( 0.5% max ) (disk writing is not strictly
> required), latency has higher priority
>
> - record size: 100-1000
>
>
> *Resources*:
>
> brokers ( Bandwidth: 25 Gbps, 32Cpus, 1 disk (I/O 99.0 MB/s)
>
> producers -> brokers -> consumers ( Bandwidth: 1 Gbps )
>
>
> *My* *configuration*:
>
> 3 brokers
>
> 6 partition (without replication in order to minimize latency)
>
> ack = 0 (missing data is tolerate)
>
> batch.size = 1024 (with 8196 the throughput is max)
>
> producers -> compression.type=none
>
>
>
> I did test using kafka-producer-perf-test.sh and
> kafka-consumer-perf-test.sh and i have a good throughput (500-600k
> messages/s using 3 producers and 3 consumers) but i would improve
> latency (0.3-2 sec) or features I'm not still considering.
>
>
> I thank you in advance.
>
> Cheers,
>
>
> Gioacchino
>
>


Configuration guidelines for a specific use-case

2019-01-08 Thread Gioacchino Vino

Hi expert,


I would ask you some guidelines, web-pages or comments regarding my 
use-case.



*Requirements*:

- 2000+ producers

- input rate 600k messages/s

- consumers must write in 3 different databases (so i assume 3 consumer 
groups) at 600k messages/s overall (200k messages/s/database)


- latency < 500ms between producers and databases

- good availability

- Possibility to process messages before to send them to the databases 
(Kafka stream? Of course in HA. Docker? Marathon?)


- it's tolerate missing data ( 0.5% max ) (disk writing is not strictly 
required), latency has higher priority


- record size: 100-1000


*Resources*:

brokers ( Bandwidth: 25 Gbps, 32Cpus, 1 disk (I/O 99.0 MB/s)

producers -> brokers -> consumers ( Bandwidth: 1 Gbps )


*My* *configuration*:

3 brokers

6 partition (without replication in order to minimize latency)

ack = 0 (missing data is tolerate)

batch.size = 1024 (with 8196 the throughput is max)

producers -> compression.type=none



I did test using kafka-producer-perf-test.sh and 
kafka-consumer-perf-test.sh and i have a good throughput (500-600k 
messages/s using 3 producers and 3 consumers) but i would improve 
latency (0.3-2 sec) or features I'm not still considering.



I thank you in advance.

Cheers,


Gioacchino



Kafka use case

2018-09-26 Thread Shibi Ns
I have 2 systems

   1. System I -  A web based interface based on Oracle DB  and No REST API
   support
   2.  System II -  Supports rest API's  which also has web based interface
   .

   When a record created or updated in either of the system I want
   propagate the data to other  system   . Can I used  Kafka here as mediator
   ? .  So Kafka needs receives the data the check who is the receiver if its
   System I  then make JDBC connection to call API but if its System 2 then
   make a rest API .


Shibi


Re: Is this a decent use case for Kafka Streams?

2017-07-13 Thread Jon Yeargers
Unf this notion isn't applicable: "...At the end of a time window..."

If you comb through the archives of this group you'll see many questions
about notifications for the 'end of an aggregation window' and a similar
number of replies from the Kafka group stating that such a notion doesn't
really exist. Each window is kept open so that late arriving records can be
incorporated. You can specify the lifetime of a given window but you don't
get any sort of signal when it expires. A record that arrives after said
expiration will trigger a new window to be created.


On Wed, Jul 12, 2017 at 5:06 PM, Stephen Powis 
wrote:

> Hey! I was hoping I could get some input from people more experienced with
> Kafka Streams to determine if they'd be a good use case/solution for me.
>
> I have multi-tenant clients submitting data to a Kafka topic that they want
> ETL'd to a third party service.  I'd like to batch and group these by
> tenant over a time window, somewhere between 1 and 5 minutes.  At the end
> of a time window then issue an API request to the third party service for
> each tenant sending the batch of data over.
>
> Other points of note:
> - Ideally we'd have exactly-once semantics, sending data multiple times
> would typically be bad.  But we'd need to gracefully handle things like API
> request errors / service outages.
>
> - We currently use Storm for doing stream processing, but the long running
> time-windows and potentially large amount of data stored in memory make me
> a bit nervous to use it for this.
>
> Thoughts?  Thanks in Advance!
> Stephen
>


Re: Is this a decent use case for Kafka Streams?

2017-07-13 Thread Eno Thereska
From just looking at your description of the problem, I'd say yes, this looks 
like a typical scenario for Kafka Streams. Kafka Streams supports exactly once 
semantics too in 0.11.

Cheers
Eno

> On 12 Jul 2017, at 17:06, Stephen Powis  wrote:
> 
> Hey! I was hoping I could get some input from people more experienced with
> Kafka Streams to determine if they'd be a good use case/solution for me.
> 
> I have multi-tenant clients submitting data to a Kafka topic that they want
> ETL'd to a third party service.  I'd like to batch and group these by
> tenant over a time window, somewhere between 1 and 5 minutes.  At the end
> of a time window then issue an API request to the third party service for
> each tenant sending the batch of data over.
> 
> Other points of note:
> - Ideally we'd have exactly-once semantics, sending data multiple times
> would typically be bad.  But we'd need to gracefully handle things like API
> request errors / service outages.
> 
> - We currently use Storm for doing stream processing, but the long running
> time-windows and potentially large amount of data stored in memory make me
> a bit nervous to use it for this.
> 
> Thoughts?  Thanks in Advance!
> Stephen



Is this a decent use case for Kafka Streams?

2017-07-12 Thread Stephen Powis
Hey! I was hoping I could get some input from people more experienced with
Kafka Streams to determine if they'd be a good use case/solution for me.

I have multi-tenant clients submitting data to a Kafka topic that they want
ETL'd to a third party service.  I'd like to batch and group these by
tenant over a time window, somewhere between 1 and 5 minutes.  At the end
of a time window then issue an API request to the third party service for
each tenant sending the batch of data over.

Other points of note:
- Ideally we'd have exactly-once semantics, sending data multiple times
would typically be bad.  But we'd need to gracefully handle things like API
request errors / service outages.

- We currently use Storm for doing stream processing, but the long running
time-windows and potentially large amount of data stored in memory make me
a bit nervous to use it for this.

Thoughts?  Thanks in Advance!
Stephen


Re: How to implement use case

2017-04-27 Thread Steven Schlansker

> On Apr 27, 2017, at 3:25 AM, Vladimir Lalovic  wrote:
> 
> Hi all,
> 
> 
> 
> Our system is about ride reservations and acts as broker between customers
> and drivers.
> 
...

> Most of our rules are function of time and some reservation’s property
> (e.g. check if there are any reservations where remaining time before
> pickup is less than x).
> 
> Number of reservations we currently fetching is ~5000 and number of
> notification/alerting rules is ~20
> 
> 
> Based on documentation and some blog posts I have impression that Kafka and
> Kafka Stream library are good choice for this use case but I would like to
> confirm that with someone from Kafka team or to get some recommendations ...

We use Kafka Streams for an event dispatch system (although ours uses SMS)

I have run into a couple of edge issues where it was clear that most users
use KS differently (most for continuous streaming analysis jobs) but 
fundamentally
the library is also good for building event driven applications and we are quite
happy with our choice.

Just know that you're at the far edge of what the community uses KS for so you 
might
trip over an issue or two.  I'd still highly recommend it!



How to implement use case

2017-04-27 Thread Vladimir Lalovic
Hi all,



Our system is about ride reservations and acts as broker between customers
and drivers.

Something similar what Uber does with major differences that we are mostly
focused on reservation scheduled in advance.

So between moment when reservation is created and until reservation/ride is
actually delivered by driver we are doing multiple
checks/notifications/alerts.


Currently that is implemented with queries scheduled to be executed on each
60 seconds.

Approach with scheduled queries becomes inefficient as number of
reservation and notifications/alerting rules is increased.

Addition reason we want to rewrite that part of functionality is to
decompose current monolith and large part of

that decomposition is moving our services from scheduled (timer-based) to
event-based mechanism.


Here in table is simplified example what we have. Basically we have two
streams  one stream of reservation's related events and another stream
would be time

Time stream

Stream of reservation events



Deliver ticks

(e.g each minute)

Event  time

Reservation’s status

Reservation Scheduled for

..

25/04/2017 11.22 PM

CREATED

15/05/2017 16.30 PM

…

26/04/2017 15.15 PM

OFFERED

15/05/2017 16.30 PM

…

26/04/2017 21.12 PM

ASSIGNED

15/05/2017 16.30 PM

…

…



15/05/2017 16.30 PM

…

15/05/2017 15.51 PM

DRIVER EN ROUTE

15/05/2017 16.30 PM

…

15/05/2017 15.25 PM

DRIVER ON LOCATION

15/05/2017 16.30 PM



Most of our rules are function of time and some reservation’s property
(e.g. check if there are any reservations where remaining time before
pickup is less than x).

Number of reservations we currently fetching is ~5000 and number of
notification/alerting rules is ~20


Based on documentation and some blog posts I have impression that Kafka and
Kafka Stream library are good choice for this use case but I would like to
confirm that with someone from Kafka team or to get some recommendations ...

Thanks,
Vladimir


Kafka Topics best practice for logging data pipeline use case

2017-03-14 Thread Ram Vittal
We are using latest Kafka and Logstash versions for ingesting several
business apps logs(now few but eventually 100+) into ELK. We have a
standardized logging structure for business apps to log data into Kafka
topics and able to ingest into ELK via Kafka topics input plugin.

Currently, we are using one kafka topic for each business app for pushing
data into logstash. We have 3 logstash consumers with 3 partitions on each
topic.

I am wondering about the best practice for using kafka/logstash. Is the
above config a good approach or is there better approach.

For example, instead of having one kafka topic for each app, should we have
one kafka topic across all apps? What are the pros and cons?

If you are not familiar with Logstash it is part of Elastic stack and it is
just another consumer for Kafka.

Would appreciate your input!
-- 
Thanks,
Ram Vittal


Re: [KIP-94] SessionWindows - IndexOutOfBoundsException in simple use case

2017-02-22 Thread Damian Guy
Hi Marco,

Did you run this example with the same store name using TimeWindows? It
looks to me that it is trying to restore state from the changelog that has
been used with TimeWindows. The data in the topic will be incompatible with
SessionWindows as the keys are in a different format.

You'll either need to use a different store name, i.e, change "aggs", or
you will need to use the streams reset tool to reset the topics:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Application+Reset+Tool

Thanks,
Damian

On Wed, 22 Feb 2017 at 09:35 Marco Abitabile 
wrote:

> Hello,
>
> I apologies with Matthias since I posted yesterday this issue on the wrong
> place on github :(
>
> I'm trying a simple use case of session windowing. TimeWindows works
> perfectly, however as I replace with SessionWindows, this exception is
> thrown:
>
> Exception in thread "StreamThread-1"
> org.apache.kafka.streams.errors.StreamsException: stream-thread
> [StreamThread-1] Failed to rebalance
> at
>
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:612)
> at
>
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)
> Caused by: java.lang.IndexOutOfBoundsException
> at java.nio.Buffer.checkIndex(Buffer.java:546)
> at java.nio.HeapByteBuffer.getLong(HeapByteBuffer.java:416)
> at
>
> org.apache.kafka.streams.kstream.internals.SessionKeySerde.extractEnd(SessionKeySerde.java:117)
> at
>
> org.apache.kafka.streams.state.internals.SessionKeySchema.segmentTimestamp(SessionKeySchema.java:45)
> at
>
> org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.put(RocksDBSegmentedBytesStore.java:71)
> at
>
> org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore$1.restore(RocksDBSegmentedBytesStore.java:104)
> at
>
> org.apache.kafka.streams.processor.internals.ProcessorStateManager.restoreActiveState(ProcessorStateManager.java:230)
> at
>
> org.apache.kafka.streams.processor.internals.ProcessorStateManager.register(ProcessorStateManager.java:193)
> at
>
> org.apache.kafka.streams.processor.internals.AbstractProcessorContext.register(AbstractProcessorContext.java:99)
> at
>
> org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.init(RocksDBSegmentedBytesStore.java:101)
> at
>
> org.apache.kafka.streams.state.internals.ChangeLoggingSegmentedBytesStore.init(ChangeLoggingSegmentedBytesStore.java:68)
> at
>
> org.apache.kafka.streams.state.internals.MeteredSegmentedBytesStore.init(MeteredSegmentedBytesStore.java:66)
> at
>
> org.apache.kafka.streams.state.internals.RocksDBSessionStore.init(RocksDBSessionStore.java:78)
> at
>
> org.apache.kafka.streams.state.internals.CachingSessionStore.init(CachingSessionStore.java:97)
> at
>
> org.apache.kafka.streams.processor.internals.AbstractTask.initializeStateStores(AbstractTask.java:86)
> at
>
> org.apache.kafka.streams.processor.internals.StreamTask.(StreamTask.java:141)
> at
>
> org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:834)
> at
>
> org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:1207)
> at
>
> org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:1180)
> at
>
> org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:937)
> at
>
> org.apache.kafka.streams.processor.internals.StreamThread.access$500(StreamThread.java:69)
> at
>
> org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:236)
> at
>
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:255)
> at
>
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:339)
> at
>
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:303)
> at
>
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286)
> at
>
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1030)
> at
>
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
> at
>
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:582)
> ... 1 more
>
> the code is very simple:
>
> KStreamBuilder builder = new KStreamBuilder();
> KStream macs = builder.stream(stringSerde,
> stringSerde, "test01");
> macs
> .groupByKey()
> .aggregate(() -> new Stri

[KIP-94] SessionWindows - IndexOutOfBoundsException in simple use case

2017-02-22 Thread Marco Abitabile
Hello,

I apologies with Matthias since I posted yesterday this issue on the wrong
place on github :(

I'm trying a simple use case of session windowing. TimeWindows works
perfectly, however as I replace with SessionWindows, this exception is
thrown:

Exception in thread "StreamThread-1"
org.apache.kafka.streams.errors.StreamsException: stream-thread
[StreamThread-1] Failed to rebalance
at
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:612)
at
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)
Caused by: java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkIndex(Buffer.java:546)
at java.nio.HeapByteBuffer.getLong(HeapByteBuffer.java:416)
at
org.apache.kafka.streams.kstream.internals.SessionKeySerde.extractEnd(SessionKeySerde.java:117)
at
org.apache.kafka.streams.state.internals.SessionKeySchema.segmentTimestamp(SessionKeySchema.java:45)
at
org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.put(RocksDBSegmentedBytesStore.java:71)
at
org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore$1.restore(RocksDBSegmentedBytesStore.java:104)
at
org.apache.kafka.streams.processor.internals.ProcessorStateManager.restoreActiveState(ProcessorStateManager.java:230)
at
org.apache.kafka.streams.processor.internals.ProcessorStateManager.register(ProcessorStateManager.java:193)
at
org.apache.kafka.streams.processor.internals.AbstractProcessorContext.register(AbstractProcessorContext.java:99)
at
org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.init(RocksDBSegmentedBytesStore.java:101)
at
org.apache.kafka.streams.state.internals.ChangeLoggingSegmentedBytesStore.init(ChangeLoggingSegmentedBytesStore.java:68)
at
org.apache.kafka.streams.state.internals.MeteredSegmentedBytesStore.init(MeteredSegmentedBytesStore.java:66)
at
org.apache.kafka.streams.state.internals.RocksDBSessionStore.init(RocksDBSessionStore.java:78)
at
org.apache.kafka.streams.state.internals.CachingSessionStore.init(CachingSessionStore.java:97)
at
org.apache.kafka.streams.processor.internals.AbstractTask.initializeStateStores(AbstractTask.java:86)
at
org.apache.kafka.streams.processor.internals.StreamTask.(StreamTask.java:141)
at
org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:834)
at
org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:1207)
at
org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:1180)
at
org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:937)
at
org.apache.kafka.streams.processor.internals.StreamThread.access$500(StreamThread.java:69)
at
org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:236)
at
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:255)
at
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:339)
at
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:303)
at
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286)
at
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1030)
at
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
at
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:582)
... 1 more

the code is very simple:

KStreamBuilder builder = new KStreamBuilder();
KStream macs = builder.stream(stringSerde,
stringSerde, "test01");
macs
.groupByKey()
.aggregate(() -> new String(),
(String aggKey, String value, String aggregate) -> {
return aggregate += value;
},
(String arg0, String arg1, String arg2) -> {
return arg1 += arg2;
},
SessionWindows.with(30 * 1000).until(10 * 60 *
1000),//TimeWindows.of(1000).until(1000),
stringSerde, "aggs")
.toStream().map((Windowed key, String value) -> {

return KeyValue.pair(key.key(), value);
}).print();

KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();

Also with my real use case doesn't work.
While debugging, I've noticed that is doesn't reach neither the beginning
of the stream pipeline (groupby).

Can you please help investigating this issue?

Best.
Marco


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-20 Thread Onur Karaman
We've only started using kafka-based group coordination for small and
simple use cases at LinkedIn so far.

Given that you kill -9 your process, your explanation for the long
stabilization time makes sense. I'd recommend calling KafkaConsumer.close.
It should speed up the rebalance times.

Another idea: it sounds like you sequentially deploy changes to your
consumers. Is this required? If not, then adding some parallelism to the
deployment would reduce the number of rebalances and therefore cause the
group to stabilize sooner.

On Fri, Feb 17, 2017 at 10:55 PM, Praveen  wrote:

> Hey Onur,
>
> I was just watching your talk on rebalancing from last year -
> https://www.youtube.com/watch?v=QaeXDh12EhE
> Nice talk!.
>
> I think I have an idea as to why it takes 1 hr in my case based on the
> talk in the video. In my case with 32 boxes / consumers from the same
> group, I believe the current state of the group coordinator's state machine
> gets messed up each time a new one is added until the very last consumer.
> Also I have a heartbeat set to 97 seconds (97 secs b/c normal processing
> could take that long and we don't want coordinator to think consumer is
> dead). I think both of these coupled together is why the cluster restart
> takes > 1hr. I'm curious how linkedin does clean cluster restarts? How do
> you handle the scenario described above?
>
> Praveen
>
>
> On Wed, Feb 15, 2017 at 10:22 AM, Praveen  wrote:
>
>> I still think a clean cluster start should not take > 1 hr for balancing
>> though. Is this expected or am i doing something different?
>>
>> I thought this would be a common use case.
>>
>> Praveen
>>
>> On Fri, Feb 10, 2017 at 10:26 AM, Onur Karaman <
>> okara...@linkedin.com.invalid> wrote:
>>
>>> Pradeep is right.
>>>
>>> close() will try and send out a LeaveGroupRequest while a kill -9 will
>>> not.
>>>
>>> On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota <
>>> pradeep...@gmail.com>
>>> wrote:
>>>
>>> > I believe if you're calling the .close() method on shutdown, then the
>>> > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not
>>> sure if
>>> > that request will be made.
>>> >
>>> > On Fri, Feb 10, 2017 at 8:47 AM, Praveen  wrote:
>>> >
>>> > > @Pradeep - I just read your thread, the 1hr pause was when all the
>>> > > consumers where shutdown simultaneously.  I'm testing out rolling
>>> restart
>>> > > to get the actual numbers. The initial numbers are promising.
>>> > >
>>> > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) ->
>>> REBALANCE
>>> > > (takes 1min to get a partition)`
>>> > >
>>> > > In your thread, Ewen says -
>>> > >
>>> > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a
>>> > > consumer knows it is going to
>>> > > shutdown, it is good to proactively make sure the group knows it
>>> needs to
>>> > > rebalance work because some of the partitions that were handled by
>>> the
>>> > > consumer need to be handled by some other group members."
>>> > >
>>> > > So does this mean that the consumer should inform the group ahead of
>>> > > time before it goes down? Currently, I just shutdown the process.
>>> > >
>>> > >
>>> > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota <
>>> pradeep...@gmail.com
>>> > >
>>> > > wrote:
>>> > >
>>> > > > I asked a similar question a while ago. There doesn't appear to be
>>> a
>>> > way
>>> > > to
>>> > > > not triggering the rebalance. But I'm not sure why it would be
>>> taking >
>>> > > 1hr
>>> > > > in your case. For us it was pretty fast.
>>> > > >
>>> > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
>>> > > > krzysztof.lesniew...@nexiot.ch> wrote:
>>> > > >
>>> > > > > Would be great to get some input on it.
>>> > > > >
>>> > > > > - Krzysztof Lesniewski
>>> > > > >
>>> > > > >
>>> > > > > On 06.02.2017 08:27, Praveen wrote:
>>> > > > >
>>> > > > >> I have a 16 broker kafka cluster. There is a topic with 32
>>> > partitions
>>> > > > >> containing real time data and on the other side, I have 32
>>> boxes w/
>>> > 1
>>> > > > >> consumer reading from these partitions.
>>> > > > >>
>>> > > > >> Today our deployment strategy is stop, deploy and start the
>>> > processes
>>> > > on
>>> > > > >> all the 32 consumers. This triggers re-balancing and takes a
>>> long
>>> > > period
>>> > > > >> of
>>> > > > >> time (> 1hr). Such a long pause isn't good for real time
>>> processing.
>>> > > > >>
>>> > > > >> I was thinking of rolling deploy but I think that will still
>>> cause
>>> > > > >> re-balancing b/c we will still have consumers go down and come
>>> up.
>>> > > > >>
>>> > > > >> How do you deploy to consumers without triggering re-balancing
>>> (or
>>> > > > >> triggering one that doesn't affect your SLA) when doing real
>>> time
>>> > > > >> processing?
>>> > > > >>
>>> > > > >> Thanks,
>>> > > > >> Praveen
>>> > > > >>
>>> > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-17 Thread Praveen
Hey Onur,

I was just watching your talk on rebalancing from last year -
https://www.youtube.com/watch?v=QaeXDh12EhE
Nice talk!.

I think I have an idea as to why it takes 1 hr in my case based on the talk
in the video. In my case with 32 boxes / consumers from the same group, I
believe the current state of the group coordinator's state machine gets
messed up each time a new one is added until the very last consumer. Also I
have a heartbeat set to 97 seconds (97 secs b/c normal processing could
take that long and we don't want coordinator to think consumer is dead). I
think both of these coupled together is why the cluster restart takes >
1hr. I'm curious how linkedin does clean cluster restarts? How do you
handle the scenario described above?

Praveen


On Wed, Feb 15, 2017 at 10:22 AM, Praveen  wrote:

> I still think a clean cluster start should not take > 1 hr for balancing
> though. Is this expected or am i doing something different?
>
> I thought this would be a common use case.
>
> Praveen
>
> On Fri, Feb 10, 2017 at 10:26 AM, Onur Karaman <
> okara...@linkedin.com.invalid> wrote:
>
>> Pradeep is right.
>>
>> close() will try and send out a LeaveGroupRequest while a kill -9 will
>> not.
>>
>> On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota > >
>> wrote:
>>
>> > I believe if you're calling the .close() method on shutdown, then the
>> > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure
>> if
>> > that request will be made.
>> >
>> > On Fri, Feb 10, 2017 at 8:47 AM, Praveen  wrote:
>> >
>> > > @Pradeep - I just read your thread, the 1hr pause was when all the
>> > > consumers where shutdown simultaneously.  I'm testing out rolling
>> restart
>> > > to get the actual numbers. The initial numbers are promising.
>> > >
>> > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) ->
>> REBALANCE
>> > > (takes 1min to get a partition)`
>> > >
>> > > In your thread, Ewen says -
>> > >
>> > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a
>> > > consumer knows it is going to
>> > > shutdown, it is good to proactively make sure the group knows it
>> needs to
>> > > rebalance work because some of the partitions that were handled by the
>> > > consumer need to be handled by some other group members."
>> > >
>> > > So does this mean that the consumer should inform the group ahead of
>> > > time before it goes down? Currently, I just shutdown the process.
>> > >
>> > >
>> > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota <
>> pradeep...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > I asked a similar question a while ago. There doesn't appear to be a
>> > way
>> > > to
>> > > > not triggering the rebalance. But I'm not sure why it would be
>> taking >
>> > > 1hr
>> > > > in your case. For us it was pretty fast.
>> > > >
>> > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
>> > > > krzysztof.lesniew...@nexiot.ch> wrote:
>> > > >
>> > > > > Would be great to get some input on it.
>> > > > >
>> > > > > - Krzysztof Lesniewski
>> > > > >
>> > > > >
>> > > > > On 06.02.2017 08:27, Praveen wrote:
>> > > > >
>> > > > >> I have a 16 broker kafka cluster. There is a topic with 32
>> > partitions
>> > > > >> containing real time data and on the other side, I have 32 boxes
>> w/
>> > 1
>> > > > >> consumer reading from these partitions.
>> > > > >>
>> > > > >> Today our deployment strategy is stop, deploy and start the
>> > processes
>> > > on
>> > > > >> all the 32 consumers. This triggers re-balancing and takes a long
>> > > period
>> > > > >> of
>> > > > >> time (> 1hr). Such a long pause isn't good for real time
>> processing.
>> > > > >>
>> > > > >> I was thinking of rolling deploy but I think that will still
>> cause
>> > > > >> re-balancing b/c we will still have consumers go down and come
>> up.
>> > > > >>
>> > > > >> How do you deploy to consumers without triggering re-balancing
>> (or
>> > > > >> triggering one that doesn't affect your SLA) when doing real time
>> > > > >> processing?
>> > > > >>
>> > > > >> Thanks,
>> > > > >> Praveen
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>
>


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-15 Thread Praveen
I still think a clean cluster start should not take > 1 hr for balancing
though. Is this expected or am i doing something different?

I thought this would be a common use case.

Praveen

On Fri, Feb 10, 2017 at 10:26 AM, Onur Karaman <
okara...@linkedin.com.invalid> wrote:

> Pradeep is right.
>
> close() will try and send out a LeaveGroupRequest while a kill -9 will not.
>
> On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota 
> wrote:
>
> > I believe if you're calling the .close() method on shutdown, then the
> > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure
> if
> > that request will be made.
> >
> > On Fri, Feb 10, 2017 at 8:47 AM, Praveen  wrote:
> >
> > > @Pradeep - I just read your thread, the 1hr pause was when all the
> > > consumers where shutdown simultaneously.  I'm testing out rolling
> restart
> > > to get the actual numbers. The initial numbers are promising.
> > >
> > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> REBALANCE
> > > (takes 1min to get a partition)`
> > >
> > > In your thread, Ewen says -
> > >
> > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a
> > > consumer knows it is going to
> > > shutdown, it is good to proactively make sure the group knows it needs
> to
> > > rebalance work because some of the partitions that were handled by the
> > > consumer need to be handled by some other group members."
> > >
> > > So does this mean that the consumer should inform the group ahead of
> > > time before it goes down? Currently, I just shutdown the process.
> > >
> > >
> > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota <
> pradeep...@gmail.com
> > >
> > > wrote:
> > >
> > > > I asked a similar question a while ago. There doesn't appear to be a
> > way
> > > to
> > > > not triggering the rebalance. But I'm not sure why it would be
> taking >
> > > 1hr
> > > > in your case. For us it was pretty fast.
> > > >
> > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html
> > > >
> > > >
> > > >
> > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
> > > > krzysztof.lesniew...@nexiot.ch> wrote:
> > > >
> > > > > Would be great to get some input on it.
> > > > >
> > > > > - Krzysztof Lesniewski
> > > > >
> > > > >
> > > > > On 06.02.2017 08:27, Praveen wrote:
> > > > >
> > > > >> I have a 16 broker kafka cluster. There is a topic with 32
> > partitions
> > > > >> containing real time data and on the other side, I have 32 boxes
> w/
> > 1
> > > > >> consumer reading from these partitions.
> > > > >>
> > > > >> Today our deployment strategy is stop, deploy and start the
> > processes
> > > on
> > > > >> all the 32 consumers. This triggers re-balancing and takes a long
> > > period
> > > > >> of
> > > > >> time (> 1hr). Such a long pause isn't good for real time
> processing.
> > > > >>
> > > > >> I was thinking of rolling deploy but I think that will still cause
> > > > >> re-balancing b/c we will still have consumers go down and come up.
> > > > >>
> > > > >> How do you deploy to consumers without triggering re-balancing (or
> > > > >> triggering one that doesn't affect your SLA) when doing real time
> > > > >> processing?
> > > > >>
> > > > >> Thanks,
> > > > >> Praveen
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-10 Thread Onur Karaman
Pradeep is right.

close() will try and send out a LeaveGroupRequest while a kill -9 will not.

On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota 
wrote:

> I believe if you're calling the .close() method on shutdown, then the
> LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure if
> that request will be made.
>
> On Fri, Feb 10, 2017 at 8:47 AM, Praveen  wrote:
>
> > @Pradeep - I just read your thread, the 1hr pause was when all the
> > consumers where shutdown simultaneously.  I'm testing out rolling restart
> > to get the actual numbers. The initial numbers are promising.
> >
> > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> REBALANCE
> > (takes 1min to get a partition)`
> >
> > In your thread, Ewen says -
> >
> > "The LeaveGroupRequest is only sent on a graceful shutdown. If a
> > consumer knows it is going to
> > shutdown, it is good to proactively make sure the group knows it needs to
> > rebalance work because some of the partitions that were handled by the
> > consumer need to be handled by some other group members."
> >
> > So does this mean that the consumer should inform the group ahead of
> > time before it goes down? Currently, I just shutdown the process.
> >
> >
> > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota  >
> > wrote:
> >
> > > I asked a similar question a while ago. There doesn't appear to be a
> way
> > to
> > > not triggering the rebalance. But I'm not sure why it would be taking >
> > 1hr
> > > in your case. For us it was pretty fast.
> > >
> > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html
> > >
> > >
> > >
> > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
> > > krzysztof.lesniew...@nexiot.ch> wrote:
> > >
> > > > Would be great to get some input on it.
> > > >
> > > > - Krzysztof Lesniewski
> > > >
> > > >
> > > > On 06.02.2017 08:27, Praveen wrote:
> > > >
> > > >> I have a 16 broker kafka cluster. There is a topic with 32
> partitions
> > > >> containing real time data and on the other side, I have 32 boxes w/
> 1
> > > >> consumer reading from these partitions.
> > > >>
> > > >> Today our deployment strategy is stop, deploy and start the
> processes
> > on
> > > >> all the 32 consumers. This triggers re-balancing and takes a long
> > period
> > > >> of
> > > >> time (> 1hr). Such a long pause isn't good for real time processing.
> > > >>
> > > >> I was thinking of rolling deploy but I think that will still cause
> > > >> re-balancing b/c we will still have consumers go down and come up.
> > > >>
> > > >> How do you deploy to consumers without triggering re-balancing (or
> > > >> triggering one that doesn't affect your SLA) when doing real time
> > > >> processing?
> > > >>
> > > >> Thanks,
> > > >> Praveen
> > > >>
> > > >>
> > > >
> > >
> >
>


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-10 Thread Pradeep Gollakota
I believe if you're calling the .close() method on shutdown, then the
LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure if
that request will be made.

On Fri, Feb 10, 2017 at 8:47 AM, Praveen  wrote:

> @Pradeep - I just read your thread, the 1hr pause was when all the
> consumers where shutdown simultaneously.  I'm testing out rolling restart
> to get the actual numbers. The initial numbers are promising.
>
> `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> REBALANCE
> (takes 1min to get a partition)`
>
> In your thread, Ewen says -
>
> "The LeaveGroupRequest is only sent on a graceful shutdown. If a
> consumer knows it is going to
> shutdown, it is good to proactively make sure the group knows it needs to
> rebalance work because some of the partitions that were handled by the
> consumer need to be handled by some other group members."
>
> So does this mean that the consumer should inform the group ahead of
> time before it goes down? Currently, I just shutdown the process.
>
>
> On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota 
> wrote:
>
> > I asked a similar question a while ago. There doesn't appear to be a way
> to
> > not triggering the rebalance. But I'm not sure why it would be taking >
> 1hr
> > in your case. For us it was pretty fast.
> >
> > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html
> >
> >
> >
> > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
> > krzysztof.lesniew...@nexiot.ch> wrote:
> >
> > > Would be great to get some input on it.
> > >
> > > - Krzysztof Lesniewski
> > >
> > >
> > > On 06.02.2017 08:27, Praveen wrote:
> > >
> > >> I have a 16 broker kafka cluster. There is a topic with 32 partitions
> > >> containing real time data and on the other side, I have 32 boxes w/ 1
> > >> consumer reading from these partitions.
> > >>
> > >> Today our deployment strategy is stop, deploy and start the processes
> on
> > >> all the 32 consumers. This triggers re-balancing and takes a long
> period
> > >> of
> > >> time (> 1hr). Such a long pause isn't good for real time processing.
> > >>
> > >> I was thinking of rolling deploy but I think that will still cause
> > >> re-balancing b/c we will still have consumers go down and come up.
> > >>
> > >> How do you deploy to consumers without triggering re-balancing (or
> > >> triggering one that doesn't affect your SLA) when doing real time
> > >> processing?
> > >>
> > >> Thanks,
> > >> Praveen
> > >>
> > >>
> > >
> >
>


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-10 Thread Praveen
@Pradeep - I just read your thread, the 1hr pause was when all the
consumers where shutdown simultaneously.  I'm testing out rolling restart
to get the actual numbers. The initial numbers are promising.

`STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> REBALANCE
(takes 1min to get a partition)`

In your thread, Ewen says -

"The LeaveGroupRequest is only sent on a graceful shutdown. If a
consumer knows it is going to
shutdown, it is good to proactively make sure the group knows it needs to
rebalance work because some of the partitions that were handled by the
consumer need to be handled by some other group members."

So does this mean that the consumer should inform the group ahead of
time before it goes down? Currently, I just shutdown the process.


On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota 
wrote:

> I asked a similar question a while ago. There doesn't appear to be a way to
> not triggering the rebalance. But I'm not sure why it would be taking > 1hr
> in your case. For us it was pretty fast.
>
> https://www.mail-archive.com/users@kafka.apache.org/msg23925.html
>
>
>
> On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
> krzysztof.lesniew...@nexiot.ch> wrote:
>
> > Would be great to get some input on it.
> >
> > - Krzysztof Lesniewski
> >
> >
> > On 06.02.2017 08:27, Praveen wrote:
> >
> >> I have a 16 broker kafka cluster. There is a topic with 32 partitions
> >> containing real time data and on the other side, I have 32 boxes w/ 1
> >> consumer reading from these partitions.
> >>
> >> Today our deployment strategy is stop, deploy and start the processes on
> >> all the 32 consumers. This triggers re-balancing and takes a long period
> >> of
> >> time (> 1hr). Such a long pause isn't good for real time processing.
> >>
> >> I was thinking of rolling deploy but I think that will still cause
> >> re-balancing b/c we will still have consumers go down and come up.
> >>
> >> How do you deploy to consumers without triggering re-balancing (or
> >> triggering one that doesn't affect your SLA) when doing real time
> >> processing?
> >>
> >> Thanks,
> >> Praveen
> >>
> >>
> >
>


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-10 Thread Pradeep Gollakota
I asked a similar question a while ago. There doesn't appear to be a way to
not triggering the rebalance. But I'm not sure why it would be taking > 1hr
in your case. For us it was pretty fast.

https://www.mail-archive.com/users@kafka.apache.org/msg23925.html



On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
krzysztof.lesniew...@nexiot.ch> wrote:

> Would be great to get some input on it.
>
> - Krzysztof Lesniewski
>
>
> On 06.02.2017 08:27, Praveen wrote:
>
>> I have a 16 broker kafka cluster. There is a topic with 32 partitions
>> containing real time data and on the other side, I have 32 boxes w/ 1
>> consumer reading from these partitions.
>>
>> Today our deployment strategy is stop, deploy and start the processes on
>> all the 32 consumers. This triggers re-balancing and takes a long period
>> of
>> time (> 1hr). Such a long pause isn't good for real time processing.
>>
>> I was thinking of rolling deploy but I think that will still cause
>> re-balancing b/c we will still have consumers go down and come up.
>>
>> How do you deploy to consumers without triggering re-balancing (or
>> triggering one that doesn't affect your SLA) when doing real time
>> processing?
>>
>> Thanks,
>> Praveen
>>
>>
>


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-10 Thread Krzysztof Lesniewski, Nexiot AG

Would be great to get some input on it.

- Krzysztof Lesniewski

On 06.02.2017 08:27, Praveen wrote:

I have a 16 broker kafka cluster. There is a topic with 32 partitions
containing real time data and on the other side, I have 32 boxes w/ 1
consumer reading from these partitions.

Today our deployment strategy is stop, deploy and start the processes on
all the 32 consumers. This triggers re-balancing and takes a long period of
time (> 1hr). Such a long pause isn't good for real time processing.

I was thinking of rolling deploy but I think that will still cause
re-balancing b/c we will still have consumers go down and come up.

How do you deploy to consumers without triggering re-balancing (or
triggering one that doesn't affect your SLA) when doing real time
processing?

Thanks,
Praveen





How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-05 Thread Praveen
I have a 16 broker kafka cluster. There is a topic with 32 partitions
containing real time data and on the other side, I have 32 boxes w/ 1
consumer reading from these partitions.

Today our deployment strategy is stop, deploy and start the processes on
all the 32 consumers. This triggers re-balancing and takes a long period of
time (> 1hr). Such a long pause isn't good for real time processing.

I was thinking of rolling deploy but I think that will still cause
re-balancing b/c we will still have consumers go down and come up.

How do you deploy to consumers without triggering re-balancing (or
triggering one that doesn't affect your SLA) when doing real time
processing?

Thanks,
Praveen


Re: Architecture recommendations for a tricky use case

2016-10-05 Thread Avi Flax

> On Sep 29, 2016, at 16:39, Ali Akhtar  wrote:
> 
> Why did you choose Druid over Postgres / Cassandra / Elasticsearch?

Well, to be clear, we haven’t chosen it yet — we’re evaluating it.

That said, it is looking quite promising for our use case.

The Druid docs say it well:

> Druid is an open source data store designed for OLAP queries on event data.

And that’s exactly what we need. The other options you listed are excellent 
systems, but they’re more general than Druid. Because Druid is specifically 
focused on OLAP queries on event data, it has features and properties that make 
it very well suited to such use cases.

In addition, Druid has built-in support for ingesting events from Kafka topics 
and making those events available for querying with very low latency. This is 
very attractive for my use case.

If you’d like to learn more about Druid I recommend this talk from last month 
at Strange Loop: https://www.youtube.com/watch?v=vbH8E0nH2Nw

HTH!

Avi


Software Architect @ Park Assist
We’re hiring! http://tech.parkassist.com/jobs/



Re: Architecture recommendations for a tricky use case

2016-09-30 Thread Andrew Stevenson
>>>>> It also needs a custom front-end, so a system like Tableau can't be

>>>>> used, it must have a custom backend + front-end.

>>>>>

>>>>> Thanks for the recommendation of Flume. Do you think this will work:

>>>>>

>>>>> - Spark Streaming to read data from Kafka

>>>>> - Storing the data on HDFS using Flume

>>>>> - Using Spark to query the data in the backend of the web UI?

>>>>>

>>>>>

>>>>>

>>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh

>>>>>  wrote:

>>>>>>

>>>>>> You need a batch layer and a speed layer. Data from Kafka can be

>>>>>> stored on HDFS using flume.

>>>>>>

>>>>>> -  Query this data to generate reports / analytics (There will be a

>>>>>> web UI which will be the front-end to the data, and will show the 
reports)

>>>>>>

>>>>>> This is basically batch layer and you need something like Tableau or

>>>>>> Zeppelin to query data

>>>>>>

>>>>>> You will also need spark streaming to query data online for speed

>>>>>> layer. That data could be stored in some transient fabric like 
ignite or

>>>>>> even druid.

>>>>>>

>>>>>> HTH

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>> Dr Mich Talebzadeh

>>>>>>

>>>>>>

>>>>>>

>>>>>> LinkedIn

>>>>>> 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

>>>>>>

>>>>>>

>>>>>>

>>>>>> http://talebzadehmich.wordpress.com

>>>>>>

>>>>>>

>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for

>>>>>> any loss, damage or destruction of data or any other property which 
may

>>>>>> arise from relying on this email's technical content is explicitly

>>>>>> disclaimed. The author will in no case be liable for any monetary 
damages

>>>>>> arising from such loss, damage or destruction.

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>> On 29 September 2016 at 15:01, Ali Akhtar 

>>>>>> wrote:

>>>>>>>

>>>>>>> It needs to be able to scale to a very large amount of data, yes.

>>>>>>>

>>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma

>>>>>>>  wrote:

>>>>>>>>

>>>>>>>> What is the message inflow ?

>>>>>>>> If it's really high , definitely spark will be of great use .

>>>>>>>>

>>>>>>>> Thanks

>>>>>>>> Deepak

>>>>>>>>

>>>>>>>>

>>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:

>>>>>>>>>

>>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.

>>>>>>>>>

>>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing 
their

>>>>>>>>> raw data into Kafka.

>>>>>>>>>

>>>>>>>>> I need to:

>>>>>>>>>

>>>>>>>>> - Do ETL on the data, and standardize it.

>>>>>>>>>

>>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw

>>>>>>>>> HDFS / ElasticSearch / Postgres)

>>>>>>>>>

>>>>>>>>> - Query this data to generate reports / analytics (There will be a

>>>>>>>>> web UI which will be the front-end to the data, and will show the 
reports)

>>>>>>>>>

>>>>>>>>> Java is being used as the backend language for everything (backend

>>>>>>>>> of the web UI, as well as the ETL layer)

>>>>>>>>>

>>>>>>>>> I'm considering:

>>>>>>>>>

>>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer

>>>>>>>>> (receive raw data from Kafka, standardize & store it)

>>>>>>>>>

>>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the 
standardized

>>>>>>>>> data, and to allow queries

>>>>>>>>>

>>>>>>>>> - In the backend of the web UI, I could either use Spark to run

>>>>>>>>> queries across the data (mostly filters), or directly run queries 
against

>>>>>>>>> Cassandra / HBase

>>>>>>>>>

>>>>>>>>> I'd appreciate some thoughts / suggestions on which of these

>>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs 
Spark for

>>>>>>>>> ETL, which persistent data store to use, and how to query that 
data store in

>>>>>>>>> the backend of the web UI, for displaying the reports).

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> Thanks.

>>>>>>>

>>>>>>>

>>>>>>

>>>>>

>>>>

>>>

>>

>>

>>

>> --

>> Thanks

>> Deepak

>> www.bigdatabig.com

>> www.keosha.net

>

>




Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Alonso Isidoro Roman
"Using Spark to query the data in the backend of the web UI?"

Dont do that. I would recommend that spark streaming process stores data
into some nosql or sql database and the web ui to query data from that
database.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-09-29 16:15 GMT+02:00 Ali Akhtar :

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>>> wrote:
>>>
>>>> What is the message inflow ?
>>>> If it's really high , definitely spark will be of great use .
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>>
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>> raw data into Kafka.
>>>>>
>>>>> I need to:
>>>>>
>>>>> - Do ETL on the data, and standardize it.
>>>>>
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>> / ElasticSearch / Postgres)
>>>>>
>>>>> - Query this data to generate reports / analytics (There will be a web
>>>>> UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> Java is being used as the backend language for everything (backend of
>>>>> the web UI, as well as the ETL layer)
>>>>>
>>>>> I'm considering:
>>>>>
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>> data, and to allow queries
>>>>>
>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>> queries across the data (mostly filters), or directly run queries against
>>>>> Cassandra / HBase
>>>>>
>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>> in the backend of the web UI, for displaying the reports).
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Ok… so what’s the tricky part? 
Spark Streaming isn’t real time so if you don’t mind a slight delay in 
processing… it would work.

The drawback is that you now have a long running Spark Job (assuming under 
YARN) and that could become a problem in terms of security and resources. 
(How well does Yarn handle long running jobs these days in a secured Cluster? 
Steve L. may have some insight… ) 

Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
want to write your own compaction code? Or use Hive 1.x+?)

HBase? Depending on your admin… stability could be a problem. 
Cassandra? That would be a separate cluster and that in itself could be a 
problem… 

YMMV so you need to address the pros/cons of each tool specific to your 
environment and skill level. 

HTH

-Mike

> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
> 
> I have a somewhat tricky use case, and I'm looking for ideas.
> 
> I have 5-6 Kafka producers, reading various APIs, and writing their raw data 
> into Kafka.
> 
> I need to:
> 
> - Do ETL on the data, and standardize it.
> 
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
> ElasticSearch / Postgres)
> 
> - Query this data to generate reports / analytics (There will be a web UI 
> which will be the front-end to the data, and will show the reports)
> 
> Java is being used as the backend language for everything (backend of the web 
> UI, as well as the ETL layer)
> 
> I'm considering:
> 
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
> raw data from Kafka, standardize & store it)
> 
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and 
> to allow queries
> 
> - In the backend of the web UI, I could either use Spark to run queries 
> across the data (mostly filters), or directly run queries against Cassandra / 
> HBase
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I 
> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
> persistent data store to use, and how to query that data store in the backend 
> of the web UI, for displaying the reports).
> 
> 
> Thanks.



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Spark standalone is not Yarn… or secure for that matter… ;-)

> On Sep 29, 2016, at 11:18 AM, Cody Koeninger  wrote:
> 
> Spark streaming helps with aggregation because
> 
> A. raw kafka consumers have no built in framework for shuffling
> amongst nodes, short of writing into an intermediate topic (I'm not
> touching Kafka Streams here, I don't have experience), and
> 
> B. it deals with batches, so you can transactionally decide to commit
> or rollback your aggregate data and your offsets.  Otherwise your
> offsets and data store can get out of sync, leading to lost /
> duplicate data.
> 
> Regarding long running spark jobs, I have streaming jobs in the
> standalone manager that have been running for 6 months or more.
> 
> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>  wrote:
>> Ok… so what’s the tricky part?
>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>> processing… it would work.
>> 
>> The drawback is that you now have a long running Spark Job (assuming under 
>> YARN) and that could become a problem in terms of security and resources.
>> (How well does Yarn handle long running jobs these days in a secured 
>> Cluster? Steve L. may have some insight… )
>> 
>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
>> want to write your own compaction code? Or use Hive 1.x+?)
>> 
>> HBase? Depending on your admin… stability could be a problem.
>> Cassandra? That would be a separate cluster and that in itself could be a 
>> problem…
>> 
>> YMMV so you need to address the pros/cons of each tool specific to your 
>> environment and skill level.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
>>> 
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>> 
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
>>> data into Kafka.
>>> 
>>> I need to:
>>> 
>>> - Do ETL on the data, and standardize it.
>>> 
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>>> ElasticSearch / Postgres)
>>> 
>>> - Query this data to generate reports / analytics (There will be a web UI 
>>> which will be the front-end to the data, and will show the reports)
>>> 
>>> Java is being used as the backend language for everything (backend of the 
>>> web UI, as well as the ETL layer)
>>> 
>>> I'm considering:
>>> 
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
>>> raw data from Kafka, standardize & store it)
>>> 
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>>> and to allow queries
>>> 
>>> - In the backend of the web UI, I could either use Spark to run queries 
>>> across the data (mostly filters), or directly run queries against Cassandra 
>>> / HBase
>>> 
>>> I'd appreciate some thoughts / suggestions on which of these alternatives I 
>>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>>> persistent data store to use, and how to query that data store in the 
>>> backend of the web UI, for displaying the reports).
>>> 
>>> 
>>> Thanks.
>> 



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
OP mentioned HBase or HDFS as persisted storage. Therefore they have to be 
running YARN if they are considering spark. 
(Assuming that you’re not trying to do a storage / compute model and use 
standalone spark outside your cluster. You can, but you have more moving 
parts…) 

I never said anything about putting something on a public network. I mentioned 
running a secured cluster.
You don’t deal with PII or other regulated data, do you? 


If you read my original post, you are correct we don’t have a lot, if any real 
information. 
Based on what the OP said, there are design considerations since every tool he 
mentioned has pluses and minuses and the problem isn’t really that challenging 
unless you have something extraordinary like high velocity or some other 
constraint that makes this challenging. 

BTW, depending on scale and velocity… your relational engines may become 
problematic. 
HTH

-Mike


> On Sep 29, 2016, at 1:51 PM, Cody Koeninger  wrote:
> 
> The OP didn't say anything about Yarn, and why are you contemplating
> putting Kafka or Spark on public networks to begin with?
> 
> Gwen's right, absent any actual requirements this is kind of pointless.
> 
> On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
>  wrote:
>> Spark standalone is not Yarn… or secure for that matter… ;-)
>> 
>>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger  wrote:
>>> 
>>> Spark streaming helps with aggregation because
>>> 
>>> A. raw kafka consumers have no built in framework for shuffling
>>> amongst nodes, short of writing into an intermediate topic (I'm not
>>> touching Kafka Streams here, I don't have experience), and
>>> 
>>> B. it deals with batches, so you can transactionally decide to commit
>>> or rollback your aggregate data and your offsets.  Otherwise your
>>> offsets and data store can get out of sync, leading to lost /
>>> duplicate data.
>>> 
>>> Regarding long running spark jobs, I have streaming jobs in the
>>> standalone manager that have been running for 6 months or more.
>>> 
>>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>>  wrote:
>>>> Ok… so what’s the tricky part?
>>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>>>> processing… it would work.
>>>> 
>>>> The drawback is that you now have a long running Spark Job (assuming under 
>>>> YARN) and that could become a problem in terms of security and resources.
>>>> (How well does Yarn handle long running jobs these days in a secured 
>>>> Cluster? Steve L. may have some insight… )
>>>> 
>>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do 
>>>> you want to write your own compaction code? Or use Hive 1.x+?)
>>>> 
>>>> HBase? Depending on your admin… stability could be a problem.
>>>> Cassandra? That would be a separate cluster and that in itself could be a 
>>>> problem…
>>>> 
>>>> YMMV so you need to address the pros/cons of each tool specific to your 
>>>> environment and skill level.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
>>>>> 
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>> 
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
>>>>> data into Kafka.
>>>>> 
>>>>> I need to:
>>>>> 
>>>>> - Do ETL on the data, and standardize it.
>>>>> 
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>>>>> ElasticSearch / Postgres)
>>>>> 
>>>>> - Query this data to generate reports / analytics (There will be a web UI 
>>>>> which will be the front-end to the data, and will show the reports)
>>>>> 
>>>>> Java is being used as the backend language for everything (backend of the 
>>>>> web UI, as well as the ETL layer)
>>>>> 
>>>>> I'm considering:
>>>>> 
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer 
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>> 
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>>>>> and to allow queries
>>>>> 
>>>>> - In the backend of the web UI, I could either use Spark to run queries 
>>>>> across the data (mostly filters), or directly run queries against 
>>>>> Cassandra / HBase
>>>>> 
>>>>> I'd appreciate some thoughts / suggestions on which of these alternatives 
>>>>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>>>>> persistent data store to use, and how to query that data store in the 
>>>>> backend of the web UI, for displaying the reports).
>>>>> 
>>>>> 
>>>>> Thanks.
>>>> 
>> 



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
Avi,

Why did you choose Druid over Postgres / Cassandra / Elasticsearch?

On Fri, Sep 30, 2016 at 1:09 AM, Avi Flax  wrote:

>
> > On Sep 29, 2016, at 09:54, Ali Akhtar  wrote:
> >
> > I'd appreciate some thoughts / suggestions on which of these
> alternatives I
> > should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> > persistent data store to use, and how to query that data store in the
> > backend of the web UI, for displaying the reports).
>
> Hi Ali, I’m no expert in any of this, but I’m working on a project that is
> broadly similar to yours, and FWIW I’m evaluating Druid as the datastore
> which would host the queryable data and, well, actually handle and fulfill
> queries.
>
> Since Druid has built-in support for streaming ingestion from Kafka
> topics, I’m tentatively thinking of doing my ETL in a stream processing
> topology (I’m using Kafka Streams, FWIW), which would write the events
> destined for Druid into certain topics, from which Druid would ingest those
> events.
>
> HTH,
> Avi
>
> 
> Software Architect @ Park Assist
> We’re hiring! http://tech.parkassist.com/jobs/
>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Avi Flax

> On Sep 29, 2016, at 09:54, Ali Akhtar  wrote:
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I
> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> persistent data store to use, and how to query that data store in the
> backend of the web UI, for displaying the reports).

Hi Ali, I’m no expert in any of this, but I’m working on a project that is 
broadly similar to yours, and FWIW I’m evaluating Druid as the datastore which 
would host the queryable data and, well, actually handle and fulfill queries.

Since Druid has built-in support for streaming ingestion from Kafka topics, I’m 
tentatively thinking of doing my ETL in a stream processing topology (I’m using 
Kafka Streams, FWIW), which would write the events destined for Druid into 
certain topics, from which Druid would ingest those events.

HTH,
Avi


Software Architect @ Park Assist
We’re hiring! http://tech.parkassist.com/jobs/



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
The OP didn't say anything about Yarn, and why are you contemplating
putting Kafka or Spark on public networks to begin with?

Gwen's right, absent any actual requirements this is kind of pointless.

On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
 wrote:
> Spark standalone is not Yarn… or secure for that matter… ;-)
>
>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger  wrote:
>>
>> Spark streaming helps with aggregation because
>>
>> A. raw kafka consumers have no built in framework for shuffling
>> amongst nodes, short of writing into an intermediate topic (I'm not
>> touching Kafka Streams here, I don't have experience), and
>>
>> B. it deals with batches, so you can transactionally decide to commit
>> or rollback your aggregate data and your offsets.  Otherwise your
>> offsets and data store can get out of sync, leading to lost /
>> duplicate data.
>>
>> Regarding long running spark jobs, I have streaming jobs in the
>> standalone manager that have been running for 6 months or more.
>>
>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>  wrote:
>>> Ok… so what’s the tricky part?
>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>>> processing… it would work.
>>>
>>> The drawback is that you now have a long running Spark Job (assuming under 
>>> YARN) and that could become a problem in terms of security and resources.
>>> (How well does Yarn handle long running jobs these days in a secured 
>>> Cluster? Steve L. may have some insight… )
>>>
>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do 
>>> you want to write your own compaction code? Or use Hive 1.x+?)
>>>
>>> HBase? Depending on your admin… stability could be a problem.
>>> Cassandra? That would be a separate cluster and that in itself could be a 
>>> problem…
>>>
>>> YMMV so you need to address the pros/cons of each tool specific to your 
>>> environment and skill level.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
>>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
>>>> data into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>>>> ElasticSearch / Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a web UI 
>>>> which will be the front-end to the data, and will show the reports)
>>>>
>>>> Java is being used as the backend language for everything (backend of the 
>>>> web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
>>>> raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>>>> and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run queries 
>>>> across the data (mostly filters), or directly run queries against 
>>>> Cassandra / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these alternatives 
>>>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>>>> persistent data store to use, and how to query that data store in the 
>>>> backend of the web UI, for displaying the reports).
>>>>
>>>>
>>>> Thanks.
>>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Gwen Shapira
The original post made no mention of throughput or latency or
correctness requirements, so pretty much any data store will fit the
bill... discussion of "what is better" degrade fast when there are no
concrete standards to choose between.

Who cares about anything when we don't know what we need? :)

On Thu, Sep 29, 2016 at 9:23 AM, Cody Koeninger  wrote:
>> I still don't understand why writing to a transactional database with 
>> locking and concurrency (read and writes) through JDBC will be fast for this 
>> sort of data ingestion.
>
> Who cares about fast if your data is wrong?  And it's still plenty fast enough
>
> https://youtu.be/NVl9_6J1G60?list=WL&t=1819
>
> https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/
>
>
>
> On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh
>  wrote:
>> The way I see this, there are two things involved.
>>
>> Data ingestion through source to Kafka
>> Date conversion and Storage ETL/ELT
>> Presentation
>>
>> Item 2 is the one that needs to be designed correctly. I presume raw data
>> has to confirm to some form of MDM that requires schema mapping etc before
>> putting into persistent storage (DB, HDFS etc). Which one to choose depends
>> on your volume of ingestion and your cluster size and complexity of data
>> conversion. Then your users will use some form of UI (Tableau, QlikView,
>> Zeppelin, direct SQL) to query data one way or other. Your users can
>> directly use UI like Tableau that offer in built analytics on SQL. Spark sql
>> offers the same). Your mileage varies according to your needs.
>>
>> I still don't understand why writing to a transactional database with
>> locking and concurrency (read and writes) through JDBC will be fast for this
>> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to
>> write to as my sink,I would use Oracle which offers the best locking and
>> concurrency among RDBMs and also handles key value pairs as well (assuming
>> that is what you want). In addition, it can be used as a Data Warehouse as
>> well.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 29 September 2016 at 16:49, Ali Akhtar  wrote:
>>>
>>> The business use case is to read a user's data from a variety of different
>>> services through their API, and then allowing the user to query that data,
>>> on a per service basis, as well as an aggregate across all services.
>>>
>>> The way I'm considering doing it, is to do some basic ETL (drop all the
>>> unnecessary fields, rename some fields into something more manageable, etc)
>>> and then store the data in Cassandra / Postgres.
>>>
>>> Then, when the user wants to view a particular report, query the
>>> respective table in Cassandra / Postgres. (select .. from data where user =
>>> ? and date between  and  and some_field = ?)
>>>
>>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>>> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>>>
>>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger 
>>> wrote:
>>>>
>>>> No, direct stream in and of itself won't ensure an end-to-end
>>>> guarantee, because it doesn't know anything about your output actions.
>>>>
>>>> You still need to do some work.  The point is having easy access to
>>>> offsets for batches on a per-partition basis makes it easier to do
>>>> that work, especially in conjunction with aggregation.
>>>>
>>>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
>>>> wrote:
>>>> > If you use spark direct streams , it ensure end to end guarantee for
>>>> > messages.
>>>> >
>>>> >
>>>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
>>>> > wrote:
>>>> >>
>>>> >> My concern with Postgres / Cassand

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
> I still don't understand why writing to a transactional database with locking 
> and concurrency (read and writes) through JDBC will be fast for this sort of 
> data ingestion.

Who cares about fast if your data is wrong?  And it's still plenty fast enough

https://youtu.be/NVl9_6J1G60?list=WL&t=1819

https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/



On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh
 wrote:
> The way I see this, there are two things involved.
>
> Data ingestion through source to Kafka
> Date conversion and Storage ETL/ELT
> Presentation
>
> Item 2 is the one that needs to be designed correctly. I presume raw data
> has to confirm to some form of MDM that requires schema mapping etc before
> putting into persistent storage (DB, HDFS etc). Which one to choose depends
> on your volume of ingestion and your cluster size and complexity of data
> conversion. Then your users will use some form of UI (Tableau, QlikView,
> Zeppelin, direct SQL) to query data one way or other. Your users can
> directly use UI like Tableau that offer in built analytics on SQL. Spark sql
> offers the same). Your mileage varies according to your needs.
>
> I still don't understand why writing to a transactional database with
> locking and concurrency (read and writes) through JDBC will be fast for this
> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to
> write to as my sink,I would use Oracle which offers the best locking and
> concurrency among RDBMs and also handles key value pairs as well (assuming
> that is what you want). In addition, it can be used as a Data Warehouse as
> well.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 29 September 2016 at 16:49, Ali Akhtar  wrote:
>>
>> The business use case is to read a user's data from a variety of different
>> services through their API, and then allowing the user to query that data,
>> on a per service basis, as well as an aggregate across all services.
>>
>> The way I'm considering doing it, is to do some basic ETL (drop all the
>> unnecessary fields, rename some fields into something more manageable, etc)
>> and then store the data in Cassandra / Postgres.
>>
>> Then, when the user wants to view a particular report, query the
>> respective table in Cassandra / Postgres. (select .. from data where user =
>> ? and date between  and  and some_field = ?)
>>
>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>>
>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger 
>> wrote:
>>>
>>> No, direct stream in and of itself won't ensure an end-to-end
>>> guarantee, because it doesn't know anything about your output actions.
>>>
>>> You still need to do some work.  The point is having easy access to
>>> offsets for batches on a per-partition basis makes it easier to do
>>> that work, especially in conjunction with aggregation.
>>>
>>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
>>> wrote:
>>> > If you use spark direct streams , it ensure end to end guarantee for
>>> > messages.
>>> >
>>> >
>>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
>>> > wrote:
>>> >>
>>> >> My concern with Postgres / Cassandra is only scalability. I will look
>>> >> further into Postgres horizontal scaling, thanks.
>>> >>
>>> >> Writes could be idempotent if done as upserts, otherwise updates will
>>> >> be
>>> >> idempotent but not inserts.
>>> >>
>>> >> Data should not be lost. The system should be as fault tolerant as
>>> >> possible.
>>> >>
>>> >> What's the advantage of using Spark for reading Kafka instead of
>>> >> direct
>>> >> Kafka consumers?
>>> >>
>>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>>> >> wrote:
>>> >>>
>&g

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because

A. raw kafka consumers have no built in framework for shuffling
amongst nodes, short of writing into an intermediate topic (I'm not
touching Kafka Streams here, I don't have experience), and

B. it deals with batches, so you can transactionally decide to commit
or rollback your aggregate data and your offsets.  Otherwise your
offsets and data store can get out of sync, leading to lost /
duplicate data.

Regarding long running spark jobs, I have streaming jobs in the
standalone manager that have been running for 6 months or more.

On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
 wrote:
> Ok… so what’s the tricky part?
> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
> processing… it would work.
>
> The drawback is that you now have a long running Spark Job (assuming under 
> YARN) and that could become a problem in terms of security and resources.
> (How well does Yarn handle long running jobs these days in a secured Cluster? 
> Steve L. may have some insight… )
>
> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
> want to write your own compaction code? Or use Hive 1.x+?)
>
> HBase? Depending on your admin… stability could be a problem.
> Cassandra? That would be a separate cluster and that in itself could be a 
> problem…
>
> YMMV so you need to address the pros/cons of each tool specific to your 
> environment and skill level.
>
> HTH
>
> -Mike
>
>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
>>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data 
>> into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>> ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI 
>> which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the 
>> web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
>> raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>> and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries 
>> across the data (mostly filters), or directly run queries against Cassandra 
>> / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives I 
>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>> persistent data store to use, and how to query that data store in the 
>> backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
The way I see this, there are two things involved.


   1. Data ingestion through source to Kafka
   2. Date conversion and Storage ETL/ELT
   3. Presentation

Item 2 is the one that needs to be designed correctly. I presume raw data
has to confirm to some form of MDM that requires schema mapping etc before
putting into persistent storage (DB, HDFS etc). Which one to choose depends
on your volume of ingestion and your cluster size and complexity of data
conversion. Then your users will use some form of UI (Tableau, QlikView,
Zeppelin, direct SQL) to query data one way or other. Your users can
directly use UI like Tableau that offer in built analytics on SQL. Spark
sql offers the same). Your mileage varies according to your needs.

I still don't understand why writing to a transactional database with
locking and concurrency (read and writes) through JDBC will be fast for
this sort of data ingestion. If you ask me if I wanted to choose an RDBMS
to write to as my sink,I would use Oracle which offers the best locking and
concurrency among RDBMs and also handles key value pairs as well (assuming
that is what you want). In addition, it can be used as a Data Warehouse as
well.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:49, Ali Akhtar  wrote:

> The business use case is to read a user's data from a variety of different
> services through their API, and then allowing the user to query that data,
> on a per service basis, as well as an aggregate across all services.
>
> The way I'm considering doing it, is to do some basic ETL (drop all the
> unnecessary fields, rename some fields into something more manageable, etc)
> and then store the data in Cassandra / Postgres.
>
> Then, when the user wants to view a particular report, query the
> respective table in Cassandra / Postgres. (select .. from data where user =
> ? and date between  and  and some_field = ?)
>
> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>
> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger 
> wrote:
>
>> No, direct stream in and of itself won't ensure an end-to-end
>> guarantee, because it doesn't know anything about your output actions.
>>
>> You still need to do some work.  The point is having easy access to
>> offsets for batches on a per-partition basis makes it easier to do
>> that work, especially in conjunction with aggregation.
>>
>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
>> wrote:
>> > If you use spark direct streams , it ensure end to end guarantee for
>> > messages.
>> >
>> >
>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
>> wrote:
>> >>
>> >> My concern with Postgres / Cassandra is only scalability. I will look
>> >> further into Postgres horizontal scaling, thanks.
>> >>
>> >> Writes could be idempotent if done as upserts, otherwise updates will
>> be
>> >> idempotent but not inserts.
>> >>
>> >> Data should not be lost. The system should be as fault tolerant as
>> >> possible.
>> >>
>> >> What's the advantage of using Spark for reading Kafka instead of direct
>> >> Kafka consumers?
>> >>
>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>> >> wrote:
>> >>>
>> >>> I wouldn't give up the flexibility and maturity of a relational
>> >>> database, unless you have a very specific use case.  I'm not trashing
>> >>> cassandra, I've used cassandra, but if all I know is that you're doing
>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> >>> aggregations without a lot of forethought.  If you're worried about
>> >>> scaling, there are several options for horizontally scaling Postgres
>> >>> in particular.  One of the current best from what I've worked with is
>> >>> Citus.
>> >>>
>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <
>> deepakmc...@gm

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The business use case is to read a user's data from a variety of different
services through their API, and then allowing the user to query that data,
on a per service basis, as well as an aggregate across all services.

The way I'm considering doing it, is to do some basic ETL (drop all the
unnecessary fields, rename some fields into something more manageable, etc)
and then store the data in Cassandra / Postgres.

Then, when the user wants to view a particular report, query the respective
table in Cassandra / Postgres. (select .. from data where user = ? and date
between  and  and some_field = ?)

How will Spark Streaming help w/ aggregation? Couldn't the data be queried
from Cassandra / Postgres via the Kafka consumer and aggregated that way?

On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger  wrote:

> No, direct stream in and of itself won't ensure an end-to-end
> guarantee, because it doesn't know anything about your output actions.
>
> You still need to do some work.  The point is having easy access to
> offsets for batches on a per-partition basis makes it easier to do
> that work, especially in conjunction with aggregation.
>
> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma 
> wrote:
> > If you use spark direct streams , it ensure end to end guarantee for
> > messages.
> >
> >
> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar 
> wrote:
> >>
> >> My concern with Postgres / Cassandra is only scalability. I will look
> >> further into Postgres horizontal scaling, thanks.
> >>
> >> Writes could be idempotent if done as upserts, otherwise updates will be
> >> idempotent but not inserts.
> >>
> >> Data should not be lost. The system should be as fault tolerant as
> >> possible.
> >>
> >> What's the advantage of using Spark for reading Kafka instead of direct
> >> Kafka consumers?
> >>
> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
> >> wrote:
> >>>
> >>> I wouldn't give up the flexibility and maturity of a relational
> >>> database, unless you have a very specific use case.  I'm not trashing
> >>> cassandra, I've used cassandra, but if all I know is that you're doing
> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> >>> aggregations without a lot of forethought.  If you're worried about
> >>> scaling, there are several options for horizontally scaling Postgres
> >>> in particular.  One of the current best from what I've worked with is
> >>> Citus.
> >>>
> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma  >
> >>> wrote:
> >>> > Hi Cody
> >>> > Spark direct stream is just fine for this use case.
> >>> > But why postgres and not cassandra?
> >>> > Is there anything specific here that i may not be aware?
> >>> >
> >>> > Thanks
> >>> > Deepak
> >>> >
> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
> >>> > wrote:
> >>> >>
> >>> >> How are you going to handle etl failures?  Do you care about lost /
> >>> >> duplicated data?  Are your writes idempotent?
> >>> >>
> >>> >> Absent any other information about the problem, I'd stay away from
> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >>> >> feeding postgres.
> >>> >>
> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
> >>> >> wrote:
> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
> >>> >> > Nothing
> >>> >> > is
> >>> >> > being done to the data except some light ETL and then storing it
> in
> >>> >> > Cassandra
> >>> >> >
> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
> >>> >> > 
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
> >>> >> >>
> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <
> ali.rac...@gmail.com>
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> I don't think I need a different speed storage and batch
> storage.
> >>> >&

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
No, direct stream in and of itself won't ensure an end-to-end
guarantee, because it doesn't know anything about your output actions.

You still need to do some work.  The point is having easy access to
offsets for batches on a per-partition basis makes it easier to do
that work, especially in conjunction with aggregation.

On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma  wrote:
> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar  wrote:
>>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>> wrote:
>>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>>> > wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>>> >> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> >> > Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>> >> > 
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> >> >>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> >> >>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> >> >>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>>  wrote:
>>> >> >>>>
>>> >> >>>> - Spark Streaming to read data from Kafka
>>> >> >>>&

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Ali,

What is the business use case for this?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:40, Deepak Sharma  wrote:

> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar  wrote:
>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
>> wrote:
>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>>> wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>>> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>>> deepakmc...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
&g

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
If you use spark direct streams , it ensure end to end guarantee for
messages.


On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar  wrote:

> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as
> possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger 
> wrote:
>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>> wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>> deepakmc...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>>  wrote:
>> >> >>>>
>> >> >>>> - Spark Streaming to read data from Kafka
>> >> >>>> - Storing the data on HDFS using Flume
>> >> >>>>
>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>> on
>> >> >>>> HDFS. It is a waste of resources.
>> >> >>>>
>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >> >>>>
>> >> >>>> KafkaAgent.sources = kafka-sources
>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> >>>>
>> >> >>>> That will be for your batch layer. To analyse you can directly
>> read
>> >> >>>> from
>> >> >>>> hdfs files with Spark or simply store data in a database of you

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
If you're doing any kind of pre-aggregation during ETL, spark direct
stream will let you more easily get the delivery semantics you need,
especially if you're using a transactional data store.

If you're literally just copying individual uniquely keyed items from
kafka to a key-value store, use kafka consumers, sure.

On Thu, Sep 29, 2016 at 10:35 AM, Ali Akhtar  wrote:
> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger  wrote:
>>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
>> > wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
>> >> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> >> > Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>> >> > 
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> >> >>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> >> >>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> >> >>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>>  wrote:
>> >> >>>>
>> >> >>>> - Spark Streaming to read data from Kafka
>> >> >>>> - Storing the data on HDFS using Flume
>> >> >>>>
>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>> >> >>>> on
>> >> >>>> HDFS. It is a waste of resources.
>> >> >>>>
>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >> >>>>
>> >> >>>> KafkaAgent.sources = kafka-sources
>>

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Yes but still these writes from Spark have  to go through JDBC? Correct.

Having said that I don't see how doing this through Spark streaming to
postgress is going to be faster than source -> Kafka - flume via zookeeper
-> HDFS.

I believe there is direct streaming from Kakfa to Hive as well and from
Flume to Hbase

I would have thought that if one wanted to do real time analytics with SS,
then that would be a good fit with a real time dashboard.

What is not so clear is the business use case for this.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:28, Cody Koeninger  wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma  >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>>  wrote:
> >> >>>>
> >> >>>> - Spark Streaming to read data from Kafka
> >> >>>> - Storing the data on HDFS using Flume
> >> >>>>
> >> >>>> You don't need Spark streaming to read data from Kafka and store on
> >> >>>> HDFS. It is a waste of resources.
> >> >>>>
> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >> >>>>
> >> >>>> KafkaAgent.sources = kafka-sources
> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> >>>>
> >> >>>> That will be for your batch layer. To analyse you can

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
My concern with Postgres / Cassandra is only scalability. I will look
further into Postgres horizontal scaling, thanks.

Writes could be idempotent if done as upserts, otherwise updates will be
idempotent but not inserts.

Data should not be lost. The system should be as fault tolerant as possible.

What's the advantage of using Spark for reading Kafka instead of direct
Kafka consumers?

On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger  wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma 
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger 
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar 
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma  >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>>  wrote:
> >> >>>>
> >> >>>> - Spark Streaming to read data from Kafka
> >> >>>> - Storing the data on HDFS using Flume
> >> >>>>
> >> >>>> You don't need Spark streaming to read data from Kafka and store on
> >> >>>> HDFS. It is a waste of resources.
> >> >>>>
> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >> >>>>
> >> >>>> KafkaAgent.sources = kafka-sources
> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> >>>>
> >> >>>> That will be for your batch layer. To analyse you can directly read
> >> >>>> from
> >> >>>> hdfs files with Spark or simply store data in a database of your
> >> >>>> choice via
> >> >>>> cron or something. Do not mix your batch layer with speed layer.
> >> >>>>
> >> >>>> Your speed layer will ingest the same data directly from Kafka into
> >> >>>> spark streaming and that will be  online or near real time (defined
> >> >>>> by your
> >> >>>> window).
> >> >>>>
> >> >>>> Then you have a a serving layer to present dat

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
I wouldn't give up the flexibility and maturity of a relational
database, unless you have a very specific use case.  I'm not trashing
cassandra, I've used cassandra, but if all I know is that you're doing
analytics, I wouldn't want to give up the ability to easily do ad-hoc
aggregations without a lot of forethought.  If you're worried about
scaling, there are several options for horizontally scaling Postgres
in particular.  One of the current best from what I've worked with is
Citus.

On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma  wrote:
> Hi Cody
> Spark direct stream is just fine for this use case.
> But why postgres and not cassandra?
> Is there anything specific here that i may not be aware?
>
> Thanks
> Deepak
>
> On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger  wrote:
>>
>> How are you going to handle etl failures?  Do you care about lost /
>> duplicated data?  Are your writes idempotent?
>>
>> Absent any other information about the problem, I'd stay away from
>> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> feeding postgres.
>>
>> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar  wrote:
>> > Is there an advantage to that vs directly consuming from Kafka? Nothing
>> > is
>> > being done to the data except some light ETL and then storing it in
>> > Cassandra
>> >
>> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
>> > wrote:
>> >>
>> >> Its better you use spark's direct stream to ingest from kafka.
>> >>
>> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
>> >> wrote:
>> >>>
>> >>> I don't think I need a different speed storage and batch storage. Just
>> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> >>> where
>> >>> the web UI can query it, seems like it will be enough.
>> >>>
>> >>> I'm thinking about:
>> >>>
>> >>> - Reading data from Kafka via Spark Streaming
>> >>> - Standardizing, then storing it in Cassandra
>> >>> - Querying Cassandra from the web ui
>> >>>
>> >>> That seems like it will work. My question now is whether to use Spark
>> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >>>
>> >>>
>> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >>>  wrote:
>> >>>>
>> >>>> - Spark Streaming to read data from Kafka
>> >>>> - Storing the data on HDFS using Flume
>> >>>>
>> >>>> You don't need Spark streaming to read data from Kafka and store on
>> >>>> HDFS. It is a waste of resources.
>> >>>>
>> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >>>>
>> >>>> KafkaAgent.sources = kafka-sources
>> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >>>>
>> >>>> That will be for your batch layer. To analyse you can directly read
>> >>>> from
>> >>>> hdfs files with Spark or simply store data in a database of your
>> >>>> choice via
>> >>>> cron or something. Do not mix your batch layer with speed layer.
>> >>>>
>> >>>> Your speed layer will ingest the same data directly from Kafka into
>> >>>> spark streaming and that will be  online or near real time (defined
>> >>>> by your
>> >>>> window).
>> >>>>
>> >>>> Then you have a a serving layer to present data from both speed  (the
>> >>>> one from SS) and batch layer.
>> >>>>
>> >>>> HTH
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> Dr Mich Talebzadeh
>> >>>>
>> >>>>
>> >>>>
>> >>>> LinkedIn
>> >>>>
>> >>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>>>
>> >>>>
>> >>>>
>> >>>> http://talebzadehmich.wordpress.com
>> >>>>
>> >>>>
>> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
>> >>>> any
>> >>>> loss, damage or destruction of data or any other p

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Hi Cody
Spark direct stream is just fine for this use case.
But why postgres and not cassandra?
Is there anything specific here that i may not be aware?

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger  wrote:

> How are you going to handle etl failures?  Do you care about lost /
> duplicated data?  Are your writes idempotent?
>
> Absent any other information about the problem, I'd stay away from
> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> feeding postgres.
>
> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar  wrote:
> > Is there an advantage to that vs directly consuming from Kafka? Nothing
> is
> > being done to the data except some light ETL and then storing it in
> > Cassandra
> >
> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
> > wrote:
> >>
> >> Its better you use spark's direct stream to ingest from kafka.
> >>
> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar 
> wrote:
> >>>
> >>> I don't think I need a different speed storage and batch storage. Just
> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
> where
> >>> the web UI can query it, seems like it will be enough.
> >>>
> >>> I'm thinking about:
> >>>
> >>> - Reading data from Kafka via Spark Streaming
> >>> - Standardizing, then storing it in Cassandra
> >>> - Querying Cassandra from the web ui
> >>>
> >>> That seems like it will work. My question now is whether to use Spark
> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>>
> >>>
> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>>  wrote:
> >>>>
> >>>> - Spark Streaming to read data from Kafka
> >>>> - Storing the data on HDFS using Flume
> >>>>
> >>>> You don't need Spark streaming to read data from Kafka and store on
> >>>> HDFS. It is a waste of resources.
> >>>>
> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >>>>
> >>>> KafkaAgent.sources = kafka-sources
> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >>>>
> >>>> That will be for your batch layer. To analyse you can directly read
> from
> >>>> hdfs files with Spark or simply store data in a database of your
> choice via
> >>>> cron or something. Do not mix your batch layer with speed layer.
> >>>>
> >>>> Your speed layer will ingest the same data directly from Kafka into
> >>>> spark streaming and that will be  online or near real time (defined
> by your
> >>>> window).
> >>>>
> >>>> Then you have a a serving layer to present data from both speed  (the
> >>>> one from SS) and batch layer.
> >>>>
> >>>> HTH
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Dr Mich Talebzadeh
> >>>>
> >>>>
> >>>>
> >>>> LinkedIn
> >>>> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>
> >>>>
> >>>>
> >>>> http://talebzadehmich.wordpress.com
> >>>>
> >>>>
> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >>>> loss, damage or destruction of data or any other property which may
> arise
> >>>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>>> author will in no case be liable for any monetary damages arising
> from such
> >>>> loss, damage or destruction.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 29 September 2016 at 15:15, Ali Akhtar 
> wrote:
> >>>>>
> >>>>> The web UI is actually the speed layer, it needs to be able to query
> >>>>> the data online, and show the results in real-time.
> >>>>>
> >>>>> It also needs a custom front-end, so a system like Tableau can't be
> >>>>> used, it must have a custom backend + front-end.
> >>>>>
> >>>>> Thanks for the recommendation of Flume. Do you think this will work:
> >>>>>
> >>>>> - Spark Streaming to read data from Kafka
> >&

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
Is there an advantage to that vs directly consuming from Kafka? Nothing is
being done to the data except some light ETL and then storing it in
Cassandra

On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma 
wrote:

> Its better you use spark's direct stream to ingest from kafka.
>
> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar  wrote:
>
>> I don't think I need a different speed storage and batch storage. Just
>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> where the web UI can query it, seems like it will be enough.
>>
>> I'm thinking about:
>>
>> - Reading data from Kafka via Spark Streaming
>> - Standardizing, then storing it in Cassandra
>> - Querying Cassandra from the web ui
>>
>> That seems like it will work. My question now is whether to use Spark
>> Streaming to read Kafka, or use Kafka consumers directly.
>>
>>
>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>>
>>> You don't need Spark streaming to read data from Kafka and store on
>>> HDFS. It is a waste of resources.
>>>
>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>
>>> KafkaAgent.sources = kafka-sources
>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>
>>> That will be for your batch layer. To analyse you can directly read from
>>> hdfs files with Spark or simply store data in a database of your choice via
>>> cron or something. Do not mix your batch layer with speed layer.
>>>
>>> Your speed layer will ingest the same data directly from Kafka into
>>> spark streaming and that will be  online or near real time (defined by your
>>> window).
>>>
>>> Then you have a a serving layer to present data from both speed  (the
>>> one from SS) and batch layer.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>>>
>>>> The web UI is actually the speed layer, it needs to be able to query
>>>> the data online, and show the results in real-time.
>>>>
>>>> It also needs a custom front-end, so a system like Tableau can't be
>>>> used, it must have a custom backend + front-end.
>>>>
>>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>>
>>>> - Spark Streaming to read data from Kafka
>>>> - Storing the data on HDFS using Flume
>>>> - Using Spark to query the data in the backend of the web UI?
>>>>
>>>>
>>>>
>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>>>>> stored on HDFS using flume.
>>>>>
>>>>> -  Query this data to generate reports / analytics (There will be a
>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> This is basically batch layer and you need something like Tableau or
>>>>> Zeppelin to query data
>>>>>
>>>>> You will also need spark streaming to query data online for speed
>>>>> layer. That data could be stored in some transient fabric like ignite or
>>>>> even druid.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
;>>>> This is basically batch layer and you need something like Tableau or
>>>>>> Zeppelin to query data
>>>>>>
>>>>>> You will also need spark streaming to query data online for speed
>>>>>> layer. That data could be stored in some transient fabric like ignite or
>>>>>> even druid.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn
>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
>>>>>> any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 29 September 2016 at 15:01, Ali Akhtar 
>>>>>> wrote:
>>>>>>>
>>>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>>>
>>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> What is the message inflow ?
>>>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Deepak
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>>>>>>>
>>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>>>
>>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>>>> raw data into Kafka.
>>>>>>>>>
>>>>>>>>> I need to:
>>>>>>>>>
>>>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>>>
>>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>>>
>>>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>>>> web UI which will be the front-end to the data, and will show the 
>>>>>>>>> reports)
>>>>>>>>>
>>>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>>>
>>>>>>>>> I'm considering:
>>>>>>>>>
>>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>>>
>>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>>>> data, and to allow queries
>>>>>>>>>
>>>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>>>> queries across the data (mostly filters), or directly run queries 
>>>>>>>>> against
>>>>>>>>> Cassandra / HBase
>>>>>>>>>
>>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs 
>>>>>>>>> Spark for
>>>>>>>>> ETL, which persistent data store to use, and how to query that data 
>>>>>>>>> store in
>>>>>>>>> the backend of the web UI, for displaying the reports).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Its better you use spark's direct stream to ingest from kafka.

On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar  wrote:

> I don't think I need a different speed storage and batch storage. Just
> taking in raw data from Kafka, standardizing, and storing it somewhere
> where the web UI can query it, seems like it will be enough.
>
> I'm thinking about:
>
> - Reading data from Kafka via Spark Streaming
> - Standardizing, then storing it in Cassandra
> - Querying Cassandra from the web ui
>
> That seems like it will work. My question now is whether to use Spark
> Streaming to read Kafka, or use Kafka consumers directly.
>
>
> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>>
>> You don't need Spark streaming to read data from Kafka and store on HDFS.
>> It is a waste of resources.
>>
>> Couple Flume to use Kafka as source and HDFS as sink directly
>>
>> KafkaAgent.sources = kafka-sources
>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>
>> That will be for your batch layer. To analyse you can directly read from
>> hdfs files with Spark or simply store data in a database of your choice via
>> cron or something. Do not mix your batch layer with speed layer.
>>
>> Your speed layer will ingest the same data directly from Kafka into spark
>> streaming and that will be  online or near real time (defined by your
>> window).
>>
>> Then you have a a serving layer to present data from both speed  (the one
>> from SS) and batch layer.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>>
>>> The web UI is actually the speed layer, it needs to be able to query the
>>> data online, and show the results in real-time.
>>>
>>> It also needs a custom front-end, so a system like Tableau can't be
>>> used, it must have a custom backend + front-end.
>>>
>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>> - Using Spark to query the data in the backend of the web UI?
>>>
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>>> on HDFS using flume.
>>>>
>>>> -  Query this data to generate reports / analytics (There will be a web
>>>> UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> This is basically batch layer and you need something like Tableau or
>>>> Zeppelin to query data
>>>>
>>>> You will also need spark streaming to query data online for speed
>>>> layer. That data could be stored in some transient fabric like ignite or
>>>> even druid.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Since the inflow is huge , flume would also need to be run with multiple
channels in distributed fashion.
In that case , the resource utilization will be high in that case as well.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh 
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>>>
>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>
>>>>>>

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I don't think I need a different speed storage and batch storage. Just
taking in raw data from Kafka, standardizing, and storing it somewhere
where the web UI can query it, seems like it will be enough.

I'm thinking about:

- Reading data from Kafka via Spark Streaming
- Standardizing, then storing it in Cassandra
- Querying Cassandra from the web ui

That seems like it will work. My question now is whether to use Spark
Streaming to read Kafka, or use Kafka consumers directly.


On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh 
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar  wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>&g

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume

You don't need Spark streaming to read data from Kafka and store on HDFS.
It is a waste of resources.

Couple Flume to use Kafka as source and HDFS as sink directly

KafkaAgent.sources = kafka-sources
KafkaAgent.sinks.hdfs-sinks.type = hdfs

That will be for your batch layer. To analyse you can directly read from
hdfs files with Spark or simply store data in a database of your choice via
cron or something. Do not mix your batch layer with speed layer.

Your speed layer will ingest the same data directly from Kafka into spark
streaming and that will be  online or near real time (defined by your
window).

Then you have a a serving layer to present data from both speed  (the one
from SS) and batch layer.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:15, Ali Akhtar  wrote:

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>>> wrote:
>>>
>>>> What is the message inflow ?
>>>> If it's really high , definitely spark will be of great use .
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>>
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>> raw data into Kafka.
>>>>>
>>>>> I need to:
>>>>>
>>>>> - Do ETL on the data, and standardize it.
>>>>>
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>> / ElasticSearch / Postgres)
>>>>>
>>>>> - Query this data to generate reports / analytics (There will be a web
>>>>> UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> Java is being used as the backend language for everything (backend of
>>>>> the web UI, as well as the ETL layer)
>>>>>
>>>>> I'm considering:
>>>>>
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>> data, and to allow queries
>>>>>
>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>> queries across the data (mostly filters), or directly run queries against
>>>>> Cassandra / HBase
>>>>>
>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>> in the backend of the web UI, for displaying the reports).
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
For ui , you need DB such as Cassandra that is designed to work around
queries .
Ingest the data to spark streaming (speed layer) and write to hdfs(for
batch layer).
Now you have data at rest as well as in motion(real time).
>From spark streaming itself , do further processing and write the final
result to Cassandra/nosql DB.
UI can pick the data from the DB now.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:00 PM, Alonso Isidoro Roman 
wrote:

> "Using Spark to query the data in the backend of the web UI?"
>
> Dont do that. I would recommend that spark streaming process stores data
> into some nosql or sql database and the web ui to query data from that
> database.
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>
> 2016-09-29 16:15 GMT+02:00 Ali Akhtar :
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>>>
>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>
>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>> raw data into Kafka.
>>>>>>
>>>>>> I need to:
>>>>>>
>>>>>> - Do ETL on the data, and standardize it.
>>>>>>
>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>>> / ElasticSearch / Postgres)
>>>>>>
>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the 
>>>>>> reports)
>>>>>>
>>>>>> Java is being used as the backend language for everything (backend of
>>>>>> the web UI, as well as the ETL layer)
>>>>>>
>>>>>> I'm considering:
>>>>>>
>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>
>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>> data, and to allow queries
>>>>>>
>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>> Cassandra / HBase
>>>>>>
>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark 
>>>>>> for
>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


RE: Architecture recommendations for a tricky use case

2016-09-29 Thread Tauzell, Dave
Spark Streaming needs to store the output somewhere.  Cassandra is a possible 
target for that.

-Dave

-Original Message-
From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Thursday, September 29, 2016 9:16 AM
Cc: users@kafka.apache.org; spark users
Subject: Re: Architecture recommendations for a tricky use case

The web UI is actually the speed layer, it needs to be able to query the data 
online, and show the results in real-time.

It also needs a custom front-end, so a system like Tableau can't be used, it 
must have a custom backend + front-end.

Thanks for the recommendation of Flume. Do you think this will work:

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume
- Using Spark to query the data in the backend of the web UI?



On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh 
wrote:

> You need a batch layer and a speed layer. Data from Kafka can be
> stored on HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a
> web UI which will be the front-end to the data, and will show the
> reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed layer.
> That data could be stored in some transient fabric like ignite or even
> druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCC
> dOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPC
> CdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's technical content is explicitly 
> disclaimed.
> The author will in no case be liable for any monetary damages arising
> from such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> 
>> wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>> raw data into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>> HDFS / ElasticSearch / Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a
>>>> web UI which will be the front-end to the data, and will show the
>>>> reports)
>>>>
>>>> Java is being used as the backend language for everything (backend
>>>> of the web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>> (receive raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>> data, and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run
>>>> queries across the data (mostly filters), or directly run queries
>>>> against Cassandra / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these
>>>> alternatives I should go with (e.g, using raw Kafka consumers vs
>>>> Spark for ETL, which persistent data store to use, and how to query
>>>> that data store in the backend of the web UI, for displaying the reports).
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>
>
This e-mail and any files transmitted with it are confidential, may contain 
sensitive information, and are intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this e-mail in error, 
please notify the sender by reply e-mail immediately and destroy all copies of 
the e-mail and any attachments.


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The web UI is actually the speed layer, it needs to be able to query the
data online, and show the results in real-time.

It also needs a custom front-end, so a system like Tableau can't be used,
it must have a custom backend + front-end.

Thanks for the recommendation of Flume. Do you think this will work:

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume
- Using Spark to query the data in the backend of the web UI?



On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh 
wrote:

> You need a batch layer and a speed layer. Data from Kafka can be stored on
> HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed layer.
> That data could be stored in some transient fabric like ignite or even
> druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar  wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
>> wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>>>> data into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>>>> ElasticSearch / Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a web
>>>> UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> Java is being used as the backend language for everything (backend of
>>>> the web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>> (receive raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>> data, and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run queries
>>>> across the data (mostly filters), or directly run queries against Cassandra
>>>> / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these
>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>> ETL, which persistent data store to use, and how to query that data store
>>>> in the backend of the web UI, for displaying the reports).
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ?
If it's really high , definitely spark will be of great use .

Thanks
Deepak

On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their raw
> data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
> ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of the
> web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
> raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
> and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run queries
> across the data (mostly filters), or directly run queries against Cassandra
> / HBase
>
> I'd appreciate some thoughts / suggestions on which of these alternatives
> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> persistent data store to use, and how to query that data store in the
> backend of the web UI, for displaying the reports).
>
>
> Thanks.
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
You need a batch layer and a speed layer. Data from Kafka can be stored on
HDFS using flume.

-  Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)

This is basically batch layer and you need something like Tableau or
Zeppelin to query data

You will also need spark streaming to query data online for speed layer.
That data could be stored in some transient fabric like ignite or even
druid.

HTH








Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:01, Ali Akhtar  wrote:

> It needs to be able to scale to a very large amount of data, yes.
>
> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
> wrote:
>
>> What is the message inflow ?
>> If it's really high , definitely spark will be of great use .
>>
>> Thanks
>> Deepak
>>
>> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>>
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>>> data into Kafka.
>>>
>>> I need to:
>>>
>>> - Do ETL on the data, and standardize it.
>>>
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>>> ElasticSearch / Postgres)
>>>
>>> - Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> Java is being used as the backend language for everything (backend of
>>> the web UI, as well as the ETL layer)
>>>
>>> I'm considering:
>>>
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>> (receive raw data from Kafka, standardize & store it)
>>>
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>> data, and to allow queries
>>>
>>> - In the backend of the web UI, I could either use Spark to run queries
>>> across the data (mostly filters), or directly run queries against Cassandra
>>> / HBase
>>>
>>> I'd appreciate some thoughts / suggestions on which of these
>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>> ETL, which persistent data store to use, and how to query that data store
>>> in the backend of the web UI, for displaying the reports).
>>>
>>>
>>> Thanks.
>>>
>>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
It needs to be able to scale to a very large amount of data, yes.

On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma 
wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>> data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>> ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI
>> which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the
>> web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>> (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
>> and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries
>> across the data (mostly filters), or directly run queries against Cassandra
>> / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives
>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
>> persistent data store to use, and how to query that data store in the
>> backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>>
>


Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I have a somewhat tricky use case, and I'm looking for ideas.

I have 5-6 Kafka producers, reading various APIs, and writing their raw
data into Kafka.

I need to:

- Do ETL on the data, and standardize it.

- Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
ElasticSearch / Postgres)

- Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)

Java is being used as the backend language for everything (backend of the
web UI, as well as the ETL layer)

I'm considering:

- Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
raw data from Kafka, standardize & store it)

- Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
and to allow queries

- In the backend of the web UI, I could either use Spark to run queries
across the data (mostly filters), or directly run queries against Cassandra
/ HBase

I'd appreciate some thoughts / suggestions on which of these alternatives I
should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
persistent data store to use, and how to query that data store in the
backend of the web UI, for displaying the reports).


Thanks.


RE: Re : A specific use case

2016-08-05 Thread Hamza HACHANI
Thanks Guozhang Wang.


Hamza


De : Guozhang Wang 
Envoyé : jeudi 4 août 2016 06:58:22
À : users@kafka.apache.org
Objet : Re: Re : A specific use case

Yeah, if you can buffer yourself in the process() function and then rely on
punctuate() for generating the outputs that would resolve your issue.

Remember that punctuate() function itself is event-time driven so if you do
not have any data coming in then it may not be triggered. Details:

https://github.com/apache/kafka/pull/1689

Guozhang

On Wed, Aug 3, 2016 at 8:53 PM, Hamza HACHANI 
wrote:

> Hi,
> Yes in fact .
> And ï found à solution.
> It was in editing the method punctuate in kafka stream processor.
>
> - Message de réponse -
> De : "Guozhang Wang" 
> Pour : "users@kafka.apache.org" 
> Objet : A specific use case
> Date : mer., août 3, 2016 23:38
>
> Hello Hamza,
>
> By saying "broker" I think you are actually referring to a Kafka Streams
> instance?
>
>
> Guozhang
>
> On Mon, Aug 1, 2016 at 1:01 AM, Hamza HACHANI 
> wrote:
>
> > Good morning,
> >
> > I'm working on a specific use case. In fact i'm receiving messages from
> an
> > operator network and trying to do statistics on their number per
> > minute,perhour,per day ...
> >
> > I would like to create a broker that receives the messages and generates
> a
> > message every minute. These producted messages are consumed by a consumer
> > from in one hand and also se,t to an other topic which receives them and
> > generates messages every minute.
> >
> > I've  been doing that for a while without a success. In fact the first
> > broker in any time it receives a messages ,it produces one and send it to
> > the other topic.
> >
> > My question is ,what i'm trying to do,Is it possible without passing by
> an
> > intermediate java processus which is out of kafka.
> >
> > If yes , How ?
> >
> > Thanks In advance.
> >
>
>
>
> --
> -- Guozhang
>



--
-- Guozhang


Re: Re : A specific use case

2016-08-04 Thread Guozhang Wang
Yeah, if you can buffer yourself in the process() function and then rely on
punctuate() for generating the outputs that would resolve your issue.

Remember that punctuate() function itself is event-time driven so if you do
not have any data coming in then it may not be triggered. Details:

https://github.com/apache/kafka/pull/1689

Guozhang

On Wed, Aug 3, 2016 at 8:53 PM, Hamza HACHANI 
wrote:

> Hi,
> Yes in fact .
> And ï found à solution.
> It was in editing the method punctuate in kafka stream processor.
>
> - Message de réponse -
> De : "Guozhang Wang" 
> Pour : "users@kafka.apache.org" 
> Objet : A specific use case
> Date : mer., août 3, 2016 23:38
>
> Hello Hamza,
>
> By saying "broker" I think you are actually referring to a Kafka Streams
> instance?
>
>
> Guozhang
>
> On Mon, Aug 1, 2016 at 1:01 AM, Hamza HACHANI 
> wrote:
>
> > Good morning,
> >
> > I'm working on a specific use case. In fact i'm receiving messages from
> an
> > operator network and trying to do statistics on their number per
> > minute,perhour,per day ...
> >
> > I would like to create a broker that receives the messages and generates
> a
> > message every minute. These producted messages are consumed by a consumer
> > from in one hand and also se,t to an other topic which receives them and
> > generates messages every minute.
> >
> > I've  been doing that for a while without a success. In fact the first
> > broker in any time it receives a messages ,it produces one and send it to
> > the other topic.
> >
> > My question is ,what i'm trying to do,Is it possible without passing by
> an
> > intermediate java processus which is out of kafka.
> >
> > If yes , How ?
> >
> > Thanks In advance.
> >
>
>
>
> --
> -- Guozhang
>



-- 
-- Guozhang


Re : A specific use case

2016-08-03 Thread Hamza HACHANI
Hi,
Yes in fact .
And ï found à solution.
It was in editing the method punctuate in kafka stream processor.

- Message de réponse -
De : "Guozhang Wang" 
Pour : "users@kafka.apache.org" 
Objet : A specific use case
Date : mer., août 3, 2016 23:38

Hello Hamza,

By saying "broker" I think you are actually referring to a Kafka Streams
instance?


Guozhang

On Mon, Aug 1, 2016 at 1:01 AM, Hamza HACHANI 
wrote:

> Good morning,
>
> I'm working on a specific use case. In fact i'm receiving messages from an
> operator network and trying to do statistics on their number per
> minute,perhour,per day ...
>
> I would like to create a broker that receives the messages and generates a
> message every minute. These producted messages are consumed by a consumer
> from in one hand and also se,t to an other topic which receives them and
> generates messages every minute.
>
> I've  been doing that for a while without a success. In fact the first
> broker in any time it receives a messages ,it produces one and send it to
> the other topic.
>
> My question is ,what i'm trying to do,Is it possible without passing by an
> intermediate java processus which is out of kafka.
>
> If yes , How ?
>
> Thanks In advance.
>



--
-- Guozhang


Re: A specific use case

2016-08-03 Thread Guozhang Wang
Hello Hamza,

By saying "broker" I think you are actually referring to a Kafka Streams
instance?


Guozhang

On Mon, Aug 1, 2016 at 1:01 AM, Hamza HACHANI 
wrote:

> Good morning,
>
> I'm working on a specific use case. In fact i'm receiving messages from an
> operator network and trying to do statistics on their number per
> minute,perhour,per day ...
>
> I would like to create a broker that receives the messages and generates a
> message every minute. These producted messages are consumed by a consumer
> from in one hand and also se,t to an other topic which receives them and
> generates messages every minute.
>
> I've  been doing that for a while without a success. In fact the first
> broker in any time it receives a messages ,it produces one and send it to
> the other topic.
>
> My question is ,what i'm trying to do,Is it possible without passing by an
> intermediate java processus which is out of kafka.
>
> If yes , How ?
>
> Thanks In advance.
>



-- 
-- Guozhang


A specific use case

2016-08-01 Thread Hamza HACHANI
Good morning,

I'm working on a specific use case. In fact i'm receiving messages from an 
operator network and trying to do statistics on their number per 
minute,perhour,per day ...

I would like to create a broker that receives the messages and generates a 
message every minute. These producted messages are consumed by a consumer from 
in one hand and also se,t to an other topic which receives them and generates 
messages every minute.

I've  been doing that for a while without a success. In fact the first broker 
in any time it receives a messages ,it produces one and send it to the other 
topic.

My question is ,what i'm trying to do,Is it possible without passing by an 
intermediate java processus which is out of kafka.

If yes , How ?

Thanks In advance.


Re: Question about a kafka use case : sequence once a partition is added

2016-07-19 Thread Gerard Klijs
You can't you only get a guarantee on the order for each partition, not
over partitions. Adding partitions will possible make it a lot worse, since
items with the same key wll land in other partitions. For example with two
partitions these will be about the hashes in each partitions:
partition-0: 0,2,4,6,8
partition-1: 1,3,5,7,9
With three partitions:
partition-0: 0,3,6,9
partition-1: 1,4,7
partition-2: 2,5,8
So items with a key which hashes to 2, will move from partition-0, to
partition-2. I think if you really need to be able to guarantee order, you
will need to add some sequential id to the messages, and buffer them when
reading, but this will have all sorts of drawbacks, like lossing the
messages in the buffer in case of error (or you must make the commit offset
dependent on the buffer).

On Mon, Jul 18, 2016 at 9:19 PM Fumo, Vincent 
wrote:

> I want to use Kafka for notifications of changes to data in a
> dataservice/database. For each object that changes, a kafka message will be
> sent. This is easy and we've got that working no problem.
>
> Here is my use case : I want to be able to fire up a process that will
>
> 1) determine the current location of the kafka topic (right now we use 2
> partitions so that would be the offset for each partition)
> 2) do a long running process that will copy data from the database
> 3) once the process is over, put the location back into a kafka consumer
> and start processing notifications in sequence
>
> This isn't very hard either but there is a problem that we'd face if
> during step (2) partitions are added to the topic (say by our operations
> team).
>
> I know we can set up a ConsumerRebalanceListener but I don't think that
> will help because we'd need to back to a time when we had our original
> number of partitions and then we'd need to know exactly when to start
> reading from the new partition(s).
>
> for example
>
> start : 2 partitions (0,1) at offsets p0,100 and p1,100
>
> 1) we store the offsets and partitions : p0,100 and p1,100
> 2) we run the db ingest
> 3) messages are posted to p0 and p1
> 4) OPS team adds p2 and our ConsumerRebalanceListener would be notified
> 5) we are done and we set our consumer to p0,100 and p1,100 (and p2,0
> thanks to the ConsumerRebalanceListener)
>
> how would we guarantee the order of messages received from our consumer
> across all 3 partitions?
>
>
>


Question about a kafka use case : sequence once a partition is added

2016-07-18 Thread Fumo, Vincent
I want to use Kafka for notifications of changes to data in a 
dataservice/database. For each object that changes, a kafka message will be 
sent. This is easy and we've got that working no problem.

Here is my use case : I want to be able to fire up a process that will 

1) determine the current location of the kafka topic (right now we use 2 
partitions so that would be the offset for each partition)
2) do a long running process that will copy data from the database
3) once the process is over, put the location back into a kafka consumer and 
start processing notifications in sequence

This isn't very hard either but there is a problem that we'd face if during 
step (2) partitions are added to the topic (say by our operations team).

I know we can set up a ConsumerRebalanceListener but I don't think that will 
help because we'd need to back to a time when we had our original number of 
partitions and then we'd need to know exactly when to start reading from the 
new partition(s).

for example

start : 2 partitions (0,1) at offsets p0,100 and p1,100

1) we store the offsets and partitions : p0,100 and p1,100
2) we run the db ingest
3) messages are posted to p0 and p1 
4) OPS team adds p2 and our ConsumerRebalanceListener would be notified
5) we are done and we set our consumer to p0,100 and p1,100 (and p2,0 thanks to 
the ConsumerRebalanceListener)

how would we guarantee the order of messages received from our consumer across 
all 3 partitions?




Re: My Use case is I want to delete the records instantly after consuming them

2016-07-02 Thread Navneet Kumar
Hi All
Is the number of consumer component equal to the number of partitions
created in cluster ?
I have created three  partitions in cluster but I am using only two
consumer poller to subscribe the records. Some time I have noticed
that the messages are polled very late. What should be the good
polling strategy needed. Please suggest me.

/home/kafka/kafka_2.11-0.9.0.0/bin/kafka-topics.sh --create
--zookeeper 172.16.8.216:2181 --replication-factor 3 --partitions 3
--topic EmailOCTracker




Thanks and Regards,
Navneet Kumar


On Sat, Jul 2, 2016 at 10:49 PM, Navneet Kumar
 wrote:
> Thank you so much Ian
>
>
>
>
> Thanks and Regards,
> Navneet Kumar
>
>
> On Sat, Jul 2, 2016 at 9:45 PM, Ian Wrigley  wrote:
>> That’s really not what Kafka was designed to do. You can set a short log 
>> retention period, which will mean messages are deleted relatively soon after 
>> they were written to Kafka, but there’s no mechanism for deleting records on 
>> consumption.
>>
>> Ian.
>>
>>
>> ---
>> Ian Wrigley
>> Director, Education Services
>> Confluent, Inc
>>
>>> On Jul 2, 2016, at 11:08 AM, Navneet Kumar  
>>> wrote:
>>>
>>> Hi All
>>> My Use case is I want to delete the records instantly after consuming
>>> them. I am using Kafka 0.90
>>>
>>>
>>>
>>> Thanks and Regards,
>>> Navneet Kumar
>>


Re: My Use case is I want to delete the records instantly after consuming them

2016-07-02 Thread Navneet Kumar
Thank you so much Ian




Thanks and Regards,
Navneet Kumar


On Sat, Jul 2, 2016 at 9:45 PM, Ian Wrigley  wrote:
> That’s really not what Kafka was designed to do. You can set a short log 
> retention period, which will mean messages are deleted relatively soon after 
> they were written to Kafka, but there’s no mechanism for deleting records on 
> consumption.
>
> Ian.
>
>
> ---
> Ian Wrigley
> Director, Education Services
> Confluent, Inc
>
>> On Jul 2, 2016, at 11:08 AM, Navneet Kumar  
>> wrote:
>>
>> Hi All
>> My Use case is I want to delete the records instantly after consuming
>> them. I am using Kafka 0.90
>>
>>
>>
>> Thanks and Regards,
>> Navneet Kumar
>


Re: My Use case is I want to delete the records instantly after consuming them

2016-07-02 Thread Ian Wrigley
That’s really not what Kafka was designed to do. You can set a short log 
retention period, which will mean messages are deleted relatively soon after 
they were written to Kafka, but there’s no mechanism for deleting records on 
consumption.

Ian.


---
Ian Wrigley
Director, Education Services
Confluent, Inc

> On Jul 2, 2016, at 11:08 AM, Navneet Kumar  
> wrote:
> 
> Hi All
> My Use case is I want to delete the records instantly after consuming
> them. I am using Kafka 0.90
> 
> 
> 
> Thanks and Regards,
> Navneet Kumar



My Use case is I want to delete the records instantly after consuming them

2016-07-02 Thread Navneet Kumar
 Hi All
My Use case is I want to delete the records instantly after consuming
them. I am using Kafka 0.90



Thanks and Regards,
Navneet Kumar


Re: Kafka Streams: finding a solution to a particular use case

2016-04-20 Thread Guozhang Wang
ld be no problem to read from them and execute a new
> stream
> > > > > > process,
> > > > > > > > right? (like a new joins, counts...).
> > > > > > > >
> > > > > > > > Thanks!!
> > > > > > > >
> > > > > > > > 2016-04-15 17:37 GMT+02:00 Guozhang Wang  >:
> > > > > > > >
> > > > > > > > > 1) There are three types of joins for KTable-KTable join,
> the
> > > > > follow
> > > > > > > the
> > > > > > > > > same semantics in SQL joins:
> > > > > > > > >
> > > > > > > > > KTable.join(KTable): when there is no matching record from
> > > inner
> > > > > > table
> > > > > > > > when
> > > > > > > > > received a new record from outer table, no output; and vice
> > > > versa.
> > > > > > > > > KTable.leftjoin(KTable): when there is no matching record
> > from
> > > > > inner
> > > > > > > > table
> > > > > > > > > when received a new record from outer table, output (a,
> > null);
> > > on
> > > > > the
> > > > > > > > other
> > > > > > > > > direction no output.
> > > > > > > > > KTable.outerjoin(KTable): when there is no matching record
> > from
> > > > > > inner /
> > > > > > > > > outer table when received a new record from outer / inner
> > > table,
> > > > > > output
> > > > > > > > (a,
> > > > > > > > > null) or (null, b).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2) The result topic is also a changelog topic, although it
> > will
> > > > be
> > > > > > log
> > > > > > > > > compacted on the key over time, if you consume immediately
> > the
> > > > log
> > > > > > may
> > > > > > > > not
> > > > > > > > > be yet compacted.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Guozhang
> > > > > > > > >
> > > > > > > > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral <
> > > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hi Guozhang,
> > > > > > > > > >
> > > > > > > > > > Thank you very much for your reply and sorry for the
> > generic
> > > > > > > question,
> > > > > > > > > I'll
> > > > > > > > > > try to explain with some pseudocode.
> > > > > > > > > >
> > > > > > > > > > I have two KTable with a join:
> > > > > > > > > >
> > > > > > > > > > ktable1: KTable[String, String] = builder.table("topic1")
> > > > > > > > > > ktable2: KTable[String, String] = builder.table("topic2")
> > > > > > > > > >
> > > > > > > > > > result: KTable[String, ResultUnion] =
> > > > > > > > > > ktable1.join(ktable2, (data1, data2) => new
> > > ResultUnion(data1,
> > > > > > > data2))
> > > > > > > > > >
> > > > > > > > > > I send the result to a topic result.to("resultTopic").
> > > > > > > > > >
> > > > > > > > > > My questions are related with the following scenario:
> > > > > > > > > >
> > > > > > > > > > - The streming is up & running without data in topics
> > > > > > > > > >
> > > > > > > > > > - I send data to "topic2", for example a key/value like
> > that
> > > > > > > > > ("uniqueKey1",
> > > > > > > > > > "hello")
> > >

Re: Kafka Streams: finding a solution to a particular use case

2016-04-20 Thread Henry Cai
om outer table, output (a,
> null);
> > on
> > > > the
> > > > > > > other
> > > > > > > > direction no output.
> > > > > > > > KTable.outerjoin(KTable): when there is no matching record
> from
> > > > > inner /
> > > > > > > > outer table when received a new record from outer / inner
> > table,
> > > > > output
> > > > > > > (a,
> > > > > > > > null) or (null, b).
> > > > > > > >
> > > > > > > >
> > > > > > > > 2) The result topic is also a changelog topic, although it
> will
> > > be
> > > > > log
> > > > > > > > compacted on the key over time, if you consume immediately
> the
> > > log
> > > > > may
> > > > > > > not
> > > > > > > > be yet compacted.
> > > > > > > >
> > > > > > > >
> > > > > > > > Guozhang
> > > > > > > >
> > > > > > > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral <
> > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > > > > > > >
> > > > > > > > > Hi Guozhang,
> > > > > > > > >
> > > > > > > > > Thank you very much for your reply and sorry for the
> generic
> > > > > > question,
> > > > > > > > I'll
> > > > > > > > > try to explain with some pseudocode.
> > > > > > > > >
> > > > > > > > > I have two KTable with a join:
> > > > > > > > >
> > > > > > > > > ktable1: KTable[String, String] = builder.table("topic1")
> > > > > > > > > ktable2: KTable[String, String] = builder.table("topic2")
> > > > > > > > >
> > > > > > > > > result: KTable[String, ResultUnion] =
> > > > > > > > > ktable1.join(ktable2, (data1, data2) => new
> > ResultUnion(data1,
> > > > > > data2))
> > > > > > > > >
> > > > > > > > > I send the result to a topic result.to("resultTopic").
> > > > > > > > >
> > > > > > > > > My questions are related with the following scenario:
> > > > > > > > >
> > > > > > > > > - The streming is up & running without data in topics
> > > > > > > > >
> > > > > > > > > - I send data to "topic2", for example a key/value like
> that
> > > > > > > > ("uniqueKey1",
> > > > > > > > > "hello")
> > > > > > > > >
> > > > > > > > > - I see null values in topic "resultTopic", i.e.
> > ("uniqueKey1",
> > > > > null)
> > > > > > > > >
> > > > > > > > > - If I send data to "topic1", for example a key/value like
> > that
> > > > > > > > > ("uniqueKey1", "world") then I see this values in topic
> > > > > > "resultTopic",
> > > > > > > > > ("uniqueKey1", ResultUnion("hello", "world"))
> > > > > > > > >
> > > > > > > > > Q: If we send data for one of the KTable that does not have
> > the
> > > > > > > > > corresponding data by key in the other one, obtain null
> > values
> > > in
> > > > > the
> > > > > > > > > result final topic is the expected behavior?
> > > > > > > > >
> > > > > > > > > My next step would be use Kafka Connect to persist result
> > data
> > > in
> > > > > C*
> > > > > > (I
> > > > > > > > > have not read yet the Connector docs...), is this the way
> to
> > do
> > > > it?
> > > > > > (I
> > > > > > > > mean
> > > > > > > > > prepare the data in the topic).
> > > > > > > > >
> > > > > > > > > Q: On the other hand, just to try, I have a KTable that
> read
> > > > > messages
> > > > > > > in
> > > > > > > > > "resultTopic" and prints them. If the stream is a KTable I
> am
> > > > > > wondering
> > > > > > > > why
> > > > > > > > > is getting all the values from the topic even those with
> the
> > > same
> > > > > > key?
> > > > > > > > >
> > > > > > > > > Thanks in advance! Great job answering community!
> > > > > > > > >
> > > > > > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang <
> wangg...@gmail.com
> > >:
> > > > > > > > >
> > > > > > > > > > Hi Guillermo,
> > > > > > > > > >
> > > > > > > > > > 1) Yes in your case, the streams are really a "changelog"
> > > > stream,
> > > > > > > hence
> > > > > > > > > you
> > > > > > > > > > should create the stream as KTable, and do KTable-KTable
> > > join.
> > > > > > > > > >
> > > > > > > > > > 2) Could elaborate about "achieving this"? What behavior
> do
> > > > > require
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > application logic?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Guozhang
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers
> Corral <
> > > > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I am a newbie to Kafka Streams and I am using it trying
> > to
> > > > > solve
> > > > > > a
> > > > > > > > > > > particular use case. Let me explain.
> > > > > > > > > > >
> > > > > > > > > > > I have two sources of data both like that:
> > > > > > > > > > >
> > > > > > > > > > > Key (string)
> > > > > > > > > > > DateTime (hourly granularity)
> > > > > > > > > > > Value
> > > > > > > > > > >
> > > > > > > > > > > I need to join the two sources by key and date (hour of
> > > day)
> > > > to
> > > > > > > > obtain:
> > > > > > > > > > >
> > > > > > > > > > > Key (string)
> > > > > > > > > > > DateTime (hourly granularity)
> > > > > > > > > > > ValueSource1
> > > > > > > > > > > ValueSource2
> > > > > > > > > > >
> > > > > > > > > > > I think that first I'd need to push the messages in
> Kafka
> > > > > topics
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > date as part of the key because I'll group by key
> taking
> > > into
> > > > > > > account
> > > > > > > > > the
> > > > > > > > > > > date. So maybe the key must be a new string like
> > > > key_timestamp.
> > > > > > > But,
> > > > > > > > of
> > > > > > > > > > > course, it is not the main problem, is just an
> additional
> > > > > > > > explanation.
> > > > > > > > > > >
> > > > > > > > > > > Ok, so data are in topics, here we go!
> > > > > > > > > > >
> > > > > > > > > > > - Multiple records allows per key but only the latest
> > value
> > > > > for a
> > > > > > > > > record
> > > > > > > > > > > key will be considered. I should use two KTable with
> some
> > > > join
> > > > > > > > > strategy,
> > > > > > > > > > > right?
> > > > > > > > > > >
> > > > > > > > > > > - Data of both sources could arrive at any time. What
> > can I
> > > > do
> > > > > to
> > > > > > > > > achieve
> > > > > > > > > > > this?
> > > > > > > > > > >
> > > > > > > > > > > Thanks in advance.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > -- Guozhang
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>


Re: Kafka Streams: finding a solution to a particular use case

2016-04-20 Thread Guozhang Wang
gt; > > Hi Guozhang,
> > > > > > > >
> > > > > > > > Thank you very much for your reply and sorry for the generic
> > > > > question,
> > > > > > > I'll
> > > > > > > > try to explain with some pseudocode.
> > > > > > > >
> > > > > > > > I have two KTable with a join:
> > > > > > > >
> > > > > > > > ktable1: KTable[String, String] = builder.table("topic1")
> > > > > > > > ktable2: KTable[String, String] = builder.table("topic2")
> > > > > > > >
> > > > > > > > result: KTable[String, ResultUnion] =
> > > > > > > > ktable1.join(ktable2, (data1, data2) => new
> ResultUnion(data1,
> > > > > data2))
> > > > > > > >
> > > > > > > > I send the result to a topic result.to("resultTopic").
> > > > > > > >
> > > > > > > > My questions are related with the following scenario:
> > > > > > > >
> > > > > > > > - The streming is up & running without data in topics
> > > > > > > >
> > > > > > > > - I send data to "topic2", for example a key/value like that
> > > > > > > ("uniqueKey1",
> > > > > > > > "hello")
> > > > > > > >
> > > > > > > > - I see null values in topic "resultTopic", i.e.
> ("uniqueKey1",
> > > > null)
> > > > > > > >
> > > > > > > > - If I send data to "topic1", for example a key/value like
> that
> > > > > > > > ("uniqueKey1", "world") then I see this values in topic
> > > > > "resultTopic",
> > > > > > > > ("uniqueKey1", ResultUnion("hello", "world"))
> > > > > > > >
> > > > > > > > Q: If we send data for one of the KTable that does not have
> the
> > > > > > > > corresponding data by key in the other one, obtain null
> values
> > in
> > > > the
> > > > > > > > result final topic is the expected behavior?
> > > > > > > >
> > > > > > > > My next step would be use Kafka Connect to persist result
> data
> > in
> > > > C*
> > > > > (I
> > > > > > > > have not read yet the Connector docs...), is this the way to
> do
> > > it?
> > > > > (I
> > > > > > > mean
> > > > > > > > prepare the data in the topic).
> > > > > > > >
> > > > > > > > Q: On the other hand, just to try, I have a KTable that read
> > > > messages
> > > > > > in
> > > > > > > > "resultTopic" and prints them. If the stream is a KTable I am
> > > > > wondering
> > > > > > > why
> > > > > > > > is getting all the values from the topic even those with the
> > same
> > > > > key?
> > > > > > > >
> > > > > > > > Thanks in advance! Great job answering community!
> > > > > > > >
> > > > > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang  >:
> > > > > > > >
> > > > > > > > > Hi Guillermo,
> > > > > > > > >
> > > > > > > > > 1) Yes in your case, the streams are really a "changelog"
> > > stream,
> > > > > > hence
> > > > > > > > you
> > > > > > > > > should create the stream as KTable, and do KTable-KTable
> > join.
> > > > > > > > >
> > > > > > > > > 2) Could elaborate about "achieving this"? What behavior do
> > > > require
> > > > > > in
> > > > > > > > the
> > > > > > > > > application logic?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Guozhang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
> > > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I am a newbie to Kafka Streams and I am using it trying
> to
> > > > solve
> > > > > a
> > > > > > > > > > particular use case. Let me explain.
> > > > > > > > > >
> > > > > > > > > > I have two sources of data both like that:
> > > > > > > > > >
> > > > > > > > > > Key (string)
> > > > > > > > > > DateTime (hourly granularity)
> > > > > > > > > > Value
> > > > > > > > > >
> > > > > > > > > > I need to join the two sources by key and date (hour of
> > day)
> > > to
> > > > > > > obtain:
> > > > > > > > > >
> > > > > > > > > > Key (string)
> > > > > > > > > > DateTime (hourly granularity)
> > > > > > > > > > ValueSource1
> > > > > > > > > > ValueSource2
> > > > > > > > > >
> > > > > > > > > > I think that first I'd need to push the messages in Kafka
> > > > topics
> > > > > > with
> > > > > > > > the
> > > > > > > > > > date as part of the key because I'll group by key taking
> > into
> > > > > > account
> > > > > > > > the
> > > > > > > > > > date. So maybe the key must be a new string like
> > > key_timestamp.
> > > > > > But,
> > > > > > > of
> > > > > > > > > > course, it is not the main problem, is just an additional
> > > > > > > explanation.
> > > > > > > > > >
> > > > > > > > > > Ok, so data are in topics, here we go!
> > > > > > > > > >
> > > > > > > > > > - Multiple records allows per key but only the latest
> value
> > > > for a
> > > > > > > > record
> > > > > > > > > > key will be considered. I should use two KTable with some
> > > join
> > > > > > > > strategy,
> > > > > > > > > > right?
> > > > > > > > > >
> > > > > > > > > > - Data of both sources could arrive at any time. What
> can I
> > > do
> > > > to
> > > > > > > > achieve
> > > > > > > > > > this?
> > > > > > > > > >
> > > > > > > > > > Thanks in advance.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang


Re: Kafka Streams: finding a solution to a particular use case

2016-04-20 Thread Guillermo Lammers Corral
> >> versa.
> >>>>>>> KTable.leftjoin(KTable): when there is no matching record from
> >>> inner
> >>>>>> table
> >>>>>>> when received a new record from outer table, output (a, null); on
> >>> the
> >>>>>> other
> >>>>>>> direction no output.
> >>>>>>> KTable.outerjoin(KTable): when there is no matching record from
> >>>> inner /
> >>>>>>> outer table when received a new record from outer / inner table,
> >>>> output
> >>>>>> (a,
> >>>>>>> null) or (null, b).
> >>>>>>>
> >>>>>>>
> >>>>>>> 2) The result topic is also a changelog topic, although it will
> >> be
> >>>> log
> >>>>>>> compacted on the key over time, if you consume immediately the
> >> log
> >>>> may
> >>>>>> not
> >>>>>>> be yet compacted.
> >>>>>>>
> >>>>>>>
> >>>>>>> Guozhang
> >>>>>>>
> >>>>>>> On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral <
> >>>>>>> guillermo.lammers.cor...@tecsisa.com> wrote:
> >>>>>>>
> >>>>>>>> Hi Guozhang,
> >>>>>>>>
> >>>>>>>> Thank you very much for your reply and sorry for the generic
> >>>>> question,
> >>>>>>> I'll
> >>>>>>>> try to explain with some pseudocode.
> >>>>>>>>
> >>>>>>>> I have two KTable with a join:
> >>>>>>>>
> >>>>>>>> ktable1: KTable[String, String] = builder.table("topic1")
> >>>>>>>> ktable2: KTable[String, String] = builder.table("topic2")
> >>>>>>>>
> >>>>>>>> result: KTable[String, ResultUnion] =
> >>>>>>>> ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1,
> >>>>> data2))
> >>>>>>>>
> >>>>>>>> I send the result to a topic result.to("resultTopic").
> >>>>>>>>
> >>>>>>>> My questions are related with the following scenario:
> >>>>>>>>
> >>>>>>>> - The streming is up & running without data in topics
> >>>>>>>>
> >>>>>>>> - I send data to "topic2", for example a key/value like that
> >>>>>>> ("uniqueKey1",
> >>>>>>>> "hello")
> >>>>>>>>
> >>>>>>>> - I see null values in topic "resultTopic", i.e. ("uniqueKey1",
> >>>> null)
> >>>>>>>>
> >>>>>>>> - If I send data to "topic1", for example a key/value like that
> >>>>>>>> ("uniqueKey1", "world") then I see this values in topic
> >>>>> "resultTopic",
> >>>>>>>> ("uniqueKey1", ResultUnion("hello", "world"))
> >>>>>>>>
> >>>>>>>> Q: If we send data for one of the KTable that does not have the
> >>>>>>>> corresponding data by key in the other one, obtain null values
> >> in
> >>>> the
> >>>>>>>> result final topic is the expected behavior?
> >>>>>>>>
> >>>>>>>> My next step would be use Kafka Connect to persist result data
> >> in
> >>>> C*
> >>>>> (I
> >>>>>>>> have not read yet the Connector docs...), is this the way to do
> >>> it?
> >>>>> (I
> >>>>>>> mean
> >>>>>>>> prepare the data in the topic).
> >>>>>>>>
> >>>>>>>> Q: On the other hand, just to try, I have a KTable that read
> >>>> messages
> >>>>>> in
> >>>>>>>> "resultTopic" and prints them. If the stream is a KTable I am
> >>>>> wondering
> >>>>>>> why
> >>>>>>>> is getting all the values from the topic even those with the
> >> same
> >>>>> key?
> >>>>>>>>
> >>>>>>>> Thanks in advance! Great job answering community!
> >>>>>>>>
> >>>>>>>> 2016-04-14 20:00 GMT+02:00 Guozhang Wang :
> >>>>>>>>
> >>>>>>>>> Hi Guillermo,
> >>>>>>>>>
> >>>>>>>>> 1) Yes in your case, the streams are really a "changelog"
> >>> stream,
> >>>>>> hence
> >>>>>>>> you
> >>>>>>>>> should create the stream as KTable, and do KTable-KTable
> >> join.
> >>>>>>>>>
> >>>>>>>>> 2) Could elaborate about "achieving this"? What behavior do
> >>>> require
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>> application logic?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Guozhang
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
> >>>>>>>>> guillermo.lammers.cor...@tecsisa.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I am a newbie to Kafka Streams and I am using it trying to
> >>>> solve
> >>>>> a
> >>>>>>>>>> particular use case. Let me explain.
> >>>>>>>>>>
> >>>>>>>>>> I have two sources of data both like that:
> >>>>>>>>>>
> >>>>>>>>>> Key (string)
> >>>>>>>>>> DateTime (hourly granularity)
> >>>>>>>>>> Value
> >>>>>>>>>>
> >>>>>>>>>> I need to join the two sources by key and date (hour of
> >> day)
> >>> to
> >>>>>>> obtain:
> >>>>>>>>>>
> >>>>>>>>>> Key (string)
> >>>>>>>>>> DateTime (hourly granularity)
> >>>>>>>>>> ValueSource1
> >>>>>>>>>> ValueSource2
> >>>>>>>>>>
> >>>>>>>>>> I think that first I'd need to push the messages in Kafka
> >>>> topics
> >>>>>> with
> >>>>>>>> the
> >>>>>>>>>> date as part of the key because I'll group by key taking
> >> into
> >>>>>> account
> >>>>>>>> the
> >>>>>>>>>> date. So maybe the key must be a new string like
> >>> key_timestamp.
> >>>>>> But,
> >>>>>>> of
> >>>>>>>>>> course, it is not the main problem, is just an additional
> >>>>>>> explanation.
> >>>>>>>>>>
> >>>>>>>>>> Ok, so data are in topics, here we go!
> >>>>>>>>>>
> >>>>>>>>>> - Multiple records allows per key but only the latest value
> >>>> for a
> >>>>>>>> record
> >>>>>>>>>> key will be considered. I should use two KTable with some
> >>> join
> >>>>>>>> strategy,
> >>>>>>>>>> right?
> >>>>>>>>>>
> >>>>>>>>>> - Data of both sources could arrive at any time. What can I
> >>> do
> >>>> to
> >>>>>>>> achieve
> >>>>>>>>>> this?
> >>>>>>>>>>
> >>>>>>>>>> Thanks in advance.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> -- Guozhang
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> -- Guozhang
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> -- Guozhang
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >
>
>


Re: Kafka Streams: finding a solution to a particular use case

2016-04-20 Thread Matthias J. Sax
explain with some pseudocode.
>>>>>>>>
>>>>>>>> I have two KTable with a join:
>>>>>>>>
>>>>>>>> ktable1: KTable[String, String] = builder.table("topic1")
>>>>>>>> ktable2: KTable[String, String] = builder.table("topic2")
>>>>>>>>
>>>>>>>> result: KTable[String, ResultUnion] =
>>>>>>>> ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1,
>>>>> data2))
>>>>>>>>
>>>>>>>> I send the result to a topic result.to("resultTopic").
>>>>>>>>
>>>>>>>> My questions are related with the following scenario:
>>>>>>>>
>>>>>>>> - The streming is up & running without data in topics
>>>>>>>>
>>>>>>>> - I send data to "topic2", for example a key/value like that
>>>>>>> ("uniqueKey1",
>>>>>>>> "hello")
>>>>>>>>
>>>>>>>> - I see null values in topic "resultTopic", i.e. ("uniqueKey1",
>>>> null)
>>>>>>>>
>>>>>>>> - If I send data to "topic1", for example a key/value like that
>>>>>>>> ("uniqueKey1", "world") then I see this values in topic
>>>>> "resultTopic",
>>>>>>>> ("uniqueKey1", ResultUnion("hello", "world"))
>>>>>>>>
>>>>>>>> Q: If we send data for one of the KTable that does not have the
>>>>>>>> corresponding data by key in the other one, obtain null values
>> in
>>>> the
>>>>>>>> result final topic is the expected behavior?
>>>>>>>>
>>>>>>>> My next step would be use Kafka Connect to persist result data
>> in
>>>> C*
>>>>> (I
>>>>>>>> have not read yet the Connector docs...), is this the way to do
>>> it?
>>>>> (I
>>>>>>> mean
>>>>>>>> prepare the data in the topic).
>>>>>>>>
>>>>>>>> Q: On the other hand, just to try, I have a KTable that read
>>>> messages
>>>>>> in
>>>>>>>> "resultTopic" and prints them. If the stream is a KTable I am
>>>>> wondering
>>>>>>> why
>>>>>>>> is getting all the values from the topic even those with the
>> same
>>>>> key?
>>>>>>>>
>>>>>>>> Thanks in advance! Great job answering community!
>>>>>>>>
>>>>>>>> 2016-04-14 20:00 GMT+02:00 Guozhang Wang :
>>>>>>>>
>>>>>>>>> Hi Guillermo,
>>>>>>>>>
>>>>>>>>> 1) Yes in your case, the streams are really a "changelog"
>>> stream,
>>>>>> hence
>>>>>>>> you
>>>>>>>>> should create the stream as KTable, and do KTable-KTable
>> join.
>>>>>>>>>
>>>>>>>>> 2) Could elaborate about "achieving this"? What behavior do
>>>> require
>>>>>> in
>>>>>>>> the
>>>>>>>>> application logic?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Guozhang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
>>>>>>>>> guillermo.lammers.cor...@tecsisa.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am a newbie to Kafka Streams and I am using it trying to
>>>> solve
>>>>> a
>>>>>>>>>> particular use case. Let me explain.
>>>>>>>>>>
>>>>>>>>>> I have two sources of data both like that:
>>>>>>>>>>
>>>>>>>>>> Key (string)
>>>>>>>>>> DateTime (hourly granularity)
>>>>>>>>>> Value
>>>>>>>>>>
>>>>>>>>>> I need to join the two sources by key and date (hour of
>> day)
>>> to
>>>>>>> obtain:
>>>>>>>>>>
>>>>>>>>>> Key (string)
>>>>>>>>>> DateTime (hourly granularity)
>>>>>>>>>> ValueSource1
>>>>>>>>>> ValueSource2
>>>>>>>>>>
>>>>>>>>>> I think that first I'd need to push the messages in Kafka
>>>> topics
>>>>>> with
>>>>>>>> the
>>>>>>>>>> date as part of the key because I'll group by key taking
>> into
>>>>>> account
>>>>>>>> the
>>>>>>>>>> date. So maybe the key must be a new string like
>>> key_timestamp.
>>>>>> But,
>>>>>>> of
>>>>>>>>>> course, it is not the main problem, is just an additional
>>>>>>> explanation.
>>>>>>>>>>
>>>>>>>>>> Ok, so data are in topics, here we go!
>>>>>>>>>>
>>>>>>>>>> - Multiple records allows per key but only the latest value
>>>> for a
>>>>>>>> record
>>>>>>>>>> key will be considered. I should use two KTable with some
>>> join
>>>>>>>> strategy,
>>>>>>>>>> right?
>>>>>>>>>>
>>>>>>>>>> - Data of both sources could arrive at any time. What can I
>>> do
>>>> to
>>>>>>>> achieve
>>>>>>>>>> this?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> -- Guozhang
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> -- Guozhang
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
>>
>>
>> --
>> -- Guozhang
>>
> 



signature.asc
Description: OpenPGP digital signature


Re: Kafka Streams: finding a solution to a particular use case

2016-04-19 Thread Henry Cai
> > - I send data to "topic2", for example a key/value like that
> > > > > > ("uniqueKey1",
> > > > > > > "hello")
> > > > > > >
> > > > > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1",
> > > null)
> > > > > > >
> > > > > > > - If I send data to "topic1", for example a key/value like that
> > > > > > > ("uniqueKey1", "world") then I see this values in topic
> > > > "resultTopic",
> > > > > > > ("uniqueKey1", ResultUnion("hello", "world"))
> > > > > > >
> > > > > > > Q: If we send data for one of the KTable that does not have the
> > > > > > > corresponding data by key in the other one, obtain null values
> in
> > > the
> > > > > > > result final topic is the expected behavior?
> > > > > > >
> > > > > > > My next step would be use Kafka Connect to persist result data
> in
> > > C*
> > > > (I
> > > > > > > have not read yet the Connector docs...), is this the way to do
> > it?
> > > > (I
> > > > > > mean
> > > > > > > prepare the data in the topic).
> > > > > > >
> > > > > > > Q: On the other hand, just to try, I have a KTable that read
> > > messages
> > > > > in
> > > > > > > "resultTopic" and prints them. If the stream is a KTable I am
> > > > wondering
> > > > > > why
> > > > > > > is getting all the values from the topic even those with the
> same
> > > > key?
> > > > > > >
> > > > > > > Thanks in advance! Great job answering community!
> > > > > > >
> > > > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang :
> > > > > > >
> > > > > > > > Hi Guillermo,
> > > > > > > >
> > > > > > > > 1) Yes in your case, the streams are really a "changelog"
> > stream,
> > > > > hence
> > > > > > > you
> > > > > > > > should create the stream as KTable, and do KTable-KTable
> join.
> > > > > > > >
> > > > > > > > 2) Could elaborate about "achieving this"? What behavior do
> > > require
> > > > > in
> > > > > > > the
> > > > > > > > application logic?
> > > > > > > >
> > > > > > > >
> > > > > > > > Guozhang
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
> > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I am a newbie to Kafka Streams and I am using it trying to
> > > solve
> > > > a
> > > > > > > > > particular use case. Let me explain.
> > > > > > > > >
> > > > > > > > > I have two sources of data both like that:
> > > > > > > > >
> > > > > > > > > Key (string)
> > > > > > > > > DateTime (hourly granularity)
> > > > > > > > > Value
> > > > > > > > >
> > > > > > > > > I need to join the two sources by key and date (hour of
> day)
> > to
> > > > > > obtain:
> > > > > > > > >
> > > > > > > > > Key (string)
> > > > > > > > > DateTime (hourly granularity)
> > > > > > > > > ValueSource1
> > > > > > > > > ValueSource2
> > > > > > > > >
> > > > > > > > > I think that first I'd need to push the messages in Kafka
> > > topics
> > > > > with
> > > > > > > the
> > > > > > > > > date as part of the key because I'll group by key taking
> into
> > > > > account
> > > > > > > the
> > > > > > > > > date. So maybe the key must be a new string like
> > key_timestamp.
> > > > > But,
> > > > > > of
> > > > > > > > > course, it is not the main problem, is just an additional
> > > > > > explanation.
> > > > > > > > >
> > > > > > > > > Ok, so data are in topics, here we go!
> > > > > > > > >
> > > > > > > > > - Multiple records allows per key but only the latest value
> > > for a
> > > > > > > record
> > > > > > > > > key will be considered. I should use two KTable with some
> > join
> > > > > > > strategy,
> > > > > > > > > right?
> > > > > > > > >
> > > > > > > > > - Data of both sources could arrive at any time. What can I
> > do
> > > to
> > > > > > > achieve
> > > > > > > > > this?
> > > > > > > > >
> > > > > > > > > Thanks in advance.
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>


Re: Kafka Streams: finding a solution to a particular use case

2016-04-19 Thread Guozhang Wang
> > should be no problem to read from them and execute a new stream
> > process,
> > > > right? (like a new joins, counts...).
> > > >
> > > > Thanks!!
> > > >
> > > > 2016-04-15 17:37 GMT+02:00 Guozhang Wang :
> > > >
> > > > > 1) There are three types of joins for KTable-KTable join, the
> follow
> > > the
> > > > > same semantics in SQL joins:
> > > > >
> > > > > KTable.join(KTable): when there is no matching record from inner
> > table
> > > > when
> > > > > received a new record from outer table, no output; and vice versa.
> > > > > KTable.leftjoin(KTable): when there is no matching record from
> inner
> > > > table
> > > > > when received a new record from outer table, output (a, null); on
> the
> > > > other
> > > > > direction no output.
> > > > > KTable.outerjoin(KTable): when there is no matching record from
> > inner /
> > > > > outer table when received a new record from outer / inner table,
> > output
> > > > (a,
> > > > > null) or (null, b).
> > > > >
> > > > >
> > > > > 2) The result topic is also a changelog topic, although it will be
> > log
> > > > > compacted on the key over time, if you consume immediately the log
> > may
> > > > not
> > > > > be yet compacted.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral <
> > > > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > > > >
> > > > > > Hi Guozhang,
> > > > > >
> > > > > > Thank you very much for your reply and sorry for the generic
> > > question,
> > > > > I'll
> > > > > > try to explain with some pseudocode.
> > > > > >
> > > > > > I have two KTable with a join:
> > > > > >
> > > > > > ktable1: KTable[String, String] = builder.table("topic1")
> > > > > > ktable2: KTable[String, String] = builder.table("topic2")
> > > > > >
> > > > > > result: KTable[String, ResultUnion] =
> > > > > > ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1,
> > > data2))
> > > > > >
> > > > > > I send the result to a topic result.to("resultTopic").
> > > > > >
> > > > > > My questions are related with the following scenario:
> > > > > >
> > > > > > - The streming is up & running without data in topics
> > > > > >
> > > > > > - I send data to "topic2", for example a key/value like that
> > > > > ("uniqueKey1",
> > > > > > "hello")
> > > > > >
> > > > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1",
> > null)
> > > > > >
> > > > > > - If I send data to "topic1", for example a key/value like that
> > > > > > ("uniqueKey1", "world") then I see this values in topic
> > > "resultTopic",
> > > > > > ("uniqueKey1", ResultUnion("hello", "world"))
> > > > > >
> > > > > > Q: If we send data for one of the KTable that does not have the
> > > > > > corresponding data by key in the other one, obtain null values in
> > the
> > > > > > result final topic is the expected behavior?
> > > > > >
> > > > > > My next step would be use Kafka Connect to persist result data in
> > C*
> > > (I
> > > > > > have not read yet the Connector docs...), is this the way to do
> it?
> > > (I
> > > > > mean
> > > > > > prepare the data in the topic).
> > > > > >
> > > > > > Q: On the other hand, just to try, I have a KTable that read
> > messages
> > > > in
> > > > > > "resultTopic" and prints them. If the stream is a KTable I am
> > > wondering
> > > > > why
> > > > > > is getting all the values from the topic even those with the same
> > > key?
>

Re: Kafka Streams: finding a solution to a particular use case

2016-04-19 Thread Henry Cai
log
> may
> > > not
> > > > be yet compacted.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral <
> > > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > > >
> > > > > Hi Guozhang,
> > > > >
> > > > > Thank you very much for your reply and sorry for the generic
> > question,
> > > > I'll
> > > > > try to explain with some pseudocode.
> > > > >
> > > > > I have two KTable with a join:
> > > > >
> > > > > ktable1: KTable[String, String] = builder.table("topic1")
> > > > > ktable2: KTable[String, String] = builder.table("topic2")
> > > > >
> > > > > result: KTable[String, ResultUnion] =
> > > > > ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1,
> > data2))
> > > > >
> > > > > I send the result to a topic result.to("resultTopic").
> > > > >
> > > > > My questions are related with the following scenario:
> > > > >
> > > > > - The streming is up & running without data in topics
> > > > >
> > > > > - I send data to "topic2", for example a key/value like that
> > > > ("uniqueKey1",
> > > > > "hello")
> > > > >
> > > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1",
> null)
> > > > >
> > > > > - If I send data to "topic1", for example a key/value like that
> > > > > ("uniqueKey1", "world") then I see this values in topic
> > "resultTopic",
> > > > > ("uniqueKey1", ResultUnion("hello", "world"))
> > > > >
> > > > > Q: If we send data for one of the KTable that does not have the
> > > > > corresponding data by key in the other one, obtain null values in
> the
> > > > > result final topic is the expected behavior?
> > > > >
> > > > > My next step would be use Kafka Connect to persist result data in
> C*
> > (I
> > > > > have not read yet the Connector docs...), is this the way to do it?
> > (I
> > > > mean
> > > > > prepare the data in the topic).
> > > > >
> > > > > Q: On the other hand, just to try, I have a KTable that read
> messages
> > > in
> > > > > "resultTopic" and prints them. If the stream is a KTable I am
> > wondering
> > > > why
> > > > > is getting all the values from the topic even those with the same
> > key?
> > > > >
> > > > > Thanks in advance! Great job answering community!
> > > > >
> > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang :
> > > > >
> > > > > > Hi Guillermo,
> > > > > >
> > > > > > 1) Yes in your case, the streams are really a "changelog" stream,
> > > hence
> > > > > you
> > > > > > should create the stream as KTable, and do KTable-KTable join.
> > > > > >
> > > > > > 2) Could elaborate about "achieving this"? What behavior do
> require
> > > in
> > > > > the
> > > > > > application logic?
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
> > > > > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am a newbie to Kafka Streams and I am using it trying to
> solve
> > a
> > > > > > > particular use case. Let me explain.
> > > > > > >
> > > > > > > I have two sources of data both like that:
> > > > > > >
> > > > > > > Key (string)
> > > > > > > DateTime (hourly granularity)
> > > > > > > Value
> > > > > > >
> > > > > > > I need to join the two sources by key and date (hour of day) to
> > > > obtain:
> > > > > > >
> > > > > > > Key (string)
> > > > > > > DateTime (hourly granularity)
> > > > > > > ValueSource1
> > > > > > > ValueSource2
> > > > > > >
> > > > > > > I think that first I'd need to push the messages in Kafka
> topics
> > > with
> > > > > the
> > > > > > > date as part of the key because I'll group by key taking into
> > > account
> > > > > the
> > > > > > > date. So maybe the key must be a new string like key_timestamp.
> > > But,
> > > > of
> > > > > > > course, it is not the main problem, is just an additional
> > > > explanation.
> > > > > > >
> > > > > > > Ok, so data are in topics, here we go!
> > > > > > >
> > > > > > > - Multiple records allows per key but only the latest value
> for a
> > > > > record
> > > > > > > key will be considered. I should use two KTable with some join
> > > > > strategy,
> > > > > > > right?
> > > > > > >
> > > > > > > - Data of both sources could arrive at any time. What can I do
> to
> > > > > achieve
> > > > > > > this?
> > > > > > >
> > > > > > > Thanks in advance.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>
>
>
> --
> -- Guozhang
>


Re: Kafka Streams: finding a solution to a particular use case

2016-04-19 Thread Guozhang Wang
:
> > > >
> > > > - The streming is up & running without data in topics
> > > >
> > > > - I send data to "topic2", for example a key/value like that
> > > ("uniqueKey1",
> > > > "hello")
> > > >
> > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", null)
> > > >
> > > > - If I send data to "topic1", for example a key/value like that
> > > > ("uniqueKey1", "world") then I see this values in topic
> "resultTopic",
> > > > ("uniqueKey1", ResultUnion("hello", "world"))
> > > >
> > > > Q: If we send data for one of the KTable that does not have the
> > > > corresponding data by key in the other one, obtain null values in the
> > > > result final topic is the expected behavior?
> > > >
> > > > My next step would be use Kafka Connect to persist result data in C*
> (I
> > > > have not read yet the Connector docs...), is this the way to do it?
> (I
> > > mean
> > > > prepare the data in the topic).
> > > >
> > > > Q: On the other hand, just to try, I have a KTable that read messages
> > in
> > > > "resultTopic" and prints them. If the stream is a KTable I am
> wondering
> > > why
> > > > is getting all the values from the topic even those with the same
> key?
> > > >
> > > > Thanks in advance! Great job answering community!
> > > >
> > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang :
> > > >
> > > > > Hi Guillermo,
> > > > >
> > > > > 1) Yes in your case, the streams are really a "changelog" stream,
> > hence
> > > > you
> > > > > should create the stream as KTable, and do KTable-KTable join.
> > > > >
> > > > > 2) Could elaborate about "achieving this"? What behavior do require
> > in
> > > > the
> > > > > application logic?
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
> > > > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am a newbie to Kafka Streams and I am using it trying to solve
> a
> > > > > > particular use case. Let me explain.
> > > > > >
> > > > > > I have two sources of data both like that:
> > > > > >
> > > > > > Key (string)
> > > > > > DateTime (hourly granularity)
> > > > > > Value
> > > > > >
> > > > > > I need to join the two sources by key and date (hour of day) to
> > > obtain:
> > > > > >
> > > > > > Key (string)
> > > > > > DateTime (hourly granularity)
> > > > > > ValueSource1
> > > > > > ValueSource2
> > > > > >
> > > > > > I think that first I'd need to push the messages in Kafka topics
> > with
> > > > the
> > > > > > date as part of the key because I'll group by key taking into
> > account
> > > > the
> > > > > > date. So maybe the key must be a new string like key_timestamp.
> > But,
> > > of
> > > > > > course, it is not the main problem, is just an additional
> > > explanation.
> > > > > >
> > > > > > Ok, so data are in topics, here we go!
> > > > > >
> > > > > > - Multiple records allows per key but only the latest value for a
> > > > record
> > > > > > key will be considered. I should use two KTable with some join
> > > > strategy,
> > > > > > right?
> > > > > >
> > > > > > - Data of both sources could arrive at any time. What can I do to
> > > > achieve
> > > > > > this?
> > > > > >
> > > > > > Thanks in advance.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>



-- 
-- Guozhang


Re: Kafka Streams: finding a solution to a particular use case

2016-04-19 Thread Henry Cai
Related to the log compaction question: " it will be log
compacted on the key over time", how do we control the time for log
compaction?  For the log compaction implementation, is the storage used to
map a new value for a given key stored in memory or on disk?

On Tue, Apr 19, 2016 at 8:58 AM, Guillermo Lammers Corral <
guillermo.lammers.cor...@tecsisa.com> wrote:

> Hello,
>
> Thanks again for your reply :)
>
> 1) In my example when I send a record from outer table and there is no
> matching record from inner table I receive data to the output topic and
> vice versa. I am trying it with the topics empties at the first execution.
> How is possible?
>
> Why KTable joins does not support windowing strategies? I think that for
> this use cases I need it, what do you think?
>
> 2) What does it means? Although the log may not be yet compacted, there
> should be no problem to read from them and execute a new stream process,
> right? (like a new joins, counts...).
>
> Thanks!!
>
> 2016-04-15 17:37 GMT+02:00 Guozhang Wang :
>
> > 1) There are three types of joins for KTable-KTable join, the follow the
> > same semantics in SQL joins:
> >
> > KTable.join(KTable): when there is no matching record from inner table
> when
> > received a new record from outer table, no output; and vice versa.
> > KTable.leftjoin(KTable): when there is no matching record from inner
> table
> > when received a new record from outer table, output (a, null); on the
> other
> > direction no output.
> > KTable.outerjoin(KTable): when there is no matching record from inner /
> > outer table when received a new record from outer / inner table, output
> (a,
> > null) or (null, b).
> >
> >
> > 2) The result topic is also a changelog topic, although it will be log
> > compacted on the key over time, if you consume immediately the log may
> not
> > be yet compacted.
> >
> >
> > Guozhang
> >
> > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral <
> > guillermo.lammers.cor...@tecsisa.com> wrote:
> >
> > > Hi Guozhang,
> > >
> > > Thank you very much for your reply and sorry for the generic question,
> > I'll
> > > try to explain with some pseudocode.
> > >
> > > I have two KTable with a join:
> > >
> > > ktable1: KTable[String, String] = builder.table("topic1")
> > > ktable2: KTable[String, String] = builder.table("topic2")
> > >
> > > result: KTable[String, ResultUnion] =
> > > ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, data2))
> > >
> > > I send the result to a topic result.to("resultTopic").
> > >
> > > My questions are related with the following scenario:
> > >
> > > - The streming is up & running without data in topics
> > >
> > > - I send data to "topic2", for example a key/value like that
> > ("uniqueKey1",
> > > "hello")
> > >
> > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", null)
> > >
> > > - If I send data to "topic1", for example a key/value like that
> > > ("uniqueKey1", "world") then I see this values in topic "resultTopic",
> > > ("uniqueKey1", ResultUnion("hello", "world"))
> > >
> > > Q: If we send data for one of the KTable that does not have the
> > > corresponding data by key in the other one, obtain null values in the
> > > result final topic is the expected behavior?
> > >
> > > My next step would be use Kafka Connect to persist result data in C* (I
> > > have not read yet the Connector docs...), is this the way to do it? (I
> > mean
> > > prepare the data in the topic).
> > >
> > > Q: On the other hand, just to try, I have a KTable that read messages
> in
> > > "resultTopic" and prints them. If the stream is a KTable I am wondering
> > why
> > > is getting all the values from the topic even those with the same key?
> > >
> > > Thanks in advance! Great job answering community!
> > >
> > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang :
> > >
> > > > Hi Guillermo,
> > > >
> > > > 1) Yes in your case, the streams are really a "changelog" stream,
> hence
> > > you
> > > > should create the stream as KTable, and do KTable-KTable join.
> > > >
> > > > 2) Could elaborate about &q

Re: Kafka Streams: finding a solution to a particular use case

2016-04-19 Thread Guillermo Lammers Corral
Hello,

Thanks again for your reply :)

1) In my example when I send a record from outer table and there is no
matching record from inner table I receive data to the output topic and
vice versa. I am trying it with the topics empties at the first execution.
How is possible?

Why KTable joins does not support windowing strategies? I think that for
this use cases I need it, what do you think?

2) What does it means? Although the log may not be yet compacted, there
should be no problem to read from them and execute a new stream process,
right? (like a new joins, counts...).

Thanks!!

2016-04-15 17:37 GMT+02:00 Guozhang Wang :

> 1) There are three types of joins for KTable-KTable join, the follow the
> same semantics in SQL joins:
>
> KTable.join(KTable): when there is no matching record from inner table when
> received a new record from outer table, no output; and vice versa.
> KTable.leftjoin(KTable): when there is no matching record from inner table
> when received a new record from outer table, output (a, null); on the other
> direction no output.
> KTable.outerjoin(KTable): when there is no matching record from inner /
> outer table when received a new record from outer / inner table, output (a,
> null) or (null, b).
>
>
> 2) The result topic is also a changelog topic, although it will be log
> compacted on the key over time, if you consume immediately the log may not
> be yet compacted.
>
>
> Guozhang
>
> On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral <
> guillermo.lammers.cor...@tecsisa.com> wrote:
>
> > Hi Guozhang,
> >
> > Thank you very much for your reply and sorry for the generic question,
> I'll
> > try to explain with some pseudocode.
> >
> > I have two KTable with a join:
> >
> > ktable1: KTable[String, String] = builder.table("topic1")
> > ktable2: KTable[String, String] = builder.table("topic2")
> >
> > result: KTable[String, ResultUnion] =
> > ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, data2))
> >
> > I send the result to a topic result.to("resultTopic").
> >
> > My questions are related with the following scenario:
> >
> > - The streming is up & running without data in topics
> >
> > - I send data to "topic2", for example a key/value like that
> ("uniqueKey1",
> > "hello")
> >
> > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", null)
> >
> > - If I send data to "topic1", for example a key/value like that
> > ("uniqueKey1", "world") then I see this values in topic "resultTopic",
> > ("uniqueKey1", ResultUnion("hello", "world"))
> >
> > Q: If we send data for one of the KTable that does not have the
> > corresponding data by key in the other one, obtain null values in the
> > result final topic is the expected behavior?
> >
> > My next step would be use Kafka Connect to persist result data in C* (I
> > have not read yet the Connector docs...), is this the way to do it? (I
> mean
> > prepare the data in the topic).
> >
> > Q: On the other hand, just to try, I have a KTable that read messages in
> > "resultTopic" and prints them. If the stream is a KTable I am wondering
> why
> > is getting all the values from the topic even those with the same key?
> >
> > Thanks in advance! Great job answering community!
> >
> > 2016-04-14 20:00 GMT+02:00 Guozhang Wang :
> >
> > > Hi Guillermo,
> > >
> > > 1) Yes in your case, the streams are really a "changelog" stream, hence
> > you
> > > should create the stream as KTable, and do KTable-KTable join.
> > >
> > > 2) Could elaborate about "achieving this"? What behavior do require in
> > the
> > > application logic?
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
> > > guillermo.lammers.cor...@tecsisa.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am a newbie to Kafka Streams and I am using it trying to solve a
> > > > particular use case. Let me explain.
> > > >
> > > > I have two sources of data both like that:
> > > >
> > > > Key (string)
> > > > DateTime (hourly granularity)
> > > > Value
> > > >
> > > > I need to join the two sources by key and date (hour of day) to
> obtain

Re: Kafka Streams: finding a solution to a particular use case

2016-04-15 Thread Guozhang Wang
1) There are three types of joins for KTable-KTable join, the follow the
same semantics in SQL joins:

KTable.join(KTable): when there is no matching record from inner table when
received a new record from outer table, no output; and vice versa.
KTable.leftjoin(KTable): when there is no matching record from inner table
when received a new record from outer table, output (a, null); on the other
direction no output.
KTable.outerjoin(KTable): when there is no matching record from inner /
outer table when received a new record from outer / inner table, output (a,
null) or (null, b).


2) The result topic is also a changelog topic, although it will be log
compacted on the key over time, if you consume immediately the log may not
be yet compacted.


Guozhang

On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral <
guillermo.lammers.cor...@tecsisa.com> wrote:

> Hi Guozhang,
>
> Thank you very much for your reply and sorry for the generic question, I'll
> try to explain with some pseudocode.
>
> I have two KTable with a join:
>
> ktable1: KTable[String, String] = builder.table("topic1")
> ktable2: KTable[String, String] = builder.table("topic2")
>
> result: KTable[String, ResultUnion] =
> ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, data2))
>
> I send the result to a topic result.to("resultTopic").
>
> My questions are related with the following scenario:
>
> - The streming is up & running without data in topics
>
> - I send data to "topic2", for example a key/value like that ("uniqueKey1",
> "hello")
>
> - I see null values in topic "resultTopic", i.e. ("uniqueKey1", null)
>
> - If I send data to "topic1", for example a key/value like that
> ("uniqueKey1", "world") then I see this values in topic "resultTopic",
> ("uniqueKey1", ResultUnion("hello", "world"))
>
> Q: If we send data for one of the KTable that does not have the
> corresponding data by key in the other one, obtain null values in the
> result final topic is the expected behavior?
>
> My next step would be use Kafka Connect to persist result data in C* (I
> have not read yet the Connector docs...), is this the way to do it? (I mean
> prepare the data in the topic).
>
> Q: On the other hand, just to try, I have a KTable that read messages in
> "resultTopic" and prints them. If the stream is a KTable I am wondering why
> is getting all the values from the topic even those with the same key?
>
> Thanks in advance! Great job answering community!
>
> 2016-04-14 20:00 GMT+02:00 Guozhang Wang :
>
> > Hi Guillermo,
> >
> > 1) Yes in your case, the streams are really a "changelog" stream, hence
> you
> > should create the stream as KTable, and do KTable-KTable join.
> >
> > 2) Could elaborate about "achieving this"? What behavior do require in
> the
> > application logic?
> >
> >
> > Guozhang
> >
> >
> > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
> > guillermo.lammers.cor...@tecsisa.com> wrote:
> >
> > > Hi,
> > >
> > > I am a newbie to Kafka Streams and I am using it trying to solve a
> > > particular use case. Let me explain.
> > >
> > > I have two sources of data both like that:
> > >
> > > Key (string)
> > > DateTime (hourly granularity)
> > > Value
> > >
> > > I need to join the two sources by key and date (hour of day) to obtain:
> > >
> > > Key (string)
> > > DateTime (hourly granularity)
> > > ValueSource1
> > > ValueSource2
> > >
> > > I think that first I'd need to push the messages in Kafka topics with
> the
> > > date as part of the key because I'll group by key taking into account
> the
> > > date. So maybe the key must be a new string like key_timestamp. But, of
> > > course, it is not the main problem, is just an additional explanation.
> > >
> > > Ok, so data are in topics, here we go!
> > >
> > > - Multiple records allows per key but only the latest value for a
> record
> > > key will be considered. I should use two KTable with some join
> strategy,
> > > right?
> > >
> > > - Data of both sources could arrive at any time. What can I do to
> achieve
> > > this?
> > >
> > > Thanks in advance.
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang


Re: Kafka Streams: finding a solution to a particular use case

2016-04-15 Thread Guillermo Lammers Corral
Hi Guozhang,

Thank you very much for your reply and sorry for the generic question, I'll
try to explain with some pseudocode.

I have two KTable with a join:

ktable1: KTable[String, String] = builder.table("topic1")
ktable2: KTable[String, String] = builder.table("topic2")

result: KTable[String, ResultUnion] =
ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, data2))

I send the result to a topic result.to("resultTopic").

My questions are related with the following scenario:

- The streming is up & running without data in topics

- I send data to "topic2", for example a key/value like that ("uniqueKey1",
"hello")

- I see null values in topic "resultTopic", i.e. ("uniqueKey1", null)

- If I send data to "topic1", for example a key/value like that
("uniqueKey1", "world") then I see this values in topic "resultTopic",
("uniqueKey1", ResultUnion("hello", "world"))

Q: If we send data for one of the KTable that does not have the
corresponding data by key in the other one, obtain null values in the
result final topic is the expected behavior?

My next step would be use Kafka Connect to persist result data in C* (I
have not read yet the Connector docs...), is this the way to do it? (I mean
prepare the data in the topic).

Q: On the other hand, just to try, I have a KTable that read messages in
"resultTopic" and prints them. If the stream is a KTable I am wondering why
is getting all the values from the topic even those with the same key?

Thanks in advance! Great job answering community!

2016-04-14 20:00 GMT+02:00 Guozhang Wang :

> Hi Guillermo,
>
> 1) Yes in your case, the streams are really a "changelog" stream, hence you
> should create the stream as KTable, and do KTable-KTable join.
>
> 2) Could elaborate about "achieving this"? What behavior do require in the
> application logic?
>
>
> Guozhang
>
>
> On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
> guillermo.lammers.cor...@tecsisa.com> wrote:
>
> > Hi,
> >
> > I am a newbie to Kafka Streams and I am using it trying to solve a
> > particular use case. Let me explain.
> >
> > I have two sources of data both like that:
> >
> > Key (string)
> > DateTime (hourly granularity)
> > Value
> >
> > I need to join the two sources by key and date (hour of day) to obtain:
> >
> > Key (string)
> > DateTime (hourly granularity)
> > ValueSource1
> > ValueSource2
> >
> > I think that first I'd need to push the messages in Kafka topics with the
> > date as part of the key because I'll group by key taking into account the
> > date. So maybe the key must be a new string like key_timestamp. But, of
> > course, it is not the main problem, is just an additional explanation.
> >
> > Ok, so data are in topics, here we go!
> >
> > - Multiple records allows per key but only the latest value for a record
> > key will be considered. I should use two KTable with some join strategy,
> > right?
> >
> > - Data of both sources could arrive at any time. What can I do to achieve
> > this?
> >
> > Thanks in advance.
> >
>
>
>
> --
> -- Guozhang
>


Re: Kafka Streams: finding a solution to a particular use case

2016-04-14 Thread Guozhang Wang
Hi Guillermo,

1) Yes in your case, the streams are really a "changelog" stream, hence you
should create the stream as KTable, and do KTable-KTable join.

2) Could elaborate about "achieving this"? What behavior do require in the
application logic?


Guozhang


On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral <
guillermo.lammers.cor...@tecsisa.com> wrote:

> Hi,
>
> I am a newbie to Kafka Streams and I am using it trying to solve a
> particular use case. Let me explain.
>
> I have two sources of data both like that:
>
> Key (string)
> DateTime (hourly granularity)
> Value
>
> I need to join the two sources by key and date (hour of day) to obtain:
>
> Key (string)
> DateTime (hourly granularity)
> ValueSource1
> ValueSource2
>
> I think that first I'd need to push the messages in Kafka topics with the
> date as part of the key because I'll group by key taking into account the
> date. So maybe the key must be a new string like key_timestamp. But, of
> course, it is not the main problem, is just an additional explanation.
>
> Ok, so data are in topics, here we go!
>
> - Multiple records allows per key but only the latest value for a record
> key will be considered. I should use two KTable with some join strategy,
> right?
>
> - Data of both sources could arrive at any time. What can I do to achieve
> this?
>
> Thanks in advance.
>



-- 
-- Guozhang


  1   2   >