Kafka use case for my setup
Hi there, I would like to automate some of my tasks using Apache Kafka. Previously i used to do the same using Apache Airflow and which worked fine. But i want to explore the same using Kafka whether this works better than Airflow or not. 1) Kafka runs on Server A 2) Kafka searches for a file named test.xml on Server B, here kafka search for every 10 or 20 mins whether this file created or not. 3) Once kafka sense the file created, then the job starts as follows a)Create a jira ticket and update all the executions on jira for each events b) Trigger a rsync command c) Then unarchive the files using tar command d) Some script to execute using the unarchive files e) Then archive the files and rsync to different location f) Send email once all task finished Please advise if this is something kafka intelligent to begin with? Or if you have any other open source products which can do this actions , please let me know. By the way i prefer to setup these on docker-compose based installation. Thanks Jay
Kafka use case for my setup
Hi there, I would like to automate some of my tasks using Apache Kafka. Previously i used to do the same using Apache Airflow and which worked fine. But i want to explore the same using Kafka whether this works better than Airflow or not. 1) Kafka runs on Server A 2) Kafka searches for a file named test.xml on Server B, here kafka search for every 10 or 20 mins whether this file created or not. 3) Once kafka sense the file created, then the job starts as follows a)Create a jira ticket and update all the executions on jira for each events b) Trigger a rsync command c) Then unarchive the files using tar command d) Some script to execute using the unarchive files e) Then archive the files and rsync to different location f) Send email once all task finished Please advise if this is something kafka intelligent to begin with? Or if you have any other open source products which can do this actions , please let me know. By the way i prefer to setup these on docker-compose based installation. Thanks Jay
Re: Use case: Per tenant deployments talking to multi tenant kafka cluster
Hi Christian, > For multiple partitions is it the correct behaviour to simply assign to partition number:offset or do I have to provide offsets for the other partitions too? I'm not sure I get your question here. If you are asking if you should commit offsets of other partitions that this consumer doesn't consumed, then you don't have to. For example: consumer A assign to partition 0 consumer B assign to partition 1, So, when committing offsets, in consumer A, you only have to commit the offset of partition 0. And in consumer B, you just commit the offsets of partition 1. For the Consumer#assign, you can check this link for more information: https://stackoverflow.com/a/53938397 > In this case do we still need a custom producer partitioner or is it enough to simply assign to the topic like described above? No, if you want each consumer to receive all the messages of all environments, then you don't need custom partitioner. Thank you, Luke On Wed, Dec 8, 2021 at 11:05 PM Christian Schneider wrote: > Hi Luke, > > thanks for the hints. This helps a lot already. > > We already use assign as we manage offsets on the consumer side. Currently > we only have one partition and simply assign a stored offset on partition > 0. > For multiple partitions is it the correct behaviour to simply assign to > partition number:offset or do I have to provide offsets for the other > partitions too? I only want to listen to one partition. > You mentioned custom producer partitioner. We currently use a random > consumer group name for each consumer as we want each consumer to receive > all messages of the environment. In this case do we still need a custom > producer partitioner or is it enough to simply assign to the topic like > described above? > > Christian > > Am Mi., 8. Dez. 2021 um 11:19 Uhr schrieb Luke Chen : > > > Hi Christian, > > Answering your question below: > > > > > Let's assume we just have one topic with 10 partitions for simplicity. > > We can now use the environment id as a key for the messages to make sure > > the messages of each environment arrive in order while sharing the load > on > > the partitions. > > > > > Now we want each environment to only read the minimal number of > messages > > while consuming. Ideally we would like to to only consume its own > messages. > > Can we somehow filter to only > > receive messages with a certain key? Can we maybe only listen to a > certain > > partition at least? > > > > > > Unfortunately, Kafka doesn't have the feature to filter the messages on > > broker before sending to consumer. > > But for your 2nd question: > > > Can we maybe only listen to a certain partition at least? > > > > Actually, yes. Kafka has a way to just fetch data from a certain > partition > > of a topic. You can use Consumer#assign API to achieve that. So, to do > > that, I think you also need to have a custom producer partitioner for > your > > purpose. Let's say, in your example, you have 10 partitions, and 10 > > environments. Your partitioner should send to the specific partition > based > > on the environment ID, ex: env ID 1 -> partition 1, env ID 2 -> partition > > 2 So, in your consumer, you can just assign to the partition > containing > > its environment ID. > > > > And for the idea of encrypting the messages to achieve isolation, it's > > interesting! I've never thought about it! :) > > > > Hope it helps. > > > > Thank you. > > Luke > > > > > > On Wed, Dec 8, 2021 at 4:48 PM Christian Schneider < > > ch...@die-schneider.net> > > wrote: > > > > > We have a single tenant application that we deploy to a kubernetes > > cluster > > > in many instances. > > > Every customer has several environments of the application. Each > > > application lives in a separate namespace and should be isolated from > > other > > > applications. > > > > > > We plan to use kafka to communicate inside an environment (between the > > > different pods). > > > As setting up one kafka cluster per such environment is a lot of > overhead > > > and cost we would like to just use a single multi tenant kafka cluster. > > > > > > Let's assume we just have one topic with 10 partitions for simplicity. > > > We can now use the environment id as a key for the messages to make > sure > > > the messages of each environment arrive in order while sharing the load > > on > > > the partitions. > > > > > > Now we want each environment to only read the minimal number of > messages > > > while consuming. Ideally we would like to to only consume its own > > messages. > > > Can we somehow filter to only > > > receive messages with a certain key? Can we maybe only listen to a > > certain > > > partition at least? > > > > > > Additionally we ideally would like to have enforced isolation. So each > > > environment can only see its own messages even if it might receive > > messages > > > of other environments from the same partition. > > > I think in worst case we can make this happen by encrypting the > messages > > > but it
Re: Use case: Per tenant deployments talking to multi tenant kafka cluster
Hi Luke, thanks for the hints. This helps a lot already. We already use assign as we manage offsets on the consumer side. Currently we only have one partition and simply assign a stored offset on partition 0. For multiple partitions is it the correct behaviour to simply assign to partition number:offset or do I have to provide offsets for the other partitions too? I only want to listen to one partition. You mentioned custom producer partitioner. We currently use a random consumer group name for each consumer as we want each consumer to receive all messages of the environment. In this case do we still need a custom producer partitioner or is it enough to simply assign to the topic like described above? Christian Am Mi., 8. Dez. 2021 um 11:19 Uhr schrieb Luke Chen : > Hi Christian, > Answering your question below: > > > Let's assume we just have one topic with 10 partitions for simplicity. > We can now use the environment id as a key for the messages to make sure > the messages of each environment arrive in order while sharing the load on > the partitions. > > > Now we want each environment to only read the minimal number of messages > while consuming. Ideally we would like to to only consume its own messages. > Can we somehow filter to only > receive messages with a certain key? Can we maybe only listen to a certain > partition at least? > > > Unfortunately, Kafka doesn't have the feature to filter the messages on > broker before sending to consumer. > But for your 2nd question: > > Can we maybe only listen to a certain partition at least? > > Actually, yes. Kafka has a way to just fetch data from a certain partition > of a topic. You can use Consumer#assign API to achieve that. So, to do > that, I think you also need to have a custom producer partitioner for your > purpose. Let's say, in your example, you have 10 partitions, and 10 > environments. Your partitioner should send to the specific partition based > on the environment ID, ex: env ID 1 -> partition 1, env ID 2 -> partition > 2 So, in your consumer, you can just assign to the partition containing > its environment ID. > > And for the idea of encrypting the messages to achieve isolation, it's > interesting! I've never thought about it! :) > > Hope it helps. > > Thank you. > Luke > > > On Wed, Dec 8, 2021 at 4:48 PM Christian Schneider < > ch...@die-schneider.net> > wrote: > > > We have a single tenant application that we deploy to a kubernetes > cluster > > in many instances. > > Every customer has several environments of the application. Each > > application lives in a separate namespace and should be isolated from > other > > applications. > > > > We plan to use kafka to communicate inside an environment (between the > > different pods). > > As setting up one kafka cluster per such environment is a lot of overhead > > and cost we would like to just use a single multi tenant kafka cluster. > > > > Let's assume we just have one topic with 10 partitions for simplicity. > > We can now use the environment id as a key for the messages to make sure > > the messages of each environment arrive in order while sharing the load > on > > the partitions. > > > > Now we want each environment to only read the minimal number of messages > > while consuming. Ideally we would like to to only consume its own > messages. > > Can we somehow filter to only > > receive messages with a certain key? Can we maybe only listen to a > certain > > partition at least? > > > > Additionally we ideally would like to have enforced isolation. So each > > environment can only see its own messages even if it might receive > messages > > of other environments from the same partition. > > I think in worst case we can make this happen by encrypting the messages > > but it would be great if we could filter on broker side. > > > > Christian > > > > -- > > -- > > Christian Schneider > > http://www.liquid-reality.de > > > > Computer Scientist > > http://www.adobe.com > > > -- -- Christian Schneider http://www.liquid-reality.de Computer Scientist http://www.adobe.com
Re: Use case: Per tenant deployments talking to multi tenant kafka cluster
Hi Christian, Answering your question below: > Let's assume we just have one topic with 10 partitions for simplicity. We can now use the environment id as a key for the messages to make sure the messages of each environment arrive in order while sharing the load on the partitions. > Now we want each environment to only read the minimal number of messages while consuming. Ideally we would like to to only consume its own messages. Can we somehow filter to only receive messages with a certain key? Can we maybe only listen to a certain partition at least? Unfortunately, Kafka doesn't have the feature to filter the messages on broker before sending to consumer. But for your 2nd question: > Can we maybe only listen to a certain partition at least? Actually, yes. Kafka has a way to just fetch data from a certain partition of a topic. You can use Consumer#assign API to achieve that. So, to do that, I think you also need to have a custom producer partitioner for your purpose. Let's say, in your example, you have 10 partitions, and 10 environments. Your partitioner should send to the specific partition based on the environment ID, ex: env ID 1 -> partition 1, env ID 2 -> partition 2 So, in your consumer, you can just assign to the partition containing its environment ID. And for the idea of encrypting the messages to achieve isolation, it's interesting! I've never thought about it! :) Hope it helps. Thank you. Luke On Wed, Dec 8, 2021 at 4:48 PM Christian Schneider wrote: > We have a single tenant application that we deploy to a kubernetes cluster > in many instances. > Every customer has several environments of the application. Each > application lives in a separate namespace and should be isolated from other > applications. > > We plan to use kafka to communicate inside an environment (between the > different pods). > As setting up one kafka cluster per such environment is a lot of overhead > and cost we would like to just use a single multi tenant kafka cluster. > > Let's assume we just have one topic with 10 partitions for simplicity. > We can now use the environment id as a key for the messages to make sure > the messages of each environment arrive in order while sharing the load on > the partitions. > > Now we want each environment to only read the minimal number of messages > while consuming. Ideally we would like to to only consume its own messages. > Can we somehow filter to only > receive messages with a certain key? Can we maybe only listen to a certain > partition at least? > > Additionally we ideally would like to have enforced isolation. So each > environment can only see its own messages even if it might receive messages > of other environments from the same partition. > I think in worst case we can make this happen by encrypting the messages > but it would be great if we could filter on broker side. > > Christian > > -- > -- > Christian Schneider > http://www.liquid-reality.de > > Computer Scientist > http://www.adobe.com >
Use case: Per tenant deployments talking to multi tenant kafka cluster
We have a single tenant application that we deploy to a kubernetes cluster in many instances. Every customer has several environments of the application. Each application lives in a separate namespace and should be isolated from other applications. We plan to use kafka to communicate inside an environment (between the different pods). As setting up one kafka cluster per such environment is a lot of overhead and cost we would like to just use a single multi tenant kafka cluster. Let's assume we just have one topic with 10 partitions for simplicity. We can now use the environment id as a key for the messages to make sure the messages of each environment arrive in order while sharing the load on the partitions. Now we want each environment to only read the minimal number of messages while consuming. Ideally we would like to to only consume its own messages. Can we somehow filter to only receive messages with a certain key? Can we maybe only listen to a certain partition at least? Additionally we ideally would like to have enforced isolation. So each environment can only see its own messages even if it might receive messages of other environments from the same partition. I think in worst case we can make this happen by encrypting the messages but it would be great if we could filter on broker side. Christian -- -- Christian Schneider http://www.liquid-reality.de Computer Scientist http://www.adobe.com
Re: Right Use Case For Kafka Streams?
just my 2 cents the best answer is always from the real-world practices :) RocksDB https://rocksdb.org/ is the implementation of "state store" in Kafka Stream and it is an "embedded" kv store (which is diff than the distributed kv store). The "state store" in Kafka Stream is also backed up by "changelog" topic, where the physical kv data is stored. The performance hit may happen if: (1) one of application node (that runs kafka stream, since kafka stream is a library) is gone, the "state store" has to be rebuilt from changelog topic and if the changelog topic is huge, the rebuild time could be long. (2) the stream topology is complex with multiple state store / aggregation or called "reduce" operations, the rebuild or recovery time after failure could be long. `num.standby.replicas` should help to significantly reduce the rebuild time, but it comes with the storage cost, since the "state store" is replicated at a different node. On 2021/03/16 01:11:00, Gareth Collins wrote: > Hi, > > We have a requirement to calculate metrics on a huge number of keys (could > be hundreds of millions, perhaps billions of keys - attempting caching on > individual keys in many cases will have almost a 0% cache hit rate). Is > Kafka Streams with RocksDB and compacting topics the right tool for a task > like that? > > As well, just from playing with Kafka Streams for a week it feels like it > wants to create a lot of separate stores by default (if I want to calculate > aggregates on five, ten and 30 days I will get three separate stores by > default for this state data). Coming from a different distributed storage > solution, I feel like I want to put them together in one store as I/O has > always been my bottleneck (1 big read and 1 big write is better than three > small separate reads and three small separate writes). > > But am I perhaps missing something here? I don't want to avoid the DSL that > Kafka Streams provides if I don't have to. Will the Kafka Streams RocksDB > solution be so much faster than a distributed read that it won't be the > bottleneck even with huge amounts of data? > > Any info/opinions would be greatly appreciated. > > thanks in advance, > Gareth Collins >
Re: Right Use Case For Kafka Streams?
Hello Gareth, A common practice for rolling up aggregations with Kafka Streams is to do the finest granularity at processor (5 days in your case), and to coarse-grained rolling up upon query serving through the interactive query API -- i.e. whenever a query is issued for a 30 day aggregate you do a range scan on the 5-day-aggregate stores, and compute the rollup on the fly. If you'd prefer to still materialize all of the granularities since maybe their query frequency is high enough, maybe just go with three stores but as three concatenated aggregations (i.e. a stream aggregation into 5-day,s and the 5-day table aggregation to 10days, and 10-day table aggregation to 30-days). Guozhang On Mon, Mar 15, 2021 at 6:11 PM Gareth Collins wrote: > Hi, > > We have a requirement to calculate metrics on a huge number of keys (could > be hundreds of millions, perhaps billions of keys - attempting caching on > individual keys in many cases will have almost a 0% cache hit rate). Is > Kafka Streams with RocksDB and compacting topics the right tool for a task > like that? > > As well, just from playing with Kafka Streams for a week it feels like it > wants to create a lot of separate stores by default (if I want to calculate > aggregates on five, ten and 30 days I will get three separate stores by > default for this state data). Coming from a different distributed storage > solution, I feel like I want to put them together in one store as I/O has > always been my bottleneck (1 big read and 1 big write is better than three > small separate reads and three small separate writes). > > But am I perhaps missing something here? I don't want to avoid the DSL that > Kafka Streams provides if I don't have to. Will the Kafka Streams RocksDB > solution be so much faster than a distributed read that it won't be the > bottleneck even with huge amounts of data? > > Any info/opinions would be greatly appreciated. > > thanks in advance, > Gareth Collins > -- -- Guozhang
Right Use Case For Kafka Streams?
Hi, We have a requirement to calculate metrics on a huge number of keys (could be hundreds of millions, perhaps billions of keys - attempting caching on individual keys in many cases will have almost a 0% cache hit rate). Is Kafka Streams with RocksDB and compacting topics the right tool for a task like that? As well, just from playing with Kafka Streams for a week it feels like it wants to create a lot of separate stores by default (if I want to calculate aggregates on five, ten and 30 days I will get three separate stores by default for this state data). Coming from a different distributed storage solution, I feel like I want to put them together in one store as I/O has always been my bottleneck (1 big read and 1 big write is better than three small separate reads and three small separate writes). But am I perhaps missing something here? I don't want to avoid the DSL that Kafka Streams provides if I don't have to. Will the Kafka Streams RocksDB solution be so much faster than a distributed read that it won't be the bottleneck even with huge amounts of data? Any info/opinions would be greatly appreciated. thanks in advance, Gareth Collins
Re: Is this a valid use case for reading local store ?
Thanks. I was curious about the other real world use cases i.e what do people use it for ? Is this widely used or mostly for debugging purposes ? Any caveats ? Thanks Mohan On 10/1/20, 5:55 PM, "Guozhang Wang" wrote: Mohan, I think you can build a REST API on top of app1 directly leveraging on its IQ interface. For some examples code you can refer to https://github.com/confluentinc/kafka-streams-examples/tree/6.0.0-post/src/main/java/io/confluent/examples/streams/interactivequeries Guozhang On Thu, Oct 1, 2020 at 10:40 AM Parthasarathy, Mohan wrote: > Hi Guozhang, > > The async event trigger process is not running as a kafka streams > application. It offers REST interface where other applications post events > which in turn needs to go through App1's state and send requests to App2 > via Kafka. Here is the diagram: > >KafkaTopics---> App1 ---> App2 >| >V > REST >App3 > > REST API to App3 and read the local store of App1 (IQ) and send requests > to App2 (through kafka topic, not shown above). Conceptually it looks same > as your use case. What do people do if a kafka streams application (App1) > has to offer REST interface also ? > > -thanks > Mohan > > On 9/30/20, 5:01 PM, "Guozhang Wang" wrote: > > Hello Mohan, > > If I understand correctly, your async event trigger process runs out > of the > streams application, that reads the state stores of app2 through the > interactive query interface, right? This is actually a pretty common > use > case pattern for IQ :) > > > Guozhang > > On Wed, Sep 30, 2020 at 1:22 PM Parthasarathy, Mohan > > wrote: > > > Hi, > > > > A traditional kafka streams application (App1) reading data from a > kafka > > topic, doing aggregations resulting in some local state. The output > of this > > application is consumed by a different application(App2) for doing a > > different task. Under some conditions, there is an external trigger > (async > > event) which needs to trigger requests for all the keys in the local > store > > to App2. To achieve this, we can read the local stores from all the > > replicas and send the request to App2. > > > > This async event happens less frequently compared to the normal case > that > > leads to the state creation in the first place. Are there any > caveats doing > > it this way ? If not, any other suggestions ? > > > > Thanks > > Mohan > > > > > > -- > -- Guozhang > > > -- -- Guozhang
Re: Is this a valid use case for reading local store ?
Mohan, I think you can build a REST API on top of app1 directly leveraging on its IQ interface. For some examples code you can refer to https://github.com/confluentinc/kafka-streams-examples/tree/6.0.0-post/src/main/java/io/confluent/examples/streams/interactivequeries Guozhang On Thu, Oct 1, 2020 at 10:40 AM Parthasarathy, Mohan wrote: > Hi Guozhang, > > The async event trigger process is not running as a kafka streams > application. It offers REST interface where other applications post events > which in turn needs to go through App1's state and send requests to App2 > via Kafka. Here is the diagram: > >KafkaTopics---> App1 ---> App2 >| >V > REST >App3 > > REST API to App3 and read the local store of App1 (IQ) and send requests > to App2 (through kafka topic, not shown above). Conceptually it looks same > as your use case. What do people do if a kafka streams application (App1) > has to offer REST interface also ? > > -thanks > Mohan > > On 9/30/20, 5:01 PM, "Guozhang Wang" wrote: > > Hello Mohan, > > If I understand correctly, your async event trigger process runs out > of the > streams application, that reads the state stores of app2 through the > interactive query interface, right? This is actually a pretty common > use > case pattern for IQ :) > > > Guozhang > > On Wed, Sep 30, 2020 at 1:22 PM Parthasarathy, Mohan > > wrote: > > > Hi, > > > > A traditional kafka streams application (App1) reading data from a > kafka > > topic, doing aggregations resulting in some local state. The output > of this > > application is consumed by a different application(App2) for doing a > > different task. Under some conditions, there is an external trigger > (async > > event) which needs to trigger requests for all the keys in the local > store > > to App2. To achieve this, we can read the local stores from all the > > replicas and send the request to App2. > > > > This async event happens less frequently compared to the normal case > that > > leads to the state creation in the first place. Are there any > caveats doing > > it this way ? If not, any other suggestions ? > > > > Thanks > > Mohan > > > > > > -- > -- Guozhang > > > -- -- Guozhang
Re: Is this a valid use case for reading local store ?
Hi Guozhang, The async event trigger process is not running as a kafka streams application. It offers REST interface where other applications post events which in turn needs to go through App1's state and send requests to App2 via Kafka. Here is the diagram: KafkaTopics---> App1 ---> App2 | V REST >App3 REST API to App3 and read the local store of App1 (IQ) and send requests to App2 (through kafka topic, not shown above). Conceptually it looks same as your use case. What do people do if a kafka streams application (App1) has to offer REST interface also ? -thanks Mohan On 9/30/20, 5:01 PM, "Guozhang Wang" wrote: Hello Mohan, If I understand correctly, your async event trigger process runs out of the streams application, that reads the state stores of app2 through the interactive query interface, right? This is actually a pretty common use case pattern for IQ :) Guozhang On Wed, Sep 30, 2020 at 1:22 PM Parthasarathy, Mohan wrote: > Hi, > > A traditional kafka streams application (App1) reading data from a kafka > topic, doing aggregations resulting in some local state. The output of this > application is consumed by a different application(App2) for doing a > different task. Under some conditions, there is an external trigger (async > event) which needs to trigger requests for all the keys in the local store > to App2. To achieve this, we can read the local stores from all the > replicas and send the request to App2. > > This async event happens less frequently compared to the normal case that > leads to the state creation in the first place. Are there any caveats doing > it this way ? If not, any other suggestions ? > > Thanks > Mohan > > -- -- Guozhang
Re: Is this a valid use case for reading local store ?
Hello Mohan, If I understand correctly, your async event trigger process runs out of the streams application, that reads the state stores of app2 through the interactive query interface, right? This is actually a pretty common use case pattern for IQ :) Guozhang On Wed, Sep 30, 2020 at 1:22 PM Parthasarathy, Mohan wrote: > Hi, > > A traditional kafka streams application (App1) reading data from a kafka > topic, doing aggregations resulting in some local state. The output of this > application is consumed by a different application(App2) for doing a > different task. Under some conditions, there is an external trigger (async > event) which needs to trigger requests for all the keys in the local store > to App2. To achieve this, we can read the local stores from all the > replicas and send the request to App2. > > This async event happens less frequently compared to the normal case that > leads to the state creation in the first place. Are there any caveats doing > it this way ? If not, any other suggestions ? > > Thanks > Mohan > > -- -- Guozhang
Is this a valid use case for reading local store ?
Hi, A traditional kafka streams application (App1) reading data from a kafka topic, doing aggregations resulting in some local state. The output of this application is consumed by a different application(App2) for doing a different task. Under some conditions, there is an external trigger (async event) which needs to trigger requests for all the keys in the local store to App2. To achieve this, we can read the local stores from all the replicas and send the request to App2. This async event happens less frequently compared to the normal case that leads to the state creation in the first place. Are there any caveats doing it this way ? If not, any other suggestions ? Thanks Mohan
Re: First time building a streaming app and I need help understanding how to build out my use case
```*When I get a request for all of the messages containing a given user ID, I need to query in to the topic and get the content of those messages. Does that make sense and is it a think Kafka can do?*``` - If i understand correctly , your requirement is to Query the Kafka Topics based on key. Example : Topic `user_data` [ Key : userid , Value : JSON or Some other data ] . If you get userid , all you need is to consume the JSON data from topic user_data for the supplied user_id? Is this correct? If yes , Kafka is not recommended to use as a Query Service. If you have very less number of users data , still you can achieve this by consuming all data and apply filter based on user_id. --Senthil On Mon, Jun 10, 2019 at 9:45 PM Simon Calvin wrote: > Martin, > > Thank you very much for your reply. I appreciate the perspective on > securing communications with Kafka, but before I get to that point I'm > trying to figure out if/how I can implement this use case specifically in > Kafka. > > The point that I'm stuck on is needing to query for specific messages > within a topic when the app receives a request. To simplify the example, > consider a service that is subscribed to messages that contain a user id. > When I get a request for all of the messages containing a given user ID, I > need to query in to the topic and get the content of those messages. Does > that make sense and is it a think Kafka can do? > > Thanks again for your help and attention! > > Simon > > > From: Martin Gainty > Sent: Monday, June 10, 2019 8:20 AM > To: users@kafka.apache.org > Subject: Re: First time building a streaming app and I need help > understanding how to build out my use case > > MG>below > > > From: Simon Calvin > Sent: Friday, June 7, 2019 3:39 PM > To: users@kafka.apache.org > Subject: First time building a streaming app and I need help understanding > how to build out my use case > > Hello, everyone. I feel like I have a use case that it is well suited to > the Kafka streaming paradigm, but I'm having a difficult time understanding > how certain aspects will work as I'm prototyping. > > So here's my use case: Service 1 assigns a job to a user which is > published as an event to Kafka. Service 2 is a domain service that owns the > definition for all jobs. In this case, the definition boils down to a bunch > of form fields that need to be filled in. As changes are made to the > definitions, the updated versions are published by Service 2 to Kafka (I > think this is a KTable?). The job from Service 1 and the definition from > Service 2 get joined together to create a "bill of materials" that the user > needs to fulfill. > Service 3, a REST API, > > MG>can you risk implementing a non-secured HTTP connection?... then go > ahead > MG>if not you will need to look into some manner of PKI implementation for > your Kafka Streams (user_login or certs&keys) > > needs to pull any unfulfilled bills for a given user. Ideally we want the > bill to contain the most current version of the job definition at the point > it is retrieved (vs the version at the point that the job assignment was > published). Then, as the user fulfills the items, we update the bill with > their responses. Once the bill is complete it gets pushed on to the one or > more additional services (all basic consumers). > > MG>for Ktable stream example please reference > org.apache.kafka.streams.smoketest.SmokeTestClient createKafkaStreams > > The part I'm having the most trouble with is the retrieval of bills for a > user in Service 3. I got this idea in my head that because Kafka is > effectively a storage system there was a(n at least fairly) straightforward > way of querying out messages that were keyed/tagged a certain way (i.e., > with the user ID), but it's not clear to me if and how that works in > practice. I'm very new to the idea of streaming and so I think a lot of the > issue is that I'm trying to force foreign concepts (the non-streaming way > I'm used to doing things) in to the streaming paradigm. Any help is > appreciated! > > MG>assuming your ID is *NOT* generated for your table > MG>if implementing HTTPS request/response you might want to consider using > identifier of unique secured SESSION_ID > > https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely > [ > https://cdn.sstatic.net/Sites/security/img/apple-touch-i...@2.png?v=497726d850f9 > ]< > https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely > > > How is the session ID sent securely? - Stack Exchange< > https://security.
Re: First time building a streaming app and I need help understanding how to build out my use case
Martin, Thank you very much for your reply. I appreciate the perspective on securing communications with Kafka, but before I get to that point I'm trying to figure out if/how I can implement this use case specifically in Kafka. The point that I'm stuck on is needing to query for specific messages within a topic when the app receives a request. To simplify the example, consider a service that is subscribed to messages that contain a user id. When I get a request for all of the messages containing a given user ID, I need to query in to the topic and get the content of those messages. Does that make sense and is it a think Kafka can do? Thanks again for your help and attention! Simon From: Martin Gainty Sent: Monday, June 10, 2019 8:20 AM To: users@kafka.apache.org Subject: Re: First time building a streaming app and I need help understanding how to build out my use case MG>below From: Simon Calvin Sent: Friday, June 7, 2019 3:39 PM To: users@kafka.apache.org Subject: First time building a streaming app and I need help understanding how to build out my use case Hello, everyone. I feel like I have a use case that it is well suited to the Kafka streaming paradigm, but I'm having a difficult time understanding how certain aspects will work as I'm prototyping. So here's my use case: Service 1 assigns a job to a user which is published as an event to Kafka. Service 2 is a domain service that owns the definition for all jobs. In this case, the definition boils down to a bunch of form fields that need to be filled in. As changes are made to the definitions, the updated versions are published by Service 2 to Kafka (I think this is a KTable?). The job from Service 1 and the definition from Service 2 get joined together to create a "bill of materials" that the user needs to fulfill. Service 3, a REST API, MG>can you risk implementing a non-secured HTTP connection?... then go ahead MG>if not you will need to look into some manner of PKI implementation for your Kafka Streams (user_login or certs&keys) needs to pull any unfulfilled bills for a given user. Ideally we want the bill to contain the most current version of the job definition at the point it is retrieved (vs the version at the point that the job assignment was published). Then, as the user fulfills the items, we update the bill with their responses. Once the bill is complete it gets pushed on to the one or more additional services (all basic consumers). MG>for Ktable stream example please reference org.apache.kafka.streams.smoketest.SmokeTestClient createKafkaStreams The part I'm having the most trouble with is the retrieval of bills for a user in Service 3. I got this idea in my head that because Kafka is effectively a storage system there was a(n at least fairly) straightforward way of querying out messages that were keyed/tagged a certain way (i.e., with the user ID), but it's not clear to me if and how that works in practice. I'm very new to the idea of streaming and so I think a lot of the issue is that I'm trying to force foreign concepts (the non-streaming way I'm used to doing things) in to the streaming paradigm. Any help is appreciated! MG>assuming your ID is *NOT* generated for your table MG>if implementing HTTPS request/response you might want to consider using identifier of unique secured SESSION_ID https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely [https://cdn.sstatic.net/Sites/security/img/apple-touch-i...@2.png?v=497726d850f9]<https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely> How is the session ID sent securely? - Stack Exchange<https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely> Answer 1: if the server uses SSL/HTTPS(verified by third party-not self-signed certificate), cookies and session IDs travel as cipher-text over the network, and if an attacker (Man in the Middle) uses a packet sniffer, they can not obtain any information. They can not decrypt data because the connection between client and server is secured by a verified third party.so HTTPS without verified ... security.stackexchange.com Thanks very much for your kind attention! Simon Calvin
Re: First time building a streaming app and I need help understanding how to build out my use case
MG>below From: Simon Calvin Sent: Friday, June 7, 2019 3:39 PM To: users@kafka.apache.org Subject: First time building a streaming app and I need help understanding how to build out my use case Hello, everyone. I feel like I have a use case that it is well suited to the Kafka streaming paradigm, but I'm having a difficult time understanding how certain aspects will work as I'm prototyping. So here's my use case: Service 1 assigns a job to a user which is published as an event to Kafka. Service 2 is a domain service that owns the definition for all jobs. In this case, the definition boils down to a bunch of form fields that need to be filled in. As changes are made to the definitions, the updated versions are published by Service 2 to Kafka (I think this is a KTable?). The job from Service 1 and the definition from Service 2 get joined together to create a "bill of materials" that the user needs to fulfill. Service 3, a REST API, MG>can you risk implementing a non-secured HTTP connection?... then go ahead MG>if not you will need to look into some manner of PKI implementation for your Kafka Streams (user_login or certs&keys) needs to pull any unfulfilled bills for a given user. Ideally we want the bill to contain the most current version of the job definition at the point it is retrieved (vs the version at the point that the job assignment was published). Then, as the user fulfills the items, we update the bill with their responses. Once the bill is complete it gets pushed on to the one or more additional services (all basic consumers). MG>for Ktable stream example please reference org.apache.kafka.streams.smoketest.SmokeTestClient createKafkaStreams The part I'm having the most trouble with is the retrieval of bills for a user in Service 3. I got this idea in my head that because Kafka is effectively a storage system there was a(n at least fairly) straightforward way of querying out messages that were keyed/tagged a certain way (i.e., with the user ID), but it's not clear to me if and how that works in practice. I'm very new to the idea of streaming and so I think a lot of the issue is that I'm trying to force foreign concepts (the non-streaming way I'm used to doing things) in to the streaming paradigm. Any help is appreciated! MG>assuming your ID is *NOT* generated for your table MG>if implementing HTTPS request/response you might want to consider using identifier of unique secured SESSION_ID https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely [https://cdn.sstatic.net/Sites/security/img/apple-touch-i...@2.png?v=497726d850f9]<https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely> How is the session ID sent securely? - Stack Exchange<https://security.stackexchange.com/questions/87269/how-is-the-session-id-sent-securely> Answer 1: if the server uses SSL/HTTPS(verified by third party-not self-signed certificate), cookies and session IDs travel as cipher-text over the network, and if an attacker (Man in the Middle) uses a packet sniffer, they can not obtain any information. They can not decrypt data because the connection between client and server is secured by a verified third party.so HTTPS without verified ... security.stackexchange.com Thanks very much for your kind attention! Simon Calvin
First time building a streaming app and I need help understanding how to build out my use case
Hello, everyone. I feel like I have a use case that it is well suited to the Kafka streaming paradigm, but I'm having a difficult time understanding how certain aspects will work as I'm prototyping. So here's my use case: Service 1 assigns a job to a user which is published as an event to Kafka. Service 2 is a domain service that owns the definition for all jobs. In this case, the definition boils down to a bunch of form fields that need to be filled in. As changes are made to the definitions, the updated versions are published by Service 2 to Kafka (I think this is a KTable?). The job from Service 1 and the definition from Service 2 get joined together to create a "bill of materials" that the user needs to fulfill. Service 3, a REST API, needs to pull any unfulfilled bills for a given user. Ideally we want the bill to contain the most current version of the job definition at the point it is retrieved (vs the version at the point that the job assignment was published). Then, as the user fulfills the items, we update the bill with their responses. Once the bill is complete it gets pushed on to the one or more additional services (all basic consumers). The part I'm having the most trouble with is the retrieval of bills for a user in Service 3. I got this idea in my head that because Kafka is effectively a storage system there was a(n at least fairly) straightforward way of querying out messages that were keyed/tagged a certain way (i.e., with the user ID), but it's not clear to me if and how that works in practice. I'm very new to the idea of streaming and so I think a lot of the issue is that I'm trying to force foreign concepts (the non-streaming way I'm used to doing things) in to the streaming paradigm. Any help is appreciated! Thanks very much for your kind attention! Simon Calvin
Re: Configuration guidelines for a specific use-case
Given your high throughput on the consumer side, you might consider adding more partitions than 6, so you can scale up beyond 6 consumers per group if need be. Ryanne On Wed, Jan 9, 2019 at 10:37 AM Gioacchino Vino wrote: > Hi Ryanne, > > > I just forgot to insert the "linger.ms=0" configuration. > > I got this result: > > > 5000 records sent, 706793.701055 records/sec (67.41 MB/sec), 7.29 ms > avg latency, 1245.00 ms max latency, 0 ms 50th, 3 ms 95th, 197 ms 99th, > 913 ms 99.9th. > > > it's pretty good but I would like to improve it just a bit. > > Do you think using 6 partitions in a 3 broker cluster is a good choice? > > > Gioacchino > > > On 08/01/2019 18:52, Ryanne Dolan wrote: > > Latency sounds high to me, maybe your JVMs are GC'ing a lot? > > > > Ryanne > > > > On Tue, Jan 8, 2019, 11:45 AM Gioacchino Vino > wrote: > > > >> Hi expert, > >> > >> > >> I would ask you some guidelines, web-pages or comments regarding my > >> use-case. > >> > >> > >> *Requirements*: > >> > >> - 2000+ producers > >> > >> - input rate 600k messages/s > >> > >> - consumers must write in 3 different databases (so i assume 3 consumer > >> groups) at 600k messages/s overall (200k messages/s/database) > >> > >> - latency < 500ms between producers and databases > >> > >> - good availability > >> > >> - Possibility to process messages before to send them to the databases > >> (Kafka stream? Of course in HA. Docker? Marathon?) > >> > >> - it's tolerate missing data ( 0.5% max ) (disk writing is not strictly > >> required), latency has higher priority > >> > >> - record size: 100-1000 > >> > >> > >> *Resources*: > >> > >> brokers ( Bandwidth: 25 Gbps, 32Cpus, 1 disk (I/O 99.0 MB/s) > >> > >> producers -> brokers -> consumers ( Bandwidth: 1 Gbps ) > >> > >> > >> *My* *configuration*: > >> > >> 3 brokers > >> > >> 6 partition (without replication in order to minimize latency) > >> > >> ack = 0 (missing data is tolerate) > >> > >> batch.size = 1024 (with 8196 the throughput is max) > >> > >> producers -> compression.type=none > >> > >> > >> > >> I did test using kafka-producer-perf-test.sh and > >> kafka-consumer-perf-test.sh and i have a good throughput (500-600k > >> messages/s using 3 producers and 3 consumers) but i would improve > >> latency (0.3-2 sec) or features I'm not still considering. > >> > >> > >> I thank you in advance. > >> > >> Cheers, > >> > >> > >> Gioacchino > >> > >> >
Re: Configuration guidelines for a specific use-case
Hi Ryanne, I just forgot to insert the "linger.ms=0" configuration. I got this result: 5000 records sent, 706793.701055 records/sec (67.41 MB/sec), 7.29 ms avg latency, 1245.00 ms max latency, 0 ms 50th, 3 ms 95th, 197 ms 99th, 913 ms 99.9th. it's pretty good but I would like to improve it just a bit. Do you think using 6 partitions in a 3 broker cluster is a good choice? Gioacchino On 08/01/2019 18:52, Ryanne Dolan wrote: Latency sounds high to me, maybe your JVMs are GC'ing a lot? Ryanne On Tue, Jan 8, 2019, 11:45 AM Gioacchino Vino Hi expert, I would ask you some guidelines, web-pages or comments regarding my use-case. *Requirements*: - 2000+ producers - input rate 600k messages/s - consumers must write in 3 different databases (so i assume 3 consumer groups) at 600k messages/s overall (200k messages/s/database) - latency < 500ms between producers and databases - good availability - Possibility to process messages before to send them to the databases (Kafka stream? Of course in HA. Docker? Marathon?) - it's tolerate missing data ( 0.5% max ) (disk writing is not strictly required), latency has higher priority - record size: 100-1000 *Resources*: brokers ( Bandwidth: 25 Gbps, 32Cpus, 1 disk (I/O 99.0 MB/s) producers -> brokers -> consumers ( Bandwidth: 1 Gbps ) *My* *configuration*: 3 brokers 6 partition (without replication in order to minimize latency) ack = 0 (missing data is tolerate) batch.size = 1024 (with 8196 the throughput is max) producers -> compression.type=none I did test using kafka-producer-perf-test.sh and kafka-consumer-perf-test.sh and i have a good throughput (500-600k messages/s using 3 producers and 3 consumers) but i would improve latency (0.3-2 sec) or features I'm not still considering. I thank you in advance. Cheers, Gioacchino
Re: Configuration guidelines for a specific use-case
Latency sounds high to me, maybe your JVMs are GC'ing a lot? Ryanne On Tue, Jan 8, 2019, 11:45 AM Gioacchino Vino Hi expert, > > > I would ask you some guidelines, web-pages or comments regarding my > use-case. > > > *Requirements*: > > - 2000+ producers > > - input rate 600k messages/s > > - consumers must write in 3 different databases (so i assume 3 consumer > groups) at 600k messages/s overall (200k messages/s/database) > > - latency < 500ms between producers and databases > > - good availability > > - Possibility to process messages before to send them to the databases > (Kafka stream? Of course in HA. Docker? Marathon?) > > - it's tolerate missing data ( 0.5% max ) (disk writing is not strictly > required), latency has higher priority > > - record size: 100-1000 > > > *Resources*: > > brokers ( Bandwidth: 25 Gbps, 32Cpus, 1 disk (I/O 99.0 MB/s) > > producers -> brokers -> consumers ( Bandwidth: 1 Gbps ) > > > *My* *configuration*: > > 3 brokers > > 6 partition (without replication in order to minimize latency) > > ack = 0 (missing data is tolerate) > > batch.size = 1024 (with 8196 the throughput is max) > > producers -> compression.type=none > > > > I did test using kafka-producer-perf-test.sh and > kafka-consumer-perf-test.sh and i have a good throughput (500-600k > messages/s using 3 producers and 3 consumers) but i would improve > latency (0.3-2 sec) or features I'm not still considering. > > > I thank you in advance. > > Cheers, > > > Gioacchino > >
Configuration guidelines for a specific use-case
Hi expert, I would ask you some guidelines, web-pages or comments regarding my use-case. *Requirements*: - 2000+ producers - input rate 600k messages/s - consumers must write in 3 different databases (so i assume 3 consumer groups) at 600k messages/s overall (200k messages/s/database) - latency < 500ms between producers and databases - good availability - Possibility to process messages before to send them to the databases (Kafka stream? Of course in HA. Docker? Marathon?) - it's tolerate missing data ( 0.5% max ) (disk writing is not strictly required), latency has higher priority - record size: 100-1000 *Resources*: brokers ( Bandwidth: 25 Gbps, 32Cpus, 1 disk (I/O 99.0 MB/s) producers -> brokers -> consumers ( Bandwidth: 1 Gbps ) *My* *configuration*: 3 brokers 6 partition (without replication in order to minimize latency) ack = 0 (missing data is tolerate) batch.size = 1024 (with 8196 the throughput is max) producers -> compression.type=none I did test using kafka-producer-perf-test.sh and kafka-consumer-perf-test.sh and i have a good throughput (500-600k messages/s using 3 producers and 3 consumers) but i would improve latency (0.3-2 sec) or features I'm not still considering. I thank you in advance. Cheers, Gioacchino
Kafka use case
I have 2 systems 1. System I - A web based interface based on Oracle DB and No REST API support 2. System II - Supports rest API's which also has web based interface . When a record created or updated in either of the system I want propagate the data to other system . Can I used Kafka here as mediator ? . So Kafka needs receives the data the check who is the receiver if its System I then make JDBC connection to call API but if its System 2 then make a rest API . Shibi
Re: Is this a decent use case for Kafka Streams?
Unf this notion isn't applicable: "...At the end of a time window..." If you comb through the archives of this group you'll see many questions about notifications for the 'end of an aggregation window' and a similar number of replies from the Kafka group stating that such a notion doesn't really exist. Each window is kept open so that late arriving records can be incorporated. You can specify the lifetime of a given window but you don't get any sort of signal when it expires. A record that arrives after said expiration will trigger a new window to be created. On Wed, Jul 12, 2017 at 5:06 PM, Stephen Powis wrote: > Hey! I was hoping I could get some input from people more experienced with > Kafka Streams to determine if they'd be a good use case/solution for me. > > I have multi-tenant clients submitting data to a Kafka topic that they want > ETL'd to a third party service. I'd like to batch and group these by > tenant over a time window, somewhere between 1 and 5 minutes. At the end > of a time window then issue an API request to the third party service for > each tenant sending the batch of data over. > > Other points of note: > - Ideally we'd have exactly-once semantics, sending data multiple times > would typically be bad. But we'd need to gracefully handle things like API > request errors / service outages. > > - We currently use Storm for doing stream processing, but the long running > time-windows and potentially large amount of data stored in memory make me > a bit nervous to use it for this. > > Thoughts? Thanks in Advance! > Stephen >
Re: Is this a decent use case for Kafka Streams?
From just looking at your description of the problem, I'd say yes, this looks like a typical scenario for Kafka Streams. Kafka Streams supports exactly once semantics too in 0.11. Cheers Eno > On 12 Jul 2017, at 17:06, Stephen Powis wrote: > > Hey! I was hoping I could get some input from people more experienced with > Kafka Streams to determine if they'd be a good use case/solution for me. > > I have multi-tenant clients submitting data to a Kafka topic that they want > ETL'd to a third party service. I'd like to batch and group these by > tenant over a time window, somewhere between 1 and 5 minutes. At the end > of a time window then issue an API request to the third party service for > each tenant sending the batch of data over. > > Other points of note: > - Ideally we'd have exactly-once semantics, sending data multiple times > would typically be bad. But we'd need to gracefully handle things like API > request errors / service outages. > > - We currently use Storm for doing stream processing, but the long running > time-windows and potentially large amount of data stored in memory make me > a bit nervous to use it for this. > > Thoughts? Thanks in Advance! > Stephen
Is this a decent use case for Kafka Streams?
Hey! I was hoping I could get some input from people more experienced with Kafka Streams to determine if they'd be a good use case/solution for me. I have multi-tenant clients submitting data to a Kafka topic that they want ETL'd to a third party service. I'd like to batch and group these by tenant over a time window, somewhere between 1 and 5 minutes. At the end of a time window then issue an API request to the third party service for each tenant sending the batch of data over. Other points of note: - Ideally we'd have exactly-once semantics, sending data multiple times would typically be bad. But we'd need to gracefully handle things like API request errors / service outages. - We currently use Storm for doing stream processing, but the long running time-windows and potentially large amount of data stored in memory make me a bit nervous to use it for this. Thoughts? Thanks in Advance! Stephen
Re: How to implement use case
> On Apr 27, 2017, at 3:25 AM, Vladimir Lalovic wrote: > > Hi all, > > > > Our system is about ride reservations and acts as broker between customers > and drivers. > ... > Most of our rules are function of time and some reservation’s property > (e.g. check if there are any reservations where remaining time before > pickup is less than x). > > Number of reservations we currently fetching is ~5000 and number of > notification/alerting rules is ~20 > > > Based on documentation and some blog posts I have impression that Kafka and > Kafka Stream library are good choice for this use case but I would like to > confirm that with someone from Kafka team or to get some recommendations ... We use Kafka Streams for an event dispatch system (although ours uses SMS) I have run into a couple of edge issues where it was clear that most users use KS differently (most for continuous streaming analysis jobs) but fundamentally the library is also good for building event driven applications and we are quite happy with our choice. Just know that you're at the far edge of what the community uses KS for so you might trip over an issue or two. I'd still highly recommend it!
How to implement use case
Hi all, Our system is about ride reservations and acts as broker between customers and drivers. Something similar what Uber does with major differences that we are mostly focused on reservation scheduled in advance. So between moment when reservation is created and until reservation/ride is actually delivered by driver we are doing multiple checks/notifications/alerts. Currently that is implemented with queries scheduled to be executed on each 60 seconds. Approach with scheduled queries becomes inefficient as number of reservation and notifications/alerting rules is increased. Addition reason we want to rewrite that part of functionality is to decompose current monolith and large part of that decomposition is moving our services from scheduled (timer-based) to event-based mechanism. Here in table is simplified example what we have. Basically we have two streams one stream of reservation's related events and another stream would be time Time stream Stream of reservation events Deliver ticks (e.g each minute) Event time Reservation’s status Reservation Scheduled for .. 25/04/2017 11.22 PM CREATED 15/05/2017 16.30 PM … 26/04/2017 15.15 PM OFFERED 15/05/2017 16.30 PM … 26/04/2017 21.12 PM ASSIGNED 15/05/2017 16.30 PM … … 15/05/2017 16.30 PM … 15/05/2017 15.51 PM DRIVER EN ROUTE 15/05/2017 16.30 PM … 15/05/2017 15.25 PM DRIVER ON LOCATION 15/05/2017 16.30 PM Most of our rules are function of time and some reservation’s property (e.g. check if there are any reservations where remaining time before pickup is less than x). Number of reservations we currently fetching is ~5000 and number of notification/alerting rules is ~20 Based on documentation and some blog posts I have impression that Kafka and Kafka Stream library are good choice for this use case but I would like to confirm that with someone from Kafka team or to get some recommendations ... Thanks, Vladimir
Kafka Topics best practice for logging data pipeline use case
We are using latest Kafka and Logstash versions for ingesting several business apps logs(now few but eventually 100+) into ELK. We have a standardized logging structure for business apps to log data into Kafka topics and able to ingest into ELK via Kafka topics input plugin. Currently, we are using one kafka topic for each business app for pushing data into logstash. We have 3 logstash consumers with 3 partitions on each topic. I am wondering about the best practice for using kafka/logstash. Is the above config a good approach or is there better approach. For example, instead of having one kafka topic for each app, should we have one kafka topic across all apps? What are the pros and cons? If you are not familiar with Logstash it is part of Elastic stack and it is just another consumer for Kafka. Would appreciate your input! -- Thanks, Ram Vittal
Re: [KIP-94] SessionWindows - IndexOutOfBoundsException in simple use case
Hi Marco, Did you run this example with the same store name using TimeWindows? It looks to me that it is trying to restore state from the changelog that has been used with TimeWindows. The data in the topic will be incompatible with SessionWindows as the keys are in a different format. You'll either need to use a different store name, i.e, change "aggs", or you will need to use the streams reset tool to reset the topics: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Application+Reset+Tool Thanks, Damian On Wed, 22 Feb 2017 at 09:35 Marco Abitabile wrote: > Hello, > > I apologies with Matthias since I posted yesterday this issue on the wrong > place on github :( > > I'm trying a simple use case of session windowing. TimeWindows works > perfectly, however as I replace with SessionWindows, this exception is > thrown: > > Exception in thread "StreamThread-1" > org.apache.kafka.streams.errors.StreamsException: stream-thread > [StreamThread-1] Failed to rebalance > at > > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:612) > at > > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368) > Caused by: java.lang.IndexOutOfBoundsException > at java.nio.Buffer.checkIndex(Buffer.java:546) > at java.nio.HeapByteBuffer.getLong(HeapByteBuffer.java:416) > at > > org.apache.kafka.streams.kstream.internals.SessionKeySerde.extractEnd(SessionKeySerde.java:117) > at > > org.apache.kafka.streams.state.internals.SessionKeySchema.segmentTimestamp(SessionKeySchema.java:45) > at > > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.put(RocksDBSegmentedBytesStore.java:71) > at > > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore$1.restore(RocksDBSegmentedBytesStore.java:104) > at > > org.apache.kafka.streams.processor.internals.ProcessorStateManager.restoreActiveState(ProcessorStateManager.java:230) > at > > org.apache.kafka.streams.processor.internals.ProcessorStateManager.register(ProcessorStateManager.java:193) > at > > org.apache.kafka.streams.processor.internals.AbstractProcessorContext.register(AbstractProcessorContext.java:99) > at > > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.init(RocksDBSegmentedBytesStore.java:101) > at > > org.apache.kafka.streams.state.internals.ChangeLoggingSegmentedBytesStore.init(ChangeLoggingSegmentedBytesStore.java:68) > at > > org.apache.kafka.streams.state.internals.MeteredSegmentedBytesStore.init(MeteredSegmentedBytesStore.java:66) > at > > org.apache.kafka.streams.state.internals.RocksDBSessionStore.init(RocksDBSessionStore.java:78) > at > > org.apache.kafka.streams.state.internals.CachingSessionStore.init(CachingSessionStore.java:97) > at > > org.apache.kafka.streams.processor.internals.AbstractTask.initializeStateStores(AbstractTask.java:86) > at > > org.apache.kafka.streams.processor.internals.StreamTask.(StreamTask.java:141) > at > > org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:834) > at > > org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:1207) > at > > org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:1180) > at > > org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:937) > at > > org.apache.kafka.streams.processor.internals.StreamThread.access$500(StreamThread.java:69) > at > > org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:236) > at > > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:255) > at > > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:339) > at > > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:303) > at > > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286) > at > > org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1030) > at > > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995) > at > > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:582) > ... 1 more > > the code is very simple: > > KStreamBuilder builder = new KStreamBuilder(); > KStream macs = builder.stream(stringSerde, > stringSerde, "test01"); > macs > .groupByKey() > .aggregate(() -> new Stri
[KIP-94] SessionWindows - IndexOutOfBoundsException in simple use case
Hello, I apologies with Matthias since I posted yesterday this issue on the wrong place on github :( I'm trying a simple use case of session windowing. TimeWindows works perfectly, however as I replace with SessionWindows, this exception is thrown: Exception in thread "StreamThread-1" org.apache.kafka.streams.errors.StreamsException: stream-thread [StreamThread-1] Failed to rebalance at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:612) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368) Caused by: java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkIndex(Buffer.java:546) at java.nio.HeapByteBuffer.getLong(HeapByteBuffer.java:416) at org.apache.kafka.streams.kstream.internals.SessionKeySerde.extractEnd(SessionKeySerde.java:117) at org.apache.kafka.streams.state.internals.SessionKeySchema.segmentTimestamp(SessionKeySchema.java:45) at org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.put(RocksDBSegmentedBytesStore.java:71) at org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore$1.restore(RocksDBSegmentedBytesStore.java:104) at org.apache.kafka.streams.processor.internals.ProcessorStateManager.restoreActiveState(ProcessorStateManager.java:230) at org.apache.kafka.streams.processor.internals.ProcessorStateManager.register(ProcessorStateManager.java:193) at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.register(AbstractProcessorContext.java:99) at org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.init(RocksDBSegmentedBytesStore.java:101) at org.apache.kafka.streams.state.internals.ChangeLoggingSegmentedBytesStore.init(ChangeLoggingSegmentedBytesStore.java:68) at org.apache.kafka.streams.state.internals.MeteredSegmentedBytesStore.init(MeteredSegmentedBytesStore.java:66) at org.apache.kafka.streams.state.internals.RocksDBSessionStore.init(RocksDBSessionStore.java:78) at org.apache.kafka.streams.state.internals.CachingSessionStore.init(CachingSessionStore.java:97) at org.apache.kafka.streams.processor.internals.AbstractTask.initializeStateStores(AbstractTask.java:86) at org.apache.kafka.streams.processor.internals.StreamTask.(StreamTask.java:141) at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:834) at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:1207) at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:1180) at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:937) at org.apache.kafka.streams.processor.internals.StreamThread.access$500(StreamThread.java:69) at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:236) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:255) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:339) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:303) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1030) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:582) ... 1 more the code is very simple: KStreamBuilder builder = new KStreamBuilder(); KStream macs = builder.stream(stringSerde, stringSerde, "test01"); macs .groupByKey() .aggregate(() -> new String(), (String aggKey, String value, String aggregate) -> { return aggregate += value; }, (String arg0, String arg1, String arg2) -> { return arg1 += arg2; }, SessionWindows.with(30 * 1000).until(10 * 60 * 1000),//TimeWindows.of(1000).until(1000), stringSerde, "aggs") .toStream().map((Windowed key, String value) -> { return KeyValue.pair(key.key(), value); }).print(); KafkaStreams streams = new KafkaStreams(builder, props); streams.start(); Also with my real use case doesn't work. While debugging, I've noticed that is doesn't reach neither the beginning of the stream pipeline (groupby). Can you please help investigating this issue? Best. Marco
Re: How does one deploy to consumers without causing re-balancing for real time use case?
We've only started using kafka-based group coordination for small and simple use cases at LinkedIn so far. Given that you kill -9 your process, your explanation for the long stabilization time makes sense. I'd recommend calling KafkaConsumer.close. It should speed up the rebalance times. Another idea: it sounds like you sequentially deploy changes to your consumers. Is this required? If not, then adding some parallelism to the deployment would reduce the number of rebalances and therefore cause the group to stabilize sooner. On Fri, Feb 17, 2017 at 10:55 PM, Praveen wrote: > Hey Onur, > > I was just watching your talk on rebalancing from last year - > https://www.youtube.com/watch?v=QaeXDh12EhE > Nice talk!. > > I think I have an idea as to why it takes 1 hr in my case based on the > talk in the video. In my case with 32 boxes / consumers from the same > group, I believe the current state of the group coordinator's state machine > gets messed up each time a new one is added until the very last consumer. > Also I have a heartbeat set to 97 seconds (97 secs b/c normal processing > could take that long and we don't want coordinator to think consumer is > dead). I think both of these coupled together is why the cluster restart > takes > 1hr. I'm curious how linkedin does clean cluster restarts? How do > you handle the scenario described above? > > Praveen > > > On Wed, Feb 15, 2017 at 10:22 AM, Praveen wrote: > >> I still think a clean cluster start should not take > 1 hr for balancing >> though. Is this expected or am i doing something different? >> >> I thought this would be a common use case. >> >> Praveen >> >> On Fri, Feb 10, 2017 at 10:26 AM, Onur Karaman < >> okara...@linkedin.com.invalid> wrote: >> >>> Pradeep is right. >>> >>> close() will try and send out a LeaveGroupRequest while a kill -9 will >>> not. >>> >>> On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota < >>> pradeep...@gmail.com> >>> wrote: >>> >>> > I believe if you're calling the .close() method on shutdown, then the >>> > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not >>> sure if >>> > that request will be made. >>> > >>> > On Fri, Feb 10, 2017 at 8:47 AM, Praveen wrote: >>> > >>> > > @Pradeep - I just read your thread, the 1hr pause was when all the >>> > > consumers where shutdown simultaneously. I'm testing out rolling >>> restart >>> > > to get the actual numbers. The initial numbers are promising. >>> > > >>> > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> >>> REBALANCE >>> > > (takes 1min to get a partition)` >>> > > >>> > > In your thread, Ewen says - >>> > > >>> > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a >>> > > consumer knows it is going to >>> > > shutdown, it is good to proactively make sure the group knows it >>> needs to >>> > > rebalance work because some of the partitions that were handled by >>> the >>> > > consumer need to be handled by some other group members." >>> > > >>> > > So does this mean that the consumer should inform the group ahead of >>> > > time before it goes down? Currently, I just shutdown the process. >>> > > >>> > > >>> > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota < >>> pradeep...@gmail.com >>> > > >>> > > wrote: >>> > > >>> > > > I asked a similar question a while ago. There doesn't appear to be >>> a >>> > way >>> > > to >>> > > > not triggering the rebalance. But I'm not sure why it would be >>> taking > >>> > > 1hr >>> > > > in your case. For us it was pretty fast. >>> > > > >>> > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html >>> > > > >>> > > > >>> > > > >>> > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG < >>> > > > krzysztof.lesniew...@nexiot.ch> wrote: >>> > > > >>> > > > > Would be great to get some input on it. >>> > > > > >>> > > > > - Krzysztof Lesniewski >>> > > > > >>> > > > > >>> > > > > On 06.02.2017 08:27, Praveen wrote: >>> > > > > >>> > > > >> I have a 16 broker kafka cluster. There is a topic with 32 >>> > partitions >>> > > > >> containing real time data and on the other side, I have 32 >>> boxes w/ >>> > 1 >>> > > > >> consumer reading from these partitions. >>> > > > >> >>> > > > >> Today our deployment strategy is stop, deploy and start the >>> > processes >>> > > on >>> > > > >> all the 32 consumers. This triggers re-balancing and takes a >>> long >>> > > period >>> > > > >> of >>> > > > >> time (> 1hr). Such a long pause isn't good for real time >>> processing. >>> > > > >> >>> > > > >> I was thinking of rolling deploy but I think that will still >>> cause >>> > > > >> re-balancing b/c we will still have consumers go down and come >>> up. >>> > > > >> >>> > > > >> How do you deploy to consumers without triggering re-balancing >>> (or >>> > > > >> triggering one that doesn't affect your SLA) when doing real >>> time >>> > > > >> processing? >>> > > > >> >>> > > > >> Thanks, >>> > > > >> Praveen >>> > > > >> >>> > > > >> >>> > > > > >>> > > > >>> > > >>> > >>> >> >> >
Re: How does one deploy to consumers without causing re-balancing for real time use case?
Hey Onur, I was just watching your talk on rebalancing from last year - https://www.youtube.com/watch?v=QaeXDh12EhE Nice talk!. I think I have an idea as to why it takes 1 hr in my case based on the talk in the video. In my case with 32 boxes / consumers from the same group, I believe the current state of the group coordinator's state machine gets messed up each time a new one is added until the very last consumer. Also I have a heartbeat set to 97 seconds (97 secs b/c normal processing could take that long and we don't want coordinator to think consumer is dead). I think both of these coupled together is why the cluster restart takes > 1hr. I'm curious how linkedin does clean cluster restarts? How do you handle the scenario described above? Praveen On Wed, Feb 15, 2017 at 10:22 AM, Praveen wrote: > I still think a clean cluster start should not take > 1 hr for balancing > though. Is this expected or am i doing something different? > > I thought this would be a common use case. > > Praveen > > On Fri, Feb 10, 2017 at 10:26 AM, Onur Karaman < > okara...@linkedin.com.invalid> wrote: > >> Pradeep is right. >> >> close() will try and send out a LeaveGroupRequest while a kill -9 will >> not. >> >> On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota > > >> wrote: >> >> > I believe if you're calling the .close() method on shutdown, then the >> > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure >> if >> > that request will be made. >> > >> > On Fri, Feb 10, 2017 at 8:47 AM, Praveen wrote: >> > >> > > @Pradeep - I just read your thread, the 1hr pause was when all the >> > > consumers where shutdown simultaneously. I'm testing out rolling >> restart >> > > to get the actual numbers. The initial numbers are promising. >> > > >> > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> >> REBALANCE >> > > (takes 1min to get a partition)` >> > > >> > > In your thread, Ewen says - >> > > >> > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a >> > > consumer knows it is going to >> > > shutdown, it is good to proactively make sure the group knows it >> needs to >> > > rebalance work because some of the partitions that were handled by the >> > > consumer need to be handled by some other group members." >> > > >> > > So does this mean that the consumer should inform the group ahead of >> > > time before it goes down? Currently, I just shutdown the process. >> > > >> > > >> > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota < >> pradeep...@gmail.com >> > > >> > > wrote: >> > > >> > > > I asked a similar question a while ago. There doesn't appear to be a >> > way >> > > to >> > > > not triggering the rebalance. But I'm not sure why it would be >> taking > >> > > 1hr >> > > > in your case. For us it was pretty fast. >> > > > >> > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html >> > > > >> > > > >> > > > >> > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG < >> > > > krzysztof.lesniew...@nexiot.ch> wrote: >> > > > >> > > > > Would be great to get some input on it. >> > > > > >> > > > > - Krzysztof Lesniewski >> > > > > >> > > > > >> > > > > On 06.02.2017 08:27, Praveen wrote: >> > > > > >> > > > >> I have a 16 broker kafka cluster. There is a topic with 32 >> > partitions >> > > > >> containing real time data and on the other side, I have 32 boxes >> w/ >> > 1 >> > > > >> consumer reading from these partitions. >> > > > >> >> > > > >> Today our deployment strategy is stop, deploy and start the >> > processes >> > > on >> > > > >> all the 32 consumers. This triggers re-balancing and takes a long >> > > period >> > > > >> of >> > > > >> time (> 1hr). Such a long pause isn't good for real time >> processing. >> > > > >> >> > > > >> I was thinking of rolling deploy but I think that will still >> cause >> > > > >> re-balancing b/c we will still have consumers go down and come >> up. >> > > > >> >> > > > >> How do you deploy to consumers without triggering re-balancing >> (or >> > > > >> triggering one that doesn't affect your SLA) when doing real time >> > > > >> processing? >> > > > >> >> > > > >> Thanks, >> > > > >> Praveen >> > > > >> >> > > > >> >> > > > > >> > > > >> > > >> > >> > >
Re: How does one deploy to consumers without causing re-balancing for real time use case?
I still think a clean cluster start should not take > 1 hr for balancing though. Is this expected or am i doing something different? I thought this would be a common use case. Praveen On Fri, Feb 10, 2017 at 10:26 AM, Onur Karaman < okara...@linkedin.com.invalid> wrote: > Pradeep is right. > > close() will try and send out a LeaveGroupRequest while a kill -9 will not. > > On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota > wrote: > > > I believe if you're calling the .close() method on shutdown, then the > > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure > if > > that request will be made. > > > > On Fri, Feb 10, 2017 at 8:47 AM, Praveen wrote: > > > > > @Pradeep - I just read your thread, the 1hr pause was when all the > > > consumers where shutdown simultaneously. I'm testing out rolling > restart > > > to get the actual numbers. The initial numbers are promising. > > > > > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> REBALANCE > > > (takes 1min to get a partition)` > > > > > > In your thread, Ewen says - > > > > > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a > > > consumer knows it is going to > > > shutdown, it is good to proactively make sure the group knows it needs > to > > > rebalance work because some of the partitions that were handled by the > > > consumer need to be handled by some other group members." > > > > > > So does this mean that the consumer should inform the group ahead of > > > time before it goes down? Currently, I just shutdown the process. > > > > > > > > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota < > pradeep...@gmail.com > > > > > > wrote: > > > > > > > I asked a similar question a while ago. There doesn't appear to be a > > way > > > to > > > > not triggering the rebalance. But I'm not sure why it would be > taking > > > > 1hr > > > > in your case. For us it was pretty fast. > > > > > > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html > > > > > > > > > > > > > > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG < > > > > krzysztof.lesniew...@nexiot.ch> wrote: > > > > > > > > > Would be great to get some input on it. > > > > > > > > > > - Krzysztof Lesniewski > > > > > > > > > > > > > > > On 06.02.2017 08:27, Praveen wrote: > > > > > > > > > >> I have a 16 broker kafka cluster. There is a topic with 32 > > partitions > > > > >> containing real time data and on the other side, I have 32 boxes > w/ > > 1 > > > > >> consumer reading from these partitions. > > > > >> > > > > >> Today our deployment strategy is stop, deploy and start the > > processes > > > on > > > > >> all the 32 consumers. This triggers re-balancing and takes a long > > > period > > > > >> of > > > > >> time (> 1hr). Such a long pause isn't good for real time > processing. > > > > >> > > > > >> I was thinking of rolling deploy but I think that will still cause > > > > >> re-balancing b/c we will still have consumers go down and come up. > > > > >> > > > > >> How do you deploy to consumers without triggering re-balancing (or > > > > >> triggering one that doesn't affect your SLA) when doing real time > > > > >> processing? > > > > >> > > > > >> Thanks, > > > > >> Praveen > > > > >> > > > > >> > > > > > > > > > > > > > > >
Re: How does one deploy to consumers without causing re-balancing for real time use case?
Pradeep is right. close() will try and send out a LeaveGroupRequest while a kill -9 will not. On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota wrote: > I believe if you're calling the .close() method on shutdown, then the > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure if > that request will be made. > > On Fri, Feb 10, 2017 at 8:47 AM, Praveen wrote: > > > @Pradeep - I just read your thread, the 1hr pause was when all the > > consumers where shutdown simultaneously. I'm testing out rolling restart > > to get the actual numbers. The initial numbers are promising. > > > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> REBALANCE > > (takes 1min to get a partition)` > > > > In your thread, Ewen says - > > > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a > > consumer knows it is going to > > shutdown, it is good to proactively make sure the group knows it needs to > > rebalance work because some of the partitions that were handled by the > > consumer need to be handled by some other group members." > > > > So does this mean that the consumer should inform the group ahead of > > time before it goes down? Currently, I just shutdown the process. > > > > > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota > > > wrote: > > > > > I asked a similar question a while ago. There doesn't appear to be a > way > > to > > > not triggering the rebalance. But I'm not sure why it would be taking > > > 1hr > > > in your case. For us it was pretty fast. > > > > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html > > > > > > > > > > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG < > > > krzysztof.lesniew...@nexiot.ch> wrote: > > > > > > > Would be great to get some input on it. > > > > > > > > - Krzysztof Lesniewski > > > > > > > > > > > > On 06.02.2017 08:27, Praveen wrote: > > > > > > > >> I have a 16 broker kafka cluster. There is a topic with 32 > partitions > > > >> containing real time data and on the other side, I have 32 boxes w/ > 1 > > > >> consumer reading from these partitions. > > > >> > > > >> Today our deployment strategy is stop, deploy and start the > processes > > on > > > >> all the 32 consumers. This triggers re-balancing and takes a long > > period > > > >> of > > > >> time (> 1hr). Such a long pause isn't good for real time processing. > > > >> > > > >> I was thinking of rolling deploy but I think that will still cause > > > >> re-balancing b/c we will still have consumers go down and come up. > > > >> > > > >> How do you deploy to consumers without triggering re-balancing (or > > > >> triggering one that doesn't affect your SLA) when doing real time > > > >> processing? > > > >> > > > >> Thanks, > > > >> Praveen > > > >> > > > >> > > > > > > > > > >
Re: How does one deploy to consumers without causing re-balancing for real time use case?
I believe if you're calling the .close() method on shutdown, then the LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure if that request will be made. On Fri, Feb 10, 2017 at 8:47 AM, Praveen wrote: > @Pradeep - I just read your thread, the 1hr pause was when all the > consumers where shutdown simultaneously. I'm testing out rolling restart > to get the actual numbers. The initial numbers are promising. > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> REBALANCE > (takes 1min to get a partition)` > > In your thread, Ewen says - > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a > consumer knows it is going to > shutdown, it is good to proactively make sure the group knows it needs to > rebalance work because some of the partitions that were handled by the > consumer need to be handled by some other group members." > > So does this mean that the consumer should inform the group ahead of > time before it goes down? Currently, I just shutdown the process. > > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota > wrote: > > > I asked a similar question a while ago. There doesn't appear to be a way > to > > not triggering the rebalance. But I'm not sure why it would be taking > > 1hr > > in your case. For us it was pretty fast. > > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html > > > > > > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG < > > krzysztof.lesniew...@nexiot.ch> wrote: > > > > > Would be great to get some input on it. > > > > > > - Krzysztof Lesniewski > > > > > > > > > On 06.02.2017 08:27, Praveen wrote: > > > > > >> I have a 16 broker kafka cluster. There is a topic with 32 partitions > > >> containing real time data and on the other side, I have 32 boxes w/ 1 > > >> consumer reading from these partitions. > > >> > > >> Today our deployment strategy is stop, deploy and start the processes > on > > >> all the 32 consumers. This triggers re-balancing and takes a long > period > > >> of > > >> time (> 1hr). Such a long pause isn't good for real time processing. > > >> > > >> I was thinking of rolling deploy but I think that will still cause > > >> re-balancing b/c we will still have consumers go down and come up. > > >> > > >> How do you deploy to consumers without triggering re-balancing (or > > >> triggering one that doesn't affect your SLA) when doing real time > > >> processing? > > >> > > >> Thanks, > > >> Praveen > > >> > > >> > > > > > >
Re: How does one deploy to consumers without causing re-balancing for real time use case?
@Pradeep - I just read your thread, the 1hr pause was when all the consumers where shutdown simultaneously. I'm testing out rolling restart to get the actual numbers. The initial numbers are promising. `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> REBALANCE (takes 1min to get a partition)` In your thread, Ewen says - "The LeaveGroupRequest is only sent on a graceful shutdown. If a consumer knows it is going to shutdown, it is good to proactively make sure the group knows it needs to rebalance work because some of the partitions that were handled by the consumer need to be handled by some other group members." So does this mean that the consumer should inform the group ahead of time before it goes down? Currently, I just shutdown the process. On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota wrote: > I asked a similar question a while ago. There doesn't appear to be a way to > not triggering the rebalance. But I'm not sure why it would be taking > 1hr > in your case. For us it was pretty fast. > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html > > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG < > krzysztof.lesniew...@nexiot.ch> wrote: > > > Would be great to get some input on it. > > > > - Krzysztof Lesniewski > > > > > > On 06.02.2017 08:27, Praveen wrote: > > > >> I have a 16 broker kafka cluster. There is a topic with 32 partitions > >> containing real time data and on the other side, I have 32 boxes w/ 1 > >> consumer reading from these partitions. > >> > >> Today our deployment strategy is stop, deploy and start the processes on > >> all the 32 consumers. This triggers re-balancing and takes a long period > >> of > >> time (> 1hr). Such a long pause isn't good for real time processing. > >> > >> I was thinking of rolling deploy but I think that will still cause > >> re-balancing b/c we will still have consumers go down and come up. > >> > >> How do you deploy to consumers without triggering re-balancing (or > >> triggering one that doesn't affect your SLA) when doing real time > >> processing? > >> > >> Thanks, > >> Praveen > >> > >> > > >
Re: How does one deploy to consumers without causing re-balancing for real time use case?
I asked a similar question a while ago. There doesn't appear to be a way to not triggering the rebalance. But I'm not sure why it would be taking > 1hr in your case. For us it was pretty fast. https://www.mail-archive.com/users@kafka.apache.org/msg23925.html On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG < krzysztof.lesniew...@nexiot.ch> wrote: > Would be great to get some input on it. > > - Krzysztof Lesniewski > > > On 06.02.2017 08:27, Praveen wrote: > >> I have a 16 broker kafka cluster. There is a topic with 32 partitions >> containing real time data and on the other side, I have 32 boxes w/ 1 >> consumer reading from these partitions. >> >> Today our deployment strategy is stop, deploy and start the processes on >> all the 32 consumers. This triggers re-balancing and takes a long period >> of >> time (> 1hr). Such a long pause isn't good for real time processing. >> >> I was thinking of rolling deploy but I think that will still cause >> re-balancing b/c we will still have consumers go down and come up. >> >> How do you deploy to consumers without triggering re-balancing (or >> triggering one that doesn't affect your SLA) when doing real time >> processing? >> >> Thanks, >> Praveen >> >> >
Re: How does one deploy to consumers without causing re-balancing for real time use case?
Would be great to get some input on it. - Krzysztof Lesniewski On 06.02.2017 08:27, Praveen wrote: I have a 16 broker kafka cluster. There is a topic with 32 partitions containing real time data and on the other side, I have 32 boxes w/ 1 consumer reading from these partitions. Today our deployment strategy is stop, deploy and start the processes on all the 32 consumers. This triggers re-balancing and takes a long period of time (> 1hr). Such a long pause isn't good for real time processing. I was thinking of rolling deploy but I think that will still cause re-balancing b/c we will still have consumers go down and come up. How do you deploy to consumers without triggering re-balancing (or triggering one that doesn't affect your SLA) when doing real time processing? Thanks, Praveen
How does one deploy to consumers without causing re-balancing for real time use case?
I have a 16 broker kafka cluster. There is a topic with 32 partitions containing real time data and on the other side, I have 32 boxes w/ 1 consumer reading from these partitions. Today our deployment strategy is stop, deploy and start the processes on all the 32 consumers. This triggers re-balancing and takes a long period of time (> 1hr). Such a long pause isn't good for real time processing. I was thinking of rolling deploy but I think that will still cause re-balancing b/c we will still have consumers go down and come up. How do you deploy to consumers without triggering re-balancing (or triggering one that doesn't affect your SLA) when doing real time processing? Thanks, Praveen
Re: Architecture recommendations for a tricky use case
> On Sep 29, 2016, at 16:39, Ali Akhtar wrote: > > Why did you choose Druid over Postgres / Cassandra / Elasticsearch? Well, to be clear, we haven’t chosen it yet — we’re evaluating it. That said, it is looking quite promising for our use case. The Druid docs say it well: > Druid is an open source data store designed for OLAP queries on event data. And that’s exactly what we need. The other options you listed are excellent systems, but they’re more general than Druid. Because Druid is specifically focused on OLAP queries on event data, it has features and properties that make it very well suited to such use cases. In addition, Druid has built-in support for ingesting events from Kafka topics and making those events available for querying with very low latency. This is very attractive for my use case. If you’d like to learn more about Druid I recommend this talk from last month at Strange Loop: https://www.youtube.com/watch?v=vbH8E0nH2Nw HTH! Avi Software Architect @ Park Assist We’re hiring! http://tech.parkassist.com/jobs/
Re: Architecture recommendations for a tricky use case
>>>>> It also needs a custom front-end, so a system like Tableau can't be >>>>> used, it must have a custom backend + front-end. >>>>> >>>>> Thanks for the recommendation of Flume. Do you think this will work: >>>>> >>>>> - Spark Streaming to read data from Kafka >>>>> - Storing the data on HDFS using Flume >>>>> - Using Spark to query the data in the backend of the web UI? >>>>> >>>>> >>>>> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh >>>>> wrote: >>>>>> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be >>>>>> stored on HDFS using flume. >>>>>> >>>>>> - Query this data to generate reports / analytics (There will be a >>>>>> web UI which will be the front-end to the data, and will show the reports) >>>>>> >>>>>> This is basically batch layer and you need something like Tableau or >>>>>> Zeppelin to query data >>>>>> >>>>>> You will also need spark streaming to query data online for speed >>>>>> layer. That data could be stored in some transient fabric like ignite or >>>>>> even druid. >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn >>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for >>>>>> any loss, damage or destruction of data or any other property which may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar >>>>>> wrote: >>>>>>> >>>>>>> It needs to be able to scale to a very large amount of data, yes. >>>>>>> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>>>>>> wrote: >>>>>>>> >>>>>>>> What is the message inflow ? >>>>>>>> If it's really high , definitely spark will be of great use . >>>>>>>> >>>>>>>> Thanks >>>>>>>> Deepak >>>>>>>> >>>>>>>> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" wrote: >>>>>>>>> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>>>>>> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their >>>>>>>>> raw data into Kafka. >>>>>>>>> >>>>>>>>> I need to: >>>>>>>>> >>>>>>>>> - Do ETL on the data, and standardize it. >>>>>>>>> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw >>>>>>>>> HDFS / ElasticSearch / Postgres) >>>>>>>>> >>>>>>>>> - Query this data to generate reports / analytics (There will be a >>>>>>>>> web UI which will be the front-end to the data, and will show the reports) >>>>>>>>> >>>>>>>>> Java is being used as the backend language for everything (backend >>>>>>>>> of the web UI, as well as the ETL layer) >>>>>>>>> >>>>>>>>> I'm considering: >>>>>>>>> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>>>>>>> (receive raw data from Kafka, standardize & store it) >>>>>>>>> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>>>>>>> data, and to allow queries >>>>>>>>> >>>>>>>>> - In the backend of the web UI, I could either use Spark to run >>>>>>>>> queries across the data (mostly filters), or directly run queries against >>>>>>>>> Cassandra / HBase >>>>>>>>> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for >>>>>>>>> ETL, which persistent data store to use, and how to query that data store in >>>>>>>>> the backend of the web UI, for displaying the reports). >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> >> >> -- >> Thanks >> Deepak >> www.bigdatabig.com >> www.keosha.net > >
Re: Architecture recommendations for a tricky use case
"Using Spark to query the data in the backend of the web UI?" Dont do that. I would recommend that spark streaming process stores data into some nosql or sql database and the web ui to query data from that database. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> 2016-09-29 16:15 GMT+02:00 Ali Akhtar : > The web UI is actually the speed layer, it needs to be able to query the > data online, and show the results in real-time. > > It also needs a custom front-end, so a system like Tableau can't be used, > it must have a custom backend + front-end. > > Thanks for the recommendation of Flume. Do you think this will work: > > - Spark Streaming to read data from Kafka > - Storing the data on HDFS using Flume > - Using Spark to query the data in the backend of the web UI? > > > > On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> You need a batch layer and a speed layer. Data from Kafka can be stored >> on HDFS using flume. >> >> - Query this data to generate reports / analytics (There will be a web >> UI which will be the front-end to the data, and will show the reports) >> >> This is basically batch layer and you need something like Tableau or >> Zeppelin to query data >> >> You will also need spark streaming to query data online for speed layer. >> That data could be stored in some transient fabric like ignite or even >> druid. >> >> HTH >> >> >> >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 29 September 2016 at 15:01, Ali Akhtar wrote: >> >>> It needs to be able to scale to a very large amount of data, yes. >>> >>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>> wrote: >>> >>>> What is the message inflow ? >>>> If it's really high , definitely spark will be of great use . >>>> >>>> Thanks >>>> Deepak >>>> >>>> On Sep 29, 2016 19:24, "Ali Akhtar" wrote: >>>> >>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>> >>>>> I have 5-6 Kafka producers, reading various APIs, and writing their >>>>> raw data into Kafka. >>>>> >>>>> I need to: >>>>> >>>>> - Do ETL on the data, and standardize it. >>>>> >>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS >>>>> / ElasticSearch / Postgres) >>>>> >>>>> - Query this data to generate reports / analytics (There will be a web >>>>> UI which will be the front-end to the data, and will show the reports) >>>>> >>>>> Java is being used as the backend language for everything (backend of >>>>> the web UI, as well as the ETL layer) >>>>> >>>>> I'm considering: >>>>> >>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>>> (receive raw data from Kafka, standardize & store it) >>>>> >>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>>> data, and to allow queries >>>>> >>>>> - In the backend of the web UI, I could either use Spark to run >>>>> queries across the data (mostly filters), or directly run queries against >>>>> Cassandra / HBase >>>>> >>>>> I'd appreciate some thoughts / suggestions on which of these >>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for >>>>> ETL, which persistent data store to use, and how to query that data store >>>>> in the backend of the web UI, for displaying the reports). >>>>> >>>>> >>>>> Thanks. >>>>> >>>> >>> >> >
Re: Architecture recommendations for a tricky use case
Ok… so what’s the tricky part? Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work. The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… ) Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?) HBase? Depending on your admin… stability could be a problem. Cassandra? That would be a separate cluster and that in itself could be a problem… YMMV so you need to address the pros/cons of each tool specific to your environment and skill level. HTH -Mike > On Sep 29, 2016, at 8:54 AM, Ali Akhtar wrote: > > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and writing their raw data > into Kafka. > > I need to: > > - Do ETL on the data, and standardize it. > > - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / > ElasticSearch / Postgres) > > - Query this data to generate reports / analytics (There will be a web UI > which will be the front-end to the data, and will show the reports) > > Java is being used as the backend language for everything (backend of the web > UI, as well as the ETL layer) > > I'm considering: > > - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive > raw data from Kafka, standardize & store it) > > - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and > to allow queries > > - In the backend of the web UI, I could either use Spark to run queries > across the data (mostly filters), or directly run queries against Cassandra / > HBase > > I'd appreciate some thoughts / suggestions on which of these alternatives I > should go with (e.g, using raw Kafka consumers vs Spark for ETL, which > persistent data store to use, and how to query that data store in the backend > of the web UI, for displaying the reports). > > > Thanks.
Re: Architecture recommendations for a tricky use case
Spark standalone is not Yarn… or secure for that matter… ;-) > On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote: > > Spark streaming helps with aggregation because > > A. raw kafka consumers have no built in framework for shuffling > amongst nodes, short of writing into an intermediate topic (I'm not > touching Kafka Streams here, I don't have experience), and > > B. it deals with batches, so you can transactionally decide to commit > or rollback your aggregate data and your offsets. Otherwise your > offsets and data store can get out of sync, leading to lost / > duplicate data. > > Regarding long running spark jobs, I have streaming jobs in the > standalone manager that have been running for 6 months or more. > > On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel > wrote: >> Ok… so what’s the tricky part? >> Spark Streaming isn’t real time so if you don’t mind a slight delay in >> processing… it would work. >> >> The drawback is that you now have a long running Spark Job (assuming under >> YARN) and that could become a problem in terms of security and resources. >> (How well does Yarn handle long running jobs these days in a secured >> Cluster? Steve L. may have some insight… ) >> >> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you >> want to write your own compaction code? Or use Hive 1.x+?) >> >> HBase? Depending on your admin… stability could be a problem. >> Cassandra? That would be a separate cluster and that in itself could be a >> problem… >> >> YMMV so you need to address the pros/cons of each tool specific to your >> environment and skill level. >> >> HTH >> >> -Mike >> >>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar wrote: >>> >>> I have a somewhat tricky use case, and I'm looking for ideas. >>> >>> I have 5-6 Kafka producers, reading various APIs, and writing their raw >>> data into Kafka. >>> >>> I need to: >>> >>> - Do ETL on the data, and standardize it. >>> >>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / >>> ElasticSearch / Postgres) >>> >>> - Query this data to generate reports / analytics (There will be a web UI >>> which will be the front-end to the data, and will show the reports) >>> >>> Java is being used as the backend language for everything (backend of the >>> web UI, as well as the ETL layer) >>> >>> I'm considering: >>> >>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive >>> raw data from Kafka, standardize & store it) >>> >>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, >>> and to allow queries >>> >>> - In the backend of the web UI, I could either use Spark to run queries >>> across the data (mostly filters), or directly run queries against Cassandra >>> / HBase >>> >>> I'd appreciate some thoughts / suggestions on which of these alternatives I >>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which >>> persistent data store to use, and how to query that data store in the >>> backend of the web UI, for displaying the reports). >>> >>> >>> Thanks. >>
Re: Architecture recommendations for a tricky use case
OP mentioned HBase or HDFS as persisted storage. Therefore they have to be running YARN if they are considering spark. (Assuming that you’re not trying to do a storage / compute model and use standalone spark outside your cluster. You can, but you have more moving parts…) I never said anything about putting something on a public network. I mentioned running a secured cluster. You don’t deal with PII or other regulated data, do you? If you read my original post, you are correct we don’t have a lot, if any real information. Based on what the OP said, there are design considerations since every tool he mentioned has pluses and minuses and the problem isn’t really that challenging unless you have something extraordinary like high velocity or some other constraint that makes this challenging. BTW, depending on scale and velocity… your relational engines may become problematic. HTH -Mike > On Sep 29, 2016, at 1:51 PM, Cody Koeninger wrote: > > The OP didn't say anything about Yarn, and why are you contemplating > putting Kafka or Spark on public networks to begin with? > > Gwen's right, absent any actual requirements this is kind of pointless. > > On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel > wrote: >> Spark standalone is not Yarn… or secure for that matter… ;-) >> >>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote: >>> >>> Spark streaming helps with aggregation because >>> >>> A. raw kafka consumers have no built in framework for shuffling >>> amongst nodes, short of writing into an intermediate topic (I'm not >>> touching Kafka Streams here, I don't have experience), and >>> >>> B. it deals with batches, so you can transactionally decide to commit >>> or rollback your aggregate data and your offsets. Otherwise your >>> offsets and data store can get out of sync, leading to lost / >>> duplicate data. >>> >>> Regarding long running spark jobs, I have streaming jobs in the >>> standalone manager that have been running for 6 months or more. >>> >>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel >>> wrote: >>>> Ok… so what’s the tricky part? >>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in >>>> processing… it would work. >>>> >>>> The drawback is that you now have a long running Spark Job (assuming under >>>> YARN) and that could become a problem in terms of security and resources. >>>> (How well does Yarn handle long running jobs these days in a secured >>>> Cluster? Steve L. may have some insight… ) >>>> >>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do >>>> you want to write your own compaction code? Or use Hive 1.x+?) >>>> >>>> HBase? Depending on your admin… stability could be a problem. >>>> Cassandra? That would be a separate cluster and that in itself could be a >>>> problem… >>>> >>>> YMMV so you need to address the pros/cons of each tool specific to your >>>> environment and skill level. >>>> >>>> HTH >>>> >>>> -Mike >>>> >>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar wrote: >>>>> >>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>> >>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw >>>>> data into Kafka. >>>>> >>>>> I need to: >>>>> >>>>> - Do ETL on the data, and standardize it. >>>>> >>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / >>>>> ElasticSearch / Postgres) >>>>> >>>>> - Query this data to generate reports / analytics (There will be a web UI >>>>> which will be the front-end to the data, and will show the reports) >>>>> >>>>> Java is being used as the backend language for everything (backend of the >>>>> web UI, as well as the ETL layer) >>>>> >>>>> I'm considering: >>>>> >>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>>> (receive raw data from Kafka, standardize & store it) >>>>> >>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, >>>>> and to allow queries >>>>> >>>>> - In the backend of the web UI, I could either use Spark to run queries >>>>> across the data (mostly filters), or directly run queries against >>>>> Cassandra / HBase >>>>> >>>>> I'd appreciate some thoughts / suggestions on which of these alternatives >>>>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which >>>>> persistent data store to use, and how to query that data store in the >>>>> backend of the web UI, for displaying the reports). >>>>> >>>>> >>>>> Thanks. >>>> >>
Re: Architecture recommendations for a tricky use case
Avi, Why did you choose Druid over Postgres / Cassandra / Elasticsearch? On Fri, Sep 30, 2016 at 1:09 AM, Avi Flax wrote: > > > On Sep 29, 2016, at 09:54, Ali Akhtar wrote: > > > > I'd appreciate some thoughts / suggestions on which of these > alternatives I > > should go with (e.g, using raw Kafka consumers vs Spark for ETL, which > > persistent data store to use, and how to query that data store in the > > backend of the web UI, for displaying the reports). > > Hi Ali, I’m no expert in any of this, but I’m working on a project that is > broadly similar to yours, and FWIW I’m evaluating Druid as the datastore > which would host the queryable data and, well, actually handle and fulfill > queries. > > Since Druid has built-in support for streaming ingestion from Kafka > topics, I’m tentatively thinking of doing my ETL in a stream processing > topology (I’m using Kafka Streams, FWIW), which would write the events > destined for Druid into certain topics, from which Druid would ingest those > events. > > HTH, > Avi > > > Software Architect @ Park Assist > We’re hiring! http://tech.parkassist.com/jobs/ > >
Re: Architecture recommendations for a tricky use case
> On Sep 29, 2016, at 09:54, Ali Akhtar wrote: > > I'd appreciate some thoughts / suggestions on which of these alternatives I > should go with (e.g, using raw Kafka consumers vs Spark for ETL, which > persistent data store to use, and how to query that data store in the > backend of the web UI, for displaying the reports). Hi Ali, I’m no expert in any of this, but I’m working on a project that is broadly similar to yours, and FWIW I’m evaluating Druid as the datastore which would host the queryable data and, well, actually handle and fulfill queries. Since Druid has built-in support for streaming ingestion from Kafka topics, I’m tentatively thinking of doing my ETL in a stream processing topology (I’m using Kafka Streams, FWIW), which would write the events destined for Druid into certain topics, from which Druid would ingest those events. HTH, Avi Software Architect @ Park Assist We’re hiring! http://tech.parkassist.com/jobs/
Re: Architecture recommendations for a tricky use case
The OP didn't say anything about Yarn, and why are you contemplating putting Kafka or Spark on public networks to begin with? Gwen's right, absent any actual requirements this is kind of pointless. On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel wrote: > Spark standalone is not Yarn… or secure for that matter… ;-) > >> On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote: >> >> Spark streaming helps with aggregation because >> >> A. raw kafka consumers have no built in framework for shuffling >> amongst nodes, short of writing into an intermediate topic (I'm not >> touching Kafka Streams here, I don't have experience), and >> >> B. it deals with batches, so you can transactionally decide to commit >> or rollback your aggregate data and your offsets. Otherwise your >> offsets and data store can get out of sync, leading to lost / >> duplicate data. >> >> Regarding long running spark jobs, I have streaming jobs in the >> standalone manager that have been running for 6 months or more. >> >> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel >> wrote: >>> Ok… so what’s the tricky part? >>> Spark Streaming isn’t real time so if you don’t mind a slight delay in >>> processing… it would work. >>> >>> The drawback is that you now have a long running Spark Job (assuming under >>> YARN) and that could become a problem in terms of security and resources. >>> (How well does Yarn handle long running jobs these days in a secured >>> Cluster? Steve L. may have some insight… ) >>> >>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do >>> you want to write your own compaction code? Or use Hive 1.x+?) >>> >>> HBase? Depending on your admin… stability could be a problem. >>> Cassandra? That would be a separate cluster and that in itself could be a >>> problem… >>> >>> YMMV so you need to address the pros/cons of each tool specific to your >>> environment and skill level. >>> >>> HTH >>> >>> -Mike >>> >>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar wrote: >>>> >>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>> >>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw >>>> data into Kafka. >>>> >>>> I need to: >>>> >>>> - Do ETL on the data, and standardize it. >>>> >>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / >>>> ElasticSearch / Postgres) >>>> >>>> - Query this data to generate reports / analytics (There will be a web UI >>>> which will be the front-end to the data, and will show the reports) >>>> >>>> Java is being used as the backend language for everything (backend of the >>>> web UI, as well as the ETL layer) >>>> >>>> I'm considering: >>>> >>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive >>>> raw data from Kafka, standardize & store it) >>>> >>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, >>>> and to allow queries >>>> >>>> - In the backend of the web UI, I could either use Spark to run queries >>>> across the data (mostly filters), or directly run queries against >>>> Cassandra / HBase >>>> >>>> I'd appreciate some thoughts / suggestions on which of these alternatives >>>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which >>>> persistent data store to use, and how to query that data store in the >>>> backend of the web UI, for displaying the reports). >>>> >>>> >>>> Thanks. >>> >
Re: Architecture recommendations for a tricky use case
The original post made no mention of throughput or latency or correctness requirements, so pretty much any data store will fit the bill... discussion of "what is better" degrade fast when there are no concrete standards to choose between. Who cares about anything when we don't know what we need? :) On Thu, Sep 29, 2016 at 9:23 AM, Cody Koeninger wrote: >> I still don't understand why writing to a transactional database with >> locking and concurrency (read and writes) through JDBC will be fast for this >> sort of data ingestion. > > Who cares about fast if your data is wrong? And it's still plenty fast enough > > https://youtu.be/NVl9_6J1G60?list=WL&t=1819 > > https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/ > > > > On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh > wrote: >> The way I see this, there are two things involved. >> >> Data ingestion through source to Kafka >> Date conversion and Storage ETL/ELT >> Presentation >> >> Item 2 is the one that needs to be designed correctly. I presume raw data >> has to confirm to some form of MDM that requires schema mapping etc before >> putting into persistent storage (DB, HDFS etc). Which one to choose depends >> on your volume of ingestion and your cluster size and complexity of data >> conversion. Then your users will use some form of UI (Tableau, QlikView, >> Zeppelin, direct SQL) to query data one way or other. Your users can >> directly use UI like Tableau that offer in built analytics on SQL. Spark sql >> offers the same). Your mileage varies according to your needs. >> >> I still don't understand why writing to a transactional database with >> locking and concurrency (read and writes) through JDBC will be fast for this >> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to >> write to as my sink,I would use Oracle which offers the best locking and >> concurrency among RDBMs and also handles key value pairs as well (assuming >> that is what you want). In addition, it can be used as a Data Warehouse as >> well. >> >> HTH >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> >> >> On 29 September 2016 at 16:49, Ali Akhtar wrote: >>> >>> The business use case is to read a user's data from a variety of different >>> services through their API, and then allowing the user to query that data, >>> on a per service basis, as well as an aggregate across all services. >>> >>> The way I'm considering doing it, is to do some basic ETL (drop all the >>> unnecessary fields, rename some fields into something more manageable, etc) >>> and then store the data in Cassandra / Postgres. >>> >>> Then, when the user wants to view a particular report, query the >>> respective table in Cassandra / Postgres. (select .. from data where user = >>> ? and date between and and some_field = ?) >>> >>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried >>> from Cassandra / Postgres via the Kafka consumer and aggregated that way? >>> >>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger >>> wrote: >>>> >>>> No, direct stream in and of itself won't ensure an end-to-end >>>> guarantee, because it doesn't know anything about your output actions. >>>> >>>> You still need to do some work. The point is having easy access to >>>> offsets for batches on a per-partition basis makes it easier to do >>>> that work, especially in conjunction with aggregation. >>>> >>>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma >>>> wrote: >>>> > If you use spark direct streams , it ensure end to end guarantee for >>>> > messages. >>>> > >>>> > >>>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar >>>> > wrote: >>>> >> >>>> >> My concern with Postgres / Cassand
Re: Architecture recommendations for a tricky use case
> I still don't understand why writing to a transactional database with locking > and concurrency (read and writes) through JDBC will be fast for this sort of > data ingestion. Who cares about fast if your data is wrong? And it's still plenty fast enough https://youtu.be/NVl9_6J1G60?list=WL&t=1819 https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/ On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh wrote: > The way I see this, there are two things involved. > > Data ingestion through source to Kafka > Date conversion and Storage ETL/ELT > Presentation > > Item 2 is the one that needs to be designed correctly. I presume raw data > has to confirm to some form of MDM that requires schema mapping etc before > putting into persistent storage (DB, HDFS etc). Which one to choose depends > on your volume of ingestion and your cluster size and complexity of data > conversion. Then your users will use some form of UI (Tableau, QlikView, > Zeppelin, direct SQL) to query data one way or other. Your users can > directly use UI like Tableau that offer in built analytics on SQL. Spark sql > offers the same). Your mileage varies according to your needs. > > I still don't understand why writing to a transactional database with > locking and concurrency (read and writes) through JDBC will be fast for this > sort of data ingestion. If you ask me if I wanted to choose an RDBMS to > write to as my sink,I would use Oracle which offers the best locking and > concurrency among RDBMs and also handles key value pairs as well (assuming > that is what you want). In addition, it can be used as a Data Warehouse as > well. > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > Disclaimer: Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > > On 29 September 2016 at 16:49, Ali Akhtar wrote: >> >> The business use case is to read a user's data from a variety of different >> services through their API, and then allowing the user to query that data, >> on a per service basis, as well as an aggregate across all services. >> >> The way I'm considering doing it, is to do some basic ETL (drop all the >> unnecessary fields, rename some fields into something more manageable, etc) >> and then store the data in Cassandra / Postgres. >> >> Then, when the user wants to view a particular report, query the >> respective table in Cassandra / Postgres. (select .. from data where user = >> ? and date between and and some_field = ?) >> >> How will Spark Streaming help w/ aggregation? Couldn't the data be queried >> from Cassandra / Postgres via the Kafka consumer and aggregated that way? >> >> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger >> wrote: >>> >>> No, direct stream in and of itself won't ensure an end-to-end >>> guarantee, because it doesn't know anything about your output actions. >>> >>> You still need to do some work. The point is having easy access to >>> offsets for batches on a per-partition basis makes it easier to do >>> that work, especially in conjunction with aggregation. >>> >>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma >>> wrote: >>> > If you use spark direct streams , it ensure end to end guarantee for >>> > messages. >>> > >>> > >>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar >>> > wrote: >>> >> >>> >> My concern with Postgres / Cassandra is only scalability. I will look >>> >> further into Postgres horizontal scaling, thanks. >>> >> >>> >> Writes could be idempotent if done as upserts, otherwise updates will >>> >> be >>> >> idempotent but not inserts. >>> >> >>> >> Data should not be lost. The system should be as fault tolerant as >>> >> possible. >>> >> >>> >> What's the advantage of using Spark for reading Kafka instead of >>> >> direct >>> >> Kafka consumers? >>> >> >>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger >>> >> wrote: >>> >>> >&g
Re: Architecture recommendations for a tricky use case
Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally decide to commit or rollback your aggregate data and your offsets. Otherwise your offsets and data store can get out of sync, leading to lost / duplicate data. Regarding long running spark jobs, I have streaming jobs in the standalone manager that have been running for 6 months or more. On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel wrote: > Ok… so what’s the tricky part? > Spark Streaming isn’t real time so if you don’t mind a slight delay in > processing… it would work. > > The drawback is that you now have a long running Spark Job (assuming under > YARN) and that could become a problem in terms of security and resources. > (How well does Yarn handle long running jobs these days in a secured Cluster? > Steve L. may have some insight… ) > > Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you > want to write your own compaction code? Or use Hive 1.x+?) > > HBase? Depending on your admin… stability could be a problem. > Cassandra? That would be a separate cluster and that in itself could be a > problem… > > YMMV so you need to address the pros/cons of each tool specific to your > environment and skill level. > > HTH > > -Mike > >> On Sep 29, 2016, at 8:54 AM, Ali Akhtar wrote: >> >> I have a somewhat tricky use case, and I'm looking for ideas. >> >> I have 5-6 Kafka producers, reading various APIs, and writing their raw data >> into Kafka. >> >> I need to: >> >> - Do ETL on the data, and standardize it. >> >> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / >> ElasticSearch / Postgres) >> >> - Query this data to generate reports / analytics (There will be a web UI >> which will be the front-end to the data, and will show the reports) >> >> Java is being used as the backend language for everything (backend of the >> web UI, as well as the ETL layer) >> >> I'm considering: >> >> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive >> raw data from Kafka, standardize & store it) >> >> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, >> and to allow queries >> >> - In the backend of the web UI, I could either use Spark to run queries >> across the data (mostly filters), or directly run queries against Cassandra >> / HBase >> >> I'd appreciate some thoughts / suggestions on which of these alternatives I >> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which >> persistent data store to use, and how to query that data store in the >> backend of the web UI, for displaying the reports). >> >> >> Thanks. >
Re: Architecture recommendations for a tricky use case
The way I see this, there are two things involved. 1. Data ingestion through source to Kafka 2. Date conversion and Storage ETL/ELT 3. Presentation Item 2 is the one that needs to be designed correctly. I presume raw data has to confirm to some form of MDM that requires schema mapping etc before putting into persistent storage (DB, HDFS etc). Which one to choose depends on your volume of ingestion and your cluster size and complexity of data conversion. Then your users will use some form of UI (Tableau, QlikView, Zeppelin, direct SQL) to query data one way or other. Your users can directly use UI like Tableau that offer in built analytics on SQL. Spark sql offers the same). Your mileage varies according to your needs. I still don't understand why writing to a transactional database with locking and concurrency (read and writes) through JDBC will be fast for this sort of data ingestion. If you ask me if I wanted to choose an RDBMS to write to as my sink,I would use Oracle which offers the best locking and concurrency among RDBMs and also handles key value pairs as well (assuming that is what you want). In addition, it can be used as a Data Warehouse as well. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 29 September 2016 at 16:49, Ali Akhtar wrote: > The business use case is to read a user's data from a variety of different > services through their API, and then allowing the user to query that data, > on a per service basis, as well as an aggregate across all services. > > The way I'm considering doing it, is to do some basic ETL (drop all the > unnecessary fields, rename some fields into something more manageable, etc) > and then store the data in Cassandra / Postgres. > > Then, when the user wants to view a particular report, query the > respective table in Cassandra / Postgres. (select .. from data where user = > ? and date between and and some_field = ?) > > How will Spark Streaming help w/ aggregation? Couldn't the data be queried > from Cassandra / Postgres via the Kafka consumer and aggregated that way? > > On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger > wrote: > >> No, direct stream in and of itself won't ensure an end-to-end >> guarantee, because it doesn't know anything about your output actions. >> >> You still need to do some work. The point is having easy access to >> offsets for batches on a per-partition basis makes it easier to do >> that work, especially in conjunction with aggregation. >> >> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma >> wrote: >> > If you use spark direct streams , it ensure end to end guarantee for >> > messages. >> > >> > >> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar >> wrote: >> >> >> >> My concern with Postgres / Cassandra is only scalability. I will look >> >> further into Postgres horizontal scaling, thanks. >> >> >> >> Writes could be idempotent if done as upserts, otherwise updates will >> be >> >> idempotent but not inserts. >> >> >> >> Data should not be lost. The system should be as fault tolerant as >> >> possible. >> >> >> >> What's the advantage of using Spark for reading Kafka instead of direct >> >> Kafka consumers? >> >> >> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger >> >> wrote: >> >>> >> >>> I wouldn't give up the flexibility and maturity of a relational >> >>> database, unless you have a very specific use case. I'm not trashing >> >>> cassandra, I've used cassandra, but if all I know is that you're doing >> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc >> >>> aggregations without a lot of forethought. If you're worried about >> >>> scaling, there are several options for horizontally scaling Postgres >> >>> in particular. One of the current best from what I've worked with is >> >>> Citus. >> >>> >> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma < >> deepakmc...@gm
Re: Architecture recommendations for a tricky use case
The business use case is to read a user's data from a variety of different services through their API, and then allowing the user to query that data, on a per service basis, as well as an aggregate across all services. The way I'm considering doing it, is to do some basic ETL (drop all the unnecessary fields, rename some fields into something more manageable, etc) and then store the data in Cassandra / Postgres. Then, when the user wants to view a particular report, query the respective table in Cassandra / Postgres. (select .. from data where user = ? and date between and and some_field = ?) How will Spark Streaming help w/ aggregation? Couldn't the data be queried from Cassandra / Postgres via the Kafka consumer and aggregated that way? On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger wrote: > No, direct stream in and of itself won't ensure an end-to-end > guarantee, because it doesn't know anything about your output actions. > > You still need to do some work. The point is having easy access to > offsets for batches on a per-partition basis makes it easier to do > that work, especially in conjunction with aggregation. > > On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma > wrote: > > If you use spark direct streams , it ensure end to end guarantee for > > messages. > > > > > > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar > wrote: > >> > >> My concern with Postgres / Cassandra is only scalability. I will look > >> further into Postgres horizontal scaling, thanks. > >> > >> Writes could be idempotent if done as upserts, otherwise updates will be > >> idempotent but not inserts. > >> > >> Data should not be lost. The system should be as fault tolerant as > >> possible. > >> > >> What's the advantage of using Spark for reading Kafka instead of direct > >> Kafka consumers? > >> > >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger > >> wrote: > >>> > >>> I wouldn't give up the flexibility and maturity of a relational > >>> database, unless you have a very specific use case. I'm not trashing > >>> cassandra, I've used cassandra, but if all I know is that you're doing > >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc > >>> aggregations without a lot of forethought. If you're worried about > >>> scaling, there are several options for horizontally scaling Postgres > >>> in particular. One of the current best from what I've worked with is > >>> Citus. > >>> > >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma > > >>> wrote: > >>> > Hi Cody > >>> > Spark direct stream is just fine for this use case. > >>> > But why postgres and not cassandra? > >>> > Is there anything specific here that i may not be aware? > >>> > > >>> > Thanks > >>> > Deepak > >>> > > >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger > >>> > wrote: > >>> >> > >>> >> How are you going to handle etl failures? Do you care about lost / > >>> >> duplicated data? Are your writes idempotent? > >>> >> > >>> >> Absent any other information about the problem, I'd stay away from > >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream > >>> >> feeding postgres. > >>> >> > >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar > >>> >> wrote: > >>> >> > Is there an advantage to that vs directly consuming from Kafka? > >>> >> > Nothing > >>> >> > is > >>> >> > being done to the data except some light ETL and then storing it > in > >>> >> > Cassandra > >>> >> > > >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma > >>> >> > > >>> >> > wrote: > >>> >> >> > >>> >> >> Its better you use spark's direct stream to ingest from kafka. > >>> >> >> > >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar < > ali.rac...@gmail.com> > >>> >> >> wrote: > >>> >> >>> > >>> >> >>> I don't think I need a different speed storage and batch > storage. > >>> >&
Re: Architecture recommendations for a tricky use case
No, direct stream in and of itself won't ensure an end-to-end guarantee, because it doesn't know anything about your output actions. You still need to do some work. The point is having easy access to offsets for batches on a per-partition basis makes it easier to do that work, especially in conjunction with aggregation. On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma wrote: > If you use spark direct streams , it ensure end to end guarantee for > messages. > > > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar wrote: >> >> My concern with Postgres / Cassandra is only scalability. I will look >> further into Postgres horizontal scaling, thanks. >> >> Writes could be idempotent if done as upserts, otherwise updates will be >> idempotent but not inserts. >> >> Data should not be lost. The system should be as fault tolerant as >> possible. >> >> What's the advantage of using Spark for reading Kafka instead of direct >> Kafka consumers? >> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger >> wrote: >>> >>> I wouldn't give up the flexibility and maturity of a relational >>> database, unless you have a very specific use case. I'm not trashing >>> cassandra, I've used cassandra, but if all I know is that you're doing >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc >>> aggregations without a lot of forethought. If you're worried about >>> scaling, there are several options for horizontally scaling Postgres >>> in particular. One of the current best from what I've worked with is >>> Citus. >>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma >>> wrote: >>> > Hi Cody >>> > Spark direct stream is just fine for this use case. >>> > But why postgres and not cassandra? >>> > Is there anything specific here that i may not be aware? >>> > >>> > Thanks >>> > Deepak >>> > >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger >>> > wrote: >>> >> >>> >> How are you going to handle etl failures? Do you care about lost / >>> >> duplicated data? Are your writes idempotent? >>> >> >>> >> Absent any other information about the problem, I'd stay away from >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream >>> >> feeding postgres. >>> >> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar >>> >> wrote: >>> >> > Is there an advantage to that vs directly consuming from Kafka? >>> >> > Nothing >>> >> > is >>> >> > being done to the data except some light ETL and then storing it in >>> >> > Cassandra >>> >> > >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma >>> >> > >>> >> > wrote: >>> >> >> >>> >> >> Its better you use spark's direct stream to ingest from kafka. >>> >> >> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar >>> >> >> wrote: >>> >> >>> >>> >> >>> I don't think I need a different speed storage and batch storage. >>> >> >>> Just >>> >> >>> taking in raw data from Kafka, standardizing, and storing it >>> >> >>> somewhere >>> >> >>> where >>> >> >>> the web UI can query it, seems like it will be enough. >>> >> >>> >>> >> >>> I'm thinking about: >>> >> >>> >>> >> >>> - Reading data from Kafka via Spark Streaming >>> >> >>> - Standardizing, then storing it in Cassandra >>> >> >>> - Querying Cassandra from the web ui >>> >> >>> >>> >> >>> That seems like it will work. My question now is whether to use >>> >> >>> Spark >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly. >>> >> >>> >>> >> >>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh >>> >> >>> wrote: >>> >> >>>> >>> >> >>>> - Spark Streaming to read data from Kafka >>> >> >>>&
Re: Architecture recommendations for a tricky use case
Hi Ali, What is the business use case for this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 29 September 2016 at 16:40, Deepak Sharma wrote: > If you use spark direct streams , it ensure end to end guarantee for > messages. > > > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar wrote: > >> My concern with Postgres / Cassandra is only scalability. I will look >> further into Postgres horizontal scaling, thanks. >> >> Writes could be idempotent if done as upserts, otherwise updates will be >> idempotent but not inserts. >> >> Data should not be lost. The system should be as fault tolerant as >> possible. >> >> What's the advantage of using Spark for reading Kafka instead of direct >> Kafka consumers? >> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger >> wrote: >> >>> I wouldn't give up the flexibility and maturity of a relational >>> database, unless you have a very specific use case. I'm not trashing >>> cassandra, I've used cassandra, but if all I know is that you're doing >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc >>> aggregations without a lot of forethought. If you're worried about >>> scaling, there are several options for horizontally scaling Postgres >>> in particular. One of the current best from what I've worked with is >>> Citus. >>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma >>> wrote: >>> > Hi Cody >>> > Spark direct stream is just fine for this use case. >>> > But why postgres and not cassandra? >>> > Is there anything specific here that i may not be aware? >>> > >>> > Thanks >>> > Deepak >>> > >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger >>> wrote: >>> >> >>> >> How are you going to handle etl failures? Do you care about lost / >>> >> duplicated data? Are your writes idempotent? >>> >> >>> >> Absent any other information about the problem, I'd stay away from >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream >>> >> feeding postgres. >>> >> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar >>> wrote: >>> >> > Is there an advantage to that vs directly consuming from Kafka? >>> Nothing >>> >> > is >>> >> > being done to the data except some light ETL and then storing it in >>> >> > Cassandra >>> >> > >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma < >>> deepakmc...@gmail.com> >>> >> > wrote: >>> >> >> >>> >> >> Its better you use spark's direct stream to ingest from kafka. >>> >> >> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar >>> >> >> wrote: >>> >> >>> >>> >> >>> I don't think I need a different speed storage and batch storage. >>> Just >>> >> >>> taking in raw data from Kafka, standardizing, and storing it >>> somewhere >>> >> >>> where >>> >> >>> the web UI can query it, seems like it will be enough. >>> >> >>> >>> >> >>> I'm thinking about: >>> >> >>> >>> >> >>> - Reading data from Kafka via Spark Streaming >>> >> >>> - Standardizing, then storing it in Cassandra >>> >> >>> - Querying Cassandra from the web ui >>> >> >>> >>> >> >>> That seems like it will work. My question now is whether to use >>> Spark >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly. >>> >> >>> >>> >> >>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh &g
Re: Architecture recommendations for a tricky use case
If you use spark direct streams , it ensure end to end guarantee for messages. On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar wrote: > My concern with Postgres / Cassandra is only scalability. I will look > further into Postgres horizontal scaling, thanks. > > Writes could be idempotent if done as upserts, otherwise updates will be > idempotent but not inserts. > > Data should not be lost. The system should be as fault tolerant as > possible. > > What's the advantage of using Spark for reading Kafka instead of direct > Kafka consumers? > > On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger > wrote: > >> I wouldn't give up the flexibility and maturity of a relational >> database, unless you have a very specific use case. I'm not trashing >> cassandra, I've used cassandra, but if all I know is that you're doing >> analytics, I wouldn't want to give up the ability to easily do ad-hoc >> aggregations without a lot of forethought. If you're worried about >> scaling, there are several options for horizontally scaling Postgres >> in particular. One of the current best from what I've worked with is >> Citus. >> >> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma >> wrote: >> > Hi Cody >> > Spark direct stream is just fine for this use case. >> > But why postgres and not cassandra? >> > Is there anything specific here that i may not be aware? >> > >> > Thanks >> > Deepak >> > >> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger >> wrote: >> >> >> >> How are you going to handle etl failures? Do you care about lost / >> >> duplicated data? Are your writes idempotent? >> >> >> >> Absent any other information about the problem, I'd stay away from >> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream >> >> feeding postgres. >> >> >> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar >> wrote: >> >> > Is there an advantage to that vs directly consuming from Kafka? >> Nothing >> >> > is >> >> > being done to the data except some light ETL and then storing it in >> >> > Cassandra >> >> > >> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma < >> deepakmc...@gmail.com> >> >> > wrote: >> >> >> >> >> >> Its better you use spark's direct stream to ingest from kafka. >> >> >> >> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar >> >> >> wrote: >> >> >>> >> >> >>> I don't think I need a different speed storage and batch storage. >> Just >> >> >>> taking in raw data from Kafka, standardizing, and storing it >> somewhere >> >> >>> where >> >> >>> the web UI can query it, seems like it will be enough. >> >> >>> >> >> >>> I'm thinking about: >> >> >>> >> >> >>> - Reading data from Kafka via Spark Streaming >> >> >>> - Standardizing, then storing it in Cassandra >> >> >>> - Querying Cassandra from the web ui >> >> >>> >> >> >>> That seems like it will work. My question now is whether to use >> Spark >> >> >>> Streaming to read Kafka, or use Kafka consumers directly. >> >> >>> >> >> >>> >> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh >> >> >>> wrote: >> >> >>>> >> >> >>>> - Spark Streaming to read data from Kafka >> >> >>>> - Storing the data on HDFS using Flume >> >> >>>> >> >> >>>> You don't need Spark streaming to read data from Kafka and store >> on >> >> >>>> HDFS. It is a waste of resources. >> >> >>>> >> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly >> >> >>>> >> >> >>>> KafkaAgent.sources = kafka-sources >> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs >> >> >>>> >> >> >>>> That will be for your batch layer. To analyse you can directly >> read >> >> >>>> from >> >> >>>> hdfs files with Spark or simply store data in a database of you
Re: Architecture recommendations for a tricky use case
If you're doing any kind of pre-aggregation during ETL, spark direct stream will let you more easily get the delivery semantics you need, especially if you're using a transactional data store. If you're literally just copying individual uniquely keyed items from kafka to a key-value store, use kafka consumers, sure. On Thu, Sep 29, 2016 at 10:35 AM, Ali Akhtar wrote: > My concern with Postgres / Cassandra is only scalability. I will look > further into Postgres horizontal scaling, thanks. > > Writes could be idempotent if done as upserts, otherwise updates will be > idempotent but not inserts. > > Data should not be lost. The system should be as fault tolerant as possible. > > What's the advantage of using Spark for reading Kafka instead of direct > Kafka consumers? > > On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger wrote: >> >> I wouldn't give up the flexibility and maturity of a relational >> database, unless you have a very specific use case. I'm not trashing >> cassandra, I've used cassandra, but if all I know is that you're doing >> analytics, I wouldn't want to give up the ability to easily do ad-hoc >> aggregations without a lot of forethought. If you're worried about >> scaling, there are several options for horizontally scaling Postgres >> in particular. One of the current best from what I've worked with is >> Citus. >> >> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma >> wrote: >> > Hi Cody >> > Spark direct stream is just fine for this use case. >> > But why postgres and not cassandra? >> > Is there anything specific here that i may not be aware? >> > >> > Thanks >> > Deepak >> > >> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger >> > wrote: >> >> >> >> How are you going to handle etl failures? Do you care about lost / >> >> duplicated data? Are your writes idempotent? >> >> >> >> Absent any other information about the problem, I'd stay away from >> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream >> >> feeding postgres. >> >> >> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar >> >> wrote: >> >> > Is there an advantage to that vs directly consuming from Kafka? >> >> > Nothing >> >> > is >> >> > being done to the data except some light ETL and then storing it in >> >> > Cassandra >> >> > >> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma >> >> > >> >> > wrote: >> >> >> >> >> >> Its better you use spark's direct stream to ingest from kafka. >> >> >> >> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar >> >> >> wrote: >> >> >>> >> >> >>> I don't think I need a different speed storage and batch storage. >> >> >>> Just >> >> >>> taking in raw data from Kafka, standardizing, and storing it >> >> >>> somewhere >> >> >>> where >> >> >>> the web UI can query it, seems like it will be enough. >> >> >>> >> >> >>> I'm thinking about: >> >> >>> >> >> >>> - Reading data from Kafka via Spark Streaming >> >> >>> - Standardizing, then storing it in Cassandra >> >> >>> - Querying Cassandra from the web ui >> >> >>> >> >> >>> That seems like it will work. My question now is whether to use >> >> >>> Spark >> >> >>> Streaming to read Kafka, or use Kafka consumers directly. >> >> >>> >> >> >>> >> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh >> >> >>> wrote: >> >> >>>> >> >> >>>> - Spark Streaming to read data from Kafka >> >> >>>> - Storing the data on HDFS using Flume >> >> >>>> >> >> >>>> You don't need Spark streaming to read data from Kafka and store >> >> >>>> on >> >> >>>> HDFS. It is a waste of resources. >> >> >>>> >> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly >> >> >>>> >> >> >>>> KafkaAgent.sources = kafka-sources >>
Re: Architecture recommendations for a tricky use case
Yes but still these writes from Spark have to go through JDBC? Correct. Having said that I don't see how doing this through Spark streaming to postgress is going to be faster than source -> Kafka - flume via zookeeper -> HDFS. I believe there is direct streaming from Kakfa to Hive as well and from Flume to Hbase I would have thought that if one wanted to do real time analytics with SS, then that would be a good fit with a real time dashboard. What is not so clear is the business use case for this. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 29 September 2016 at 16:28, Cody Koeninger wrote: > I wouldn't give up the flexibility and maturity of a relational > database, unless you have a very specific use case. I'm not trashing > cassandra, I've used cassandra, but if all I know is that you're doing > analytics, I wouldn't want to give up the ability to easily do ad-hoc > aggregations without a lot of forethought. If you're worried about > scaling, there are several options for horizontally scaling Postgres > in particular. One of the current best from what I've worked with is > Citus. > > On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma > wrote: > > Hi Cody > > Spark direct stream is just fine for this use case. > > But why postgres and not cassandra? > > Is there anything specific here that i may not be aware? > > > > Thanks > > Deepak > > > > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger > wrote: > >> > >> How are you going to handle etl failures? Do you care about lost / > >> duplicated data? Are your writes idempotent? > >> > >> Absent any other information about the problem, I'd stay away from > >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream > >> feeding postgres. > >> > >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar > wrote: > >> > Is there an advantage to that vs directly consuming from Kafka? > Nothing > >> > is > >> > being done to the data except some light ETL and then storing it in > >> > Cassandra > >> > > >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma > > >> > wrote: > >> >> > >> >> Its better you use spark's direct stream to ingest from kafka. > >> >> > >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar > >> >> wrote: > >> >>> > >> >>> I don't think I need a different speed storage and batch storage. > Just > >> >>> taking in raw data from Kafka, standardizing, and storing it > somewhere > >> >>> where > >> >>> the web UI can query it, seems like it will be enough. > >> >>> > >> >>> I'm thinking about: > >> >>> > >> >>> - Reading data from Kafka via Spark Streaming > >> >>> - Standardizing, then storing it in Cassandra > >> >>> - Querying Cassandra from the web ui > >> >>> > >> >>> That seems like it will work. My question now is whether to use > Spark > >> >>> Streaming to read Kafka, or use Kafka consumers directly. > >> >>> > >> >>> > >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh > >> >>> wrote: > >> >>>> > >> >>>> - Spark Streaming to read data from Kafka > >> >>>> - Storing the data on HDFS using Flume > >> >>>> > >> >>>> You don't need Spark streaming to read data from Kafka and store on > >> >>>> HDFS. It is a waste of resources. > >> >>>> > >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly > >> >>>> > >> >>>> KafkaAgent.sources = kafka-sources > >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs > >> >>>> > >> >>>> That will be for your batch layer. To analyse you can
Re: Architecture recommendations for a tricky use case
My concern with Postgres / Cassandra is only scalability. I will look further into Postgres horizontal scaling, thanks. Writes could be idempotent if done as upserts, otherwise updates will be idempotent but not inserts. Data should not be lost. The system should be as fault tolerant as possible. What's the advantage of using Spark for reading Kafka instead of direct Kafka consumers? On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger wrote: > I wouldn't give up the flexibility and maturity of a relational > database, unless you have a very specific use case. I'm not trashing > cassandra, I've used cassandra, but if all I know is that you're doing > analytics, I wouldn't want to give up the ability to easily do ad-hoc > aggregations without a lot of forethought. If you're worried about > scaling, there are several options for horizontally scaling Postgres > in particular. One of the current best from what I've worked with is > Citus. > > On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma > wrote: > > Hi Cody > > Spark direct stream is just fine for this use case. > > But why postgres and not cassandra? > > Is there anything specific here that i may not be aware? > > > > Thanks > > Deepak > > > > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger > wrote: > >> > >> How are you going to handle etl failures? Do you care about lost / > >> duplicated data? Are your writes idempotent? > >> > >> Absent any other information about the problem, I'd stay away from > >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream > >> feeding postgres. > >> > >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar > wrote: > >> > Is there an advantage to that vs directly consuming from Kafka? > Nothing > >> > is > >> > being done to the data except some light ETL and then storing it in > >> > Cassandra > >> > > >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma > > >> > wrote: > >> >> > >> >> Its better you use spark's direct stream to ingest from kafka. > >> >> > >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar > >> >> wrote: > >> >>> > >> >>> I don't think I need a different speed storage and batch storage. > Just > >> >>> taking in raw data from Kafka, standardizing, and storing it > somewhere > >> >>> where > >> >>> the web UI can query it, seems like it will be enough. > >> >>> > >> >>> I'm thinking about: > >> >>> > >> >>> - Reading data from Kafka via Spark Streaming > >> >>> - Standardizing, then storing it in Cassandra > >> >>> - Querying Cassandra from the web ui > >> >>> > >> >>> That seems like it will work. My question now is whether to use > Spark > >> >>> Streaming to read Kafka, or use Kafka consumers directly. > >> >>> > >> >>> > >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh > >> >>> wrote: > >> >>>> > >> >>>> - Spark Streaming to read data from Kafka > >> >>>> - Storing the data on HDFS using Flume > >> >>>> > >> >>>> You don't need Spark streaming to read data from Kafka and store on > >> >>>> HDFS. It is a waste of resources. > >> >>>> > >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly > >> >>>> > >> >>>> KafkaAgent.sources = kafka-sources > >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs > >> >>>> > >> >>>> That will be for your batch layer. To analyse you can directly read > >> >>>> from > >> >>>> hdfs files with Spark or simply store data in a database of your > >> >>>> choice via > >> >>>> cron or something. Do not mix your batch layer with speed layer. > >> >>>> > >> >>>> Your speed layer will ingest the same data directly from Kafka into > >> >>>> spark streaming and that will be online or near real time (defined > >> >>>> by your > >> >>>> window). > >> >>>> > >> >>>> Then you have a a serving layer to present dat
Re: Architecture recommendations for a tricky use case
I wouldn't give up the flexibility and maturity of a relational database, unless you have a very specific use case. I'm not trashing cassandra, I've used cassandra, but if all I know is that you're doing analytics, I wouldn't want to give up the ability to easily do ad-hoc aggregations without a lot of forethought. If you're worried about scaling, there are several options for horizontally scaling Postgres in particular. One of the current best from what I've worked with is Citus. On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma wrote: > Hi Cody > Spark direct stream is just fine for this use case. > But why postgres and not cassandra? > Is there anything specific here that i may not be aware? > > Thanks > Deepak > > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger wrote: >> >> How are you going to handle etl failures? Do you care about lost / >> duplicated data? Are your writes idempotent? >> >> Absent any other information about the problem, I'd stay away from >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream >> feeding postgres. >> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar wrote: >> > Is there an advantage to that vs directly consuming from Kafka? Nothing >> > is >> > being done to the data except some light ETL and then storing it in >> > Cassandra >> > >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma >> > wrote: >> >> >> >> Its better you use spark's direct stream to ingest from kafka. >> >> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar >> >> wrote: >> >>> >> >>> I don't think I need a different speed storage and batch storage. Just >> >>> taking in raw data from Kafka, standardizing, and storing it somewhere >> >>> where >> >>> the web UI can query it, seems like it will be enough. >> >>> >> >>> I'm thinking about: >> >>> >> >>> - Reading data from Kafka via Spark Streaming >> >>> - Standardizing, then storing it in Cassandra >> >>> - Querying Cassandra from the web ui >> >>> >> >>> That seems like it will work. My question now is whether to use Spark >> >>> Streaming to read Kafka, or use Kafka consumers directly. >> >>> >> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh >> >>> wrote: >> >>>> >> >>>> - Spark Streaming to read data from Kafka >> >>>> - Storing the data on HDFS using Flume >> >>>> >> >>>> You don't need Spark streaming to read data from Kafka and store on >> >>>> HDFS. It is a waste of resources. >> >>>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly >> >>>> >> >>>> KafkaAgent.sources = kafka-sources >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs >> >>>> >> >>>> That will be for your batch layer. To analyse you can directly read >> >>>> from >> >>>> hdfs files with Spark or simply store data in a database of your >> >>>> choice via >> >>>> cron or something. Do not mix your batch layer with speed layer. >> >>>> >> >>>> Your speed layer will ingest the same data directly from Kafka into >> >>>> spark streaming and that will be online or near real time (defined >> >>>> by your >> >>>> window). >> >>>> >> >>>> Then you have a a serving layer to present data from both speed (the >> >>>> one from SS) and batch layer. >> >>>> >> >>>> HTH >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> Dr Mich Talebzadeh >> >>>> >> >>>> >> >>>> >> >>>> LinkedIn >> >>>> >> >>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >>>> >> >>>> >> >>>> >> >>>> http://talebzadehmich.wordpress.com >> >>>> >> >>>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for >> >>>> any >> >>>> loss, damage or destruction of data or any other p
Re: Architecture recommendations for a tricky use case
Hi Cody Spark direct stream is just fine for this use case. But why postgres and not cassandra? Is there anything specific here that i may not be aware? Thanks Deepak On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger wrote: > How are you going to handle etl failures? Do you care about lost / > duplicated data? Are your writes idempotent? > > Absent any other information about the problem, I'd stay away from > cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream > feeding postgres. > > On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar wrote: > > Is there an advantage to that vs directly consuming from Kafka? Nothing > is > > being done to the data except some light ETL and then storing it in > > Cassandra > > > > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma > > wrote: > >> > >> Its better you use spark's direct stream to ingest from kafka. > >> > >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar > wrote: > >>> > >>> I don't think I need a different speed storage and batch storage. Just > >>> taking in raw data from Kafka, standardizing, and storing it somewhere > where > >>> the web UI can query it, seems like it will be enough. > >>> > >>> I'm thinking about: > >>> > >>> - Reading data from Kafka via Spark Streaming > >>> - Standardizing, then storing it in Cassandra > >>> - Querying Cassandra from the web ui > >>> > >>> That seems like it will work. My question now is whether to use Spark > >>> Streaming to read Kafka, or use Kafka consumers directly. > >>> > >>> > >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh > >>> wrote: > >>>> > >>>> - Spark Streaming to read data from Kafka > >>>> - Storing the data on HDFS using Flume > >>>> > >>>> You don't need Spark streaming to read data from Kafka and store on > >>>> HDFS. It is a waste of resources. > >>>> > >>>> Couple Flume to use Kafka as source and HDFS as sink directly > >>>> > >>>> KafkaAgent.sources = kafka-sources > >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs > >>>> > >>>> That will be for your batch layer. To analyse you can directly read > from > >>>> hdfs files with Spark or simply store data in a database of your > choice via > >>>> cron or something. Do not mix your batch layer with speed layer. > >>>> > >>>> Your speed layer will ingest the same data directly from Kafka into > >>>> spark streaming and that will be online or near real time (defined > by your > >>>> window). > >>>> > >>>> Then you have a a serving layer to present data from both speed (the > >>>> one from SS) and batch layer. > >>>> > >>>> HTH > >>>> > >>>> > >>>> > >>>> > >>>> Dr Mich Talebzadeh > >>>> > >>>> > >>>> > >>>> LinkedIn > >>>> https://www.linkedin.com/profile/view?id= > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >>>> > >>>> > >>>> > >>>> http://talebzadehmich.wordpress.com > >>>> > >>>> > >>>> Disclaimer: Use it at your own risk. Any and all responsibility for > any > >>>> loss, damage or destruction of data or any other property which may > arise > >>>> from relying on this email's technical content is explicitly > disclaimed. The > >>>> author will in no case be liable for any monetary damages arising > from such > >>>> loss, damage or destruction. > >>>> > >>>> > >>>> > >>>> > >>>> On 29 September 2016 at 15:15, Ali Akhtar > wrote: > >>>>> > >>>>> The web UI is actually the speed layer, it needs to be able to query > >>>>> the data online, and show the results in real-time. > >>>>> > >>>>> It also needs a custom front-end, so a system like Tableau can't be > >>>>> used, it must have a custom backend + front-end. > >>>>> > >>>>> Thanks for the recommendation of Flume. Do you think this will work: > >>>>> > >>>>> - Spark Streaming to read data from Kafka > >&
Re: Architecture recommendations for a tricky use case
Is there an advantage to that vs directly consuming from Kafka? Nothing is being done to the data except some light ETL and then storing it in Cassandra On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma wrote: > Its better you use spark's direct stream to ingest from kafka. > > On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar wrote: > >> I don't think I need a different speed storage and batch storage. Just >> taking in raw data from Kafka, standardizing, and storing it somewhere >> where the web UI can query it, seems like it will be enough. >> >> I'm thinking about: >> >> - Reading data from Kafka via Spark Streaming >> - Standardizing, then storing it in Cassandra >> - Querying Cassandra from the web ui >> >> That seems like it will work. My question now is whether to use Spark >> Streaming to read Kafka, or use Kafka consumers directly. >> >> >> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> - Spark Streaming to read data from Kafka >>> - Storing the data on HDFS using Flume >>> >>> You don't need Spark streaming to read data from Kafka and store on >>> HDFS. It is a waste of resources. >>> >>> Couple Flume to use Kafka as source and HDFS as sink directly >>> >>> KafkaAgent.sources = kafka-sources >>> KafkaAgent.sinks.hdfs-sinks.type = hdfs >>> >>> That will be for your batch layer. To analyse you can directly read from >>> hdfs files with Spark or simply store data in a database of your choice via >>> cron or something. Do not mix your batch layer with speed layer. >>> >>> Your speed layer will ingest the same data directly from Kafka into >>> spark streaming and that will be online or near real time (defined by your >>> window). >>> >>> Then you have a a serving layer to present data from both speed (the >>> one from SS) and batch layer. >>> >>> HTH >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 29 September 2016 at 15:15, Ali Akhtar wrote: >>> >>>> The web UI is actually the speed layer, it needs to be able to query >>>> the data online, and show the results in real-time. >>>> >>>> It also needs a custom front-end, so a system like Tableau can't be >>>> used, it must have a custom backend + front-end. >>>> >>>> Thanks for the recommendation of Flume. Do you think this will work: >>>> >>>> - Spark Streaming to read data from Kafka >>>> - Storing the data on HDFS using Flume >>>> - Using Spark to query the data in the backend of the web UI? >>>> >>>> >>>> >>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> You need a batch layer and a speed layer. Data from Kafka can be >>>>> stored on HDFS using flume. >>>>> >>>>> - Query this data to generate reports / analytics (There will be a >>>>> web UI which will be the front-end to the data, and will show the reports) >>>>> >>>>> This is basically batch layer and you need something like Tableau or >>>>> Zeppelin to query data >>>>> >>>>> You will also need spark streaming to query data online for speed >>>>> layer. That data could be stored in some transient fabric like ignite or >>>>> even druid. >>>>> >>>>> HTH >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>
Re: Architecture recommendations for a tricky use case
;>>>> This is basically batch layer and you need something like Tableau or >>>>>> Zeppelin to query data >>>>>> >>>>>> You will also need spark streaming to query data online for speed >>>>>> layer. That data could be stored in some transient fabric like ignite or >>>>>> even druid. >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn >>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for >>>>>> any loss, damage or destruction of data or any other property which may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar >>>>>> wrote: >>>>>>> >>>>>>> It needs to be able to scale to a very large amount of data, yes. >>>>>>> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>>>>>> wrote: >>>>>>>> >>>>>>>> What is the message inflow ? >>>>>>>> If it's really high , definitely spark will be of great use . >>>>>>>> >>>>>>>> Thanks >>>>>>>> Deepak >>>>>>>> >>>>>>>> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" wrote: >>>>>>>>> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>>>>>> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their >>>>>>>>> raw data into Kafka. >>>>>>>>> >>>>>>>>> I need to: >>>>>>>>> >>>>>>>>> - Do ETL on the data, and standardize it. >>>>>>>>> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw >>>>>>>>> HDFS / ElasticSearch / Postgres) >>>>>>>>> >>>>>>>>> - Query this data to generate reports / analytics (There will be a >>>>>>>>> web UI which will be the front-end to the data, and will show the >>>>>>>>> reports) >>>>>>>>> >>>>>>>>> Java is being used as the backend language for everything (backend >>>>>>>>> of the web UI, as well as the ETL layer) >>>>>>>>> >>>>>>>>> I'm considering: >>>>>>>>> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>>>>>>> (receive raw data from Kafka, standardize & store it) >>>>>>>>> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>>>>>>> data, and to allow queries >>>>>>>>> >>>>>>>>> - In the backend of the web UI, I could either use Spark to run >>>>>>>>> queries across the data (mostly filters), or directly run queries >>>>>>>>> against >>>>>>>>> Cassandra / HBase >>>>>>>>> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs >>>>>>>>> Spark for >>>>>>>>> ETL, which persistent data store to use, and how to query that data >>>>>>>>> store in >>>>>>>>> the backend of the web UI, for displaying the reports). >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> >> >> -- >> Thanks >> Deepak >> www.bigdatabig.com >> www.keosha.net > >
Re: Architecture recommendations for a tricky use case
Its better you use spark's direct stream to ingest from kafka. On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar wrote: > I don't think I need a different speed storage and batch storage. Just > taking in raw data from Kafka, standardizing, and storing it somewhere > where the web UI can query it, seems like it will be enough. > > I'm thinking about: > > - Reading data from Kafka via Spark Streaming > - Standardizing, then storing it in Cassandra > - Querying Cassandra from the web ui > > That seems like it will work. My question now is whether to use Spark > Streaming to read Kafka, or use Kafka consumers directly. > > > On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> - Spark Streaming to read data from Kafka >> - Storing the data on HDFS using Flume >> >> You don't need Spark streaming to read data from Kafka and store on HDFS. >> It is a waste of resources. >> >> Couple Flume to use Kafka as source and HDFS as sink directly >> >> KafkaAgent.sources = kafka-sources >> KafkaAgent.sinks.hdfs-sinks.type = hdfs >> >> That will be for your batch layer. To analyse you can directly read from >> hdfs files with Spark or simply store data in a database of your choice via >> cron or something. Do not mix your batch layer with speed layer. >> >> Your speed layer will ingest the same data directly from Kafka into spark >> streaming and that will be online or near real time (defined by your >> window). >> >> Then you have a a serving layer to present data from both speed (the one >> from SS) and batch layer. >> >> HTH >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 29 September 2016 at 15:15, Ali Akhtar wrote: >> >>> The web UI is actually the speed layer, it needs to be able to query the >>> data online, and show the results in real-time. >>> >>> It also needs a custom front-end, so a system like Tableau can't be >>> used, it must have a custom backend + front-end. >>> >>> Thanks for the recommendation of Flume. Do you think this will work: >>> >>> - Spark Streaming to read data from Kafka >>> - Storing the data on HDFS using Flume >>> - Using Spark to query the data in the backend of the web UI? >>> >>> >>> >>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> You need a batch layer and a speed layer. Data from Kafka can be stored >>>> on HDFS using flume. >>>> >>>> - Query this data to generate reports / analytics (There will be a web >>>> UI which will be the front-end to the data, and will show the reports) >>>> >>>> This is basically batch layer and you need something like Tableau or >>>> Zeppelin to query data >>>> >>>> You will also need spark streaming to query data online for speed >>>> layer. That data could be stored in some transient fabric like ignite or >>>> even druid. >>>> >>>> HTH >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content
Re: Architecture recommendations for a tricky use case
Since the inflow is huge , flume would also need to be run with multiple channels in distributed fashion. In that case , the resource utilization will be high in that case as well. Thanks Deepak On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh wrote: > - Spark Streaming to read data from Kafka > - Storing the data on HDFS using Flume > > You don't need Spark streaming to read data from Kafka and store on HDFS. > It is a waste of resources. > > Couple Flume to use Kafka as source and HDFS as sink directly > > KafkaAgent.sources = kafka-sources > KafkaAgent.sinks.hdfs-sinks.type = hdfs > > That will be for your batch layer. To analyse you can directly read from > hdfs files with Spark or simply store data in a database of your choice via > cron or something. Do not mix your batch layer with speed layer. > > Your speed layer will ingest the same data directly from Kafka into spark > streaming and that will be online or near real time (defined by your > window). > > Then you have a a serving layer to present data from both speed (the one > from SS) and batch layer. > > HTH > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 29 September 2016 at 15:15, Ali Akhtar wrote: > >> The web UI is actually the speed layer, it needs to be able to query the >> data online, and show the results in real-time. >> >> It also needs a custom front-end, so a system like Tableau can't be used, >> it must have a custom backend + front-end. >> >> Thanks for the recommendation of Flume. Do you think this will work: >> >> - Spark Streaming to read data from Kafka >> - Storing the data on HDFS using Flume >> - Using Spark to query the data in the backend of the web UI? >> >> >> >> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> You need a batch layer and a speed layer. Data from Kafka can be stored >>> on HDFS using flume. >>> >>> - Query this data to generate reports / analytics (There will be a web >>> UI which will be the front-end to the data, and will show the reports) >>> >>> This is basically batch layer and you need something like Tableau or >>> Zeppelin to query data >>> >>> You will also need spark streaming to query data online for speed layer. >>> That data could be stored in some transient fabric like ignite or even >>> druid. >>> >>> HTH >>> >>> >>> >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 29 September 2016 at 15:01, Ali Akhtar wrote: >>> >>>> It needs to be able to scale to a very large amount of data, yes. >>>> >>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>>> wrote: >>>> >>>>> What is the message inflow ? >>>>> If it's really high , definitely spark will be of great use . >>>>> >>>>> Thanks >>>>> Deepak >>>>> >>>>> On Sep 29, 2016 19:24, "Ali Akhtar" wrote: >>>>> >>>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>>> >>>>>>
Re: Architecture recommendations for a tricky use case
I don't think I need a different speed storage and batch storage. Just taking in raw data from Kafka, standardizing, and storing it somewhere where the web UI can query it, seems like it will be enough. I'm thinking about: - Reading data from Kafka via Spark Streaming - Standardizing, then storing it in Cassandra - Querying Cassandra from the web ui That seems like it will work. My question now is whether to use Spark Streaming to read Kafka, or use Kafka consumers directly. On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh wrote: > - Spark Streaming to read data from Kafka > - Storing the data on HDFS using Flume > > You don't need Spark streaming to read data from Kafka and store on HDFS. > It is a waste of resources. > > Couple Flume to use Kafka as source and HDFS as sink directly > > KafkaAgent.sources = kafka-sources > KafkaAgent.sinks.hdfs-sinks.type = hdfs > > That will be for your batch layer. To analyse you can directly read from > hdfs files with Spark or simply store data in a database of your choice via > cron or something. Do not mix your batch layer with speed layer. > > Your speed layer will ingest the same data directly from Kafka into spark > streaming and that will be online or near real time (defined by your > window). > > Then you have a a serving layer to present data from both speed (the one > from SS) and batch layer. > > HTH > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 29 September 2016 at 15:15, Ali Akhtar wrote: > >> The web UI is actually the speed layer, it needs to be able to query the >> data online, and show the results in real-time. >> >> It also needs a custom front-end, so a system like Tableau can't be used, >> it must have a custom backend + front-end. >> >> Thanks for the recommendation of Flume. Do you think this will work: >> >> - Spark Streaming to read data from Kafka >> - Storing the data on HDFS using Flume >> - Using Spark to query the data in the backend of the web UI? >> >> >> >> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> You need a batch layer and a speed layer. Data from Kafka can be stored >>> on HDFS using flume. >>> >>> - Query this data to generate reports / analytics (There will be a web >>> UI which will be the front-end to the data, and will show the reports) >>> >>> This is basically batch layer and you need something like Tableau or >>> Zeppelin to query data >>> >>> You will also need spark streaming to query data online for speed layer. >>> That data could be stored in some transient fabric like ignite or even >>> druid. >>> >>> HTH >>> >>> >>> >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 29 September 2016 at 15:01, Ali Akhtar wrote: >>> >>>> It needs to be able to scale to a very large amount of data, yes. >>>> >>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>>> wrote: >>>> >>>>> What is the message inflow ? >>>>> If it's really high , definitely spark will be of great use . >>>>> >>>&g
Re: Architecture recommendations for a tricky use case
- Spark Streaming to read data from Kafka - Storing the data on HDFS using Flume You don't need Spark streaming to read data from Kafka and store on HDFS. It is a waste of resources. Couple Flume to use Kafka as source and HDFS as sink directly KafkaAgent.sources = kafka-sources KafkaAgent.sinks.hdfs-sinks.type = hdfs That will be for your batch layer. To analyse you can directly read from hdfs files with Spark or simply store data in a database of your choice via cron or something. Do not mix your batch layer with speed layer. Your speed layer will ingest the same data directly from Kafka into spark streaming and that will be online or near real time (defined by your window). Then you have a a serving layer to present data from both speed (the one from SS) and batch layer. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 29 September 2016 at 15:15, Ali Akhtar wrote: > The web UI is actually the speed layer, it needs to be able to query the > data online, and show the results in real-time. > > It also needs a custom front-end, so a system like Tableau can't be used, > it must have a custom backend + front-end. > > Thanks for the recommendation of Flume. Do you think this will work: > > - Spark Streaming to read data from Kafka > - Storing the data on HDFS using Flume > - Using Spark to query the data in the backend of the web UI? > > > > On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> You need a batch layer and a speed layer. Data from Kafka can be stored >> on HDFS using flume. >> >> - Query this data to generate reports / analytics (There will be a web >> UI which will be the front-end to the data, and will show the reports) >> >> This is basically batch layer and you need something like Tableau or >> Zeppelin to query data >> >> You will also need spark streaming to query data online for speed layer. >> That data could be stored in some transient fabric like ignite or even >> druid. >> >> HTH >> >> >> >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 29 September 2016 at 15:01, Ali Akhtar wrote: >> >>> It needs to be able to scale to a very large amount of data, yes. >>> >>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>> wrote: >>> >>>> What is the message inflow ? >>>> If it's really high , definitely spark will be of great use . >>>> >>>> Thanks >>>> Deepak >>>> >>>> On Sep 29, 2016 19:24, "Ali Akhtar" wrote: >>>> >>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>> >>>>> I have 5-6 Kafka producers, reading various APIs, and writing their >>>>> raw data into Kafka. >>>>> >>>>> I need to: >>>>> >>>>> - Do ETL on the data, and standardize it. >>>>> >>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS >>>>> / ElasticSearch / Postgres) >>>>> >>>>> - Query this data to generate reports / analytics (There will be a web >>>>> UI which will be the front-end to the data, and will show the reports) >>>>> >>>>> Java is being used as the backend language for everything (backend of >>>>> the web UI, as well as the ETL layer) >>>>> >>>>> I'm considering: >>>>> >>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>>> (receive raw data from Kafka, standardize & store it) >>>>> >>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>>> data, and to allow queries >>>>> >>>>> - In the backend of the web UI, I could either use Spark to run >>>>> queries across the data (mostly filters), or directly run queries against >>>>> Cassandra / HBase >>>>> >>>>> I'd appreciate some thoughts / suggestions on which of these >>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for >>>>> ETL, which persistent data store to use, and how to query that data store >>>>> in the backend of the web UI, for displaying the reports). >>>>> >>>>> >>>>> Thanks. >>>>> >>>> >>> >> >
Re: Architecture recommendations for a tricky use case
For ui , you need DB such as Cassandra that is designed to work around queries . Ingest the data to spark streaming (speed layer) and write to hdfs(for batch layer). Now you have data at rest as well as in motion(real time). >From spark streaming itself , do further processing and write the final result to Cassandra/nosql DB. UI can pick the data from the DB now. Thanks Deepak On Thu, Sep 29, 2016 at 8:00 PM, Alonso Isidoro Roman wrote: > "Using Spark to query the data in the backend of the web UI?" > > Dont do that. I would recommend that spark streaming process stores data > into some nosql or sql database and the web ui to query data from that > database. > > Alonso Isidoro Roman > [image: https://]about.me/alonso.isidoro.roman > > <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> > > 2016-09-29 16:15 GMT+02:00 Ali Akhtar : > >> The web UI is actually the speed layer, it needs to be able to query the >> data online, and show the results in real-time. >> >> It also needs a custom front-end, so a system like Tableau can't be used, >> it must have a custom backend + front-end. >> >> Thanks for the recommendation of Flume. Do you think this will work: >> >> - Spark Streaming to read data from Kafka >> - Storing the data on HDFS using Flume >> - Using Spark to query the data in the backend of the web UI? >> >> >> >> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> You need a batch layer and a speed layer. Data from Kafka can be stored >>> on HDFS using flume. >>> >>> - Query this data to generate reports / analytics (There will be a web >>> UI which will be the front-end to the data, and will show the reports) >>> >>> This is basically batch layer and you need something like Tableau or >>> Zeppelin to query data >>> >>> You will also need spark streaming to query data online for speed layer. >>> That data could be stored in some transient fabric like ignite or even >>> druid. >>> >>> HTH >>> >>> >>> >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 29 September 2016 at 15:01, Ali Akhtar wrote: >>> >>>> It needs to be able to scale to a very large amount of data, yes. >>>> >>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>>> wrote: >>>> >>>>> What is the message inflow ? >>>>> If it's really high , definitely spark will be of great use . >>>>> >>>>> Thanks >>>>> Deepak >>>>> >>>>> On Sep 29, 2016 19:24, "Ali Akhtar" wrote: >>>>> >>>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>>> >>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their >>>>>> raw data into Kafka. >>>>>> >>>>>> I need to: >>>>>> >>>>>> - Do ETL on the data, and standardize it. >>>>>> >>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS >>>>>> / ElasticSearch / Postgres) >>>>>> >>>>>> - Query this data to generate reports / analytics (There will be a >>>>>> web UI which will be the front-end to the data, and will show the >>>>>> reports) >>>>>> >>>>>> Java is being used as the backend language for everything (backend of >>>>>> the web UI, as well as the ETL layer) >>>>>> >>>>>> I'm considering: >>>>>> >>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>>>> (receive raw data from Kafka, standardize & store it) >>>>>> >>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>>>> data, and to allow queries >>>>>> >>>>>> - In the backend of the web UI, I could either use Spark to run >>>>>> queries across the data (mostly filters), or directly run queries against >>>>>> Cassandra / HBase >>>>>> >>>>>> I'd appreciate some thoughts / suggestions on which of these >>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark >>>>>> for >>>>>> ETL, which persistent data store to use, and how to query that data store >>>>>> in the backend of the web UI, for displaying the reports). >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>> >>>> >>> >> > -- Thanks Deepak www.bigdatabig.com www.keosha.net
RE: Architecture recommendations for a tricky use case
Spark Streaming needs to store the output somewhere. Cassandra is a possible target for that. -Dave -Original Message- From: Ali Akhtar [mailto:ali.rac...@gmail.com] Sent: Thursday, September 29, 2016 9:16 AM Cc: users@kafka.apache.org; spark users Subject: Re: Architecture recommendations for a tricky use case The web UI is actually the speed layer, it needs to be able to query the data online, and show the results in real-time. It also needs a custom front-end, so a system like Tableau can't be used, it must have a custom backend + front-end. Thanks for the recommendation of Flume. Do you think this will work: - Spark Streaming to read data from Kafka - Storing the data on HDFS using Flume - Using Spark to query the data in the backend of the web UI? On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh wrote: > You need a batch layer and a speed layer. Data from Kafka can be > stored on HDFS using flume. > > - Query this data to generate reports / analytics (There will be a > web UI which will be the front-end to the data, and will show the > reports) > > This is basically batch layer and you need something like Tableau or > Zeppelin to query data > > You will also need spark streaming to query data online for speed layer. > That data could be stored in some transient fabric like ignite or even > druid. > > HTH > > > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCC > dOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPC > CdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for > any loss, damage or destruction of data or any other property which > may arise from relying on this email's technical content is explicitly > disclaimed. > The author will in no case be liable for any monetary damages arising > from such loss, damage or destruction. > > > > On 29 September 2016 at 15:01, Ali Akhtar wrote: > >> It needs to be able to scale to a very large amount of data, yes. >> >> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >> >> wrote: >> >>> What is the message inflow ? >>> If it's really high , definitely spark will be of great use . >>> >>> Thanks >>> Deepak >>> >>> On Sep 29, 2016 19:24, "Ali Akhtar" wrote: >>> >>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>> >>>> I have 5-6 Kafka producers, reading various APIs, and writing their >>>> raw data into Kafka. >>>> >>>> I need to: >>>> >>>> - Do ETL on the data, and standardize it. >>>> >>>> - Store the standardized data somewhere (HBase / Cassandra / Raw >>>> HDFS / ElasticSearch / Postgres) >>>> >>>> - Query this data to generate reports / analytics (There will be a >>>> web UI which will be the front-end to the data, and will show the >>>> reports) >>>> >>>> Java is being used as the backend language for everything (backend >>>> of the web UI, as well as the ETL layer) >>>> >>>> I'm considering: >>>> >>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>> (receive raw data from Kafka, standardize & store it) >>>> >>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>> data, and to allow queries >>>> >>>> - In the backend of the web UI, I could either use Spark to run >>>> queries across the data (mostly filters), or directly run queries >>>> against Cassandra / HBase >>>> >>>> I'd appreciate some thoughts / suggestions on which of these >>>> alternatives I should go with (e.g, using raw Kafka consumers vs >>>> Spark for ETL, which persistent data store to use, and how to query >>>> that data store in the backend of the web UI, for displaying the reports). >>>> >>>> >>>> Thanks. >>>> >>> >> > This e-mail and any files transmitted with it are confidential, may contain sensitive information, and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error, please notify the sender by reply e-mail immediately and destroy all copies of the e-mail and any attachments.
Re: Architecture recommendations for a tricky use case
The web UI is actually the speed layer, it needs to be able to query the data online, and show the results in real-time. It also needs a custom front-end, so a system like Tableau can't be used, it must have a custom backend + front-end. Thanks for the recommendation of Flume. Do you think this will work: - Spark Streaming to read data from Kafka - Storing the data on HDFS using Flume - Using Spark to query the data in the backend of the web UI? On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh wrote: > You need a batch layer and a speed layer. Data from Kafka can be stored on > HDFS using flume. > > - Query this data to generate reports / analytics (There will be a web UI > which will be the front-end to the data, and will show the reports) > > This is basically batch layer and you need something like Tableau or > Zeppelin to query data > > You will also need spark streaming to query data online for speed layer. > That data could be stored in some transient fabric like ignite or even > druid. > > HTH > > > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 29 September 2016 at 15:01, Ali Akhtar wrote: > >> It needs to be able to scale to a very large amount of data, yes. >> >> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >> wrote: >> >>> What is the message inflow ? >>> If it's really high , definitely spark will be of great use . >>> >>> Thanks >>> Deepak >>> >>> On Sep 29, 2016 19:24, "Ali Akhtar" wrote: >>> >>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>> >>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw >>>> data into Kafka. >>>> >>>> I need to: >>>> >>>> - Do ETL on the data, and standardize it. >>>> >>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / >>>> ElasticSearch / Postgres) >>>> >>>> - Query this data to generate reports / analytics (There will be a web >>>> UI which will be the front-end to the data, and will show the reports) >>>> >>>> Java is being used as the backend language for everything (backend of >>>> the web UI, as well as the ETL layer) >>>> >>>> I'm considering: >>>> >>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>> (receive raw data from Kafka, standardize & store it) >>>> >>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>> data, and to allow queries >>>> >>>> - In the backend of the web UI, I could either use Spark to run queries >>>> across the data (mostly filters), or directly run queries against Cassandra >>>> / HBase >>>> >>>> I'd appreciate some thoughts / suggestions on which of these >>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for >>>> ETL, which persistent data store to use, and how to query that data store >>>> in the backend of the web UI, for displaying the reports). >>>> >>>> >>>> Thanks. >>>> >>> >> >
Re: Architecture recommendations for a tricky use case
What is the message inflow ? If it's really high , definitely spark will be of great use . Thanks Deepak On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and writing their raw > data into Kafka. > > I need to: > > - Do ETL on the data, and standardize it. > > - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / > ElasticSearch / Postgres) > > - Query this data to generate reports / analytics (There will be a web UI > which will be the front-end to the data, and will show the reports) > > Java is being used as the backend language for everything (backend of the > web UI, as well as the ETL layer) > > I'm considering: > > - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive > raw data from Kafka, standardize & store it) > > - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, > and to allow queries > > - In the backend of the web UI, I could either use Spark to run queries > across the data (mostly filters), or directly run queries against Cassandra > / HBase > > I'd appreciate some thoughts / suggestions on which of these alternatives > I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which > persistent data store to use, and how to query that data store in the > backend of the web UI, for displaying the reports). > > > Thanks. >
Re: Architecture recommendations for a tricky use case
You need a batch layer and a speed layer. Data from Kafka can be stored on HDFS using flume. - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports) This is basically batch layer and you need something like Tableau or Zeppelin to query data You will also need spark streaming to query data online for speed layer. That data could be stored in some transient fabric like ignite or even druid. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 29 September 2016 at 15:01, Ali Akhtar wrote: > It needs to be able to scale to a very large amount of data, yes. > > On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma > wrote: > >> What is the message inflow ? >> If it's really high , definitely spark will be of great use . >> >> Thanks >> Deepak >> >> On Sep 29, 2016 19:24, "Ali Akhtar" wrote: >> >>> I have a somewhat tricky use case, and I'm looking for ideas. >>> >>> I have 5-6 Kafka producers, reading various APIs, and writing their raw >>> data into Kafka. >>> >>> I need to: >>> >>> - Do ETL on the data, and standardize it. >>> >>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / >>> ElasticSearch / Postgres) >>> >>> - Query this data to generate reports / analytics (There will be a web >>> UI which will be the front-end to the data, and will show the reports) >>> >>> Java is being used as the backend language for everything (backend of >>> the web UI, as well as the ETL layer) >>> >>> I'm considering: >>> >>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>> (receive raw data from Kafka, standardize & store it) >>> >>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>> data, and to allow queries >>> >>> - In the backend of the web UI, I could either use Spark to run queries >>> across the data (mostly filters), or directly run queries against Cassandra >>> / HBase >>> >>> I'd appreciate some thoughts / suggestions on which of these >>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for >>> ETL, which persistent data store to use, and how to query that data store >>> in the backend of the web UI, for displaying the reports). >>> >>> >>> Thanks. >>> >> >
Re: Architecture recommendations for a tricky use case
It needs to be able to scale to a very large amount of data, yes. On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma wrote: > What is the message inflow ? > If it's really high , definitely spark will be of great use . > > Thanks > Deepak > > On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > >> I have a somewhat tricky use case, and I'm looking for ideas. >> >> I have 5-6 Kafka producers, reading various APIs, and writing their raw >> data into Kafka. >> >> I need to: >> >> - Do ETL on the data, and standardize it. >> >> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / >> ElasticSearch / Postgres) >> >> - Query this data to generate reports / analytics (There will be a web UI >> which will be the front-end to the data, and will show the reports) >> >> Java is being used as the backend language for everything (backend of the >> web UI, as well as the ETL layer) >> >> I'm considering: >> >> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >> (receive raw data from Kafka, standardize & store it) >> >> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, >> and to allow queries >> >> - In the backend of the web UI, I could either use Spark to run queries >> across the data (mostly filters), or directly run queries against Cassandra >> / HBase >> >> I'd appreciate some thoughts / suggestions on which of these alternatives >> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which >> persistent data store to use, and how to query that data store in the >> backend of the web UI, for displaying the reports). >> >> >> Thanks. >> >
Architecture recommendations for a tricky use case
I have a somewhat tricky use case, and I'm looking for ideas. I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka. I need to: - Do ETL on the data, and standardize it. - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres) - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports) Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer) I'm considering: - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it) - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports). Thanks.
RE: Re : A specific use case
Thanks Guozhang Wang. Hamza De : Guozhang Wang Envoyé : jeudi 4 août 2016 06:58:22 À : users@kafka.apache.org Objet : Re: Re : A specific use case Yeah, if you can buffer yourself in the process() function and then rely on punctuate() for generating the outputs that would resolve your issue. Remember that punctuate() function itself is event-time driven so if you do not have any data coming in then it may not be triggered. Details: https://github.com/apache/kafka/pull/1689 Guozhang On Wed, Aug 3, 2016 at 8:53 PM, Hamza HACHANI wrote: > Hi, > Yes in fact . > And ï found à solution. > It was in editing the method punctuate in kafka stream processor. > > - Message de réponse - > De : "Guozhang Wang" > Pour : "users@kafka.apache.org" > Objet : A specific use case > Date : mer., août 3, 2016 23:38 > > Hello Hamza, > > By saying "broker" I think you are actually referring to a Kafka Streams > instance? > > > Guozhang > > On Mon, Aug 1, 2016 at 1:01 AM, Hamza HACHANI > wrote: > > > Good morning, > > > > I'm working on a specific use case. In fact i'm receiving messages from > an > > operator network and trying to do statistics on their number per > > minute,perhour,per day ... > > > > I would like to create a broker that receives the messages and generates > a > > message every minute. These producted messages are consumed by a consumer > > from in one hand and also se,t to an other topic which receives them and > > generates messages every minute. > > > > I've been doing that for a while without a success. In fact the first > > broker in any time it receives a messages ,it produces one and send it to > > the other topic. > > > > My question is ,what i'm trying to do,Is it possible without passing by > an > > intermediate java processus which is out of kafka. > > > > If yes , How ? > > > > Thanks In advance. > > > > > > -- > -- Guozhang > -- -- Guozhang
Re: Re : A specific use case
Yeah, if you can buffer yourself in the process() function and then rely on punctuate() for generating the outputs that would resolve your issue. Remember that punctuate() function itself is event-time driven so if you do not have any data coming in then it may not be triggered. Details: https://github.com/apache/kafka/pull/1689 Guozhang On Wed, Aug 3, 2016 at 8:53 PM, Hamza HACHANI wrote: > Hi, > Yes in fact . > And ï found à solution. > It was in editing the method punctuate in kafka stream processor. > > - Message de réponse - > De : "Guozhang Wang" > Pour : "users@kafka.apache.org" > Objet : A specific use case > Date : mer., août 3, 2016 23:38 > > Hello Hamza, > > By saying "broker" I think you are actually referring to a Kafka Streams > instance? > > > Guozhang > > On Mon, Aug 1, 2016 at 1:01 AM, Hamza HACHANI > wrote: > > > Good morning, > > > > I'm working on a specific use case. In fact i'm receiving messages from > an > > operator network and trying to do statistics on their number per > > minute,perhour,per day ... > > > > I would like to create a broker that receives the messages and generates > a > > message every minute. These producted messages are consumed by a consumer > > from in one hand and also se,t to an other topic which receives them and > > generates messages every minute. > > > > I've been doing that for a while without a success. In fact the first > > broker in any time it receives a messages ,it produces one and send it to > > the other topic. > > > > My question is ,what i'm trying to do,Is it possible without passing by > an > > intermediate java processus which is out of kafka. > > > > If yes , How ? > > > > Thanks In advance. > > > > > > -- > -- Guozhang > -- -- Guozhang
Re : A specific use case
Hi, Yes in fact . And ï found à solution. It was in editing the method punctuate in kafka stream processor. - Message de réponse - De : "Guozhang Wang" Pour : "users@kafka.apache.org" Objet : A specific use case Date : mer., août 3, 2016 23:38 Hello Hamza, By saying "broker" I think you are actually referring to a Kafka Streams instance? Guozhang On Mon, Aug 1, 2016 at 1:01 AM, Hamza HACHANI wrote: > Good morning, > > I'm working on a specific use case. In fact i'm receiving messages from an > operator network and trying to do statistics on their number per > minute,perhour,per day ... > > I would like to create a broker that receives the messages and generates a > message every minute. These producted messages are consumed by a consumer > from in one hand and also se,t to an other topic which receives them and > generates messages every minute. > > I've been doing that for a while without a success. In fact the first > broker in any time it receives a messages ,it produces one and send it to > the other topic. > > My question is ,what i'm trying to do,Is it possible without passing by an > intermediate java processus which is out of kafka. > > If yes , How ? > > Thanks In advance. > -- -- Guozhang
Re: A specific use case
Hello Hamza, By saying "broker" I think you are actually referring to a Kafka Streams instance? Guozhang On Mon, Aug 1, 2016 at 1:01 AM, Hamza HACHANI wrote: > Good morning, > > I'm working on a specific use case. In fact i'm receiving messages from an > operator network and trying to do statistics on their number per > minute,perhour,per day ... > > I would like to create a broker that receives the messages and generates a > message every minute. These producted messages are consumed by a consumer > from in one hand and also se,t to an other topic which receives them and > generates messages every minute. > > I've been doing that for a while without a success. In fact the first > broker in any time it receives a messages ,it produces one and send it to > the other topic. > > My question is ,what i'm trying to do,Is it possible without passing by an > intermediate java processus which is out of kafka. > > If yes , How ? > > Thanks In advance. > -- -- Guozhang
A specific use case
Good morning, I'm working on a specific use case. In fact i'm receiving messages from an operator network and trying to do statistics on their number per minute,perhour,per day ... I would like to create a broker that receives the messages and generates a message every minute. These producted messages are consumed by a consumer from in one hand and also se,t to an other topic which receives them and generates messages every minute. I've been doing that for a while without a success. In fact the first broker in any time it receives a messages ,it produces one and send it to the other topic. My question is ,what i'm trying to do,Is it possible without passing by an intermediate java processus which is out of kafka. If yes , How ? Thanks In advance.
Re: Question about a kafka use case : sequence once a partition is added
You can't you only get a guarantee on the order for each partition, not over partitions. Adding partitions will possible make it a lot worse, since items with the same key wll land in other partitions. For example with two partitions these will be about the hashes in each partitions: partition-0: 0,2,4,6,8 partition-1: 1,3,5,7,9 With three partitions: partition-0: 0,3,6,9 partition-1: 1,4,7 partition-2: 2,5,8 So items with a key which hashes to 2, will move from partition-0, to partition-2. I think if you really need to be able to guarantee order, you will need to add some sequential id to the messages, and buffer them when reading, but this will have all sorts of drawbacks, like lossing the messages in the buffer in case of error (or you must make the commit offset dependent on the buffer). On Mon, Jul 18, 2016 at 9:19 PM Fumo, Vincent wrote: > I want to use Kafka for notifications of changes to data in a > dataservice/database. For each object that changes, a kafka message will be > sent. This is easy and we've got that working no problem. > > Here is my use case : I want to be able to fire up a process that will > > 1) determine the current location of the kafka topic (right now we use 2 > partitions so that would be the offset for each partition) > 2) do a long running process that will copy data from the database > 3) once the process is over, put the location back into a kafka consumer > and start processing notifications in sequence > > This isn't very hard either but there is a problem that we'd face if > during step (2) partitions are added to the topic (say by our operations > team). > > I know we can set up a ConsumerRebalanceListener but I don't think that > will help because we'd need to back to a time when we had our original > number of partitions and then we'd need to know exactly when to start > reading from the new partition(s). > > for example > > start : 2 partitions (0,1) at offsets p0,100 and p1,100 > > 1) we store the offsets and partitions : p0,100 and p1,100 > 2) we run the db ingest > 3) messages are posted to p0 and p1 > 4) OPS team adds p2 and our ConsumerRebalanceListener would be notified > 5) we are done and we set our consumer to p0,100 and p1,100 (and p2,0 > thanks to the ConsumerRebalanceListener) > > how would we guarantee the order of messages received from our consumer > across all 3 partitions? > > >
Question about a kafka use case : sequence once a partition is added
I want to use Kafka for notifications of changes to data in a dataservice/database. For each object that changes, a kafka message will be sent. This is easy and we've got that working no problem. Here is my use case : I want to be able to fire up a process that will 1) determine the current location of the kafka topic (right now we use 2 partitions so that would be the offset for each partition) 2) do a long running process that will copy data from the database 3) once the process is over, put the location back into a kafka consumer and start processing notifications in sequence This isn't very hard either but there is a problem that we'd face if during step (2) partitions are added to the topic (say by our operations team). I know we can set up a ConsumerRebalanceListener but I don't think that will help because we'd need to back to a time when we had our original number of partitions and then we'd need to know exactly when to start reading from the new partition(s). for example start : 2 partitions (0,1) at offsets p0,100 and p1,100 1) we store the offsets and partitions : p0,100 and p1,100 2) we run the db ingest 3) messages are posted to p0 and p1 4) OPS team adds p2 and our ConsumerRebalanceListener would be notified 5) we are done and we set our consumer to p0,100 and p1,100 (and p2,0 thanks to the ConsumerRebalanceListener) how would we guarantee the order of messages received from our consumer across all 3 partitions?
Re: My Use case is I want to delete the records instantly after consuming them
Hi All Is the number of consumer component equal to the number of partitions created in cluster ? I have created three partitions in cluster but I am using only two consumer poller to subscribe the records. Some time I have noticed that the messages are polled very late. What should be the good polling strategy needed. Please suggest me. /home/kafka/kafka_2.11-0.9.0.0/bin/kafka-topics.sh --create --zookeeper 172.16.8.216:2181 --replication-factor 3 --partitions 3 --topic EmailOCTracker Thanks and Regards, Navneet Kumar On Sat, Jul 2, 2016 at 10:49 PM, Navneet Kumar wrote: > Thank you so much Ian > > > > > Thanks and Regards, > Navneet Kumar > > > On Sat, Jul 2, 2016 at 9:45 PM, Ian Wrigley wrote: >> That’s really not what Kafka was designed to do. You can set a short log >> retention period, which will mean messages are deleted relatively soon after >> they were written to Kafka, but there’s no mechanism for deleting records on >> consumption. >> >> Ian. >> >> >> --- >> Ian Wrigley >> Director, Education Services >> Confluent, Inc >> >>> On Jul 2, 2016, at 11:08 AM, Navneet Kumar >>> wrote: >>> >>> Hi All >>> My Use case is I want to delete the records instantly after consuming >>> them. I am using Kafka 0.90 >>> >>> >>> >>> Thanks and Regards, >>> Navneet Kumar >>
Re: My Use case is I want to delete the records instantly after consuming them
Thank you so much Ian Thanks and Regards, Navneet Kumar On Sat, Jul 2, 2016 at 9:45 PM, Ian Wrigley wrote: > That’s really not what Kafka was designed to do. You can set a short log > retention period, which will mean messages are deleted relatively soon after > they were written to Kafka, but there’s no mechanism for deleting records on > consumption. > > Ian. > > > --- > Ian Wrigley > Director, Education Services > Confluent, Inc > >> On Jul 2, 2016, at 11:08 AM, Navneet Kumar >> wrote: >> >> Hi All >> My Use case is I want to delete the records instantly after consuming >> them. I am using Kafka 0.90 >> >> >> >> Thanks and Regards, >> Navneet Kumar >
Re: My Use case is I want to delete the records instantly after consuming them
That’s really not what Kafka was designed to do. You can set a short log retention period, which will mean messages are deleted relatively soon after they were written to Kafka, but there’s no mechanism for deleting records on consumption. Ian. --- Ian Wrigley Director, Education Services Confluent, Inc > On Jul 2, 2016, at 11:08 AM, Navneet Kumar > wrote: > > Hi All > My Use case is I want to delete the records instantly after consuming > them. I am using Kafka 0.90 > > > > Thanks and Regards, > Navneet Kumar
My Use case is I want to delete the records instantly after consuming them
Hi All My Use case is I want to delete the records instantly after consuming them. I am using Kafka 0.90 Thanks and Regards, Navneet Kumar
Re: Kafka Streams: finding a solution to a particular use case
ld be no problem to read from them and execute a new > stream > > > > > > process, > > > > > > > > right? (like a new joins, counts...). > > > > > > > > > > > > > > > > Thanks!! > > > > > > > > > > > > > > > > 2016-04-15 17:37 GMT+02:00 Guozhang Wang >: > > > > > > > > > > > > > > > > > 1) There are three types of joins for KTable-KTable join, > the > > > > > follow > > > > > > > the > > > > > > > > > same semantics in SQL joins: > > > > > > > > > > > > > > > > > > KTable.join(KTable): when there is no matching record from > > > inner > > > > > > table > > > > > > > > when > > > > > > > > > received a new record from outer table, no output; and vice > > > > versa. > > > > > > > > > KTable.leftjoin(KTable): when there is no matching record > > from > > > > > inner > > > > > > > > table > > > > > > > > > when received a new record from outer table, output (a, > > null); > > > on > > > > > the > > > > > > > > other > > > > > > > > > direction no output. > > > > > > > > > KTable.outerjoin(KTable): when there is no matching record > > from > > > > > > inner / > > > > > > > > > outer table when received a new record from outer / inner > > > table, > > > > > > output > > > > > > > > (a, > > > > > > > > > null) or (null, b). > > > > > > > > > > > > > > > > > > > > > > > > > > > 2) The result topic is also a changelog topic, although it > > will > > > > be > > > > > > log > > > > > > > > > compacted on the key over time, if you consume immediately > > the > > > > log > > > > > > may > > > > > > > > not > > > > > > > > > be yet compacted. > > > > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral < > > > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > > > > > > > > > > > > > Hi Guozhang, > > > > > > > > > > > > > > > > > > > > Thank you very much for your reply and sorry for the > > generic > > > > > > > question, > > > > > > > > > I'll > > > > > > > > > > try to explain with some pseudocode. > > > > > > > > > > > > > > > > > > > > I have two KTable with a join: > > > > > > > > > > > > > > > > > > > > ktable1: KTable[String, String] = builder.table("topic1") > > > > > > > > > > ktable2: KTable[String, String] = builder.table("topic2") > > > > > > > > > > > > > > > > > > > > result: KTable[String, ResultUnion] = > > > > > > > > > > ktable1.join(ktable2, (data1, data2) => new > > > ResultUnion(data1, > > > > > > > data2)) > > > > > > > > > > > > > > > > > > > > I send the result to a topic result.to("resultTopic"). > > > > > > > > > > > > > > > > > > > > My questions are related with the following scenario: > > > > > > > > > > > > > > > > > > > > - The streming is up & running without data in topics > > > > > > > > > > > > > > > > > > > > - I send data to "topic2", for example a key/value like > > that > > > > > > > > > ("uniqueKey1", > > > > > > > > > > "hello") > > >
Re: Kafka Streams: finding a solution to a particular use case
om outer table, output (a, > null); > > on > > > > the > > > > > > > other > > > > > > > > direction no output. > > > > > > > > KTable.outerjoin(KTable): when there is no matching record > from > > > > > inner / > > > > > > > > outer table when received a new record from outer / inner > > table, > > > > > output > > > > > > > (a, > > > > > > > > null) or (null, b). > > > > > > > > > > > > > > > > > > > > > > > > 2) The result topic is also a changelog topic, although it > will > > > be > > > > > log > > > > > > > > compacted on the key over time, if you consume immediately > the > > > log > > > > > may > > > > > > > not > > > > > > > > be yet compacted. > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral < > > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > > > > > > > > > > > Hi Guozhang, > > > > > > > > > > > > > > > > > > Thank you very much for your reply and sorry for the > generic > > > > > > question, > > > > > > > > I'll > > > > > > > > > try to explain with some pseudocode. > > > > > > > > > > > > > > > > > > I have two KTable with a join: > > > > > > > > > > > > > > > > > > ktable1: KTable[String, String] = builder.table("topic1") > > > > > > > > > ktable2: KTable[String, String] = builder.table("topic2") > > > > > > > > > > > > > > > > > > result: KTable[String, ResultUnion] = > > > > > > > > > ktable1.join(ktable2, (data1, data2) => new > > ResultUnion(data1, > > > > > > data2)) > > > > > > > > > > > > > > > > > > I send the result to a topic result.to("resultTopic"). > > > > > > > > > > > > > > > > > > My questions are related with the following scenario: > > > > > > > > > > > > > > > > > > - The streming is up & running without data in topics > > > > > > > > > > > > > > > > > > - I send data to "topic2", for example a key/value like > that > > > > > > > > ("uniqueKey1", > > > > > > > > > "hello") > > > > > > > > > > > > > > > > > > - I see null values in topic "resultTopic", i.e. > > ("uniqueKey1", > > > > > null) > > > > > > > > > > > > > > > > > > - If I send data to "topic1", for example a key/value like > > that > > > > > > > > > ("uniqueKey1", "world") then I see this values in topic > > > > > > "resultTopic", > > > > > > > > > ("uniqueKey1", ResultUnion("hello", "world")) > > > > > > > > > > > > > > > > > > Q: If we send data for one of the KTable that does not have > > the > > > > > > > > > corresponding data by key in the other one, obtain null > > values > > > in > > > > > the > > > > > > > > > result final topic is the expected behavior? > > > > > > > > > > > > > > > > > > My next step would be use Kafka Connect to persist result > > data > > > in > > > > > C* > > > > > > (I > > > > > > > > > have not read yet the Connector docs...), is this the way > to > > do > > > > it? > > > > > > (I > > > > > > > > mean > > > > > > > > > prepare the data in the topic). > > > > > > > > > > > > > > > > > > Q: On the other hand, just to try, I have a KTable that > read > > > > > messages > > > > > > > in > > > > > > > > > "resultTopic" and prints them. If the stream is a KTable I > am > > > > > > wondering > > > > > > > > why > > > > > > > > > is getting all the values from the topic even those with > the > > > same > > > > > > key? > > > > > > > > > > > > > > > > > > Thanks in advance! Great job answering community! > > > > > > > > > > > > > > > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang < > wangg...@gmail.com > > >: > > > > > > > > > > > > > > > > > > > Hi Guillermo, > > > > > > > > > > > > > > > > > > > > 1) Yes in your case, the streams are really a "changelog" > > > > stream, > > > > > > > hence > > > > > > > > > you > > > > > > > > > > should create the stream as KTable, and do KTable-KTable > > > join. > > > > > > > > > > > > > > > > > > > > 2) Could elaborate about "achieving this"? What behavior > do > > > > > require > > > > > > > in > > > > > > > > > the > > > > > > > > > > application logic? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers > Corral < > > > > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I am a newbie to Kafka Streams and I am using it trying > > to > > > > > solve > > > > > > a > > > > > > > > > > > particular use case. Let me explain. > > > > > > > > > > > > > > > > > > > > > > I have two sources of data both like that: > > > > > > > > > > > > > > > > > > > > > > Key (string) > > > > > > > > > > > DateTime (hourly granularity) > > > > > > > > > > > Value > > > > > > > > > > > > > > > > > > > > > > I need to join the two sources by key and date (hour of > > > day) > > > > to > > > > > > > > obtain: > > > > > > > > > > > > > > > > > > > > > > Key (string) > > > > > > > > > > > DateTime (hourly granularity) > > > > > > > > > > > ValueSource1 > > > > > > > > > > > ValueSource2 > > > > > > > > > > > > > > > > > > > > > > I think that first I'd need to push the messages in > Kafka > > > > > topics > > > > > > > with > > > > > > > > > the > > > > > > > > > > > date as part of the key because I'll group by key > taking > > > into > > > > > > > account > > > > > > > > > the > > > > > > > > > > > date. So maybe the key must be a new string like > > > > key_timestamp. > > > > > > > But, > > > > > > > > of > > > > > > > > > > > course, it is not the main problem, is just an > additional > > > > > > > > explanation. > > > > > > > > > > > > > > > > > > > > > > Ok, so data are in topics, here we go! > > > > > > > > > > > > > > > > > > > > > > - Multiple records allows per key but only the latest > > value > > > > > for a > > > > > > > > > record > > > > > > > > > > > key will be considered. I should use two KTable with > some > > > > join > > > > > > > > > strategy, > > > > > > > > > > > right? > > > > > > > > > > > > > > > > > > > > > > - Data of both sources could arrive at any time. What > > can I > > > > do > > > > > to > > > > > > > > > achieve > > > > > > > > > > > this? > > > > > > > > > > > > > > > > > > > > > > Thanks in advance. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > > -- > -- Guozhang >
Re: Kafka Streams: finding a solution to a particular use case
gt; > > Hi Guozhang, > > > > > > > > > > > > > > > > Thank you very much for your reply and sorry for the generic > > > > > question, > > > > > > > I'll > > > > > > > > try to explain with some pseudocode. > > > > > > > > > > > > > > > > I have two KTable with a join: > > > > > > > > > > > > > > > > ktable1: KTable[String, String] = builder.table("topic1") > > > > > > > > ktable2: KTable[String, String] = builder.table("topic2") > > > > > > > > > > > > > > > > result: KTable[String, ResultUnion] = > > > > > > > > ktable1.join(ktable2, (data1, data2) => new > ResultUnion(data1, > > > > > data2)) > > > > > > > > > > > > > > > > I send the result to a topic result.to("resultTopic"). > > > > > > > > > > > > > > > > My questions are related with the following scenario: > > > > > > > > > > > > > > > > - The streming is up & running without data in topics > > > > > > > > > > > > > > > > - I send data to "topic2", for example a key/value like that > > > > > > > ("uniqueKey1", > > > > > > > > "hello") > > > > > > > > > > > > > > > > - I see null values in topic "resultTopic", i.e. > ("uniqueKey1", > > > > null) > > > > > > > > > > > > > > > > - If I send data to "topic1", for example a key/value like > that > > > > > > > > ("uniqueKey1", "world") then I see this values in topic > > > > > "resultTopic", > > > > > > > > ("uniqueKey1", ResultUnion("hello", "world")) > > > > > > > > > > > > > > > > Q: If we send data for one of the KTable that does not have > the > > > > > > > > corresponding data by key in the other one, obtain null > values > > in > > > > the > > > > > > > > result final topic is the expected behavior? > > > > > > > > > > > > > > > > My next step would be use Kafka Connect to persist result > data > > in > > > > C* > > > > > (I > > > > > > > > have not read yet the Connector docs...), is this the way to > do > > > it? > > > > > (I > > > > > > > mean > > > > > > > > prepare the data in the topic). > > > > > > > > > > > > > > > > Q: On the other hand, just to try, I have a KTable that read > > > > messages > > > > > > in > > > > > > > > "resultTopic" and prints them. If the stream is a KTable I am > > > > > wondering > > > > > > > why > > > > > > > > is getting all the values from the topic even those with the > > same > > > > > key? > > > > > > > > > > > > > > > > Thanks in advance! Great job answering community! > > > > > > > > > > > > > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang >: > > > > > > > > > > > > > > > > > Hi Guillermo, > > > > > > > > > > > > > > > > > > 1) Yes in your case, the streams are really a "changelog" > > > stream, > > > > > > hence > > > > > > > > you > > > > > > > > > should create the stream as KTable, and do KTable-KTable > > join. > > > > > > > > > > > > > > > > > > 2) Could elaborate about "achieving this"? What behavior do > > > > require > > > > > > in > > > > > > > > the > > > > > > > > > application logic? > > > > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < > > > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I am a newbie to Kafka Streams and I am using it trying > to > > > > solve > > > > > a > > > > > > > > > > particular use case. Let me explain. > > > > > > > > > > > > > > > > > > > > I have two sources of data both like that: > > > > > > > > > > > > > > > > > > > > Key (string) > > > > > > > > > > DateTime (hourly granularity) > > > > > > > > > > Value > > > > > > > > > > > > > > > > > > > > I need to join the two sources by key and date (hour of > > day) > > > to > > > > > > > obtain: > > > > > > > > > > > > > > > > > > > > Key (string) > > > > > > > > > > DateTime (hourly granularity) > > > > > > > > > > ValueSource1 > > > > > > > > > > ValueSource2 > > > > > > > > > > > > > > > > > > > > I think that first I'd need to push the messages in Kafka > > > > topics > > > > > > with > > > > > > > > the > > > > > > > > > > date as part of the key because I'll group by key taking > > into > > > > > > account > > > > > > > > the > > > > > > > > > > date. So maybe the key must be a new string like > > > key_timestamp. > > > > > > But, > > > > > > > of > > > > > > > > > > course, it is not the main problem, is just an additional > > > > > > > explanation. > > > > > > > > > > > > > > > > > > > > Ok, so data are in topics, here we go! > > > > > > > > > > > > > > > > > > > > - Multiple records allows per key but only the latest > value > > > > for a > > > > > > > > record > > > > > > > > > > key will be considered. I should use two KTable with some > > > join > > > > > > > > strategy, > > > > > > > > > > right? > > > > > > > > > > > > > > > > > > > > - Data of both sources could arrive at any time. What > can I > > > do > > > > to > > > > > > > > achieve > > > > > > > > > > this? > > > > > > > > > > > > > > > > > > > > Thanks in advance. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang
Re: Kafka Streams: finding a solution to a particular use case
> >> versa. > >>>>>>> KTable.leftjoin(KTable): when there is no matching record from > >>> inner > >>>>>> table > >>>>>>> when received a new record from outer table, output (a, null); on > >>> the > >>>>>> other > >>>>>>> direction no output. > >>>>>>> KTable.outerjoin(KTable): when there is no matching record from > >>>> inner / > >>>>>>> outer table when received a new record from outer / inner table, > >>>> output > >>>>>> (a, > >>>>>>> null) or (null, b). > >>>>>>> > >>>>>>> > >>>>>>> 2) The result topic is also a changelog topic, although it will > >> be > >>>> log > >>>>>>> compacted on the key over time, if you consume immediately the > >> log > >>>> may > >>>>>> not > >>>>>>> be yet compacted. > >>>>>>> > >>>>>>> > >>>>>>> Guozhang > >>>>>>> > >>>>>>> On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral < > >>>>>>> guillermo.lammers.cor...@tecsisa.com> wrote: > >>>>>>> > >>>>>>>> Hi Guozhang, > >>>>>>>> > >>>>>>>> Thank you very much for your reply and sorry for the generic > >>>>> question, > >>>>>>> I'll > >>>>>>>> try to explain with some pseudocode. > >>>>>>>> > >>>>>>>> I have two KTable with a join: > >>>>>>>> > >>>>>>>> ktable1: KTable[String, String] = builder.table("topic1") > >>>>>>>> ktable2: KTable[String, String] = builder.table("topic2") > >>>>>>>> > >>>>>>>> result: KTable[String, ResultUnion] = > >>>>>>>> ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, > >>>>> data2)) > >>>>>>>> > >>>>>>>> I send the result to a topic result.to("resultTopic"). > >>>>>>>> > >>>>>>>> My questions are related with the following scenario: > >>>>>>>> > >>>>>>>> - The streming is up & running without data in topics > >>>>>>>> > >>>>>>>> - I send data to "topic2", for example a key/value like that > >>>>>>> ("uniqueKey1", > >>>>>>>> "hello") > >>>>>>>> > >>>>>>>> - I see null values in topic "resultTopic", i.e. ("uniqueKey1", > >>>> null) > >>>>>>>> > >>>>>>>> - If I send data to "topic1", for example a key/value like that > >>>>>>>> ("uniqueKey1", "world") then I see this values in topic > >>>>> "resultTopic", > >>>>>>>> ("uniqueKey1", ResultUnion("hello", "world")) > >>>>>>>> > >>>>>>>> Q: If we send data for one of the KTable that does not have the > >>>>>>>> corresponding data by key in the other one, obtain null values > >> in > >>>> the > >>>>>>>> result final topic is the expected behavior? > >>>>>>>> > >>>>>>>> My next step would be use Kafka Connect to persist result data > >> in > >>>> C* > >>>>> (I > >>>>>>>> have not read yet the Connector docs...), is this the way to do > >>> it? > >>>>> (I > >>>>>>> mean > >>>>>>>> prepare the data in the topic). > >>>>>>>> > >>>>>>>> Q: On the other hand, just to try, I have a KTable that read > >>>> messages > >>>>>> in > >>>>>>>> "resultTopic" and prints them. If the stream is a KTable I am > >>>>> wondering > >>>>>>> why > >>>>>>>> is getting all the values from the topic even those with the > >> same > >>>>> key? > >>>>>>>> > >>>>>>>> Thanks in advance! Great job answering community! > >>>>>>>> > >>>>>>>> 2016-04-14 20:00 GMT+02:00 Guozhang Wang : > >>>>>>>> > >>>>>>>>> Hi Guillermo, > >>>>>>>>> > >>>>>>>>> 1) Yes in your case, the streams are really a "changelog" > >>> stream, > >>>>>> hence > >>>>>>>> you > >>>>>>>>> should create the stream as KTable, and do KTable-KTable > >> join. > >>>>>>>>> > >>>>>>>>> 2) Could elaborate about "achieving this"? What behavior do > >>>> require > >>>>>> in > >>>>>>>> the > >>>>>>>>> application logic? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Guozhang > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < > >>>>>>>>> guillermo.lammers.cor...@tecsisa.com> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> I am a newbie to Kafka Streams and I am using it trying to > >>>> solve > >>>>> a > >>>>>>>>>> particular use case. Let me explain. > >>>>>>>>>> > >>>>>>>>>> I have two sources of data both like that: > >>>>>>>>>> > >>>>>>>>>> Key (string) > >>>>>>>>>> DateTime (hourly granularity) > >>>>>>>>>> Value > >>>>>>>>>> > >>>>>>>>>> I need to join the two sources by key and date (hour of > >> day) > >>> to > >>>>>>> obtain: > >>>>>>>>>> > >>>>>>>>>> Key (string) > >>>>>>>>>> DateTime (hourly granularity) > >>>>>>>>>> ValueSource1 > >>>>>>>>>> ValueSource2 > >>>>>>>>>> > >>>>>>>>>> I think that first I'd need to push the messages in Kafka > >>>> topics > >>>>>> with > >>>>>>>> the > >>>>>>>>>> date as part of the key because I'll group by key taking > >> into > >>>>>> account > >>>>>>>> the > >>>>>>>>>> date. So maybe the key must be a new string like > >>> key_timestamp. > >>>>>> But, > >>>>>>> of > >>>>>>>>>> course, it is not the main problem, is just an additional > >>>>>>> explanation. > >>>>>>>>>> > >>>>>>>>>> Ok, so data are in topics, here we go! > >>>>>>>>>> > >>>>>>>>>> - Multiple records allows per key but only the latest value > >>>> for a > >>>>>>>> record > >>>>>>>>>> key will be considered. I should use two KTable with some > >>> join > >>>>>>>> strategy, > >>>>>>>>>> right? > >>>>>>>>>> > >>>>>>>>>> - Data of both sources could arrive at any time. What can I > >>> do > >>>> to > >>>>>>>> achieve > >>>>>>>>>> this? > >>>>>>>>>> > >>>>>>>>>> Thanks in advance. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> -- Guozhang > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> -- Guozhang > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> -- Guozhang > >>>> > >>> > >> > >> > >> > >> -- > >> -- Guozhang > >> > > > >
Re: Kafka Streams: finding a solution to a particular use case
explain with some pseudocode. >>>>>>>> >>>>>>>> I have two KTable with a join: >>>>>>>> >>>>>>>> ktable1: KTable[String, String] = builder.table("topic1") >>>>>>>> ktable2: KTable[String, String] = builder.table("topic2") >>>>>>>> >>>>>>>> result: KTable[String, ResultUnion] = >>>>>>>> ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, >>>>> data2)) >>>>>>>> >>>>>>>> I send the result to a topic result.to("resultTopic"). >>>>>>>> >>>>>>>> My questions are related with the following scenario: >>>>>>>> >>>>>>>> - The streming is up & running without data in topics >>>>>>>> >>>>>>>> - I send data to "topic2", for example a key/value like that >>>>>>> ("uniqueKey1", >>>>>>>> "hello") >>>>>>>> >>>>>>>> - I see null values in topic "resultTopic", i.e. ("uniqueKey1", >>>> null) >>>>>>>> >>>>>>>> - If I send data to "topic1", for example a key/value like that >>>>>>>> ("uniqueKey1", "world") then I see this values in topic >>>>> "resultTopic", >>>>>>>> ("uniqueKey1", ResultUnion("hello", "world")) >>>>>>>> >>>>>>>> Q: If we send data for one of the KTable that does not have the >>>>>>>> corresponding data by key in the other one, obtain null values >> in >>>> the >>>>>>>> result final topic is the expected behavior? >>>>>>>> >>>>>>>> My next step would be use Kafka Connect to persist result data >> in >>>> C* >>>>> (I >>>>>>>> have not read yet the Connector docs...), is this the way to do >>> it? >>>>> (I >>>>>>> mean >>>>>>>> prepare the data in the topic). >>>>>>>> >>>>>>>> Q: On the other hand, just to try, I have a KTable that read >>>> messages >>>>>> in >>>>>>>> "resultTopic" and prints them. If the stream is a KTable I am >>>>> wondering >>>>>>> why >>>>>>>> is getting all the values from the topic even those with the >> same >>>>> key? >>>>>>>> >>>>>>>> Thanks in advance! Great job answering community! >>>>>>>> >>>>>>>> 2016-04-14 20:00 GMT+02:00 Guozhang Wang : >>>>>>>> >>>>>>>>> Hi Guillermo, >>>>>>>>> >>>>>>>>> 1) Yes in your case, the streams are really a "changelog" >>> stream, >>>>>> hence >>>>>>>> you >>>>>>>>> should create the stream as KTable, and do KTable-KTable >> join. >>>>>>>>> >>>>>>>>> 2) Could elaborate about "achieving this"? What behavior do >>>> require >>>>>> in >>>>>>>> the >>>>>>>>> application logic? >>>>>>>>> >>>>>>>>> >>>>>>>>> Guozhang >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < >>>>>>>>> guillermo.lammers.cor...@tecsisa.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am a newbie to Kafka Streams and I am using it trying to >>>> solve >>>>> a >>>>>>>>>> particular use case. Let me explain. >>>>>>>>>> >>>>>>>>>> I have two sources of data both like that: >>>>>>>>>> >>>>>>>>>> Key (string) >>>>>>>>>> DateTime (hourly granularity) >>>>>>>>>> Value >>>>>>>>>> >>>>>>>>>> I need to join the two sources by key and date (hour of >> day) >>> to >>>>>>> obtain: >>>>>>>>>> >>>>>>>>>> Key (string) >>>>>>>>>> DateTime (hourly granularity) >>>>>>>>>> ValueSource1 >>>>>>>>>> ValueSource2 >>>>>>>>>> >>>>>>>>>> I think that first I'd need to push the messages in Kafka >>>> topics >>>>>> with >>>>>>>> the >>>>>>>>>> date as part of the key because I'll group by key taking >> into >>>>>> account >>>>>>>> the >>>>>>>>>> date. So maybe the key must be a new string like >>> key_timestamp. >>>>>> But, >>>>>>> of >>>>>>>>>> course, it is not the main problem, is just an additional >>>>>>> explanation. >>>>>>>>>> >>>>>>>>>> Ok, so data are in topics, here we go! >>>>>>>>>> >>>>>>>>>> - Multiple records allows per key but only the latest value >>>> for a >>>>>>>> record >>>>>>>>>> key will be considered. I should use two KTable with some >>> join >>>>>>>> strategy, >>>>>>>>>> right? >>>>>>>>>> >>>>>>>>>> - Data of both sources could arrive at any time. What can I >>> do >>>> to >>>>>>>> achieve >>>>>>>>>> this? >>>>>>>>>> >>>>>>>>>> Thanks in advance. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> -- Guozhang >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> -- Guozhang >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> -- Guozhang >>>> >>> >> >> >> >> -- >> -- Guozhang >> > signature.asc Description: OpenPGP digital signature
Re: Kafka Streams: finding a solution to a particular use case
> > - I send data to "topic2", for example a key/value like that > > > > > > ("uniqueKey1", > > > > > > > "hello") > > > > > > > > > > > > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", > > > null) > > > > > > > > > > > > > > - If I send data to "topic1", for example a key/value like that > > > > > > > ("uniqueKey1", "world") then I see this values in topic > > > > "resultTopic", > > > > > > > ("uniqueKey1", ResultUnion("hello", "world")) > > > > > > > > > > > > > > Q: If we send data for one of the KTable that does not have the > > > > > > > corresponding data by key in the other one, obtain null values > in > > > the > > > > > > > result final topic is the expected behavior? > > > > > > > > > > > > > > My next step would be use Kafka Connect to persist result data > in > > > C* > > > > (I > > > > > > > have not read yet the Connector docs...), is this the way to do > > it? > > > > (I > > > > > > mean > > > > > > > prepare the data in the topic). > > > > > > > > > > > > > > Q: On the other hand, just to try, I have a KTable that read > > > messages > > > > > in > > > > > > > "resultTopic" and prints them. If the stream is a KTable I am > > > > wondering > > > > > > why > > > > > > > is getting all the values from the topic even those with the > same > > > > key? > > > > > > > > > > > > > > Thanks in advance! Great job answering community! > > > > > > > > > > > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang : > > > > > > > > > > > > > > > Hi Guillermo, > > > > > > > > > > > > > > > > 1) Yes in your case, the streams are really a "changelog" > > stream, > > > > > hence > > > > > > > you > > > > > > > > should create the stream as KTable, and do KTable-KTable > join. > > > > > > > > > > > > > > > > 2) Could elaborate about "achieving this"? What behavior do > > > require > > > > > in > > > > > > > the > > > > > > > > application logic? > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < > > > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I am a newbie to Kafka Streams and I am using it trying to > > > solve > > > > a > > > > > > > > > particular use case. Let me explain. > > > > > > > > > > > > > > > > > > I have two sources of data both like that: > > > > > > > > > > > > > > > > > > Key (string) > > > > > > > > > DateTime (hourly granularity) > > > > > > > > > Value > > > > > > > > > > > > > > > > > > I need to join the two sources by key and date (hour of > day) > > to > > > > > > obtain: > > > > > > > > > > > > > > > > > > Key (string) > > > > > > > > > DateTime (hourly granularity) > > > > > > > > > ValueSource1 > > > > > > > > > ValueSource2 > > > > > > > > > > > > > > > > > > I think that first I'd need to push the messages in Kafka > > > topics > > > > > with > > > > > > > the > > > > > > > > > date as part of the key because I'll group by key taking > into > > > > > account > > > > > > > the > > > > > > > > > date. So maybe the key must be a new string like > > key_timestamp. > > > > > But, > > > > > > of > > > > > > > > > course, it is not the main problem, is just an additional > > > > > > explanation. > > > > > > > > > > > > > > > > > > Ok, so data are in topics, here we go! > > > > > > > > > > > > > > > > > > - Multiple records allows per key but only the latest value > > > for a > > > > > > > record > > > > > > > > > key will be considered. I should use two KTable with some > > join > > > > > > > strategy, > > > > > > > > > right? > > > > > > > > > > > > > > > > > > - Data of both sources could arrive at any time. What can I > > do > > > to > > > > > > > achieve > > > > > > > > > this? > > > > > > > > > > > > > > > > > > Thanks in advance. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > > -- > -- Guozhang >
Re: Kafka Streams: finding a solution to a particular use case
> > should be no problem to read from them and execute a new stream > > process, > > > > right? (like a new joins, counts...). > > > > > > > > Thanks!! > > > > > > > > 2016-04-15 17:37 GMT+02:00 Guozhang Wang : > > > > > > > > > 1) There are three types of joins for KTable-KTable join, the > follow > > > the > > > > > same semantics in SQL joins: > > > > > > > > > > KTable.join(KTable): when there is no matching record from inner > > table > > > > when > > > > > received a new record from outer table, no output; and vice versa. > > > > > KTable.leftjoin(KTable): when there is no matching record from > inner > > > > table > > > > > when received a new record from outer table, output (a, null); on > the > > > > other > > > > > direction no output. > > > > > KTable.outerjoin(KTable): when there is no matching record from > > inner / > > > > > outer table when received a new record from outer / inner table, > > output > > > > (a, > > > > > null) or (null, b). > > > > > > > > > > > > > > > 2) The result topic is also a changelog topic, although it will be > > log > > > > > compacted on the key over time, if you consume immediately the log > > may > > > > not > > > > > be yet compacted. > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral < > > > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > > > > > Hi Guozhang, > > > > > > > > > > > > Thank you very much for your reply and sorry for the generic > > > question, > > > > > I'll > > > > > > try to explain with some pseudocode. > > > > > > > > > > > > I have two KTable with a join: > > > > > > > > > > > > ktable1: KTable[String, String] = builder.table("topic1") > > > > > > ktable2: KTable[String, String] = builder.table("topic2") > > > > > > > > > > > > result: KTable[String, ResultUnion] = > > > > > > ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, > > > data2)) > > > > > > > > > > > > I send the result to a topic result.to("resultTopic"). > > > > > > > > > > > > My questions are related with the following scenario: > > > > > > > > > > > > - The streming is up & running without data in topics > > > > > > > > > > > > - I send data to "topic2", for example a key/value like that > > > > > ("uniqueKey1", > > > > > > "hello") > > > > > > > > > > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", > > null) > > > > > > > > > > > > - If I send data to "topic1", for example a key/value like that > > > > > > ("uniqueKey1", "world") then I see this values in topic > > > "resultTopic", > > > > > > ("uniqueKey1", ResultUnion("hello", "world")) > > > > > > > > > > > > Q: If we send data for one of the KTable that does not have the > > > > > > corresponding data by key in the other one, obtain null values in > > the > > > > > > result final topic is the expected behavior? > > > > > > > > > > > > My next step would be use Kafka Connect to persist result data in > > C* > > > (I > > > > > > have not read yet the Connector docs...), is this the way to do > it? > > > (I > > > > > mean > > > > > > prepare the data in the topic). > > > > > > > > > > > > Q: On the other hand, just to try, I have a KTable that read > > messages > > > > in > > > > > > "resultTopic" and prints them. If the stream is a KTable I am > > > wondering > > > > > why > > > > > > is getting all the values from the topic even those with the same > > > key? >
Re: Kafka Streams: finding a solution to a particular use case
log > may > > > not > > > > be yet compacted. > > > > > > > > > > > > Guozhang > > > > > > > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral < > > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > > > Hi Guozhang, > > > > > > > > > > Thank you very much for your reply and sorry for the generic > > question, > > > > I'll > > > > > try to explain with some pseudocode. > > > > > > > > > > I have two KTable with a join: > > > > > > > > > > ktable1: KTable[String, String] = builder.table("topic1") > > > > > ktable2: KTable[String, String] = builder.table("topic2") > > > > > > > > > > result: KTable[String, ResultUnion] = > > > > > ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, > > data2)) > > > > > > > > > > I send the result to a topic result.to("resultTopic"). > > > > > > > > > > My questions are related with the following scenario: > > > > > > > > > > - The streming is up & running without data in topics > > > > > > > > > > - I send data to "topic2", for example a key/value like that > > > > ("uniqueKey1", > > > > > "hello") > > > > > > > > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", > null) > > > > > > > > > > - If I send data to "topic1", for example a key/value like that > > > > > ("uniqueKey1", "world") then I see this values in topic > > "resultTopic", > > > > > ("uniqueKey1", ResultUnion("hello", "world")) > > > > > > > > > > Q: If we send data for one of the KTable that does not have the > > > > > corresponding data by key in the other one, obtain null values in > the > > > > > result final topic is the expected behavior? > > > > > > > > > > My next step would be use Kafka Connect to persist result data in > C* > > (I > > > > > have not read yet the Connector docs...), is this the way to do it? > > (I > > > > mean > > > > > prepare the data in the topic). > > > > > > > > > > Q: On the other hand, just to try, I have a KTable that read > messages > > > in > > > > > "resultTopic" and prints them. If the stream is a KTable I am > > wondering > > > > why > > > > > is getting all the values from the topic even those with the same > > key? > > > > > > > > > > Thanks in advance! Great job answering community! > > > > > > > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang : > > > > > > > > > > > Hi Guillermo, > > > > > > > > > > > > 1) Yes in your case, the streams are really a "changelog" stream, > > > hence > > > > > you > > > > > > should create the stream as KTable, and do KTable-KTable join. > > > > > > > > > > > > 2) Could elaborate about "achieving this"? What behavior do > require > > > in > > > > > the > > > > > > application logic? > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < > > > > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I am a newbie to Kafka Streams and I am using it trying to > solve > > a > > > > > > > particular use case. Let me explain. > > > > > > > > > > > > > > I have two sources of data both like that: > > > > > > > > > > > > > > Key (string) > > > > > > > DateTime (hourly granularity) > > > > > > > Value > > > > > > > > > > > > > > I need to join the two sources by key and date (hour of day) to > > > > obtain: > > > > > > > > > > > > > > Key (string) > > > > > > > DateTime (hourly granularity) > > > > > > > ValueSource1 > > > > > > > ValueSource2 > > > > > > > > > > > > > > I think that first I'd need to push the messages in Kafka > topics > > > with > > > > > the > > > > > > > date as part of the key because I'll group by key taking into > > > account > > > > > the > > > > > > > date. So maybe the key must be a new string like key_timestamp. > > > But, > > > > of > > > > > > > course, it is not the main problem, is just an additional > > > > explanation. > > > > > > > > > > > > > > Ok, so data are in topics, here we go! > > > > > > > > > > > > > > - Multiple records allows per key but only the latest value > for a > > > > > record > > > > > > > key will be considered. I should use two KTable with some join > > > > > strategy, > > > > > > > right? > > > > > > > > > > > > > > - Data of both sources could arrive at any time. What can I do > to > > > > > achieve > > > > > > > this? > > > > > > > > > > > > > > Thanks in advance. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > > > > > -- > -- Guozhang >
Re: Kafka Streams: finding a solution to a particular use case
: > > > > > > > > - The streming is up & running without data in topics > > > > > > > > - I send data to "topic2", for example a key/value like that > > > ("uniqueKey1", > > > > "hello") > > > > > > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", null) > > > > > > > > - If I send data to "topic1", for example a key/value like that > > > > ("uniqueKey1", "world") then I see this values in topic > "resultTopic", > > > > ("uniqueKey1", ResultUnion("hello", "world")) > > > > > > > > Q: If we send data for one of the KTable that does not have the > > > > corresponding data by key in the other one, obtain null values in the > > > > result final topic is the expected behavior? > > > > > > > > My next step would be use Kafka Connect to persist result data in C* > (I > > > > have not read yet the Connector docs...), is this the way to do it? > (I > > > mean > > > > prepare the data in the topic). > > > > > > > > Q: On the other hand, just to try, I have a KTable that read messages > > in > > > > "resultTopic" and prints them. If the stream is a KTable I am > wondering > > > why > > > > is getting all the values from the topic even those with the same > key? > > > > > > > > Thanks in advance! Great job answering community! > > > > > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang : > > > > > > > > > Hi Guillermo, > > > > > > > > > > 1) Yes in your case, the streams are really a "changelog" stream, > > hence > > > > you > > > > > should create the stream as KTable, and do KTable-KTable join. > > > > > > > > > > 2) Could elaborate about "achieving this"? What behavior do require > > in > > > > the > > > > > application logic? > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < > > > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I am a newbie to Kafka Streams and I am using it trying to solve > a > > > > > > particular use case. Let me explain. > > > > > > > > > > > > I have two sources of data both like that: > > > > > > > > > > > > Key (string) > > > > > > DateTime (hourly granularity) > > > > > > Value > > > > > > > > > > > > I need to join the two sources by key and date (hour of day) to > > > obtain: > > > > > > > > > > > > Key (string) > > > > > > DateTime (hourly granularity) > > > > > > ValueSource1 > > > > > > ValueSource2 > > > > > > > > > > > > I think that first I'd need to push the messages in Kafka topics > > with > > > > the > > > > > > date as part of the key because I'll group by key taking into > > account > > > > the > > > > > > date. So maybe the key must be a new string like key_timestamp. > > But, > > > of > > > > > > course, it is not the main problem, is just an additional > > > explanation. > > > > > > > > > > > > Ok, so data are in topics, here we go! > > > > > > > > > > > > - Multiple records allows per key but only the latest value for a > > > > record > > > > > > key will be considered. I should use two KTable with some join > > > > strategy, > > > > > > right? > > > > > > > > > > > > - Data of both sources could arrive at any time. What can I do to > > > > achieve > > > > > > this? > > > > > > > > > > > > Thanks in advance. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > -- -- Guozhang
Re: Kafka Streams: finding a solution to a particular use case
Related to the log compaction question: " it will be log compacted on the key over time", how do we control the time for log compaction? For the log compaction implementation, is the storage used to map a new value for a given key stored in memory or on disk? On Tue, Apr 19, 2016 at 8:58 AM, Guillermo Lammers Corral < guillermo.lammers.cor...@tecsisa.com> wrote: > Hello, > > Thanks again for your reply :) > > 1) In my example when I send a record from outer table and there is no > matching record from inner table I receive data to the output topic and > vice versa. I am trying it with the topics empties at the first execution. > How is possible? > > Why KTable joins does not support windowing strategies? I think that for > this use cases I need it, what do you think? > > 2) What does it means? Although the log may not be yet compacted, there > should be no problem to read from them and execute a new stream process, > right? (like a new joins, counts...). > > Thanks!! > > 2016-04-15 17:37 GMT+02:00 Guozhang Wang : > > > 1) There are three types of joins for KTable-KTable join, the follow the > > same semantics in SQL joins: > > > > KTable.join(KTable): when there is no matching record from inner table > when > > received a new record from outer table, no output; and vice versa. > > KTable.leftjoin(KTable): when there is no matching record from inner > table > > when received a new record from outer table, output (a, null); on the > other > > direction no output. > > KTable.outerjoin(KTable): when there is no matching record from inner / > > outer table when received a new record from outer / inner table, output > (a, > > null) or (null, b). > > > > > > 2) The result topic is also a changelog topic, although it will be log > > compacted on the key over time, if you consume immediately the log may > not > > be yet compacted. > > > > > > Guozhang > > > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral < > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > Hi Guozhang, > > > > > > Thank you very much for your reply and sorry for the generic question, > > I'll > > > try to explain with some pseudocode. > > > > > > I have two KTable with a join: > > > > > > ktable1: KTable[String, String] = builder.table("topic1") > > > ktable2: KTable[String, String] = builder.table("topic2") > > > > > > result: KTable[String, ResultUnion] = > > > ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, data2)) > > > > > > I send the result to a topic result.to("resultTopic"). > > > > > > My questions are related with the following scenario: > > > > > > - The streming is up & running without data in topics > > > > > > - I send data to "topic2", for example a key/value like that > > ("uniqueKey1", > > > "hello") > > > > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", null) > > > > > > - If I send data to "topic1", for example a key/value like that > > > ("uniqueKey1", "world") then I see this values in topic "resultTopic", > > > ("uniqueKey1", ResultUnion("hello", "world")) > > > > > > Q: If we send data for one of the KTable that does not have the > > > corresponding data by key in the other one, obtain null values in the > > > result final topic is the expected behavior? > > > > > > My next step would be use Kafka Connect to persist result data in C* (I > > > have not read yet the Connector docs...), is this the way to do it? (I > > mean > > > prepare the data in the topic). > > > > > > Q: On the other hand, just to try, I have a KTable that read messages > in > > > "resultTopic" and prints them. If the stream is a KTable I am wondering > > why > > > is getting all the values from the topic even those with the same key? > > > > > > Thanks in advance! Great job answering community! > > > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang : > > > > > > > Hi Guillermo, > > > > > > > > 1) Yes in your case, the streams are really a "changelog" stream, > hence > > > you > > > > should create the stream as KTable, and do KTable-KTable join. > > > > > > > > 2) Could elaborate about &q
Re: Kafka Streams: finding a solution to a particular use case
Hello, Thanks again for your reply :) 1) In my example when I send a record from outer table and there is no matching record from inner table I receive data to the output topic and vice versa. I am trying it with the topics empties at the first execution. How is possible? Why KTable joins does not support windowing strategies? I think that for this use cases I need it, what do you think? 2) What does it means? Although the log may not be yet compacted, there should be no problem to read from them and execute a new stream process, right? (like a new joins, counts...). Thanks!! 2016-04-15 17:37 GMT+02:00 Guozhang Wang : > 1) There are three types of joins for KTable-KTable join, the follow the > same semantics in SQL joins: > > KTable.join(KTable): when there is no matching record from inner table when > received a new record from outer table, no output; and vice versa. > KTable.leftjoin(KTable): when there is no matching record from inner table > when received a new record from outer table, output (a, null); on the other > direction no output. > KTable.outerjoin(KTable): when there is no matching record from inner / > outer table when received a new record from outer / inner table, output (a, > null) or (null, b). > > > 2) The result topic is also a changelog topic, although it will be log > compacted on the key over time, if you consume immediately the log may not > be yet compacted. > > > Guozhang > > On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral < > guillermo.lammers.cor...@tecsisa.com> wrote: > > > Hi Guozhang, > > > > Thank you very much for your reply and sorry for the generic question, > I'll > > try to explain with some pseudocode. > > > > I have two KTable with a join: > > > > ktable1: KTable[String, String] = builder.table("topic1") > > ktable2: KTable[String, String] = builder.table("topic2") > > > > result: KTable[String, ResultUnion] = > > ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, data2)) > > > > I send the result to a topic result.to("resultTopic"). > > > > My questions are related with the following scenario: > > > > - The streming is up & running without data in topics > > > > - I send data to "topic2", for example a key/value like that > ("uniqueKey1", > > "hello") > > > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", null) > > > > - If I send data to "topic1", for example a key/value like that > > ("uniqueKey1", "world") then I see this values in topic "resultTopic", > > ("uniqueKey1", ResultUnion("hello", "world")) > > > > Q: If we send data for one of the KTable that does not have the > > corresponding data by key in the other one, obtain null values in the > > result final topic is the expected behavior? > > > > My next step would be use Kafka Connect to persist result data in C* (I > > have not read yet the Connector docs...), is this the way to do it? (I > mean > > prepare the data in the topic). > > > > Q: On the other hand, just to try, I have a KTable that read messages in > > "resultTopic" and prints them. If the stream is a KTable I am wondering > why > > is getting all the values from the topic even those with the same key? > > > > Thanks in advance! Great job answering community! > > > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang : > > > > > Hi Guillermo, > > > > > > 1) Yes in your case, the streams are really a "changelog" stream, hence > > you > > > should create the stream as KTable, and do KTable-KTable join. > > > > > > 2) Could elaborate about "achieving this"? What behavior do require in > > the > > > application logic? > > > > > > > > > Guozhang > > > > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < > > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > > > Hi, > > > > > > > > I am a newbie to Kafka Streams and I am using it trying to solve a > > > > particular use case. Let me explain. > > > > > > > > I have two sources of data both like that: > > > > > > > > Key (string) > > > > DateTime (hourly granularity) > > > > Value > > > > > > > > I need to join the two sources by key and date (hour of day) to > obtain
Re: Kafka Streams: finding a solution to a particular use case
1) There are three types of joins for KTable-KTable join, the follow the same semantics in SQL joins: KTable.join(KTable): when there is no matching record from inner table when received a new record from outer table, no output; and vice versa. KTable.leftjoin(KTable): when there is no matching record from inner table when received a new record from outer table, output (a, null); on the other direction no output. KTable.outerjoin(KTable): when there is no matching record from inner / outer table when received a new record from outer / inner table, output (a, null) or (null, b). 2) The result topic is also a changelog topic, although it will be log compacted on the key over time, if you consume immediately the log may not be yet compacted. Guozhang On Fri, Apr 15, 2016 at 2:11 AM, Guillermo Lammers Corral < guillermo.lammers.cor...@tecsisa.com> wrote: > Hi Guozhang, > > Thank you very much for your reply and sorry for the generic question, I'll > try to explain with some pseudocode. > > I have two KTable with a join: > > ktable1: KTable[String, String] = builder.table("topic1") > ktable2: KTable[String, String] = builder.table("topic2") > > result: KTable[String, ResultUnion] = > ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, data2)) > > I send the result to a topic result.to("resultTopic"). > > My questions are related with the following scenario: > > - The streming is up & running without data in topics > > - I send data to "topic2", for example a key/value like that ("uniqueKey1", > "hello") > > - I see null values in topic "resultTopic", i.e. ("uniqueKey1", null) > > - If I send data to "topic1", for example a key/value like that > ("uniqueKey1", "world") then I see this values in topic "resultTopic", > ("uniqueKey1", ResultUnion("hello", "world")) > > Q: If we send data for one of the KTable that does not have the > corresponding data by key in the other one, obtain null values in the > result final topic is the expected behavior? > > My next step would be use Kafka Connect to persist result data in C* (I > have not read yet the Connector docs...), is this the way to do it? (I mean > prepare the data in the topic). > > Q: On the other hand, just to try, I have a KTable that read messages in > "resultTopic" and prints them. If the stream is a KTable I am wondering why > is getting all the values from the topic even those with the same key? > > Thanks in advance! Great job answering community! > > 2016-04-14 20:00 GMT+02:00 Guozhang Wang : > > > Hi Guillermo, > > > > 1) Yes in your case, the streams are really a "changelog" stream, hence > you > > should create the stream as KTable, and do KTable-KTable join. > > > > 2) Could elaborate about "achieving this"? What behavior do require in > the > > application logic? > > > > > > Guozhang > > > > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < > > guillermo.lammers.cor...@tecsisa.com> wrote: > > > > > Hi, > > > > > > I am a newbie to Kafka Streams and I am using it trying to solve a > > > particular use case. Let me explain. > > > > > > I have two sources of data both like that: > > > > > > Key (string) > > > DateTime (hourly granularity) > > > Value > > > > > > I need to join the two sources by key and date (hour of day) to obtain: > > > > > > Key (string) > > > DateTime (hourly granularity) > > > ValueSource1 > > > ValueSource2 > > > > > > I think that first I'd need to push the messages in Kafka topics with > the > > > date as part of the key because I'll group by key taking into account > the > > > date. So maybe the key must be a new string like key_timestamp. But, of > > > course, it is not the main problem, is just an additional explanation. > > > > > > Ok, so data are in topics, here we go! > > > > > > - Multiple records allows per key but only the latest value for a > record > > > key will be considered. I should use two KTable with some join > strategy, > > > right? > > > > > > - Data of both sources could arrive at any time. What can I do to > achieve > > > this? > > > > > > Thanks in advance. > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang
Re: Kafka Streams: finding a solution to a particular use case
Hi Guozhang, Thank you very much for your reply and sorry for the generic question, I'll try to explain with some pseudocode. I have two KTable with a join: ktable1: KTable[String, String] = builder.table("topic1") ktable2: KTable[String, String] = builder.table("topic2") result: KTable[String, ResultUnion] = ktable1.join(ktable2, (data1, data2) => new ResultUnion(data1, data2)) I send the result to a topic result.to("resultTopic"). My questions are related with the following scenario: - The streming is up & running without data in topics - I send data to "topic2", for example a key/value like that ("uniqueKey1", "hello") - I see null values in topic "resultTopic", i.e. ("uniqueKey1", null) - If I send data to "topic1", for example a key/value like that ("uniqueKey1", "world") then I see this values in topic "resultTopic", ("uniqueKey1", ResultUnion("hello", "world")) Q: If we send data for one of the KTable that does not have the corresponding data by key in the other one, obtain null values in the result final topic is the expected behavior? My next step would be use Kafka Connect to persist result data in C* (I have not read yet the Connector docs...), is this the way to do it? (I mean prepare the data in the topic). Q: On the other hand, just to try, I have a KTable that read messages in "resultTopic" and prints them. If the stream is a KTable I am wondering why is getting all the values from the topic even those with the same key? Thanks in advance! Great job answering community! 2016-04-14 20:00 GMT+02:00 Guozhang Wang : > Hi Guillermo, > > 1) Yes in your case, the streams are really a "changelog" stream, hence you > should create the stream as KTable, and do KTable-KTable join. > > 2) Could elaborate about "achieving this"? What behavior do require in the > application logic? > > > Guozhang > > > On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < > guillermo.lammers.cor...@tecsisa.com> wrote: > > > Hi, > > > > I am a newbie to Kafka Streams and I am using it trying to solve a > > particular use case. Let me explain. > > > > I have two sources of data both like that: > > > > Key (string) > > DateTime (hourly granularity) > > Value > > > > I need to join the two sources by key and date (hour of day) to obtain: > > > > Key (string) > > DateTime (hourly granularity) > > ValueSource1 > > ValueSource2 > > > > I think that first I'd need to push the messages in Kafka topics with the > > date as part of the key because I'll group by key taking into account the > > date. So maybe the key must be a new string like key_timestamp. But, of > > course, it is not the main problem, is just an additional explanation. > > > > Ok, so data are in topics, here we go! > > > > - Multiple records allows per key but only the latest value for a record > > key will be considered. I should use two KTable with some join strategy, > > right? > > > > - Data of both sources could arrive at any time. What can I do to achieve > > this? > > > > Thanks in advance. > > > > > > -- > -- Guozhang >
Re: Kafka Streams: finding a solution to a particular use case
Hi Guillermo, 1) Yes in your case, the streams are really a "changelog" stream, hence you should create the stream as KTable, and do KTable-KTable join. 2) Could elaborate about "achieving this"? What behavior do require in the application logic? Guozhang On Thu, Apr 14, 2016 at 1:30 AM, Guillermo Lammers Corral < guillermo.lammers.cor...@tecsisa.com> wrote: > Hi, > > I am a newbie to Kafka Streams and I am using it trying to solve a > particular use case. Let me explain. > > I have two sources of data both like that: > > Key (string) > DateTime (hourly granularity) > Value > > I need to join the two sources by key and date (hour of day) to obtain: > > Key (string) > DateTime (hourly granularity) > ValueSource1 > ValueSource2 > > I think that first I'd need to push the messages in Kafka topics with the > date as part of the key because I'll group by key taking into account the > date. So maybe the key must be a new string like key_timestamp. But, of > course, it is not the main problem, is just an additional explanation. > > Ok, so data are in topics, here we go! > > - Multiple records allows per key but only the latest value for a record > key will be considered. I should use two KTable with some join strategy, > right? > > - Data of both sources could arrive at any time. What can I do to achieve > this? > > Thanks in advance. > -- -- Guozhang