Re: Caching in Kafka Streams to ignore garbage message

Matthias J. Sax Thu, 27 Apr 2017 14:45:07 -0700

>> I'd like to avoid repeated trips to the db, and caching a large amount of
>> data in memory.


Lookups to the DB would be hard to get done anyway. Ie, it would not
perform well, as all your calls would need to be synchronous...


>> Is it possible to send a message w/ the id as the partition key to a topic,
>> and then use the same id as the key, so the same node which will receive
>> the data for an id is the one which will process it?

That is what I did propose (maybe it was not clear). If you use Connect,
you can just import the ID into Kafka and leave the value empty (ie,
null). This reduced you cache data to a minimum. And the KStream-KTable
join work as you described it :)


-Matthias

On 4/27/17 2:37 PM, Ali Akhtar wrote:
> I'd like to avoid repeated trips to the db, and caching a large amount of
> data in memory.
> 
> Is it possible to send a message w/ the id as the partition key to a topic,
> and then use the same id as the key, so the same node which will receive
> the data for an id is the one which will process it?
> 
> 
> On Fri, Apr 28, 2017 at 2:32 AM, Matthias J. Sax <matth...@confluent.io>
> wrote:
> 
>> The recommended solution would be to use Kafka Connect to load you DB
>> data into a Kafka topic.
>>
>> With Kafka Streams you read your db-topic as KTable and do a (inne)
>> KStream-KTable join to lookup the IDs.
>>
>>
>> -Matthias
>>
>> On 4/27/17 2:22 PM, Ali Akhtar wrote:
>>> I have a Kafka topic which will receive a large amount of data.
>>>
>>> This data has an 'id' field. I need to look up the id in an external db,
>>> see if we are tracking that id, and if yes, we process that message, if
>>> not, we ignore it.
>>>
>>> 99% of the data will be for ids which are not being tracked - 1% or so
>> will
>>> be for ids which are tracked.
>>>
>>> My concern is, that there'd be a lot of round trips to the db made just
>> to
>>> check the id, and if it'd be better to cache the ids being tracked
>>> somewhere, so other ids are ignored.
>>>
>>> I was considering sending a message to another (or the same topic)
>> whenever
>>> a new id is added to the track list, and that id should then get
>> processed
>>> on the node which will process the messages.
>>>
>>> Should I just cache all ids on all nodes (which may be a large amount),
>> or
>>> is there a way to only cache the id on the same kafka streams node which
>>> will receive data for that id?
>>>
>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: Caching in Kafka Streams to ignore garbage message

Reply via email to