Hi folks,

I am coding a streaming task that processes http requests from our web site and 
enriches these with additional information.

It contains session ids from historic requests and the related emails that were 
used within these session in the past.


    lookup - hashtable:     session_id: String => emails: Set[String]


During processing of these NEW http request

- the lookup table should be used to get previous emails and enrich the current 
stream item
- new candidates for the lookup table will be discovered during processing of 
these items and should be added to the lookup table (also these changes should 
be visible through the cluster)

I see at least the following issues:

(1) load the state as a whole from the data store into memory is a huge burn of 
memory (also making changes cluster-wide visible is an issue)

(2) not loading into memory but using something like cassandra / redis as a 
lookup store would certainly work but introduces a lot of network requests 
(possible ideas: use a distributed cache? broadcast updates in flink cluster?)

(3) how should I integrate the changes to the table with flink's checkpointing?

I really don't get how to solve this best and my current solution is far from 
elegant.... 

So is there any best practice for supporting "large lookup tables that change 
during stream processing" ?

Cheers
Peter




Reply via email to