In our samza app, we need to read data from MySQL (reference table) with a stream. So the requirements are
* Read data into each Samza task before processing any message. * The Samza task should be able to listen to updates happening in MySQL. I did some research after scanning through some relevant conversations and JIRAs on the community but did not find a solution yet. Neither I find a recommended way to do this. If my data streams comes from a topic called *topicD*, options in my mind are: - Use Kafka 1. Use one of CDC based solution to replicate data in MySQL to a topic Kafka. https://github.com/wushujames/mysql-cdc-projects/wiki. Say the topic is called *topicR*. 2. In my Samza app, read reference table from *topicR *and persisted in a cache in each Samza task's local storage. - If the data in *topicR *is NOT partitioned in the same way as *topicD*, can we configure each individual Samza task to read data from all partitions from a topic? - If the answer to the above question is no, do I need to create *topicR *with the same number of partitions as *topicD*, and replicate data to all partitions? - On start, how to make Samza task to block processing the first message from *topicD* before reading all data from *topicR*. 3. Any new updates/deletes to *topicR *will be consumed to update the local cache of each Samza task. 4. On failure or restarts, each Samza task will read from the beginning from *topicR*. - Not Use Kafka - Each Samza task reads a Snapshot of database and builds its local cache, and it then needs to read periodically to update its local cache. I have read about a few blogs, and this doesn't sound a solid way in the long term. Any thoughts? Chen - -- Chen Song