In our samza app, we need to read data from MySQL (reference table) with a
stream. So the requirements are

* Read data into each Samza task before processing any message.
* The Samza task should be able to listen to updates happening in MySQL.

I did some research after scanning through some relevant conversations and
JIRAs on the community but did not find a solution yet. Neither I find a
recommended way to do this.

If my data streams comes from a topic called *topicD*, options in my mind
are:

   - Use Kafka
      1. Use one of CDC based solution to replicate data in MySQL to a
      topic Kafka. https://github.com/wushujames/mysql-cdc-projects/wiki.
      Say the topic is called *topicR*.
      2. In my Samza app, read reference table from *topicR *and persisted
      in a cache in each Samza task's local storage.
         - If the data in *topicR *is NOT partitioned in the same way as
         *topicD*, can we configure each individual Samza task to read data
         from all partitions from a topic?
         - If the answer to the above question is no, do I need to
create *topicR
         *with the same number of partitions as *topicD*, and replicate
         data to all partitions?
         - On start, how to make Samza task to block processing the first
         message from *topicD* before reading all data from *topicR*.
      3. Any new updates/deletes to *topicR *will be consumed to update the
      local cache of each Samza task.
      4. On failure or restarts, each Samza task will read from the
      beginning from *topicR*.
   - Not Use Kafka
      - Each Samza task reads a Snapshot of database and builds its local
      cache, and it then needs to read periodically to update its
local cache. I
      have read about a few blogs, and this doesn't sound a solid way
in the long
      term.

Any thoughts?

Chen

   -

-- 
Chen Song

Reply via email to