Hi Chen Song,

Sorry for the late reply. What you describe is a typical bootstrap use case. 
Check http://samza.apache.org/learn/documentation/0.9/container/streams.html , 
the bootstrap configuration. By using this one, Samza will always read the 
*topicR* from the beginning when it restarts. And then it treats the *topicR* 
as a normal topic after reading existing msgs in the *topicD*.


== can we configure each individual Samza task to read data from all partitions 
from a topic?
It works in the 0.10.0 by using the broadcast stream. In the 0.9.0, you have to 
"create topicR with the same number of partitions as *topicD*, and replicate 
data to all partitions".


Hope this still helps.


Thanks,
Yan


At 2015-10-22 04:44:41, "Chen Song" <chen.song...@gmail.com> wrote:
>In our samza app, we need to read data from MySQL (reference table) with a
>stream. So the requirements are
>
>* Read data into each Samza task before processing any message.
>* The Samza task should be able to listen to updates happening in MySQL.
>
>I did some research after scanning through some relevant conversations and
>JIRAs on the community but did not find a solution yet. Neither I find a
>recommended way to do this.
>
>If my data streams comes from a topic called *topicD*, options in my mind
>are:
>
>   - Use Kafka
>      1. Use one of CDC based solution to replicate data in MySQL to a
>      topic Kafka. https://github.com/wushujames/mysql-cdc-projects/wiki.
>      Say the topic is called *topicR*.
>      2. In my Samza app, read reference table from *topicR *and persisted
>      in a cache in each Samza task's local storage.
>         - If the data in *topicR *is NOT partitioned in the same way as
>         *topicD*, can we configure each individual Samza task to read data
>         from all partitions from a topic?
>         - If the answer to the above question is no, do I need to
>create *topicR
>         *with the same number of partitions as *topicD*, and replicate
>         data to all partitions?
>         - On start, how to make Samza task to block processing the first
>         message from *topicD* before reading all data from *topicR*.
>      3. Any new updates/deletes to *topicR *will be consumed to update the
>      local cache of each Samza task.
>      4. On failure or restarts, each Samza task will read from the
>      beginning from *topicR*.
>   - Not Use Kafka
>      - Each Samza task reads a Snapshot of database and builds its local
>      cache, and it then needs to read periodically to update its
>local cache. I
>      have read about a few blogs, and this doesn't sound a solid way
>in the long
>      term.
>
>Any thoughts?
>
>Chen
>
>   -
>
>-- 
>Chen Song

Reply via email to