Hi Chen Song,
Sorry for the late reply. What you describe is a typical bootstrap use case. Check http://samza.apache.org/learn/documentation/0.9/container/streams.html , the bootstrap configuration. By using this one, Samza will always read the *topicR* from the beginning when it restarts. And then it treats the *topicR* as a normal topic after reading existing msgs in the *topicD*. == can we configure each individual Samza task to read data from all partitions from a topic? It works in the 0.10.0 by using the broadcast stream. In the 0.9.0, you have to "create topicR with the same number of partitions as *topicD*, and replicate data to all partitions". Hope this still helps. Thanks, Yan At 2015-10-22 04:44:41, "Chen Song" <chen.song...@gmail.com> wrote: >In our samza app, we need to read data from MySQL (reference table) with a >stream. So the requirements are > >* Read data into each Samza task before processing any message. >* The Samza task should be able to listen to updates happening in MySQL. > >I did some research after scanning through some relevant conversations and >JIRAs on the community but did not find a solution yet. Neither I find a >recommended way to do this. > >If my data streams comes from a topic called *topicD*, options in my mind >are: > > - Use Kafka > 1. Use one of CDC based solution to replicate data in MySQL to a > topic Kafka. https://github.com/wushujames/mysql-cdc-projects/wiki. > Say the topic is called *topicR*. > 2. In my Samza app, read reference table from *topicR *and persisted > in a cache in each Samza task's local storage. > - If the data in *topicR *is NOT partitioned in the same way as > *topicD*, can we configure each individual Samza task to read data > from all partitions from a topic? > - If the answer to the above question is no, do I need to >create *topicR > *with the same number of partitions as *topicD*, and replicate > data to all partitions? > - On start, how to make Samza task to block processing the first > message from *topicD* before reading all data from *topicR*. > 3. Any new updates/deletes to *topicR *will be consumed to update the > local cache of each Samza task. > 4. On failure or restarts, each Samza task will read from the > beginning from *topicR*. > - Not Use Kafka > - Each Samza task reads a Snapshot of database and builds its local > cache, and it then needs to read periodically to update its >local cache. I > have read about a few blogs, and this doesn't sound a solid way >in the long > term. > >Any thoughts? > >Chen > > - > >-- >Chen Song