Well written doc. I think I can understand every point even with limited knowledge of Samza :)
A few comments: 1) stateful_job.png Could you make the changelog streams into the box of tasks to demo they are collocated? And could we also add the KV stores into the box? At first glance I thought the changelog is actually the store. Also the name "changelog stream" is a little confusing. At first I was wondering why this is called a stream and would it be just a "log". I only get this while reading the fault tolerance section. 2) "This does not impact consistency—a task always reads what it wrote (since it checks the cache first)" I think another main reason is that a task only read/write its own partition. 3) Do we need guarantee consistency between the incoming stream and the change log stream. For example, the change log should indicate that this change entry is for the state processed until offset X of the incoming stream so that on failover we know where we should set the offset of the incoming stream? 4) Would reading directly from LevelDB a better approach than replaying the whole (although compacted) changelog? Writes to LevelDB should be persistent, and we only need to make sure we have checkpoints, for example, the current state (excluding writes that are still in cache) represent all the changes till the offset of the changelog? Guozhang ________________________________________ From: Jay Kreps [[email protected]] Sent: Thursday, September 05, 2013 5:14 PM To: [email protected] Subject: state management docs I took a pass at improving the state management documentation (talking to people, I don't think anyone understood what we were saying): http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html I would love to get some feedback on this, especially from anyone who doesn't already know Samza. Does this make any sense? Does it tell you what you need to know in the right order? -Jay
