Thanks for the interest! Responses inline. > # Do you plan to automate the Samza config generation portion, so that you > can run the full flow from a single command?
Yes, definitely. The config generation is just there so I could keep the existing workflow mostly unchanged, but it's easy to swap out. (I'm being intentionally noncommittal about this for now.) My hacky config-writing code is here: https://github.com/bkirwi/incubator-samza-hello-samza/blob/hello-coast/samza-wikipedia/src/main/scala/samza/examples/coast/ConfigGenerationApp.scala#L52-L62 If you replace those lines with `val job = jobFactory.getJob(config); job.submit()` or similar, you should get a single-command launcher. I haven't done it with YARN yet, but my integration tests work like this. > # What can we do to help? Fielding these questions has already been a big help; thanks! My goal right now is to get something that works on the existing 0.8 branch; once that's done, it should be more obvious if there's anything Samza can do to make things cleaner / more performant. > Samza, itself, will handle exactly-once messaging at some point, but it > depends on yet-to-be-implemented Kafka features. Are you saying that > you're trying to implement exactly-once messaging on top of Samza? Yes, that's right. > If so, yes, this will be pretty tricky, and likely impact performance. I > believe > Trident's exactly-once performance suffers for this reason, as well. We > looked at doing things this way, but opted for the Kafka-based approach. I don't actually think this is as crazy as it may seem. A Kafka log is actually a decent coordination primitive, and performance is generally very good -- unlike the Trident / Kafka-transactional approach, many jobs require no extra I/O on the hot path or access to a central coordinator. There's definitely a cognitive overhead to doing this manually, but I think coast actually has a pretty good shot at making that easier -- it has quite a lot of 'structural' knowledge about the flow of data, so it should be able to do a pretty good job of inserting the necessary checks / checkpoints / etc. one DAG node at a time. The toughest part so far has actually been dealing with the Kafka ecosystem... a lot of tools try and hide the log abstraction from you or bake in particular assumptions, which makes life hard when you need precise control. One thing that pulled me to Samza is that it's pretty explicit about the log model, so these things are closer to the surface. Anyways, I get that I'm swimming against the current here, but that's the plan for now. Should be straightforward to switch to another approach if things go badly... >> Should I create the system myself from config during task init, or is >>there a better way? > > Yea, for now, you'll have to create the system yourself. One thing that > I'd been batting around was the idea of exposing all of the objects via > the SamzaContainerContext, but that's far off. Great, that works. Having access to these things in the context would be convenient, but it's certainly not a blocker. >> Am I missing anything? If not, would it be worth adding support for this? > > You're not missing anything. This seems like a worth-while feature, and is > similar to the TaskCoordinator.shutdown() work that Martin did to add a > request scope. Can you open a JIRA for this? Done. For those following along at home, the ticket is here: https://issues.apache.org/jira/browse/SAMZA-459
