I love the hello-samza project -- it's quite magical to run a bunch of commands and see real data flow through the example job. Great idea to use Wikipedia's IRC feed!
However, I feel the setup process is still a bit intimidating and fragile. I just wanted to bounce around some ideas about how we could make it quicker to get started: • YARN is very heavyweight (100MB download). Could we avoid using YARN in hello-samza, in favour of LocalJobFactory? Does Kafka have a local mode for development that doesn't require Zookeeper? The fewer dependencies the better. • The Vagrant bootstrap script was quite broken -- I submitted a pull request (https://github.com/linkedin/hello-samza/pull/18) which should hopefully fix it. • I somehow got my setup into a bad state (where YARN was running but its web UI wouldn't load); I think it happened because I ran `vagrant up` at the same time as `bin/grid bootstrap` outside of the VM, and the two processes trampled on each other. Deleting the 'deploy' directory and starting from a clean slate fixed it. Can we isolate Vagrant and local-OS bootstrap from each other? • Can we make task logs go to stdout by default? Logs provide reassurance that something is happening, and at the moment you have to dig around somewhere in the deploy directory to find the log files. • Can we shorten the commands? Having to unpack the .tar.gz file and then copy/paste a scary long run-job.sh line makes the process feel arcane, and obscures what is really happening. Perhaps just a shell script wrapper for run-job.sh or a maven goal would do it. • Would it be possible to have maven download the dependencies, rather than bin/grid calling curl on random URLs? Somehow it feels weird to have a script download and run random code off the internet (although of course that's what every package manager does, it's irrational). It would also avoid re-downloading everything in case you decide to blow away the deploy directory. What do you think? Please chime in. I'm happy to work on these things, just wanted to get a read on what people think first. Martin
