- I am not convinced that LocalJobFactory should be the default mode for hello-Samza. The target users for Samza are developers. Showing how awesome it is to setup Samza with Kafka and Yarn and consume wiki edit events in 5 - 10 minutes is really the big win. I don't think we gain much in reducing this time to 1 minute. I am also not a fan of having many way to do a quickstart which is my next point.
- Having three ways to do quick start defeats the purpose and I vote for moving vagrant out into another repo. However, I do think the default should use yarn as mentioned above. I don't see a value add with making it localjobfactory. > On Feb 5, 2014, at 6:21 AM, Martin Kleppmann <[email protected]> wrote: > > Hi Chris, > > On 4 Feb 2014, at 19:05, Chris Riccomini <[email protected]> wrote: >>> [...] >>> € YARN is very heavyweight (100MB download). Could we avoid using YARN in >>> hello-samza, in favour of LocalJobFactory? Does Kafka have a local mode >>> for development that doesn't require Zookeeper? The fewer dependencies >>> the better. >> >> On the one hand, I agree with you that it's annoying to have so many >> dependencies get pulled in. On the other hand, these systems are >> non-trivial to install, and getting them up and running, and showing the >> full power of Samza is a big deal. When I wrote hello-samza, I originally >> was just going to use LocalJobFactory, and not even use Kafka. This would >> have eliminated all dependencies. I opted against this because I felt like >> it gave a much poorer feel of what Samza was, and how it worked in the >> real world. For example, having the AM dashboard is really helpful, and >> allows us to illustrate what containers are, etc. > > I agree that it's good to show the full power of Samza, and make it easy to > get started with YARN etc. But that raises the question: who is hello-samza > intended for? > > - Is it for somebody who just saw a link to the Samza website in a tweet, but > who hasn't read the documentation yet, and who just wants to quickly decide > whether to invest more time into finding out about Samza? (The > "2-minute-quickly-playing-around" use case) > > - Or is it for somebody who has already decided to try Samza, and wants a > reference project as a starting point for their own project? (The > "1-hour-experimentation" use case) > > Both are valid use cases. The fact that "Hello Samza" appears as the very > first item in the website navigation suggests that it's intended for the > first case, whereas the full-on YARN install is more appropriate to the > second case. > > In that light, I'd like to suggest the following: > > - We move both the Vagrant setup and bin/grid into a separate repository > (call it "samza-instant-grid" or something like that). Since the Vagrant > setup depends on bin/grid, it makes sense for the two to be in the same > repository. That repo doesn't contain a particular Samza job -- it's focused > on the purpose of getting to a working YARN+Kafka+ZK setup as quickly as > possible, either on the local OS or inside a VM. > > - We change hello-samza to use LocalJobFactory by default, for instant > gratification of people who are completely new to Samza. And at the end of > the hello-samza instructions we say something like: "Congratulations, you've > run your first Samza job! But it was running in local mode, which is only for > development, and doesn't have the resource isolation or fault tolerance > features of a real Samza deployment. Check out [samza-instant-grid](LINK) to > set up a miniature Samza cluster on your machine in 10 minutes. You can then > deploy samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz to your > local cluster, and see the same job running in a YARN container." > > That would allow hello-samza to satisfy both the > 2-minute-quickly-playing-around use case and the 1-hour-experimentation use > case. And it would have the side benefit of showing how to set up a project > to use both local mode for development (which I think is genuinely useful) > and also generate an artifact that is deployable to YARN. > > Does that make sense? > >>> € I somehow got my setup into a bad state (where YARN was running but its >>> web UI wouldn't load); I think it happened because I ran `vagrant up` at >>> the same time as `bin/grid bootstrap` outside of the VM, and the two >>> processes trampled on each other. Deleting the 'deploy' directory and >>> starting from a clean slate fixed it. Can we isolate Vagrant and local-OS >>> bootstrap from each other? >> >> Yea, we really need to think this through. Originally, we only had local >> bin/grid (no Vagrant). Now, we have two different ways to run hello-samza, >> which is really confusing (especially since the README only talks about >> Vagrant, and the Samza website only talks about local mode). Jakob and I >> were talking about this as well. It seems like a good thing to move the >> Vagrant stuff somewhere else, and be clear about the two different ways of >> bootstrapping. Not quite sure about the best way to do this, but Jakob had >> some thoughts. > > Jakob, would be interested to hear what you think. > >>> € Can we make task logs go to stdout by default? Logs provide reassurance >>> that something is happening, and at the moment you have to dig around >>> somewhere in the deploy directory to find the log files. >> >> Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs? > > The run-job.sh commands currently give no visual feedback as to what is > happening -- you just start it, but then the job disappears into a 'black > hole'. You can start the kafka-console-consumer to see the output of a job, > or you can find it on the YARN web UI, but a more immediate form of feedback > would be for the job's startup logs to appear on stdout. > > I noticed a file deploy/samza/undefined-samza-container-name.log, which > included some info from the Samza job starting up, such as the MOTD sent by > the Wikipedia IRC gateway after connecting. That's the kind of output I was > thinking of. > > Showing logs on stdout probably makes most sense when a job is run through > LocalJobFactory. If a job is deployed to YARN, it's understandable that the > logs are not shown (because they are generated in a different process, > potentially on a different machine). > >>> € Can we shorten the commands? Having to unpack the .tar.gz file and then >>> copy/paste a scary long run-job.sh line makes the process feel arcane, >>> and obscures what is really happening. Perhaps just a shell script >>> wrapper for run-job.sh or a maven goal would do it. >> >> Regarding the mkdir and .tar.gz unpacking, we should just do this as part >> of `mvn package`. If you want to make that change, I'm all for it. >> >> As for hiding the run-job.sh, I'm not as convinced of getting rid of it. I >> kind of like exposing how Samza actually works to the developer, so they >> know. Hiding it behind some one-off script doesn't really help them >> understand Samza (of course the same argument could be made for hiding >> YARN/ZK/Kafka behind bin/grid). Perhaps we just need more documentation in >> the walkthrough about what this command does and what the parameters are? > > If run-job.sh is part of samza-instant-grid, I think it's ok to keep it > as-is, and document it. > > For the 2-minute-quickly-playing-around use case, I fear that a long command > mentioning factories is more confusing than enlightening. Am I right in > thinking that when using LocalJobFactory, run-job.sh is not needed? > >>> € Would it be possible to have maven download the dependencies, rather >>> than bin/grid calling curl on random URLs? Somehow it feels weird to have >>> a script download and run random code off the internet (although of >>> course that's what every package manager does, it's irrational). It would >>> also avoid re-downloading everything in case you decide to blow away the >>> deploy directory. >> >> Not sure about this. All of this stuff is up in Apache's HTTP servers, but >> I'm not sure if the release packages for these projects are published into >> Maven central (I'm nearly 100% certain that Kafka isn't). If they're not, >> then having Maven download the packages is no different than having the >> shell script do it. >> >> One alternative would be to have the bin/grid script cache the files >> locally somewhere, so that blowing away the deploy directory doesn't >> trigger a re-download of YARN/ZK/Kafka again. > > Ok, having the shell script cache the files in another directory sounds good. > I'm happy to make that change. > > Cheers, > Martin >
