Hey Guys, I think I agree with Sriram on this one.
It seems to me that, if we move the Vagrant stuff out into a separate repo, hello-samza is pretty straight forward. Copying and pasting a long line isn't that scary to me, I just ignore it, and follow the directions, but that's just my personality. :) Cheers, Chris On 2/5/14 8:20 AM, "Sriram" <[email protected]> wrote: >- I am not convinced that LocalJobFactory should be the default mode for >hello-Samza. The target users for Samza are developers. Showing how >awesome it is to setup Samza with Kafka and Yarn and consume wiki edit >events in 5 - 10 minutes is really the big win. I don't think we gain >much in reducing this time to 1 minute. I am also not a fan of having >many way to do a quickstart which is my next point. > >- Having three ways to do quick start defeats the purpose and I vote for >moving vagrant out into another repo. However, I do think the default >should use yarn as mentioned above. I don't see a value add with making >it localjobfactory. > >> On Feb 5, 2014, at 6:21 AM, Martin Kleppmann <[email protected]> >>wrote: >> >> Hi Chris, >> >> On 4 Feb 2014, at 19:05, Chris Riccomini <[email protected]> >>wrote: >>>> [...] >>>> € YARN is very heavyweight (100MB download). Could we avoid using >>>>YARN in >>>> hello-samza, in favour of LocalJobFactory? Does Kafka have a local >>>>mode >>>> for development that doesn't require Zookeeper? The fewer dependencies >>>> the better. >>> >>> On the one hand, I agree with you that it's annoying to have so many >>> dependencies get pulled in. On the other hand, these systems are >>> non-trivial to install, and getting them up and running, and showing >>>the >>> full power of Samza is a big deal. When I wrote hello-samza, I >>>originally >>> was just going to use LocalJobFactory, and not even use Kafka. This >>>would >>> have eliminated all dependencies. I opted against this because I felt >>>like >>> it gave a much poorer feel of what Samza was, and how it worked in the >>> real world. For example, having the AM dashboard is really helpful, and >>> allows us to illustrate what containers are, etc. >> >> I agree that it's good to show the full power of Samza, and make it >>easy to get started with YARN etc. But that raises the question: who is >>hello-samza intended for? >> >> - Is it for somebody who just saw a link to the Samza website in a >>tweet, but who hasn't read the documentation yet, and who just wants to >>quickly decide whether to invest more time into finding out about Samza? >>(The "2-minute-quickly-playing-around" use case) >> >> - Or is it for somebody who has already decided to try Samza, and wants >>a reference project as a starting point for their own project? (The >>"1-hour-experimentation" use case) >> >> Both are valid use cases. The fact that "Hello Samza" appears as the >>very first item in the website navigation suggests that it's intended >>for the first case, whereas the full-on YARN install is more appropriate >>to the second case. >> >> In that light, I'd like to suggest the following: >> >> - We move both the Vagrant setup and bin/grid into a separate >>repository (call it "samza-instant-grid" or something like that). Since >>the Vagrant setup depends on bin/grid, it makes sense for the two to be >>in the same repository. That repo doesn't contain a particular Samza job >>-- it's focused on the purpose of getting to a working YARN+Kafka+ZK >>setup as quickly as possible, either on the local OS or inside a VM. >> >> - We change hello-samza to use LocalJobFactory by default, for instant >>gratification of people who are completely new to Samza. And at the end >>of the hello-samza instructions we say something like: "Congratulations, >>you've run your first Samza job! But it was running in local mode, which >>is only for development, and doesn't have the resource isolation or >>fault tolerance features of a real Samza deployment. Check out >>[samza-instant-grid](LINK) to set up a miniature Samza cluster on your >>machine in 10 minutes. You can then deploy >>samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz to your >>local cluster, and see the same job running in a YARN container." >> >> That would allow hello-samza to satisfy both the >>2-minute-quickly-playing-around use case and the 1-hour-experimentation >>use case. And it would have the side benefit of showing how to set up a >>project to use both local mode for development (which I think is >>genuinely useful) and also generate an artifact that is deployable to >>YARN. >> >> Does that make sense? >> >>>> € I somehow got my setup into a bad state (where YARN was running but >>>>its >>>> web UI wouldn't load); I think it happened because I ran `vagrant up` >>>>at >>>> the same time as `bin/grid bootstrap` outside of the VM, and the two >>>> processes trampled on each other. Deleting the 'deploy' directory and >>>> starting from a clean slate fixed it. Can we isolate Vagrant and >>>>local-OS >>>> bootstrap from each other? >>> >>> Yea, we really need to think this through. Originally, we only had >>>local >>> bin/grid (no Vagrant). Now, we have two different ways to run >>>hello-samza, >>> which is really confusing (especially since the README only talks about >>> Vagrant, and the Samza website only talks about local mode). Jakob and >>>I >>> were talking about this as well. It seems like a good thing to move the >>> Vagrant stuff somewhere else, and be clear about the two different >>>ways of >>> bootstrapping. Not quite sure about the best way to do this, but Jakob >>>had >>> some thoughts. >> >> Jakob, would be interested to hear what you think. >> >>>> € Can we make task logs go to stdout by default? Logs provide >>>>reassurance >>>> that something is happening, and at the moment you have to dig around >>>> somewhere in the deploy directory to find the log files. >>> >>> Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs? >> >> The run-job.sh commands currently give no visual feedback as to what is >>happening -- you just start it, but then the job disappears into a >>'black hole'. You can start the kafka-console-consumer to see the output >>of a job, or you can find it on the YARN web UI, but a more immediate >>form of feedback would be for the job's startup logs to appear on stdout. >> >> I noticed a file deploy/samza/undefined-samza-container-name.log, which >>included some info from the Samza job starting up, such as the MOTD sent >>by the Wikipedia IRC gateway after connecting. That's the kind of output >>I was thinking of. >> >> Showing logs on stdout probably makes most sense when a job is run >>through LocalJobFactory. If a job is deployed to YARN, it's >>understandable that the logs are not shown (because they are generated >>in a different process, potentially on a different machine). >> >>>> € Can we shorten the commands? Having to unpack the .tar.gz file and >>>>then >>>> copy/paste a scary long run-job.sh line makes the process feel arcane, >>>> and obscures what is really happening. Perhaps just a shell script >>>> wrapper for run-job.sh or a maven goal would do it. >>> >>> Regarding the mkdir and .tar.gz unpacking, we should just do this as >>>part >>> of `mvn package`. If you want to make that change, I'm all for it. >>> >>> As for hiding the run-job.sh, I'm not as convinced of getting rid of >>>it. I >>> kind of like exposing how Samza actually works to the developer, so >>>they >>> know. Hiding it behind some one-off script doesn't really help them >>> understand Samza (of course the same argument could be made for hiding >>> YARN/ZK/Kafka behind bin/grid). Perhaps we just need more >>>documentation in >>> the walkthrough about what this command does and what the parameters >>>are? >> >> If run-job.sh is part of samza-instant-grid, I think it's ok to keep it >>as-is, and document it. >> >> For the 2-minute-quickly-playing-around use case, I fear that a long >>command mentioning factories is more confusing than enlightening. Am I >>right in thinking that when using LocalJobFactory, run-job.sh is not >>needed? >> >>>> € Would it be possible to have maven download the dependencies, rather >>>> than bin/grid calling curl on random URLs? Somehow it feels weird to >>>>have >>>> a script download and run random code off the internet (although of >>>> course that's what every package manager does, it's irrational). It >>>>would >>>> also avoid re-downloading everything in case you decide to blow away >>>>the >>>> deploy directory. >>> >>> Not sure about this. All of this stuff is up in Apache's HTTP servers, >>>but >>> I'm not sure if the release packages for these projects are published >>>into >>> Maven central (I'm nearly 100% certain that Kafka isn't). If they're >>>not, >>> then having Maven download the packages is no different than having the >>> shell script do it. >>> >>> One alternative would be to have the bin/grid script cache the files >>> locally somewhere, so that blowing away the deploy directory doesn't >>> trigger a re-download of YARN/ZK/Kafka again. >> >> Ok, having the shell script cache the files in another directory sounds >>good. I'm happy to make that change. >> >> Cheers, >> Martin >>
