Re: Making hello-samza easier to get started with

Sriram Wed, 05 Feb 2014 08:22:01 -0800

- I am not convinced that LocalJobFactory should be the default mode for 
hello-Samza. The target users for Samza are developers. Showing how awesome it 
is to setup Samza with Kafka and Yarn and consume wiki edit events in 5 - 10 
minutes is really the big win. I don't think we gain much in reducing this time 
to 1 minute. I am also not a fan of having many way to do a quickstart which is 
my next point.


- Having three ways to do quick start defeats the purpose and I vote for moving 
vagrant out into another repo. However, I do think the default should use yarn 
as mentioned above. I don't see a value add with making it localjobfactory.

> On Feb 5, 2014, at 6:21 AM, Martin Kleppmann <[email protected]> wrote:
> 
> Hi Chris,
> 
> On 4 Feb 2014, at 19:05, Chris Riccomini <[email protected]> wrote:
>>> [...]
>>> € YARN is very heavyweight (100MB download). Could we avoid using YARN in
>>> hello-samza, in favour of LocalJobFactory? Does Kafka have a local mode
>>> for development that doesn't require Zookeeper? The fewer dependencies
>>> the better.
>> 
>> On the one hand, I agree with you that it's annoying to have so many
>> dependencies get pulled in. On the other hand, these systems are
>> non-trivial to install, and getting them up and running, and showing the
>> full power of Samza is a big deal. When I wrote hello-samza, I originally
>> was just going to use LocalJobFactory, and not even use Kafka. This would
>> have eliminated all dependencies. I opted against this because I felt like
>> it gave a much poorer feel of what Samza was, and how it worked in the
>> real world. For example, having the AM dashboard is really helpful, and
>> allows us to illustrate what containers are, etc.
> 
> I agree that it's good to show the full power of Samza, and make it easy to 
> get started with YARN etc. But that raises the question: who is hello-samza 
> intended for?
> 
> - Is it for somebody who just saw a link to the Samza website in a tweet, but 
> who hasn't read the documentation yet, and who just wants to quickly decide 
> whether to invest more time into finding out about Samza? (The 
> "2-minute-quickly-playing-around" use case)
> 
> - Or is it for somebody who has already decided to try Samza, and wants a 
> reference project as a starting point for their own project? (The 
> "1-hour-experimentation" use case)
> 
> Both are valid use cases. The fact that "Hello Samza" appears as the very 
> first item in the website navigation suggests that it's intended for the 
> first case, whereas the full-on YARN install is more appropriate to the 
> second case.
> 
> In that light, I'd like to suggest the following:
> 
> - We move both the Vagrant setup and bin/grid into a separate repository 
> (call it "samza-instant-grid" or something like that). Since the Vagrant 
> setup depends on bin/grid, it makes sense for the two to be in the same 
> repository. That repo doesn't contain a particular Samza job -- it's focused 
> on the purpose of getting to a working YARN+Kafka+ZK setup as quickly as 
> possible, either on the local OS or inside a VM.
> 
> - We change hello-samza to use LocalJobFactory by default, for instant 
> gratification of people who are completely new to Samza. And at the end of 
> the hello-samza instructions we say something like: "Congratulations, you've 
> run your first Samza job! But it was running in local mode, which is only for 
> development, and doesn't have the resource isolation or fault tolerance 
> features of a real Samza deployment. Check out [samza-instant-grid](LINK) to 
> set up a miniature Samza cluster on your machine in 10 minutes. You can then 
> deploy samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz to your 
> local cluster, and see the same job running in a YARN container."
> 
> That would allow hello-samza to satisfy both the 
> 2-minute-quickly-playing-around use case and the 1-hour-experimentation use 
> case. And it would have the side benefit of showing how to set up a project 
> to use both local mode for development (which I think is genuinely useful) 
> and also generate an artifact that is deployable to YARN.
> 
> Does that make sense?
> 
>>> € I somehow got my setup into a bad state (where YARN was running but its
>>> web UI wouldn't load); I think it happened because I ran `vagrant up` at
>>> the same time as `bin/grid bootstrap` outside of the VM, and the two
>>> processes trampled on each other. Deleting the 'deploy' directory and
>>> starting from a clean slate fixed it. Can we isolate Vagrant and local-OS
>>> bootstrap from each other?
>> 
>> Yea, we really need to think this through. Originally, we only had local
>> bin/grid (no Vagrant). Now, we have two different ways to run hello-samza,
>> which is really confusing (especially since the README only talks about
>> Vagrant, and the Samza website only talks about local mode). Jakob and I
>> were talking about this as well. It seems like a good thing to move the
>> Vagrant stuff somewhere else, and be clear about the two different ways of
>> bootstrapping. Not quite sure about the best way to do this, but Jakob had
>> some thoughts.
> 
> Jakob, would be interested to hear what you think.
> 
>>> € Can we make task logs go to stdout by default? Logs provide reassurance
>>> that something is happening, and at the moment you have to dig around
>>> somewhere in the deploy directory to find the log files.
>> 
>> Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs?
> 
> The run-job.sh commands currently give no visual feedback as to what is 
> happening -- you just start it, but then the job disappears into a 'black 
> hole'. You can start the kafka-console-consumer to see the output of a job, 
> or you can find it on the YARN web UI, but a more immediate form of feedback 
> would be for the job's startup logs to appear on stdout.
> 
> I noticed a file deploy/samza/undefined-samza-container-name.log, which 
> included some info from the Samza job starting up, such as the MOTD sent by 
> the Wikipedia IRC gateway after connecting. That's the kind of output I was 
> thinking of.
> 
> Showing logs on stdout probably makes most sense when a job is run through 
> LocalJobFactory. If a job is deployed to YARN, it's understandable that the 
> logs are not shown (because they are generated in a different process, 
> potentially on a different machine).
> 
>>> € Can we shorten the commands? Having to unpack the .tar.gz file and then
>>> copy/paste a scary long run-job.sh line makes the process feel arcane,
>>> and obscures what is really happening. Perhaps just a shell script
>>> wrapper for run-job.sh or a maven goal would do it.
>> 
>> Regarding the mkdir and .tar.gz unpacking, we should just do this as part
>> of `mvn package`. If you want to make that change, I'm all for it.
>> 
>> As for hiding the run-job.sh, I'm not as convinced of getting rid of it. I
>> kind of like exposing how Samza actually works to the developer, so they
>> know. Hiding it behind some one-off script doesn't really help them
>> understand Samza (of course the same argument could be made for hiding
>> YARN/ZK/Kafka behind bin/grid). Perhaps we just need more documentation in
>> the walkthrough about what this command does and what the parameters are?
> 
> If run-job.sh is part of samza-instant-grid, I think it's ok to keep it 
> as-is, and document it.
> 
> For the 2-minute-quickly-playing-around use case, I fear that a long command 
> mentioning factories is more confusing than enlightening. Am I right in 
> thinking that when using LocalJobFactory, run-job.sh is not needed?
> 
>>> € Would it be possible to have maven download the dependencies, rather
>>> than bin/grid calling curl on random URLs? Somehow it feels weird to have
>>> a script download and run random code off the internet (although of
>>> course that's what every package manager does, it's irrational). It would
>>> also avoid re-downloading everything in case you decide to blow away the
>>> deploy directory.
>> 
>> Not sure about this. All of this stuff is up in Apache's HTTP servers, but
>> I'm not sure if the release packages for these projects are published into
>> Maven central (I'm nearly 100% certain that Kafka isn't). If they're not,
>> then having Maven download the packages is no different than having the
>> shell script do it.
>> 
>> One alternative would be to have the bin/grid script cache the files
>> locally somewhere, so that blowing away the deploy directory doesn't
>> trigger a re-download of YARN/ZK/Kafka again.
> 
> Ok, having the shell script cache the files in another directory sounds good. 
> I'm happy to make that change.
> 
> Cheers,
> Martin
>

Re: Making hello-samza easier to get started with

Reply via email to