Re: Making hello-samza easier to get started with

Chris Riccomini Wed, 05 Feb 2014 11:10:56 -0800

Hey Guys,

I think I agree with Sriram on this one.


It seems to me that, if we move the Vagrant stuff out into a separate
repo, hello-samza is pretty straight forward. Copying and pasting a long
line isn't that scary to me, I just ignore it, and follow the directions,
but that's just my personality. :)

Cheers,
Chris

On 2/5/14 8:20 AM, "Sriram" <[email protected]> wrote:

>- I am not convinced that LocalJobFactory should be the default mode for
>hello-Samza. The target users for Samza are developers. Showing how
>awesome it is to setup Samza with Kafka and Yarn and consume wiki edit
>events in 5 - 10 minutes is really the big win. I don't think we gain
>much in reducing this time to 1 minute. I am also not a fan of having
>many way to do a quickstart which is my next point.
>
>- Having three ways to do quick start defeats the purpose and I vote for
>moving vagrant out into another repo. However, I do think the default
>should use yarn as mentioned above. I don't see a value add with making
>it localjobfactory.
>
>> On Feb 5, 2014, at 6:21 AM, Martin Kleppmann <[email protected]>
>>wrote:
>> 
>> Hi Chris,
>> 
>> On 4 Feb 2014, at 19:05, Chris Riccomini <[email protected]>
>>wrote:
>>>> [...]
>>>> € YARN is very heavyweight (100MB download). Could we avoid using
>>>>YARN in
>>>> hello-samza, in favour of LocalJobFactory? Does Kafka have a local
>>>>mode
>>>> for development that doesn't require Zookeeper? The fewer dependencies
>>>> the better.
>>> 
>>> On the one hand, I agree with you that it's annoying to have so many
>>> dependencies get pulled in. On the other hand, these systems are
>>> non-trivial to install, and getting them up and running, and showing
>>>the
>>> full power of Samza is a big deal. When I wrote hello-samza, I
>>>originally
>>> was just going to use LocalJobFactory, and not even use Kafka. This
>>>would
>>> have eliminated all dependencies. I opted against this because I felt
>>>like
>>> it gave a much poorer feel of what Samza was, and how it worked in the
>>> real world. For example, having the AM dashboard is really helpful, and
>>> allows us to illustrate what containers are, etc.
>> 
>> I agree that it's good to show the full power of Samza, and make it
>>easy to get started with YARN etc. But that raises the question: who is
>>hello-samza intended for?
>> 
>> - Is it for somebody who just saw a link to the Samza website in a
>>tweet, but who hasn't read the documentation yet, and who just wants to
>>quickly decide whether to invest more time into finding out about Samza?
>>(The "2-minute-quickly-playing-around" use case)
>> 
>> - Or is it for somebody who has already decided to try Samza, and wants
>>a reference project as a starting point for their own project? (The
>>"1-hour-experimentation" use case)
>> 
>> Both are valid use cases. The fact that "Hello Samza" appears as the
>>very first item in the website navigation suggests that it's intended
>>for the first case, whereas the full-on YARN install is more appropriate
>>to the second case.
>> 
>> In that light, I'd like to suggest the following:
>> 
>> - We move both the Vagrant setup and bin/grid into a separate
>>repository (call it "samza-instant-grid" or something like that). Since
>>the Vagrant setup depends on bin/grid, it makes sense for the two to be
>>in the same repository. That repo doesn't contain a particular Samza job
>>-- it's focused on the purpose of getting to a working YARN+Kafka+ZK
>>setup as quickly as possible, either on the local OS or inside a VM.
>> 
>> - We change hello-samza to use LocalJobFactory by default, for instant
>>gratification of people who are completely new to Samza. And at the end
>>of the hello-samza instructions we say something like: "Congratulations,
>>you've run your first Samza job! But it was running in local mode, which
>>is only for development, and doesn't have the resource isolation or
>>fault tolerance features of a real Samza deployment. Check out
>>[samza-instant-grid](LINK) to set up a miniature Samza cluster on your
>>machine in 10 minutes. You can then deploy
>>samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz to your
>>local cluster, and see the same job running in a YARN container."
>> 
>> That would allow hello-samza to satisfy both the
>>2-minute-quickly-playing-around use case and the 1-hour-experimentation
>>use case. And it would have the side benefit of showing how to set up a
>>project to use both local mode for development (which I think is
>>genuinely useful) and also generate an artifact that is deployable to
>>YARN.
>> 
>> Does that make sense?
>> 
>>>> € I somehow got my setup into a bad state (where YARN was running but
>>>>its
>>>> web UI wouldn't load); I think it happened because I ran `vagrant up`
>>>>at
>>>> the same time as `bin/grid bootstrap` outside of the VM, and the two
>>>> processes trampled on each other. Deleting the 'deploy' directory and
>>>> starting from a clean slate fixed it. Can we isolate Vagrant and
>>>>local-OS
>>>> bootstrap from each other?
>>> 
>>> Yea, we really need to think this through. Originally, we only had
>>>local
>>> bin/grid (no Vagrant). Now, we have two different ways to run
>>>hello-samza,
>>> which is really confusing (especially since the README only talks about
>>> Vagrant, and the Samza website only talks about local mode). Jakob and
>>>I
>>> were talking about this as well. It seems like a good thing to move the
>>> Vagrant stuff somewhere else, and be clear about the two different
>>>ways of
>>> bootstrapping. Not quite sure about the best way to do this, but Jakob
>>>had
>>> some thoughts.
>> 
>> Jakob, would be interested to hear what you think.
>> 
>>>> € Can we make task logs go to stdout by default? Logs provide
>>>>reassurance
>>>> that something is happening, and at the moment you have to dig around
>>>> somewhere in the deploy directory to find the log files.
>>> 
>>> Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs?
>> 
>> The run-job.sh commands currently give no visual feedback as to what is
>>happening -- you just start it, but then the job disappears into a
>>'black hole'. You can start the kafka-console-consumer to see the output
>>of a job, or you can find it on the YARN web UI, but a more immediate
>>form of feedback would be for the job's startup logs to appear on stdout.
>> 
>> I noticed a file deploy/samza/undefined-samza-container-name.log, which
>>included some info from the Samza job starting up, such as the MOTD sent
>>by the Wikipedia IRC gateway after connecting. That's the kind of output
>>I was thinking of.
>> 
>> Showing logs on stdout probably makes most sense when a job is run
>>through LocalJobFactory. If a job is deployed to YARN, it's
>>understandable that the logs are not shown (because they are generated
>>in a different process, potentially on a different machine).
>> 
>>>> € Can we shorten the commands? Having to unpack the .tar.gz file and
>>>>then
>>>> copy/paste a scary long run-job.sh line makes the process feel arcane,
>>>> and obscures what is really happening. Perhaps just a shell script
>>>> wrapper for run-job.sh or a maven goal would do it.
>>> 
>>> Regarding the mkdir and .tar.gz unpacking, we should just do this as
>>>part
>>> of `mvn package`. If you want to make that change, I'm all for it.
>>> 
>>> As for hiding the run-job.sh, I'm not as convinced of getting rid of
>>>it. I
>>> kind of like exposing how Samza actually works to the developer, so
>>>they
>>> know. Hiding it behind some one-off script doesn't really help them
>>> understand Samza (of course the same argument could be made for hiding
>>> YARN/ZK/Kafka behind bin/grid). Perhaps we just need more
>>>documentation in
>>> the walkthrough about what this command does and what the parameters
>>>are?
>> 
>> If run-job.sh is part of samza-instant-grid, I think it's ok to keep it
>>as-is, and document it.
>> 
>> For the 2-minute-quickly-playing-around use case, I fear that a long
>>command mentioning factories is more confusing than enlightening. Am I
>>right in thinking that when using LocalJobFactory, run-job.sh is not
>>needed?
>> 
>>>> € Would it be possible to have maven download the dependencies, rather
>>>> than bin/grid calling curl on random URLs? Somehow it feels weird to
>>>>have
>>>> a script download and run random code off the internet (although of
>>>> course that's what every package manager does, it's irrational). It
>>>>would
>>>> also avoid re-downloading everything in case you decide to blow away
>>>>the
>>>> deploy directory.
>>> 
>>> Not sure about this. All of this stuff is up in Apache's HTTP servers,
>>>but
>>> I'm not sure if the release packages for these projects are published
>>>into
>>> Maven central (I'm nearly 100% certain that Kafka isn't). If they're
>>>not,
>>> then having Maven download the packages is no different than having the
>>> shell script do it.
>>> 
>>> One alternative would be to have the bin/grid script cache the files
>>> locally somewhere, so that blowing away the deploy directory doesn't
>>> trigger a re-download of YARN/ZK/Kafka again.
>> 
>> Ok, having the shell script cache the files in another directory sounds
>>good. I'm happy to make that change.
>> 
>> Cheers,
>> Martin
>>

Re: Making hello-samza easier to get started with

Reply via email to