Re: Making hello-samza easier to get started with

Chris Riccomini Tue, 04 Feb 2014 11:12:57 -0800

Hey Martin,

Responses inline.

The things in your list that I'm most excited about are:

1. Split Vagrant out.
2. Collapsing mkdir and tar -xvf into `mvn package`.
3. Making bin/grid cache downloads outside of the deploy directory.

I'm not really opposed to some of the other stuff, but we need to think it
through more (and probably need feedback from others).

Cheers,
Chris

On 2/4/14 9:51 AM, "Martin Kleppmann" <[email protected]> wrote:

>I love the hello-samza project -- it's quite magical to run a bunch of
>commands and see real data flow through the example job. Great idea to
>use Wikipedia's IRC feed!
>
>However, I feel the setup process is still a bit intimidating and
>fragile. I just wanted to bounce around some ideas about how we could
>make it quicker to get started:
>
>€ YARN is very heavyweight (100MB download). Could we avoid using YARN in
>hello-samza, in favour of LocalJobFactory? Does Kafka have a local mode
>for development that doesn't require Zookeeper? The fewer dependencies
>the better.

On the one hand, I agree with you that it's annoying to have so many
dependencies get pulled in. On the other hand, these systems are
non-trivial to install, and getting them up and running, and showing the
full power of Samza is a big deal. When I wrote hello-samza, I originally
was just going to use LocalJobFactory, and not even use Kafka. This would
have eliminated all dependencies. I opted against this because I felt like
it gave a much poorer feel of what Samza was, and how it worked in the
real world. For example, having the AM dashboard is really helpful, and
allows us to illustrate what containers are, etc.

>
>€ The Vagrant bootstrap script was quite broken -- I submitted a pull
>request (https://github.com/linkedin/hello-samza/pull/18) which should
>hopefully fix it.

Took a look. Looks good to me. Will merge if no one has any objections.

>
>€ I somehow got my setup into a bad state (where YARN was running but its
>web UI wouldn't load); I think it happened because I ran `vagrant up` at
>the same time as `bin/grid bootstrap` outside of the VM, and the two
>processes trampled on each other. Deleting the 'deploy' directory and
>starting from a clean slate fixed it. Can we isolate Vagrant and local-OS
>bootstrap from each other?

Yea, we really need to think this through. Originally, we only had local
bin/grid (no Vagrant). Now, we have two different ways to run hello-samza,
which is really confusing (especially since the README only talks about
Vagrant, and the Samza website only talks about local mode). Jakob and I
were talking about this as well. It seems like a good thing to move the
Vagrant stuff somewhere else, and be clear about the two different ways of
bootstrapping. Not quite sure about the best way to do this, but Jakob had
some thoughts.

>
>€ Can we make task logs go to stdout by default? Logs provide reassurance
>that something is happening, and at the moment you have to dig around
>somewhere in the deploy directory to find the log files.

Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs?

>
>€ Can we shorten the commands? Having to unpack the .tar.gz file and then
>copy/paste a scary long run-job.sh line makes the process feel arcane,
>and obscures what is really happening. Perhaps just a shell script
>wrapper for run-job.sh or a maven goal would do it.

Regarding the mkdir and .tar.gz unpacking, we should just do this as part
of `mvn package`. If you want to make that change, I'm all for it.

As for hiding the run-job.sh, I'm not as convinced of getting rid of it. I
kind of like exposing how Samza actually works to the developer, so they
know. Hiding it behind some one-off script doesn't really help them
understand Samza (of course the same argument could be made for hiding
YARN/ZK/Kafka behind bin/grid). Perhaps we just need more documentation in
the walkthrough about what this command does and what the parameters are?

>
>€ Would it be possible to have maven download the dependencies, rather
>than bin/grid calling curl on random URLs? Somehow it feels weird to have
>a script download and run random code off the internet (although of
>course that's what every package manager does, it's irrational). It would
>also avoid re-downloading everything in case you decide to blow away the
>deploy directory.

Not sure about this. All of this stuff is up in Apache's HTTP servers, but
I'm not sure if the release packages for these projects are published into
Maven central (I'm nearly 100% certain that Kafka isn't). If they're not,
then having Maven download the packages is no different than having the
shell script do it.

One alternative would be to have the bin/grid script cache the files
locally somewhere, so that blowing away the deploy directory doesn't
trigger a re-download of YARN/ZK/Kafka again.

>
>What do you think? Please chime in. I'm happy to work on these things,
>just wanted to get a read on what people think first.
>
>Martin
>

Re: Making hello-samza easier to get started with

Reply via email to