Hi Gwen, Congrats on becoming a committer!
I submitted a (ducktape) pull request for the cluster size issue here: https://github.com/confluentinc/ducktape/pull/67 which hopefully makes the error less confusing. Maybe we can punt on the slightly funky Vagrantfile.local setup in the Kafka repository. Anything else to add? I'm thinking it would be nice to move forward with the merge and iterate on the workflow/user experience when more people get a chance to try it. Thanks, Geoff On Thu, Jun 25, 2015 at 12:17 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > To add some reasoning to Geoff's explanation, there are a couple of > reasons I think separating node configuration from test execution is better: > > 1. Allows for (slightly) more sane configuration across different > platforms -- if we wanted an RPM-based distro or something, that would just > be a separate script that needs to setup the same stuff. The > 2. Makes some config options, like JDK version, easily testable without > requiring test support. > 3. You don't want to have to handle node configuration as part of a normal > test run. For jenkins-style tests that might make sense, but often you want > that step isolated. > 4. It's obviously generalizable and makes the config as easy as the method > you use to script it. Docker would be a good example here. You could do it > all through SSH, but that might actually suck if the DockerCluster > implementation for ducktape allocates new containers for each test run > (which would require configuring from scratch every time, if you already > had an image that would imply you already ran that configuration as a > pre-process). > > It has some drawbacks: > > 1. More things to look at/be aware of. Setup process is more complex, and > I think this is part of what you encountered yesterday. > 2. Configuration scripts for different cluster types can get disconnected. > Ideally automated testing across all types of configs would make this > better, but if people add even just a couple of platforms, that's going to > be a *lot* of overhead to get setup. > > Maybe we should formalize that configuration step somehow in ducktape? > i.e. add some support to the cluster interface? Currently we assume the > cluster is already there + configured, and I'm not sure how this would work > out for "dynamic" clusters (like localhost or docker where you can add > nodes arbitrarily). But it doesn't sound like a bad idea to be able to boil > the commands down to: > > ducktape cluster configure > ducktape tests/ > > -Ewen > > P.S. We don't have a ducktape mailing list or anything right now (and it's > be pretty lonely if we did), but we might also want to move some of this > discussion back to public lists/JIRAs. There are lots of good ideas flowing > here, don't want to lose any! > > On Thu, Jun 25, 2015 at 12:15 PM, Geoffrey Anderson <ge...@confluent.io> > wrote: > >> Hi Gwen, >> >> Provisioning and installation is *not* baked into ducktape >> (intentionally), so it's expected that the user has a mechanism to >> provision whatever machines they're using. In our kafkatest case, the >> Vagrant provisioning scripts do the work of installing zk, kafka etc on the >> slave machines. >> >> Within ducktape there currently exist three Cluster classes: >> LocalHostCluster, VagrantCluster, JsonCluster. The --cluster option >> specifies which cluster class to use. >> >> The most generic of these is JsonCluster, which searches for a >> "cluster.json" file or takes JSON in its constructor, and this JSON data >> specifies enough information so that ssh commands can be run on individual >> nodes. >> >> VagrantCluster is actually just a JsonCluster but it parses vagrant >> ssh-config to create its json data. For example: >> >> {'nodes': [{'ssh_hostname': '127.0.0.1', 'hostname': 'worker1', 'user': >> 'vagrant', 'ssh_args': "-o 'HostName 127.0.0.1' -o 'Port 2222' -o >> 'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o >> 'PasswordAuthentication no' -o 'IdentityFile >> /Users/geoffreyanderson/Google_Drive/Confluent_code/cp_system_tests/muckrake/.vagrant/machines/worker1/virtualbox/private_key' >> -o 'IdentitiesOnly yes' -o 'LogLevel FATAL' "}, {'ssh_hostname': >> '127.0.0.1', 'hostname': 'worker2', 'user': 'vagrant', 'ssh_args': "-o >> 'HostName 127.0.0.1' -o 'Port 2200' -o 'UserKnownHostsFile /dev/null' -o >> 'StrictHostKeyChecking no' -o 'PasswordAuthentication no' -o 'IdentityFile >> /Users/geoffreyanderson/Google_Drive/Confluent_code/cp_system_tests/muckrake/.vagrant/machines/worker2/virtualbox/private_key' >> -o 'IdentitiesOnly yes' -o 'LogLevel FATAL' "}, {'ssh_hostname': >> '127.0.0.1', 'hostname': 'worker3', 'user': 'vagrant', 'ssh_args': "-o >> 'HostName 127.0.0.1' -o 'Port 2201' -o 'UserKnownHostsFile /dev/null' -o >> 'StrictHostKeyChecking no' -o 'PasswordAuthentication no' -o 'IdentityFile >> /Users/geoffreyanderson/Google_Drive/Confluent_code/cp_system_tests/muckrake/.vagrant/machines/worker3/virtualbox/private_key' >> -o 'IdentitiesOnly yes' -o 'LogLevel FATAL' "}]} >> >> Does that make sense? >> >> Cheers, >> Geoff >> >> >> >> >> >> >> On Thu, Jun 25, 2015 at 11:40 AM, Gwen Shapira <gshap...@cloudera.com> >> wrote: >> >>> Looping back here: >>> >>> The generic cluster option takes a list of empty hosts (in json >>> format) and takes care of everything on them? i.e. installing Kafka >>> somewhere, installing ZK somewhere else, etc? >>> >>> The help says something different: >>> " --cluster CLUSTER cluster class to use to allocate nodes for >>> tests" >>> >>> :) >>> >>> On Wed, Jun 24, 2015 at 8:07 PM, Ewen Cheslack-Postava >>> <e...@confluent.io> wrote: >>> > Ah, yes. So this actually brings up the issue of non-Vagrant cluster >>> > implementations for ducktape. >>> > >>> > I hesitate to suggest making the easy path running on localhost since >>> it >>> > requires making the services work with random ports, not assume layout >>> of >>> > jars/code in the VM, etc., but that will always be the easiest setup >>> for the >>> > user. >>> > >>> > There's also the generic json-cluster version which just uses whatever >>> > machines you put into a JSON file (with hosts, ports, ssh info, etc.). >>> This >>> > is "easy" in the sense that there aren't many steps, but requires you >>> to >>> > make sure each server is setup well. >>> > >>> > I'd love suggestions about how to simplify these steps -- it's hard to >>> get >>> > right, but well worth the effort. Another idea is to wrap up some of >>> the >>> > steps for a from-scratch setup into an automated script. >>> > >>> > And Gwen, since we were just talking about the Docker version on >>> Twitter, I >>> > realized that my suggestion is the way I would do it using Vagrant. >>> However, >>> > you might also consider a Docker cluster plugin for Docker (which >>> would be >>> > pretty easy to create). You'd need to add the scripts to build the >>> Docker >>> > image that work the same as the VM setup, but I think you could just >>> > directly reuse the Vagrant provisioner scripts. >>> > >>> > -Ewen >>> > >>> > >>> > >>> > On Wed, Jun 24, 2015 at 8:01 PM, Gwen Shapira <gshap...@cloudera.com> >>> wrote: >>> >> >>> >> It is too complicated in the EC2 section. I skipped that part >>> completely. >>> >> >>> >> If I get a vote, I'd go for very simple 1. 2. 3. steps in the Readme >>> >> and point to external docs for more details (if you are using Vagrant, >>> >> press 1. If you are using EC2, press 2. If you are using Docker, >>> >> submit a pull request.). >>> >> >>> >> >>> >> >>> >> >>> >> On Wed, Jun 24, 2015 at 7:57 PM, Ewen Cheslack-Postava >>> >> <e...@confluent.io> wrote: >>> >> > Wow, awesome. That was easier than I thought it would be :) >>> >> > >>> >> > Geoff, want to take a stab at clarifying the README? This is also >>> just a >>> >> > bit >>> >> > tough since there are a variety of different ways to run it -- the >>> >> > vagrant/README.md was kind of a pain too, and I still think it's too >>> >> > complicated in the EC2 section... >>> >> > >>> >> > -Ewen >>> >> > >>> >> > On Wed, Jun 24, 2015 at 7:53 PM, Gwen Shapira < >>> gshap...@cloudera.com> >>> >> > wrote: >>> >> >> >>> >> >> OMG! It works! Slow, but it works! >>> >> >> >>> >> >> [INFO:2015-06-24 19:46:02,318]: SerialTestRunner: running 14 >>> tests... >>> >> >> [INFO:2015-06-24 19:46:02,319]: SerialTestRunner: >>> >> >> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> >> >> running test 1 of 14 >>> >> >> [INFO:2015-06-24 19:46:02,320]: SerialTestRunner: >>> >> >> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> >> >> setting up >>> >> >> [INFO:2015-06-24 19:46:36,345]: SerialTestRunner: >>> >> >> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> >> >> running >>> >> >> [INFO:2015-06-24 19:47:25,235]: SerialTestRunner: >>> >> >> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> PASS >>> >> >> [INFO:2015-06-24 19:47:25,237]: SerialTestRunner: >>> >> >> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> >> >> tearing down >>> >> >> >>> >> >> >>> >> >> >>> ====================================================================================================================================================================================================== >>> >> >> test_id: >>> >> >> >>> >> >> >>> 2015-06-24--003.kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency >>> >> >> status: PASS >>> >> >> run time: 54.254 seconds >>> >> >> {"latency_99th_ms": 13.0, "latency_50th_ms": 3.0, >>> "latency_999th_ms": >>> >> >> 18.0} >>> >> >> >>> >> >> If we can either clarify the doc or package the thing a bit better >>> (or >>> >> >> both), I'm ready to +1. >>> >> >> >>> >> >> On Wed, Jun 24, 2015 at 7:45 PM, Gwen Shapira < >>> gshap...@cloudera.com> >>> >> >> wrote: >>> >> >> > The README was clear enough, but I'm wonder if we can package it >>> >> >> > better so we will provide the Vagrantfile.local somewhere under >>> tests >>> >> >> > and tell users to run "vagrant up" from there? Or even include a >>> >> >> > setup >>> >> >> > script that will make sure we got the right file? >>> >> >> > >>> >> >> > And opened Ducktape issue #62 :) >>> >> >> > >>> >> >> > Gwen >>> >> >> > >>> >> >> > On Wed, Jun 24, 2015 at 7:32 PM, Ewen Cheslack-Postava >>> >> >> > <e...@confluent.io> wrote: >>> >> >> >> >>> >> >> >> >>> >> >> >> On Wed, Jun 24, 2015 at 7:19 PM, Gwen Shapira >>> >> >> >> <gshap...@cloudera.com> >>> >> >> >> wrote: >>> >> >> >>> >>> >> >> >>> I clearly did something wrong :) >>> >> >> >>> >>> >> >> >>> gshapira-MBP-2:tests gshapira$ vagrant status >>> >> >> >>> Current machine states: >>> >> >> >>> >>> >> >> >>> zk1 running (virtualbox) >>> >> >> >>> broker1 running (virtualbox) >>> >> >> >>> broker2 running (virtualbox) >>> >> >> >>> broker3 running (virtualbox) >>> >> >> >> >>> >> >> >> >>> >> >> >> Ah, so these are the defaults you get out of the Vagrantfile. >>> The >>> >> >> >> way >>> >> >> >> this >>> >> >> >> works is that there is a Vagrantfile checked in that includes >>> >> >> >> support >>> >> >> >> for >>> >> >> >> pulling up a small Kafka cluster. This was the original use >>> case for >>> >> >> >> the >>> >> >> >> Vagrantfile (and from what I see in some bug reports, is in fact >>> >> >> >> being >>> >> >> >> used >>> >> >> >> this way). The Vagrantfile supports bringing up clusters of >>> varying >>> >> >> >> sizes >>> >> >> >> for ZK and Kafka, and the default settings are: >>> >> >> >> >>> >> >> >> num_zookeepers=1 >>> >> >> >> num_brokers=3 >>> >> >> >> num_workers=0 >>> >> >> >> >>> >> >> >> On ZK and broker machines, we actually start those services. >>> >> >> >> >>> >> >> >> However, for these tests, we actually want VMs with everything >>> >> >> >> installed, >>> >> >> >> but nothing running yet. When I created the Vagrantfile >>> originally, >>> >> >> >> I >>> >> >> >> added >>> >> >> >> the num_workers setting for a similar reason -- so you could run >>> >> >> >> one-off >>> >> >> >> commands from within a VM, e.g. console-producer/consumer, the >>> perf >>> >> >> >> tools, >>> >> >> >> etc. We reuse that same type of node for these tests. >>> >> >> >> >>> >> >> >> You can add a file Vagrantfile.local (which is gitignored) that >>> >> >> >> overrides >>> >> >> >> any of the settings in the Vagrantfile, so in this case you'll >>> add >>> >> >> >> the >>> >> >> >> settings Geoff suggested. >>> >> >> >> >>> >> >> >> *BUT* just in case you're not super familiar with Vagrant, >>> you'll >>> >> >> >> want >>> >> >> >> to >>> >> >> >> >>> >> >> >> vagrant destroy >>> >> >> >> >>> >> >> >> those other nodes first. Otherwise the change to the vagrantfile >>> >> >> >> will >>> >> >> >> cause >>> >> >> >> Vagrant to lose track of the already running VMs. >>> >> >> >> >>> >> >> >> This kind of sucks for your first setup because you'll need to >>> >> >> >> rebuild >>> >> >> >> those >>> >> >> >> machines, which is a bit slow if you're running them in >>> Virtualbox. >>> >> >> >> In >>> >> >> >> theory you could just add a couple of workers and kill the >>> >> >> >> appropriate >>> >> >> >> processes in the zkX and brokerX VMs, but it's probably simpler >>> to >>> >> >> >> just >>> >> >> >> get >>> >> >> >> the Vagrantfile config right from the get go. >>> >> >> >> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> Any idea where it picked up this configuration? >>> >> >> >>> I'll add the default you recommended. It was actually >>> mentioned in >>> >> >> >>> Ducktape readme, but I thought I saw a default file in the PR, >>> so I >>> >> >> >>> skipped this step. >>> >> >> >> >>> >> >> >> >>> >> >> >> Not sure if this is specific to you having seen the PR (which >>> has an >>> >> >> >> example >>> >> >> >> file, but expected you to already know about the >>> Vagrantfile.local >>> >> >> >> file) or >>> >> >> >> if this is just a general problem with the README that needs >>> >> >> >> clarification. >>> >> >> >> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> The error message should probably include how many exist, how >>> many >>> >> >> >>> are >>> >> >> >>> expected and if at all possible, where was the number of >>> expected >>> >> >> >>> defined. Want me to open a Ducktape issue? >>> >> >> >> >>> >> >> >> >>> >> >> >> Yes please! This is the kind of stuff we've already gotten used >>> to >>> >> >> >> that >>> >> >> >> clearly could have better useability. >>> >> >> >> >>> >> >> >> -Ewen >>> >> >> >> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> Gwen >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> On Wed, Jun 24, 2015 at 7:08 PM, Geoffrey Anderson >>> >> >> >>> <ge...@confluent.io> >>> >> >> >>> wrote: >>> >> >> >>> > Hey Gwen, >>> >> >> >>> > >>> >> >> >>> > Cool, nice to see you're giving it a spin! >>> >> >> >>> > >>> >> >> >>> > First question: what do you see if you run >>> >> >> >>> > $ vagrant status >>> >> >> >>> > >>> >> >> >>> > You should see (and no nodes named zk or broker) >>> >> >> >>> > worker1 ... >>> >> >> >>> > worker2 ... >>> >> >> >>> > worker3 ... >>> >> >> >>> > >>> >> >> >>> > VagrantCluster in ducktape assumes the workers are named >>> >> >> >>> > "workerX" >>> >> >> >>> > >>> >> >> >>> > You can guarantee this is the case by creating >>> Vagrantfile.local >>> >> >> >>> > in >>> >> >> >>> > your >>> >> >> >>> > kafka directory and making sure it looks like >>> >> >> >>> > num_zookeepers = 0 >>> >> >> >>> > num_kafka = 0 >>> >> >> >>> > num_workers = 5 >>> >> >> >>> > >>> >> >> >>> > Second, it looks like you'll need at least 5 workers for this >>> >> >> >>> > test >>> >> >> >>> > (1 >>> >> >> >>> > for >>> >> >> >>> > zookeeper, 3 for kafka, 1 for running EndToEndLatency) >>> >> >> >>> > >>> >> >> >>> > The default min_cluster_size method is on the base Test >>> class and >>> >> >> >>> > provides a >>> >> >> >>> > way to error out early if the cluster is too small to run the >>> >> >> >>> > test. >>> >> >> >>> > Hmm >>> >> >> >>> > it >>> >> >> >>> > looks like this error message could be updated to be more >>> useful. >>> >> >> >>> > >>> >> >> >>> > Thanks, >>> >> >> >>> > Geoff >>> >> >> >>> > >>> >> >> >>> > >>> >> >> >>> > >>> >> >> >>> > >>> >> >> >>> > >>> >> >> >>> > On Wed, Jun 24, 2015 at 6:45 PM, Gwen Shapira >>> >> >> >>> > <gshap...@cloudera.com> >>> >> >> >>> > wrote: >>> >> >> >>> >> >>> >> >> >>> >> Sorry, I can't get the tests to work, and it may be because >>> I'm >>> >> >> >>> >> so >>> >> >> >>> >> unfamiliar with the stack. I figured I'll spare dev@kafka >>> my >>> >> >> >>> >> newbie >>> >> >> >>> >> questions. >>> >> >> >>> >> >>> >> >> >>> >> I did my best to follow the instructions: I installed >>> vagrant, >>> >> >> >>> >> started >>> >> >> >>> >> the nodes, validated that I have 3 broker servers, installed >>> >> >> >>> >> ducktape >>> >> >> >>> >> and tried running the tests: >>> >> >> >>> >> >>> >> >> >>> >> [INFO:2015-06-24 18:06:49,554]: SerialTestRunner: >>> >> >> >>> >> >>> >> >> >>> >> >>> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> >> >> >>> >> running test 1 of 14 >>> >> >> >>> >> [INFO:2015-06-24 18:06:49,554]: SerialTestRunner: >>> >> >> >>> >> >>> >> >> >>> >> >>> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> >> >> >>> >> setting up >>> >> >> >>> >> [INFO:2015-06-24 18:07:23,630]: SerialTestRunner: >>> >> >> >>> >> >>> >> >> >>> >> >>> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> >> >> >>> >> running >>> >> >> >>> >> [INFO:2015-06-24 18:07:23,631]: SerialTestRunner: >>> >> >> >>> >> >>> >> >> >>> >> >>> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> >> >> >>> >> FAIL >>> >> >> >>> >> [INFO:2015-06-24 18:07:23,634]: SerialTestRunner: >>> >> >> >>> >> >>> >> >> >>> >> >>> kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency: >>> >> >> >>> >> tearing down >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> ==================================================================================================================================================================================== >>> >> >> >>> >> test_id: >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> 2015-06-24--002.kafkatest.tests.benchmark_test.Benchmark.test_end_to_end_latency >>> >> >> >>> >> status: FAIL >>> >> >> >>> >> run time: 2.929 seconds >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> There aren't enough available nodes to satisfy the >>> resource >>> >> >> >>> >> request. Your test has almost certainly incorrectly >>> implemented >>> >> >> >>> >> its >>> >> >> >>> >> min_cluster_size() method. >>> >> >> >>> >> Traceback (most recent call last): >>> >> >> >>> >> File >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> "/Library/Python/2.7/site-packages/ducktape-0.2.0-py2.7.egg/ducktape/tests/runner.py", >>> >> >> >>> >> line 88, in run_all_tests >>> >> >> >>> >> result.data = self.run_single_test() >>> >> >> >>> >> File >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> "/Library/Python/2.7/site-packages/ducktape-0.2.0-py2.7.egg/ducktape/tests/runner.py", >>> >> >> >>> >> line 133, in run_single_test >>> >> >> >>> >> return >>> self.current_test_context.function(self.current_test) >>> >> >> >>> >> File >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> "/Users/gshapira/workspace/confluent-kafka/tests/kafkatest/tests/benchmark_test.py", >>> >> >> >>> >> line 166, in test_end_to_end_latency >>> >> >> >>> >> self.perf.run() >>> >> >> >>> >> File >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> "/Library/Python/2.7/site-packages/ducktape-0.2.0-py2.7.egg/ducktape/services/service.py", >>> >> >> >>> >> line 174, in run >>> >> >> >>> >> self.start() >>> >> >> >>> >> File >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> "/Library/Python/2.7/site-packages/ducktape-0.2.0-py2.7.egg/ducktape/services/service.py", >>> >> >> >>> >> line 101, in start >>> >> >> >>> >> self.allocate_nodes() >>> >> >> >>> >> File >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> "/Library/Python/2.7/site-packages/ducktape-0.2.0-py2.7.egg/ducktape/services/service.py", >>> >> >> >>> >> line 81, in allocate_nodes >>> >> >> >>> >> self.nodes = self.cluster.request(self.num_nodes) >>> >> >> >>> >> File >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> "/Library/Python/2.7/site-packages/ducktape-0.2.0-py2.7.egg/ducktape/cluster/json.py", >>> >> >> >>> >> line 48, in request >>> >> >> >>> >> "certainly incorrectly implemented its >>> min_cluster_size() >>> >> >> >>> >> method.") >>> >> >> >>> >> RuntimeError: There aren't enough available nodes to >>> satisfy the >>> >> >> >>> >> resource request. Your test has almost certainly incorrectly >>> >> >> >>> >> implemented its min_cluster_size() method. >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> I didn't see any min_cluster_size() method in the tests - is >>> >> >> >>> >> that >>> >> >> >>> >> the >>> >> >> >>> >> issue? Do we need to add it? >>> >> >> >>> >> >>> >> >> >>> >> Gwen >>> >> >> >>> > >>> >> >> >>> > >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> -- >>> >> >> >> Thanks, >>> >> >> >> Ewen >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > Thanks, >>> >> > Ewen >>> > >>> > >>> > >>> > >>> > -- >>> > Thanks, >>> > Ewen >>> >> >> > > > -- > Thanks, > Ewen >