[jira] [Commented] (KAFKA-5608) System test failure due to timeout starting Jmx tool
[ https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096194#comment-16096194 ] ASF GitHub Bot commented on KAFKA-5608: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/3553 > System test failure due to timeout starting Jmx tool > > > Key: KAFKA-5608 > URL: https://issues.apache.org/jira/browse/KAFKA-5608 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Ewen Cheslack-Postava > Fix For: 0.11.0.1, 0.11.1.0 > > > Began seeing this in some failing system tests: > {code} > [INFO - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - > lineno:39]: Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", > line 35, in _protected_worker > self._worker(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py", > line 261, in _worker > self.start_jmx_tool(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py", > line 73, in start_jmx_tool > wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, > backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: ubuntu@worker7: Jmx tool took too long to start > {code} > This is immediately followed by a consumer timeout in the failing cases: > {code} > [INFO - 2017-07-18 14:26:46,907 - runner_client - log - lineno:221]: > RunnerClient: > kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_SSL.client_protocol=SASL_SSL: > FAIL: Consumer failed to consume messages for 60s. > Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py", > line 148, in test_rolling_upgrade_phase_two > self.run_produce_consume_validate(self.roll_in_secured_settings, > client_protocol, broker_protocol) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 106, in run_produce_consume_validate > self.start_producer_and_consumer() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in start_producer_and_consumer > self.consumer_start_timeout_sec) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: Consumer failed to consume messages for 60s. > {code} > There does not appear to be anything wrong with the consumer in the logs, so > the timeout seems to be caused by the Jmx tool timeout. > Possibly due to https://github.com/apache/kafka/pull/3447/files? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5608) System test failure due to timeout starting Jmx tool
[ https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094541#comment-16094541 ] ASF GitHub Bot commented on KAFKA-5608: --- GitHub user ijuma opened a pull request: https://github.com/apache/kafka/pull/3553 KAFKA-5608: Follow-up to fix potential NPE and clarify method name You can merge this pull request into a Git repository by running: $ git pull https://github.com/ijuma/kafka kafka-5608-follow-up Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3553.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3553 commit 7f43b0f23402c7018532f92aac438929bc83ca68 Author: Ismael Juma Date: 2017-07-20T11:09:02Z KAFKA-5608: Follow-up to fix potential NPE and clarify method name > System test failure due to timeout starting Jmx tool > > > Key: KAFKA-5608 > URL: https://issues.apache.org/jira/browse/KAFKA-5608 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Ewen Cheslack-Postava > Fix For: 0.11.0.1, 0.11.1.0 > > > Began seeing this in some failing system tests: > {code} > [INFO - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - > lineno:39]: Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", > line 35, in _protected_worker > self._worker(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py", > line 261, in _worker > self.start_jmx_tool(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py", > line 73, in start_jmx_tool > wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, > backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: ubuntu@worker7: Jmx tool took too long to start > {code} > This is immediately followed by a consumer timeout in the failing cases: > {code} > [INFO - 2017-07-18 14:26:46,907 - runner_client - log - lineno:221]: > RunnerClient: > kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_SSL.client_protocol=SASL_SSL: > FAIL: Consumer failed to consume messages for 60s. > Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py", > line 148, in test_rolling_upgrade_phase_two > self.run_produce_consume_validate(self.roll_in_secured_settings, > client_protocol, broker_protocol) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 106, in run_produce_consume_validate > self.start_producer_and_consumer() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in start_producer_and_consumer > self.consumer_start_timeout_sec) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: Consumer failed to consume messages for 60s. > {code} > There does not appear to be anything wrong with the consumer in the logs, so > the timeout seems to be caused by the Jmx tool timeout. > Possibly due to https://github.com/apache/kafka/pull/3447/files? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5608) System test failure due to timeout starting Jmx tool
[ https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094188#comment-16094188 ] Ewen Cheslack-Postava commented on KAFKA-5608: -- For anyone following up, there was quite a bit more conversation on the PR itself: https://github.com/apache/kafka/pull/3547 > System test failure due to timeout starting Jmx tool > > > Key: KAFKA-5608 > URL: https://issues.apache.org/jira/browse/KAFKA-5608 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Ewen Cheslack-Postava > Fix For: 0.11.0.1, 0.11.1.0 > > > Began seeing this in some failing system tests: > {code} > [INFO - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - > lineno:39]: Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", > line 35, in _protected_worker > self._worker(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py", > line 261, in _worker > self.start_jmx_tool(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py", > line 73, in start_jmx_tool > wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, > backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: ubuntu@worker7: Jmx tool took too long to start > {code} > This is immediately followed by a consumer timeout in the failing cases: > {code} > [INFO - 2017-07-18 14:26:46,907 - runner_client - log - lineno:221]: > RunnerClient: > kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_SSL.client_protocol=SASL_SSL: > FAIL: Consumer failed to consume messages for 60s. > Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py", > line 148, in test_rolling_upgrade_phase_two > self.run_produce_consume_validate(self.roll_in_secured_settings, > client_protocol, broker_protocol) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 106, in run_produce_consume_validate > self.start_producer_and_consumer() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in start_producer_and_consumer > self.consumer_start_timeout_sec) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: Consumer failed to consume messages for 60s. > {code} > There does not appear to be anything wrong with the consumer in the logs, so > the timeout seems to be caused by the Jmx tool timeout. > Possibly due to https://github.com/apache/kafka/pull/3447/files? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5608) System test failure due to timeout starting Jmx tool
[ https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094164#comment-16094164 ] ASF GitHub Bot commented on KAFKA-5608: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/3547 > System test failure due to timeout starting Jmx tool > > > Key: KAFKA-5608 > URL: https://issues.apache.org/jira/browse/KAFKA-5608 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Ewen Cheslack-Postava > Fix For: 0.11.0.1, 0.11.1.0 > > > Began seeing this in some failing system tests: > {code} > [INFO - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - > lineno:39]: Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", > line 35, in _protected_worker > self._worker(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py", > line 261, in _worker > self.start_jmx_tool(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py", > line 73, in start_jmx_tool > wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, > backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: ubuntu@worker7: Jmx tool took too long to start > {code} > This is immediately followed by a consumer timeout in the failing cases: > {code} > [INFO - 2017-07-18 14:26:46,907 - runner_client - log - lineno:221]: > RunnerClient: > kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_SSL.client_protocol=SASL_SSL: > FAIL: Consumer failed to consume messages for 60s. > Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py", > line 148, in test_rolling_upgrade_phase_two > self.run_produce_consume_validate(self.roll_in_secured_settings, > client_protocol, broker_protocol) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 106, in run_produce_consume_validate > self.start_producer_and_consumer() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in start_producer_and_consumer > self.consumer_start_timeout_sec) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: Consumer failed to consume messages for 60s. > {code} > There does not appear to be anything wrong with the consumer in the logs, so > the timeout seems to be caused by the Jmx tool timeout. > Possibly due to https://github.com/apache/kafka/pull/3447/files? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5608) System test failure due to timeout starting Jmx tool
[ https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092578#comment-16092578 ] ASF GitHub Bot commented on KAFKA-5608: --- GitHub user ewencp opened a pull request: https://github.com/apache/kafka/pull/3547 KAFKA-5608: Add --wait option for JmxTool and use in system tests to avoid race between JmxTool and monitored services You can merge this pull request into a Git repository by running: $ git pull https://github.com/ewencp/kafka wait-jmx-metrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3547.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3547 commit b33306053eda3ece107965b0db6111f36cc3232e Author: Ewen Cheslack-Postava Date: 2017-07-19T04:31:23Z KAFKA-5608: Add --wait option for JmxTool and use in system tests to avoid race between JmxTool and monitored services > System test failure due to timeout starting Jmx tool > > > Key: KAFKA-5608 > URL: https://issues.apache.org/jira/browse/KAFKA-5608 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Ewen Cheslack-Postava > > Began seeing this in some failing system tests: > {code} > [INFO - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - > lineno:39]: Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", > line 35, in _protected_worker > self._worker(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py", > line 261, in _worker > self.start_jmx_tool(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py", > line 73, in start_jmx_tool > wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, > backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: ubuntu@worker7: Jmx tool took too long to start > {code} > This is immediately followed by a consumer timeout in the failing cases: > {code} > [INFO - 2017-07-18 14:26:46,907 - runner_client - log - lineno:221]: > RunnerClient: > kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_SSL.client_protocol=SASL_SSL: > FAIL: Consumer failed to consume messages for 60s. > Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py", > line 148, in test_rolling_upgrade_phase_two > self.run_produce_consume_validate(self.roll_in_secured_settings, > client_protocol, broker_protocol) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 106, in run_produce_consume_validate > self.start_producer_and_consumer() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in start_producer_and_consumer > self.consumer_start_timeout_sec) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: Consumer failed to consume messages for 60s. > {code} > There does not appear to be anything wrong with the consumer in the logs, so > the timeout seems to be caused by the Jmx tool timeout. > Possibly due to https://github.com/apache/kafka/pull/3447/files? -- Thi
[jira] [Commented] (KAFKA-5608) System test failure due to timeout starting Jmx tool
[ https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092497#comment-16092497 ] Jason Gustafson commented on KAFKA-5608: [~ewencp] I think there is a race condition when starting up the jmx tool at the same time as the console consumer. If the metrics haven't been registered at initialization time, we may end up querying for nothing. We should probably modify the tool to wait until at least one expected name has been found before beginning polling. I did a quick dirty fix locally and it seemed to work. Does that sound plausible? > System test failure due to timeout starting Jmx tool > > > Key: KAFKA-5608 > URL: https://issues.apache.org/jira/browse/KAFKA-5608 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Ewen Cheslack-Postava > > Began seeing this in some failing system tests: > {code} > [INFO - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - > lineno:39]: Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", > line 35, in _protected_worker > self._worker(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py", > line 261, in _worker > self.start_jmx_tool(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py", > line 73, in start_jmx_tool > wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, > backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: ubuntu@worker7: Jmx tool took too long to start > {code} > This is immediately followed by a consumer timeout in the failing cases: > {code} > [INFO - 2017-07-18 14:26:46,907 - runner_client - log - lineno:221]: > RunnerClient: > kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_SSL.client_protocol=SASL_SSL: > FAIL: Consumer failed to consume messages for 60s. > Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py", > line 148, in test_rolling_upgrade_phase_two > self.run_produce_consume_validate(self.roll_in_secured_settings, > client_protocol, broker_protocol) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 106, in run_produce_consume_validate > self.start_producer_and_consumer() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in start_producer_and_consumer > self.consumer_start_timeout_sec) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: Consumer failed to consume messages for 60s. > {code} > There does not appear to be anything wrong with the consumer in the logs, so > the timeout seems to be caused by the Jmx tool timeout. > Possibly due to https://github.com/apache/kafka/pull/3447/files? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5608) System test failure due to timeout starting Jmx tool
[ https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092467#comment-16092467 ] Ewen Cheslack-Postava commented on KAFKA-5608: -- Another interesting point by [~hachikuji] is that all the tests that fail with this error seem to use SASL. > System test failure due to timeout starting Jmx tool > > > Key: KAFKA-5608 > URL: https://issues.apache.org/jira/browse/KAFKA-5608 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Ewen Cheslack-Postava > > Began seeing this in some failing system tests: > {code} > [INFO - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - > lineno:39]: Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", > line 35, in _protected_worker > self._worker(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py", > line 261, in _worker > self.start_jmx_tool(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py", > line 73, in start_jmx_tool > wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, > backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: ubuntu@worker7: Jmx tool took too long to start > {code} > This is immediately followed by a consumer timeout in the failing cases: > {code} > [INFO - 2017-07-18 14:26:46,907 - runner_client - log - lineno:221]: > RunnerClient: > kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_SSL.client_protocol=SASL_SSL: > FAIL: Consumer failed to consume messages for 60s. > Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py", > line 148, in test_rolling_upgrade_phase_two > self.run_produce_consume_validate(self.roll_in_secured_settings, > client_protocol, broker_protocol) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 106, in run_produce_consume_validate > self.start_producer_and_consumer() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in start_producer_and_consumer > self.consumer_start_timeout_sec) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: Consumer failed to consume messages for 60s. > {code} > There does not appear to be anything wrong with the consumer in the logs, so > the timeout seems to be caused by the Jmx tool timeout. > Possibly due to https://github.com/apache/kafka/pull/3447/files? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5608) System test failure due to timeout starting Jmx tool
[ https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092414#comment-16092414 ] Ewen Cheslack-Postava commented on KAFKA-5608: -- Further adding to the confusion, I've found a test ConnectMysqlTest.test_query_modes.query_mode=table.incremental_mode=timestamp which has this in the jmx_tool.err.log: {quote} Trying to connect to JMX url: service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi. Trying to connect to JMX url: service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi. {quote} But based on the code https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/tools/JmxTool.scala#L98-L110 it doesn't seem that it should be possible to see that message twice without also seeing an error message... > System test failure due to timeout starting Jmx tool > > > Key: KAFKA-5608 > URL: https://issues.apache.org/jira/browse/KAFKA-5608 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Ewen Cheslack-Postava > > Began seeing this in some failing system tests: > {code} > [INFO - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - > lineno:39]: Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", > line 35, in _protected_worker > self._worker(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py", > line 261, in _worker > self.start_jmx_tool(idx, node) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py", > line 73, in start_jmx_tool > wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, > backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: ubuntu@worker7: Jmx tool took too long to start > {code} > This is immediately followed by a consumer timeout in the failing cases: > {code} > [INFO - 2017-07-18 14:26:46,907 - runner_client - log - lineno:221]: > RunnerClient: > kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_SSL.client_protocol=SASL_SSL: > FAIL: Consumer failed to consume messages for 60s. > Traceback (most recent call last): > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py", > line 148, in test_rolling_upgrade_phase_two > self.run_produce_consume_validate(self.roll_in_secured_settings, > client_protocol, broker_protocol) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 106, in run_produce_consume_validate > self.start_producer_and_consumer() > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in start_producer_and_consumer > self.consumer_start_timeout_sec) > File > "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: Consumer failed to consume messages for 60s. > {code} > There does not appear to be anything wrong with the consumer in the logs, so > the timeout seems to be caused by the Jmx tool timeout. > Possibly due to https://github.com/apache/kafka/pull/3447/files? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5608) System test failure due to timeout starting Jmx tool
[ https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092286#comment-16092286 ] Ewen Cheslack-Postava commented on KAFKA-5608: -- It looks like the JmxTool really did take too long to generate output: {quote} [INFO - 2017-07-18 14:25:44,820 - kafka - bootstrap_servers - lineno:549]: Bootstrap client port is: 9095 [DEBUG - 2017-07-18 14:25:44,820 - console_consumer - _worker - lineno:255]: Console consumer 1 command: export JMX_PORT=9192; export LOG_DIR=/mnt/console_consumer/logs; export KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:/mnt/console_consumer/tools-log4j.properties"; export KAFKA_OPTS="-Djava.security.auth.login.config=/mnt/security/jaas.conf -Djava.security.krb5.conf=/mnt/security/krb5.conf"; /opt/kafka-dev/bin/kafka -console-consumer.sh --topic test_topic --consumer.config /mnt/console_consumer/console_consumer.properties --bootstrap-server worker11:9095,worker8:9095,worker4:9095 --isolation-level read_uncommitted --from-beginning --timeout-ms 6 --formatter kafka.tools.LoggingMessageFormatter 2>> /mnt/console_consumer/console_consumer.stderr | tee -a /mnt/console_consumer/console_consumer.stdout & [DEBUG - 2017-07-18 14:25:44,820 - remoteaccount - _log - lineno:158]: ubuntu@worker7: Running ssh command: export JMX_PORT=9192; export LOG_DIR=/mnt/console_consumer/logs; export KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:/mnt/console_consumer/tools-log4j.properties"; export KAFKA_OPTS="-Djava.security.auth.login.config=/mnt/security/jaas.conf -Djava.security.krb5.conf=/mnt/security/krb5.conf"; /opt/kafka-dev/bin/ka fka-console-consumer.sh --topic test_topic --consumer.config /mnt/console_consumer/console_consumer.properties --bootstrap-server worker11:9095,worker8:9095,worker4:9095 --isolation-level read_uncommitted --from-beginning --timeout-ms 6 --formatter kafka.tools.LoggingMessageFormatter 2>> /mnt/console_consumer/console_consumer.stderr | tee -a /mnt/console_consumer/console_consumer.stdout & [DEBUG - 2017-07-18 14:25:44,823 - console_consumer - _worker - lineno:260]: collecting following jmx objects: ['kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer'] [DEBUG - 2017-07-18 14:25:44,823 - remoteaccount - _log - lineno:158]: ubuntu@worker7: Running ssh command: nc -z 127.0.0.1 9192 [DEBUG - 2017-07-18 14:25:44,929 - remoteaccount - _log - lineno:158]: ubuntu@worker7: Running ssh command: nc -z 127.0.0.1 9192 [DEBUG - 2017-07-18 14:25:45,035 - remoteaccount - _log - lineno:158]: ubuntu@worker7: Running ssh command: nc -z 127.0.0.1 9192 [DEBUG - 2017-07-18 14:25:45,141 - remoteaccount - _log - lineno:158]: ubuntu@worker7: Running ssh command: nc -z 127.0.0.1 9192 [DEBUG - 2017-07-18 14:25:45,148 - jmx - start_jmx_tool - lineno:71]: ubuntu@worker7: Start JmxTool 1 command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log & [DEBUG - 2017-07-18 14:25:45,148 - remoteaccount - _log - lineno:158]: ubuntu@worker7: Running ssh command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log & [DEBUG - 2017-07-18 14:25:45,210 - remoteaccount - _log - lineno:158]: ubuntu@worker10: Running ssh command: mkdir -p /mnt/security [DEBUG - 2017-07-18 14:25:45,215 - remoteaccount - _log - lineno:158]: ubuntu@worker10: Running ssh command: java -version [DEBUG - 2017-07-18 14:25:45,216 - remoteaccount - _log - lineno:158]: ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)" [INFO - 2017-07-18 14:25:45,356 - verifiable_producer - _worker - lineno:131]: verifiable_producer.properties: [INFO - 2017-07-18 14:25:45,356 - verifiable_producer - _worker - lineno:132]: ssl.endpoint.identification.algorithm=HTTPS sasl.kerberos.service.name=kafka security.protocol=SASL_SSL ssl.keystore.location=/mnt/security/test.keystore.jks ssl.truststore.location=/mnt/security/test.truststore.jks ssl.keystore.password=test-ks-passwd sasl.mechanism=GSSAPI ssl.truststore.password=test-ts-passwd ssl.key.password=test-key-passwd sasl.mechanism.inter.broker.protocol=GSSAPI request.timeout.ms=3 [DEBUG - 2017-07-18 14:25:45,359 - kafka_path - path - lineno:92]: Using path resolver KafkaSystemTestPathResolver [INFO - 2017-07-18 14:25:45,359 - kafka - bootstrap_servers - lineno:549]: Bootstrap client port is: 9095 [DEBUG - 2017-07-18 14:25:45,359 - verifiable_producer - _worker -