[ 
https://issues.apache.org/jira/browse/KAFKA-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092286#comment-16092286
 ] 

Ewen Cheslack-Postava commented on KAFKA-5608:
----------------------------------------------

It looks like the JmxTool really did take too long to generate output:

{quote}
[INFO  - 2017-07-18 14:25:44,820 - kafka - bootstrap_servers - lineno:549]: 
Bootstrap client port is: 9095
[DEBUG - 2017-07-18 14:25:44,820 - console_consumer - _worker - lineno:255]: 
Console consumer 1 command: export JMX_PORT=9192; export 
LOG_DIR=/mnt/console_consumer/logs; export 
KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:/mnt/console_consumer/tools-log4j.properties";
 export KAFKA_OPTS="-Djava.security.auth.login.config=/mnt/security/jaas.conf 
-Djava.security.krb5.conf=/mnt/security/krb5.conf"; /opt/kafka-dev/bin/kafka
-console-consumer.sh --topic test_topic --consumer.config 
/mnt/console_consumer/console_consumer.properties --bootstrap-server 
worker11:9095,worker8:9095,worker4:9095 --isolation-level read_uncommitted 
--from-beginning --timeout-ms 60000 --formatter 
kafka.tools.LoggingMessageFormatter 2>> 
/mnt/console_consumer/console_consumer.stderr | tee -a 
/mnt/console_consumer/console_consumer.stdout &
[DEBUG - 2017-07-18 14:25:44,820 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: export JMX_PORT=9192; export 
LOG_DIR=/mnt/console_consumer/logs; export 
KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:/mnt/console_consumer/tools-log4j.properties";
 export KAFKA_OPTS="-Djava.security.auth.login.config=/mnt/security/jaas.conf 
-Djava.security.krb5.conf=/mnt/security/krb5.conf"; /opt/kafka-dev/bin/ka
fka-console-consumer.sh --topic test_topic --consumer.config 
/mnt/console_consumer/console_consumer.properties --bootstrap-server 
worker11:9095,worker8:9095,worker4:9095 --isolation-level read_uncommitted 
--from-beginning --timeout-ms 60000 --formatter 
kafka.tools.LoggingMessageFormatter 2>> 
/mnt/console_consumer/console_consumer.stderr | tee -a 
/mnt/console_consumer/console_consumer.stdout &
[DEBUG - 2017-07-18 14:25:44,823 - console_consumer - _worker - lineno:260]: 
collecting following jmx objects: 
['kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer']
[DEBUG - 2017-07-18 14:25:44,823 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-18 14:25:44,929 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-18 14:25:45,035 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-18 14:25:45,141 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-18 14:25:45,148 - jmx - start_jmx_tool - lineno:71]: 
ubuntu@worker7: Start JmxTool 1 command: /opt/kafka-dev/bin/kafka-run-class.sh 
kafka.tools.JmxTool --reporting-interval 1000 --jmx-url 
service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --object-name 
kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer 
--attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> 
/mnt/jmx_tool.err.log &
[DEBUG - 2017-07-18 14:25:45,148 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: /opt/kafka-dev/bin/kafka-run-class.sh 
kafka.tools.JmxTool --reporting-interval 1000 --jmx-url 
service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --object-name 
kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer 
--attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> 
/mnt/jmx_tool.err.log &
[DEBUG - 2017-07-18 14:25:45,210 - remoteaccount - _log - lineno:158]: 
ubuntu@worker10: Running ssh command: mkdir -p /mnt/security
[DEBUG - 2017-07-18 14:25:45,215 - remoteaccount - _log - lineno:158]: 
ubuntu@worker10: Running ssh command: java -version
[DEBUG - 2017-07-18 14:25:45,216 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[INFO  - 2017-07-18 14:25:45,356 - verifiable_producer - _worker - lineno:131]: 
verifiable_producer.properties:
[INFO  - 2017-07-18 14:25:45,356 - verifiable_producer - _worker - lineno:132]:
ssl.endpoint.identification.algorithm=HTTPS
sasl.kerberos.service.name=kafka
security.protocol=SASL_SSL
ssl.keystore.location=/mnt/security/test.keystore.jks
ssl.truststore.location=/mnt/security/test.truststore.jks
ssl.keystore.password=test-ks-passwd
sasl.mechanism=GSSAPI
ssl.truststore.password=test-ts-passwd
ssl.key.password=test-key-passwd
sasl.mechanism.inter.broker.protocol=GSSAPI

request.timeout.ms=30000

[DEBUG - 2017-07-18 14:25:45,359 - kafka_path - path - lineno:92]: Using path 
resolver KafkaSystemTestPathResolver
[INFO  - 2017-07-18 14:25:45,359 - kafka - bootstrap_servers - lineno:549]: 
Bootstrap client port is: 9095
[DEBUG - 2017-07-18 14:25:45,359 - verifiable_producer - _worker - lineno:136]: 
VerifiableProducer 1 command: export LOG_DIR=/mnt/verifiable_producer/logs; 
export KAFKA_OPTS="-Djava.security.auth.login.config=/mnt/security/jaas.conf 
-Djava.security.krb5.conf=/mnt/security/krb5.conf"; export 
KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:/mnt/verifiable_producer/tools-log4j.properties";
 /opt/kafka-dev/bin/kafka-run-class.sh 
org.apache.kafka.tools.VerifiableProducer --topic test_topic --broker-list 
worker11:9095,worker8:9095,worker4:9095 --throughput 100 --producer.config 
/mnt/verifiable_producer/verifiable_producer.properties 2>> 
/mnt/verifiable_producer/verifiable_producer.stdout | tee -a 
/mnt/verifiable_producer/verifiable_producer.stdout &
[DEBUG - 2017-07-18 14:25:45,359 - remoteaccount - _log - lineno:158]: 
ubuntu@worker10: Running ssh command: export 
LOG_DIR=/mnt/verifiable_producer/logs; export 
KAFKA_OPTS="-Djava.security.auth.login.config=/mnt/security/jaas.conf 
-Djava.security.krb5.conf=/mnt/security/krb5.conf"; export 
KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:/mnt/verifiable_producer/tools-log4j.properties";
 /opt/kafka-dev/bin/kafka-run-class.sh 
org.apache.kafka.tools.VerifiableProducer --topic test_topic --broker-list 
worker11:9095,worker8:9095,worker4:9095 --throughput 100 --producer.config 
/mnt/verifiable_producer/verifiable_producer.properties 2>> 
/mnt/verifiable_producer/verifiable_producer.stdout | tee -a 
/mnt/verifiable_producer/verifiable_producer.stdout &
[DEBUG - 2017-07-18 14:25:45,761 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:46,267 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:46,773 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:47,279 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:47,785 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:48,291 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:48,797 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:49,303 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:49,808 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:50,314 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:50,821 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:51,327 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:51,833 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:52,339 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:52,844 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:53,350 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:53,856 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:54,362 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-18 14:25:54,868 - remoteaccount - _log - lineno:158]: 
ubuntu@worker7: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[INFO  - 2017-07-18 14:25:55,374 - background_thread - _protected_worker - 
lineno:38]: BackgroundThreadService threw exception:
[INFO  - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - 
lineno:39]: Traceback (most recent call last):
  File 
"/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py",
 line 35, in _protected_worker
    self._worker(idx, node)
  File 
"/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py",
 line 261, in _worker
    self.start_jmx_tool(idx, node)
  File 
"/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py",
 line 73, in start_jmx_tool
    wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, 
backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account)
  File 
"/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py",
 line 36, in wait_until
    raise TimeoutError(err_msg)
TimeoutError: ubuntu@worker7: Jmx tool took too long to start
{quote}

You can see the calls to nc initially, then checking for the availability of 
the JMX port. There's some other output unrelated because the jmx stuff is 
happening in a background thread. Then you can see it start checking for the 
log file, but it never seems to show up.

In this case there are logs for jmx (whereas I've seen some other tests where 
there aren't):

{quote}
$ cat jmx_tool.err.log
Trying to connect to JMX url: 
service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi.
{quote}

{quote}
$ cat jmx_tool.log
{quote}

So no output was generated, but it did at least log itself trying to connect.

And unlike some cases I've seen before where the timeouts were a problem, the 
range of timestamps in the consumer log seems to overlap plenty with the tests 
for the log file:

{quote}
[2017-07-18 14:25:45,514] INFO ConsumerConfig values:
        auto.commit.interval.ms = 5000
...
[2017-07-18 14:26:46,963] TRACE Completed receive from node 2147483645 for 
LEAVE_GROUP with correlation id 5854, received 
{throttle_time_ms=0,error_code=0} (org.apache.kafka.clients.NetworkClient)
[2017-07-18 14:26:46,963] DEBUG LeaveGroup request for group group returned 
successfully (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
{quote}

Unfortunately, we can't tell about JmxTool itself because it doesn't log any 
timestamp information.

> System test failure due to timeout starting Jmx tool
> ----------------------------------------------------
>
>                 Key: KAFKA-5608
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5608
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Ewen Cheslack-Postava
>
> Began seeing this in some failing system tests:
> {code}
> [INFO  - 2017-07-18 14:25:55,375 - background_thread - _protected_worker - 
> lineno:39]: Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py",
>  line 35, in _protected_worker
>     self._worker(idx, node)
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/console_consumer.py",
>  line 261, in _worker
>     self.start_jmx_tool(idx, node)
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/services/monitor/jmx.py",
>  line 73, in start_jmx_tool
>     wait_until(lambda: self._jmx_has_output(node), timeout_sec=10, 
> backoff_sec=.5, err_msg="%s: Jmx tool took too long to start" % node.account)
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py",
>  line 36, in wait_until
>     raise TimeoutError(err_msg)
> TimeoutError: ubuntu@worker7: Jmx tool took too long to start
> {code}
> This is immediately followed by a consumer timeout in the failing cases:
> {code}
> [INFO  - 2017-07-18 14:26:46,907 - runner_client - log - lineno:221]: 
> RunnerClient: 
> kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_SSL.client_protocol=SASL_SSL:
>  FAIL: Consumer failed to consume messages for 60s.
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
>     data = self.run_test()
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
>     return self.test_context.function(self.test)
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
>     return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
>     self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 106, in run_produce_consume_validate
>     self.start_producer_and_consumer()
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in start_producer_and_consumer
>     self.consumer_start_timeout_sec)
>   File 
> "/home/jenkins/workspace/system-test-kafka-0.11.0/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py",
>  line 36, in wait_until
>     raise TimeoutError(err_msg)
> TimeoutError: Consumer failed to consume messages for 60s.
> {code}
> There does not appear to be anything wrong with the consumer in the logs, so 
> the timeout seems to be caused by the Jmx tool timeout.
> Possibly due to https://github.com/apache/kafka/pull/3447/files?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to