daniellavoie opened a new issue #7008:
URL: https://github.com/apache/incubator-pinot/issues/7008
# Context
A table was wrongly configured to point at a TLS port of Kafka (MSK to be
specific) while the configuration details still inferred non TLS connectivity.
While this is a problem in itself, the side effect is even worst. The
misconfiguration will actually exhaust all available heap memory of a Pinot
Controller.
This implies that a user provided configuration will impact the cluster
stability. I am not sure yet if the root cause is 100% inside the Kafka client,
but I would like us to keep track of this since the side effect are really
really bad for a Pinot cluster. Regardless of the root cause, we should
evaluate if anything can be done on the Pinot side to prevent that.
# Output from histo live
```
num #instances #bytes class name
----------------------------------------------
1: 6784 7540704 [B
2: 75925 6518760 [C
3: 262144 6291456
org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapper
4: 18167 2030352 [Ljava.lang.Object;
5: 12060 1928160 [I
6: 17019 1890224 java.lang.Class
7: 75277 1806648 java.lang.String
8: 39975 1279200
java.util.concurrent.ConcurrentHashMap$Node
9: 20138 805520 java.util.LinkedHashMap$Entry
10: 9096 729648 [Ljava.util.HashMap$Node;
11: 19423 621536 java.util.HashMap$Node
12: 5255 462440 java.lang.reflect.Method
13: 25127 402032 java.lang.Object
14: 6919 387464 java.util.LinkedHashMap
15: 7191 287640 java.lang.ref.SoftReference
```
# Relevant logs
```
2021/05/18 21:38:44.298 INFO [PeriodicTaskScheduler] [pool-10-thread-2]
Starting RealtimeSegmentValidationManager with running frequency of 3600
seconds.
2021/05/18 21:38:44.298 INFO [BasePeriodicTask] [pool-10-thread-2] Start
running task: RealtimeSegmentValidationManager
2021/05/18 21:38:44.299 INFO [ControllerPeriodicTask] [pool-10-thread-2]
Processing 5 tables in task: RealtimeSegmentValidationManager
2021/05/18 21:38:44.299 INFO [RealtimeSegmentValidationManager]
[pool-10-thread-2] Run segment-level validation
2021/05/18 21:38:44.327 INFO [ConsumerConfig] [pool-10-thread-2]
ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [*******:9094, *************:9094] <------ TLS port
of MSK
check.crcs = true
client.id =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id =
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class
org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class
org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class
org.apache.kafka.common.serialization.BytesDeserializer
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'realtime.segment.flush.threshold.rows' was supplied but isn't a
known config.
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'stream.kafka.decoder.class.name' was supplied but isn't a known
config.
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'streamType' was supplied but isn't a known config.
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'realtime.segment.flush.segment.size' was supplied but isn't a
known config.
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'stream.kafka.consumer.type' was supplied but isn't a known
config.
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'stream.kafka.broker.list' was supplied but isn't a known config.
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'realtime.segment.flush.threshold.time' was supplied but isn't a
known config.
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'stream.kafka.consumer.prop.auto.offset.reset' was supplied but
isn't a known config.
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'stream.kafka.consumer.factory.class.name' was supplied but isn't
a known config.
2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The
configuration 'stream.kafka.topic.name' was supplied but isn't a known config.
2021/05/18 21:38:44.518 INFO [AppInfoParser] [pool-10-thread-2] Kafka
version : 2.0.0
2021/05/18 21:38:44.518 INFO [AppInfoParser] [pool-10-thread-2] Kafka
commitId : 3402a8361b734732
2021/05/18 21:38:46.543 INFO [PeriodicTaskScheduler] [pool-10-thread-3]
Starting OfflineSegmentIntervalChecker with running frequency of 86400 seconds.
2021/05/18 21:38:46.543 INFO [BasePeriodicTask] [pool-10-thread-3] Start
running task: OfflineSegmentIntervalChecker
2021/05/18 21:38:46.544 INFO [ControllerPeriodicTask] [pool-10-thread-3]
Processing 5 tables in task: OfflineSegmentIntervalChecker
2021/05/18 21:38:46.545 INFO [ControllerPeriodicTask] [pool-10-thread-3]
Finish processing 5/5 tables in task: OfflineSegmentIntervalChecker
2021/05/18 21:38:46.545 INFO [BasePeriodicTask] [pool-10-thread-3] Finish
running task: OfflineSegmentIntervalChecker in 2ms
2021/05/18 21:38:46.598 WARN [PeriodicTaskScheduler] [pool-10-thread-2]
Caught exception while running Task: RealtimeSegmentValidationManager
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) ~[?:1.8.0_282]
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) ~[?:1.8.0_282]
at
org.apache.kafka.common.memory.MemoryPool$1.tryAllocate(MemoryPool.java:30)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:112)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.common.network.Selector.attemptRead(Selector.java:560)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:496)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:271)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:218)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:274)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1774)
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.plugin.stream.kafka20.KafkaStreamMetadataProvider.fetchPartitionCount(KafkaStreamMetadataProvider.java:46)
~[pinot-kafka-2.0-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.spi.stream.PartitionCountFetcher.call(PartitionCountFetcher.java:65)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.spi.stream.PartitionCountFetcher.call(PartitionCountFetcher.java:29)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:50)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.controller.helix.core.PinotTableIdealStateBuilder.getPartitionCount(PinotTableIdealStateBuilder.java:121)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.controller.helix.core.realtime.PinotLLCRealtimeSegmentManager.getNumPartitions(PinotLLCRealtimeSegmentManager.java:637)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.controller.helix.core.realtime.PinotLLCRealtimeSegmentManager.ensureAllPartitionsConsuming(PinotLLCRealtimeSegmentManager.java:753)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.controller.validation.RealtimeSegmentValidationManager.processTable(RealtimeSegmentValidationManager.java:102)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.controller.validation.RealtimeSegmentValidationManager.processTable(RealtimeSegmentValidationManager.java:48)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTables(ControllerPeriodicTask.java:95)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.runTask(ControllerPeriodicTask.java:68)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:120)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.core.periodictask.PeriodicTaskScheduler.lambda$start$0(PeriodicTaskScheduler.java:85)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
at
org.apache.pinot.core.periodictask.PeriodicTaskScheduler$$Lambda$413/1311004473.run(Unknown
Source) ~[?:?]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[?:1.8.0_282]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]