daniellavoie opened a new issue #7008:
URL: https://github.com/apache/incubator-pinot/issues/7008


   # Context
   
   A table was wrongly configured to point at a TLS port of Kafka (MSK to be 
specific) while the configuration details still inferred non TLS connectivity. 
While this is a problem in itself, the side effect is even worst. The 
misconfiguration will actually exhaust all available heap memory of a Pinot 
Controller. 
   
   This implies that a user provided configuration will impact the cluster 
stability. I am not sure yet if the root cause is 100% inside the Kafka client, 
but I would like us to keep track of this since the side effect are really 
really bad for a Pinot cluster. Regardless of the root cause, we should 
evaluate if anything can be done on the Pinot side to prevent that.
   
   # Output from histo live
   
   ```
    num     #instances         #bytes  class name
   ----------------------------------------------
      1:          6784        7540704  [B
      2:         75925        6518760  [C
      3:        262144        6291456  
org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapper
      4:         18167        2030352  [Ljava.lang.Object;
      5:         12060        1928160  [I
      6:         17019        1890224  java.lang.Class
      7:         75277        1806648  java.lang.String
      8:         39975        1279200  
java.util.concurrent.ConcurrentHashMap$Node
      9:         20138         805520  java.util.LinkedHashMap$Entry
     10:          9096         729648  [Ljava.util.HashMap$Node;
     11:         19423         621536  java.util.HashMap$Node
     12:          5255         462440  java.lang.reflect.Method
     13:         25127         402032  java.lang.Object
     14:          6919         387464  java.util.LinkedHashMap
     15:          7191         287640  java.lang.ref.SoftReference
   ```
   
   # Relevant logs
   
   ```
   2021/05/18 21:38:44.298 INFO [PeriodicTaskScheduler] [pool-10-thread-2] 
Starting RealtimeSegmentValidationManager with running frequency of 3600 
seconds.
   2021/05/18 21:38:44.298 INFO [BasePeriodicTask] [pool-10-thread-2] Start 
running task: RealtimeSegmentValidationManager
   2021/05/18 21:38:44.299 INFO [ControllerPeriodicTask] [pool-10-thread-2] 
Processing 5 tables in task: RealtimeSegmentValidationManager
   2021/05/18 21:38:44.299 INFO [RealtimeSegmentValidationManager] 
[pool-10-thread-2] Run segment-level validation
   2021/05/18 21:38:44.327 INFO [ConsumerConfig] [pool-10-thread-2] 
ConsumerConfig values: 
        auto.commit.interval.ms = 5000
        auto.offset.reset = latest
        bootstrap.servers = [*******:9094, *************:9094] <------ TLS port 
of MSK
        check.crcs = true
        client.id = 
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = true
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500
        fetch.min.bytes = 1
        group.id = 
        heartbeat.interval.ms = 3000
        interceptor.classes = []
        internal.leave.group.on.close = true
        isolation.level = read_uncommitted
        key.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
        max.partition.fetch.bytes = 1048576
        max.poll.interval.ms = 300000
        max.poll.records = 500
        metadata.max.age.ms = 300000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        partition.assignment.strategy = [class 
org.apache.kafka.clients.consumer.RangeAssignor]
        receive.buffer.bytes = 65536
        reconnect.backoff.max.ms = 1000
        reconnect.backoff.ms = 50
        request.timeout.ms = 30000
        retry.backoff.ms = 100
        sasl.client.callback.handler.class = null
        sasl.jaas.config = null
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.login.callback.handler.class = null
        sasl.login.class = null
        sasl.login.refresh.buffer.seconds = 300
        sasl.login.refresh.min.period.seconds = 60
        sasl.login.refresh.window.factor = 0.8
        sasl.login.refresh.window.jitter = 0.05
        sasl.mechanism = GSSAPI
        security.protocol = PLAINTEXT
        send.buffer.bytes = 131072
        session.timeout.ms = 10000
        ssl.cipher.suites = null
        ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
        ssl.endpoint.identification.algorithm = https
        ssl.key.password = null
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.location = null
        ssl.keystore.password = null
        ssl.keystore.type = JKS
        ssl.protocol = TLS
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.location = null
        ssl.truststore.password = null
        ssl.truststore.type = JKS
        value.deserializer = class 
org.apache.kafka.common.serialization.BytesDeserializer
   
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'realtime.segment.flush.threshold.rows' was supplied but isn't a 
known config.
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'stream.kafka.decoder.class.name' was supplied but isn't a known 
config.
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'streamType' was supplied but isn't a known config.
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'realtime.segment.flush.segment.size' was supplied but isn't a 
known config.
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'stream.kafka.consumer.type' was supplied but isn't a known 
config.
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'stream.kafka.broker.list' was supplied but isn't a known config.
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'realtime.segment.flush.threshold.time' was supplied but isn't a 
known config.
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'stream.kafka.consumer.prop.auto.offset.reset' was supplied but 
isn't a known config.
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'stream.kafka.consumer.factory.class.name' was supplied but isn't 
a known config.
   2021/05/18 21:38:44.516 WARN [ConsumerConfig] [pool-10-thread-2] The 
configuration 'stream.kafka.topic.name' was supplied but isn't a known config.
   2021/05/18 21:38:44.518 INFO [AppInfoParser] [pool-10-thread-2] Kafka 
version : 2.0.0
   2021/05/18 21:38:44.518 INFO [AppInfoParser] [pool-10-thread-2] Kafka 
commitId : 3402a8361b734732
   2021/05/18 21:38:46.543 INFO [PeriodicTaskScheduler] [pool-10-thread-3] 
Starting OfflineSegmentIntervalChecker with running frequency of 86400 seconds.
   2021/05/18 21:38:46.543 INFO [BasePeriodicTask] [pool-10-thread-3] Start 
running task: OfflineSegmentIntervalChecker
   2021/05/18 21:38:46.544 INFO [ControllerPeriodicTask] [pool-10-thread-3] 
Processing 5 tables in task: OfflineSegmentIntervalChecker
   2021/05/18 21:38:46.545 INFO [ControllerPeriodicTask] [pool-10-thread-3] 
Finish processing 5/5 tables in task: OfflineSegmentIntervalChecker
   2021/05/18 21:38:46.545 INFO [BasePeriodicTask] [pool-10-thread-3] Finish 
running task: OfflineSegmentIntervalChecker in 2ms
   2021/05/18 21:38:46.598 WARN [PeriodicTaskScheduler] [pool-10-thread-2] 
Caught exception while running Task: RealtimeSegmentValidationManager
   java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) ~[?:1.8.0_282]
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) ~[?:1.8.0_282]
        at 
org.apache.kafka.common.memory.MemoryPool$1.tryAllocate(MemoryPool.java:30) 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:112)
 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335) 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296) 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.common.network.Selector.attemptRead(Selector.java:560) 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:496) 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at org.apache.kafka.common.network.Selector.poll(Selector.java:425) 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:271)
 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:218)
 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:274)
 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1774)
 
~[pinot-confluent-avro-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.plugin.stream.kafka20.KafkaStreamMetadataProvider.fetchPartitionCount(KafkaStreamMetadataProvider.java:46)
 
~[pinot-kafka-2.0-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.spi.stream.PartitionCountFetcher.call(PartitionCountFetcher.java:65)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.spi.stream.PartitionCountFetcher.call(PartitionCountFetcher.java:29)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:50)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.controller.helix.core.PinotTableIdealStateBuilder.getPartitionCount(PinotTableIdealStateBuilder.java:121)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.controller.helix.core.realtime.PinotLLCRealtimeSegmentManager.getNumPartitions(PinotLLCRealtimeSegmentManager.java:637)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.controller.helix.core.realtime.PinotLLCRealtimeSegmentManager.ensureAllPartitionsConsuming(PinotLLCRealtimeSegmentManager.java:753)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.controller.validation.RealtimeSegmentValidationManager.processTable(RealtimeSegmentValidationManager.java:102)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.controller.validation.RealtimeSegmentValidationManager.processTable(RealtimeSegmentValidationManager.java:48)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTables(ControllerPeriodicTask.java:95)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.runTask(ControllerPeriodicTask.java:68)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:120)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.core.periodictask.PeriodicTaskScheduler.lambda$start$0(PeriodicTaskScheduler.java:85)
 
~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-593d23780d0110afabebd5a2127863ee10f8fd03]
        at 
org.apache.pinot.core.periodictask.PeriodicTaskScheduler$$Lambda$413/1311004473.run(Unknown
 Source) ~[?:?]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_282]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to