tibrewalpratik17 opened a new pull request, #13036: URL: https://github.com/apache/pinot/pull/13036
We have intermittently seen issues in our clusters while creating streamMessageDecoder. Stack trace: ``` java.lang.RuntimeException: Caught exception while creating StreamMessageDecoder from stream config: StreamConfig{_type='kafka', _topicName='<redacted>', _consumerTypes=[LOWLEVEL], _consumerFactoryClassName='redacted>', _offsetCriteria='OffsetCriteria{_offsetType=LARGEST, _offsetString='largest'}', _connectionTimeoutMillis=30000, _fetchTimeoutMillis=5000, _idleTimeoutMillis=180000, _flushThresholdRows=80000000, _flushThresholdTimeMillis=86400000, _flushSegmentDesiredSizeBytes=209715200, _flushAutotuneInitialRows=100000, _decoderClass='redacted', _decoderProperties={}, _groupId='null', _topicConsumptionRateLimit=-1.0, _tableNameWithType='redacted'} at org.apache.pinot.spi.stream.StreamDecoderProvider.create(StreamDecoderProvider.java:48) at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.(LLRealtimeSegmentDataManager.java:1424) at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:446) at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:228) at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:168) at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:83) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:350) at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:278) at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97) at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) ``` This stops consumption in one of the replicas and once the other replica starts committing, this stopped replica always ends up in ERROR state. The only way to fix this is to reset this replica's segment. The behaviour of not consuming in one replica is also dangerous as if the other replica's hosts restarts / goes down due to any reason, it can cause data loss scenarios. Having a retry policy during StreamMessageDecoder.create() may help reduce the chances of such scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org