[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-24 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/24/20 11:10 AM:


Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error(on 
broker side) which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears almost every day  at the same time and txnStartTimestamp of previous 
transaction state is younger than txnStartTimestamp of transaction target state

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had found that time of issue reproducing is the time of 
ntp synchronization. 
 Our broker timer goes forward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where timestamp comparison exists

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error(on 
broker side) which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears almost every day  at the same time and txnStartTimestamp of previous 
transaction state is younger than txnStartTimestamp of transaction target state

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): 

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-24 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/24/20 11:09 AM:


Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error(on 
broker side) which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears almost every day  at the same time and txnStartTimestamp of previous 
transaction state is younger than txnStartTimestamp of transaction target state

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had found that time of issue reproducing is the time of 
ntp synchronization. 
 Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where timestamp comparison exists

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error(on 
broker side) which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears almost every day  at the same time and txnStartTimestamp of previous 
transaction state is younger than txnStartTimestamp of transaction target state

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): 

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/23/20 4:09 PM:
---

Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error(on 
broker side) which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears almost every day  at the same time and txnStartTimestamp of previous 
transaction state is younger than txnStartTimestamp of transaction target state

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had detected that time of issue reproducing is the time 
of ntp synchronization. 
 Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where timestamp comparison exists

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears almost every day  at the same time and txnStartTimestamp of previous 
transaction state is younger than txnStartTimestamp of transaction target state

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[Transa

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022001#comment-17022001
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/23/20 12:02 PM:


Original exception full stacktrace(Broker 2.0.1)
{noformat}
 [2020-01-02 03:03:06,262] ERROR [KafkaApi-0] Error when handling request 
{transactional_id=>,producer_id=61432,producer_epoch=0,group_id=>}
 (kafka.server.KafkaApis)
java.lang.IllegalStateException: TransactionalId > failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata
at 
kafka.coordinator.transaction.TransactionMetadata.throwStateTransitionFailure(TransactionMetadata.scala:390)
at 
kafka.coordinator.transaction.TransactionMetadata.completeTransitionTo(TransactionMetadata.scala:326)
at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1$1.apply$mcV$sp(TransactionStateManager.scala:542)
at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1$1.apply(TransactionStateManager.scala:534)
at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1$1.apply(TransactionStateManager.scala:534)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
at 
kafka.coordinator.transaction.TransactionMetadata.inLock(TransactionMetadata.scala:172)
at 
kafka.coordinator.transaction.TransactionStateManager.kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1(TransactionStateManager.scala:533)
at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1$$anonfun$apply$mcV$sp$11.apply(TransactionStateManager.scala:628)
at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1$$anonfun$apply$mcV$sp$11.apply(TransactionStateManager.scala:628)
at kafka.server.DelayedProduce.onComplete(DelayedProduce.scala:129)
at kafka.server.DelayedOperation.forceComplete(DelayedOperation.scala:70)
at kafka.server.DelayedProduce.tryComplete(DelayedProduce.scala:110)
at 
kafka.server.DelayedOperationPurgatory.tryCompleteElseWatch(DelayedOperation.scala:232)
at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:495)
at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1.apply$mcV$sp(TransactionStateManager.scala:622)
at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1.apply(TransactionStateManager.scala:599)
at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1.apply(TransactionStateManager.scala:599)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:257)
at 
kafka.coordinator.transaction.TransactionStateManager.appendTransactionToLog(TransactionStateManager.scala:593)
at 
kafka.coordinator.transaction.TransactionCoordinator.handleAddPartitionsToTransaction(TransactionCoordinator.scala:272)
at kafka.server.KafkaApis.handleAddOffsetsToTxnRequest(KafkaApis.scala:1852)
at kafka.server.KafkaApis.handle(KafkaApis.scala:138)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69)
at java.lang.Thread.run(Thread.java:748)

{noformat}


was (Author: oleksii.boiko):
Original exception full stacktrace(Broker 2.0.1)


{noformat}
[2020-01-02 03:03:06,262] ERROR [KafkaApi-0] Error when handling request 
{transactional_id=,producer_id=61432,producer_epoch=0,group_id=} 
(kafka.server.KafkaApis)java.lang.IllegalStateException: TransactionalId  
failed transition to state TxnTransitMetadata(producerId=61432, 
producerEpoch=0, txnTimeoutMs=6, txnState=Ongoing, 
topicPartitions=Set(__consumer_offsets-10), txnStartTimestamp=1577934186261, 
txnLastUpdateTimestamp=1577934186261) due to unexpected metadata at 
kafka.coordinator.transaction.TransactionMetadata.throwStateTransitionFailure(TransactionMetadata.scala:390)
 at 
kafka.coordinator.transaction.TransactionMetadata.completeTransitionTo(TransactionMetadata.scala:326)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1$1.apply$mcV$sp(TransactionStateManager.scala:542)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1$1.apply(TransactionStateManager.scala:534)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonf

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022001#comment-17022001
 ] 

Oleksii Boiko commented on KAFKA-8803:
--

Original exception full stacktrace(Broker 2.0.1)


{noformat}
[2020-01-02 03:03:06,262] ERROR [KafkaApi-0] Error when handling request 
{transactional_id=,producer_id=61432,producer_epoch=0,group_id=} 
(kafka.server.KafkaApis)java.lang.IllegalStateException: TransactionalId  
failed transition to state TxnTransitMetadata(producerId=61432, 
producerEpoch=0, txnTimeoutMs=6, txnState=Ongoing, 
topicPartitions=Set(__consumer_offsets-10), txnStartTimestamp=1577934186261, 
txnLastUpdateTimestamp=1577934186261) due to unexpected metadata at 
kafka.coordinator.transaction.TransactionMetadata.throwStateTransitionFailure(TransactionMetadata.scala:390)
 at 
kafka.coordinator.transaction.TransactionMetadata.completeTransitionTo(TransactionMetadata.scala:326)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1$1.apply$mcV$sp(TransactionStateManager.scala:542)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1$1.apply(TransactionStateManager.scala:534)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1$1.apply(TransactionStateManager.scala:534)
 at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at 
kafka.coordinator.transaction.TransactionMetadata.inLock(TransactionMetadata.scala:172)
 at 
kafka.coordinator.transaction.TransactionStateManager.kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1(TransactionStateManager.scala:533)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1$$anonfun$apply$mcV$sp$11.apply(TransactionStateManager.scala:628)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1$$anonfun$apply$mcV$sp$11.apply(TransactionStateManager.scala:628)
 at kafka.server.DelayedProduce.onComplete(DelayedProduce.scala:129) at 
kafka.server.DelayedOperation.forceComplete(DelayedOperation.scala:70) at 
kafka.server.DelayedProduce.tryComplete(DelayedProduce.scala:110) at 
kafka.server.DelayedOperationPurgatory.tryCompleteElseWatch(DelayedOperation.scala:232)
 at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:495) at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1.apply$mcV$sp(TransactionStateManager.scala:622)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1.apply(TransactionStateManager.scala:599)
 at 
kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1.apply(TransactionStateManager.scala:599)
 at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at 
kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:257) at 
kafka.coordinator.transaction.TransactionStateManager.appendTransactionToLog(TransactionStateManager.scala:593)
 at 
kafka.coordinator.transaction.TransactionCoordinator.handleAddPartitionsToTransaction(TransactionCoordinator.scala:272)
 at kafka.server.KafkaApis.handleAddOffsetsToTxnRequest(KafkaApis.scala:1852) 
at kafka.server.KafkaApis.handle(KafkaApis.scala:138) at 
kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69) at 
java.lang.Thread.run(Thread.java:748)
{noformat}

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Boyang Chen
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the str

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/23/20 11:57 AM:


Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears almost every day  at the same time and txnStartTimestamp of previous 
transaction state is younger than txnStartTimestamp of transaction target state

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had detected that time of issue reproducing is the time 
of ntp synchronization. 
 Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where timestamp comparison exists

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears almost every day  at the same time

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.fil

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/23/20 11:49 AM:


Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears almost every day  at the same time

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had detected that time of issue reproducing is the time 
of ntp synchronization. 
 Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where timestamp comparison exists

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day 

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.fla

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/23/20 11:47 AM:


Hi all,

We are facing same issue. The frequency of issue reproducing was increased 
after migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day 

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had detected that time of issue reproducing is the time 
of ntp synchronization. 
 Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where timestamp comparison exists

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day 

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flat

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/23/20 11:36 AM:


Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day 

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had detected that time of issue reproducing is the time 
of ntp synchronization. 
 Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where timestamp comparison exists

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day 

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatM

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/23/20 11:35 AM:


Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day 

If I'm correct that means that transaction can not be transferred to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had detected that time of issue reproducing is the time 
of ntp synchronization. 
 Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where we have timestamp comparison

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day 

If I'm correct that means that transaction can not be transfered to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatM

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/23/20 11:34 AM:


Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day 

If I'm correct that means that transaction can not be transfered to "Ongoing" 
state and as the result it never expires due to only "Ongoing" transactions can 
be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had detected that time of issue reproducing is the time 
of ntp synchronization. 
 Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where we have timestamp comparison

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day and transaction can not be transfered 
to "Ongoing" state

If I'm correct as the result transaction never expires due to only "Ongoing" 
transactions can be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko edited comment on KAFKA-8803 at 1/23/20 11:32 AM:


Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
 For our case increase was caused by changing commit behaviour in 
Kafka-streams. Previously(we used 2.0.1), "commit offsets" was executed only in 
case when new record was consumed on source node but in version 2.3 "commit 
offset" executes on each "punctuate" call even no changes were made. We have 
punctuator with 1s wall-clock scheduler. As the the result commit offsets 
operations count was grown. Additionally to this we had detected that error 
which causes transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time almost every day and transaction can not be transfered 
to "Ongoing" state

If I'm correct as the result transaction never expires due to only "Ongoing" 
transactions can be expired
 kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}
Based on this inputs we had detected that time of issue reproducing is the time 
of ntp synchronization. 
 Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
 We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
 There is one more place where we have timestamp comparison

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 


was (Author: oleksii.boiko):
Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
For our case increase was caused by changing commit behaviour in Kafka-streams. 
Previously(we used 2.0.1), "commit offsets" was executed only in case when new 
record was consumed on source node but in version 2.3 "commit offset" executes 
on each "punctuate" call even no changes were made. We have punctuator with 1s 
wall-clock scheduler. As the the result commit offsets operations count was 
grown. Additionally to this we had detected that error which causes 
transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time and transaction can not be transfered to "Ongoing" 
state

If I'm correct as the result transaction never expires due to only "Ongoing" 
transactions can be expired
kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entr

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-01-23 Thread Oleksii Boiko (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021982#comment-17021982
 ] 

Oleksii Boiko commented on KAFKA-8803:
--

Hi all,

We are facing same issue. The sequence of issue reproducing was increased after 
migrating to Kafka-streams 2.3
For our case increase was caused by changing commit behaviour in Kafka-streams. 
Previously(we used 2.0.1), "commit offsets" was executed only in case when new 
record was consumed on source node but in version 2.3 "commit offset" executes 
on each "punctuate" call even no changes were made. We have punctuator with 1s 
wall-clock scheduler. As the the result commit offsets operations count was 
grown. Additionally to this we had detected that error which causes 
transctional id stuck

 
{noformat}
java.lang.IllegalStateException: TransactionalId  failed transition to 
state TxnTransitMetadata(producerId=61432, producerEpoch=0, txnTimeoutMs=6, 
txnState=Ongoing, topicPartitions=Set(__consumer_offsets-10), 
txnStartTimestamp=1577934186261, txnLastUpdateTimestamp=1577934186261) due to 
unexpected metadata{noformat}
appears at the same time and transaction can not be transfered to "Ongoing" 
state

If I'm correct as the result transaction never expires due to only "Ongoing" 
transactions can be expired
kafka.coordinator.transaction.TransactionStateManager#timedOutTransactions
{noformat}
def timedOutTransactions(): Iterable[TransactionalIdAndProducerIdEpoch] = {
val now = time.milliseconds()
inReadLock(stateLock) {
  transactionMetadataCache.filter { case (txnPartitionId, _) =>
!leavingPartitions.exists(_.txnPartitionId == txnPartitionId)
  }.flatMap { case (_, entry) =>
entry.metadataPerTransactionalId.filter { case (_, txnMetadata) =>
  if (txnMetadata.pendingTransitionInProgress) {
false
  } else {
txnMetadata.state match {
  case Ongoing =>
txnMetadata.txnStartTimestamp + txnMetadata.txnTimeoutMs < now
  case _ => false
}
  }
}.map { case (txnId, txnMetadata) =>
  TransactionalIdAndProducerIdEpoch(txnId, txnMetadata.producerId, 
txnMetadata.producerEpoch)
}
  }
}
  }{noformat}


Based on this inputs we had detected that time of issue reproducing is the time 
of ntp synchronization. 
Our broker timer goes froward in comparison with ntp and have up to +2 seconds 
per 24 hour and ntp sync rollbacks this delta.

After disabling ntp sync issue was gone
We had found that similar issue was already fixed previously but it does not 
cover all the cases https://issues.apache.org/jira/browse/KAFKA-5415
There is one more place where we have timestamp comparison

kafka.coordinator.transaction.TransactionMetadata#completeTransitionTo
{noformat}
case Ongoing => // from addPartitions if (!validProducerEpoch(transitMetadata) 
|| !topicPartitions.subsetOf(transitMetadata.topicPartitions) || txnTimeoutMs 
!= transitMetadata.txnTimeoutMs || txnStartTimestamp > 
transitMetadata.txnStartTimestamp) { 
throwStateTransitionFailure(transitMetadata) } else { txnStartTimestamp = 
transitMetadata.txnStartTimestamp 
addPartitions(transitMetadata.topicPartitions) }
{noformat}
 

 

 

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Boyang Chen
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. F