[jira] [Updated] (KAFKA-17941) TransactionStateManager.loadTransactionMetadata method may get stuck in an infinite loop

2024-11-04 Thread Vincent Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Jiang updated KAFKA-17941:
--
Description: 
When loading transaction metadata from a transaction log partition, if the 
partition contains a segment ending with an empty batch, "currOffset" update 
logic at 
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L482]
 will be skipped.  Since "currOffset" is not advanced to next offset of last 
batch properly, TransactionStateManager.loadTransactionMetadata method will be 
stuck in the "while" loop at 
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L438.|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L438,]
 

After the change of https://issues.apache.org/jira/browse/KAFKA-17076, there is 
a higher chance for compaction process to generate segments ending with an 
empty batch. As a result, this issue is more likely to be hit now comparing to 
before KAFKA-17076 change.

  was:When loading transaction metadata from a transaction log partition, if 
the partition contains a segment ending with an empty batch, "currOffset" 
update logic at 
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L482]
 will be skipped.  Since "currOffset" is not advanced to next offset of last 
batch properly, TransactionStateManager.loadTransactionMetadata method will be 
stuck in the "while" loop at 
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L438.|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L438,]
 


> TransactionStateManager.loadTransactionMetadata method may get stuck in an 
> infinite loop
> 
>
> Key: KAFKA-17941
> URL: https://issues.apache.org/jira/browse/KAFKA-17941
> Project: Kafka
>  Issue Type: Bug
>Reporter: Vincent Jiang
>Assignee: Vincent Jiang
>Priority: Major
>
> When loading transaction metadata from a transaction log partition, if the 
> partition contains a segment ending with an empty batch, "currOffset" update 
> logic at 
> [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L482]
>  will be skipped.  Since "currOffset" is not advanced to next offset of last 
> batch properly, TransactionStateManager.loadTransactionMetadata method will 
> be stuck in the "while" loop at 
> [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L438.|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L438,]
>  
> After the change of https://issues.apache.org/jira/browse/KAFKA-17076, there 
> is a higher chance for compaction process to generate segments ending with an 
> empty batch. As a result, this issue is more likely to be hit now comparing 
> to before KAFKA-17076 change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17941) TransactionStateManager.loadTransactionMetadata method may get stuck in an infinite loop

2024-11-04 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-17941:
-

 Summary: TransactionStateManager.loadTransactionMetadata method 
may get stuck in an infinite loop
 Key: KAFKA-17941
 URL: https://issues.apache.org/jira/browse/KAFKA-17941
 Project: Kafka
  Issue Type: Bug
Reporter: Vincent Jiang


When loading transaction metadata from a transaction log partition, if the 
partition contains a segment ending with an empty batch, "currOffset" update 
logic at 
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L482]
 will be skipped.  Since "currOffset" is not advanced to next offset of last 
batch properly, TransactionStateManager.loadTransactionMetadata method will be 
stuck in the "while" loop at 
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L438.|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L438,]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15375) When running in KRaft mode, LogManager may creates CleanShutdown file by mistake

2023-08-17 Thread Vincent Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Jiang reassigned KAFKA-15375:
-

Assignee: Vincent Jiang

> When running in KRaft mode, LogManager may creates CleanShutdown file by 
> mistake 
> -
>
> Key: KAFKA-15375
> URL: https://issues.apache.org/jira/browse/KAFKA-15375
> Project: Kafka
>  Issue Type: Bug
>Reporter: Vincent Jiang
>Assignee: Vincent Jiang
>Priority: Major
>
> Consider following sequence when running Kafka in KRaft mode:
>  # A partition log "log1" is created under "logDir1", and some records are 
> appended to it.
>  # Broker crashes. No clean shutdown file is created in "logDir1".
>  # Broker is restarted. BrokerServer.startup is called.
>  # On a different thread, LogManager.startup is called by 
> BrokerMetadataPublisher.
>  # Before LogManager.startup finishing recovering logs under "logDir1", fatal 
> exception is thrown in BrokerServer.startup.
>  # In exception hander, BrokerServer.startup calls LogManager.shutdown. As a 
> result, a clean shutdown file is created under "logDir1"
>  # Broker is restarted again. Due to the clean shutdown file created in step 
> 6, recovery is skipped for logs under "logDir1", which is not right because 
> "logDir1" was not fully recovered in step 5.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15375) When running in KRaft mode, LogManager may creates CleanShutdown file by mistake

2023-08-17 Thread Vincent Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Jiang updated KAFKA-15375:
--
Description: 
Consider following sequence when running Kafka in KRaft mode:
 # A partition log "log1" is created under "logDir1", and some records are 
appended to it.
 # Broker crashes. No clean shutdown file is created in "logDir1".
 # Broker is restarted. BrokerServer.startup is called.
 # On a different thread, LogManager.startup is called by 
BrokerMetadataPublisher.
 # Before LogManager.startup finishing recovering logs under "logDir1", fatal 
exception is thrown in BrokerServer.startup.
 # In exception hander, BrokerServer.startup calls LogManager.shutdown. As a 
result, a clean shutdown file is created under "logDir1"
 # Broker is restarted again. Due to the clean shutdown file created in step 6, 
recovery is skipped for logs under "logDir1", which is not right because 
"logDir1" was not fully recovered in step 5.

  was:
Consider following sequence when running Kafka in KRaft mode:
 # A partition log "log1" is created under "logDir1", and some records are 
appended to it.

 # Broker crashes. No clean shutdown file is created in "logDir1".

 # Broker is restarted. BrokerServer.startup is called.

 # On a different thread, LogManager.startup is called by 
BrokerMetadataPublisher.
 # Before LogManager.startup finishing recovering logs under "logDir1", fatal 
exception is thrown in BrokerServer.startup.

 # In exception hander, BrokerServer.startup calls LogManager.shutdown. As a 
result, a clean shutdown file is created under "logDir1"

 # Broker is restarted again. Due to the clean shutdown file created in step 6, 
recovery is skipped for logs under "logDir1", which is not right because 
"logDir1" needs recovery.

 


> When running in KRaft mode, LogManager may creates CleanShutdown file by 
> mistake 
> -
>
> Key: KAFKA-15375
> URL: https://issues.apache.org/jira/browse/KAFKA-15375
> Project: Kafka
>  Issue Type: Bug
>Reporter: Vincent Jiang
>Priority: Major
>
> Consider following sequence when running Kafka in KRaft mode:
>  # A partition log "log1" is created under "logDir1", and some records are 
> appended to it.
>  # Broker crashes. No clean shutdown file is created in "logDir1".
>  # Broker is restarted. BrokerServer.startup is called.
>  # On a different thread, LogManager.startup is called by 
> BrokerMetadataPublisher.
>  # Before LogManager.startup finishing recovering logs under "logDir1", fatal 
> exception is thrown in BrokerServer.startup.
>  # In exception hander, BrokerServer.startup calls LogManager.shutdown. As a 
> result, a clean shutdown file is created under "logDir1"
>  # Broker is restarted again. Due to the clean shutdown file created in step 
> 6, recovery is skipped for logs under "logDir1", which is not right because 
> "logDir1" was not fully recovered in step 5.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15375) When running in KRaft mode, LogManager may creates CleanShutdown file by mistake

2023-08-17 Thread Vincent Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Jiang updated KAFKA-15375:
--
Description: 
Consider following sequence when running Kafka in KRaft mode:
 # A partition log "log1" is created under "logDir1", and some records are 
appended to it.

 # Broker crashes. No clean shutdown file is created in "logDir1".

 # Broker is restarted. BrokerServer.startup is called.

 # On a different thread, LogManager.startup is called by 
BrokerMetadataPublisher.
 # Before LogManager.startup finishing recovering logs under "logDir1", fatal 
exception is thrown in BrokerServer.startup.

 # In exception hander, BrokerServer.startup calls LogManager.shutdown. As a 
result, a clean shutdown file is created under "logDir1"

 # Broker is restarted again. Due to the clean shutdown file created in step 6, 
recovery is skipped for logs under "logDir1", which is not right because 
"logDir1" needs recovery.

 

> When running in KRaft mode, LogManager may creates CleanShutdown file by 
> mistake 
> -
>
> Key: KAFKA-15375
> URL: https://issues.apache.org/jira/browse/KAFKA-15375
> Project: Kafka
>  Issue Type: Bug
>Reporter: Vincent Jiang
>Priority: Major
>
> Consider following sequence when running Kafka in KRaft mode:
>  # A partition log "log1" is created under "logDir1", and some records are 
> appended to it.
>  # Broker crashes. No clean shutdown file is created in "logDir1".
>  # Broker is restarted. BrokerServer.startup is called.
>  # On a different thread, LogManager.startup is called by 
> BrokerMetadataPublisher.
>  # Before LogManager.startup finishing recovering logs under "logDir1", fatal 
> exception is thrown in BrokerServer.startup.
>  # In exception hander, BrokerServer.startup calls LogManager.shutdown. As a 
> result, a clean shutdown file is created under "logDir1"
>  # Broker is restarted again. Due to the clean shutdown file created in step 
> 6, recovery is skipped for logs under "logDir1", which is not right because 
> "logDir1" needs recovery.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15375) When running in KRaft mode, LogManager may creates CleanShutdown file by mistake

2023-08-17 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-15375:
-

 Summary: When running in KRaft mode, LogManager may creates 
CleanShutdown file by mistake 
 Key: KAFKA-15375
 URL: https://issues.apache.org/jira/browse/KAFKA-15375
 Project: Kafka
  Issue Type: Bug
Reporter: Vincent Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14497) LastStableOffset is advanced prematurely when a log is reopened.

2022-12-15 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-14497:
-

 Summary: LastStableOffset is advanced prematurely when a log is 
reopened.
 Key: KAFKA-14497
 URL: https://issues.apache.org/jira/browse/KAFKA-14497
 Project: Kafka
  Issue Type: Bug
Reporter: Vincent Jiang


In below test case, last stable offset of log is advanced prematurely after 
reopen:
 # producer #1 appends transaction records to leader. offsets = [0, 1, 2, 3]
 # producer #2 appends transactional records to leader. offsets =  [4, 5, 6, 7]
 # all records are replicated to followers and high watermark advanced to 8.
 # at this point, lastStableOffset = 0. (first offset of an open transaction)
 # producer #1 aborts the transaction by writing an abort marker at offset 8.  
ProducerStateManager.unreplicatedTxns contains the aborted transaction 
(firstOffset=0, lastOffset=8)
 # then the log is closed and reopened.
 # after reopen, log.lastStableOffset is initialized to 4.  This is because 
ProducerStateManager.unreplicatedTxns is empty after reopening log.

 

We should rebuild ProducerStateManager.unreplicatedTxns when reloading a log, 
so that lastStableOffset remains unchanged before and after reopen.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14347) deleted records may be kept unexpectedly when leader changes while adding a new replica

2022-11-01 Thread Vincent Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627365#comment-17627365
 ] 

Vincent Jiang commented on KAFKA-14347:
---

Related issue: https://issues.apache.org/jira/browse/KAFKA-4546

> deleted records may be kept unexpectedly when leader changes while adding a 
> new replica
> ---
>
> Key: KAFKA-14347
> URL: https://issues.apache.org/jira/browse/KAFKA-14347
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Vincent Jiang
>Priority: Major
>
> Consider that in a compacted topic, a regular record _k1=v1_  is deleted by a 
> later tombstone record {_}k1=null{_}{_}.{_}  And imagine that somehow __ log 
> compaction is making different progress on the three replicas, {_}r1{_}, _r2_ 
> and _r3:_
> _-_ on replica {_}r1{_}, log compaction has not cleaned _k1=v1_ or _k1=null_ 
> yet.
> - on replica {_}r2{_}, log compaction cleaned and removed both _k1=v1_ and 
> _k1=null._
> In this case, following sequence can cause record _k1=v1_ being kept 
> unexpectedly:
> 1.  Replica _r3_ is re-assigned to a different node and starts to replicate 
> data from leader. 
> 2. At the beginning, _r1_ is the leader, so _r3_ replicates record _k1=v1_ 
> from {_}r1{_}.
> 3. Before _k1=null_ is replicated from {_}r1{_}, leader changes to {_}r2{_}.
> 4. _r3_ replicates data from {_}r2{_}.  Because _k1=null_ record has been 
> cleaned in {_}r2{_}, it will not be replicated.
> As a result, _r3_ has record _k1=v1_ but not {_}k1=null{_}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14347) deleted records may be kept unexpectedly when leader changes while adding a new replica

2022-11-01 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-14347:
-

 Summary: deleted records may be kept unexpectedly when leader 
changes while adding a new replica
 Key: KAFKA-14347
 URL: https://issues.apache.org/jira/browse/KAFKA-14347
 Project: Kafka
  Issue Type: Improvement
Reporter: Vincent Jiang


Consider that in a compacted topic, a regular record _k1=v1_  is deleted by a 
later tombstone record {_}k1=null{_}{_}.{_}  And imagine that somehow __ log 
compaction is making different progress on the three replicas, {_}r1{_}, _r2_ 
and _r3:_
_-_ on replica {_}r1{_}, log compaction has not cleaned _k1=v1_ or _k1=null_ 
yet.
- on replica {_}r2{_}, log compaction cleaned and removed both _k1=v1_ and 
_k1=null._

In this case, following sequence can cause record _k1=v1_ being kept 
unexpectedly:
1.  Replica _r3_ is re-assigned to a different node and starts to replicate 
data from leader. 
2. At the beginning, _r1_ is the leader, so _r3_ replicates record _k1=v1_ from 
{_}r1{_}.
3. Before _k1=null_ is replicated from {_}r1{_}, leader changes to {_}r2{_}.
4. _r3_ replicates data from {_}r2{_}.  Because _k1=null_ record has been 
cleaned in {_}r2{_}, it will not be replicated.

As a result, _r3_ has record _k1=v1_ but not {_}k1=null{_}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14096) Race Condition in Log Rolling Leading to Disk Failure

2022-08-12 Thread Vincent Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579155#comment-17579155
 ] 

Vincent Jiang commented on KAFKA-14096:
---

I have two questions about this issue:
1.  at 2022-07-18 18:24:47,782, logging shows segment 141935201 was scheduled 
to be deleted, which means the segment should have been removed from 
Log.segments. Then at 2022-07-18 18:24:48,024, how did Log.flush method still 
see the segment? 
2.  by default, there is a 60 seconds delay after a segment is scheduled for 
deletion. So segment deletion should not happen yet at 2022-07-18 18:24:48,024. 
If so, why did Log.flush see ClosedChannelException?  I assume renaming file 
shall not cause ClosedChannelException.

 

[~eazama] , to better understanding the issue, could you share the full broker 
log?

> Race Condition in Log Rolling Leading to Disk Failure
> -
>
> Key: KAFKA-14096
> URL: https://issues.apache.org/jira/browse/KAFKA-14096
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.5.1
>Reporter: Eric Azama
>Priority: Major
>
> We've recently encountered what appears to be a race condition that can lead 
> to disk being marked offline. One of our brokers recently crashed because its 
> log directory failed. We found the following in the server.log file
> {code:java}
> [2022-07-18 18:24:42,940] INFO [Log partition=TOPIC-REDACTED-15, 
> dir=/data1/kafka-logs] Rolled new log segment at offset 141946850 in 37 ms. 
> (kafka.log.Log)
> [...]
> [2022-07-18 18:24:47,782] INFO [Log partition=TOPIC-REDACTED-15, 
> dir=/data1/kafka-logs] Scheduling segments for deletion 
> List(LogSegment(baseOffset=141935201, size=1073560219, 
> lastModifiedTime=1658168598869, largestTime=1658168495678)) (kafka.log.Log)
> [2022-07-18 18:24:48,024] ERROR Error while flushing log for 
> TOPIC-REDACTED-15 in dir /data1/kafka-logs with offset 141935201 
> (kafka.server.LogDirFailureChannel)
> java.nio.channels.ClosedChannelException
> at java.base/sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:150)
> at java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:452)
> at org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:176)
> at kafka.log.LogSegment.$anonfun$flush$1(LogSegment.scala:472)
> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
> at kafka.log.LogSegment.flush(LogSegment.scala:471)
> at kafka.log.Log.$anonfun$flush$4(Log.scala:1956)
> at kafka.log.Log.$anonfun$flush$4$adapted(Log.scala:1955)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at kafka.log.Log.$anonfun$flush$2(Log.scala:1955)
> at kafka.log.Log.flush(Log.scala:2322)
> at kafka.log.Log.$anonfun$roll$9(Log.scala:1925)
> at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> [2022-07-18 18:24:48,036] ERROR Uncaught exception in scheduled task 
> 'flush-log' (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.KafkaStorageException: Error while flushing 
> log for TOPIC-REDACTED-15 in dir /data1/kafka-logs with offset 141935201{code}
> and the following in the log-cleaner.log file
> {code:java}
> [2022-07-18 18:24:47,062] INFO Cleaner 0: Cleaning 
> LogSegment(baseOffset=141935201, size=1073560219, 
> lastModifiedTime=1658168598869, largestTime=1658168495678) in log 
> TOPIC-REDACTED-15 into 141935201 with deletion horizon 1658082163480, 
> retaining deletes. (kafka.log.LogCleaner) {code}
> The timing of the log-cleaner log shows that the log flush failed because the 
> log segment had been cleaned and the underlying file was already renamed or 
> deleted.
> The stacktrace indicates that the log flush that triggered the exception was 
> part of the process of rolling a new log segment. (at 
> kafka.log.Log.$anonfun$roll$9([Log.scala:1925|https://github.com/apache/kafka/blob/2.5.1/core/src/main/scala/kafka/log/Log.scala#L1925]))

[jira] [Updated] (KAFKA-14151) Add additional validation to protect on-disk log segment data from being corrupted

2022-08-09 Thread Vincent Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Jiang updated KAFKA-14151:
--
Priority: Major  (was: Minor)

> Add additional validation to protect on-disk log segment data from being 
> corrupted
> --
>
> Key: KAFKA-14151
> URL: https://issues.apache.org/jira/browse/KAFKA-14151
> Project: Kafka
>  Issue Type: Improvement
>  Components: log
>Reporter: Vincent Jiang
>Priority: Major
>
> We received escalations reporting bad records being written to log segment 
> on-disk data due to environmental issues (bug in old version JVM jit).  We 
> should consider adding additional validation to  protect the on-disk data 
> from being corrupted by inadvertent bugs or environmental issues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14151) Add additional validation to protect on-disk log segment data from being corrupted

2022-08-09 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-14151:
-

 Summary: Add additional validation to protect on-disk log segment 
data from being corrupted
 Key: KAFKA-14151
 URL: https://issues.apache.org/jira/browse/KAFKA-14151
 Project: Kafka
  Issue Type: Improvement
  Components: log
Reporter: Vincent Jiang


We received escalations reporting bad records being written to log segment 
on-disk data due to environmental issues (bug in old version JVM jit).  We 
should consider adding additional validation to  protect the on-disk data from 
being corrupted by inadvertent bugs or environmental issues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14151) Add additional validation to protect on-disk log segment data from being corrupted

2022-08-09 Thread Vincent Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Jiang updated KAFKA-14151:
--
Priority: Minor  (was: Major)

> Add additional validation to protect on-disk log segment data from being 
> corrupted
> --
>
> Key: KAFKA-14151
> URL: https://issues.apache.org/jira/browse/KAFKA-14151
> Project: Kafka
>  Issue Type: Improvement
>  Components: log
>Reporter: Vincent Jiang
>Priority: Minor
>
> We received escalations reporting bad records being written to log segment 
> on-disk data due to environmental issues (bug in old version JVM jit).  We 
> should consider adding additional validation to  protect the on-disk data 
> from being corrupted by inadvertent bugs or environmental issues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14005) LogCleaner doesn't clean log if there is no dirty range

2022-06-16 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-14005:
-

 Summary: LogCleaner doesn't clean log if there is no dirty range
 Key: KAFKA-14005
 URL: https://issues.apache.org/jira/browse/KAFKA-14005
 Project: Kafka
  Issue Type: Bug
Reporter: Vincent Jiang


When there is no dirty range to clean (firstDirtyOffset == 
firstUnclenableOffset), buildOffsetMap for dirty range returns an empty offset 
map, with map.latestOffset = -1.

 

Then target cleaning offset range becomes [startOffset, map.latestOffset + 1) = 
[startOffset, 0], hence no segments are cleaned.

 

The correct cleaning offset range should be [startOffset, firstDirtyOffset], so 
that the log can be cleaned again to remove abort/commit markers, or tombstones.

 

LogCleanerTest.FakeOffsetMap.clear() method has a bug - it doesn't reset 
lastOffset. This bug causes test case like testAbortMarkerRemoval() pass 
false-positively.

  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (KAFKA-13717) KafkaConsumer.close throws authorization exception even when commit offsets is empty

2022-03-07 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-13717:
-

 Summary: KafkaConsumer.close throws authorization exception even 
when commit offsets is empty
 Key: KAFKA-13717
 URL: https://issues.apache.org/jira/browse/KAFKA-13717
 Project: Kafka
  Issue Type: Bug
  Components: unit tests
Reporter: Vincent Jiang


When offsets is empty and coordinator is unknown, KafkaConsumer.close doesn't 
throw exception before commit 
[https://github.com/apache/kafka/commit/4b468a9d81f7380f7197a2a6b859c1b4dca84bd9|https://github.com/apache/kafka/commit/4b468a9d81f7380f7197a2a6b859c1b4dca84bd9,].
  After this commit, Kafka.close may throw authorization exception.

 

Root cause is because in the commit, the logic is changed to call 
lookupCoordinator even if offsets is empty. 

 

Even if a consumer doesn't have access to a group or a topic, it might be 
better to not throw authorization exception in this case because close() call 
doesn't touch actually access any resource.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KAFKA-13706) org.apache.kafka.test.MockSelector doesn't remove closed connections from its 'ready' field

2022-03-03 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-13706:
-

 Summary: org.apache.kafka.test.MockSelector doesn't remove closed 
connections from its 'ready' field
 Key: KAFKA-13706
 URL: https://issues.apache.org/jira/browse/KAFKA-13706
 Project: Kafka
  Issue Type: Bug
  Components: unit tests
Reporter: Vincent Jiang


MockSelector.close(String id) method doesn't remove closed connection from 
"ready" field.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KAFKA-13461) KafkaController stops functioning as active controller after ZooKeeperClient auth failure

2021-11-17 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-13461:
-

 Summary: KafkaController stops functioning as active controller 
after ZooKeeperClient auth failure
 Key: KAFKA-13461
 URL: https://issues.apache.org/jira/browse/KAFKA-13461
 Project: Kafka
  Issue Type: Bug
  Components: zkclient
Reporter: Vincent Jiang


When java.security.auth.login.config is present, but there is no "Client" 
section,  ZookeeperSaslClient creation fails and raises LoginExcpetion, result 
in warning log:
{code:java}
WARN SASL configuration failed: javax.security.auth.login.LoginException: No 
JAAS configuration section named 'Client' was found in specified JAAS 
configuration file: '***'. Will continue connection to Zookeeper server without 
SASL authentication, if Zookeeper server allows it.{code}
When this happens after initial startup, ClientCnxn enqueues an AuthFailed 
event which will trigger following sequence:
 # zkclient reinitialization is triggered
 # the controller resigns.
 # Before the controller's ZK session expires, the controller successfully 
connect to ZK and maintains the current session
 # In KafkaController.elect(), the controller sets activeControllerId to itself 
and short-circuits the rest of the elect. Since the controller resigned earlier 
and also skips the call to onControllerFailover(), the controller is not 
actually functioning as the active controller (e.g. the necessary ZK watchers 
haven't been registered).

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KAFKA-13305) NullPointerException in LogCleanerManager "uncleanable-bytes" gauge

2021-09-15 Thread Vincent Jiang (Jira)
Vincent Jiang created KAFKA-13305:
-

 Summary: NullPointerException in LogCleanerManager 
"uncleanable-bytes" gauge
 Key: KAFKA-13305
 URL: https://issues.apache.org/jira/browse/KAFKA-13305
 Project: Kafka
  Issue Type: Bug
  Components: log cleaner
Reporter: Vincent Jiang


We've seen following exception in production environment:
{quote} java.lang.NullPointerException: Cannot invoke 
"kafka.log.UnifiedLog.logStartOffset()" because "log" is null at

kafka.log.LogCleanerManager$.cleanableOffsets(LogCleanerManager.scala:599)
{quote}
Looks like uncleanablePartitions never has partitions removed from it to 
reflect partition deletion/reassignment.

 

We should fix the NullPointerException and removed deleted partitions from 
uncleanablePartitions.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)