[jira] [Commented] (KAFKA-7519) Transactional Ids Left in Pending State by TransactionStateManager During Transactional Id Expiration Are Unusable

2018-10-20 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657983#comment-16657983
 ] 

Dong Lin commented on KAFKA-7519:
-

[~ijuma] Thanks for updating the state. I would like to help review it. But it 
seems more related to the stream processing and transaction semantics. So it 
may be safer if someone with more expertise in these two areas can take a look 
:)

> Transactional Ids Left in Pending State by TransactionStateManager During 
> Transactional Id Expiration Are Unusable
> --
>
> Key: KAFKA-7519
> URL: https://issues.apache.org/jira/browse/KAFKA-7519
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 2.0.0
>Reporter: Bridger Howell
>Priority: Blocker
> Fix For: 2.1.0
>
> Attachments: KAFKA-7519.patch, image-2018-10-18-13-02-22-371.png
>
>
>  
> After digging into a case where an exactly-once streams process was bizarrely 
> unable to process incoming data, we observed the following:
>  * StreamThreads stalling while creating a producer, eventually resulting in 
> no consumption by that streams process. Looking into those threads, we found 
> they were stuck in a loop, sending InitProducerIdRequests and always 
> receiving back the retriable error CONCURRENT_TRANSACTIONS and trying again. 
> These requests always had the same transactional id.
>  * After changing the streams process to not use exactly-once, it was able to 
> process messages with no problems.
>  * Alternatively, changing the applicationId for that streams process, it was 
> able to process with no problems.
>  * Every hour,  every broker would fail the task `transactionalId-expiration` 
> with the following error:
>  ** 
> {code:java}
> {"exception":{"stacktrace":"java.lang.IllegalStateException: Preparing 
> transaction state transition to Dead while it already a pending sta
> te Dead
>     at 
> kafka.coordinator.transaction.TransactionMetadata.prepareTransitionTo(TransactionMetadata.scala:262)
>     at kafka.coordinator
> .transaction.TransactionMetadata.prepareDead(TransactionMetadata.scala:237)
>     at kafka.coordinator.transaction.TransactionStateManager$$a
> nonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$anonfun$2$$anonfun$apply$9$$anonfun$3.apply(TransactionStateManager.scal
> a:151)
>     at 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$ano
> nfun$2$$anonfun$apply$9$$anonfun$3.apply(TransactionStateManager.scala:151)
>     at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
>     at
>  
> kafka.coordinator.transaction.TransactionMetadata.inLock(TransactionMetadata.scala:172)
>     at kafka.coordinator.transaction.TransactionSt
> ateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$anonfun$2$$anonfun$apply$9.apply(TransactionStateManager.sc
> ala:150)
>     at 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$a
> nonfun$2$$anonfun$apply$9.apply(TransactionStateManager.scala:149)
>     at scala.collection.TraversableLike$$anonfun$map$1.apply(Traversable
> Like.scala:234)
>     at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>     at scala.collection.immutable.Li
> st.foreach(List.scala:392)
>     at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>     at scala.collection.immutable.Li
> st.map(List.scala:296)
>     at 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$app
> ly$mcV$sp$1$$anonfun$2.apply(TransactionStateManager.scala:149)
>     at kafka.coordinator.transaction.TransactionStateManager$$anonfun$enabl
> eTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$anonfun$2.apply(TransactionStateManager.scala:142)
>     at scala.collection.Traversabl
> eLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>     at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.
> scala:241)
>     at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
>     at scala.collection.mutable.HashMap$$anon
> fun$foreach$1.apply(HashMap.scala:130)
>     at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
>     at scala.collec
> tion.mutable.HashMap.foreachEntry(HashMap.scala:40)
>     at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
>     at scala.collecti
> on.TraversableLike$class.flatMap(TraversableLike.scala:241)
>     at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>     a
> t 
> kafka.coordinator.transaction.TransactionStateManager

[jira] [Updated] (KAFKA-7464) Fail to shutdown ReplicaManager during broker cleaned shutdown

2018-10-20 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7464:

Fix Version/s: 2.0.1

> Fail to shutdown ReplicaManager during broker cleaned shutdown
> --
>
> Key: KAFKA-7464
> URL: https://issues.apache.org/jira/browse/KAFKA-7464
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Zhanxiang (Patrick) Huang
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> In 2.0 deployment, we saw the following log when shutting down the 
> ReplicaManager in broker cleaned shutdown:
> {noformat}
> 2018/09/27 08:22:18.699 WARN [CoreUtils$] [Thread-1] [kafka-server] [] null
> java.lang.IllegalArgumentException: null
> at java.nio.Buffer.position(Buffer.java:244) ~[?:1.8.0_121]
> at sun.nio.ch.IOUtil.write(IOUtil.java:68) ~[?:1.8.0_121]
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) 
> ~[?:1.8.0_121]
> at 
> org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:214)
>  ~[kafka-clients-2.0.0.22.jar:?]
> at 
> org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:164)
>  ~[kafka-clients-2.0.0.22.jar:?]
> at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:806) 
> ~[kafka-clients-2.0.0.22.jar:?]
> at 
> org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:107) 
> ~[kafka-clients-2.0.0.22.jar:?]
> at 
> org.apache.kafka.common.network.Selector.doClose(Selector.java:751) 
> ~[kafka-clients-2.0.0.22.jar:?]
> at org.apache.kafka.common.network.Selector.close(Selector.java:739) 
> ~[kafka-clients-2.0.0.22.jar:?]
> at org.apache.kafka.common.network.Selector.close(Selector.java:701) 
> ~[kafka-clients-2.0.0.22.jar:?]
> at org.apache.kafka.common.network.Selector.close(Selector.java:315) 
> ~[kafka-clients-2.0.0.22.jar:?]
> at 
> org.apache.kafka.clients.NetworkClient.close(NetworkClient.java:595) 
> ~[kafka-clients-2.0.0.22.jar:?]
> at 
> kafka.server.ReplicaFetcherBlockingSend.close(ReplicaFetcherBlockingSend.scala:107)
>  ~[kafka_2.11-2.0.0.22.jar:?]
> at 
> kafka.server.ReplicaFetcherThread.initiateShutdown(ReplicaFetcherThread.scala:108)
>  ~[kafka_2.11-2.0.0.22.jar:?]
> at 
> kafka.server.AbstractFetcherManager$$anonfun$closeAllFetchers$2.apply(AbstractFetcherManager.scala:183)
>  ~[kafka_2.11-2.0.0.22.jar:?]
> at 
> kafka.server.AbstractFetcherManager$$anonfun$closeAllFetchers$2.apply(AbstractFetcherManager.scala:182)
>  ~[kafka_2.11-2.0.0.22.jar:?]
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  ~[scala-library-2.11.12.jar:?]
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) 
> ~[scala-library-2.11.12.jar:?]
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) 
> ~[scala-library-2.11.12.jar:?]
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) 
> ~[scala-library-2.11.12.jar:?]
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) 
> ~[scala-library-2.11.12.jar:?]
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) 
> ~[scala-library-2.11.12.jar:?]
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  ~[scala-library-2.11.12.jar:?]
> at 
> kafka.server.AbstractFetcherManager.closeAllFetchers(AbstractFetcherManager.scala:182)
>  ~[kafka_2.11-2.0.0.22.jar:?]
> at 
> kafka.server.ReplicaFetcherManager.shutdown(ReplicaFetcherManager.scala:37) 
> ~[kafka_2.11-2.0.0.22.jar:?]
> at kafka.server.ReplicaManager.shutdown(ReplicaManager.scala:1471) 
> ~[kafka_2.11-2.0.0.22.jar:?]
> at 
> kafka.server.KafkaServer$$anonfun$shutdown$12.apply$mcV$sp(KafkaServer.scala:616)
>  ~[kafka_2.11-2.0.0.22.jar:?]
> at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:86) 
> ~[kafka_2.11-2.0.0.22.jar:?]
> at kafka.server.KafkaServer.shutdown(KafkaServer.scala:616) 
> ~[kafka_2.11-2.0.0.22.jar:?]
> {noformat}
> After that, we noticed that some of the replica fetcher thread fail to 
> shutdown:
> {noformat}
> 2018/09/27 08:22:46.176 ERROR [LogDirFailureChannel] 
> [ReplicaFetcherThread-26-13085] [kafka-server] [] Error while rolling log 
> segment for video-social-gestures-30 in dir /export/content/kafka/i001_caches
> java.nio.channels.ClosedChannelException: null
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) 
> ~[?:1.8.0_121]
> at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:300) 
> ~[?:1.8.0_121]
> at 
> org.apache.kafka.c

[jira] [Updated] (KAFKA-5403) Transactions system test should dedup consumed messages by offset

2018-10-23 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5403:

Fix Version/s: (was: 2.1.0)

> Transactions system test should dedup consumed messages by offset
> -
>
> Key: KAFKA-5403
> URL: https://issues.apache.org/jira/browse/KAFKA-5403
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Major
>
> In KAFKA-5396, we saw that the consumers which verify the data in multiple 
> topics could read the same offsets multiple times, for instance when a 
> rebalance happens. 
> This would detect spurious duplicates, causing the test to fail. We should 
> dedup the consumed messages by offset and only fail the test if we have 
> duplicate values for a if for a unique set of offsets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4690) IllegalStateException using DeleteTopicsRequest when delete.topic.enable=false

2018-10-23 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4690:

Fix Version/s: (was: 2.1.0)

> IllegalStateException using DeleteTopicsRequest when delete.topic.enable=false
> --
>
> Key: KAFKA-4690
> URL: https://issues.apache.org/jira/browse/KAFKA-4690
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.2.0
> Environment: OS X
>Reporter: Jon Chiu
>Assignee: Manikumar
>Priority: Major
> Attachments: delete-topics-request.java
>
>
> There is no indication as to why the delete request fails. Perhaps an error 
> code?
> This can be reproduced with the following steps:
> 1. Start ZK and 1 broker (with default {{delete.topic.enable=false}})
> 2. Create a topic test
> {noformat}
> bin/kafka-topics.sh --zookeeper localhost:2181 \
>   --create --topic test --partition 1 --replication-factor 1
> {noformat}
> 3. Delete topic by sending a DeleteTopicsRequest
> 4. An error is returned
> {noformat}
> org.apache.kafka.common.errors.DisconnectException
> {noformat}
> or 
> {noformat}
> java.lang.IllegalStateException: Attempt to retrieve exception from future 
> which hasn't failed
>   at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.exception(RequestFuture.java:99)
>   at 
> io.confluent.adminclient.KafkaAdminClient.send(KafkaAdminClient.java:195)
>   at 
> io.confluent.adminclient.KafkaAdminClient.deleteTopic(KafkaAdminClient.java:152)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6045) All access to log should fail if log is closed

2018-10-23 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6045:

Fix Version/s: (was: 2.1.0)

> All access to log should fail if log is closed
> --
>
> Key: KAFKA-6045
> URL: https://issues.apache.org/jira/browse/KAFKA-6045
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dong Lin
>Priority: Major
>
> After log.close() or log.closeHandlers() is called for a given log, all uses 
> of the Log's API should fail with proper exception. For example, 
> log.appendAsLeader() should throw KafkaStorageException. APIs such as 
> Log.activeProducersWithLastSequence() should also fail but not necessarily 
> with KafkaStorageException, since the KafkaStorageException indicates failure 
> to access disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5857) Excessive heap usage on controller node during reassignment

2018-10-23 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5857:

Fix Version/s: (was: 2.1.0)

> Excessive heap usage on controller node during reassignment
> ---
>
> Key: KAFKA-5857
> URL: https://issues.apache.org/jira/browse/KAFKA-5857
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0
> Environment: CentOs 7, Java 1.8
>Reporter: Raoufeh Hashemian
>Priority: Major
>  Labels: reliability
> Attachments: CPU.png, disk_write_x.png, memory.png, 
> reassignment_plan.txt
>
>
> I was trying to expand our kafka cluster of 6 broker nodes to 12 broker 
> nodes. 
> Before expansion, we had a single topic with 960 partitions and a replication 
> factor of 3. So each node had 480 partitions. The size of data in each node 
> was 3TB . 
> To do the expansion, I submitted a partition reassignment plan (see attached 
> file for the current/new assignments). The plan was optimized to minimize 
> data movement and be rack aware. 
> When I submitted the plan, it took approximately 3 hours for moving data from 
> old to new nodes to complete. After that, it started deleting source 
> partitions (I say this based on the number of file descriptors) and 
> rebalancing leaders which has not been successful. Meanwhile, the heap usage 
> in the controller node started to go up with a large slope (along with long 
> GC times) and it took 5 hours for the controller to go out of memory and 
> another controller started to have the same behaviour for another 4 hours. At 
> this time the zookeeper ran out of disk and the service stopped.
> To recover from this condition:
> 1) Removed zk logs to free up disk and restarted all 3 zk nodes
> 2) Deleted /kafka/admin/reassign_partitions node from zk
> 3) Had to do unclean restarts of kafka service on oom controller nodes which 
> took 3 hours to complete  . After this stage there was still 676 under 
> replicated partitions.
> 4) Do a clean restart on all 12 broker nodes.
> After step 4 , number of under replicated nodes went to 0.
> So I was wondering if this memory footprint from controller is expected for 
> 1k partitions ? Did we do sth wrong or it is a bug?
> Attached are some resource usage graph during this 30 hours event and the 
> reassignment plan. I'll try to add log files as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-23 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7481:

Fix Version/s: (was: 2.1.0)

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-23 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661482#comment-16661482
 ] 

Dong Lin commented on KAFKA-7481:
-

I removed 2.1.0 tag so that release.py would generate release for 2.1.0-RC0. 
The goal is for open source users to be able to help test this release even 
before we have addressed all issues. We can add the tag back later.

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-24 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7481:

Fix Version/s: 2.1.0

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-24 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7481:

Priority: Critical  (was: Blocker)

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Critical
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-24 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662555#comment-16662555
 ] 

Dong Lin commented on KAFKA-7481:
-

Thanks [~ewencp] for the suggestion. I agree. Will do this in the future.

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Critical
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-24 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7481:

Priority: Blocker  (was: Critical)

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-24 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662610#comment-16662610
 ] 

Dong Lin commented on KAFKA-7481:
-

After reading through the discussion and thinking through the problem more, I 
have the following alternative solution.

 

Basically we have the following three catagories of features:

1) Features that require only the protocol change. These features can be rolled 
back.

2) Features that require both the protocol and disk format change without 
having performance impact. These features can not be rolled back.

3) Features that require both the protocol and disk format change and have 
performance impact. These features can not be rolled back.

 

 

My proposed solution is to this:

1) For features that require only the protocol change, let broker automatically 
detect the protocol version (e.g. KIP-35) as the lower version of the two 
brokers in communication instead of controlling the version explicitly using 
inter.broker.protocol.version.

2) For features that require both the protocol and disk format change without 
having performance impact, the version can be specified explicitly using the 
existing inter.broker.protocol.version config. And we tell user that this 
config can not be rolled back after it is bumped.

3) For features that require both the protocol and disk format change and have 
performance impact, the version will be specified explicitly using 
message.format.version as we are currently doing. There is no change in this 
category.

This solution does not increase config or the testing surface area which meet 
the goal of Ismael. And this solution also minimizes the cases in which we do 
not allow broker version downgrade which meet the goal of Jason and Ewen. One 
additional benefit, as mentioned by Ewen with KIP-35, is that for features that 
require only the protocol change, which is the common case, user only need one 
rolling bounce to pick up the feature.

 

Does this sound good?

 

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-24 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662619#comment-16662619
 ] 

Dong Lin commented on KAFKA-7481:
-

BTW, if we think this the long term solution, then this issue is no longer 
blocking for this 2.1.0 release because we will tell user that both 
inter.broker.protocol.version and message.format.version involves disk format 
change and thus can not be rolled back.

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7535) KafkaConsumer doesn't report records-lag if isolation.level is read_committed

2018-10-24 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7535:

Fix Version/s: 2.1.0

> KafkaConsumer doesn't report records-lag if isolation.level is read_committed
> -
>
> Key: KAFKA-7535
> URL: https://issues.apache.org/jira/browse/KAFKA-7535
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.0.0
>Reporter: Alexey Vakhrenev
>Assignee: lambdaliu
>Priority: Major
>  Labels: regression
> Fix For: 2.1.0, 2.1.1, 2.0.2
>
>
> Starting from 2.0.0, {{KafkaConsumer}} doesn't report {{records-lag}} if 
> {{isolation.level}} is {{read_committed}}. The last version, where it works 
> is {{1.1.1}}.
> The issue can be easily reproduced in {{kafka.api.PlaintextConsumerTest}} by 
> adding {{consumerConfig.setProperty("isolation.level", "read_committed")}} 
> witin related tests:
>  - {{testPerPartitionLagMetricsCleanUpWithAssign}}
>  - {{testPerPartitionLagMetricsCleanUpWithSubscribe}}
>  - {{testPerPartitionLagWithMaxPollRecords}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7535) KafkaConsumer doesn't report records-lag if isolation.level is read_committed

2018-10-24 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662916#comment-16662916
 ] 

Dong Lin commented on KAFKA-7535:
-

[~ijuma] Thanks for the notice. Yeah I would like to have this issue fixed in 
2.1.0.

> KafkaConsumer doesn't report records-lag if isolation.level is read_committed
> -
>
> Key: KAFKA-7535
> URL: https://issues.apache.org/jira/browse/KAFKA-7535
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.0.0
>Reporter: Alexey Vakhrenev
>Assignee: lambdaliu
>Priority: Major
>  Labels: regression
> Fix For: 2.1.0, 2.1.1, 2.0.2
>
>
> Starting from 2.0.0, {{KafkaConsumer}} doesn't report {{records-lag}} if 
> {{isolation.level}} is {{read_committed}}. The last version, where it works 
> is {{1.1.1}}.
> The issue can be easily reproduced in {{kafka.api.PlaintextConsumerTest}} by 
> adding {{consumerConfig.setProperty("isolation.level", "read_committed")}} 
> witin related tests:
>  - {{testPerPartitionLagMetricsCleanUpWithAssign}}
>  - {{testPerPartitionLagMetricsCleanUpWithSubscribe}}
>  - {{testPerPartitionLagWithMaxPollRecords}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7313) StopReplicaRequest should attempt to remove future replica for the partition only if future replica exists

2018-10-24 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7313:

Fix Version/s: 2.1.1

> StopReplicaRequest should attempt to remove future replica for the partition 
> only if future replica exists
> --
>
> Key: KAFKA-7313
> URL: https://issues.apache.org/jira/browse/KAFKA-7313
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Major
> Fix For: 2.1.1
>
>
> This patch fixes two issues:
> 1) Currently if a broker received StopReplicaRequest with delete=true for the 
> same offline replica, the first StopRelicaRequest will show 
> KafkaStorageException and the second StopRelicaRequest will show 
> ReplicaNotAvailableException. This is because the first StopRelicaRequest 
> will remove the mapping (tp -> ReplicaManager.OfflinePartition) from 
> ReplicaManager.allPartitions before returning KafkaStorageException, thus the 
> second StopRelicaRequest will not find this partition as offline.
> This result appears to be inconsistent. And since the replica is already 
> offline and broker will not be able to delete file for this replica, the 
> StopReplicaRequest should fail without making any change and broker should 
> still remember that this replica is offline. 
> 2) Currently if broker receives StopReplicaRequest with delete=true, the 
> broker will attempt to remove future replica for the partition, which will 
> cause KafkaStorageException in the StopReplicaResponse if this replica does 
> not have future replica. It is problematic to always return 
> KafkaStorageException in the response if future replica does not exist.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-24 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663274#comment-16663274
 ] 

Dong Lin commented on KAFKA-7481:
-

Hey [~ewencp], regarding the change for 2.1.0 release, say we don't make 
further code change for this issue, I agree it is good to clarify in the 
upgrade note that inter.broker.protocol.version can not be reverted after it is 
bumped to 2.1.0.

Regarding the short term solution, I also prefer not to make big code change to 
e.g. use KIP-35 idea to solve the issue here. I would prefer to just clarify in 
the upgrade note that inter.broker.protocol.version can not be reverted after 
it is bumped to 2.1.0. Also, since we currently do not mention anything about 
downgrade in the upgrade note, and the other config log.message.format.version 
can not be downgraded, I am not sure user actually expect to be able to 
downgrade the inter.broker.protocol. So I feel this short term solution is OK 
and strictly speaking it does not break any semantic guarantee.

Regarding the long term solution, it seems that we actually want user to 
manually manage the protocol version config in order to pickup any new feature 
that can change the data format on disk. Otherwise, say we always make things 
work with one rolling bounce, then whenever there is feature that change data 
format on disk, we will have to bump up the major version for the next Kafka 
release to indicate that the version can not be downgraded, which delays the 
acceptance for the release. Also, if we automatically bump up the 
message.format.version for the new broker version, the broker performance will 
downgrade so much because user wont' even have time upgrade client library 
version for most users in the organization.

 

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7560) Client Quota - system test failure

2018-10-29 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reassigned KAFKA-7560:
---

Assignee: Dong Lin

> Client Quota - system test failure
> --
>
> Key: KAFKA-7560
> URL: https://issues.apache.org/jira/browse/KAFKA-7560
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Stanislav Kozlovski
>Assignee: Dong Lin
>Priority: Major
>
> The test under `kafkatest.tests.client.quota_test.QuotaTest.test_quota` fails 
> when I run it locally. It produces the following error message:
> {code:java}
>  File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, 
> in validate     metric.value for k, metrics in 
> producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
> client_id=producer.client_id) for metric in metrics ValueError: max() arg is 
> an empty sequence
> {code}
> I assume it cannot find the metric it's searching for



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7403) Offset commit failure after upgrading brokers past KIP-211/KAFKA-4682

2018-10-29 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reassigned KAFKA-7403:
---

Assignee: Vahid Hashemian

> Offset commit failure after upgrading brokers past KIP-211/KAFKA-4682
> -
>
> Key: KAFKA-7403
> URL: https://issues.apache.org/jira/browse/KAFKA-7403
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: Jon Lee
>Assignee: Vahid Hashemian
>Priority: Blocker
> Fix For: 2.1.0
>
>
> I am currently trying broker upgrade from 0.11 to 2.0 with some patches 
> including KIP-211/KAFKA-4682. After the upgrade, however, applications with 
> 0.10.2 Kafka clients failed with the following error:
> {code:java}
> 2018/09/11 19:34:52.814 ERROR Failed to commit offsets. Exiting. 
> org.apache.kafka.common.KafkaException: Unexpected error in commit: The 
> server experienced an unexpected error when processing the request at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:784)
>  ~[kafka-clients-0.10.2.86.jar:?] at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:722)
>  ~[kafka-clients-0.10.2.86.jar:?] at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:784)
>  ~[kafka-clients-0.10.2.86.jar:?] at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:765)
>  ~[kafka-clients-0.10.2.86.jar:?] at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:186)
>  ~[kafka-clients-0.10.2.86.jar:?] at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:149)
>  ~[kafka-clients-0.10.2.86.jar:?] at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:116)
>  ~[kafka-clients-0.10.2.86.jar:?] at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:493)
>  ~[kafka-clients-0.10.2.86.jar:?] at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:322)
>  ~[kafka-clients-0.10.2.86.jar:?] at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:253)
>  ~[kafka-clients-0.10.2.86.jar:?]
> {code}
> From my reading of the code, it looks like the following happened:
>  # The 0.10.2 client sends a v2 OffsetCommitRequest to the broker. It sets 
> the retentionTime field of the OffsetCommitRequest to DEFAULT_RETENTION_TIME.
>  # In the 2.0 broker code, upon receiving an OffsetCommitRequest with 
> DEFAULT_RETENTION_TIME, KafkaApis.handleOffsetCommitRequest() sets the 
> "expireTimestamp" field of OffsetAndMetadata to None.
>  # Later in the code path, GroupMetadataManager.offsetCommitValue() expects 
> OffsetAndMetadata to have a non-empty "expireTimestamp" field if the 
> inter.broker.protocol.version is < KAFKA_2_1_IV0.
>  # However, the inter.broker.protocol.version was set to "1.0" prior to the 
> upgrade, and as a result, the following code in offsetCommitValue() raises an 
> error because expireTimestamp is None:
> {code:java}
> value.set(OFFSET_VALUE_EXPIRE_TIMESTAMP_FIELD_V1, 
> offsetAndMetadata.expireTimestamp.get){code}
>  
> Here is the stack trace for the broker side error
> {code:java}
> java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347) ~[scala-library-2.11.12.jar:?]
> at scala.None$.get(Option.scala:345) ~[scala-library-2.11.12.jar:?]
> at 
> kafka.coordinator.group.GroupMetadataManager$.offsetCommitValue(GroupMetadataManager.scala:1109)
>  ~[kafka_2.11-2.0.0.10.jar:?]
> at 
> kafka.coordinator.group.GroupMetadataManager$$anonfun$7.apply(GroupMetadataManager.scala:326)
>  ~[kafka_2.11-2.0.0.10.jar:?]
> at 
> kafka.coordinator.group.GroupMetadataManager$$anonfun$7.apply(GroupMetadataManager.scala:324)
>  ~[kafka_2.11-2.0.0.10.jar:?]
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  ~[scala-library-2.11.12.jar:?]
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  ~[scala-library-2.11.12.jar:?]
> at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) 
> ~[scala-library-2.11.12.jar:?]
> at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) 
> ~[scala-library-2.11.12.jar:?]
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) 
> ~[scala-library-2.11.12.jar:?]
> at scala.collection.AbstractTraversable.map(Traversable.scala:104) 
> ~[scala-

[jira] [Commented] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-30 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669323#comment-16669323
 ] 

Dong Lin commented on KAFKA-7481:
-

[~ijuma] Regarding `I am still not sure if there's a lot of value in having two 
separate upgrade states`, I think we need at least one separate upgrade state 
for changes that can not be rolled back, since it seems weird not to be able to 
downgrade if there is only minor version change in the Kafka. And the rational 
for the second separate upgrade state is that, there are two categories of 
changes that prevents downgrade, e.g. those that changes topic schema and those 
that changes message format. It is common for user to be willing to pickup the 
first category of change very soon, and only pickup the second category of 
change much later after client library has been upgraded to reduce performance 
cost in the broker.

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-10-30 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669391#comment-16669391
 ] 

Dong Lin commented on KAFKA-7481:
-

[~ijuma] Yeah this works and this is why I think we need two separate upgrade 
states. Currently there are three possible upgrade state, i.e. binary version 
is upgraded, binary version + inter.broker.protocol.version are upgraded, and 
binary version + inter.broker.protocol.version + message.format.version is 
upgraded. I guess my point is that it is reasonable to keep these three states 
in the long term.

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7560) Client Quota - system test failure

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7560:

Priority: Blocker  (was: Major)

> Client Quota - system test failure
> --
>
> Key: KAFKA-7560
> URL: https://issues.apache.org/jira/browse/KAFKA-7560
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Stanislav Kozlovski
>Assignee: Dong Lin
>Priority: Blocker
>
> The test under `kafkatest.tests.client.quota_test.QuotaTest.test_quota` fails 
> when I run it locally. It produces the following error message:
> {code:java}
>  File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, 
> in validate     metric.value for k, metrics in 
> producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
> client_id=producer.client_id) for metric in metrics ValueError: max() arg is 
> an empty sequence
> {code}
> I assume it cannot find the metric it's searching for



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7560) Client Quota - system test failure

2018-11-06 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676512#comment-16676512
 ] 

Dong Lin commented on KAFKA-7560:
-

Initially I thought the test failure is related to quota logic in this test and 
thus I would be one of the best person to debug this test. Now it seems that 
the test failed because the test suite is not able to read metrics from 
producer using the solution developed in 
[https://github.com/apache/kafka/pull/4072.] More specifically, the log message 
shows that 5 messages are successfully produced and consumed. But do_POST 
in http.py is never called and thus we have the exception shown in the Jira 
description. 

[~ewencp] [~apurva] could you have time to take a look since you are probably 
more familiar with the HTTP based approach of sending metrics here? I will also 
try to debug further.

> Client Quota - system test failure
> --
>
> Key: KAFKA-7560
> URL: https://issues.apache.org/jira/browse/KAFKA-7560
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Stanislav Kozlovski
>Assignee: Dong Lin
>Priority: Blocker
>
> The test under `kafkatest.tests.client.quota_test.QuotaTest.test_quota` fails 
> when I run it locally. It produces the following error message:
> {code:java}
>  File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, 
> in validate     metric.value for k, metrics in 
> producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
> client_id=producer.client_id) for metric in metrics ValueError: max() arg is 
> an empty sequence
> {code}
> I assume it cannot find the metric it's searching for



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7560) Client Quota - system test failure

2018-11-06 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677126#comment-16677126
 ] 

Dong Lin commented on KAFKA-7560:
-

Never mind. I just found the issue.

> Client Quota - system test failure
> --
>
> Key: KAFKA-7560
> URL: https://issues.apache.org/jira/browse/KAFKA-7560
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Stanislav Kozlovski
>Assignee: Dong Lin
>Priority: Blocker
>
> The test under `kafkatest.tests.client.quota_test.QuotaTest.test_quota` fails 
> when I run it locally. It produces the following error message:
> {code:java}
>  File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, 
> in validate     metric.value for k, metrics in 
> producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
> client_id=producer.client_id) for metric in metrics ValueError: max() arg is 
> an empty sequence
> {code}
> I assume it cannot find the metric it's searching for



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7560) PushHttpMetricsReporter should not convert metric value to double

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7560:

Description: 
Currently metricValue

 

The test under `kafkatest.tests.client.quota_test.QuotaTest.test_quota` fails 
when I run it locally. It produces the following error message:
{code:java}
 File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, in 
validate     metric.value for k, metrics in 
producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
client_id=producer.client_id) for metric in metrics ValueError: max() arg is an 
empty sequence
{code}
I assume it cannot find the metric it's searching for

  was:
The test under `kafkatest.tests.client.quota_test.QuotaTest.test_quota` fails 
when I run it locally. It produces the following error message:


{code:java}
 File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, in 
validate     metric.value for k, metrics in 
producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
client_id=producer.client_id) for metric in metrics ValueError: max() arg is an 
empty sequence
{code}
I assume it cannot find the metric it's searching for


> PushHttpMetricsReporter should not convert metric value to double
> -
>
> Key: KAFKA-7560
> URL: https://issues.apache.org/jira/browse/KAFKA-7560
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Stanislav Kozlovski
>Assignee: Dong Lin
>Priority: Blocker
>
> Currently metricValue
>  
> The test under `kafkatest.tests.client.quota_test.QuotaTest.test_quota` fails 
> when I run it locally. It produces the following error message:
> {code:java}
>  File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, 
> in validate     metric.value for k, metrics in 
> producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
> client_id=producer.client_id) for metric in metrics ValueError: max() arg is 
> an empty sequence
> {code}
> I assume it cannot find the metric it's searching for



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7560) PushHttpMetricsReporter should not convert metric value to double

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7560:

Summary: PushHttpMetricsReporter should not convert metric value to double  
(was: Client Quota - system test failure)

> PushHttpMetricsReporter should not convert metric value to double
> -
>
> Key: KAFKA-7560
> URL: https://issues.apache.org/jira/browse/KAFKA-7560
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Stanislav Kozlovski
>Assignee: Dong Lin
>Priority: Blocker
>
> The test under `kafkatest.tests.client.quota_test.QuotaTest.test_quota` fails 
> when I run it locally. It produces the following error message:
> {code:java}
>  File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, 
> in validate     metric.value for k, metrics in 
> producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
> client_id=producer.client_id) for metric in metrics ValueError: max() arg is 
> an empty sequence
> {code}
> I assume it cannot find the metric it's searching for



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7560) PushHttpMetricsReporter should not convert metric value to double

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7560:

Description: 
Currently PushHttpMetricsReporter will convert value from 
KafkaMetric.metricValue() to double. This will not work for non-numerical 
metrics such as version in AppInfoParser whose value can be string. This has 
caused issue for PushHttpMetricsReporter which in turn caused system test 
kafkatest.tests.client.quota_test.QuotaTest.test_quota to fail with the 
following exception:  
{code:java}
 File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, in 
validate     metric.value for k, metrics in 
producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
client_id=producer.client_id) for metric in metrics ValueError: max() arg is an 
empty sequence
{code}
Since we allow metric value to be object, PushHttpMetricsReporter should also 
read metric value as object and pass it to the http server.

  was:
Currently metricValue

 

The test under `kafkatest.tests.client.quota_test.QuotaTest.test_quota` fails 
when I run it locally. It produces the following error message:
{code:java}
 File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, in 
validate     metric.value for k, metrics in 
producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
client_id=producer.client_id) for metric in metrics ValueError: max() arg is an 
empty sequence
{code}
I assume it cannot find the metric it's searching for


> PushHttpMetricsReporter should not convert metric value to double
> -
>
> Key: KAFKA-7560
> URL: https://issues.apache.org/jira/browse/KAFKA-7560
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Stanislav Kozlovski
>Assignee: Dong Lin
>Priority: Blocker
>
> Currently PushHttpMetricsReporter will convert value from 
> KafkaMetric.metricValue() to double. This will not work for non-numerical 
> metrics such as version in AppInfoParser whose value can be string. This has 
> caused issue for PushHttpMetricsReporter which in turn caused system test 
> kafkatest.tests.client.quota_test.QuotaTest.test_quota to fail with the 
> following exception:  
> {code:java}
>  File "/opt/kafka-dev/tests/kafkatest/tests/client/quota_test.py", line 196, 
> in validate     metric.value for k, metrics in 
> producer.metrics(group='producer-metrics', name='outgoing-byte-rate', 
> client_id=producer.client_id) for metric in metrics ValueError: max() arg is 
> an empty sequence
> {code}
> Since we allow metric value to be object, PushHttpMetricsReporter should also 
> read metric value as object and pass it to the http server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7559) ConnectStandaloneFileTest system tests do not pass

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7559:

Fix Version/s: 2.1.0
   2.0.1

> ConnectStandaloneFileTest system tests do not pass
> --
>
> Key: KAFKA-7559
> URL: https://issues.apache.org/jira/browse/KAFKA-7559
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.1.0
>Reporter: Stanislav Kozlovski
>Assignee: Randall Hauch
>Priority: Major
> Fix For: 2.0.1, 2.1.0
>
>
> Both tests `test_skip_and_log_to_dlq` and `test_file_source_and_sink` under 
> `kafkatest.tests.connect.connect_test.ConnectStandaloneFileTest` fail with 
> error messages similar to:
> "TimeoutError: Kafka Connect failed to start on node: ducker@ducker04 in 
> condition mode: LISTEN"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7559) ConnectStandaloneFileTest system tests do not pass

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-7559.
-
Resolution: Fixed

> ConnectStandaloneFileTest system tests do not pass
> --
>
> Key: KAFKA-7559
> URL: https://issues.apache.org/jira/browse/KAFKA-7559
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.1.0
>Reporter: Stanislav Kozlovski
>Assignee: Randall Hauch
>Priority: Major
> Fix For: 2.0.1, 2.1.0
>
>
> Both tests `test_skip_and_log_to_dlq` and `test_file_source_and_sink` under 
> `kafkatest.tests.connect.connect_test.ConnectStandaloneFileTest` fail with 
> error messages similar to:
> "TimeoutError: Kafka Connect failed to start on node: ducker@ducker04 in 
> condition mode: LISTEN"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7313) StopReplicaRequest should attempt to remove future replica for the partition only if future replica exists

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7313:

Fix Version/s: (was: 2.1.1)
   2.1.0
   2.0.1

> StopReplicaRequest should attempt to remove future replica for the partition 
> only if future replica exists
> --
>
> Key: KAFKA-7313
> URL: https://issues.apache.org/jira/browse/KAFKA-7313
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Major
> Fix For: 2.0.1, 2.1.0
>
>
> This patch fixes two issues:
> 1) Currently if a broker received StopReplicaRequest with delete=true for the 
> same offline replica, the first StopRelicaRequest will show 
> KafkaStorageException and the second StopRelicaRequest will show 
> ReplicaNotAvailableException. This is because the first StopRelicaRequest 
> will remove the mapping (tp -> ReplicaManager.OfflinePartition) from 
> ReplicaManager.allPartitions before returning KafkaStorageException, thus the 
> second StopRelicaRequest will not find this partition as offline.
> This result appears to be inconsistent. And since the replica is already 
> offline and broker will not be able to delete file for this replica, the 
> StopReplicaRequest should fail without making any change and broker should 
> still remember that this replica is offline. 
> 2) Currently if broker receives StopReplicaRequest with delete=true, the 
> broker will attempt to remove future replica for the partition, which will 
> cause KafkaStorageException in the StopReplicaResponse if this replica does 
> not have future replica. It is problematic to always return 
> KafkaStorageException in the response if future replica does not exist.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reassigned KAFKA-7481:
---

Assignee: Jason Gustafson

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reopened KAFKA-7481:
-

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-7481.
-
Resolution: Fixed

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 2.1.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7481) Consider options for safer upgrade of offset commit value schema

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7481:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Consider options for safer upgrade of offset commit value schema
> 
>
> Key: KAFKA-7481
> URL: https://issues.apache.org/jira/browse/KAFKA-7481
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 2.2.0
>
>
> KIP-211 and KIP-320 add new versions of the offset commit value schema. The 
> use of the new schema version is controlled by the 
> `inter.broker.protocol.version` configuration.  Once the new inter-broker 
> version is in use, it is not possible to downgrade since the older brokers 
> will not be able to parse the new schema. 
> The options at the moment are the following:
> 1. Do nothing. Users can try the new version and keep 
> `inter.broker.protocol.version` locked to the old release. Downgrade will 
> still be possible, but users will not be able to test new capabilities which 
> depend on inter-broker protocol changes.
> 2. Instead of using `inter.broker.protocol.version`, we could use 
> `message.format.version`. This would basically extend the use of this config 
> to apply to all persistent formats. The advantage is that it allows users to 
> upgrade the broker and begin using the new inter-broker protocol while still 
> allowing downgrade. But features which depend on the persistent format could 
> not be tested.
> Any other options?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7313) StopReplicaRequest should attempt to remove future replica for the partition only if future replica exists

2018-11-06 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-7313.
-
Resolution: Fixed

> StopReplicaRequest should attempt to remove future replica for the partition 
> only if future replica exists
> --
>
> Key: KAFKA-7313
> URL: https://issues.apache.org/jira/browse/KAFKA-7313
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Major
> Fix For: 2.0.1, 2.1.0
>
>
> This patch fixes two issues:
> 1) Currently if a broker received StopReplicaRequest with delete=true for the 
> same offline replica, the first StopRelicaRequest will show 
> KafkaStorageException and the second StopRelicaRequest will show 
> ReplicaNotAvailableException. This is because the first StopRelicaRequest 
> will remove the mapping (tp -> ReplicaManager.OfflinePartition) from 
> ReplicaManager.allPartitions before returning KafkaStorageException, thus the 
> second StopRelicaRequest will not find this partition as offline.
> This result appears to be inconsistent. And since the replica is already 
> offline and broker will not be able to delete file for this replica, the 
> StopReplicaRequest should fail without making any change and broker should 
> still remember that this replica is offline. 
> 2) Currently if broker receives StopReplicaRequest with delete=true, the 
> broker will attempt to remove future replica for the partition, which will 
> cause KafkaStorageException in the StopReplicaResponse if this replica does 
> not have future replica. It is problematic to always return 
> KafkaStorageException in the response if future replica does not exist.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7549) Old ProduceRequest with zstd compression does not return error to client

2018-11-07 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678602#comment-16678602
 ] 

Dong Lin commented on KAFKA-7549:
-

I actually think it is reasonable for client to receive InvalidRequestException 
if client sends ProduceRequest V3 with zstd codec.

We can not send UnsupportedCompressionTypeException because 
UNSUPPORTED_COMPRESSION_TYPE exception because if ProduceRequest is V3 then 
there is no guarantee that client library can understand the error code 76. The 
meaning of UnsupportedVersionException is currently "The version of API is not 
supported", which does not match the scenario here because ProduceRequest V3 is 
actually supported by the broker. InvalidRequestException is reasonable because 
the issue here is that ProduceRequest V3 is used with zstd codec which makes 
the entire request invalid.

[~edenhill] [~dongjin] By saying "client side bug" in the Jira description, is 
this bug in Apache Kafka or in another custom client library? If it is in 
Apache Kafka, is there Jira that tracks this issue?

 

 

 

> Old ProduceRequest with zstd compression does not return error to client
> 
>
> Key: KAFKA-7549
> URL: https://issues.apache.org/jira/browse/KAFKA-7549
> Project: Kafka
>  Issue Type: Bug
>  Components: compression
>Reporter: Magnus Edenhill
>Assignee: Lee Dongjin
>Priority: Major
> Fix For: 2.1.0
>
>
> Kafka broker v2.1.0rc0.
>  
> KIP-110 states that:
> "Zstd will only be allowed for the bumped produce API. That is, for older 
> version clients(=below KAFKA_2_1_IV0), we return UNSUPPORTED_COMPRESSION_TYPE 
> regardless of the message format."
>  
> However, sending a ProduceRequest V3 with zstd compression (which is a client 
> side bug) closes the connection with the following exception rather than 
> returning UNSUPPORTED_COMPRESSION_TYPE in the ProduceResponse:
>  
> {noformat}
> [2018-10-25 11:40:31,813] ERROR Exception while processing request from 
> 127.0.0.1:60723-127.0.0.1:60656-94 (kafka.network.Processor)
> org.apache.kafka.common.errors.InvalidRequestException: Error getting request 
> for apiKey: PRODUCE, apiVersion: 3, connectionId: 
> 127.0.0.1:60723-127.0.0.1:60656-94, listenerName: ListenerName(PLAINTEXT), 
> principal: User:ANONYMOUS
> Caused by: org.apache.kafka.common.record.InvalidRecordException: Produce 
> requests with version 3 are note allowed to use ZStandard compression
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7549) Old ProduceRequest with zstd compression does not return error to client

2018-11-07 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678602#comment-16678602
 ] 

Dong Lin edited comment on KAFKA-7549 at 11/7/18 6:25 PM:
--

I actually think it is reasonable for client to receive InvalidRequestException 
if client sends ProduceRequest V3 with zstd codec.

We can not send UnsupportedCompressionTypeException because 
UNSUPPORTED_COMPRESSION_TYPE exception because if ProduceRequest is V3 then 
there is no guarantee that client library can understand the error code 76. The 
meaning of UnsupportedVersionException is currently "The version of API is not 
supported", which does not match the scenario here because ProduceRequest V3 is 
actually supported by the broker. InvalidRequestException is reasonable because 
the issue here is that ProduceRequest V3 is used with zstd codec which makes 
the entire request invalid.

[~edenhill] [~dongjin] By saying "client side bug" in the Jira description, is 
this bug in Apache Kafka or in another custom client library? If it is in 
Apache Kafka, is there Jira that tracks this issue?

[~ijuma] [~hachikuji] Does this sound reasonable?


was (Author: lindong):
I actually think it is reasonable for client to receive InvalidRequestException 
if client sends ProduceRequest V3 with zstd codec.

We can not send UnsupportedCompressionTypeException because 
UNSUPPORTED_COMPRESSION_TYPE exception because if ProduceRequest is V3 then 
there is no guarantee that client library can understand the error code 76. The 
meaning of UnsupportedVersionException is currently "The version of API is not 
supported", which does not match the scenario here because ProduceRequest V3 is 
actually supported by the broker. InvalidRequestException is reasonable because 
the issue here is that ProduceRequest V3 is used with zstd codec which makes 
the entire request invalid.

[~edenhill] [~dongjin] By saying "client side bug" in the Jira description, is 
this bug in Apache Kafka or in another custom client library? If it is in 
Apache Kafka, is there Jira that tracks this issue?

 

 

 

> Old ProduceRequest with zstd compression does not return error to client
> 
>
> Key: KAFKA-7549
> URL: https://issues.apache.org/jira/browse/KAFKA-7549
> Project: Kafka
>  Issue Type: Bug
>  Components: compression
>Reporter: Magnus Edenhill
>Assignee: Lee Dongjin
>Priority: Major
> Fix For: 2.1.0
>
>
> Kafka broker v2.1.0rc0.
>  
> KIP-110 states that:
> "Zstd will only be allowed for the bumped produce API. That is, for older 
> version clients(=below KAFKA_2_1_IV0), we return UNSUPPORTED_COMPRESSION_TYPE 
> regardless of the message format."
>  
> However, sending a ProduceRequest V3 with zstd compression (which is a client 
> side bug) closes the connection with the following exception rather than 
> returning UNSUPPORTED_COMPRESSION_TYPE in the ProduceResponse:
>  
> {noformat}
> [2018-10-25 11:40:31,813] ERROR Exception while processing request from 
> 127.0.0.1:60723-127.0.0.1:60656-94 (kafka.network.Processor)
> org.apache.kafka.common.errors.InvalidRequestException: Error getting request 
> for apiKey: PRODUCE, apiVersion: 3, connectionId: 
> 127.0.0.1:60723-127.0.0.1:60656-94, listenerName: ListenerName(PLAINTEXT), 
> principal: User:ANONYMOUS
> Caused by: org.apache.kafka.common.record.InvalidRecordException: Produce 
> requests with version 3 are note allowed to use ZStandard compression
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7603) Producer should negotiate message format version with broker

2018-11-07 Thread Dong Lin (JIRA)
Dong Lin created KAFKA-7603:
---

 Summary: Producer should negotiate message format version with 
broker
 Key: KAFKA-7603
 URL: https://issues.apache.org/jira/browse/KAFKA-7603
 Project: Kafka
  Issue Type: Improvement
Reporter: Dong Lin


Currently Producer will always send the record with the highest magic format 
version that is supported by both the produce and broker library regardless of 
log.message.format.version config in the broker.

This causes unnecessary message downconvert overhead if 
log.message.format.version has not been upgraded and producer/broker library 
has been upgraded. It is preferred for produce to produce message with format 
version no higher than the log.message.format.version configured in the broker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7603) Producer should negotiate message format version with broker

2018-11-07 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reassigned KAFKA-7603:
---

Assignee: Dong Lin

> Producer should negotiate message format version with broker
> 
>
> Key: KAFKA-7603
> URL: https://issues.apache.org/jira/browse/KAFKA-7603
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Major
>
> Currently Producer will always send the record with the highest magic format 
> version that is supported by both the produce and broker library regardless 
> of log.message.format.version config in the broker.
> This causes unnecessary message downconvert overhead if 
> log.message.format.version has not been upgraded and producer/broker library 
> has been upgraded. It is preferred for produce to produce message with format 
> version no higher than the log.message.format.version configured in the 
> broker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-6262) KIP-232: Detect outdated metadata by adding ControllerMetadataEpoch field

2018-11-07 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-6262.
-
Resolution: Duplicate

The design in this Jira has been moved to KIP-320.

> KIP-232: Detect outdated metadata by adding ControllerMetadataEpoch field
> -
>
> Key: KAFKA-6262
> URL: https://issues.apache.org/jira/browse/KAFKA-6262
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Major
>
> Currently the following sequence of events may happen that cause consumer to 
> rewind back to the earliest offset even if there is no log truncation in 
> Kafka. This can be a problem for MM by forcing MM to lag behind significantly 
> and duplicate a large amount of data.
> - Say there are three brokers 1,2,3 for a given partition P. Broker 1 is the 
> leader. Initially they are all in ISR. HW and LEO are both 10.
> - SRE does controlled shutdown for broker 1. Controller sends 
> LeaderAndIsrRequest to all three brokers so that leader = broker 2 and 
> isr_set = [broker 2, broker 3].
> - Broker 2 and 3 receives and processes LeaderAndIsrRequest almost 
> instantaneously. Now broker 2 and broker 3 can accept ProduceRequest and 
> FetchRequest for the partition P. 
> However, broker 1 has not processed this LeaderAndIsrRequest due to backlog 
> in its request queue. So broker 1 still think it is leader for the partition 
> P.
> - Because there is leadership movement, a consumer receives 
> NotLeaderForPartitionException, which triggers this consumer to send 
> MetadataRequest to a randomly selected broker, say broker 2. Broker 2 tells 
> consumer that itself is the leader for partition P. Consumer fetches date of 
> partition P from broker 2. The latest data has offset 20.
> - Later this consumer receives NotLeaderForPartitionException for another 
> partition. It sends MetadataRequest to a randomly selected broker again. This 
> time it sends MetadataRequest to broker 1, which tells the consumer that 
> itself is the leader for partition P.
> - This consumer issues FetchRequest for the partition P at offset 21. Broker 
> 1 returns OffsetOutOfRangeExeption because it thinks the LogEndOffset for 
> this partition is 10.
> There are two possible solutions for this problem. The long term solution is 
> probably to include version in the MetadataResponse so that consumer knows 
> whether the medata is outdated. This requires a KIP.
> The short term solution, which should solve the problem in most cases, is to 
> let consumer keep fetching metadata from the same (initially randomly picked) 
> broker until the connection to this broker is disconnected. The metadata 
> version will not go back in time if consumer keeps fetching metadata from the 
> same broker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7603) Producer should negotiate message format version with broker

2018-11-07 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678711#comment-16678711
 ] 

Dong Lin commented on KAFKA-7603:
-

[~ijuma] Yes. I have not thought about how to do this yet. Just want to create 
a ticket to document the possible improvement here.

> Producer should negotiate message format version with broker
> 
>
> Key: KAFKA-7603
> URL: https://issues.apache.org/jira/browse/KAFKA-7603
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Major
>
> Currently Producer will always send the record with the highest magic format 
> version that is supported by both the produce and broker library regardless 
> of log.message.format.version config in the broker.
> This causes unnecessary message downconvert overhead if 
> log.message.format.version has not been upgraded and producer/broker library 
> has been upgraded. It is preferred for produce to produce message with format 
> version no higher than the log.message.format.version configured in the 
> broker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7549) Old ProduceRequest with zstd compression does not return error to client

2018-11-08 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680509#comment-16680509
 ] 

Dong Lin commented on KAFKA-7549:
-

[~ijuma] [~hachikuji] Given the reasoning provided above, are you OK with 
moving this issue out of 2.1 release?

> Old ProduceRequest with zstd compression does not return error to client
> 
>
> Key: KAFKA-7549
> URL: https://issues.apache.org/jira/browse/KAFKA-7549
> Project: Kafka
>  Issue Type: Bug
>  Components: compression
>Reporter: Magnus Edenhill
>Assignee: Lee Dongjin
>Priority: Major
> Fix For: 2.1.0
>
>
> Kafka broker v2.1.0rc0.
>  
> KIP-110 states that:
> "Zstd will only be allowed for the bumped produce API. That is, for older 
> version clients(=below KAFKA_2_1_IV0), we return UNSUPPORTED_COMPRESSION_TYPE 
> regardless of the message format."
>  
> However, sending a ProduceRequest V3 with zstd compression (which is a client 
> side bug) closes the connection with the following exception rather than 
> returning UNSUPPORTED_COMPRESSION_TYPE in the ProduceResponse:
>  
> {noformat}
> [2018-10-25 11:40:31,813] ERROR Exception while processing request from 
> 127.0.0.1:60723-127.0.0.1:60656-94 (kafka.network.Processor)
> org.apache.kafka.common.errors.InvalidRequestException: Error getting request 
> for apiKey: PRODUCE, apiVersion: 3, connectionId: 
> 127.0.0.1:60723-127.0.0.1:60656-94, listenerName: ListenerName(PLAINTEXT), 
> principal: User:ANONYMOUS
> Caused by: org.apache.kafka.common.record.InvalidRecordException: Produce 
> requests with version 3 are note allowed to use ZStandard compression
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7486) Flaky test `DeleteTopicTest.testAddPartitionDuringDeleteTopic`

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690099#comment-16690099
 ] 

Dong Lin commented on KAFKA-7486:
-

Hey [~chia7712], I think [~hachikuji] probably missed your message. I am sure 
[~hachikuji] (and myself) is happy for your to help take this JIRA.

> Flaky test `DeleteTopicTest.testAddPartitionDuringDeleteTopic`
> --
>
> Key: KAFKA-7486
> URL: https://issues.apache.org/jira/browse/KAFKA-7486
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> Starting to see more of this recently:
> {code}
> 10:06:28 kafka.admin.DeleteTopicTest > testAddPartitionDuringDeleteTopic 
> FAILED
> 10:06:28 kafka.admin.AdminOperationException: 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /brokers/topics/test
> 10:06:28 at 
> kafka.zk.AdminZkClient.writeTopicPartitionAssignment(AdminZkClient.scala:162)
> 10:06:28 at 
> kafka.zk.AdminZkClient.createOrUpdateTopicPartitionAssignmentPathInZK(AdminZkClient.scala:102)
> 10:06:28 at 
> kafka.zk.AdminZkClient.addPartitions(AdminZkClient.scala:229)
> 10:06:28 at 
> kafka.admin.DeleteTopicTest.testAddPartitionDuringDeleteTopic(DeleteTopicTest.scala:266)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7645) Fix flaky unit test for 2.1 branch

2018-11-16 Thread Dong Lin (JIRA)
Dong Lin created KAFKA-7645:
---

 Summary: Fix flaky unit test for 2.1 branch
 Key: KAFKA-7645
 URL: https://issues.apache.org/jira/browse/KAFKA-7645
 Project: Kafka
  Issue Type: Task
Reporter: Dong Lin
Assignee: Dong Lin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7486) Flaky test `DeleteTopicTest.testAddPartitionDuringDeleteTopic`

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7486:

Issue Type: Sub-task  (was: Bug)
Parent: KAFKA-7645

> Flaky test `DeleteTopicTest.testAddPartitionDuringDeleteTopic`
> --
>
> Key: KAFKA-7486
> URL: https://issues.apache.org/jira/browse/KAFKA-7486
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Jason Gustafson
>Priority: Major
>
> Starting to see more of this recently:
> {code}
> 10:06:28 kafka.admin.DeleteTopicTest > testAddPartitionDuringDeleteTopic 
> FAILED
> 10:06:28 kafka.admin.AdminOperationException: 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /brokers/topics/test
> 10:06:28 at 
> kafka.zk.AdminZkClient.writeTopicPartitionAssignment(AdminZkClient.scala:162)
> 10:06:28 at 
> kafka.zk.AdminZkClient.createOrUpdateTopicPartitionAssignmentPathInZK(AdminZkClient.scala:102)
> 10:06:28 at 
> kafka.zk.AdminZkClient.addPartitions(AdminZkClient.scala:229)
> 10:06:28 at 
> kafka.admin.DeleteTopicTest.testAddPartitionDuringDeleteTopic(DeleteTopicTest.scala:266)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7646) Flaky test SaslOAuthBearerSslEndToEndAuthorizationTest.testNoConsumeWithDescribeAclViaSubscribe

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7646:

Issue Type: Sub-task  (was: Task)
Parent: KAFKA-7645

> Flaky test 
> SaslOAuthBearerSslEndToEndAuthorizationTest.testNoConsumeWithDescribeAclViaSubscribe
> ---
>
> Key: KAFKA-7646
> URL: https://issues.apache.org/jira/browse/KAFKA-7646
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> This test is reported to fail by [~enothereska] as part of 2.1.0 RC0 release 
> certification.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7647) Flaky test LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic

2018-11-16 Thread Dong Lin (JIRA)
Dong Lin created KAFKA-7647:
---

 Summary: Flaky test 
LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic
 Key: KAFKA-7647
 URL: https://issues.apache.org/jira/browse/KAFKA-7647
 Project: Kafka
  Issue Type: Task
Reporter: Dong Lin


kafka.log.LogCleanerParameterizedIntegrationTest >
testCleansCombinedCompactAndDeleteTopic[3] FAILED
    java.lang.AssertionError: Contents of the map shouldn't change
expected: (340,340), 5 -> (345,345), 10 -> (350,350), 14 ->
(354,354), 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353),
2 -> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
(343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 8 ->
(348,348), 19 -> (359,359), 4 -> (344,344), 15 -> (355,355))> but
was: (340,340), 5 -> (345,345), 10 -> (350,350), 14 -> (354,354),
1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353), 2 ->
(342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
(343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 99 ->
(299,299), 8 -> (348,348), 19 -> (359,359), 4 -> (344,344), 15 ->
(355,355))>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at
kafka.log.LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic(LogCleanerParameterizedIntegrationTest.scala:129)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7647) Flaky test LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7647:

Issue Type: Sub-task  (was: Task)
Parent: KAFKA-7645

> Flaky test 
> LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic
> -
>
> Key: KAFKA-7647
> URL: https://issues.apache.org/jira/browse/KAFKA-7647
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> kafka.log.LogCleanerParameterizedIntegrationTest >
> testCleansCombinedCompactAndDeleteTopic[3] FAILED
>     java.lang.AssertionError: Contents of the map shouldn't change
> expected: (340,340), 5 -> (345,345), 10 -> (350,350), 14 ->
> (354,354), 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353),
> 2 -> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
> (343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 8 ->
> (348,348), 19 -> (359,359), 4 -> (344,344), 15 -> (355,355))> but
> was: (340,340), 5 -> (345,345), 10 -> (350,350), 14 -> (354,354),
> 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353), 2 ->
> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
> (343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 99 ->
> (299,299), 8 -> (348,348), 19 -> (359,359), 4 -> (344,344), 15 ->
> (355,355))>
>         at org.junit.Assert.fail(Assert.java:88)
>         at org.junit.Assert.failNotEquals(Assert.java:834)
>         at org.junit.Assert.assertEquals(Assert.java:118)
>         at
> kafka.log.LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic(LogCleanerParameterizedIntegrationTest.scala:129)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7648) Flaky test DeleteTopicsRequestTest.testValidDeleteTopicRequests

2018-11-16 Thread Dong Lin (JIRA)
Dong Lin created KAFKA-7648:
---

 Summary: Flaky test 
DeleteTopicsRequestTest.testValidDeleteTopicRequests
 Key: KAFKA-7648
 URL: https://issues.apache.org/jira/browse/KAFKA-7648
 Project: Kafka
  Issue Type: Task
Reporter: Dong Lin


Observed in 
[https://builds.apache.org/job/kafka-2.1-jdk8/52/testReport/junit/kafka.server/DeleteTopicsRequestTest/testValidDeleteTopicRequests/]

 
h3. Error Message
org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
exists.
h3. Stacktrace
org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
exists.
h3. Standard Output
[2018-11-07 17:53:10,812] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
fetcherId=0] Error for partition topic-3-3 at offset 0 
(kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:10,812] ERROR 
[ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
topic-3-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:14,805] WARN Client 
session timed out, have not heard from server in 4000ms for sessionid 
0x10051eebf480003 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
17:53:14,806] WARN Unable to read additional data from client sessionid 
0x10051eebf480003, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,807] WARN 
Client session timed out, have not heard from server in 4002ms for sessionid 
0x10051eebf480002 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
17:53:14,807] WARN Unable to read additional data from client sessionid 
0x10051eebf480002, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,823] WARN 
Client session timed out, have not heard from server in 4002ms for sessionid 
0x10051eebf480001 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
17:53:14,824] WARN Unable to read additional data from client sessionid 
0x10051eebf480001, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,423] WARN 
Client session timed out, have not heard from server in 4002ms for sessionid 
0x10051eebf48 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
17:53:15,423] WARN Unable to read additional data from client sessionid 
0x10051eebf48, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,879] WARN 
fsync-ing the write ahead log in SyncThread:0 took 4456ms which will adversely 
effect operation latency. See the ZooKeeper troubleshooting guide 
(org.apache.zookeeper.server.persistence.FileTxnLog:338) [2018-11-07 
17:53:16,831] ERROR [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error 
for partition topic-4-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:23,087] ERROR 
[ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
invalid-timeout-1 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:23,088] ERROR 
[ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Error for partition 
invalid-timeout-3 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:23,137] ERROR 
[ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error for partition 
invalid-timeout-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
 
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7648) Flaky test DeleteTopicsRequestTest.testValidDeleteTopicRequests

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7648:

Issue Type: Sub-task  (was: Task)
Parent: KAFKA-7645

> Flaky test DeleteTopicsRequestTest.testValidDeleteTopicRequests
> ---
>
> Key: KAFKA-7648
> URL: https://issues.apache.org/jira/browse/KAFKA-7648
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> Observed in 
> [https://builds.apache.org/job/kafka-2.1-jdk8/52/testReport/junit/kafka.server/DeleteTopicsRequestTest/testValidDeleteTopicRequests/]
>  
> h3. Error Message
> org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
> exists.
> h3. Stacktrace
> org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
> exists.
> h3. Standard Output
> [2018-11-07 17:53:10,812] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
> fetcherId=0] Error for partition topic-3-3 at offset 0 
> (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:10,812] ERROR 
> [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
> topic-3-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:14,805] WARN Client 
> session timed out, have not heard from server in 4000ms for sessionid 
> 0x10051eebf480003 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
> 17:53:14,806] WARN Unable to read additional data from client sessionid 
> 0x10051eebf480003, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,807] 
> WARN Client session timed out, have not heard from server in 4002ms for 
> sessionid 0x10051eebf480002 (org.apache.zookeeper.ClientCnxn:1112) 
> [2018-11-07 17:53:14,807] WARN Unable to read additional data from client 
> sessionid 0x10051eebf480002, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,823] 
> WARN Client session timed out, have not heard from server in 4002ms for 
> sessionid 0x10051eebf480001 (org.apache.zookeeper.ClientCnxn:1112) 
> [2018-11-07 17:53:14,824] WARN Unable to read additional data from client 
> sessionid 0x10051eebf480001, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,423] 
> WARN Client session timed out, have not heard from server in 4002ms for 
> sessionid 0x10051eebf48 (org.apache.zookeeper.ClientCnxn:1112) 
> [2018-11-07 17:53:15,423] WARN Unable to read additional data from client 
> sessionid 0x10051eebf48, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,879] 
> WARN fsync-ing the write ahead log in SyncThread:0 took 4456ms which will 
> adversely effect operation latency. See the ZooKeeper troubleshooting guide 
> (org.apache.zookeeper.server.persistence.FileTxnLog:338) [2018-11-07 
> 17:53:16,831] ERROR [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] 
> Error for partition topic-4-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:23,087] ERROR 
> [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
> invalid-timeout-1 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:23,088] ERROR 
> [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Error for partition 
> invalid-timeout-3 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:23,137] ERROR 
> [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error for partition 
> invalid-timeout-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7312) Transient failure in kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7312:

Issue Type: Sub-task  (was: Improvement)
Parent: KAFKA-7645

> Transient failure in 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts
> 
>
> Key: KAFKA-7312
> URL: https://issues.apache.org/jira/browse/KAFKA-7312
> Project: Kafka
>  Issue Type: Sub-task
>  Components: unit tests
>Reporter: Guozhang Wang
>Priority: Major
>
> {code}
> Error Message
> org.junit.runners.model.TestTimedOutException: test timed out after 12 
> milliseconds
> Stacktrace
> org.junit.runners.model.TestTimedOutException: test timed out after 12 
> milliseconds
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:92)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:262)
>   at 
> kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1345)
>   at 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7648) Flaky test DeleteTopicsRequestTest.testValidDeleteTopicRequests

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7648:

Description: 
Observed in 
[https://builds.apache.org/job/kafka-2.1-jdk8/52/testReport/junit/kafka.server/DeleteTopicsRequestTest/testValidDeleteTopicRequests/]

 

{code}

Error Message

org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
exists.
h3. Stacktrace

org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
exists.
h3. Standard Output

[2018-11-07 17:53:10,812] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
fetcherId=0] Error for partition topic-3-3 at offset 0 
(kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:10,812] ERROR 
[ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
topic-3-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:14,805] WARN Client 
session timed out, have not heard from server in 4000ms for sessionid 
0x10051eebf480003 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
17:53:14,806] WARN Unable to read additional data from client sessionid 
0x10051eebf480003, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,807] WARN 
Client session timed out, have not heard from server in 4002ms for sessionid 
0x10051eebf480002 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
17:53:14,807] WARN Unable to read additional data from client sessionid 
0x10051eebf480002, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,823] WARN 
Client session timed out, have not heard from server in 4002ms for sessionid 
0x10051eebf480001 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
17:53:14,824] WARN Unable to read additional data from client sessionid 
0x10051eebf480001, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,423] WARN 
Client session timed out, have not heard from server in 4002ms for sessionid 
0x10051eebf48 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
17:53:15,423] WARN Unable to read additional data from client sessionid 
0x10051eebf48, likely client has closed socket 
(org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,879] WARN 
fsync-ing the write ahead log in SyncThread:0 took 4456ms which will adversely 
effect operation latency. See the ZooKeeper troubleshooting guide 
(org.apache.zookeeper.server.persistence.FileTxnLog:338) [2018-11-07 
17:53:16,831] ERROR [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error 
for partition topic-4-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:23,087] ERROR 
[ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
invalid-timeout-1 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:23,088] ERROR 
[ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Error for partition 
invalid-timeout-3 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:23,137] ERROR 
[ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error for partition 
invalid-timeout-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
  
{code}

  was:
Observed in 
[https://builds.apache.org/job/kafka-2.1-jdk8/52/testReport/junit/kafka.server/DeleteTopicsRequestTest/testValidDeleteTopicRequests/]

 
h3. Error Message
org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
exists.
h3. Stacktrace
org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
exists.
h3. Standard Output
[2018-11-07 17:53:10,812] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
fetcherId=0] Error for partition topic-3-3 at offset 0 
(kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:10,812] ERROR 
[ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
topic-3-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition. [2018-11-07 17:53:14,805] WARN Client 
session timed out, have not heard f

[jira] [Created] (KAFKA-7649) Flaky test SslEndToEndAuthorizationTest.testNoProduceWithDescribeAcl

2018-11-16 Thread Dong Lin (JIRA)
Dong Lin created KAFKA-7649:
---

 Summary: Flaky test 
SslEndToEndAuthorizationTest.testNoProduceWithDescribeAcl
 Key: KAFKA-7649
 URL: https://issues.apache.org/jira/browse/KAFKA-7649
 Project: Kafka
  Issue Type: Task
Reporter: Dong Lin


Observed in 
https://builds.apache.org/job/kafka-2.1-jdk8/49/testReport/junit/kafka.api/SslEndToEndAuthorizationTest/testNoProduceWithDescribeAcl/

{code}
Error Message
java.lang.SecurityException: zookeeper.set.acl is true, but the verification of 
the JAAS login file failed.
Stacktrace
java.lang.SecurityException: zookeeper.set.acl is true, but the verification of 
the JAAS login file failed.
at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:361)
at kafka.server.KafkaServer.startup(KafkaServer.scala:202)
at kafka.utils.TestUtils$.createServer(TestUtils.scala:135)
at 
kafka.integration.KafkaServerTestHarness$$anonfun$setUp$1.apply(KafkaServerTestHarness.scala:101)
at 
kafka.integration.KafkaServerTestHarness$$anonfun$setUp$1.apply(KafkaServerTestHarness.scala:100)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
kafka.integration.KafkaServerTestHarness.setUp(KafkaServerTestHarness.scala:100)
at 
kafka.api.IntegrationTestHarness.doSetup(IntegrationTestHarness.scala:81)
at 
kafka.api.IntegrationTestHarness.setUp(IntegrationTestHarness.scala:73)
at 
kafka.api.EndToEndAuthorizationTest.setUp(EndToEndAuthorizationTest.scala:180)
at 
kafka.api.SslEndToEndAuthorizationTest.setUp(SslEndToEndAuthorizationTest.scala:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
at 
org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
at 
org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at 
org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
at 
org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
at 
org.gradle.api.internal.tasks.testing.wo

[jira] [Updated] (KAFKA-7649) Flaky test SslEndToEndAuthorizationTest.testNoProduceWithDescribeAcl

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7649:

Issue Type: Sub-task  (was: Task)
Parent: KAFKA-7645

> Flaky test SslEndToEndAuthorizationTest.testNoProduceWithDescribeAcl
> 
>
> Key: KAFKA-7649
> URL: https://issues.apache.org/jira/browse/KAFKA-7649
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> Observed in 
> https://builds.apache.org/job/kafka-2.1-jdk8/49/testReport/junit/kafka.api/SslEndToEndAuthorizationTest/testNoProduceWithDescribeAcl/
> {code}
> Error Message
> java.lang.SecurityException: zookeeper.set.acl is true, but the verification 
> of the JAAS login file failed.
> Stacktrace
> java.lang.SecurityException: zookeeper.set.acl is true, but the verification 
> of the JAAS login file failed.
>   at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:361)
>   at kafka.server.KafkaServer.startup(KafkaServer.scala:202)
>   at kafka.utils.TestUtils$.createServer(TestUtils.scala:135)
>   at 
> kafka.integration.KafkaServerTestHarness$$anonfun$setUp$1.apply(KafkaServerTestHarness.scala:101)
>   at 
> kafka.integration.KafkaServerTestHarness$$anonfun$setUp$1.apply(KafkaServerTestHarness.scala:100)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> kafka.integration.KafkaServerTestHarness.setUp(KafkaServerTestHarness.scala:100)
>   at 
> kafka.api.IntegrationTestHarness.doSetup(IntegrationTestHarness.scala:81)
>   at 
> kafka.api.IntegrationTestHarness.setUp(IntegrationTestHarness.scala:73)
>   at 
> kafka.api.EndToEndAuthorizationTest.setUp(EndToEndAuthorizationTest.scala:180)
>   at 
> kafka.api.SslEndToEndAuthorizationTest.setUp(SslEndToEndAuthorizationTest.scala:72)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>   a

[jira] [Created] (KAFKA-7646) Flaky test SaslOAuthBearerSslEndToEndAuthorizationTest.testNoConsumeWithDescribeAclViaSubscribe

2018-11-16 Thread Dong Lin (JIRA)
Dong Lin created KAFKA-7646:
---

 Summary: Flaky test 
SaslOAuthBearerSslEndToEndAuthorizationTest.testNoConsumeWithDescribeAclViaSubscribe
 Key: KAFKA-7646
 URL: https://issues.apache.org/jira/browse/KAFKA-7646
 Project: Kafka
  Issue Type: Task
Reporter: Dong Lin


This test is reported to fail by [~enothereska] as part of 2.1.0 RC0 release 
certification.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7541) Transient Failure: kafka.server.DynamicBrokerReconfigurationTest.testUncleanLeaderElectionEnable

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7541:

Issue Type: Sub-task  (was: Bug)
Parent: KAFKA-7645

> Transient Failure: 
> kafka.server.DynamicBrokerReconfigurationTest.testUncleanLeaderElectionEnable
> 
>
> Key: KAFKA-7541
> URL: https://issues.apache.org/jira/browse/KAFKA-7541
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: John Roesler
>Priority: Major
>
> Observed on Java 11: 
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/264/testReport/junit/kafka.server/DynamicBrokerReconfigurationTest/testUncleanLeaderElectionEnable/]
>  
> Stacktrace:
> {noformat}
> java.lang.AssertionError: Unclean leader not elected
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> kafka.server.DynamicBrokerReconfigurationTest.testUncleanLeaderElectionEnable(DynamicBrokerReconfigurationTest.scala:487)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>   at 
> org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
>   at 
> org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
>   at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
>   at 
> org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:117)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Delegati

[jira] [Commented] (KAFKA-7486) Flaky test `DeleteTopicTest.testAddPartitionDuringDeleteTopic`

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690115#comment-16690115
 ] 

Dong Lin commented on KAFKA-7486:
-

The test calls `AdminUtils.addPartitions` to add partition to the topic after 
the async topic deletion is triggered. It assumes that the topic deletion is 
not completed when `AdminUtils.addPartitions` is executed but this is not 
guaranteed. This is the root cause of the test failure. So it does not indicate 
any bug in the code and thus this is not a blocking issue for 2.1 release.

> Flaky test `DeleteTopicTest.testAddPartitionDuringDeleteTopic`
> --
>
> Key: KAFKA-7486
> URL: https://issues.apache.org/jira/browse/KAFKA-7486
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Jason Gustafson
>Priority: Major
>
> Starting to see more of this recently:
> {code}
> 10:06:28 kafka.admin.DeleteTopicTest > testAddPartitionDuringDeleteTopic 
> FAILED
> 10:06:28 kafka.admin.AdminOperationException: 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /brokers/topics/test
> 10:06:28 at 
> kafka.zk.AdminZkClient.writeTopicPartitionAssignment(AdminZkClient.scala:162)
> 10:06:28 at 
> kafka.zk.AdminZkClient.createOrUpdateTopicPartitionAssignmentPathInZK(AdminZkClient.scala:102)
> 10:06:28 at 
> kafka.zk.AdminZkClient.addPartitions(AdminZkClient.scala:229)
> 10:06:28 at 
> kafka.admin.DeleteTopicTest.testAddPartitionDuringDeleteTopic(DeleteTopicTest.scala:266)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7541) Transient Failure: kafka.server.DynamicBrokerReconfigurationTest.testUncleanLeaderElectionEnable

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690130#comment-16690130
 ] 

Dong Lin commented on KAFKA-7541:
-

According to the source code 
DynamicBrokerReconfigurationTest.testUncleanLeaderElectionEnable(), the test 
will fail with the exception above if leader election is not completed within 
15 seconds. Thus the test may fail if there is long GC. We can reduce the 
chance of the test failure by increasing the wait time.

Given the above understanding and the fact that the test passes with high 
probability, this flaky test does not indicate bug and should not be a blocking 
issue for 2.1.0 release.

> Transient Failure: 
> kafka.server.DynamicBrokerReconfigurationTest.testUncleanLeaderElectionEnable
> 
>
> Key: KAFKA-7541
> URL: https://issues.apache.org/jira/browse/KAFKA-7541
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: John Roesler
>Priority: Major
>
> Observed on Java 11: 
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/264/testReport/junit/kafka.server/DynamicBrokerReconfigurationTest/testUncleanLeaderElectionEnable/]
>  
> Stacktrace:
> {noformat}
> java.lang.AssertionError: Unclean leader not elected
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> kafka.server.DynamicBrokerReconfigurationTest.testUncleanLeaderElectionEnable(DynamicBrokerReconfigurationTest.scala:487)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>   at 
> org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
>   at 
> org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(Pro

[jira] [Comment Edited] (KAFKA-7486) Flaky test `DeleteTopicTest.testAddPartitionDuringDeleteTopic`

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690115#comment-16690115
 ] 

Dong Lin edited comment on KAFKA-7486 at 11/16/18 11:13 PM:


The test calls `AdminUtils.addPartitions` to add partition to the topic after 
the async topic deletion is triggered. It assumes that the topic deletion is 
not completed when `AdminUtils.addPartitions` is executed but this is not 
guaranteed. This is the root cause of the test failure. So it does not indicate 
any bug in the code and thus this is not a blocking issue for 2.1.0 release.


was (Author: lindong):
The test calls `AdminUtils.addPartitions` to add partition to the topic after 
the async topic deletion is triggered. It assumes that the topic deletion is 
not completed when `AdminUtils.addPartitions` is executed but this is not 
guaranteed. This is the root cause of the test failure. So it does not indicate 
any bug in the code and thus this is not a blocking issue for 2.1 release.

> Flaky test `DeleteTopicTest.testAddPartitionDuringDeleteTopic`
> --
>
> Key: KAFKA-7486
> URL: https://issues.apache.org/jira/browse/KAFKA-7486
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Jason Gustafson
>Priority: Major
>
> Starting to see more of this recently:
> {code}
> 10:06:28 kafka.admin.DeleteTopicTest > testAddPartitionDuringDeleteTopic 
> FAILED
> 10:06:28 kafka.admin.AdminOperationException: 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /brokers/topics/test
> 10:06:28 at 
> kafka.zk.AdminZkClient.writeTopicPartitionAssignment(AdminZkClient.scala:162)
> 10:06:28 at 
> kafka.zk.AdminZkClient.createOrUpdateTopicPartitionAssignmentPathInZK(AdminZkClient.scala:102)
> 10:06:28 at 
> kafka.zk.AdminZkClient.addPartitions(AdminZkClient.scala:229)
> 10:06:28 at 
> kafka.admin.DeleteTopicTest.testAddPartitionDuringDeleteTopic(DeleteTopicTest.scala:266)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7648) Flaky test DeleteTopicsRequestTest.testValidDeleteTopicRequests

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690140#comment-16690140
 ] 

Dong Lin edited comment on KAFKA-7648 at 11/16/18 11:47 PM:


Currently TestUtils.createTopic(...) will re-send znode creation request to 
zookeeper service if the previous response shows Code.CONNECTIONLOSS. See 
KafkaZkClient.retryRequestsUntilConnected() for related logic.

This means that the test will fail if the zookeeper has created znode upon the 
first request, the response to the first request is lost or timed-out, the 
second request is sent, and the response of the second request shows 
Code.NODEEXISTS.

In order to fix this flaky test, we probably should implement some logic 
similar to KafkaZkClient.CheckedEphemeral() to check whether the znode has been 
created in the with the same session id after receiving Code.NODEEXISTS.

Given the above understanding and the fact that the test passes with high 
probability, this flaky test does not indicate bug and should not be a blocking 
issue for 2.1.0 release.


was (Author: lindong):
Currently TestUtils.createTopic(...) will re-send znode creation request to 
zookeeper service if the previous response shows Code.CONNECTIONLOSS. See 
KafkaZkClient.retryRequestsUntilConnected() for related logic.

This means that the test will fail if the zookeeper has created znode upon the 
first request, the response to the first request is lost or timed-out, the 
second request is sent, and the response of the second request shows 
Code.NODEEXISTS.

In order to fix this flaky test, we probably should implement some logic 
similar to KafkaZkClient.CheckedEphemeral() to check whether the znode has been 
created in the with the same session id after receiving Code.NODEEXISTS.



> Flaky test DeleteTopicsRequestTest.testValidDeleteTopicRequests
> ---
>
> Key: KAFKA-7648
> URL: https://issues.apache.org/jira/browse/KAFKA-7648
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> Observed in 
> [https://builds.apache.org/job/kafka-2.1-jdk8/52/testReport/junit/kafka.server/DeleteTopicsRequestTest/testValidDeleteTopicRequests/]
>  
> {code}
> Error Message
> org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
> exists.
> h3. Stacktrace
> org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
> exists.
> h3. Standard Output
> [2018-11-07 17:53:10,812] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
> fetcherId=0] Error for partition topic-3-3 at offset 0 
> (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:10,812] ERROR 
> [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
> topic-3-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:14,805] WARN Client 
> session timed out, have not heard from server in 4000ms for sessionid 
> 0x10051eebf480003 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
> 17:53:14,806] WARN Unable to read additional data from client sessionid 
> 0x10051eebf480003, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,807] 
> WARN Client session timed out, have not heard from server in 4002ms for 
> sessionid 0x10051eebf480002 (org.apache.zookeeper.ClientCnxn:1112) 
> [2018-11-07 17:53:14,807] WARN Unable to read additional data from client 
> sessionid 0x10051eebf480002, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,823] 
> WARN Client session timed out, have not heard from server in 4002ms for 
> sessionid 0x10051eebf480001 (org.apache.zookeeper.ClientCnxn:1112) 
> [2018-11-07 17:53:14,824] WARN Unable to read additional data from client 
> sessionid 0x10051eebf480001, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,423] 
> WARN Client session timed out, have not heard from server in 4002ms for 
> sessionid 0x10051eebf48 (org.apache.zookeeper.ClientCnxn:1112) 
> [2018-11-07 17:53:15,423] WARN Unable to read additional data from client 
> sessionid 0x10051eebf48, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,879] 
> WARN fsync-ing the write ahead log in SyncThread:0 took 4456ms which will 
> adversely effect operation latency. See the ZooKeeper troubleshooting guide 
> (org.apache.zookeeper.server.persistence.FileTxnLog:338) [2018-11-07 
> 17:53:16,831] ERROR [ReplicaFetcher replicaId=0, leaderId

[jira] [Commented] (KAFKA-7648) Flaky test DeleteTopicsRequestTest.testValidDeleteTopicRequests

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690140#comment-16690140
 ] 

Dong Lin commented on KAFKA-7648:
-

Currently TestUtils.createTopic(...) will re-send znode creation request to 
zookeeper service if the previous response shows Code.CONNECTIONLOSS. See 
KafkaZkClient.retryRequestsUntilConnected() for related logic.

This means that the test will fail if the zookeeper has created znode upon the 
first request, the response to the first request is lost or timed-out, the 
second request is sent, and the response of the second request shows 
Code.NODEEXISTS.

In order to fix this flaky test, we probably should implement some logic 
similar to KafkaZkClient.CheckedEphemeral() to check whether the znode has been 
created in the with the same session id after receiving Code.NODEEXISTS.



> Flaky test DeleteTopicsRequestTest.testValidDeleteTopicRequests
> ---
>
> Key: KAFKA-7648
> URL: https://issues.apache.org/jira/browse/KAFKA-7648
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> Observed in 
> [https://builds.apache.org/job/kafka-2.1-jdk8/52/testReport/junit/kafka.server/DeleteTopicsRequestTest/testValidDeleteTopicRequests/]
>  
> {code}
> Error Message
> org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
> exists.
> h3. Stacktrace
> org.apache.kafka.common.errors.TopicExistsException: Topic 'topic-4' already 
> exists.
> h3. Standard Output
> [2018-11-07 17:53:10,812] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
> fetcherId=0] Error for partition topic-3-3 at offset 0 
> (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:10,812] ERROR 
> [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
> topic-3-0 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:14,805] WARN Client 
> session timed out, have not heard from server in 4000ms for sessionid 
> 0x10051eebf480003 (org.apache.zookeeper.ClientCnxn:1112) [2018-11-07 
> 17:53:14,806] WARN Unable to read additional data from client sessionid 
> 0x10051eebf480003, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,807] 
> WARN Client session timed out, have not heard from server in 4002ms for 
> sessionid 0x10051eebf480002 (org.apache.zookeeper.ClientCnxn:1112) 
> [2018-11-07 17:53:14,807] WARN Unable to read additional data from client 
> sessionid 0x10051eebf480002, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:14,823] 
> WARN Client session timed out, have not heard from server in 4002ms for 
> sessionid 0x10051eebf480001 (org.apache.zookeeper.ClientCnxn:1112) 
> [2018-11-07 17:53:14,824] WARN Unable to read additional data from client 
> sessionid 0x10051eebf480001, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,423] 
> WARN Client session timed out, have not heard from server in 4002ms for 
> sessionid 0x10051eebf48 (org.apache.zookeeper.ClientCnxn:1112) 
> [2018-11-07 17:53:15,423] WARN Unable to read additional data from client 
> sessionid 0x10051eebf48, likely client has closed socket 
> (org.apache.zookeeper.server.NIOServerCnxn:376) [2018-11-07 17:53:15,879] 
> WARN fsync-ing the write ahead log in SyncThread:0 took 4456ms which will 
> adversely effect operation latency. See the ZooKeeper troubleshooting guide 
> (org.apache.zookeeper.server.persistence.FileTxnLog:338) [2018-11-07 
> 17:53:16,831] ERROR [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] 
> Error for partition topic-4-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:23,087] ERROR 
> [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error for partition 
> invalid-timeout-1 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:23,088] ERROR 
> [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Error for partition 
> invalid-timeout-3 at offset 0 (kafka.server.ReplicaFetcherThread:76) 
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition. [2018-11-07 17:53:23,137] ERROR 
> [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error for partition 
> invalid-timeo

[jira] [Commented] (KAFKA-7312) Transient failure in kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690150#comment-16690150
 ] 

Dong Lin commented on KAFKA-7312:
-

Here is another stacktrace:

{code}
Error Message
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
Stacktrace
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1404)
at 
kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{code}


> Transient failure in 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts
> 
>
> Key: KAFKA-7312
> URL: https://issues.apache.org/jira/browse/KAFKA-7312
> Project: Kafka
>  Issue Type: Sub-task
>  Components: unit tests
>Reporter: Guozhang Wang
>Priority: Major
>
> {code}
> Error Message
> org.junit.runners.model.TestTimedOutException: test timed out after 12 
> milliseconds
> Stacktrace
> org.junit.runners.model.TestTimedOutException: test timed out after 12 
> milliseconds
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:92)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:262)
>   at 
> kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1345)
>   at 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7649) Flaky test SslEndToEndAuthorizationTest.testNoProduceWithDescribeAcl

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690168#comment-16690168
 ] 

Dong Lin commented on KAFKA-7649:
-

The there is error in the log that says "No JAAS configuration section named 
'Client' was found in specified JAAS configuration file: 
'/tmp/kafka7346315539242944484.tmp'. Will continue connection to Zookeeper 
server without SASL authentication, if Zookeeper server allows it. 
(org.apache.zookeeper.ClientCnxn:1011)". The source code confirms that broker 
will fail to start with "java.lang.SecurityException: zookeeper.set.acl is 
true, but the verification of the JAAS login file failed." if broker can not 
find JAAS configuration file.

So the question is why broker fails to find the JAAS configuration file even 
though "startSasl(jaasSections(kafkaServerSaslMechanisms, 
Option(kafkaClientSaslMechanism), Both))" in 
SaslEndToEndAuthorizationTest.setUp() should have created the JAAS 
configuration file. I could not find the root cause yet. 

Since this happens rarely in the integration test and this issue is related to 
the existing of a configuration file during broker initialization. My guess is 
that the bug is related to the test setup, or maybe the temporary file 
`'/tmp/kafka7346315539242944484.tmp` is somehow cleaned up by the test machine. 
Though I am not 100% sure, my opinion is that this is not a blocking issue for 
2.1.0 release.






> Flaky test SslEndToEndAuthorizationTest.testNoProduceWithDescribeAcl
> 
>
> Key: KAFKA-7649
> URL: https://issues.apache.org/jira/browse/KAFKA-7649
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> Observed in 
> https://builds.apache.org/job/kafka-2.1-jdk8/49/testReport/junit/kafka.api/SslEndToEndAuthorizationTest/testNoProduceWithDescribeAcl/
> {code}
> Error Message
> java.lang.SecurityException: zookeeper.set.acl is true, but the verification 
> of the JAAS login file failed.
> Stacktrace
> java.lang.SecurityException: zookeeper.set.acl is true, but the verification 
> of the JAAS login file failed.
>   at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:361)
>   at kafka.server.KafkaServer.startup(KafkaServer.scala:202)
>   at kafka.utils.TestUtils$.createServer(TestUtils.scala:135)
>   at 
> kafka.integration.KafkaServerTestHarness$$anonfun$setUp$1.apply(KafkaServerTestHarness.scala:101)
>   at 
> kafka.integration.KafkaServerTestHarness$$anonfun$setUp$1.apply(KafkaServerTestHarness.scala:100)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> kafka.integration.KafkaServerTestHarness.setUp(KafkaServerTestHarness.scala:100)
>   at 
> kafka.api.IntegrationTestHarness.doSetup(IntegrationTestHarness.scala:81)
>   at 
> kafka.api.IntegrationTestHarness.setUp(IntegrationTestHarness.scala:73)
>   at 
> kafka.api.EndToEndAuthorizationTest.setUp(EndToEndAuthorizationTest.scala:180)
>   at 
> kafka.api.SslEndToEndAuthorizationTest.setUp(SslEndToEndAuthorizationTest.scala:72)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(R

[jira] [Comment Edited] (KAFKA-7312) Transient failure in kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690150#comment-16690150
 ] 

Dong Lin edited comment on KAFKA-7312 at 11/17/18 12:36 AM:


Here is another stacktrace from 
https://builds.apache.org/job/kafka-2.1-jdk8/51/testReport/junit/kafka.api/SaslSslAdminClientIntegrationTest/testMinimumRequestTimeouts/

{code}
Error Message
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
Stacktrace
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1404)
at 
kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{code}



was (Author: lindong):
Here is another stacktrace:

{code}
Error Message
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
Stacktrace
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1404)
at 
kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{code}


> Transient failure in 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts
> 
>
> Key: KAFKA-7312
> URL: https://issues.apache.org/jira/browse/KAFKA-7312
> Project: Kafka
>  Issue Type: Sub-task
>  Components: unit tests
>Reporter: Guozhang Wang
>Priority: Major
>
> {code}
> Error Message
> org.junit.runners.model.TestTimedOutException: test timed out after 12 
> milliseconds
> Stacktrace
> org.junit.runners.model.TestTimedOutException: test timed out afte

[jira] [Created] (KAFKA-7651) Flaky test SaslSslAdminClientIntegrationTest.testMinimumRequestTimeouts

2018-11-16 Thread Dong Lin (JIRA)
Dong Lin created KAFKA-7651:
---

 Summary: Flaky test 
SaslSslAdminClientIntegrationTest.testMinimumRequestTimeouts
 Key: KAFKA-7651
 URL: https://issues.apache.org/jira/browse/KAFKA-7651
 Project: Kafka
  Issue Type: Task
Reporter: Dong Lin


Here is stacktrace from 
https://builds.apache.org/job/kafka-2.1-jdk8/51/testReport/junit/kafka.api/SaslSslAdminClientIntegrationTest/testMinimumRequestTimeouts/

{code}
Error Message
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
Stacktrace
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1404)
at 
kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7312) Transient failure in kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7312:

Issue Type: Bug  (was: Sub-task)
Parent: (was: KAFKA-7645)

> Transient failure in 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts
> 
>
> Key: KAFKA-7312
> URL: https://issues.apache.org/jira/browse/KAFKA-7312
> Project: Kafka
>  Issue Type: Bug
>  Components: unit tests
>Reporter: Guozhang Wang
>Priority: Major
>
> {code}
> Error Message
> org.junit.runners.model.TestTimedOutException: test timed out after 12 
> milliseconds
> Stacktrace
> org.junit.runners.model.TestTimedOutException: test timed out after 12 
> milliseconds
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:92)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:262)
>   at 
> kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1345)
>   at 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (KAFKA-7312) Transient failure in kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7312:

Comment: was deleted

(was: Here is another stacktrace from 
https://builds.apache.org/job/kafka-2.1-jdk8/51/testReport/junit/kafka.api/SaslSslAdminClientIntegrationTest/testMinimumRequestTimeouts/

{code}
Error Message
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
Stacktrace
java.lang.AssertionError: Expected an exception of type 
org.apache.kafka.common.errors.TimeoutException; got type 
org.apache.kafka.common.errors.SslAuthenticationException
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1404)
at 
kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{code}
)

> Transient failure in 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts
> 
>
> Key: KAFKA-7312
> URL: https://issues.apache.org/jira/browse/KAFKA-7312
> Project: Kafka
>  Issue Type: Bug
>  Components: unit tests
>Reporter: Guozhang Wang
>Priority: Major
>
> {code}
> Error Message
> org.junit.runners.model.TestTimedOutException: test timed out after 12 
> milliseconds
> Stacktrace
> org.junit.runners.model.TestTimedOutException: test timed out after 12 
> milliseconds
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:92)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:262)
>   at 
> kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1345)
>   at 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7651) Flaky test SaslSslAdminClientIntegrationTest.testMinimumRequestTimeouts

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690173#comment-16690173
 ] 

Dong Lin commented on KAFKA-7651:
-

The test failure is related to SSL handshake. In general SSL handshake is a 
stateful operation without timeout. Thus it is understandable the SSL logic in 
the test may be flaky if the GC pause is long or there is ephemeral network 
issue in the test.

There are same test failure for 2.0 branch in 
https://builds.apache.org/job/kafka-2.0-jdk8/183/testReport/junit/kafka.api/SaslSslAdminClientIntegrationTest/testMinimumRequestTimeouts.

Since user has been running Kafka 2.0.0 well without major issues, the test 
failure here should not be a blocking issue for 2.1.0 release.

> Flaky test SaslSslAdminClientIntegrationTest.testMinimumRequestTimeouts
> ---
>
> Key: KAFKA-7651
> URL: https://issues.apache.org/jira/browse/KAFKA-7651
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> Here is stacktrace from 
> https://builds.apache.org/job/kafka-2.1-jdk8/51/testReport/junit/kafka.api/SaslSslAdminClientIntegrationTest/testMinimumRequestTimeouts/
> {code}
> Error Message
> java.lang.AssertionError: Expected an exception of type 
> org.apache.kafka.common.errors.TimeoutException; got type 
> org.apache.kafka.common.errors.SslAuthenticationException
> Stacktrace
> java.lang.AssertionError: Expected an exception of type 
> org.apache.kafka.common.errors.TimeoutException; got type 
> org.apache.kafka.common.errors.SslAuthenticationException
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1404)
>   at 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7651) Flaky test SaslSslAdminClientIntegrationTest.testMinimumRequestTimeouts

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7651:

Issue Type: Sub-task  (was: Task)
Parent: KAFKA-7645

> Flaky test SaslSslAdminClientIntegrationTest.testMinimumRequestTimeouts
> ---
>
> Key: KAFKA-7651
> URL: https://issues.apache.org/jira/browse/KAFKA-7651
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> Here is stacktrace from 
> https://builds.apache.org/job/kafka-2.1-jdk8/51/testReport/junit/kafka.api/SaslSslAdminClientIntegrationTest/testMinimumRequestTimeouts/
> {code}
> Error Message
> java.lang.AssertionError: Expected an exception of type 
> org.apache.kafka.common.errors.TimeoutException; got type 
> org.apache.kafka.common.errors.SslAuthenticationException
> Stacktrace
> java.lang.AssertionError: Expected an exception of type 
> org.apache.kafka.common.errors.TimeoutException; got type 
> org.apache.kafka.common.errors.SslAuthenticationException
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> kafka.utils.TestUtils$.assertFutureExceptionTypeEquals(TestUtils.scala:1404)
>   at 
> kafka.api.AdminClientIntegrationTest.testMinimumRequestTimeouts(AdminClientIntegrationTest.scala:1080)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7646) Flaky test SaslOAuthBearerSslEndToEndAuthorizationTest.testNoConsumeWithDescribeAclViaSubscribe

2018-11-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690174#comment-16690174
 ] 

Dong Lin commented on KAFKA-7646:
-

This issue is not captured by the automatic test runs in 
https://builds.apache.org/job/kafka-2.0-jdk8. It is hard to debug this without 
having stacktrace.

Given that the test is related to Sasal and it passes most of the time, for the 
same reason as explained in https://issues.apache.org/jira/browse/KAFKA-7651, 
this does not appear to be a blocking issue for 2.1.0 release.

> Flaky test 
> SaslOAuthBearerSslEndToEndAuthorizationTest.testNoConsumeWithDescribeAclViaSubscribe
> ---
>
> Key: KAFKA-7646
> URL: https://issues.apache.org/jira/browse/KAFKA-7646
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> This test is reported to fail by [~enothereska] as part of 2.1.0 RC0 release 
> certification.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7647) Flaky test LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic

2018-11-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7647:

Description: 

{code}
kafka.log.LogCleanerParameterizedIntegrationTest >
testCleansCombinedCompactAndDeleteTopic[3] FAILED
    java.lang.AssertionError: Contents of the map shouldn't change
expected: (340,340), 5 -> (345,345), 10 -> (350,350), 14 ->
(354,354), 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353),
2 -> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
(343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 8 ->
(348,348), 19 -> (359,359), 4 -> (344,344), 15 -> (355,355))> but
was: (340,340), 5 -> (345,345), 10 -> (350,350), 14 -> (354,354),
1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353), 2 ->
(342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
(343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 99 ->
(299,299), 8 -> (348,348), 19 -> (359,359), 4 -> (344,344), 15 ->
(355,355))>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at
kafka.log.LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic(LogCleanerParameterizedIntegrationTest.scala:129)
{code}

  was:
kafka.log.LogCleanerParameterizedIntegrationTest >
testCleansCombinedCompactAndDeleteTopic[3] FAILED
    java.lang.AssertionError: Contents of the map shouldn't change
expected: (340,340), 5 -> (345,345), 10 -> (350,350), 14 ->
(354,354), 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353),
2 -> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
(343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 8 ->
(348,348), 19 -> (359,359), 4 -> (344,344), 15 -> (355,355))> but
was: (340,340), 5 -> (345,345), 10 -> (350,350), 14 -> (354,354),
1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353), 2 ->
(342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
(343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 99 ->
(299,299), 8 -> (348,348), 19 -> (359,359), 4 -> (344,344), 15 ->
(355,355))>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at
kafka.log.LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic(LogCleanerParameterizedIntegrationTest.scala:129)


> Flaky test 
> LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic
> -
>
> Key: KAFKA-7647
> URL: https://issues.apache.org/jira/browse/KAFKA-7647
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Priority: Major
>
> {code}
> kafka.log.LogCleanerParameterizedIntegrationTest >
> testCleansCombinedCompactAndDeleteTopic[3] FAILED
>     java.lang.AssertionError: Contents of the map shouldn't change
> expected: (340,340), 5 -> (345,345), 10 -> (350,350), 14 ->
> (354,354), 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353),
> 2 -> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
> (343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 8 ->
> (348,348), 19 -> (359,359), 4 -> (344,344), 15 -> (355,355))> but
> was: (340,340), 5 -> (345,345), 10 -> (350,350), 14 -> (354,354),
> 1 -> (341,341), 6 -> (346,346), 9 -> (349,349), 13 -> (353,353), 2 ->
> (342,342), 17 -> (357,357), 12 -> (352,352), 7 -> (347,347), 3 ->
> (343,343), 18 -> (358,358), 16 -> (356,356), 11 -> (351,351), 99 ->
> (299,299), 8 -> (348,348), 19 -> (359,359), 4 -> (344,344), 15 ->
> (355,355))>
>         at org.junit.Assert.fail(Assert.java:88)
>         at org.junit.Assert.failNotEquals(Assert.java:834)
>         at org.junit.Assert.assertEquals(Assert.java:118)
>         at
> kafka.log.LogCleanerParameterizedIntegrationTest.testCleansCombinedCompactAndDeleteTopic(LogCleanerParameterizedIntegrationTest.scala:129)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-5335) Controller should batch updatePartitionReassignmentData() operation

2018-12-13 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-5335.
-
Resolution: Won't Do

> Controller should batch updatePartitionReassignmentData() operation
> ---
>
> Key: KAFKA-5335
> URL: https://issues.apache.org/jira/browse/KAFKA-5335
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Major
>
> Currently controller will update partition reassignment data every time a 
> partition in the reassignment is completed. It means that if user specifies a 
> huge reassignment znode of size 1 MB to move 10K partitions, controller will 
> need to write roughly 0.5 MB * 1 = 5 GB data to zookeeper in order to 
> complete this reassignment. This is because controller needs to write the 
> remaining partitions to the znode every time a partition is completely moved.
> This is problematic because such a huge reassignment may greatly slow down 
> Kafka controller. Note that partition reassignment doesn't necessarily cause 
> data movement between brokers because we may use it only to recorder the 
> replica list of partitions to evenly distribute preferred leader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7297) Both read/write access to Log.segments should be protected by lock

2018-12-13 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7297:

Reporter: Zhanxiang (Patrick) Huang  (was: Dong Lin)

> Both read/write access to Log.segments should be protected by lock
> --
>
> Key: KAFKA-7297
> URL: https://issues.apache.org/jira/browse/KAFKA-7297
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Zhanxiang (Patrick) Huang
>Assignee: Dong Lin
>Priority: Major
>
> Log.replaceSegments() updates segments in two steps. It first adds new 
> segments and then remove old segments. Though this operation is protected by 
> a lock, other read access such as Log.logSegments does not grab lock and thus 
> these methods may return an inconsistent view of the segments.
> As an example, say Log.replaceSegments() intends to replace segments [0, 
> 100), [100, 200) with a new segment [0, 200). In this case if Log.logSegments 
> is called right after the new segments are added, the method may return 
> segments [0, 200), [100, 200) and messages in the range [100, 200) may be 
> duplicated if caller choose to enumerate all messages in all segments 
> returned by the method.
> The solution is probably to protect read/write access to Log.segments with 
> read/write lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7297) Both read/write access to Log.segments should be protected by lock

2018-12-13 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7297:

Reporter: Dong Lin  (was: Zhanxiang (Patrick) Huang)

> Both read/write access to Log.segments should be protected by lock
> --
>
> Key: KAFKA-7297
> URL: https://issues.apache.org/jira/browse/KAFKA-7297
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Major
>
> Log.replaceSegments() updates segments in two steps. It first adds new 
> segments and then remove old segments. Though this operation is protected by 
> a lock, other read access such as Log.logSegments does not grab lock and thus 
> these methods may return an inconsistent view of the segments.
> As an example, say Log.replaceSegments() intends to replace segments [0, 
> 100), [100, 200) with a new segment [0, 200). In this case if Log.logSegments 
> is called right after the new segments are added, the method may return 
> segments [0, 200), [100, 200) and messages in the range [100, 200) may be 
> duplicated if caller choose to enumerate all messages in all segments 
> returned by the method.
> The solution is probably to protect read/write access to Log.segments with 
> read/write lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7297) Both read/write access to Log.segments should be protected by lock

2018-12-13 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reassigned KAFKA-7297:
---

Assignee: Zhanxiang (Patrick) Huang  (was: Dong Lin)

> Both read/write access to Log.segments should be protected by lock
> --
>
> Key: KAFKA-7297
> URL: https://issues.apache.org/jira/browse/KAFKA-7297
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
>
> Log.replaceSegments() updates segments in two steps. It first adds new 
> segments and then remove old segments. Though this operation is protected by 
> a lock, other read access such as Log.logSegments does not grab lock and thus 
> these methods may return an inconsistent view of the segments.
> As an example, say Log.replaceSegments() intends to replace segments [0, 
> 100), [100, 200) with a new segment [0, 200). In this case if Log.logSegments 
> is called right after the new segments are added, the method may return 
> segments [0, 200), [100, 200) and messages in the range [100, 200) may be 
> duplicated if caller choose to enumerate all messages in all segments 
> returned by the method.
> The solution is probably to protect read/write access to Log.segments with 
> read/write lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7297) Both read/write access to Log.segments should be protected by lock

2018-12-13 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720876#comment-16720876
 ] 

Dong Lin commented on KAFKA-7297:
-

[~junrao] Ah I would not be able to work on this. Is this an urgent issue? 
[~hzxa21] is interested to work on this issue.

> Both read/write access to Log.segments should be protected by lock
> --
>
> Key: KAFKA-7297
> URL: https://issues.apache.org/jira/browse/KAFKA-7297
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Dong Lin
>Priority: Major
>
> Log.replaceSegments() updates segments in two steps. It first adds new 
> segments and then remove old segments. Though this operation is protected by 
> a lock, other read access such as Log.logSegments does not grab lock and thus 
> these methods may return an inconsistent view of the segments.
> As an example, say Log.replaceSegments() intends to replace segments [0, 
> 100), [100, 200) with a new segment [0, 200). In this case if Log.logSegments 
> is called right after the new segments are added, the method may return 
> segments [0, 200), [100, 200) and messages in the range [100, 200) may be 
> duplicated if caller choose to enumerate all messages in all segments 
> returned by the method.
> The solution is probably to protect read/write access to Log.segments with 
> read/write lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7297) Both read/write access to Log.segments should be protected by lock

2018-12-14 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722044#comment-16722044
 ] 

Dong Lin commented on KAFKA-7297:
-

[~junrao] After going through the code and the Java doc, it appears that 
currently the only "correctness" concern due to this issue are 1) 
Log.deleteRetentionSizeBreachedSegments() may delete segment that otherwise 
would not be deleted and 2) ReplicaManager.describeLogDirs() and thus result in 
the DescribeLogDirResponse may be larger than the actual value which causes 
overestimation on the user side. 

The probability of this happening is very rare and the impact of the 
consequence seems low. I agree it may not be causing problem now. 

Instead of documenting that the iterator may return overlapping segment, would 
it be be better to fix the issue described in this JIRA so that we will not see 
overlapping segment in the iterator? As you mentioned in the earlier comment, 
since all callers of Log.logSegments seem to either be holding lock already or 
is infrequent, the overhead seems OK, right?




> Both read/write access to Log.segments should be protected by lock
> --
>
> Key: KAFKA-7297
> URL: https://issues.apache.org/jira/browse/KAFKA-7297
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
>
> Log.replaceSegments() updates segments in two steps. It first adds new 
> segments and then remove old segments. Though this operation is protected by 
> a lock, other read access such as Log.logSegments does not grab lock and thus 
> these methods may return an inconsistent view of the segments.
> As an example, say Log.replaceSegments() intends to replace segments [0, 
> 100), [100, 200) with a new segment [0, 200). In this case if Log.logSegments 
> is called right after the new segments are added, the method may return 
> segments [0, 200), [100, 200) and messages in the range [100, 200) may be 
> duplicated if caller choose to enumerate all messages in all segments 
> returned by the method.
> The solution is probably to protect read/write access to Log.segments with 
> read/write lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7297) Both read/write access to Log.segments should be protected by lock

2018-12-15 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722331#comment-16722331
 ] 

Dong Lin commented on KAFKA-7297:
-

[~junrao] Yeah I agree. [~hzxa21] does this solution sounds reasonable?

> Both read/write access to Log.segments should be protected by lock
> --
>
> Key: KAFKA-7297
> URL: https://issues.apache.org/jira/browse/KAFKA-7297
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
>
> Log.replaceSegments() updates segments in two steps. It first adds new 
> segments and then remove old segments. Though this operation is protected by 
> a lock, other read access such as Log.logSegments does not grab lock and thus 
> these methods may return an inconsistent view of the segments.
> As an example, say Log.replaceSegments() intends to replace segments [0, 
> 100), [100, 200) with a new segment [0, 200). In this case if Log.logSegments 
> is called right after the new segments are added, the method may return 
> segments [0, 200), [100, 200) and messages in the range [100, 200) may be 
> duplicated if caller choose to enumerate all messages in all segments 
> returned by the method.
> The solution is probably to protect read/write access to Log.segments with 
> read/write lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7297) Both read/write access to Log.segments should be protected by lock

2018-12-15 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722347#comment-16722347
 ] 

Dong Lin commented on KAFKA-7297:
-

[~ijuma] Atomic update is another good alternative to the read/write lock 
discussed above. Note that atomic update still requires lock to avoid race 
condition between concurrent mutations. If mutations are rare compared to 
reads, then both solutions should have low performance overhead. Atomic update 
has a bit extra memory (due to copy of the segments) overhead whereas the 
read/write lock solution has a bit extra lock overhead (due to the race 
condition between concurrent mutation and read operation). Is there a bit more 
detail to understand why atomic update is better in this case?

> Both read/write access to Log.segments should be protected by lock
> --
>
> Key: KAFKA-7297
> URL: https://issues.apache.org/jira/browse/KAFKA-7297
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
>
> Log.replaceSegments() updates segments in two steps. It first adds new 
> segments and then remove old segments. Though this operation is protected by 
> a lock, other read access such as Log.logSegments does not grab lock and thus 
> these methods may return an inconsistent view of the segments.
> As an example, say Log.replaceSegments() intends to replace segments [0, 
> 100), [100, 200) with a new segment [0, 200). In this case if Log.logSegments 
> is called right after the new segments are added, the method may return 
> segments [0, 200), [100, 200) and messages in the range [100, 200) may be 
> duplicated if caller choose to enumerate all messages in all segments 
> returned by the method.
> The solution is probably to protect read/write access to Log.segments with 
> read/write lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7297) Both read/write access to Log.segments should be protected by lock

2018-12-17 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723236#comment-16723236
 ] 

Dong Lin commented on KAFKA-7297:
-

[~ijuma] Hmm.. I think we may have different understanding. Jun originally does 
mention the use of materialized view (which involves copying) to resolve the 
underlying map may be changed. In the latest discussion, we agree that it is OK 
for the underlying view because ConcurrentSkipListMap supports weak consistency.

So what we are trying to address in this ticket is the issue in the JIRA 
description, i.e. additional entry may be returned by `Log.logSegments `. There 
are two solutions to solve this problem. One solution is to use the atomic 
update which involves extra copy for write operation. The other solution is the 
use read/write lock which requires lock for read operation but there is no need 
to copy for read/write operation. Is this understanding correct?




> Both read/write access to Log.segments should be protected by lock
> --
>
> Key: KAFKA-7297
> URL: https://issues.apache.org/jira/browse/KAFKA-7297
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
>
> Log.replaceSegments() updates segments in two steps. It first adds new 
> segments and then remove old segments. Though this operation is protected by 
> a lock, other read access such as Log.logSegments does not grab lock and thus 
> these methods may return an inconsistent view of the segments.
> As an example, say Log.replaceSegments() intends to replace segments [0, 
> 100), [100, 200) with a new segment [0, 200). In this case if Log.logSegments 
> is called right after the new segments are added, the method may return 
> segments [0, 200), [100, 200) and messages in the range [100, 200) may be 
> duplicated if caller choose to enumerate all messages in all segments 
> returned by the method.
> The solution is probably to protect read/write access to Log.segments with 
> read/write lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7829) Inaccurate description for --reassignment-json-file option in ReassignPartitionsCommand

2019-01-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reassigned KAFKA-7829:
---

Assignee: Dong Lin

> Inaccurate description for --reassignment-json-file option in 
> ReassignPartitionsCommand
> ---
>
> Key: KAFKA-7829
> URL: https://issues.apache.org/jira/browse/KAFKA-7829
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Assignee: Dong Lin
>Priority: Major
>
> In ReassignPartitionsCommand, the --reassignment-json-file option has the 
> following. This seems inaccurate since we support moving existing replicas to 
> new log dirs.
> If absolute log directory path is specified, it is currently required that " +
> "the replica has not already been created on that broker. The replica will 
> then be created in the specified log directory on the broker later



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7829) Inaccurate description for --reassignment-json-file option in ReassignPartitionsCommand

2019-01-16 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744321#comment-16744321
 ] 

Dong Lin commented on KAFKA-7829:
-

[~junrao] Yeah this is inaccurate after 
https://github.com/apache/kafka/pull/3874 is merged. I will submit PR to fix 
this.

> Inaccurate description for --reassignment-json-file option in 
> ReassignPartitionsCommand
> ---
>
> Key: KAFKA-7829
> URL: https://issues.apache.org/jira/browse/KAFKA-7829
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Priority: Major
>
> In ReassignPartitionsCommand, the --reassignment-json-file option has the 
> following. This seems inaccurate since we support moving existing replicas to 
> new log dirs.
> If absolute log directory path is specified, it is currently required that " +
> "the replica has not already been created on that broker. The replica will 
> then be created in the specified log directory on the broker later



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7829) Javadoc should show that alterReplicaLogDirs() is supported in Kafka 1.1.0 or later

2019-01-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7829:

Summary: Javadoc should show that alterReplicaLogDirs() is supported in 
Kafka 1.1.0 or later  (was: Inaccurate description for --reassignment-json-file 
option in ReassignPartitionsCommand)

> Javadoc should show that alterReplicaLogDirs() is supported in Kafka 1.1.0 or 
> later
> ---
>
> Key: KAFKA-7829
> URL: https://issues.apache.org/jira/browse/KAFKA-7829
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Assignee: Dong Lin
>Priority: Major
>
> In ReassignPartitionsCommand, the --reassignment-json-file option has the 
> following. This seems inaccurate since we support moving existing replicas to 
> new log dirs.
> If absolute log directory path is specified, it is currently required that " +
> "the replica has not already been created on that broker. The replica will 
> then be created in the specified log directory on the broker later



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7829) Javadoc should show that alterReplicaLogDirs() is supported in Kafka 1.1.0 or later

2019-01-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7829:

Description: 
In ReassignPartitionsCommand, the --reassignment-json-file option says "...If 
absolute log directory path is specified, it is currently required that the 
replica has not already been created on that broker...". This is inaccurate 
since we support moving existing replicas to new log dirs.

In addition, the Javadoc of AdminClient.alterReplicaLogDirs(...) should show 
the API is supported by brokers with version 1.1.0 or later.

  was:
In ReassignPartitionsCommand, the --reassignment-json-file option has the 
following. This seems inaccurate since we support moving existing replicas to 
new log dirs.

If absolute log directory path is specified, it is currently required that " +
"the replica has not already been created on that broker. The replica will then 
be created in the specified log directory on the broker later


> Javadoc should show that alterReplicaLogDirs() is supported in Kafka 1.1.0 or 
> later
> ---
>
> Key: KAFKA-7829
> URL: https://issues.apache.org/jira/browse/KAFKA-7829
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Assignee: Dong Lin
>Priority: Major
>
> In ReassignPartitionsCommand, the --reassignment-json-file option says "...If 
> absolute log directory path is specified, it is currently required that the 
> replica has not already been created on that broker...". This is inaccurate 
> since we support moving existing replicas to new log dirs.
> In addition, the Javadoc of AdminClient.alterReplicaLogDirs(...) should show 
> the API is supported by brokers with version 1.1.0 or later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7829) Javadoc should show that AdminClient.alterReplicaLogDirs() is supported in Kafka 1.1.0 or later

2019-01-16 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7829:

Summary: Javadoc should show that AdminClient.alterReplicaLogDirs() is 
supported in Kafka 1.1.0 or later  (was: Javadoc should show that 
alterReplicaLogDirs() is supported in Kafka 1.1.0 or later)

> Javadoc should show that AdminClient.alterReplicaLogDirs() is supported in 
> Kafka 1.1.0 or later
> ---
>
> Key: KAFKA-7829
> URL: https://issues.apache.org/jira/browse/KAFKA-7829
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Assignee: Dong Lin
>Priority: Major
>
> In ReassignPartitionsCommand, the --reassignment-json-file option says "...If 
> absolute log directory path is specified, it is currently required that the 
> replica has not already been created on that broker...". This is inaccurate 
> since we support moving existing replicas to new log dirs.
> In addition, the Javadoc of AdminClient.alterReplicaLogDirs(...) should show 
> the API is supported by brokers with version 1.1.0 or later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7836) The propagation of log dir failure can be delayed due to slowness in closing the file handles

2019-01-17 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745816#comment-16745816
 ] 

Dong Lin commented on KAFKA-7836:
-

[~junrao] This solution sounds good to me.

> The propagation of log dir failure can be delayed due to slowness in closing 
> the file handles
> -
>
> Key: KAFKA-7836
> URL: https://issues.apache.org/jira/browse/KAFKA-7836
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Priority: Major
>
> In ReplicaManager.handleLogDirFailure(), we call 
> zkClient.propagateLogDirEvent after  logManager.handleLogDirFailure. The 
> latter closes the file handles of the offline replicas, which could take time 
> when the disk is bad. This will delay the new leader election by the 
> controller. In one incident, we have seen the closing of file handles of 
> multiple replicas taking more than 20 seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7837) maybeShrinkIsr may not reflect OfflinePartitions immediately

2019-01-17 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745832#comment-16745832
 ] 

Dong Lin commented on KAFKA-7837:
-

[~junrao] Currently `partition.maybeShrinkIsr(...)` reads 
`leaderReplicaIfLocal` to determine whether the partition is still the leader. 
When there is disk failure, we can also do `partition.leaderReplicaIfLocal = 
None` in `replicaManager.handleLogDirFailure(...)` for every partition on the 
offline disk, so that broker will no longer shrink ISR for these partitions. I 
personally feel this approach is probably simpler than accessing 
ReplicaManager.allPartitions().

> maybeShrinkIsr may not reflect OfflinePartitions immediately
> 
>
> Key: KAFKA-7837
> URL: https://issues.apache.org/jira/browse/KAFKA-7837
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Priority: Major
>
> When a partition is marked offline due to a failed disk, the leader is 
> supposed to not shrink its ISR any more. In ReplicaManager.maybeShrinkIsr(), 
> we iterate through all non-offline partitions to shrink the ISR. If an ISR 
> needs to shrink, we need to write the new ISR to ZK, which can take a bit of 
> time. In this window, some partitions could now be marked as offline, but may 
> not be picked up by the iterator since it only reflects the state at that 
> point. This can cause all in-sync followers to be dropped out of ISR 
> unnecessarily and prevents a clean leader election.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6806) Unable to validate sink connectors without "topics" component which is not required

2018-06-11 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509256#comment-16509256
 ] 

Dong Lin commented on KAFKA-6806:
-

Hey [~rhauch], this issue has been marked as blocking issue for 1.1.1 but there 
does not seem to be person actively working on this. Do you know who will be 
working on this issue, or do we actually need to block 1.1.1 release until it 
is fixed?

 

> Unable to validate sink connectors without "topics" component which is not 
> required
> ---
>
> Key: KAFKA-6806
> URL: https://issues.apache.org/jira/browse/KAFKA-6806
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 1.1.0
> Environment: CP4.1., Centos7
>Reporter: Ivan Majnarić
>Priority: Blocker
> Fix For: 2.0.0, 1.1.1
>
>
> The bug is happening when you try to create new connector through for example 
> kafka-connect-ui.
> While both source and sink connectors were able to be validated through REST 
> without "topics" as add-on with "connector.class" like this:
> {code:java}
> PUT / 
> http://connect-url:8083/connector-plugins/com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector/config/validate
> {
>     "connector.class": 
> "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
> }{code}
> In the new version of CP4.1 you still can validate *source connectors* but 
> not *sink connectors*. If you want to validate sink connectors you need to 
> add to request -> "topics" config, like:
> {code:java}
> PUT / 
> http://connect-url:8083/connector-plugins/com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector/config/validate
> {
>     "connector.class": 
> "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
> "topics": "test-topic"
> }{code}
> So there is a little missmatch of the ways how to validate connectors which I 
> think happened accidentally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7051) Improve the efficiency of the ReplicaManager when there are many partitions

2018-06-12 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7051:

Fix Version/s: 2.0.0

> Improve the efficiency of the ReplicaManager when there are many partitions
> ---
>
> Key: KAFKA-7051
> URL: https://issues.apache.org/jira/browse/KAFKA-7051
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 0.8.0
>Reporter: Colin P. McCabe
>Assignee: Colin P. McCabe
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7019) Reduction the contention between metadata update and metadata read operation

2018-06-13 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7019:

Description: 
Currently MetadataCache.updateCache() grabs a write lock in order to process 
the UpdateMetadataRequest from controller. And a read lock is needed in order 
to handle the MetadataRequest from clients. Thus the handling of 
MetadataRequest and UpdateMetadataRequest blocks each other and the broker can 
only process such request at a time even if there are multiple request handler 
threads. Note that broker can not process MetadataRequest in parallel if there 
is a UpdateMetadataRequest waiting for the write lock, even if MetadataRequest 
only requires the read lock to e processed.

For large cluster which has tens of thousands of partitions, it can take e.g. 
200 ms to process UpdateMetadataRequest and MetadataRequest from large clients 
(e.g. MM). During the period when user is rebalancinng cluster, the leadership 
change will cause both UpdateMetadataRequest from controller and also 
MetadataRequest from client. If a broker receives 10 MetadataRequest per second 
and 2 UpdateMetadataRequest per second on average, since these requests need to 
be processed one-at-a-time, it can reduce the request handler thread idle ratio 
to 0 which makes this broker unavailable to user.

We can address this problem by removing the read lock in MetadataCache. The 
idea is that MetadataCache.updateCache() can instantiate a new copy of the 
cache as method local variable when it is processing the UpdateMetadataRequest 
and replace the class private varaible with newly instantiated method local 
varaible at the end of MetadataCache.updateCache(). The handling of 
MetadataRequest only requires access to the read-only class-private variable. 

  was:
Currently MetadataCache.updateCache() grabs a write lock in order to process 
the UpdateMetadataRequest from controller. And a read lock is needed in order 
to handle the MetadataRequest from clients. Thus the handling of 
MetadataRequest and UpdateMetadataRequest blocks each other and the broker can 
only process such request at a time even if there are multiple request handler 
threads. Note that broker can not process MetadataRequest in parallel if there 
is a UpdateMetadataRequest waiting for the write lock, even if MetadataRequest 
only requires the read lock to e processed.

For large cluster which has tens of thousands of partitions, it can take e.g. 
200 ms to process UpdateMetadataRequest and MetadataRequest from large clients 
(e.g. MM). During the period when user is rebalancinng cluster, the leadership 
change will cause both UpdateMetadataRequest from controller and also 
MetadataRequest from client. If a broker receives 10 MetadataRequest per second 
and 2 UpdateMetadataRequest per second on average, since these requests need to 
be processed one-at-a-time, it can reduce the request handler thread idle ratio 
to 0 which makes this broker unavailable to user.

We can address this problem by removing the read/write lock in MetadataCache. 
The idea is that MetadataCache.updateCache() can instantiate a new copy of the 
cache as method local variable when it is processing the UpdateMetadataRequest 
and replace the class private varaible with newly instantiated method local 
varaible at the end of MetadataCache.updateCache(). All these can be done 
without grabbing any lock. The handling of MetadataRequest only requires access 
to the read-only class-private variable.

 


> Reduction the contention between metadata update and metadata read operation
> 
>
> Key: KAFKA-7019
> URL: https://issues.apache.org/jira/browse/KAFKA-7019
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Dong Lin
>Assignee: Radai Rosenblatt
>Priority: Major
> Fix For: 2.1.0
>
>
> Currently MetadataCache.updateCache() grabs a write lock in order to process 
> the UpdateMetadataRequest from controller. And a read lock is needed in order 
> to handle the MetadataRequest from clients. Thus the handling of 
> MetadataRequest and UpdateMetadataRequest blocks each other and the broker 
> can only process such request at a time even if there are multiple request 
> handler threads. Note that broker can not process MetadataRequest in parallel 
> if there is a UpdateMetadataRequest waiting for the write lock, even if 
> MetadataRequest only requires the read lock to e processed.
> For large cluster which has tens of thousands of partitions, it can take e.g. 
> 200 ms to process UpdateMetadataRequest and MetadataRequest from large 
> clients (e.g. MM). During the period when user is rebalancinng cluster, the 
> leadership change will cause both UpdateMetadataRequest from controller and 
> also MetadataReq

[jira] [Updated] (KAFKA-6809) connections-created metric does not behave as expected

2018-06-19 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6809:

Fix Version/s: (was: 1.1.1)
   1.1.2

> connections-created metric does not behave as expected
> --
>
> Key: KAFKA-6809
> URL: https://issues.apache.org/jira/browse/KAFKA-6809
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.1.0, 1.0.1
>Reporter: Anna Povzner
>Priority: Major
> Fix For: 2.1.0, 1.1.2
>
>
> "connections-created" sensor is described as "new connections established". 
> It currently records only connections that the broker/client creates, but 
> does not count connections received. Seems like we should also count 
> connections received – either include them into this metric (and also clarify 
> the description) or add a new metric (separately counting two types of 
> connections). I am not sure how useful is to separate them, so I think we 
> should do the first approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6809) connections-created metric does not behave as expected

2018-06-19 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517621#comment-16517621
 ] 

Dong Lin commented on KAFKA-6809:
-

Moving from 1.1.1 to 1.1.2 as there is no open PR for 1.1.1.

 

> connections-created metric does not behave as expected
> --
>
> Key: KAFKA-6809
> URL: https://issues.apache.org/jira/browse/KAFKA-6809
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.1.0, 1.0.1
>Reporter: Anna Povzner
>Priority: Major
> Fix For: 2.1.0, 1.1.2
>
>
> "connections-created" sensor is described as "new connections established". 
> It currently records only connections that the broker/client creates, but 
> does not count connections received. Seems like we should also count 
> connections received – either include them into this metric (and also clarify 
> the description) or add a new metric (separately counting two types of 
> connections). I am not sure how useful is to separate them, so I think we 
> should do the first approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6836) Upgrade metrics library

2018-06-19 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517619#comment-16517619
 ] 

Dong Lin commented on KAFKA-6836:
-

Moving from 1.1.1 to 1.1.2 as there is no open PR for 1.1.1.

> Upgrade metrics library
> ---
>
> Key: KAFKA-6836
> URL: https://issues.apache.org/jira/browse/KAFKA-6836
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 1.1.0
>Reporter: Mario Molina
>Assignee: Mario Molina
>Priority: Major
> Fix For: 2.1.0, 1.1.2
>
>
> The current metrics library which Kafka is using is pretty old (version 2.2.0 
> from Yammer and now we have 4.X from Dropwizard).
> In the latest versions of the Dropwizard library (which comes from Yammer and 
> this is deprecated), there are a lot of bugfixes and new features included 
> which could be interesting for the metrics (ie: Reservoris, support JDK9, 
> etc).
> This patch includes the upgrade to this new version of the library so that we 
> could add new features in Kafka metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6836) Upgrade metrics library

2018-06-19 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6836:

Fix Version/s: (was: 1.1.1)
   1.1.2

> Upgrade metrics library
> ---
>
> Key: KAFKA-6836
> URL: https://issues.apache.org/jira/browse/KAFKA-6836
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 1.1.0
>Reporter: Mario Molina
>Assignee: Mario Molina
>Priority: Major
> Fix For: 2.1.0, 1.1.2
>
>
> The current metrics library which Kafka is using is pretty old (version 2.2.0 
> from Yammer and now we have 4.X from Dropwizard).
> In the latest versions of the Dropwizard library (which comes from Yammer and 
> this is deprecated), there are a lot of bugfixes and new features included 
> which could be interesting for the metrics (ie: Reservoris, support JDK9, 
> etc).
> This patch includes the upgrade to this new version of the library so that we 
> could add new features in Kafka metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-5975) No response when deleting topics and delete.topic.enable=false

2018-06-19 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517623#comment-16517623
 ] 

Dong Lin edited comment on KAFKA-5975 at 6/19/18 11:23 PM:
---

Moving from 1.1.1 to 1.1.2 since it is marked as Priority Minor and the PR 
still needs review.


was (Author: lindong):
Moving from 1.1.1 to 1.1.2 since it is marked as Priority Minor and the PR 
still needs review

> No response when deleting topics and delete.topic.enable=false
> --
>
> Key: KAFKA-5975
> URL: https://issues.apache.org/jira/browse/KAFKA-5975
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0
>Reporter: Mario Molina
>Assignee: Mario Molina
>Priority: Minor
> Fix For: 1.1.2
>
>
> When trying to delete topics using the KafkaAdminClient and the flag in 
> server config is set as 'delete.topic.enable=false', the client cannot get a 
> response and fails returning a timeout error. This is due to the object 
> DelayedCreatePartitions cannot complete the operation.
> This bug fix modifies the KafkaApi key DELETE_TOPICS taking into account that 
> the flag can be disabled and swallow the error to the client, this is, the 
> topic is never removed and no error is returned to the client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5975) No response when deleting topics and delete.topic.enable=false

2018-06-19 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517623#comment-16517623
 ] 

Dong Lin commented on KAFKA-5975:
-

Moving from 1.1.1 to 1.1.2 since it is marked as Priority Minor and the PR 
still needs review

> No response when deleting topics and delete.topic.enable=false
> --
>
> Key: KAFKA-5975
> URL: https://issues.apache.org/jira/browse/KAFKA-5975
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0
>Reporter: Mario Molina
>Assignee: Mario Molina
>Priority: Minor
> Fix For: 1.1.2
>
>
> When trying to delete topics using the KafkaAdminClient and the flag in 
> server config is set as 'delete.topic.enable=false', the client cannot get a 
> response and fails returning a timeout error. This is due to the object 
> DelayedCreatePartitions cannot complete the operation.
> This bug fix modifies the KafkaApi key DELETE_TOPICS taking into account that 
> the flag can be disabled and swallow the error to the client, this is, the 
> topic is never removed and no error is returned to the client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5975) No response when deleting topics and delete.topic.enable=false

2018-06-19 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5975:

Fix Version/s: (was: 1.1.1)
   1.1.2

> No response when deleting topics and delete.topic.enable=false
> --
>
> Key: KAFKA-5975
> URL: https://issues.apache.org/jira/browse/KAFKA-5975
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0
>Reporter: Mario Molina
>Assignee: Mario Molina
>Priority: Minor
> Fix For: 1.1.2
>
>
> When trying to delete topics using the KafkaAdminClient and the flag in 
> server config is set as 'delete.topic.enable=false', the client cannot get a 
> response and fails returning a timeout error. This is due to the object 
> DelayedCreatePartitions cannot complete the operation.
> This bug fix modifies the KafkaApi key DELETE_TOPICS taking into account that 
> the flag can be disabled and swallow the error to the client, this is, the 
> topic is never removed and no error is returned to the client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7012) Performance issue upgrading to kafka 1.0.1 or 1.1

2018-06-20 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7012:

Fix Version/s: (was: 1.1.1)
   1.1.2

> Performance issue upgrading to kafka 1.0.1 or 1.1
> -
>
> Key: KAFKA-7012
> URL: https://issues.apache.org/jira/browse/KAFKA-7012
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.1.0, 1.0.1
>Reporter: rajadayalan perumalsamy
>Assignee: Rajini Sivaram
>Priority: Critical
>  Labels: regression
> Fix For: 2.0.0, 1.0.2, 1.1.2
>
> Attachments: Commit-47ee8e954-0607-bufferkeys-nopoll-profile.png, 
> Commit-47ee8e954-0607-memory.png, Commit-47ee8e954-0607-profile.png, 
> Commit-47ee8e954-profile.png, Commit-47ee8e954-profile2.png, 
> Commit-f15cdbc91b-profile.png, Commit-f15cdbc91b-profile2.png
>
>
> We are trying to upgrade kafka cluster from Kafka 0.11.0.1 to Kafka 1.0.1. 
> After upgrading 1 node on the cluster, we notice that network threads use 
> most of the cpu. It is a 3 node cluster with 15k messages/sec on each node. 
> With Kafka 0.11.0.1 typical usage of the servers is around 50 to 60% 
> vcpu(using less than 1 vcpu). After upgrade we are noticing that cpu usage is 
> high depending on the number of network threads used. If networks threads is 
> set to 8, then the cpu usage is around 850%(9 vcpus) and if it is set to 4 
> then the cpu usage is around 450%(5 vcpus). Using the same kafka 
> server.properties for both.
> Did further analysis with git bisect, couple of build and deploys, traced the 
> issue to commit 47ee8e954df62b9a79099e944ec4be29afe046f6. CPU usage is fine 
> for commit f15cdbc91b240e656d9a2aeb6877e94624b21f8d. But with commit 
> 47ee8e954df62b9a79099e944ec4be29afe046f6 cpu usage has increased. Have 
> attached screenshots of profiling done with both the commits. Screenshot 
> Commit-f15cdbc91b-profile shows less cpu usage by network threads and 
> Screenshots Commit-47ee8e954-profile and Commit-47ee8e954-profile2 show 
> higher cpu usage(almost entire cpu usage) by network threads. Also noticed 
> that kafka.network.Processor.poll() method is invoked 10 times more with 
> commit 47ee8e954df62b9a79099e944ec4be29afe046f6.
> We need the issue to be resolved to upgrade the cluster. Please let me know 
> if you need any additional information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7012) Performance issue upgrading to kafka 1.0.1 or 1.1

2018-06-20 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518751#comment-16518751
 ] 

Dong Lin commented on KAFKA-7012:
-

[~mjsax] It is not included in RC0 for 1.1.1. I just updated the "fix version" 
to include 1.1.2.

For cases like this, do we expect to include the patch in RC1 for 1.1.1?

> Performance issue upgrading to kafka 1.0.1 or 1.1
> -
>
> Key: KAFKA-7012
> URL: https://issues.apache.org/jira/browse/KAFKA-7012
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.1.0, 1.0.1
>Reporter: rajadayalan perumalsamy
>Assignee: Rajini Sivaram
>Priority: Critical
>  Labels: regression
> Fix For: 2.0.0, 1.0.2, 1.1.2
>
> Attachments: Commit-47ee8e954-0607-bufferkeys-nopoll-profile.png, 
> Commit-47ee8e954-0607-memory.png, Commit-47ee8e954-0607-profile.png, 
> Commit-47ee8e954-profile.png, Commit-47ee8e954-profile2.png, 
> Commit-f15cdbc91b-profile.png, Commit-f15cdbc91b-profile2.png
>
>
> We are trying to upgrade kafka cluster from Kafka 0.11.0.1 to Kafka 1.0.1. 
> After upgrading 1 node on the cluster, we notice that network threads use 
> most of the cpu. It is a 3 node cluster with 15k messages/sec on each node. 
> With Kafka 0.11.0.1 typical usage of the servers is around 50 to 60% 
> vcpu(using less than 1 vcpu). After upgrade we are noticing that cpu usage is 
> high depending on the number of network threads used. If networks threads is 
> set to 8, then the cpu usage is around 850%(9 vcpus) and if it is set to 4 
> then the cpu usage is around 450%(5 vcpus). Using the same kafka 
> server.properties for both.
> Did further analysis with git bisect, couple of build and deploys, traced the 
> issue to commit 47ee8e954df62b9a79099e944ec4be29afe046f6. CPU usage is fine 
> for commit f15cdbc91b240e656d9a2aeb6877e94624b21f8d. But with commit 
> 47ee8e954df62b9a79099e944ec4be29afe046f6 cpu usage has increased. Have 
> attached screenshots of profiling done with both the commits. Screenshot 
> Commit-f15cdbc91b-profile shows less cpu usage by network threads and 
> Screenshots Commit-47ee8e954-profile and Commit-47ee8e954-profile2 show 
> higher cpu usage(almost entire cpu usage) by network threads. Also noticed 
> that kafka.network.Processor.poll() method is invoked 10 times more with 
> commit 47ee8e954df62b9a79099e944ec4be29afe046f6.
> We need the issue to be resolved to upgrade the cluster. Please let me know 
> if you need any additional information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


<    1   2   3   4   5   6   7   >