[jira] [Created] (KAFKA-15701) Allow use of user policy in CreateTopicPolicy
Jiao Zhang created KAFKA-15701: -- Summary: Allow use of user policy in CreateTopicPolicy Key: KAFKA-15701 URL: https://issues.apache.org/jira/browse/KAFKA-15701 Project: Kafka Issue Type: Improvement Reporter: Jiao Zhang One use case of CreateTopicPolicy we have experienced is allow/reject topic creation by checking the user . Especially for the secured cluster usage, we add acls to specific users for allowing topic creation. At the same time, we have the needs to design customized create topic policy for different users. For example, for user A, topic creation is allowed when partition number is within limit. For user B, we allow topic creation without check. As the kafka service provider, user A is imaged as random user of kafka service and user B is imaged as internal user for cluster management. For this need, we patched our local fork of kafka by passing user principle in KafkaApis. One place need to revise is here [https://github.com/apache/kafka/blob/834f72b03de40fb47caaad1397ed061de57c2509/core/src/main/scala/kafka/server/KafkaApis.scala#L1980] As thinking it's natural to support this kind of usage even in upstream, I raised this Jira for asking community's ideas about this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-9729) Shrink inWriteLock time in SimpleAuthorizer
Jiao Zhang created KAFKA-9729: - Summary: Shrink inWriteLock time in SimpleAuthorizer Key: KAFKA-9729 URL: https://issues.apache.org/jira/browse/KAFKA-9729 Project: Kafka Issue Type: Improvement Components: security Affects Versions: 1.1.0 Reporter: Jiao Zhang Current SimpleAuthorizer needs 'inWriteLock' when processing add/remove acls requests, while getAcls in authorize() needs 'inReadLock'. That means handling add/remove acls requests would block all other requests for example produce and fetch requests. When processing add/remove acls, updateResourceAcls() access zk to update acls, which could be long in the case like network glitch. We did the simulation for zk delay. When adding 100ms delay on zk side, 'inWriteLock' in addAcls()/removeAcls lasts for 400ms~500ms. When adding 500ms delay on zk side, 'inWriteLock' in addAcls()/removeAcls lasts for 2000ms~2500ms. Blocking produce/fetch requests for 2s would cause apparent performance degradation for the whole cluster. So considering is it possible to only put 'inWriteLock' inside updateCache. {code:java} // code placeholder private def updateCache(resource: Resource, versionedAcls: VersionedAcls) { if (versionedAcls.acls.nonEmpty) { aclCache.put(resource, versionedAcls) } else { aclCache.remove(resource) } } {code} If do this, block time is only the time for updating local cache, which will not be influenced by network glitch. But don't know if there were special concerns to have current strict write lock and not sure if there are side effects if only put lock to updateCache. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9685) Solve Set concatenation perf issue in AclAuthorizer
Jiao Zhang created KAFKA-9685: - Summary: Solve Set concatenation perf issue in AclAuthorizer Key: KAFKA-9685 URL: https://issues.apache.org/jira/browse/KAFKA-9685 Project: Kafka Issue Type: Improvement Components: security Affects Versions: 1.1.0 Reporter: Jiao Zhang In version 1.1, https://github.com/apache/kafka/blob/71b1e19fc60b5e1f9bba33025737ec2b7fb1c2aa/core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala#L110 the logic for checking acls is preparing a merged acl Set with 'acls = getAcls(new Resource(resource.resourceType, Resource.WildCardResource)) ++ getAcls(resource);' and then pass it as aclMatch's parameter. We found scala's Set '++' operation is very slow for example in the case that the Set on right hand of '++' has more than 100 entries. And the bad performance of '++' is due to iterating every entry of the Set on right hand of '++' in which the calculation of HashCode seems heavy. The performance of 'authorize' is important as each request delivered to broker goes through the logic, that's the reason we can't leave it as-is although the change for this proposal seems trivial. Here is the approach. We propose to solve this issue by introducing a new class 'AclSets' which takes multiple Sets as parameters and do 'find' against them one by one. ``` class AclSets(sets: Set[Acl]*) { def find(p: Acl => Boolean): Option[Acl] = sets.flatMap(_.find(p)).headOption def isEmpty: Boolean = !sets.exists(_.nonEmpty) } ``` This approach avoids the Set '++' operation, and thus outperforms a lot compared to old '++' logic. The benchmark result(we did the test with kafka version 1.1) shows notable difference under the condition: 1. set on left consists of 60 entries 2. set of right consists of 30 entries 3. search for absent entry (so that all entries are iterated) Benchmark Results is as following. Mode Cnt Score Error Units ScalaSetConcatination.Set thrpt 3 281.974 ± 140.029 ops/ms ScalaSetConcatination.AclSets thrpt 3 887.426 ± 40.261 ops/ms As the upstream also use the similar ++ operation, https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/security/authorizer/AclAuthorizer.scala#L360 we think it's necessary to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-9372) Add producer config to make topicExpiry configurable
[ https://issues.apache.org/jira/browse/KAFKA-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiao Zhang resolved KAFKA-9372. --- Resolution: Duplicate let me close it as this issue could be covered by https://issues.apache.org/jira/browse/KAFKA-8904 > Add producer config to make topicExpiry configurable > > > Key: KAFKA-9372 > URL: https://issues.apache.org/jira/browse/KAFKA-9372 > Project: Kafka > Issue Type: Improvement > Components: producer >Affects Versions: 1.1.0 >Reporter: Jiao Zhang >Assignee: Brian Byrne >Priority: Minor > > Sometimes we got error "org.apache.kafka.common.errors.TimeoutException: > Failed to update metadata after 1000 ms" on producer side. We did the > investigation and found > # our producer produced messages in really low rate, the interval is more > than 10 minutes > # by default, producer would expire topics after TOPIC_EXPIRY_MS, after > topic expired if no data produce before next metadata update (automatically > triggered by metadata.max.age.ms) partitions entry for the topic would > disappear from the Metadata cache As a result, almost for every time's > produce, producer need fetch metadata which could possibly end with timeout. > To solve this, we propose to add a new config metadata.topic.expiry for > producer to make topicExpiry configurable. Topic expiry is good only when > producer is long-lived and is used for producing variable counts of topics. > But in the case that producers are bounded to single or few fixed topics, > there is no need to expire topics at all. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9616) Add new metrics to get total response time with throttle time subtracted
Jiao Zhang created KAFKA-9616: - Summary: Add new metrics to get total response time with throttle time subtracted Key: KAFKA-9616 URL: https://issues.apache.org/jira/browse/KAFKA-9616 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 1.1.0 Reporter: Jiao Zhang We are using these RequestMetrics for our cluster monitoring [https://github.com/apache/kafka/blob/fb5bd9eb7cdfdae8ed1ea8f68e9be5687f610b28/core/src/main/scala/kafka/network/RequestChannel.scala#L364] and config our AlertManager to fire alerts if 99th value of 'TotalTimeMs' exceeds the threshold value. This alert is very important as it really notifies cluster administrators the bad situation for example when one server is bailed out from cluster or lost leadership. But we suffer from false alerts sometimes. This is the case. We set quota like 'producer_byte_rate' for some clients, so when requests from these clients are throttled, 'ThrottleTimeMs' is long and sometimes due to throttle 'TotalTimeMs' exceeds the threshold value and alert is triggered. As a result we have to put some time to check details for false alerts either. So this ticket proposes to add a new metrics 'ProcessTimeMs', the value of which is total response time with throttle time subtracted. This metrics is more accurate and could help us only notice the really unexpected situation. Btw, we tried to achieve this by using PromQL against existing metrics, like Total - Throttle. But it does not work as it seems these two metrics are inconsistent in time. So better to expose a new metrics from broker side. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9372) Add producer config to make topicExpiry configurable
Jiao Zhang created KAFKA-9372: - Summary: Add producer config to make topicExpiry configurable Key: KAFKA-9372 URL: https://issues.apache.org/jira/browse/KAFKA-9372 Project: Kafka Issue Type: Improvement Components: producer Affects Versions: 1.1.0 Reporter: Jiao Zhang Sometimes we got error "org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 1000 ms" on producer side. We did the investigation and found # our producer produced messages in really low rate, the interval is more than 10 minutes # by default, producer would expire topics after TOPIC_EXPIRY_MS, after topic expired if no data produce before next metadata update (automatically triggered by metadata.max.age.ms) partitions entry for the topic would disappear from the Metadata cache As a result, almost for every time's produce, producer need fetch metadata which could possibly end with timeout. To solve this, we propose to add a new config metadata.topic.expiry for producer to make topicExpiry configurable. Topic expiry is good only when producer is long-lived and is used for producing variable counts of topics. But in the case that producers are bounded to single or few fixed topics, there is no need to expire topics at all. -- This message was sent by Atlassian Jira (v8.3.4#803005)