[jira] [Created] (KAFKA-15701) Allow use of user policy in CreateTopicPolicy

2023-10-26 Thread Jiao Zhang (Jira)
Jiao Zhang created KAFKA-15701:
--

 Summary: Allow use of user policy in CreateTopicPolicy 
 Key: KAFKA-15701
 URL: https://issues.apache.org/jira/browse/KAFKA-15701
 Project: Kafka
  Issue Type: Improvement
Reporter: Jiao Zhang


One use case of CreateTopicPolicy we have experienced is allow/reject topic 
creation by checking the user .

Especially for the secured cluster usage, we add acls to specific users for 
allowing topic creation. At the same time, we have the needs to design 
customized create topic policy for different users. For example, for user A, 
topic creation is allowed when partition number is within limit. For user B, we 
allow topic creation without check. As the kafka service provider, user A is 
imaged as random user of kafka service and user B is imaged as internal user 
for cluster management.

For this need, we patched our local fork of kafka by passing user principle in 
KafkaApis.

One place need to revise is here 
[https://github.com/apache/kafka/blob/834f72b03de40fb47caaad1397ed061de57c2509/core/src/main/scala/kafka/server/KafkaApis.scala#L1980]

As thinking it's natural to support this kind of usage even in upstream, I 
raised this Jira for asking community's ideas about this. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-9729) Shrink inWriteLock time in SimpleAuthorizer

2020-03-17 Thread Jiao Zhang (Jira)
Jiao Zhang created KAFKA-9729:
-

 Summary: Shrink inWriteLock time in SimpleAuthorizer
 Key: KAFKA-9729
 URL: https://issues.apache.org/jira/browse/KAFKA-9729
 Project: Kafka
  Issue Type: Improvement
  Components: security
Affects Versions: 1.1.0
Reporter: Jiao Zhang


Current SimpleAuthorizer needs 'inWriteLock' when processing add/remove acls 
requests, while getAcls in authorize() needs 'inReadLock'.
That means handling add/remove acls requests would block all other requests for 
example produce and fetch requests.
When processing add/remove acls, updateResourceAcls() access zk to update acls, 
which could be long in the case like network glitch.
We did the simulation for zk delay.
When adding 100ms delay on zk side, 'inWriteLock' in addAcls()/removeAcls lasts 
for 400ms~500ms.
When adding 500ms delay on zk side, 'inWriteLock' in addAcls()/removeAcls lasts 
for 2000ms~2500ms.

Blocking produce/fetch requests for 2s would cause apparent performance 
degradation for the whole cluster.
So considering is it possible to only put 'inWriteLock' inside updateCache. 
{code:java}
// code placeholder
private def updateCache(resource: Resource, versionedAcls: VersionedAcls) {
 if (versionedAcls.acls.nonEmpty) {
 aclCache.put(resource, versionedAcls)
 } else {
 aclCache.remove(resource)
 }
 }
{code}
If do this, block time is only the time for updating local cache, which will 
not be influenced by network glitch. But don't know if there were special 
concerns to have current strict write lock and not sure if there are side 
effects if only put lock to updateCache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9685) Solve Set concatenation perf issue in AclAuthorizer

2020-03-08 Thread Jiao Zhang (Jira)
Jiao Zhang created KAFKA-9685:
-

 Summary: Solve Set concatenation perf issue in AclAuthorizer
 Key: KAFKA-9685
 URL: https://issues.apache.org/jira/browse/KAFKA-9685
 Project: Kafka
  Issue Type: Improvement
  Components: security
Affects Versions: 1.1.0
Reporter: Jiao Zhang


In version 1.1, 
https://github.com/apache/kafka/blob/71b1e19fc60b5e1f9bba33025737ec2b7fb1c2aa/core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala#L110
the logic for checking acls is preparing a merged acl Set with 
'acls = getAcls(new Resource(resource.resourceType, Resource.WildCardResource)) 
++ getAcls(resource);' and then pass it as aclMatch's parameter.
We found scala's Set '++' operation is very slow for example in the case that 
the Set on right hand of '++' has more than 100 entries.
And the bad performance of '++' is due to iterating every entry of the Set on 
right hand of '++' in which the calculation of HashCode seems heavy.
The performance of 'authorize' is important as each request delivered to broker 
goes through the logic, that's the reason we can't leave it as-is although the 
change for this proposal seems trivial.

Here is the approach. We propose to solve this issue by introducing a new class 
'AclSets' which takes multiple Sets as parameters and do 'find' against them 
one by one.
``` 
class AclSets(sets: Set[Acl]*) {
  def find(p: Acl => Boolean): Option[Acl] = sets.flatMap(_.find(p)).headOption
  def isEmpty: Boolean = !sets.exists(_.nonEmpty)
}
 ``` 
This approach avoids the Set '++' operation, and thus outperforms a lot 
compared to old '++' logic.

The benchmark result(we did the test with kafka version 1.1) shows notable 
difference under the condition:
1. set on left consists of 60 entries
2. set of right consists of 30 entries
3. search for absent entry (so that all entries are iterated)

Benchmark Results is as following.

Mode                                             Cnt    Score   Error   Units
ScalaSetConcatination.Set thrpt 3 281.974 ± 140.029 ops/ms
ScalaSetConcatination.AclSets thrpt 3 887.426 ± 40.261 ops/ms

As the upstream also use the similar ++ operation, 
https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/security/authorizer/AclAuthorizer.scala#L360
we think it's necessary to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9372) Add producer config to make topicExpiry configurable

2020-02-26 Thread Jiao Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiao Zhang resolved KAFKA-9372.
---
Resolution: Duplicate

let me close it as this issue could be covered by 
https://issues.apache.org/jira/browse/KAFKA-8904

> Add producer config to make topicExpiry configurable
> 
>
> Key: KAFKA-9372
> URL: https://issues.apache.org/jira/browse/KAFKA-9372
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Affects Versions: 1.1.0
>Reporter: Jiao Zhang
>Assignee: Brian Byrne
>Priority: Minor
>
> Sometimes we got error "org.apache.kafka.common.errors.TimeoutException: 
> Failed to update metadata after 1000 ms" on producer side. We did the 
> investigation and found
>  # our producer produced messages in really low rate, the interval is more 
> than 10 minutes
>  # by default, producer would expire topics after TOPIC_EXPIRY_MS, after 
> topic expired if no data produce before next metadata update (automatically 
> triggered by metadata.max.age.ms) partitions entry for the topic would 
> disappear from the Metadata cache As a result, almost for every time's 
> produce, producer need fetch metadata which could possibly end with timeout.
> To solve this, we propose to add a new config metadata.topic.expiry for 
> producer to make topicExpiry configurable. Topic expiry is good only when 
> producer is long-lived and is used for producing variable counts of topics. 
> But in the case that producers are bounded to single or few fixed topics, 
> there is no need to expire topics at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9616) Add new metrics to get total response time with throttle time subtracted

2020-02-26 Thread Jiao Zhang (Jira)
Jiao Zhang created KAFKA-9616:
-

 Summary: Add new metrics to get total response time with throttle 
time subtracted
 Key: KAFKA-9616
 URL: https://issues.apache.org/jira/browse/KAFKA-9616
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 1.1.0
Reporter: Jiao Zhang


We are using these RequestMetrics for our cluster monitoring 
[https://github.com/apache/kafka/blob/fb5bd9eb7cdfdae8ed1ea8f68e9be5687f610b28/core/src/main/scala/kafka/network/RequestChannel.scala#L364]

and config our AlertManager to fire alerts if 99th value of 'TotalTimeMs' 
exceeds the threshold value. This alert is very important as it really notifies 
cluster administrators the bad situation for example when one server is bailed 
out from cluster or lost leadership.

But we suffer from false alerts sometimes. This is the case. We set quota like 
'producer_byte_rate' for some clients, so when requests from these clients are 
throttled, 'ThrottleTimeMs' is long and sometimes due to throttle 'TotalTimeMs' 
exceeds the threshold value and alert is triggered. As a result we have to put 
some time to check details for false alerts either.

So this ticket proposes to add a new metrics 'ProcessTimeMs', the value of 
which is total response time with throttle time subtracted. This metrics is 
more accurate and could help us only notice the really unexpected situation.

Btw, we tried to achieve this by using PromQL against existing metrics, like 
Total - Throttle. But it does not work as it seems these two metrics are 
inconsistent in time. So better to expose a new metrics from broker side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9372) Add producer config to make topicExpiry configurable

2020-01-06 Thread Jiao Zhang (Jira)
Jiao Zhang created KAFKA-9372:
-

 Summary: Add producer config to make topicExpiry configurable
 Key: KAFKA-9372
 URL: https://issues.apache.org/jira/browse/KAFKA-9372
 Project: Kafka
  Issue Type: Improvement
  Components: producer 
Affects Versions: 1.1.0
Reporter: Jiao Zhang


Sometimes we got error "org.apache.kafka.common.errors.TimeoutException: Failed 
to update metadata after 1000 ms" on producer side. We did the investigation 
and found
 # our producer produced messages in really low rate, the interval is more than 
10 minutes
 # by default, producer would expire topics after TOPIC_EXPIRY_MS, after topic 
expired if no data produce before next metadata update (automatically triggered 
by metadata.max.age.ms) partitions entry for the topic would disappear from the 
Metadata cache As a result, almost for every time's produce, producer need 
fetch metadata which could possibly end with timeout.

To solve this, we propose to add a new config metadata.topic.expiry for 
producer to make topicExpiry configurable. Topic expiry is good only when 
producer is long-lived and is used for producing variable counts of topics. But 
in the case that producers are bounded to single or few fixed topics, there is 
no need to expire topics at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)