[jira] [Created] (KAFKA-10258) Get rid of use_zk_connection flag in kafka.py public methods
Vinoth Chandar created KAFKA-10258: -- Summary: Get rid of use_zk_connection flag in kafka.py public methods Key: KAFKA-10258 URL: https://issues.apache.org/jira/browse/KAFKA-10258 Project: Kafka Issue Type: Sub-task Reporter: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10213) Prefer --bootstrap-server in ducktape tests for Kafka clients
Vinoth Chandar created KAFKA-10213: -- Summary: Prefer --bootstrap-server in ducktape tests for Kafka clients Key: KAFKA-10213 URL: https://issues.apache.org/jira/browse/KAFKA-10213 Project: Kafka Issue Type: Sub-task Reporter: Vinoth Chandar Assignee: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10138) Prefer --bootstrap-server for reassign_partitions command in ducktape tests
[ https://issues.apache.org/jira/browse/KAFKA-10138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved KAFKA-10138. Resolution: Fixed > Prefer --bootstrap-server for reassign_partitions command in ducktape tests > --- > > Key: KAFKA-10138 > URL: https://issues.apache.org/jira/browse/KAFKA-10138 > Project: Kafka > Issue Type: Sub-task >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10174) Prefer --bootstrap-server ducktape tests using kafka_configs.sh
Vinoth Chandar created KAFKA-10174: -- Summary: Prefer --bootstrap-server ducktape tests using kafka_configs.sh Key: KAFKA-10174 URL: https://issues.apache.org/jira/browse/KAFKA-10174 Project: Kafka Issue Type: Sub-task Reporter: Vinoth Chandar Assignee: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10138) Prefer --bootstrap-server for reassign_partitions command in ducktape tests
Vinoth Chandar created KAFKA-10138: -- Summary: Prefer --bootstrap-server for reassign_partitions command in ducktape tests Key: KAFKA-10138 URL: https://issues.apache.org/jira/browse/KAFKA-10138 Project: Kafka Issue Type: Sub-task Reporter: Vinoth Chandar Assignee: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10131) Minimize use of --zookeeper flag in ducktape tests
Vinoth Chandar created KAFKA-10131: -- Summary: Minimize use of --zookeeper flag in ducktape tests Key: KAFKA-10131 URL: https://issues.apache.org/jira/browse/KAFKA-10131 Project: Kafka Issue Type: Improvement Components: system tests Reporter: Vinoth Chandar Assignee: Vinoth Chandar Get the ducktape tests working without the --zookeeper flag (except for scram). (Note: When doing compat testing we'll still use the old flags.) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10071) TopicCommand tool should make more efficient metadata calls to Kafka Servers
Vinoth Chandar created KAFKA-10071: -- Summary: TopicCommand tool should make more efficient metadata calls to Kafka Servers Key: KAFKA-10071 URL: https://issues.apache.org/jira/browse/KAFKA-10071 Project: Kafka Issue Type: Improvement Reporter: Vinoth Chandar Assignee: Vinoth Chandar This is a follow up from discussion of. KAFKA-9945 [https://github.com/apache/kafka/pull/8737] alter, describe, delete all pull down the entire topic list today, in order to support regex matching .. We need to make these commands much more efficient (there is also the issue that regex includes support for period.. so may be we need two different switches going forward).. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-9512) Flaky Test LagFetchIntegrationTest.shouldFetchLagsDuringRestoration
[ https://issues.apache.org/jira/browse/KAFKA-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved KAFKA-9512. --- Resolution: Fixed Closing since the PR is now landed > Flaky Test LagFetchIntegrationTest.shouldFetchLagsDuringRestoration > --- > > Key: KAFKA-9512 > URL: https://issues.apache.org/jira/browse/KAFKA-9512 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.5.0 >Reporter: Matthias J. Sax >Assignee: Vinoth Chandar >Priority: Critical > Labels: flaky-test > Fix For: 2.5.0 > > > [https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/497/testReport/junit/org.apache.kafka.streams.integration/LagFetchIntegrationTest/shouldFetchLagsDuringRestoration/] > {quote}java.lang.NullPointerException at > org.apache.kafka.streams.integration.LagFetchIntegrationTest.shouldFetchLagsDuringRestoration(LagFetchIntegrationTest.java:306){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9431) Expose API in KafkaStreams to fetch all local offset lags
Vinoth Chandar created KAFKA-9431: - Summary: Expose API in KafkaStreams to fetch all local offset lags Key: KAFKA-9431 URL: https://issues.apache.org/jira/browse/KAFKA-9431 Project: Kafka Issue Type: Sub-task Components: streams Reporter: Vinoth Chandar Assignee: Vinoth Chandar Fix For: 2.5.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9430) Tighten up lag estimates when source topic optimization is on
Vinoth Chandar created KAFKA-9430: - Summary: Tighten up lag estimates when source topic optimization is on Key: KAFKA-9430 URL: https://issues.apache.org/jira/browse/KAFKA-9430 Project: Kafka Issue Type: Sub-task Reporter: Vinoth Chandar Assignee: Vinoth Chandar Right now, we use _endOffsets_ of the source topic for the computation. Since the source topics can also have user event produces, this is an over estimate >From John: For "optimized" changelogs, this will be wrong, strictly speaking, but it's an over-estimate (which seems better than an under-estimate), and it's also still an apples-to-apples comparison, since all replicas would use the same upper bound to compute their lags, so the "pick the freshest" replica is still going to pick the right one. We can add a new 2.5 blocker ticket to really fix it, and not worry about it until after this KSQL stuff is done. For active: we need to use consumed offsets and not end of source topic -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9429) Allow ability to control whether stale reads out of state stores are desirable
Vinoth Chandar created KAFKA-9429: - Summary: Allow ability to control whether stale reads out of state stores are desirable Key: KAFKA-9429 URL: https://issues.apache.org/jira/browse/KAFKA-9429 Project: Kafka Issue Type: Sub-task Reporter: Vinoth Chandar Assignee: Vinoth Chandar Fix For: 2.5.0 >From John : I also meant to talk with you about the change to allow querying recovering stores. I think you might have already talked with Matthias a little about this in the scope of KIP-216, but it's probably not ok to just change the default from only allowing query while running, since there are actually people depending on full-consistency queries for correctness right now. What we can do is add an overload {{KafkaStreams.store(name, QueriableStoreType, QueriableStoreOptions)}}, with one option: {{queryStaleState(true/false)}} (your preference on the name, I just made that up right now). The default would be false, and KSQL would set it to true. While false, it would not allow querying recovering stores OR standbys. This basically allows a single switch to preserve existing behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9428) Expose standby information in KafkaStreams via queryMetadataForKey API
Vinoth Chandar created KAFKA-9428: - Summary: Expose standby information in KafkaStreams via queryMetadataForKey API Key: KAFKA-9428 URL: https://issues.apache.org/jira/browse/KAFKA-9428 Project: Kafka Issue Type: Sub-task Reporter: Vinoth Chandar Assignee: Vinoth Chandar Fix For: 2.5.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-8994) Streams should expose standby replication information & allow stale reads of state store
[ https://issues.apache.org/jira/browse/KAFKA-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved KAFKA-8994. --- Resolution: Duplicate > Streams should expose standby replication information & allow stale reads of > state store > > > Key: KAFKA-8994 > URL: https://issues.apache.org/jira/browse/KAFKA-8994 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: needs-kip > > Currently Streams interactive queries (IQ) fail during the time period where > there is a rebalance in progress. > Consider the following scenario in a three node Streams cluster with node A, > node S and node R, executing a stateful sub-topology/topic group with 1 > partition and `_num.standby.replicas=1_` > * *t0*: A is the active instance owning the partition, B is the standby that > keeps replicating the A's state into its local disk, R just routes streams > IQs to active instance using StreamsMetadata > * *t1*: IQs pick node R as router, R forwards query to A, A responds back to > R which reverse forwards back the results. > * *t2:* Active A instance is killed and rebalance begins. IQs start failing > to A > * *t3*: Rebalance assignment happens and standby B is now promoted as active > instance. IQs continue to fail > * *t4*: B fully catches up to changelog tail and rewinds offsets to A's last > commit position, IQs continue to fail > * *t5*: IQs to R, get routed to B, which is now ready to serve results. IQs > start succeeding again > > Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can > take few seconds (~10 seconds based on defaults values). Depending on how > laggy the standby B was prior to A being killed, t4 can take few > seconds-minutes. > While this behavior favors consistency over availability at all times, the > long unavailability window might be undesirable for certain classes of > applications (e.g simple caches or dashboards). > This issue aims to also expose information about standby B to R, during each > rebalance such that the queries can be routed by an application to a standby > to serve stale reads, choosing availability over consistency. > > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (KAFKA-8994) Streams should expose standby replication information & allow stale reads of state store
[ https://issues.apache.org/jira/browse/KAFKA-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reopened KAFKA-8994: --- > Streams should expose standby replication information & allow stale reads of > state store > > > Key: KAFKA-8994 > URL: https://issues.apache.org/jira/browse/KAFKA-8994 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: needs-kip > > Currently Streams interactive queries (IQ) fail during the time period where > there is a rebalance in progress. > Consider the following scenario in a three node Streams cluster with node A, > node S and node R, executing a stateful sub-topology/topic group with 1 > partition and `_num.standby.replicas=1_` > * *t0*: A is the active instance owning the partition, B is the standby that > keeps replicating the A's state into its local disk, R just routes streams > IQs to active instance using StreamsMetadata > * *t1*: IQs pick node R as router, R forwards query to A, A responds back to > R which reverse forwards back the results. > * *t2:* Active A instance is killed and rebalance begins. IQs start failing > to A > * *t3*: Rebalance assignment happens and standby B is now promoted as active > instance. IQs continue to fail > * *t4*: B fully catches up to changelog tail and rewinds offsets to A's last > commit position, IQs continue to fail > * *t5*: IQs to R, get routed to B, which is now ready to serve results. IQs > start succeeding again > > Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can > take few seconds (~10 seconds based on defaults values). Depending on how > laggy the standby B was prior to A being killed, t4 can take few > seconds-minutes. > While this behavior favors consistency over availability at all times, the > long unavailability window might be undesirable for certain classes of > applications (e.g simple caches or dashboards). > This issue aims to also expose information about standby B to R, during each > rebalance such that the queries can be routed by an application to a standby > to serve stale reads, choosing availability over consistency. > > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-8994) Streams should expose standby replication information & allow stale reads of state store
[ https://issues.apache.org/jira/browse/KAFKA-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved KAFKA-8994. --- Resolution: Fixed > Streams should expose standby replication information & allow stale reads of > state store > > > Key: KAFKA-8994 > URL: https://issues.apache.org/jira/browse/KAFKA-8994 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: needs-kip > > Currently Streams interactive queries (IQ) fail during the time period where > there is a rebalance in progress. > Consider the following scenario in a three node Streams cluster with node A, > node S and node R, executing a stateful sub-topology/topic group with 1 > partition and `_num.standby.replicas=1_` > * *t0*: A is the active instance owning the partition, B is the standby that > keeps replicating the A's state into its local disk, R just routes streams > IQs to active instance using StreamsMetadata > * *t1*: IQs pick node R as router, R forwards query to A, A responds back to > R which reverse forwards back the results. > * *t2:* Active A instance is killed and rebalance begins. IQs start failing > to A > * *t3*: Rebalance assignment happens and standby B is now promoted as active > instance. IQs continue to fail > * *t4*: B fully catches up to changelog tail and rewinds offsets to A's last > commit position, IQs continue to fail > * *t5*: IQs to R, get routed to B, which is now ready to serve results. IQs > start succeeding again > > Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can > take few seconds (~10 seconds based on defaults values). Depending on how > laggy the standby B was prior to A being killed, t4 can take few > seconds-minutes. > While this behavior favors consistency over availability at all times, the > long unavailability window might be undesirable for certain classes of > applications (e.g simple caches or dashboards). > This issue aims to also expose information about standby B to R, during each > rebalance such that the queries can be routed by an application to a standby > to serve stale reads, choosing availability over consistency. > > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-8994) Streams should expose standby replication information & allow stale reads of state store
Vinoth Chandar created KAFKA-8994: - Summary: Streams should expose standby replication information & allow stale reads of state store Key: KAFKA-8994 URL: https://issues.apache.org/jira/browse/KAFKA-8994 Project: Kafka Issue Type: New Feature Components: streams Reporter: Vinoth Chandar Currently Streams interactive queries (IQ) fail during the time period where there is a rebalance in progress. Consider the following scenario in a three node Streams cluster with node A, node S and node R, executing a stateful sub-topology/topic group with 1 partition and `_num.standby.replicas=1_` * *t0*: A is the active instance owning the partition, B is the standby that keeps replicating the A's state into its local disk, R just routes streams IQs to active instance using StreamsMetadata * *t1*: IQs pick node R as router, R forwards query to A, A responds back to R which reverse forwards back the results. * *t2:* Active A instance is killed and rebalance begins. IQs start failing to A * *t3*: Rebalance assignment happens and standby B is now promoted as active instance. IQs continue to fail * *t4*: B fully catches up to changelog tail and rewinds offsets to A's last commit position, IQs continue to fail * *t5*: IQs to R, get routed to B, which is now ready to serve results. IQs start succeeding again Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can take few seconds (~10 seconds based on defaults values). Depending on how laggy the standby B was prior to A being killed, t4 can take few seconds-minutes. While this behavior favors consistency over availability at all times, the long unavailability window might be undesirable for certain classes of applications (e.g simple caches or dashboards). This issue aims to also expose information about standby B to R, during each rebalance such that the queries can be routed by an application to a standby to serve stale reads, choosing availability over consistency. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-8839) Improve logging in Kafka Streams around debugging task lifecycle
[ https://issues.apache.org/jira/browse/KAFKA-8839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved KAFKA-8839. --- Resolution: Fixed Closing since the PR has been merged > Improve logging in Kafka Streams around debugging task lifecycle > - > > Key: KAFKA-8839 > URL: https://issues.apache.org/jira/browse/KAFKA-8839 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 2.4.0 > > > As a follow up to KAFKA-8831, this Jira will track efforts around improving > logging/docs around > > * Being able to follow state of tasks from assignment to restoration > * Better detection of misconfigured state store dir > * Docs giving guidance for rebalance time and state store config -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-8913) Document topic based configs & ISR settings for Streams apps
[ https://issues.apache.org/jira/browse/KAFKA-8913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved KAFKA-8913. --- Resolution: Fixed > Document topic based configs & ISR settings for Streams apps > > > Key: KAFKA-8913 > URL: https://issues.apache.org/jira/browse/KAFKA-8913 > Project: Kafka > Issue Type: Improvement > Components: documentation, streams >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > > Noticed that it was not clear how to configure the internal topics . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-8913) Document topic based configs & ISR settings for Streams apps
Vinoth Chandar created KAFKA-8913: - Summary: Document topic based configs & ISR settings for Streams apps Key: KAFKA-8913 URL: https://issues.apache.org/jira/browse/KAFKA-8913 Project: Kafka Issue Type: Improvement Components: documentation, streams Reporter: Vinoth Chandar Assignee: Vinoth Chandar Noticed that it was not clear how to configure the internal topics . -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (KAFKA-8870) Prevent dirty reads of Streams state store from Interactive queries
Vinoth Chandar created KAFKA-8870: - Summary: Prevent dirty reads of Streams state store from Interactive queries Key: KAFKA-8870 URL: https://issues.apache.org/jira/browse/KAFKA-8870 Project: Kafka Issue Type: Improvement Components: streams Reporter: Vinoth Chandar Today, Interactive Queries (IQ) against Streams state store could see uncommitted data, even with EOS processing guarantees (these are actually orthogonal, but clarifying since EOS may give the impression that everything is dandy). This is causes primarily because state updates in rocksdb are visible even before the kafka transaction is committed. Thus, if the instance fails, then the failed over instance will redo the uncommited old transaction and the following could be possible during recovery,. Value for key K can go from *V0 → V1 → V2* on active instance A, IQ reads V1, instance A fails and any failure/rebalancing will leave the standy instance B rewinding offsets and reprocessing, during which time IQ can again see V0 or V1 or any number of previous values for the same key. In this issue, we will plan work towards providing consistency for IQ, for a single row in a single state store. i.e once a query sees V1, it can only see either V1 or V2. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (KAFKA-8839) Improve logging in Kafka Streams around debugging task lifecycle
Vinoth Chandar created KAFKA-8839: - Summary: Improve logging in Kafka Streams around debugging task lifecycle Key: KAFKA-8839 URL: https://issues.apache.org/jira/browse/KAFKA-8839 Project: Kafka Issue Type: Improvement Components: streams Reporter: Vinoth Chandar As a follow up to KAFKA-8831, this Jira will track efforts around improving logging/docs around * Being able to follow state of tasks from assignment to restoration * Better detection of misconfigured state store dir * Docs giving guidance for rebalance time and state store config -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (KAFKA-8831) Joining a new instance sometimes does not cause rebalancing
[ https://issues.apache.org/jira/browse/KAFKA-8831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved KAFKA-8831. --- Resolution: Not A Problem I will think about better logging and put up a patch. Closing this issue > Joining a new instance sometimes does not cause rebalancing > --- > > Key: KAFKA-8831 > URL: https://issues.apache.org/jira/browse/KAFKA-8831 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Chris Pettitt >Assignee: Chris Pettitt >Priority: Major > Attachments: StandbyTaskTest.java, fail.log > > > See attached log. The application is in a REBALANCING state. The second > instance joins a bit after the first instance (~250ms). The group coordinator > says it is going to rebalance but nothing happens. The first instance gets > all partitions (2). The application transitions to RUNNING. > See attached test, which starts one client and then starts another about > 250ms later. This seems to consistently repro the issue for me. > This is blocking my work on KAFKA-8755, so I'm inclined to pick it up -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (KAFKA-8810) Add mechanism to detect topology mismatch between streams instances
Vinoth Chandar created KAFKA-8810: - Summary: Add mechanism to detect topology mismatch between streams instances Key: KAFKA-8810 URL: https://issues.apache.org/jira/browse/KAFKA-8810 Project: Kafka Issue Type: Improvement Components: streams Reporter: Vinoth Chandar Noticed this while reading through the StreamsPartitionAssignor related code. If an user accidentally deploys a different topology on one of the instances, there is no mechanism to detect this and refuse assignment/take action. Given Kafka Streams is designed as an embeddable library, I feel this is rather an important scenario to handle. For e.g, kafka streams is embedded into a web front end tier and operators deploy a hot fix for a site issue to a few instances that are leaking memory and that accidentally also deploys some topology changes with it. Please feel free to close the issue, if its a duplicate. (Could not find a ticket for this) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (KAFKA-8799) Support ability to pass global user data to consumers during Assignment
Vinoth Chandar created KAFKA-8799: - Summary: Support ability to pass global user data to consumers during Assignment Key: KAFKA-8799 URL: https://issues.apache.org/jira/browse/KAFKA-8799 Project: Kafka Issue Type: Improvement Components: streams Reporter: Vinoth Chandar This is a follow up from KAFKA-7149 *Background :* Although we reduced the size of the AssignmentInfo object sent during each rebalance from leader to all followers in KAFKA-7149, we still repeat the same _partitionsByHost_ map for each host (all this when interactive queries are enabled) and thus still end up sending redundant bytes to the broker and also logging a large kafka message. With 100s of streams instances, this overhead can grow into tens of megabytes easily. *Proposal :* Extend the group assignment protocol to be able to support passing of an additional byte[], which can now contain the HostInfo -> partitions/partitionsByHost data just one time. {code} final class GroupAssignment { private final Map assignments; // bytes sent to each consumer from leader private final byte[] globalUserData ... } {code} This can generally be handy to any other application like Streams, that does some stateful processing or lightweight cluster management -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-2580) Kafka Broker keeps file handles open for all log files (even if its not written to/read from)
[ https://issues.apache.org/jira/browse/KAFKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985707#comment-14985707 ] Vinoth Chandar commented on KAFKA-2580: --- [~jkreps] ah ok. point taken :) > Kafka Broker keeps file handles open for all log files (even if its not > written to/read from) > - > > Key: KAFKA-2580 > URL: https://issues.apache.org/jira/browse/KAFKA-2580 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.2.1 >Reporter: Vinoth Chandar >Assignee: Grant Henke > > We noticed this in one of our clusters where we stage logs for a longer > amount of time. It appears that the Kafka broker keeps file handles open even > for non active (not written to or read from) files. (in fact, there are some > threads going back to 2013 > http://grokbase.com/t/kafka/users/132p65qwcn/keeping-logs-forever) > Needless to say, this is a problem and forces us to either artificially bump > up ulimit (its already at 100K) or expand the cluster (even if we have > sufficient IO and everything). > Filing this ticket, since I could find anything similar. Very interested to > know if there are plans to address this (given how Samza's changelog topic is > meant to be a persistent large state use case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2580) Kafka Broker keeps file handles open for all log files (even if its not written to/read from)
[ https://issues.apache.org/jira/browse/KAFKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984243#comment-14984243 ] Vinoth Chandar commented on KAFKA-2580: --- Based on this, looks like we can close this? >> So a lot of this comes down to the implementation. A naive 10k item LRU >> cache could easily be far more memory hungry than having 50k open FDs, plus >> being in heap this would add a huge number of objects to manage. [~jkreps] I am a little confused. What I meant by LRU cache was simply limiting the number of "java.io.File" objects (or equivalent in Kafka codebase) that represents the handle to the segment. So, if there are 10K such objects in a (properly sized) ConcurrentHashMap, how would that add to the memory overhead so much, compared to holding 50K/200K objects anyway? > Kafka Broker keeps file handles open for all log files (even if its not > written to/read from) > - > > Key: KAFKA-2580 > URL: https://issues.apache.org/jira/browse/KAFKA-2580 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.2.1 >Reporter: Vinoth Chandar >Assignee: Grant Henke > > We noticed this in one of our clusters where we stage logs for a longer > amount of time. It appears that the Kafka broker keeps file handles open even > for non active (not written to or read from) files. (in fact, there are some > threads going back to 2013 > http://grokbase.com/t/kafka/users/132p65qwcn/keeping-logs-forever) > Needless to say, this is a problem and forces us to either artificially bump > up ulimit (its already at 100K) or expand the cluster (even if we have > sufficient IO and everything). > Filing this ticket, since I could find anything similar. Very interested to > know if there are plans to address this (given how Samza's changelog topic is > meant to be a persistent large state use case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2580) Kafka Broker keeps file handles open for all log files (even if its not written to/read from)
[ https://issues.apache.org/jira/browse/KAFKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963574#comment-14963574 ] Vinoth Chandar commented on KAFKA-2580: --- [~jjkoshy] Good point. if I understand correctly, even if say all consumers start bootstrapping with startTime=earliest, which can just force opening of all file handles, an LRU based scheme would keep closing the file handles internally from oldest to latest file, which still is good behaviour. In order to lessen the impact of fs.close() on old file by delegating to a background thread, which takes a config that caps the number of items in the file handle cache. I like the cache approach better since it will be one place thru which all access go,so future feature transparently play nicely with overall system limits. > Kafka Broker keeps file handles open for all log files (even if its not > written to/read from) > - > > Key: KAFKA-2580 > URL: https://issues.apache.org/jira/browse/KAFKA-2580 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.2.1 >Reporter: Vinoth Chandar >Assignee: Grant Henke > > We noticed this in one of our clusters where we stage logs for a longer > amount of time. It appears that the Kafka broker keeps file handles open even > for non active (not written to or read from) files. (in fact, there are some > threads going back to 2013 > http://grokbase.com/t/kafka/users/132p65qwcn/keeping-logs-forever) > Needless to say, this is a problem and forces us to either artificially bump > up ulimit (its already at 100K) or expand the cluster (even if we have > sufficient IO and everything). > Filing this ticket, since I could find anything similar. Very interested to > know if there are plans to address this (given how Samza's changelog topic is > meant to be a persistent large state use case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2580) Kafka Broker keeps file handles open for all log files (even if its not written to/read from)
[ https://issues.apache.org/jira/browse/KAFKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941910#comment-14941910 ] Vinoth Chandar commented on KAFKA-2580: --- A LRU file handle is something very commonly employed in databases, which works pretty well in practice. (considering that it involves random access). So +1 on that path. [~granthenke] would you have cycles for this? If no one is working on this currently, we (uber) can take a stab at this, later this quarter. > Kafka Broker keeps file handles open for all log files (even if its not > written to/read from) > - > > Key: KAFKA-2580 > URL: https://issues.apache.org/jira/browse/KAFKA-2580 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.2.1 >Reporter: Vinoth Chandar > > We noticed this in one of our clusters where we stage logs for a longer > amount of time. It appears that the Kafka broker keeps file handles open even > for non active (not written to or read from) files. (in fact, there are some > threads going back to 2013 > http://grokbase.com/t/kafka/users/132p65qwcn/keeping-logs-forever) > Needless to say, this is a problem and forces us to either artificially bump > up ulimit (its already at 100K) or expand the cluster (even if we have > sufficient IO and everything). > Filing this ticket, since I could find anything similar. Very interested to > know if there are plans to address this (given how Samza's changelog topic is > meant to be a persistent large state use case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2580) Kafka Broker keeps file handles open for all log files (even if its not written to/read from)
[ https://issues.apache.org/jira/browse/KAFKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908298#comment-14908298 ] Vinoth Chandar commented on KAFKA-2580: --- More context on how we determined this {code} vinoth@kafka-agg:~$ sudo ls -l /proc//fd | wc -l 50820 vinoth@kafka-agg::~$ ls -R /var/kafka-spool/data | grep -e ".log" -e ".index" | wc -l 97242 vinoth@kafka-agg::~$ ls -R /var/kafka-spool/data | grep -e ".index" | wc -l 48456 vinoth@kafka-agg::~$ ls -R /var/kafka-spool/data | grep -e ".log" | wc -l 48788 vinoth@kafka-changelog-cluster:~$ sudo ls -l /proc//fd | wc -l 59128 vinoth@kafka-changelog-cluster:~$ ls -R /var/kafka-spool/data | grep -e ".log" -e ".index" | wc -l 117548 vinoth@kafka-changelog-cluster:~$ ls -R /var/kafka-spool/data | grep -e ".index" | wc -l 58774 vinoth@kafka-changelog-cluster:~$ ls -R /var/kafka-spool/data | grep -e ".log" | wc -l 58774 {code} > Kafka Broker keeps file handles open for all log files (even if its not > written to/read from) > - > > Key: KAFKA-2580 > URL: https://issues.apache.org/jira/browse/KAFKA-2580 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.2.1 >Reporter: Vinoth Chandar > > We noticed this in one of our clusters where we stage logs for a longer > amount of time. It appears that the Kafka broker keeps file handles open even > for non active (not written to or read from) files. (in fact, there are some > threads going back to 2013 > http://grokbase.com/t/kafka/users/132p65qwcn/keeping-logs-forever) > Needless to say, this is a problem and forces us to either artificially bump > up ulimit (its already at 100K) or expand the cluster (even if we have > sufficient IO and everything). > Filing this ticket, since I could find anything similar. Very interested to > know if there are plans to address this (given how Samza's changelog topic is > meant to be a persistent large state use case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2580) Kafka Broker keeps file handles open for all log files (even if its not written to/read from)
[ https://issues.apache.org/jira/browse/KAFKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908910#comment-14908910 ] Vinoth Chandar commented on KAFKA-2580: --- Thanks for jumping in [~guozhang] .we have 256MB segment sizes and 100K descriptors.. >> Adding a feature that closes inactive segments' open file handlers and >> re-open them upon being read / written again is possible, but would be >> tricky. Can you please elaborate? Looks straightforward to me from the outside :) > Kafka Broker keeps file handles open for all log files (even if its not > written to/read from) > - > > Key: KAFKA-2580 > URL: https://issues.apache.org/jira/browse/KAFKA-2580 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.8.2.1 >Reporter: Vinoth Chandar > > We noticed this in one of our clusters where we stage logs for a longer > amount of time. It appears that the Kafka broker keeps file handles open even > for non active (not written to or read from) files. (in fact, there are some > threads going back to 2013 > http://grokbase.com/t/kafka/users/132p65qwcn/keeping-logs-forever) > Needless to say, this is a problem and forces us to either artificially bump > up ulimit (its already at 100K) or expand the cluster (even if we have > sufficient IO and everything). > Filing this ticket, since I could find anything similar. Very interested to > know if there are plans to address this (given how Samza's changelog topic is > meant to be a persistent large state use case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KAFKA-2580) Kafka Broker keeps file handles open for all log files (even if its not written to/read from)
Vinoth Chandar created KAFKA-2580: - Summary: Kafka Broker keeps file handles open for all log files (even if its not written to/read from) Key: KAFKA-2580 URL: https://issues.apache.org/jira/browse/KAFKA-2580 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.2.1 Reporter: Vinoth Chandar We noticed this in one of our clusters where we stage logs for a longer amount of time. It appears that the Kafka broker keeps file handles open even for non active (not written to or read from) files. (in fact, there are some threads going back to 2013 http://grokbase.com/t/kafka/users/132p65qwcn/keeping-logs-forever) Needless to say, this is a problem and forces us to either artificially bump up ulimit (its already at 100K) or expand the cluster (even if we have sufficient IO and everything). Filing this ticket, since I could find anything similar. Very interested to know if there are plans to address this (given how Samza's changelog topic is meant to be a persistent large state use case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)