aweiri1 opened a new issue, #24819: URL: https://github.com/apache/pulsar/issues/24819
### Search before reporting - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Read release policy - [x] I understand that [unsupported versions](https://pulsar.apache.org/contribute/release-policy/#supported-versions) don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker. ### User environment pulsar version: 4.04 helm chart version: 4.0.1 running two kubernetes pulsar clusters: openshift cluster and talos cluster Linux pulsar-talos-toolset-0 6.12.25-talos #1 SMP Mon Apr 28 10:05:42 UTC 2025 x86_64 GNU/Linux Linux pulsar-okd1-toolset-0 6.3.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 6 04:05:18 UTC 2023 x86_64 GNU/Linux using python client for test producer/consumer Python 3.10.12 ### Issue Description I am trying to configure geo-replication for a two pulsar cluster set up. I have an okd1 pulsar cluster running via an openshift kubernetes cluster, I have a talos pulsar cluster running via talos kubernetes cluster, and I have a global config/metadata store running as a zookeeper only pulsar cluster via talos kubernetes cluster. each cluster uses a proxy service that has an external load balancer IP that I am using for configuration. I have a two cluster geo-replication set up deployed via kubernetes. when running a producer on cluster A for the first time (using auto topic creation - did not manually create it) my producer successfully runs and sends the message on the cluster A service url. before starting the consumer on cluster B on that same topic, I wanted to check the topic stats on cluster B, but it says the topic doesn't exist. Once I start the consumer, the consumer on the cluster B service url, it receives the messages successfully. But I thought topics were supposed to be replicated across clusters? I have waited over 30 minutes to see if the topic shows up on cluster B, to eliminate any timing issues, but it still never showed up until that consumer is started on cluster B. when I connect the cluster B consumer, it receives all of the messages from cluster A producer. until I start that cluster B consumer, the topic does not exist on cluster B and none of the messages exist on cluster B. this is only when the topic is first auto created from cluster A producer. Once that consumer runs for the first time on cluster B (which creates the topic for cluster B) I don't hit this issue on this topic again, since its already been created. From that point, the consumer on cluster B does not need to be running for me to see messages sitting in the backlog. ### Error messages ```text an error we get in the cluster A broker logs on immediate producer send is: 2025-09-09T22:20:32,547+0000 [broker-client-shared-scheduled-executor-7-1] WARN org.apache.pulsar.client.impl.PulsarClientImpl - [topic: persistent://geo-replication-2/testing/__change_events] Could not get connection while getPartitionedTopicMetadata -- Will try again in 754 ms │ │ pulsar-talos-broker 2025-09-09T22:20:32,551+0000 [pulsar-io-3-15] ERROR org.apache.pulsar.client.impl.ClientCnx - [id: 0x714b44aa, L:/ - R:] Close connection because received internal-server error {"errorMsg":"","reqId":1946041700241505712, "remote":"pulsar-okd1-broker.pulsar.svc.cluster.local/, "local":"/"} │ │ pulsar-talos-broker 2025-09-09T22:20:32,552+0000 [pulsar-io-3-15] WARN org.apache.pulsar.client.impl.BinaryProtoLookupService - [persistent://geo-replication-2/testing/__change_events] failed to get Partitioned metadata : {"errorMsg":"{"errorMsg":"","reqId":1946041700241505712, "remote":"pulsar-okd1-broker.pulsar.svc.cluster.local/", "local":"/"}","reqId":1229027975051488438, "remote":"", "local":"/"} │ │ pulsar-talos-broker org.apache.pulsar.client.api.PulsarClientException$LookupException: {"errorMsg":"{"errorMsg":"","reqId":1946041700241505712, "remote":"pulsar-okd1-broker.pulsar.svc.cluster.local/1"local":"/"}","reqId":1229027975051488438, "remote":"", "local":"/"} I assumed this could also be a timing issue because it immediately tries to find the topic on cluster B (okd1) and it does not exist. topic stats on cluster A show the following in the replication field: "replication" : { "pulsar-okd1" : { "msgRateIn" : 0.0, "msgInCount" : 0, "msgThroughputIn" : 0.0, "bytesInCount" : 0, "msgRateOut" : 0.0, "msgOutCount" : 0, "msgThroughputOut" : 0.0, "bytesOutCount" : 0, "msgRateExpired" : 0.0, "replicationBacklog" : 100, "connected" : false, "replicationDelayInSeconds" : 0, "msgExpiredCount" : 0 } I did enable debug on both clusters. there are no create topic logs, but the debug logs on okd1 show the metadata lookup for the topic created on talos cluster: 2025-10-02T23:16:33,265+0000 [pulsar-io-3-5] DEBUG org.apache.pulsar.broker.service.BrokerService - No autoTopicCreateOverride policy found for persistent://geo-replication/testing/test 2025-10-02T23:16:33,471+0000 [pulsar-io-3-8] DEBUG org.apache.pulsar.broker.service.ServerCnx - [persistent://geo-replication/testing/test] Received PartitionMetadataLookup from /10.128.2.45:43198 for 770175375804561621 2025-10-02T23:16:33,471+0000 [pulsar-io-3-8] DEBUG ``` ### Reproducing the issue using 3 kubernetes clusters. 1 of them is the zookeeper only cluster which is the global metadata store for the okd1 and talos clusters. The other two are full pulsar clusters which use a proxy. this is via kubernetes proxy service that uses a load balancer which has an external IP. that external IP is what I use as my pulsar service url On the talos cluster I run the following to enable geo-replication: bin/pulsar-admin tenants create geo-replication --allowed-clusters pulsar-okd1,pulsar-talos bin/pulsar-admin namespaces create geo-replication/testing bin/pulsar-admin namespaces set-clusters geo-replication/testing --clusters pulsar-talos,pulsar-okd1 since we're using the global config store, the cluster and tenant already exists on okd1 cluster. all I did was set the clusters to the tenant/ns I did pass in the right service url for okd1 by doing a clusters update on the okd1 cluster and updated the urls to use the load balancer IP address. Then I restarted the brokers. (did this for both clusters). I am not using any authentication credentials for either of the clusters. I do have permissions to create a topic on that namespace. verified by just doing a topic create command on the ns. ### Additional information After more discussion with David K in slack channel, he concluded: It sounds like the metadata on cluster B doesn’t get updated until a consumer attaches to the replicated topic even though the underlying topic data is there. This behavior is wrong, and the topic should exist in the target cluster’s metadata. ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
