utahkay opened a new issue #14736:
URL: https://github.com/apache/pulsar/issues/14736


   **Describe the bug**
   Pulsar 2.10 introduces the zookeeperSessionExpiredPolicy=reconnect option. 
If the zookeeper session is disconnected, the brokers will continue running 
with as much functionality as possible, and reconnect to zookeeper when it is 
available again.
   
   The problem happens when the brokers reconnect to zookeeper. They seem to be 
having locks invalidated continuously, so the broker revalidates the lock, then 
it's invalidated, then the broker again revalidates, then it's invalidated, and 
so on. The CPU and memory usage gradually increase over the next 10-15 minutes, 
slowing down message delivery. Eventually, the brokers run out of heap space 
and die due to OOM. At which point Kubernetes recreates them, and they work 
fine at that point, without experiencing the lock invalidation problem.
   
   Cycling the brokers at any time prior to them dying also fixes the problem.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. I'm running Pulsar 2.10.0.0-rc12 (StreamNative) in StreamNative Cloud 
Hosted environment.
   2. I have a couple clients putting light load on the Pulsar cluster, 
publishing and consuming.
   3. I kill the zookeeper cluster entirely.
   4. I wait for 5 minutes.
   5. I bring the zookeeper cluster back up.
   
   **Expected behavior**
   1. Expect the brokers to reconnect to zookeeper - This succeeds.
   2. Expect the brokers to work normally - Instead, they spend much of their 
time revalidating locks that are immediately invalidated.
   
   **Screenshots**
   A portion of the logs:
   ```
   "Mar 17, 2022 @ 
14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,642+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
revalidated the lock on 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,641+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
re-acquired lock at 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,641+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired 
resource lock on 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,642+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
re-acquired lock at 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,641+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
revalidated the lock on 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,652+0000 
[metadata-store-9-1] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on 
resource 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff
 was invalidated"
   "Mar 17, 2022 @ 
14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,651+0000 
[metadata-store-9-1] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on 
resource 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff
 was invalidated"
   "Mar 17, 2022 @ 
14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,652+0000 
[metadata-store-9-1] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on 
resource 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff
 was invalidated"
   "Mar 17, 2022 @ 
14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,652+0000 
[metadata-store-9-1] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on 
resource 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff
 was invalidated"
   "Mar 17, 2022 @ 
14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
revalidated the lock on 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired 
resource lock on 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
re-acquired lock at 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
re-acquired lock at 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired 
resource lock on 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.960","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
revalidated the lock on 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:27.967","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 
[metadata-store-9-1] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on 
resource 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff
 was invalidated"
   "Mar 17, 2022 @ 
14:32:27.967","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 
[metadata-store-9-1] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on 
resource 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff
 was invalidated"
   "Mar 17, 2022 @ 
14:32:27.968","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 
[metadata-store-9-1] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on 
resource 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff
 was invalidated"
   "Mar 17, 2022 @ 
14:32:27.968","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 
[metadata-store-9-1] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on 
resource 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff
 was invalidated"
   "Mar 17, 2022 @ 
14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,294+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
re-acquired lock at 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,294+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
revalidated the lock on 
/namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,293+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired 
resource lock on 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,294+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
revalidated the lock on 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 
14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,293+0000 
[main-EventThread] INFO  
org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully 
re-acquired lock at 
/namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   
   ....
   
   "Mar 17, 2022 @ 14:36:31.082","test-2-10-broker-0","Terminating due to 
java.lang.OutOfMemoryError: Java heap space"
   ```
   
   ![CPU and memory 
consumption](https://user-images.githubusercontent.com/41346/158894058-9c597727-03f3-4c6d-9720-c71964ead353.png)
   
   ![Message latency gradually increases up to 5 seconds, until brokers 
restart](https://user-images.githubusercontent.com/41346/158895650-d856e981-d6a9-4193-ad04-17816552ac9e.png)
   
   
   **Desktop (please complete the following information):**
    - StreamNative Cloud Hosted (Ubuntu)
   
   **Additional context**
   I notice in the logs, after a few minutes, there start to be some zookeeper 
and MetadataStore disconnects / reconnects. 
   
   More logs:
   
[test-2-10-broker-0.log.csv](https://github.com/apache/pulsar/files/8292566/test-2-10-broker-0.log.csv)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to