devinbost commented on issue #6054: URL: https://github.com/apache/pulsar/issues/6054#issuecomment-801518480
**I found a huge clue on this issue** (in Pulsar 2.6.3). Several hours after we collected those stats, I got a heap dump on the function that was consuming from the dead topic, and its permits were actually positive. (In the heap dump of `third-function`, in `ConsumerImpl`, the function's `availablePermits` were positive.) ![image](https://user-images.githubusercontent.com/7418031/111550705-44feb080-8744-11eb-97f5-f8dd0ecfc722.png) This morning, I checked the stats on the `third-topic` again, and the permits had gone back to positive, and its backlogs were all gone. I found in the Pulsar source code that the consumer is supposed to periodically report to the broker what its available permits are (https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java#L112). It seems like the function must have eventually reported its permits to the broker, which brought the permits back into the positive, so the broker resumed pushing messages to the consumer. It's possible that it locked up again after accumulating more negative permits and then cleared after a while and repeated this process until eventually all the messages cleared the backlog. To confirm if the issue is just the broker getting out of sync with the consumer's availablePermits, I need to reproduce the issue again and get a function heapdump when the topic permits are negative to verify if the function permits are positive at that point or if they're also negative. If they're also negative, then there's something happening in the consumer that's somehow making them positive after a while. It looks like this PR may fix some of the permit issues: https://github.com/apache/pulsar/pull/7266 However, we're not doing any explicit batching with the Pulsar functions. Do they batch by default? Maybe @jerrypeng or someone can provide some guidance here. If there isn't batching taking place, then that PR won't completely solve this issue because these findings would suggest that there's another reason (other than incorrect batch reporting of permits, as mentioned in that PR) that the permits are getting out of sync between the broker and function. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org