devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-801518480


   **I found a huge clue on this issue** (in Pulsar 2.6.3).
   Several hours after we collected those stats, I got a heap dump on the 
function that was consuming from the dead topic, and its permits were actually 
positive. 
   (In the heap dump of `third-function`, in `ConsumerImpl`, the function's 
`availablePermits` were positive.)
   
![image](https://user-images.githubusercontent.com/7418031/111550705-44feb080-8744-11eb-97f5-f8dd0ecfc722.png)
   
   This morning, I checked the stats on the `third-topic` again, and the 
permits had gone back to positive, and its backlogs were all gone. 
   
   I found in the Pulsar source code that the consumer is supposed to 
periodically report to the broker what its available permits are 
(https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java#L112).
  It seems like the function must have eventually reported its permits to the 
broker, which brought the permits back into the positive, so the broker resumed 
pushing messages to the consumer. It's possible that it locked up again after 
accumulating more negative permits and then cleared after a while and repeated 
this process until eventually all the messages cleared the backlog. 
   
   To confirm if the issue is just the broker getting out of sync with the 
consumer's availablePermits, I need to reproduce the issue again and get a 
function heapdump when the topic permits are negative to verify if the function 
permits are positive at that point or if they're also negative. If they're also 
negative, then there's something happening in the consumer that's somehow 
making them positive after a while.
   
   It looks like this PR may fix some of the permit issues: 
   https://github.com/apache/pulsar/pull/7266
   However, we're not doing any explicit batching with the Pulsar functions. Do 
they batch by default? Maybe @jerrypeng or someone can provide some guidance 
here. 
   If there isn't batching taking place, then that PR won't completely solve 
this issue because these findings would suggest that there's another reason 
(other than incorrect batch reporting of permits, as mentioned in that PR) that 
the permits are getting out of sync between the broker and function. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to