siddharthteotia commented on PR #11496:
URL: https://github.com/apache/pinot/pull/11496#issuecomment-1716427968

   > My concern is that we are trying to prove that the fix is working using 
tests/heap dump, etc. vs the restart will just work. 
   
   Let me elaborate a bit on the nature of problem we saw in our production.
   
   We have a cluster several thousands of tables served by handful of brokers.
   
   A really bad query that was fetching around 150MB of data from 160 servers 
(fan out was 160) caused direct memory OOM on broker. Note that this was a soft 
OOM (broker didn't crash unlike Java heap space OOM)
   
   The problem is not just with the OOM. It is the cascade impact of this OOM 
on the overall stability / availability of the system.
   
   Concurrent queries around the same time and subsequent ones also failed 
because
   
   - All the direct buffer references were held up. Our heap dump confirmed it
   - Netty threads were relentlessly trying to allocate memory -- multiple 
times per channel-per server. So the same failure was happening repeatedly. 
Netty called into NIO to allocate direct memory. The latter tried to first GC. 
GC didn't help because the corresponding direct buffer reference was still in 
scope. 
   
   So this collectively destabilized and reduced the availability. Now we also 
restarted initially when we detected this to mitigate but by that time it had 
already negatively impacted our critical production use case and it missed the 
SLA -- because of the cascading impact on the concurrent / subsequent queries. 
   
   I understand and agree that restart is simpler but from detailed RCA there 
are definitely opportunities in code that could have prevented this or at least 
reduce the impact. @jasperjiaguo 's fix is aimed at that and that's why we also 
shared numbers on memory overhead reduction testing + subsequent queries 
working fine after the short recovery.
   
   I agree that shutting down channels will cause the other queries to fail but 
that particular impact may not be worse than the potential real life worst 
impact that I described above -- which without manual interference or other 
tooling etc will continue to cause problems on the cluster IMHO
   
   @soumitra-st @ege-st - I hope this gives some insight into where we are 
coming from. We can also chat offline and align if need be
   
   cc @jasperjiaguo @vvivekiyer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to