siddharthteotia commented on PR #11496: URL: https://github.com/apache/pinot/pull/11496#issuecomment-1716427968
> My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work. Let me elaborate a bit on the nature of problem we saw in our production. We have a cluster several thousands of tables served by handful of brokers. A really bad query that was fetching around 150MB of data from 160 servers (fan out was 160) caused direct memory OOM on broker. Note that this was a soft OOM (broker didn't crash unlike Java heap space OOM) The problem is not just with the OOM. It is the cascade impact of this OOM on the overall stability / availability of the system. Concurrent queries around the same time and subsequent ones also failed because - All the direct buffer references were held up. Our heap dump confirmed it - Netty threads were relentlessly trying to allocate memory -- multiple times per channel-per server. So the same failure was happening repeatedly. Netty called into NIO to allocate direct memory. The latter tried to first GC. GC didn't help because the corresponding direct buffer reference was still in scope. So this collectively destabilized and reduced the availability. Now we also restarted initially when we detected this to mitigate but by that time it had already negatively impacted our critical production use case and it missed the SLA -- because of the cascading impact on the concurrent / subsequent queries. I understand and agree that restart is simpler but from detailed RCA there are definitely opportunities in code that could have prevented this or at least reduce the impact. @jasperjiaguo 's fix is aimed at that and that's why we also shared numbers on memory overhead reduction testing + subsequent queries working fine after the short recovery. I agree that shutting down channels will cause the other queries to fail but that particular impact may not be worse than the potential real life worst impact that I described above -- which without manual interference or other tooling etc will continue to cause problems on the cluster IMHO @soumitra-st @ege-st - I hope this gives some insight into where we are coming from. We can also chat offline and align if need be cc @jasperjiaguo @vvivekiyer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
