Recently I worked on some performance related issues and noticed a pattern in the code that lead to increased latency for some APIs in a scaled up environment (> 10K user VMs, > 10K hosts). The pattern is like this:
List<Host> hosts = listHosts(); // based on some filter for (Host h : hosts) { // do some processing } You can replace host with other entities like user VMs etc. Functionally there is nothing wrong and for smaller environments works perfectly fine. But as the size of the deployment grows the effect is much more visible. Think about the entities being operated upon and how they grow as the size of the environment grows. In these scenarios the looping should be avoided to the extent possible by offloading the computation to the database. If required modify the database schemas to handle such scenarios. Another aspect is various synchronisations present in the code. It is best if these can be avoided but there are situations when these constructs needs to be used. But they should be used carefully and only the minimal amount of code should be guarded using them otherwise they can kill performance in a scaled up environment. For e.g. I came across a scenario like below in the code lock() { // read something from db // check the state and based on that do some update // update such that it is synchronised } In the above logic all threads wait on the lock irrespective of whether update is needed or not. But it can be optimised like below // read from db // check the state if (updateRequired) { lock() { // again read to ensure state not changed since last read if (updateRequired) { // do actual update } } } These are simple things to check out for while adding new code or working on bugs. Also feel free to raise bugs/fix them if you come across code that can cause latency. Thanks, Koushik