Recently I worked on some performance related issues and noticed a pattern in 
the code that lead to increased latency for some APIs in a scaled up 
environment (> 10K user VMs, > 10K hosts). The pattern is like this:

List<Host> hosts = listHosts(); // based on some filter
for (Host h : hosts) {
    // do some processing
}

You can replace host with other entities like user VMs etc. Functionally there 
is nothing wrong and for smaller environments works perfectly fine. But as the 
size of the deployment grows the effect is much more visible. Think about the 
entities being operated upon and how they grow as the size of the environment 
grows. In these scenarios the looping should be avoided to the extent possible 
by offloading the computation to the database. If required modify the database 
schemas to handle such scenarios.


Another aspect is various synchronisations present in the code. It is best if 
these can be avoided but there are situations when these constructs needs to be 
used. But they should be used carefully and only the minimal amount of code 
should be guarded using them otherwise they can kill performance in a scaled up 
environment. For e.g. I came across a scenario like below in the code

lock() {

    // read something from db
    // check the state and based on that do some update
    // update such that it is synchronised
}

In the above logic all threads wait on the lock irrespective of whether update 
is needed or not. But it can be optimised like below

// read from db
// check the state
if (updateRequired) {
    lock() {
        // again read to ensure state not changed since last read
        if (updateRequired) {
            // do actual update
        }
    }
}


These are simple things to check out for while adding new code or working on 
bugs. Also feel free to raise bugs/fix them if you come across code that can 
cause latency.

Thanks,
Koushik

Reply via email to