The important thing to recognize is that it's the *caller* that increments/decrements. This means you can elide calls to an object where you already have a guarantee of its reference count being high enough.
That won't help you if you iterate over an array, so you need a mutex on the array in order to prevent inc/dec for every single object you inspect.
inc/dec with a lock prefix could easily cost you 150-200 cycles. Ola.