On Tue, Jul 29, 2014 at 4:56 PM, Bela Ban <b...@redhat.com> wrote: > Hi guys, > > sorry for the long post, but I do think I ran into an important problem > and we need to fix it ... :-) > > I've spent the last couple of days running the IspnPerfTest [1] perftest > on Google Compute Engine (GCE), and I've run into a problem with > Infinispan. It is a design problem and can be mitigated by sizing thread > pools correctly, but cannot be eliminated entirely. > > > Symptom: > -------- > IspnPerfTest has every node in a cluster perform 20'000 requests on keys > in range [1..20000]. > > 80% of the requests are reads and 20% writes. > > By default, we have 25 requester threads per node and 100 nodes in a > cluster, so a total of 2500 requester threads. > > The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners: > > <namedCache name="clusteredCache"> > <clustering mode="distribution"> > <stateTransfer awaitInitialTransfer="true"/> > <hash numOwners="2"/> > <sync replTimeout="20000"/> > </clustering> > > <transaction transactionMode="NON_TRANSACTIONAL" > useEagerLocking="true" > eagerLockSingleNode="true" /> > <locking lockAcquisitionTimeout="5000" concurrencyLevel="1000" > isolationLevel="READ_COMMITTED" useLockStriping="false" /> > </namedCache> > > It has 2 owners, a lock acquisition timeout of 5s and a repl timeout of > 20s. Lock stripting is off, so we have 1 lock per key. > > When I run the test, I always get errors like those below: > > org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock > after [10 seconds] on key [19386] for requestor [Thread[invoker-3,5,main]]! > Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]] > > and > > org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed out > > > Investigation: > ------------ > When I looked at UNICAST3, I saw a lot of missing messages on the > receive side and unacked messages on the send side. This caused me to > look into the (mainly OOB) thread pools and - voila - maxed out ! > > I learned from Pedro that the Infinispan internal thread pool (with a > default of 32 threads) can be configured, so I increased it to 300 and > increased the OOB pools as well. > > This mitigated the problem somewhat, but when I increased the requester > threads to 100, I had the same problem again. Apparently, the Infinispan > internal thread pool uses a rejection policy of "run" and thus uses the > JGroups (OOB) thread when exhausted. >
We can't use another rejection policy in the remote executor because the message won't be re-delivered by JGroups, and we can't use a queue either. Pedro is working on ISPN-2849, which should help with the remote/OOB thread pool exhaustion. It is a bit tricky, though, because our interceptors assume they will be able to access stack variables after replication. > > I learned (from Pedro and Mircea) that GETs and PUTs work as follows in > dist-sync / 2 owners: > - GETs are sent to the primary and backup owners and the first response > received is returned to the caller. No locks are acquired, so GETs > shouldn't cause problems. > > - A PUT(K) is sent to the primary owner of K > - The primary owner > (1) locks K > (2) updates the backup owner synchronously *while holding the lock* > (3) releases the lock > > > Hypothesis > ---------- > (2) above is done while holding the lock. The sync update of the backup > owner is done with the lock held to guarantee that the primary and > backup owner of K have the same values for K. > And something else: if the primary owner reports that a write was successful and then dies, a read should find the updated value on the backup owner(s). > However, the sync update *inside the lock scope* slows things down (can > it also lead to deadlocks?); there's the risk that the request is > dropped due to a full incoming thread pool, or that the response is not > received because of the same, or that the locking at the backup owner > blocks for some time. > There is no locking on the backup owner, so there are no deadlocks. There is indeed a risk of the OOB/remote thread pools being full. > > If we have many threads modifying the same key, then we have a backlog > of locking work against that key. Say we have 100 requester threads and > a 100 node cluster. This means that we have 10'000 threads accessing > keys; with 2'000 writers there's a big chance that some writers pick the > same key at the same time. > > For example, if we have 100 threads accessing key K and it takes 3ms to > replicate K to the backup owner, then the last of the 100 threads waits > ~300ms before it gets a chance to lock K on the primary owner and > replicate it as well. > > Just a small hiccup in sending the PUT to the primary owner, sending the > modification to the backup owner, waitting for the response, or GC, and > the delay will quickly become bigger. > > > Verification > ---------- > To verify the above, I set numOwners to 1. This means that the primary > owner of K does *not* send the modification to the backup owner, it only > locks K, modifies K and unlocks K again. > > I ran the IspnPerfTest again on 100 nodes, with 25 requesters, and NO > PROBLEM ! > > I then increased the requesters to 100, 150 and 200 and the test > completed flawlessly ! Performance was around *40'000 requests per node > per sec* on 4-core boxes ! > > > Root cause > --------- > ******************* > The root cause is the sync RPC of K to the backup owner(s) of K while > the primary owner holds the lock for K. > ******************* > > This causes a backlog of threads waiting for the lock and that backlog > can grow to exhaust the thread pools. First the Infinispan internal > thread pool, then the JGroups OOB thread pool. The latter causes > retransmissions to get dropped, which compounds the problem... > > > Goal > ---- > The goal is to make sure that primary and backup owner(s) of K have the > same value for K. > > Simply sending the modification to the backup owner(s) asynchronously > won't guarantee this, as modification messages might get processed out > of order as they're OOB ! > > > Suggested solution > ---------------- > The modification RPC needs to be invoked *outside of the lock scope*: > - lock K > - modify K > - unlock K > - send modification to backup owner(s) // outside the lock scope > > The primary owner puts the modification of K into a queue from where a > separate thread/task removes it. The thread then invokes the PUT(K) on > the backup owner(s). > Does the replication thread execute the PUT(k) synchronously, or asynchronously? I assume asynchronously, otherwise the replication thread wouldn't be able to keep up with the writers. > > The queue has the modified keys in FIFO order, so the modifications > arrive at the backup owner(s) in the right order. > Sending the RPC to the backup owners asynchronously, while holding the key lock, would do the same thing. > > This requires that the way GET is implemented changes slightly: instead > of invoking a GET on all owners of K, we only invoke it on the primary > owner, then the next-in-line etc. > What's the next-in-line owner? A backup won't have the last version of the data. > The reason for this is that the backup owner(s) may not yet have > received the modification of K. > > OTOH, if the primary owner dies, we have to ask a backup, and we can lose the modifications not yet replicated by the primary. > This is a better impl anyway (we discussed this before) becuse it > generates less traffic; in the normal case, all but 1 GET requests are > unnecessary. > > I have a WIP branch for this and it seemed to work fine. Test suite speed seemed about the same, but I didn't get to do a real performance test. > > > Improvement > ----------- > The above solution can be simplified and even made more efficient. > Re-using concepts from IRAC [2], we can simply store the modified *keys* > in the modification queue. The modification replication thread removes > the key, gets the current value and invokes a PUT/REMOVE on the backup > owner(s). > > Even better: a key is only ever added *once*, so if we have [5,2,17,3], > adding key 2 is a no-op because the processing of key 2 (in second > position in the queue) will fetch the up-to-date value anyway ! > > > Misc > ---- > - Could we possibly use total order to send the updates in TO ? TBD > (Pedro?) > > > Thoughts ? > > > [1] https://github.com/belaban/IspnPerfTest > [2] > > https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering > > -- > Bela Ban, JGroups lead (http://www.jgroups.org) > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev >
_______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev