On Fri, 21 Aug 2020 at 11:01, Michael Lewis <mle...@entrata.com> wrote: > > On Thu, Aug 20, 2020 at 4:40 PM Jim Jarvie <j...@talentstack.to> wrote: >> >> On 20-Aug.-2020 17:42, Michael Lewis wrote: >> >> Can you share an explain analyze for the query that does the select for >> update? I wouldn't assume that partition pruning is possible at all with >> hash, and it would be interesting to see how it is finding those rows. >> >> Well this got interesting - the already moved error showed up: Note, the >> actual process partitions are regular table partitions, these are not >> hashed. Only the incoming and completed are hashed due to row counts at >> either end of the processing; in flight (where the issue shows up) is quite >> small: >> >> [queuedb] # explain analyze select queueid,txobject,objectid,state from >> mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and >> txobject = 'ticket' limit 250 for update skip locked; >> ERROR: 40001: tuple to be locked was already moved to another partition due >> to concurrent update >> LOCATION: heapam_tuple_lock, heapam_handler.c:405 >> Time: 579.131 ms > > That is super curious. I hope that someone will jump in with an explanation > or theory on this. > > I still wonder why the move between partitions is needed though if the work > is either done (failed or successful) or not done... not started, retry > needed or in progress... it doesn't matter. It needs to get picked up by the > next process if it isn't already row locked.
I may be heading off in the wrong direction as I'm not fully sure I understand what the complaint is about, but isn't the executor just hitting dead rows in one of the active or failed partitions that have been moved off to some other partition? When updates occur in a non-partitioned table we can follow item pointer chains to find the live row and check if the WHERE clause still matches to determine if the row should be updated, or in this case just locked since it's a SELECT FOR UPDATE. However, with partitioned table, a concurrent UPDATE may have caused the row to have been moved off to another partition, in which case the tuple's item pointer cannot point to it since we don't have enough address space, we only have 6 bytes for a TID. To get around the fact that we can't follow these update chains, we just throw the serialization error, which is what you're getting. Ideally, we'd figure out where the live version of the tuple is and check if it matches the WHERE clause and lock it if it does, but we've no means to do that with the current design. If the complaint is about the fact that you're getting the error and you think you shouldn't be because you said "SKIP LOCKED" then I'm not really sure the fact that you said "SKIP LOCKED" gives us the right to ignore this case. The user only gave us the go-ahead to skip locked tuples, not skip tuples that we just failed to follow item pointer chains for. It might be okay to do this for rows that have since been marked as complete since they no longer match your WHERE clause, however, if the row has gone from the queue_tx_active partition into the queue_tx_fail_retryable partition then I don't see why we'd have the right to skip the tuple. Your query says you want tuples that need to be retried. We can't go skipping them. So isn't the fix just to code the application to retry on 40001 errors? David