On Sat, Jun 4, 2016 at 8:43 AM, Robert Haas <robertmh...@gmail.com> wrote: > > On Thu, May 26, 2016 at 5:57 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > I am able to reproduce the assertion (it occurs once in two to three times, > > but always at same place) you have reported upthread with the above query. > > It seems to me, issue here is that while workers are writing tuples in the > > tuple queue, the master backend has detached from the queues. The reason > > why master backend has detached from tuple queues is because of Limit > > clause, basically after processing required tuples as specified by Limit > > clause, it calls shutdown of nodes in below part of code: > > I can't reproduce this assertion failure on master. I tried running > 'make installcheck' and then running this query repeatedly in the > regression database with and without > parallel_setup_cost=parallel_tuple_cost=0, and got nowhere. Does that > work for you, or do you have some other set of steps? >
I have tried by populating pg_statistic table after running make installcheck. The way to populate pg_statistic is to create lot of tables and insert few rows in each table as mentioned in the end of mail upthread. https://www.postgresql.org/message-id/CAA4eK1KOKGqmz9bGu%2BZ42qhRwMbm4R5rfnqsLCNqFs9j14jzEA%40mail.gmail.com Today, again I tried reproducing it using same steps, but could not reproduce it. This is a timing issue and difficult to reproduce, last time also I have spent quite some time to reproduce it. I think we can fix the issue as per analysis done by me last time and then let Andreas run his tests to see if he could see the issue again. > > I think the workers should stop processing tuples after the tuple queues got > > detached. This will not only handle above situation gracefully, but will > > allow to speed up the queries where Limit clause is present on top of Gather > > node. Patch for the same is attached with mail (this was part of original > > parallel seq scan patch, but was not applied and the reason as far as I > > remember was we thought such an optimization might not be required for > > initial version). > > This is very likely a good idea, but... > > > Another approach to fix this issue could be to reset mqh_partial_bytes and > > mqh_length_word_complete in shm_mq_sendv in case of SHM_MQ_DETACHED. These > > are currently reset only incase of success. > > ...I think we should do this too, because it's intended that calling > shm_mq_sendv again after it previously returned SHM_MQ_DETACHED should > again return SHM_MQ_DETACHED, not fail an assertion. > res = shm_mq_send_bytes(mqh, j, tmpbuf, nowait, &bytes_written); mqh->mqh_partial_bytes += bytes_written; + if (res == SHM_MQ_DETACHED) + { + mqh->mqh_partial_bytes = 0; + return res; + } In the above change, you are first adding bytes_written and then doing the SHM_MQ_DETACHED check, whereas other place the check is done first which seems to be right. I think it doesn't matter either way, but it is better to be consistent. Also isn't it better to set mqh_length_word_complete as false as next time this API is called, it should first try to write length into buffer. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com