subject:"\[HACKERS\] intermittent failures in Cygwin from select

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-10 Thread Noah Misch

On Thu, Aug 03, 2017 at 10:45:50AM -0400, Robert Haas wrote: > On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch wrote: > > postmaster algorithms rely on the PG_SETMASK() calls preventing that. > > Without > > such protection, duplicate bgworkers are an understandable result. I caught > > several oth

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-03 Thread Robert Haas

On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch wrote: > postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without > such protection, duplicate bgworkers are an understandable result. I caught > several other assertions; the PMChildFlags failure is another case of > duplicate pos

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-02 Thread Noah Misch

On Wed, Jun 21, 2017 at 06:44:09PM -0400, Tom Lane wrote: > Today, lorikeet failed with a new variant on the bgworker start crash: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10 > > This one is even more exciting than the last one, because it sur

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Amit Kapila

On Mon, Jun 26, 2017 at 8:09 PM, Andrew Dunstan wrote: > > > On 06/26/2017 10:36 AM, Amit Kapila wrote: >> On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan >> wrote: >>> >>> On 06/22/2017 10:24 AM, Tom Lane wrote: Andrew Dunstan writes: > Please let me know if there are tests I can run.

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Andrew Dunstan

On 06/26/2017 10:45 AM, Tom Lane wrote: > Andrew Dunstan writes: >> On 06/23/2017 07:47 AM, Andrew Dunstan wrote: >>> Rerunning with some different settings to see if I can get separate cores. >> Numerous attempts to get core dumps following methods suggested in >> Google searches have failed. T

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Tom Lane

Andrew Dunstan writes: > On 06/23/2017 07:47 AM, Andrew Dunstan wrote: >> Rerunning with some different settings to see if I can get separate cores. > Numerous attempts to get core dumps following methods suggested in > Google searches have failed. The latest one is just hanging. Well, if it's h

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Andrew Dunstan

On 06/26/2017 10:36 AM, Amit Kapila wrote: > On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan > wrote: >> >> On 06/22/2017 10:24 AM, Tom Lane wrote: >>> Andrew Dunstan writes: Please let me know if there are tests I can run. I missed your earlier request in this thread, sorry about th

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Amit Kapila

On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan wrote: > > > On 06/22/2017 10:24 AM, Tom Lane wrote: >> Andrew Dunstan writes: >>> Please let me know if there are tests I can run. I missed your earlier >>> request in this thread, sorry about that. >> That earlier request is still valid. Also, i

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Andrew Dunstan

On 06/23/2017 07:47 AM, Andrew Dunstan wrote: > > On 06/23/2017 12:11 AM, Tom Lane wrote: >> Andrew Dunstan writes: >>> On 06/22/2017 10:24 AM, Tom Lane wrote: That earlier request is still valid. Also, if you can reproduce the symptom that lorikeet just showed and get a stack trace f

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-23 Thread Andrew Dunstan

On 06/23/2017 12:11 AM, Tom Lane wrote: > Andrew Dunstan writes: >> On 06/22/2017 10:24 AM, Tom Lane wrote: >>> That earlier request is still valid. Also, if you can reproduce the >>> symptom that lorikeet just showed and get a stack trace from the >>> (hypothetical) postmaster core dump, that

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Tom Lane

Andrew Dunstan writes: > On 06/22/2017 10:24 AM, Tom Lane wrote: >> That earlier request is still valid. Also, if you can reproduce the >> symptom that lorikeet just showed and get a stack trace from the >> (hypothetical) postmaster core dump, that would be hugely valuable. > See attached log an

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Andrew Dunstan

On 06/22/2017 10:24 AM, Tom Lane wrote: > Andrew Dunstan writes: >> Please let me know if there are tests I can run. I missed your earlier >> request in this thread, sorry about that. > That earlier request is still valid. Also, if you can reproduce the > symptom that lorikeet just showed and

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Amit Kapila

On Thu, Jun 22, 2017 at 7:54 PM, Tom Lane wrote: > Andrew Dunstan writes: >> Please let me know if there are tests I can run. I missed your earlier >> request in this thread, sorry about that. > > That earlier request is still valid. > Yeah, that makes and also maybe we can try to print dsm_seg

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Tom Lane

Andrew Dunstan writes: > Please let me know if there are tests I can run. I missed your earlier > request in this thread, sorry about that. That earlier request is still valid. Also, if you can reproduce the symptom that lorikeet just showed and get a stack trace from the (hypothetical) postmas

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Andrew Dunstan

On 06/21/2017 06:44 PM, Tom Lane wrote: > Today, lorikeet failed with a new variant on the bgworker start crash: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10 > > This one is even more exciting than the last one, because it sure looks > like the

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-21 Thread Tom Lane

Today, lorikeet failed with a new variant on the bgworker start crash: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10 This one is even more exciting than the last one, because it sure looks like the crashing bgworker took the postmaster down with it.

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas

On Thu, Jun 15, 2017 at 5:16 PM, Tom Lane wrote: > Robert Haas writes: >> On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane wrote: >>> ... nodeGather cannot deem the query done until it's seen EOF on >>> each tuple queue, which it cannot see until each worker has attached >>> to and then detached from t

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane

Robert Haas writes: > On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane wrote: >> ... nodeGather cannot deem the query done until it's seen EOF on >> each tuple queue, which it cannot see until each worker has attached >> to and then detached from the associated shm_mq. > Oh. That's sad. It definitely

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas

On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane wrote: > I wrote: >> Robert Haas writes: >>> I think you're right. So here's a theory: > >>> 1. The ERROR mapping the DSM segment is just a case of the worker the >>> losing a race, and isn't a bug. > >> I concur that this is a possibility, > > Actually,

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane

I wrote: > Robert Haas writes: >> I think you're right. So here's a theory: >> 1. The ERROR mapping the DSM segment is just a case of the worker the >> losing a race, and isn't a bug. > I concur that this is a possibility, Actually, no, it isn't. I tried to reproduce the problem by inserting

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane

Robert Haas writes: > I think you're right. So here's a theory: > 1. The ERROR mapping the DSM segment is just a case of the worker the > losing a race, and isn't a bug. I concur that this is a possibility, but if we expect this to happen, seems like there should be other occurrences in the bui

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas

On Thu, Jun 15, 2017 at 10:21 AM, Amit Kapila wrote: > Yes, I think it is for next query. If you refer the log below from lorikeet: > > 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG: statement: > EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM > tenk1; > 2017-06-13 16:44

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane

Robert Haas writes: > On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane wrote: >> ... er, -ENOCAFFEINE. Nonetheless, there are no checks of >> EXEC_FLAG_EXPLAIN_ONLY in any parallel-query code, so I think >> a bet is being missed somewhere. > ExecGather() is where workers get launched, and that ain't

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas

On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane wrote: > Robert Haas writes: >> On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane wrote: >>> It's fairly hard to read this other than as telling us that the worker was >>> launched for the EXPLAIN (although really? why aren't we skipping that if >>> EXEC_FLAG_

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane

Robert Haas writes: > On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane wrote: >> It's fairly hard to read this other than as telling us that the worker was >> launched for the EXPLAIN (although really? why aren't we skipping that if >> EXEC_FLAG_EXPLAIN_ONLY?), ... > Uh, because ANALYZE was used? ...

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas

On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane wrote: > Robert Haas writes: >> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote: >>> But we know, from the subsequent failed assertion, that the leader was >>> still trying to launch parallel workers. So that particular theory >>> doesn't hold water.

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane

Robert Haas writes: > On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote: >> But we know, from the subsequent failed assertion, that the leader was >> still trying to launch parallel workers. So that particular theory >> doesn't hold water. > Is there any chance that it's already trying to launch

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Amit Kapila

On Thu, Jun 15, 2017 at 7:42 PM, Robert Haas wrote: > On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote: >>> Well, as Amit points out, there are entirely legitimate ways for that >>> to happen. If the leader finishes the whole query itself before the >>> worker reaches the dsm_attach() call, it w

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas

On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote: >> Well, as Amit points out, there are entirely legitimate ways for that >> to happen. If the leader finishes the whole query itself before the >> worker reaches the dsm_attach() call, it will call dsm_detach(), >> destroying the segment, and the

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane

Robert Haas writes: > On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane wrote: >> The lack of any other message before the 'could not map' failure must, >> then, mean that dsm_attach() couldn't find an entry in shared memory >> that it wanted to attach to. But how could that happen? > Well, as Amit poi

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas

On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane wrote: > I wrote: >> But surely the silent treatment should only apply to DSM_OP_CREATE? > > Oh ... scratch that, it *does* only apply to DSM_OP_CREATE. > > The lack of any other message before the 'could not map' failure must, > then, mean that dsm_attach

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Amit Kapila

On Thu, Jun 15, 2017 at 3:31 AM, Tom Lane wrote: > I wrote: >> But surely the silent treatment should only apply to DSM_OP_CREATE? > > Oh ... scratch that, it *does* only apply to DSM_OP_CREATE. > > The lack of any other message before the 'could not map' failure must, > then, mean that dsm_attach

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Tom Lane

I wrote: > But surely the silent treatment should only apply to DSM_OP_CREATE? Oh ... scratch that, it *does* only apply to DSM_OP_CREATE. The lack of any other message before the 'could not map' failure must, then, mean that dsm_attach() couldn't find an entry in shared memory that it wanted to

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Tom Lane

Robert Haas writes: > On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane wrote: >> So the first problem here is the lack of supporting information for the >> 'could not map' failure. > Hmm. I think I believed at the time I wrote dsm_attach() that > somebody might want to try to soldier on after failing

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Robert Haas

On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane wrote: > So the first problem here is the lack of supporting information for the > 'could not map' failure. Hmm. I think I believed at the time I wrote dsm_attach() that somebody might want to try to soldier on after failing to map a DSM, but that doesn'

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Tom Lane

Yesterday lorikeet failed the select_parallel test in a new way: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-13%2020%3A28%3A33 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment 2017-06-13 16:44:57.248 EDT [59404dec.2

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-07 Thread Tom Lane

Robert Haas writes: > On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila wrote: >> I don't think so because this problem has been reported previously as >> well [1][2] even before the commit in question. >> >> [1] - >> https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunsla

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-07 Thread Robert Haas

On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila wrote: > I don't think so because this problem has been reported previously as > well [1][2] even before the commit in question. > > [1] - > https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net > [2] - https://www.po

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-07 Thread Amit Kapila

On Wed, Jun 7, 2017 at 12:37 AM, Robert Haas wrote: > On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane wrote: >>> One thought is that the only places where shm_mq_set_sender() should >>> be getting invoked during the main regression tests are >>> ParallelWorkerMain() and ExecParallelGetReceiver, and both

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Robert Haas

On Tue, Jun 6, 2017 at 4:25 PM, Tom Lane wrote: > (I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2 > or so.) I don't mind adding it permanently, but I think that's too high. Somebody running a lot of parallel queries could easily get enough messages to drown out the stuff

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Tom Lane

Robert Haas writes: > On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane wrote: >> Hmm. With some generous assumptions it'd be possible to think that >> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was >> present in 20 successful lorikeet runs before the first of these failures, >

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Robert Haas

On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane wrote: >> One thought is that the only places where shm_mq_set_sender() should >> be getting invoked during the main regression tests are >> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those >> places using ParallelWorkerNumber to figure o

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Tom Lane

Robert Haas writes: > On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan > wrote: >> Buildfarm member lorikeet is failing occasionally with a failed >> assertion during the select_parallel regression tests like this: > I don't *think* we've made any relevant code changes lately. The only > thing t

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Robert Haas

On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan wrote: > Buildfarm member lorikeet is failing occasionally with a failed > assertion during the select_parallel regression tests like this: > > > 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select > count(*) from tenk1, tenk2

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-05 Thread Tom Lane

Andrew Dunstan writes: > Buildfarm member lorikeet is failing occasionally with a failed > assertion during the select_parallel regression tests like this: > 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select > count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.

[HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-05 Thread Andrew Dunstan

Buildfarm member lorikeet is failing occasionally with a failed assertion during the select_parallel regression tests like this: 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; TRAP: FailedAs

46 matches

Mail list logo