On Thu, Aug 03, 2017 at 10:45:50AM -0400, Robert Haas wrote:
> On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch wrote:
> > postmaster algorithms rely on the PG_SETMASK() calls preventing that.
> > Without
> > such protection, duplicate bgworkers are an understandable result. I caught
> > several oth
On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch wrote:
> postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without
> such protection, duplicate bgworkers are an understandable result. I caught
> several other assertions; the PMChildFlags failure is another case of
> duplicate pos
On Wed, Jun 21, 2017 at 06:44:09PM -0400, Tom Lane wrote:
> Today, lorikeet failed with a new variant on the bgworker start crash:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10
>
> This one is even more exciting than the last one, because it sur
On Mon, Jun 26, 2017 at 8:09 PM, Andrew Dunstan
wrote:
>
>
> On 06/26/2017 10:36 AM, Amit Kapila wrote:
>> On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan
>> wrote:
>>>
>>> On 06/22/2017 10:24 AM, Tom Lane wrote:
Andrew Dunstan writes:
> Please let me know if there are tests I can run.
On 06/26/2017 10:45 AM, Tom Lane wrote:
> Andrew Dunstan writes:
>> On 06/23/2017 07:47 AM, Andrew Dunstan wrote:
>>> Rerunning with some different settings to see if I can get separate cores.
>> Numerous attempts to get core dumps following methods suggested in
>> Google searches have failed. T
Andrew Dunstan writes:
> On 06/23/2017 07:47 AM, Andrew Dunstan wrote:
>> Rerunning with some different settings to see if I can get separate cores.
> Numerous attempts to get core dumps following methods suggested in
> Google searches have failed. The latest one is just hanging.
Well, if it's h
On 06/26/2017 10:36 AM, Amit Kapila wrote:
> On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan
> wrote:
>>
>> On 06/22/2017 10:24 AM, Tom Lane wrote:
>>> Andrew Dunstan writes:
Please let me know if there are tests I can run. I missed your earlier
request in this thread, sorry about th
On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan
wrote:
>
>
> On 06/22/2017 10:24 AM, Tom Lane wrote:
>> Andrew Dunstan writes:
>>> Please let me know if there are tests I can run. I missed your earlier
>>> request in this thread, sorry about that.
>> That earlier request is still valid. Also, i
On 06/23/2017 07:47 AM, Andrew Dunstan wrote:
>
> On 06/23/2017 12:11 AM, Tom Lane wrote:
>> Andrew Dunstan writes:
>>> On 06/22/2017 10:24 AM, Tom Lane wrote:
That earlier request is still valid. Also, if you can reproduce the
symptom that lorikeet just showed and get a stack trace f
On 06/23/2017 12:11 AM, Tom Lane wrote:
> Andrew Dunstan writes:
>> On 06/22/2017 10:24 AM, Tom Lane wrote:
>>> That earlier request is still valid. Also, if you can reproduce the
>>> symptom that lorikeet just showed and get a stack trace from the
>>> (hypothetical) postmaster core dump, that
Andrew Dunstan writes:
> On 06/22/2017 10:24 AM, Tom Lane wrote:
>> That earlier request is still valid. Also, if you can reproduce the
>> symptom that lorikeet just showed and get a stack trace from the
>> (hypothetical) postmaster core dump, that would be hugely valuable.
> See attached log an
On 06/22/2017 10:24 AM, Tom Lane wrote:
> Andrew Dunstan writes:
>> Please let me know if there are tests I can run. I missed your earlier
>> request in this thread, sorry about that.
> That earlier request is still valid. Also, if you can reproduce the
> symptom that lorikeet just showed and
On Thu, Jun 22, 2017 at 7:54 PM, Tom Lane wrote:
> Andrew Dunstan writes:
>> Please let me know if there are tests I can run. I missed your earlier
>> request in this thread, sorry about that.
>
> That earlier request is still valid.
>
Yeah, that makes and also maybe we can try to print dsm_seg
Andrew Dunstan writes:
> Please let me know if there are tests I can run. I missed your earlier
> request in this thread, sorry about that.
That earlier request is still valid. Also, if you can reproduce the
symptom that lorikeet just showed and get a stack trace from the
(hypothetical) postmas
On 06/21/2017 06:44 PM, Tom Lane wrote:
> Today, lorikeet failed with a new variant on the bgworker start crash:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10
>
> This one is even more exciting than the last one, because it sure looks
> like the
Today, lorikeet failed with a new variant on the bgworker start crash:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10
This one is even more exciting than the last one, because it sure looks
like the crashing bgworker took the postmaster down with it.
On Thu, Jun 15, 2017 at 5:16 PM, Tom Lane wrote:
> Robert Haas writes:
>> On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane wrote:
>>> ... nodeGather cannot deem the query done until it's seen EOF on
>>> each tuple queue, which it cannot see until each worker has attached
>>> to and then detached from t
Robert Haas writes:
> On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane wrote:
>> ... nodeGather cannot deem the query done until it's seen EOF on
>> each tuple queue, which it cannot see until each worker has attached
>> to and then detached from the associated shm_mq.
> Oh. That's sad. It definitely
On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane wrote:
> I wrote:
>> Robert Haas writes:
>>> I think you're right. So here's a theory:
>
>>> 1. The ERROR mapping the DSM segment is just a case of the worker the
>>> losing a race, and isn't a bug.
>
>> I concur that this is a possibility,
>
> Actually,
I wrote:
> Robert Haas writes:
>> I think you're right. So here's a theory:
>> 1. The ERROR mapping the DSM segment is just a case of the worker the
>> losing a race, and isn't a bug.
> I concur that this is a possibility,
Actually, no, it isn't. I tried to reproduce the problem by inserting
Robert Haas writes:
> I think you're right. So here's a theory:
> 1. The ERROR mapping the DSM segment is just a case of the worker the
> losing a race, and isn't a bug.
I concur that this is a possibility, but if we expect this to happen,
seems like there should be other occurrences in the bui
On Thu, Jun 15, 2017 at 10:21 AM, Amit Kapila wrote:
> Yes, I think it is for next query. If you refer the log below from lorikeet:
>
> 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG: statement:
> EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM
> tenk1;
> 2017-06-13 16:44
Robert Haas writes:
> On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane wrote:
>> ... er, -ENOCAFFEINE. Nonetheless, there are no checks of
>> EXEC_FLAG_EXPLAIN_ONLY in any parallel-query code, so I think
>> a bet is being missed somewhere.
> ExecGather() is where workers get launched, and that ain't
On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane wrote:
> Robert Haas writes:
>> On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane wrote:
>>> It's fairly hard to read this other than as telling us that the worker was
>>> launched for the EXPLAIN (although really? why aren't we skipping that if
>>> EXEC_FLAG_
Robert Haas writes:
> On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane wrote:
>> It's fairly hard to read this other than as telling us that the worker was
>> launched for the EXPLAIN (although really? why aren't we skipping that if
>> EXEC_FLAG_EXPLAIN_ONLY?), ...
> Uh, because ANALYZE was used?
...
On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane wrote:
> Robert Haas writes:
>> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote:
>>> But we know, from the subsequent failed assertion, that the leader was
>>> still trying to launch parallel workers. So that particular theory
>>> doesn't hold water.
Robert Haas writes:
> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote:
>> But we know, from the subsequent failed assertion, that the leader was
>> still trying to launch parallel workers. So that particular theory
>> doesn't hold water.
> Is there any chance that it's already trying to launch
On Thu, Jun 15, 2017 at 7:42 PM, Robert Haas wrote:
> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote:
>>> Well, as Amit points out, there are entirely legitimate ways for that
>>> to happen. If the leader finishes the whole query itself before the
>>> worker reaches the dsm_attach() call, it w
On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote:
>> Well, as Amit points out, there are entirely legitimate ways for that
>> to happen. If the leader finishes the whole query itself before the
>> worker reaches the dsm_attach() call, it will call dsm_detach(),
>> destroying the segment, and the
Robert Haas writes:
> On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane wrote:
>> The lack of any other message before the 'could not map' failure must,
>> then, mean that dsm_attach() couldn't find an entry in shared memory
>> that it wanted to attach to. But how could that happen?
> Well, as Amit poi
On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane wrote:
> I wrote:
>> But surely the silent treatment should only apply to DSM_OP_CREATE?
>
> Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.
>
> The lack of any other message before the 'could not map' failure must,
> then, mean that dsm_attach
On Thu, Jun 15, 2017 at 3:31 AM, Tom Lane wrote:
> I wrote:
>> But surely the silent treatment should only apply to DSM_OP_CREATE?
>
> Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.
>
> The lack of any other message before the 'could not map' failure must,
> then, mean that dsm_attach
I wrote:
> But surely the silent treatment should only apply to DSM_OP_CREATE?
Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.
The lack of any other message before the 'could not map' failure must,
then, mean that dsm_attach() couldn't find an entry in shared memory
that it wanted to
Robert Haas writes:
> On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane wrote:
>> So the first problem here is the lack of supporting information for the
>> 'could not map' failure.
> Hmm. I think I believed at the time I wrote dsm_attach() that
> somebody might want to try to soldier on after failing
On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane wrote:
> So the first problem here is the lack of supporting information for the
> 'could not map' failure.
Hmm. I think I believed at the time I wrote dsm_attach() that
somebody might want to try to soldier on after failing to map a DSM,
but that doesn'
Yesterday lorikeet failed the select_parallel test in a new way:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-13%2020%3A28%3A33
2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic
shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2
Robert Haas writes:
> On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila wrote:
>> I don't think so because this problem has been reported previously as
>> well [1][2] even before the commit in question.
>>
>> [1] -
>> https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunsla
On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila wrote:
> I don't think so because this problem has been reported previously as
> well [1][2] even before the commit in question.
>
> [1] -
> https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net
> [2] - https://www.po
On Wed, Jun 7, 2017 at 12:37 AM, Robert Haas wrote:
> On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane wrote:
>>> One thought is that the only places where shm_mq_set_sender() should
>>> be getting invoked during the main regression tests are
>>> ParallelWorkerMain() and ExecParallelGetReceiver, and both
On Tue, Jun 6, 2017 at 4:25 PM, Tom Lane wrote:
> (I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2
> or so.)
I don't mind adding it permanently, but I think that's too high.
Somebody running a lot of parallel queries could easily get enough
messages to drown out the stuff
Robert Haas writes:
> On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane wrote:
>> Hmm. With some generous assumptions it'd be possible to think that
>> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was
>> present in 20 successful lorikeet runs before the first of these failures,
>
On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane wrote:
>> One thought is that the only places where shm_mq_set_sender() should
>> be getting invoked during the main regression tests are
>> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
>> places using ParallelWorkerNumber to figure o
Robert Haas writes:
> On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan
> wrote:
>> Buildfarm member lorikeet is failing occasionally with a failed
>> assertion during the select_parallel regression tests like this:
> I don't *think* we've made any relevant code changes lately. The only
> thing t
On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan
wrote:
> Buildfarm member lorikeet is failing occasionally with a failed
> assertion during the select_parallel regression tests like this:
>
>
> 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select
> count(*) from tenk1, tenk2
Andrew Dunstan writes:
> Buildfarm member lorikeet is failing occasionally with a failed
> assertion during the select_parallel regression tests like this:
> 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select
> count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.
Buildfarm member lorikeet is failing occasionally with a failed
assertion during the select_parallel regression tests like this:
2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select
count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
TRAP: FailedAs
46 matches
Mail list logo