On 19.2.2013 11:27, Tom Lane wrote:
> Tomas Vondra <t...@fuzzy.cz> writes:
>> Dne 19.02.2013 05:46, Alvaro Herrera napsal:
>>> Mastodon failed:
>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2013-02-19%2000%3A00%3A01
>>>
>>> probably worth investigating a bit; we might have broken something.
> 
>> Hmmm, interesting. A single Windows machine, while the other Windows 
>> machines seem to work fine (although some of them were not built for a 
>> few weeks).
> 
> Could be random chance --- we've seen the same failure before, eg
> 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2012-11-25%2006%3A00%3A00
> 
>                       regards, tom lane

I'm looking at that test, and I'm not really sure about a few details.

First, this function seems pretty useless to me:

=======================================================================
create function wait_for_stats() returns void as $$
  declare
    start_time timestamptz := clock_timestamp();
    updated bool;
  begin
    -- we don't want to wait forever; loop will exit after 30 seconds
    for i in 1 .. 300 loop

      -- check to see if indexscan has been sensed
      SELECT (st.idx_scan >= pr.idx_scan + 1) INTO updated
        FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats AS pr
       WHERE st.relname='tenk2' AND cl.relname='tenk2';

      exit when updated;

      -- wait a little
      perform pg_sleep(0.1);

      -- reset stats snapshot so we can test again
      perform pg_stat_clear_snapshot();

    end loop;

    -- report time waited in postmaster log (where it won't change test
output)
    raise log 'wait_for_stats delayed % seconds',
      extract(epoch from clock_timestamp() - start_time);
  end
  $$ language plpgsql;
=======================================================================

AFAIK the stats remain the same within a transaction, and as a function
runs within a transaction, it will either get new data on the first
iteration, or it will run all 300 of them. I've checked several
buildfarm members and I'm yet to see a single duration between 12ms and
30sec.

So IMHO we can replace the function call with pg_sleep(30) and we'll get
about the same effect.

But this obviously does not answer the question why it failed, although
on both occasions there's this log message:

[50b1b7fa.0568:14] LOG:  wait_for_stats delayed 34.75 seconds

which essentialy means the stats were not updated before the call to
wait_for_stats().

Anyway, there are these two failing queries:

=======================================================================
-- check effects
SELECT st.seq_scan >= pr.seq_scan + 1,
       st.seq_tup_read >= pr.seq_tup_read + cl.reltuples,
       st.idx_scan >= pr.idx_scan + 1,
       st.idx_tup_fetch >= pr.idx_tup_fetch + 1
  FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats AS pr
 WHERE st.relname='tenk2' AND cl.relname='tenk2';
 ?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
 t        | t        | t        | t
(1 row)

SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
       st.idx_blks_read + st.idx_blks_hit >= pr.idx_blks + 1
  FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
 WHERE st.relname='tenk2' AND cl.relname='tenk2';
 ?column? | ?column?
----------+----------
 t        | t
(1 row)
=======================================================================

The first one returns just falses, the second one retuns either (t,f) or
(f,f) - for the two failures posted by Alvaro and TL earlier today.

I'm really wondering how that could happen. The only thing that I can
think of is some strange timing issue, causing lost requests to write
the stats or maybe some of the stats updates. Hmmm, IIRC the stats are
sent over UDP - couldn't that be related?

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to