Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2017-08-09 Thread Adam Sjøgren
On 2017-06-21 Adam Sjøgren <a...@novozymes.com> wrote:

> Adam Sjøgren <a...@novozymes.com> wrote:

>> Meanwhile, I can report that I have upgraded from 9.3.14 to 9.3.17 and
>> the errors keep appearing the log.

Just to close this, for the record: We haven't seen the errors since
2017-06-30. We upgraded to 9.3.17 (latest 9.3 point-release at the time
of writing) on 2017-06-10.

Whether this means that the affected rows gradually got overwritten
after switching to .17 and thus got fixed, or if something subtle in our
workflow changed, so we aren't hitting this anymore, or something else
entirely is the answer, we're not sure.

We didn't get to trying Alvaro Herrera's suggestion of removing
6c243f90ab6904f27fa990f1f3261e1d09a11853 before the errors stopped
appearing "by themselves".


  Best regards,

Adam

-- 
 "My Dear Babbage. I am in much dismay at having  Adam Sjøgren
  got into so amazing a quagmire & botheration with a...@novozymes.com
  these Numbers, that I cannot possibly get the
  thing done today."


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2017-06-21 Thread Adam Sjøgren
Adam Sjøgren <a...@novozymes.com> wrote:

> Meanwhile, I can report that I have upgraded from 9.3.14 to 9.3.17 and
> the errors keep appearing the log.

Just a quick update with more observations:

All the errors in the postgres.log from one of the tables are triggered
by a stored procedure that gathers data to put in a field used for full
text search - this stored procedure is called by a before update trigger
on the table. We have only seen it in the log, but not been able to
reproduce it.

We have, however, now got a row in the other big table where we can get
the error just by running a SELECT * on the row, in psql:

  user@server db=# select * from ourschema.table_a where id = 6121931;
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 339846807 in 
pg_toast_10919630
  user@server db=# 

Which is both nice - we can show the error on demand - but also more
worrying, I guess, because that means the problem is "on disk".

Running this in a stored procedure over the record in question:

>   SELECT *
>   INTO rec
>   FROM table_a where id = badid;
>   detoast := substr(rec.fts::text,1,2000);
>   exception
>   when others then
>   raise notice 'data for table_a id: % is corrupt', badid;
>   continue;

also shows the error:

  user@server db=# SELECT ourschema.check_sequence(6121931, 6121931);
  NOTICE:  data for table_a id: 6121931 is corrupt
   check_sequence 
  

  (1 row)

We are running this over the entire (160M+ row) table now, to see if any
other rows are affected.

So, we can reproduce the error message, but we can't reproduce the
problem from scratch.

Any ideas on what to look at, given a non-transient problem-row?

Our next step will be to try to switch to 9.3.17 with
6c243f90ab6904f27fa990f1f3261e1d09a11853 reverted as suggested by Alvaro
Herrera last week.


  Best regards,

Adam

-- 
 "Lägg ditt liv i min handAdam Sjøgren
  Sälj din själ till ett band"  a...@novozymes.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2017-06-12 Thread Adam Sjøgren
Alvaro Herrera <alvhe...@2ndquadrant.com> wrote:

> ADSJ (Adam Sjøgren) wrote:
>
>> Our database has started reporting errors like this:
>> 
>>   2017-05-31 13:48:10 CEST ERROR:  unexpected chunk number 0 (expected 1) 
>> for toast value 14242189 in pg_toast_10919630

> Does the problem still reproduce if you revert commit
> 6c243f90ab6904f27fa990f1f3261e1d09a11853?

I will try and get back to you with the results (building new .deb
packages as I type this).


Meanwhile, I can report that I have upgraded from 9.3.14 to 9.3.17 and
the errors keep appearing the log.

We have tried running a function similar to the one described in
http://www.databasesoup.com/2013/10/de-corrupting-toast-tables.html as
suggested by Adrian Klaver, but we haven't been able to get any errors
from that.

This is the function we have run over our two tables:

  CREATE OR REPLACE FUNCTION check_table_a(from_id int, to_id int)
  RETURNS VOID LANGUAGE PLPGSQL AS
  $f$
  declare
  curid INT := 0;
  rec RECORD;
  badid INT;
  detoast TEXT;
  begin
  FOR badid IN SELECT id FROM table_a where id >= from_id and id <= to_id LOOP
  curid = curid + 1;
  if curid % 1 = 0 then
  raise notice '% rows inspected (%, %,%)', curid, badid, from_id, 
to_id;
  end if;
  begin
  SELECT *
  INTO rec
  FROM table_a where id = badid;
  detoast := substr(rec.fts::text,1,2000);
  exception
  when others then
  raise notice 'data for table_a id: % is corrupt', badid;
  continue;
  end;
  end loop;
  end;
  $f$;

  -- The other function has:
  --
  detoast := substr(vcontent.document,1,2000);
  --
  -- and is otherwise identical.

But no 'data for table... is corrupt' is printed.

We are only substr()'ing one field (which we know is big) for each row.
Should we do so for _all_ fields? Is there an elegant way to do so?


  Best regards,

Adam

-- 
 "Lägg ditt liv i min handAdam Sjøgren
  Sälj din själ till ett band"  a...@novozymes.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2017-06-08 Thread Adam Sjøgren
Harry writes:

> The second vacuum causes an ERROR identical to that you are reporting
> below (unexpected chunk number n (expected n) for toast value...).
> However it may take up to ten attempts to replicate it.

Interesting.

> Out of interest, are you using any tablespaces other than pg_default?
> I can only replicate the issue when using separately mounted
> tablespaces.

No, we are using pg_default only.

I hope your finding can be reproduced, it would be really interesting to
see.


  Best regards,

Adam

-- 
 "Lägg ditt liv i min hand        Adam Sjøgren
  Sälj din själ till ett band"  a...@novozymes.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2017-06-08 Thread Adam Sjøgren
Achilleas writes:

> First try to find which tables those toast relations refer to :
> select 10919630::regclass , 10920100::regclass ;
> Are those critical tables? Can you restore them somehow?

They are our two big tables, containing the bulk of our data (one with
168M rows, the other with 320M rows).

They are constantly being updated, but if I can identify the affected
rows, I can restore a backup on another machine and cherry pick them
from there.

> Also you may consider
> REINDEX TABLE pg_toast.pg_toast_10920100;
> REINDEX TABLE pg_toast.pg_toast_10919630;
> REINDEX TABLE ;
> REINDEX TABLE ;
>
> also VACUUM the above tables.

Yes, but I'd like to know find out why it happens, because cleaning up
and having the corruption reoccur is not so fun.

> You might want to write a function which iterates over the damaged
> table's rows in order to identify the damaged row(s). And then do some
> good update to create a new version.

Yes - we started by doing a quick pg_dump, but I guess we should switch
to something that can tell us exactly what rows hit the problem.

Anyone has a handy little script lying around?


  Thanks for the response!

Adam

-- 
 "Lägg ditt liv i min handAdam Sjøgren
  Sälj din själ till ett band"  a...@novozymes.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2017-06-07 Thread Adam Sjøgren
Our database has started reporting errors like this:

  2017-05-31 13:48:10 CEST ERROR:  unexpected chunk number 0 (expected 1) for 
toast value 14242189 in pg_toast_10919630
  ...
  2017-06-01 11:06:56 CEST ERROR:  unexpected chunk number 0 (expected 1) for 
toast value 19573520 in pg_toast_10919630

(157 times, for different toast values, same pg_toast_nnn). pg_toast_10919630
corresponds to a table with around 168 million rows.

These went away, but the next day we got similar errors from another
table:

  2017-06-02 05:59:50 CEST ERROR:  unexpected chunk number 0 (expected 1) for 
toast value 47060150 in pg_toast_10920100
  ...
  2017-06-02 06:14:54 CEST ERROR:  unexpected chunk number 0 (expected 1) for 
toast value 47226455 in pg_toast_10920100

(Only 4 this time) pg_toast_10920100 corresponds to a table with holds
around 320 million rows (these are our two large tables).

The next day we got 6 such errors and the day after 10 such errors. On
June 5th we got 94, yesterday we got 111, of which one looked a little
different:

  2017-06-06 17:32:21 CEST ERROR:  unexpected chunk size 1996 (expected 1585) 
in final chunk 0 for toast value 114925100 in pg_toast_10920100

and today the logs have 65 lines, ending with these:

  2017-06-07 14:49:53 CEST ERROR:  unexpected chunk number 0 (expected 1) for 
toast value 131114834 in pg_toast_10920100
  2017-06-07 14:53:41 CEST ERROR:  unexpected chunk number 0 (expected 1) for 
toast value 131149566 in pg_toast_10920100

The database is 10 TB on disk (SSDs) and runs on a 48 core server with 3
TB RAM on Ubuntu 14.04 (Linux 3.18.13).

We are updating rows in the database a lot/continuously.

There are no apparent indications of hardware errors (like ECC) in
dmesg, nor any error messages logged by the LSI MegaRAID controller, as
far as I can tell.

We are running PostgreSQL 9.3.14 currently.

The only thing I could see in the release notes since 9.3.14 that might
be related is this:

 "* Avoid very-low-probability data corruption due to testing tuple
visibility without holding buffer lock (Thomas Munro, Peter Geoghegan,
Tom Lane)"

Although reading more about it, it doesn't sound like it would exhibit
the symptoms we see?

We have recently increased the load (to around twice the number of
cores), though, which made me think we could be triggering corner cases
we haven't hit before.

We will be upgrading to PostgreSQL 9.3.17 during the weekend, but I'd like to 
hear
if anyone has seen something like this, or have some ideas of how to
investigate/what the cause might be.


  Best regards,

Adam

-- 
 "Lägg ditt liv i min hand        Adam Sjøgren
  Sälj din själ till ett band"  a...@novozymes.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general