Re: [GENERAL] Upgrade from 9.1 to 9.2 fails due to unlogged table?

2013-04-05 Thread Gordon Shannon
I repeated the entire process, and I have a few clarifications.  

When I said the db seemed fine after the restore, I was wrong.  I could do a
\d on an unlogged table, but when I selected count(*) from any, that
resulted in an error like could not open file base/16388/15963587: No
such file or directory.  So the database was definitely not happy after the
restore.

Also, I was wrong when I said I couldn't go back after the failed upgrade. 
I read the output again, and yes I was able to restart 9.1 after renaming
pg_control.old.  

Next, I tried dropping and recreating all my UNLOGGED tables, as logged
tables.  On my first try, I forgot a few tables, and the pg_upgrade error
this time was: 

Linking user relation files
  /var/lib/pgsql/9.1/data/base/16389/29807980 
error while creating link for relation pg_toast.pg_toast_15949256_index
(/var/lib/pgsql/9.1/data/base/16389/29807980 to
/var/lib/pgsql/9.2/data/base/16434/15949261): No such file or directory
Failure, exiting

I realized my mistake, and recreated the remaining unlogged tables.  After
that the pg_upgrade worked perfectly.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Upgrade-from-9-1-to-9-2-fails-due-to-unlogged-table-tp5750194p5750207.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Upgrade from 9.1 to 9.2 fails due to unlogged table?

2013-04-05 Thread Gordon Shannon
I have a 9.1.3 instance (Redhat 5) with some unlogged tables.  I did the
following steps:
1. pg_basebackup to create a base.tar
2. Used the base.tar plus the WALs required by the backup to restore the db
to another 9.1.3 server.  This went fine, except at the end of the recovery
I got this error (file exists).  

77402013-03-28 09:44:16 MDT [2013-03-28 09:38:43 MDT] [29] LOG:  archive
recovery complete
77522013-03-28 09:44:16 MDT [2013-03-28 09:38:48 MDT] [1] LOG: 
checkpoint starting: end-of-recovery immediate wait
77522013-03-28 09:44:25 MDT [2013-03-28 09:38:48 MDT] [2] LOG: 
checkpoint complete: wrote 8213 buffers (1.6%); 0 transaction log file(s)
added, 0 removed, 0 recycled; write=7.361 s, sync=1.563 s, total=8.992 s;
sync files=2492, longest=0.069 s, average=0.000 s
*77402013-03-28 09:44:25 MDT [2013-03-28 09:38:43 MDT] [30] FATAL: 
could not create file base/886355/29806058: File exists*
77372013-03-28 09:44:25 MDT [2013-03-28 09:38:42 MDT] [1] LOG:  startup
process (PID 7740) exited with exit code 1
77372013-03-28 09:44:25 MDT [2013-03-28 09:38:42 MDT] [2] LOG: 
terminating any other active server processes

In a minute, started up again, and it came up without errors, and we ran for
a couple of days with no errors noted.

3. I then did a pg_upgrade to 9.2.3.  All went well, until...

Linking user relation files
  /var/lib/pgsql/9.1/data/base/16388/15963579 
error while creating link for relation stage.classification
(/var/lib/pgsql/9.1/data/base/16388/15963579 to
/var/lib/pgsql/9.2/data/base/16433/15963579): No such file or directory
Failure, exiting

I checked and it's true, the file does not exist.  Now I can't proceed and
the 9.1 data is not usable.  I looked at the original database from which
the copy was made, and the relation in question is an UNLOGGED table. This
suggests that there's some consideration for binary recovery.  I don't care
about the unlogged table data, but I do need the table definition.  We use a
few dozen unlogged tables.  Any ideas?

Thanks,
Gordon



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Upgrade-from-9-1-to-9-2-fails-due-to-unlogged-table-tp5750194.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] could not access status of transaction 1118722281

2011-04-09 Thread Gordon Shannon
Turns out this was most likely the pg_upgrade bug.  In our case, I was able
to dump and recreate the table in question.  Since then, this has been made
public:  http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/could-not-access-status-of-transaction-1118722281-tp4283137p4293709.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] could not access status of transaction 1118722281

2011-04-04 Thread Gordon Shannon
Running 9.0.2 on Centos.

I just discovered this in my production error log.  Starting about 45
minutes ago, I got 70 of these, within 2 seconds:

28652   2011-04-04 21:47:29 EDT [33]WARNING:  PD_ALL_VISIBLE flag was
incorrectly set in relation pg_toast_49338181 page 16820

These warnings were immediately proceeded by this, which has continuously
repeated every 15 seconds since then:

8895   2011-04-04 22:15:28 EDT [1]ERROR:  could not access status of
transaction 1118722281
8895   2011-04-04 22:15:28 EDT [2]DETAIL:  Could not open file
pg_clog/042A: No such file or directory.
8895   2011-04-04 22:15:28 EDT [3]CONTEXT:  automatic vacuum of table
mcore.pg_toast.pg_toast_48975830

I checked and the pg_clog files start at 04BF and run through 0A57 (1,433
files)

Any help would be greatly appreciated.

Gordon

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/could-not-access-status-of-transaction-1118722281-tp4283137p4283137.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] walreceiver getting bad data?

2011-01-06 Thread Gordon Shannon

It's 9.0.2 on Centos
-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/walreceiver-getting-bad-data-tp3329916p3330573.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] walreceiver getting bad data?

2011-01-05 Thread Gordon Shannon

I'm getting an error like this in the log a couple times a day on on my hot
standby server.   Any suggestions?

23964   2011-01-04 05:23:00 EST [47]LOG:  invalid record length at
6E53/46E8A010
23535   2011-01-04 05:23:00 EST [2]FATAL:  terminating walreceiver process
due to administrator command
cp: cannot stat `/data23/wal_sync/00016E530046': No such file or
directory
23964   2011-01-04 05:23:00 EST [48]LOG:  invalid record length at
6E53/46E8A010
cp: cannot stat `/data23/wal_sync/00016E530046': No such file or
directory
2486   2011-01-04 05:23:00 EST [1]LOG:  streaming replication successfully
connected to primary

Thanks,
Gordon
-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/walreceiver-getting-bad-data-tp3329916p3329916.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] seg fault crashed the postmaster

2011-01-04 Thread Gordon Shannon

I'm putting this on this thread, since it could be related to the issue.

I'm now seeing this in the log on the HSB/SR server.  It's happened about 4
times in the past 2 days.

23964   2011-01-04 05:23:00 EST [47]LOG:  invalid record length at
6E53/46E8A010
23535   2011-01-04 05:23:00 EST [2]FATAL:  terminating walreceiver process
due to administrator command
cp: cannot stat `/data23/wal_sync/00016E530046': No such file or
directory
23964   2011-01-04 05:23:00 EST [48]LOG:  invalid record length at
6E53/46E8A010
cp: cannot stat `/data23/wal_sync/00016E530046': No such file or
directory
2486   2011-01-04 05:23:00 EST [1]LOG:  streaming replication successfully
connected to primary
23964   2011-01-04 15:47:59 EST [49]LOG:  invalid record length at
6E6B/F8222010
2486   2011-01-04 15:47:59 EST [2]FATAL:  terminating walreceiver process
due to administrator command
cp: cannot stat `/data23/wal_sync/00016E6B00F8': No such file or
directory
23964   2011-01-04 15:48:00 EST [50]LOG:  invalid record length at
6E6B/F8222010
cp: cannot stat `/data23/wal_sync/00016E6B00F8': No such file or
directory


The word FATAL sounds very ominous. 

-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3328138.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] seg fault crashed the postmaster

2010-12-31 Thread Gordon Shannon

Interesting. That's exactly what we have been doing -- trying to update the
same rows in multiple txns. For us to proceed in production, I will take
steps to ensure we stop doing that, as it's just an app bug really.

The table in question -- v_messages -- is an empty base table with 76
partitions, with a total of 2.8 billion rows.
Let me summarize what I see as the key facts here:

(All problems have come from the UPDATE query, all identical except for
different author_id values.)

1. We did a link upgrade Wed night, from 844 to 902 so the upgrade
happened in place, no data files were copied.
2.  The 1st error was compressed data is corrupt at 18:16
3. We got 2 seg fault crashes before turning on cores and getting a 3rd
crash with the stack trace.
4. We then got a  invalid memory alloc request size 18446744073449177092
at 23:50. This was an ERROR, not a crash.

At this point, is it your suspicion that there is a code bug in 9.0.2,
rather than corrupt data?

I will post the schema and then work on a test case.

-gordon

On Fri, Dec 31, 2010 at 8:34 AM, Tom Lane-2 [via PostgreSQL] 
ml-node+3323712-1368244686-56...@n5.nabble.comml-node%2b3323712-1368244686-56...@n5.nabble.com
 wrote:


 Hmm.  This suggests that there's something wrong in the EvalPlanQual
 code, which gets invoked when there are concurrent updates to the same
 row (ie, the row this UPDATE is trying to change is one that was changed
 by some other transaction since the query started).  That stuff got
 rewritten rather thoroughly for 9.0, so the idea of a new bug there
 isn't exactly surprising.  But it's going to be hard to find without
 a test case.  Can you show us the full schema for this table and all
 the queries that execute against it up till the point of the failure?
 (Turning on log_statement across all sessions would help collect that
 info, if you don't have it already.)



-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323796.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [GENERAL] seg fault crashed the postmaster

2010-12-31 Thread Gordon Shannon

Here is the ddl for the tables in question.  There are foreign keys to other
tables that I omitted.

http://postgresql.1045698.n5.nabble.com/file/n3323804/parts.sql parts.sql 
-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323804.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] seg fault crashed the postmaster

2010-12-31 Thread Gordon Shannon

Sorry, I left that out.  Yeah, I wondered that too, since these tables do
not use toast.

CREATE TYPE message_status_enum AS ENUM ( 'V', 'X', 'S', 'R', 'U', 'D' );


On Fri, Dec 31, 2010 at 12:38 PM, Tom Lane-2 [via PostgreSQL] 
ml-node+3323859-1425181809-56...@n5.nabble.comml-node%2b3323859-1425181809-56...@n5.nabble.com
 wrote:

 Hmmm ... what is message_status_enum?  Is that an actual enum type, or
 some kind of domain over text or varchar?  If it's an enum, I'm a bit
 baffled where the corrupt compressed data message could have come from.

 regards, tom lane

 --


-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323888.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [GENERAL] seg fault crashed the postmaster

2010-12-31 Thread Gordon Shannon

Maybe it doesn't work from gmail.  I'll try uploading from here.

http://postgresql.1045698.n5.nabble.com/file/n3323933/plan.txt plan.txt 
-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323933.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] seg fault crashed the postmaster

2010-12-31 Thread Gordon Shannon

The number of matching rows on these queries is anything from 0 to 1.  I
don't think I can tell how many would have matched on the ones that
crashed.  Although I suspect it would have been toward the 1 end.   I've
been trying to get a reproducable test case with no luck so far.

I assume you can now see the plan?  I uploaded it twice, once via gmail and
once on Nabble.

Here are all the non-default settings:

 listen_addresses = '*'
 max_connections = 450
 authentication_timeout = 20s
 shared_buffers = 18GB
 work_mem = 200MB
 maintenance_work_mem = 8GB
 shared_preload_libraries = 'auto_explain'
 wal_level = hot_standby
 synchronous_commit = off
 wal_buffers = 8MB
 commit_siblings = 1
 checkpoint_segments = 256
 checkpoint_warning = 5min
 archive_mode = on
 archive_command = 'cp %p /var/lib/pgsql/wal/%f.wrk; mv
/var/lib/pgsql/wal/%f.wrk /var/lib/pgsql/wal/%f'
 max_wal_senders = 1
 cpu_tuple_cost = 0.003
 cpu_index_tuple_cost = 0.001
 cpu_operator_cost = 0.0005
 effective_cache_size = 52GB
 default_statistics_target = 200
 log_directory = '/var/log/postgres'
 log_filename = 'pg%d.log'
 log_min_duration_statement = 7min
 log_line_prefix = '%p %u %r %t [%l]'
 log_lock_waits = on
 log_temp_files = 0
 track_functions = pl# none, pl, all
 log_autovacuum_min_duration = 5min
 autovacuum_vacuum_scale_factor = 0.1
 autovacuum_analyze_scale_factor = 0.03
 autovacuum_freeze_max_age = 15# 1,500,000,000
 autovacuum_vacuum_cost_delay = -1
 temp_tablespaces =
'ts03,ts04,ts05,ts06,ts07,ts08,ts09,ts10,ts11,ts12,ts13,ts14,ts15,ts16,ts17,ts18,ts19,ts20,ts21,ts22,ts23,ts24,ts25,ts26,ts27,ts28,ts29,ts30,ts31,ts32,ts33,ts34,ts35,ts36,ts37,ts38'
 vacuum_freeze_min_age = 5 # 500,000,000
 vacuum_freeze_table_age = 13  # 1,300,000,000
 bytea_output = 'escape'
 deadlock_timeout = 5s
 standard_conforming_strings = on
 custom_variable_classes = 'auto_explain'
 auto_explain.log_min_duration = -1  # Remember this means for everybody!
 auto_explain.log_analyze = off  ## DANGER! Don't set log_analyze to
true unless you know what you're doing!
 auto_explain.log_verbose = off
 auto_explain.log_nested_statements = on


On Fri, Dec 31, 2010 at 2:49 PM, Tom Lane-2 [via PostgreSQL] 
ml-node+3323935-1680610224-56...@n5.nabble.comml-node%2b3323935-1680610224-56...@n5.nabble.com
 wrote:


 So I'm pretty sure that what we're dealing with is a case of the plan
 freeing a transient tuple datum sooner than it should.  But the toy case
 I tried here didn't fail, so evidently I'm not close enough to the plan
 you're actually using.  Still need to see that EXPLAIN output.  It'd be
 helpful also to know what nondefault configuration settings you're
 using, especially work_mem and planner-related settings.  Also, do you
 have an idea of how many rows might've matched the WHERE conditions?




-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323959.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [GENERAL] seg fault crashed the postmaster

2010-12-31 Thread Gordon Shannon

Unfortunately it's now impossible to say how many were updated, as they get
deleted by another process later. I may be able to restore part of a dump
from 2 days ago on another machine, and get some counts from that, assuming
I have the disk space.  I'll work on that.

I do not believe there could have been more than 2 concurrent updates
because the app is limited to 2 threads, but I will verify that with the
developer.

-gordon

On Fri, Dec 31, 2010 at 3:43 PM, Tom Lane-2 [via PostgreSQL] 
ml-node+3323971-508486184-56...@n5.nabble.comml-node%2b3323971-508486184-56...@n5.nabble.com
 wrote:



 No luck here either.  A couple further questions --- do you think it's
 likely that you could have been over 1 matches, ie, the LIMIT
 actually kicked in?  Also, could there have been more than two
 concurrent updates acting on the same rows?



-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3324015.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


[GENERAL] seg fault crashed the postmaster

2010-12-30 Thread Gordon Shannon

Hi,

Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last
night.  About 20 hours later, an update statement seg faulted and crashed
the server. This is a typical update that has worked fine for a long time.

20898 datafeed (58628) 2010-12-30 19:28:14 EST [103]LOG:  process 20898
acquired ShareLock on transaction 1286738762 after 90174.969 ms
20898 datafeed (58628) 2010-12-30 19:28:14 EST [104]STATEMENT:  update
v_messages set status = 'S', updated_on = now() where id in (select id from
v_messages where author_id = 34409854 and status != 'S' limit 1)
5802   2010-12-30 19:28:14 EST [4]LOG:  server process (PID 20898) was
terminated by signal 11: Segmentation fault
5802   2010-12-30 19:28:14 EST [5]LOG:  terminating any other active server
processes
15426 pipeline (36834) 2010-12-30 19:28:14 EST [1]WARNING:  terminating
connection because of crash of another server process15426 pipeline
10.10.11.54(36834) 2010-12-30 19:28:14 EST [2]DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server pro
cess exited abnormally and possibly corrupted shared memory.

One hint that something might be wrong was this:

18235 datafeed (44228) 2010-12-30 18:16:37 EST [11]LOG:  process 18235
acquired ShareLock on transaction 1285952031 after 29966.703 ms
18235 datafeed (44228) 2010-12-30 18:16:37 EST [12]STATEMENT:  update
v_messages set status = 'S', updated_on = now() where id in (select id from
v_messages where author_id = 25301995 and status != 'S' limit 1)
18235 datafeed (44228) 2010-12-30 18:16:43 EST [13]ERROR:  compressed data
is corrupt
18235 datafeed (44228) 2010-12-30 18:16:43 EST [14]STATEMENT:  update
v_messages set status = 'S', updated_on = now() where id in (select id from
v_messages where author_id = 25301995 and status != 'S' limit 1)

How concerned should I be?  Thanks!

Gordon

-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323117.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] seg fault crashed the postmaster

2010-12-30 Thread Gordon Shannon

I'd love to send you a stack trace. Any suggestions on how to get one?  It
has since happened again, on the same update command, so I'm guessing I can
repeat it.

On Thu, Dec 30, 2010 at 6:52 PM, Tom Lane-2 [via PostgreSQL] 
ml-node+3323151-436577542-56...@n5.nabble.comml-node%2b3323151-436577542-56...@n5.nabble.com
 wrote:

 Gordon Shannon [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3323151i=0
 writes:
  Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last
  night.  About 20 hours later, an update statement seg faulted and crashed

  the server. This is a typical update that has worked fine for a long
 time.

 Could we see a stack trace from that?  Or at least a self-contained
 test case?

 regards, tom lane

 --
 Sent via pgsql-general mailing list ([hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3323151i=1)

 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


 --
  View message @
 http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323151.html
 To unsubscribe from seg fault crashed the postmaster, click 
 herehttp://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3323117code=Z29yZG8xNjlAZ21haWwuY29tfDMzMjMxMTd8LTEwNjcwMjEwNDQ=.



-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323171.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [GENERAL] seg fault crashed the postmaster

2010-12-30 Thread Gordon Shannon

Stack trace:

#0  0x0031a147c15c in memcpy () from /lib64/libc.so.6
#1  0x00450cb8 in __memcpy_ichk (tuple=0x7fffb29ac900) at
/usr/include/bits/string3.h:51
#2  heap_copytuple (tuple=0x7fffb29ac900) at heaptuple.c:592
#3  0x00543d4c in EvalPlanQualFetchRowMarks (epqstate=0x3cd85ab8) at
execMain.c:1794
#4  0x005440db in EvalPlanQual (estate=0x1e0db420,
epqstate=0x3cd85ab8, relation=value optimized out, rti=4,
tid=0x7fffb29aca20, priorXmax=value optimized out) at execMain.c:1401
#5  0x005592eb in ExecUpdate (node=0x3cd85a28) at
nodeModifyTable.c:527
#6  ExecModifyTable (node=0x3cd85a28) at nodeModifyTable.c:748
#7  0x00545953 in ExecProcNode (node=0x3cd85a28) at
execProcnode.c:359
#8  0x00544881 in ExecutePlan (queryDesc=0x1e727990,
direction=1039265768, count=0) at execMain.c:1190
#9  standard_ExecutorRun (queryDesc=0x1e727990, direction=1039265768,
count=0) at execMain.c:280
#10 0x2ab002c0f2b5 in explain_ExecutorRun (queryDesc=0x1e727990,
direction=ForwardScanDirection, count=0) at auto_explain.c:203
#11 0x005f9c81 in ProcessQuery (plan=0x2112ad60,
sourceText=0x1b3e59e0 update v_messages set status = 'S', updated_on =
now() where id in (select id from v_messages where author_id = 33138761 and
status != 'S' limit 1),
params=value optimized out, dest=0x2112ae40,
completionTag=0x7fffb29ace20 ) at pquery.c:197
#12 0x005f9e99 in PortalRunMulti (portal=0x1b32aed0, isTopLevel=1
'\001', dest=0x2112ae40, altdest=0x2112ae40, completionTag=0x7fffb29ace20
) at pquery.c:1268
#13 0x005fa965 in PortalRun (portal=0x1b32aed0,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2112ae40,
altdest=0x2112ae40, completionTag=0x7fffb29ace20 ) at pquery.c:822
#14 0x005f7455 in exec_simple_query (
query_string=0x1b3e59e0 update v_messages set status = 'S', updated_on
= now() where id in (select id from v_messages where author_id = 33138761
and status != 'S' limit 1))
at postgres.c:1058
#15 0x005f7d14 in PostgresMain (argc=value optimized out,
argv=value optimized out, username=value optimized out) at
postgres.c:3929
#16 0x005c7ce5 in ServerLoop () at postmaster.c:3555
#17 0x005c89ec in PostmasterMain (argc=5, argv=0x1b31ea00) at
postmaster.c:1092
#18 0x005725fe in main (argc=5, argv=value optimized out) at
main.c:188


On Thu, Dec 30, 2010 at 7:32 PM, Tom Lane-2 [via PostgreSQL] 
ml-node+3323177-1417305259-56...@n5.nabble.comml-node%2b3323177-1417305259-56...@n5.nabble.com
 wrote:

 Gordon Shannon [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3323177i=0
 writes:
  I'd love to send you a stack trace. Any suggestions on how to get one?
  It
  has since happened again, on the same update command, so I'm guessing I
 can
  repeat it.


 http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

 regards, tom lane

 --
 Sent via pgsql-general mailing list ([hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3323177i=1)

 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


 --
  View message @
 http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323177.html

 To unsubscribe from seg fault crashed the postmaster, click 
 herehttp://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3323117code=Z29yZG8xNjlAZ21haWwuY29tfDMzMjMxMTd8LTEwNjcwMjEwNDQ=.





-- 
If I had more time, I could have written you a shorter letter.  (Blaise
Pascal)

-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323277.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [GENERAL] Need help understanding vacuum verbose output

2010-08-06 Thread Gordon Shannon

OK, so if it knew that all vacuumable tuples could be found in 492 pages, and
it scanned only those pages, then how could it be that it reports 16558
removable tuples from those 492 pages, when it has already reported earlier
that it removed 45878 tuples -- a number we know in fact to be correct?  How
could both statements be correct?
-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Need-help-understanding-vacuum-verbose-output-tp2265895p2266792.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Need help understanding vacuum verbose output

2010-08-06 Thread Gordon Shannon

Yes, and also from the original post:

 3  INFO:  scanned index authors_archive_pkey to remove 45878 row
 versions
 4  DETAIL:  CPU 0.05s/0.34u sec elapsed 0.41 sec.
 5  INFO:  authors_archive: removed 45878 row versions in 396 pages
 6  DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.

Line 5 is talking about the relation, not the index, right?  Anyway, I know
it did remove 45878 row versions, cuz that's how many I deleted, and the
stats now say 0 dead tuples.
-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Need-help-understanding-vacuum-verbose-output-tp2265895p2266825.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Need help understanding vacuum verbose output

2010-08-06 Thread Gordon Shannon

 That last message prints tups_vacuumed, but those other ones are counting
 all the removed item pointers.  So apparently Gordon had a whole lot of
 pre-existing DEAD item pointers.  I wonder why ...

Perhaps this will help.  Here's the entire test.

Start with a newly loaded table with 5,063,463 rows. 
1. delete 467 rows
2. delete 4,220 rows
3. vacuum
4. delete 5,091 rows
5. delete 3,832 rows
6. delete 10,168 rows
7. delete 10,469 rows
8. delete 16,318 rows
9. vacuum (output shown in original post)

 13 There were 4687 unused item pointers. 

This corresponds to the 1st 2 steps, I presume, where the 4687 unused
pointer were created by the 1st vacuum.




-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Need-help-understanding-vacuum-verbose-output-tp2265895p2266912.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Need help understanding vacuum verbose output

2010-08-05 Thread Gordon Shannon

Hi,  Running 8.4.4 on Centos.  A couple of these numbers don't make sense to
me.
(I added line numbers for reference)

1  vacuum verbose authors_archive;
2  INFO:  vacuuming public.authors_archive
3  INFO:  scanned index authors_archive_pkey to remove 45878 row versions
4  DETAIL:  CPU 0.05s/0.34u sec elapsed 0.41 sec.
5  INFO:  authors_archive: removed 45878 row versions in 396 pages
6  DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
7  INFO:  index authors_archive_pkey now contains 5012898 row versions in
13886 pages
8  DETAIL:  45878 index row versions were removed.
9  138 index pages have been deleted, 12 are currently reusable.
10 CPU 0.00s/0.00u sec elapsed 0.00 sec.
11 INFO:  authors_archive: found 16558 removable, 7300 nonremovable row
versions in 492 out of 51958 pages
12 DETAIL:  0 dead row versions cannot be removed yet.
13 There were 4687 unused item pointers.
14 0 pages are entirely empty.

I was the only one touching this table. Since the previous vacuum, I deleted
45,878 rows.
Line 5 says it removed all 45,878 tuples, that's fine.
Line 11 now says it found only 16,558 removable tuples. What does this mean?
Line 11 also says there are 7,300 non-removable tuples.  I expected 0. What
does this mean?
Line 12 says that 0 dead tuples cannot be removed, which seems correct to
me, but seems to contradict Line 11.

Thanks.


-- 
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Need-help-understanding-vacuum-verbose-output-tp2265895p2265895.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Help writing a query to predict auto analyze

2010-05-19 Thread Gordon Shannon

In an effort to fine-tune my table storage parameters so tables are analyzed
at the optimal time, I have written a query to show how soon my tables will
be auto-analyzed. But my results to not jive with what I see autovacuum
doing, i.e. there are tables that are millions of rows past the threshold
that haven't been analyzed for 2 weeks, and autovacuum does not want to auto
analyze them, so I conclude that my query must be wrong. 

The docs say that tables will be auto analyzed when (paraphrasing) the
number of tuples inserted or updated since the last analyze exceeds the
analyze base threshold plus the product of the analyze scale factor and the
number of tuples.

After a reading of the some of the code in autovacuum.c, it appears the
formula used is something like this.  (Note that I'm not quite sure how
number of tuples inserted or updated since the last analyze translates in
code as n_live_tup + n_dead_tup - [tuples at last analyze], but I'll trust
the code is correct, and I need only understand how to get the values of the
variables.)

T = bt + (sf * rt)
X = lt + dt - at
will analyze when X  T

T: Threshold
X: test value
bt: base threshold
sf: scale factor
rt: rel tuples
lt: live tuples
dt: dead tuples
at: last analysis tuples

The next step was to get these values from the system tables. Here's what I
used:

bt: base threshold = current_setting('autovacuum_analyze_threshold') or
table override setting
sf: scale factor = current_setting('autovacuum_analyze_scale_factor') or
table override setting
rt: rel tuples = pg_class.reltuples
lt: live tuples = pg_stat_user_tables.n_live_tup
dt: dead tuples = pg_stat_user_tables.n_dead_tup
at: last analysis tuples = pg_class.reltuples 

I'm the least confident about the last one -- tuples as of last analyze.
Can anyone confirm or correct these?

Version:  PostgreSQL 8.4.3 on x86_64-redhat-linux-gnu, compiled by GCC gcc
(GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit


Thanks!
--gordon

-- 
View this message in context: 
http://old.nabble.com/Help-writing-a-query-to-predict-auto-analyze-tp28610247p28610247.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Help writing a query to predict auto analyze

2010-05-19 Thread Gordon Shannon



alvherre wrote:
 
 Excerpts from Gordon Shannon's message of mié may 19 11:49:45 -0400 2010:
 
 at: last analysis tuples = pg_class.reltuples 
 
 I'm the least confident about the last one -- tuples as of last analyze.
 Can anyone confirm or correct these?
 
 In 8.4 it's number of dead + lives tuples that there were in the previous
 analyze.  See pgstat_recv_analyze in src/backend/postmaster/pgstat.c.
 (In 9.0 it's been reworked a bit.)
 
 

I'm sorry, I'm not following you.  Are you saying that last analysis
tuples is number of dead + live tuples from the previous anlyze?  If so,
that would really confuse me because X would always be 0:

X = lt + dt - at
X = pg_stat_user_tables.n_live_tup + n_dead_tup - (n_live_tup + n_dead_tup)
X = 0

or is there something else wrong with the formula?

--gordon

-- 
View this message in context: 
http://old.nabble.com/Help-writing-a-query-to-predict-auto-analyze-tp28610247p28614875.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Help writing a query to predict auto analyze

2010-05-19 Thread Gordon Shannon



alvherre wrote:
 
 n_live_tup and n_dead_tup corresponds to the current numbers,
 whereas last analysis tuples are the values from back when the
 previous analyze ran.  These counters keep moving per updates, deletes,
 inserts, they are not static.
 
 

OK.  Do you know how can I get the values from back when the previous
analyze ran?
-- 
View this message in context: 
http://old.nabble.com/Help-writing-a-query-to-predict-auto-analyze-tp28610247p28616817.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Crazy looking actual row count from explain analyze

2010-05-10 Thread Gordon Shannon

Running 8.4.3, I have a table with 43 million rows. Two of the columns are
(topic_id int not null) and (status message_status_enum not null), where
message_status_enum is defined as
CREATE TYPE message_status_enum AS ENUM ( 'V', 'X', 'S', 'R', 'U', 'D' ); 

Among the indexes there is this:
m_20100201_topic_multi btree (topic_id, status, source_category_id,
alg_ci_rank_rollup)

..see that topic_id and status are the leading edge of the index.
Fact: there are no rows with status 'S' or 'X'
Fact: there are no rows with topic_id = 1

Consider, then...

explain analyze select count(*) from m_20100201 where status in ('S','X');
 QUERY
PLAN  
-
 Aggregate  (cost=987810.75..987810.76 rows=1 width=0) (actual
time=2340.193..2340.194 rows=1 loops=1)
   -  Bitmap Heap Scan on m_20100201  (cost=987806.75..987810.75 rows=1
width=0) (actual time=2340.191..2340.191 rows=0 loops=1)
 Recheck Cond: (status = ANY ('{S,X}'::message_status_enum[]))
 -  Bitmap Index Scan on m_20100201_topic_multi 
(cost=0.00..987806.75 rows=1 width=0) (actual time=2334.371..2334.371
rows=126336 loops=1)
   Index Cond: (status = ANY ('{S,X}'::message_status_enum[]))

What I don't understand is the actual rows of 126,336 in the bitmap index
scan.  I would expect it to have to scan every index entry, but doesn't this
output mean that it's *returning* 126K rows from that scan?  Whereas I think
it should return zero.

I have already fixed this query by adding a better index.  But the point of
this post is simply to understand this explain analyze output.  Thanks!

--gordon 


-- 
View this message in context: 
http://old.nabble.com/Crazy-looking-actual-row-count-from-explain-analyze-tp28517643p28517643.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Crazy looking actual row count from explain analyze

2010-05-10 Thread Gordon Shannon


Tom Lane-2 wrote:
 
 My first suspicion
 is that those are unvacuumed dead rows ... what's your vacuuming policy
 on this database?
 

Ah, I didn't know that number included dead tuples.  That probably explains
it.  pg_stat_user_tables says the table has 370,269 dead tuples.  On this
table, I have autovacuum_vacuum_scale_factor set to 0.02, so I believe the
table will have to have 869K dead tuples before vacuum will kick in.

 I have already fixed this query by adding a better index.


Tom Lane-2 wrote:
 
 I think the new index might have fixed things largely by not bothering
 to index already-dead rows.
 

Actually, I put a partial index on status, where != 'V'.  That fits our
usage pattern of 99% of the records being 'V', so it's a tiny index and
satisifies this type of query very quickly.

Thanks,

--gordon

-- 
View this message in context: 
http://old.nabble.com/Crazy-looking-actual-row-count-from-explain-analyze-tp28517643p28518862.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Tracking down log segment corruption

2010-05-02 Thread Gordon Shannon

I just got ran into the same problem.  Both servers are running 8.4.3, and
the standby server had been running for 2 days, processing many thousands of
logs successfully.  Here's my error:

4158   2010-05-02 11:12:09 EDT [26445]LOG:  restored log file
00013C7700C3 from archive
4158   2010-05-02 11:12:09 EDT [26446]LOG:  restored log file
00013C7700C4 from archive
4158   2010-05-02 11:12:09 EDT [26447]WARNING:  specified item offset is too
large
4158   2010-05-02 11:12:09 EDT [26448]CONTEXT:  xlog redo insert: rel
48777166/22362/48778276; tid 2/2
4158   2010-05-02 11:12:09 EDT [26449]PANIC:  btree_insert_redo: failed to
add item
4158   2010-05-02 11:12:09 EDT [26450]CONTEXT:  xlog redo insert: rel
48777166/22362/48778276; tid 2/2
4151   2010-05-02 11:12:09 EDT [1]LOG:  startup process (PID 4158) was
terminated by signal 6: Aborted
4151   2010-05-02 11:12:09 EDT [2]LOG:  terminating any other active server
processes

Any suggestions?


Charles Duffy-5 wrote:
 
 
 [14-1] LOG:  restored log file 000100140010 from archive
 [15-1] WARNING:  specified item offset is too large
 [15-2] CONTEXT:  xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
 [16-1] PANIC:  btree_insert_redo: failed to add item
 [16-2] CONTEXT:  xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
 [1-1] LOG:  startup process (PID 17310) was terminated by signal 6:
 Aborted
 [2-1] LOG:  aborting startup due to startup process failure
 
 

-- 
View this message in context: 
http://old.nabble.com/Tracking-down-log-segment-corruption-tp21121136p28427922.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Tracking down log segment corruption

2010-05-02 Thread Gordon Shannon
On Sun, May 2, 2010 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote:



 Hmm ... AFAICS the only way to get that message when the incoming TID's
 offsetNumber is only 2 is for the index page to be completely empty
 (not zeroes, else PageAddItem's sanity check would have triggered,
 but valid and empty).  What that smells like is a software bug, like
 failing to emit a WAL record in a case where it was necessary.  Can you
 identify which index this was?  (Look for relfilenode 48778276 in the
 database with OID 22362.)  If so, can you give us any hints about
 unusual things that might have been done with that index?


Interesting. There is no pg_class entry for 22362.  There is, however, an
entry for that filenode. It's an index I created Sat AM, about 6AM.

select oid, * from pg_class where relfilenode=48778276;
-[ RECORD 1 ]--+---
oid| 48777488
relname| cts_20100501_topic_date_nk
relnamespace   | 2200
reltype| 0
relowner   | 16412
relam  | 403
relfilenode| 48778276
reltablespace  | 48777166
relpages   | 2476
reltuples  | 58879
reltoastrelid  | 0
reltoastidxid  | 0
relhasindex| f
relisshared| f
relistemp  | f
relkind| i
relnatts   | 2
relchecks  | 0
relhasoids | f
relhaspkey | f
relhasrules| f
relhastriggers | f
relhassubclass | f
relfrozenxid   | 0
relacl | null
reloptions | null

Possibly relevant facts:

- The WSB server went active on Friday around 3:30PM
- On Friday evening, I added about 11 tablespaces.  I noted the new files on
the WSB, no problems.
- On Sat morning, I created a partitioned table cts_20100501 (inherits from
another table) and 4 indexes.
- This morning, I was doing some table maintenance on the master and
discovered I had created this table and its indexes in the wrong tablespace.
I wanted the table in ts29, but had it in ts30.  Vice versa for the
indexes.  So I moved them. This is from my command history:

alter index cts_20100501_natural_uk set tablespace ts30;
alter index cts_20100501_pkey set tablespace ts30;
alter index cts_20100501_topic_date_nk set tablespace ts30;
alter index cts_20100501_updated_nk set tablespace ts30;
alter table cts_20100501 set tablespace ts29;

These commands worked fine on the master, yet this seems suspiciously
relevant.


  Any suggestions?

 As far as recovering goes, there's probably not much you can do except
 resync the standby from scratch.  But it would be nice to get to the
 bottom of the problem, so that we can fix the bug.  Have you got an
 archive of this xlog segment and the ones before it, and would you be
 willing to let a developer look at them?


Before I received your reply, I had already started the re-sync, and
unfortunately already deleted the wal logs in question.
If it happens again, I will certainly keep them, and would be happy to share
them.

Regards,
Gordon


Re: [GENERAL] Tracking down log segment corruption

2010-05-02 Thread Gordon Shannon
On Sun, May 2, 2010 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 No, this would be a pg_database row with that OID.  But it looks like
 you found the relevant index anyway.

 Yup, realized that on second reading.


  These commands worked fine on the master, yet this seems suspiciously
  relevant.


 Yeah, perhaps so.  What time did the failure on the standby occur (how
 long after you did those moves)?  Is it reasonable to assume that this
 was the first subsequent access to the index?


Bingo.  Yes it is reasonable.  It was 25 seconds between my altering the
index in question and the server crash.

My local commands (in MDT, plus my machine is 15 sec ahead of the server):

09:10:52 alter index cts_20100501_natural_uk set tablespace ts30;
ALTER INDEX
Time: 787.790 ms
09:11:41 alter index cts_20100501_pkey set tablespace ts30;
ALTER INDEX
Time: 468.526 ms
09:11:51 alter index cts_20100501_topic_date_nk set tablespace ts30;
ALTER INDEX
Time: 385.322 ms
09:11:59 alter index cts_20100501_updated_nk set tablespace ts30;
ALTER INDEX
Time: 963.150 ms
09:12:10 alter table cts_20100501 set tablespace ts29;
ALTER TABLE

And from the wsb log (times in EDT):

4158   2010-05-02 11:12:09 EDT [26446]LOG:  restored log file
00013C7700C4 from archive
4158   2010-05-02 11:12:09 EDT [26447]WARNING:  specified item offset is too
large
4158   2010-05-02 11:12:09 EDT [26448]CONTEXT:  xlog redo insert: rel
48777166/22362/48778276; tid 2/2
4158   2010-05-02 11:12:09 EDT [26449]PANIC:  btree_insert_redo: failed to
add item


Re: [GENERAL] Tracking down log segment corruption

2010-05-02 Thread Gordon Shannon
On Sun, May 2, 2010 at 12:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Gordon Shannon gordo...@gmail.com writes:
  Bingo.  Yes it is reasonable.  It was 25 seconds between my altering the
  index in question and the server crash.

 Sounds like we have a smoking gun.  Could you show all your non-default
 postgresql.conf settings on the master?  I'm wondering about
 full_page_writes in particular, but might as well gather all the
 relevant data while we're at it.  Easiest way is:

 select name,setting from pg_settings where source not in
 ('default','override');

regards, tom lane


 archive_command | cp %p /var/lib/pgsql/wal/%f.wrk; mv
/var/lib/pgsql/wal/%f.wrk /var/lib/pgsql/wal/%f
 archive_mode| on
 authentication_timeout  | 20
 autovacuum_analyze_scale_factor | 0.05
 autovacuum_freeze_max_age   | 15
 autovacuum_vacuum_cost_delay| -1
 autovacuum_vacuum_scale_factor  | 0.1
 checkpoint_segments | 128
 checkpoint_warning  | 300
 commit_siblings | 1
 cpu_index_tuple_cost| 0.001
 cpu_operator_cost   | 0.0005
 cpu_tuple_cost  | 0.003
 DateStyle   | ISO, MDY
 deadlock_timeout| 5000
 default_text_search_config  | pg_catalog.english
 effective_cache_size| 6291456
 lc_messages | en_US.UTF-8
 lc_monetary | en_US.UTF-8
 lc_numeric  | en_US.UTF-8
 lc_time | en_US.UTF-8
 listen_addresses| *
 log_autovacuum_min_duration | 0
 log_destination | stderr
 log_directory   | /var/log/postgres
 log_filename| pg%d.log
 log_line_prefix | %p %u %r %t [%l]
 log_min_duration_statement  | 18
 log_rotation_age| 1440
 log_rotation_size   | 0
 log_temp_files  | 0
 log_timezone| US/Eastern
 log_truncate_on_rotation| on
 logging_collector   | on
 maintenance_work_mem| 8388608
 max_connections | 500
 max_stack_depth | 2048
 port| 5432
 search_path | public, archive, _slony_cluster
 shared_buffers  | 2359296
 standard_conforming_strings | on
 synchronous_commit  | off
 temp_tablespaces|
ts27,ts28,ts29,ts30,ts31,ts32,ts33,ts34,ts35,ts36,ts37
 TimeZone| US/Eastern
 timezone_abbreviations  | Default
 track_functions | pl
 vacuum_freeze_min_age   | 5
 vacuum_freeze_table_age | 13
 wal_buffers | 1024
 work_mem| 204800


Re: [GENERAL] Tracking down log segment corruption

2010-05-02 Thread Gordon Shannon
Sounds like you're on it.  Just wanted to share one additional piece, in
case it helps.

Just before the ALTER INDEX SET TABLESPACE was issued, there were some
writes to the table in question inside a serializable transaction. The
transaction committed at 11:11:58 EDT, and consisted of, among a couple
thousand writes to sibling tables, 4 writes (unknown combination of inserts
and updates) to cts_20100501, which definitely effected the index in
question.

In any case, I will cease and desist from ALTER SET TABLESPACE for a while!.

Thanks!
Gordon

Between 11:11:56 and 11:11:58 EDT (11 sec before the crash), there were

On Sun, May 2, 2010 at 3:16 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Found it, I think.  ATExecSetTableSpace transfers the copied data to the
 slave by means of XLOG_HEAP_NEWPAGE WAL records.  The replay function
 for this (heap_xlog_newpage) is failing to pay any attention to the
 forkNum field of the WAL record.  This means it will happily write FSM
 and visibility-map pages into the main fork of the relation.  So if the
 index had any such pages on the master, it would immediately become
 corrupted on the slave.  Now indexes don't have a visibility-map fork,
 but they could have FSM pages.  And an FSM page would have the right
 header information to look like an empty index page.  So dropping an
 index FSM page into the main fork of the index would produce the
 observed symptom.

 I'm not 100% sure that this is what bit you, but it's clearly a bug and
 AFAICS it could produce the observed symptoms.

 This is a seriously, seriously nasty data corruption bug.  The only bit
 of good news is that ALTER SET TABLESPACE seems to be the only operation
 that can emit XLOG_HEAP_NEWPAGE records with forkNum different from
 MAIN_FORKNUM, so that's the only operation that's at risk.  But if you
 do do that, not only are standby slaves going to get clobbered, but the
 master could get corrupted too if you were unlucky enough to have a
 crash and replay from WAL shortly after completing the ALTER.  And it's
 not only indexes that are at risk --- tables could get clobbered the
 same way.

 My crystal ball says there will be update releases in the very near
 future.

regards, tom lane



Re: [GENERAL] unexplained autovacuum to prevent wraparound

2010-03-14 Thread Gordon Shannon
That looks like the fix for this, thanks!  I will try to upgrade soon.

-- Gordon

On Sun, Mar 14, 2010 at 7:43 AM, Alvaro Herrera
alvhe...@commandprompt.comwrote:

 Gordon Shannon escribió:
  Ah, now I see what you meant.  Forgive me, I thought you were referring
 to
  the pg_autovacuum table in 8.3 where you have to specifiy something for
 each
  column, and -1 says use the default.  It appears in 8.4.0 I have to
  explicitly set ALL (?) other storage parameters to -1 to get the default,
  otherwise I am getting zero for each value??  I don't believe the
  documentation mentions this rather important detail:
 
 http://www.postgresql.org/docs/8.4/static/sql-createtable.html#SQL-CREATETABLE-STORAGE-PARAMETERS
 .
  Did I miss it somewhere?

 Err, no, that would be a bug.  Please update to 8.4.2, I think we fixed
 it there.

 revision 1.28.2.1
 date: 2009-08-27 13:19:31 -0400;  author: alvherre;  state: Exp;  lines:
 +10 -10;
 Fix handling of autovacuum reloptions.

 In the original coding, setting a single reloption would cause default
 values to be used for all the other reloptions.  This is a problem
 particularly for autovacuum reloptions.

 Itagaki Takahiro


 --
 Alvaro Herrera
 http://www.CommandPrompt.com/
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support



Re: [GENERAL] unexplained autovacuum to prevent wraparound

2010-03-13 Thread Gordon Shannon
Ah, now I see what you meant.  Forgive me, I thought you were referring to
the pg_autovacuum table in 8.3 where you have to specifiy something for each
column, and -1 says use the default.  It appears in 8.4.0 I have to
explicitly set ALL (?) other storage parameters to -1 to get the default,
otherwise I am getting zero for each value??  I don't believe the
documentation mentions this rather important detail:
http://www.postgresql.org/docs/8.4/static/sql-createtable.html#SQL-CREATETABLE-STORAGE-PARAMETERS.
Did I miss it somewhere?

Thanks!
Gordon

On Fri, Mar 12, 2010 at 4:45 PM, Gordon Shannon gordo...@gmail.com wrote:

 This is 8.4, there is no pg_autovacuum table.  I set it like this:

 alter table foo set (autovacuum_analyze_scale_factor=0.01);




 On Fri, Mar 12, 2010 at 4:31 PM, Alvaro Herrera 
 alvhe...@commandprompt.com wrote:

 Gordon Shannon escribió:

  One possibly interesting thing is that this seems to have started just
 after
  I set foo's autovacuum_analyze_scale_factor to 0.01, since I wanted more
  frequent analyze runs.  I wonder if that could be related.

 You probably set the other values to 0, which includes the freeze age.
 You need to set it (and all other values) to -1 instead.

 --
 Alvaro Herrera
 http://www.CommandPrompt.com/
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support






-- 
If I had more time, I could have written you a shorter letter.  (Blaise
Pascal)


[GENERAL] unexplained autovacuum to prevent wraparound

2010-03-12 Thread Gordon Shannon

It appears to me that in my 8.4.0 system, autovacuum is running to prevent
wraparound contrary to the documentation. I have it set to a tables'
relfrozenxid has to get to 1.5 billion before that kicks in: 

 show autovacuum_freeze_max_age;
 15

 show vacuum_freeze_table_age;
 13

 show vacuum_freeze_min_age;
 10

Table foo has an age just over 1 billion, still well under the
freeze_table_age:

 select age(relfrozenxid) from pg_class where relname='foo';
age 

 1055823634

yet, I see this in pg_stat_activity:

autovacuum: VACUUM public.foo (to prevent wraparound)

One possibly interesting thing is that this seems to have started just after
I set foo's autovacuum_analyze_scale_factor to 0.01, since I wanted more
frequent analyze runs.  I wonder if that could be related.

Any ideas?  These wraparound autovacuums are taking up my vacuum workers so
no analyze workers can run.

Thanks!
Gordon


-- 
View this message in context: 
http://old.nabble.com/unexplained-autovacuum-to-prevent-wraparound-tp27883825p27883825.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] unexplained autovacuum to prevent wraparound

2010-03-12 Thread Gordon Shannon
This is 8.4, there is no pg_autovacuum table.  I set it like this:

alter table foo set (autovacuum_analyze_scale_factor=0.01);



On Fri, Mar 12, 2010 at 4:31 PM, Alvaro Herrera
alvhe...@commandprompt.comwrote:

 Gordon Shannon escribió:

  One possibly interesting thing is that this seems to have started just
 after
  I set foo's autovacuum_analyze_scale_factor to 0.01, since I wanted more
  frequent analyze runs.  I wonder if that could be related.

 You probably set the other values to 0, which includes the freeze age.
 You need to set it (and all other values) to -1 instead.

 --
 Alvaro Herrera
 http://www.CommandPrompt.com/
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support



Re: [GENERAL] unexplained autovacuum to prevent wraparound

2010-03-12 Thread Gordon Shannon
Thanks, but I do want 1%.

On Fri, Mar 12, 2010 at 5:19 PM, Joshua D. Drake j...@commandprompt.comwrote:

 On Fri, 2010-03-12 at 16:45 -0700, Gordon Shannon wrote:
  This is 8.4, there is no pg_autovacuum table.  I set it like this:
 
  alter table foo set (autovacuum_analyze_scale_factor=0.01);

 That is 1% changes. I think you want .10

 Sincerely,

 Joshua D. Drake


 


 --
 PostgreSQL.org Major Contributor
 Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
 Consulting, Training, Support, Custom Development, Engineering
 Respect is earned, not gained through arbitrary and repetitive use or Mr.
 or Sir.




-- 
If I had more time, I could have written you a shorter letter.  (Blaise
Pascal)


[GENERAL] Getting insufficient data left in message on copy with binary

2009-09-03 Thread Gordon Shannon

Hello,

I'm running 8.4 on Linux/Centos.  I am doing a copy (select ) to
'/absolute/path/to/file.dat' with binary. That works fine.  But when I load
that file into a table...

  copy mytable (id, mlid, parent_mlid, author_id, date_id, time_id,
content_type_id, provider_id,
is_duplicate, is_spam, language_code, profanity, tonality, sentiment,
created_time, updated_at)
  from '/absolute/path/to/file.dat' with binary;

  ERROR:  insufficient data left in message
  CONTEXT:  COPY mytable, line 1, column provider_id

The same data works fine without the with binary.  Also, the column it's
complaining about, provider_id, is a NOT NULL column, and the data is
definitely there -- i.e. not a NULL in data file.

I have searched for this message and mostly I see issues related to JDBC
drivers, so that doesn't appear relevant.  And they all talk about nul bytes
(0x00), but again, how can that be relevant when I'm in binary mode?  Seems
like it should understand null bytes here, if that's what this is about.

Anybody seen this?

Thanks,
Gordon

-- 
View this message in context: 
http://www.nabble.com/Getting-%22insufficient-data-left-in-message%22-on-copy-with-binary-tp25282935p25282935.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Getting insufficient data left in message on copy with binary

2009-09-03 Thread Gordon Shannon



Tom Lane-2 wrote:
 
 Gordon Shannon gordo...@gmail.com writes:
 
   ERROR:  insufficient data left in message
   CONTEXT:  COPY mytable, line 1, column provider_id
 
 Anybody seen this?
 
 No.  Can you extract a self-contained test case?
 


Got it.  The problem was a combination of 2 mis-matched data types. 
Consider this test case:

begin---

drop table if exists bar;
drop table if exists foo;

create table foo (
system_id smallint,
credibility real not null
);

insert into foo ( system_id, credibility) values (1,1);

copy foo to '/tmp/repeat.dat' with binary;

create table bar (
system_id int,
credibility numeric(10,9) not null
);

copy bar from '/tmp/repeat.dat' with binary;

copy bar from '/var/lib/pgsql/backups/repeat.dat' with binary;
psql:repeat:19: ERROR:  insufficient data left in message
CONTEXT:  COPY bar, line 1, column system_id

--end-

It's interesting to note that I get this error only when there are 2 bad
fields.
If I fix only the numeric field, I get incorrect binary data format on the
int field.
If I fix only the smallint field, I get invalid length in external
numeric value on the real field.

So, my fault, and the fix is obvious.  But it does seem like a less than
ideal error message.

Also, maybe a sentence like this would be helpful on the COPY page: Be
careful that the data types match from 'copy to' to 'copy from'.  There is
no implicit conversion done in binary mode... or some such?

Cheers

-- 
View this message in context: 
http://www.nabble.com/Getting-%22insufficient-data-left-in-message%22-on-copy-with-binary-tp25282935p25287583.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Got could not truncate directory pg_multixact/offsets: apparent wraparound

2009-09-03 Thread Gordon Shannon

Hello,  running 8.4 on Centos.  Been running production for 6 months.  Never
saw this message until tonight:

  LOG: could not truncate directory pg_multixact/offsets: apparent
wraparound

In case it helps...

Output of pg_controldata:
Latest checkpoint's NextMultiXactId:  145725622
Latest checkpoint's NextMultiOffset:  394849408

Contents of pg_multixact/offsets:  08B0
Contents of pg_multixact/members: 178B

In conf file:
vacuum_freeze_min_age = 10  # (1 billion)
autovacuum_freeze_max_age = 15  # (1.5 billion)

Oldest txn in cluster (pg_database.datfrozenxid) : 648

Should I be concerned? 

Thanks
Gordon


-- 
View this message in context: 
http://www.nabble.com/Got-could-not-truncate-directory-%22pg_multixact-offsets%22%3A-apparent-wraparound-tp25287801p25287801.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to capture an interactive psql session in a log file?

2009-04-03 Thread Gordon Shannon

That does the trick, awesome!

I do think it would be great if psql had a stderr capture in addition to
stdout.

Thanks


hubert depesz lubaczewski-2 wrote:
 
 On Thu, Apr 02, 2009 at 10:55:10PM -0700, Gordon Shannon wrote:
 Has anyone solved this issue before?
 
 have you seen program script?
 
 usage:
 
 just run script
 you will get shell. now run your command you want to capture everything
 from - it will work as usual.
 after you finish - exit the script-shell, and check the generated
 typescript file
 
 Best regards,
 
 depesz
 
 -- 
 Linkedin: http://www.linkedin.com/in/depesz  /  blog:
 http://www.depesz.com/
 jid/gtalk: dep...@depesz.com / aim:depeszhdl / skype:depesz_hdl /
 gg:6749007
 
 -- 
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general
 
 

-- 
View this message in context: 
http://www.nabble.com/How-to-capture-an-interactive-psql-session-in-a-log-file--tp22862412p22862519.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] How to capture an interactive psql session in a log file?

2009-04-02 Thread Gordon Shannon

What I'm trying to do doesn't seem like it should be that difficult or
unusual, but I can't seem to find the right combination of commands to make
it happen.  I want to have a log file that captures everything from an
interactive psql session.  Running 8.3.7 with bash shell on Linux.  

If I use \o file or \o |tee file, it doesn't capture my entered commands,
or any error text.

I can use some fancy bash to capture stderr in the file too, and then error
text goes there, but no matter what, neither the commands I enter nor the
psql prompt will be captured in the log file.  

This is even when using -a, or \set ECHO all.  It's as if my entered
commands neither go to stdout or stderr.

Has anyone solved this issue before?

Thanks in advance
Gordon
-- 
View this message in context: 
http://www.nabble.com/How-to-capture-an-interactive-psql-session-in-a-log-file--tp22862412p22862412.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general