Re: [GENERAL] Call for design: PostgreSQL mugs

2013-09-09 Thread Quentin Hartman
On Mon, Sep 9, 2013 at 10:09 AM, Mike Christensen wrote:

> How about something incredibly cheesy like
>
> SELECT * FROM Mug;
>

I dig this. Black mug with this in white on one side, and a postgres logo
in white on the other side. I'd buy one...

 One thing I have not seen discussed here is "Who is the audience?" Is this
something that is intended to be primarily a promotional tool? If so, then
slogans and "we're awesome" things make sense. If it's something for the
community to be proud of and have fun with, then little punny things like
this are better. Also, I would keep words to a minimum for i18n. Really
should just keep it to code, "PostgreSQL". and images.

QH


Re: [GENERAL] Weird error when setting up streaming replication

2013-08-09 Thread Quentin Hartman
OK, figured this out. I had it start copying the pg_xlog directory as well
when doing the initial sync. I realized this is also the first time I've
setup replication from scratch using 9.2. All my other 9.2 pairs were setup
on either 9.0 or 9.1, and have been upgraded from there with replication
already in place. Previously, and still according to that article in the
wiki, the pg_xlog directory was specifically excluded. Does anyone know why
this behavior may have changed?


On Fri, Aug 9, 2013 at 9:33 AM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> This pair of servers aren't replacing anything, they are new, empty
> servers. Before starting the slave at all, I'm copying the entire data
> filestructure over to it via rsync. I'm doing almost exactly what is
> described here:
> http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Binary_Replication_in_6_Steps.
>  The only different is that I've tweaked the paths on the rsync to be
> appropriate to my system layout. I've even gone so far as to delete
> everything in the data dir except for the pg_xlog directory before syncing
> everything over to make sure it wasn't caused by something not getting
> overwritten when it was supposed to.
>
>
> On Thu, Aug 8, 2013 at 6:23 PM, Michael Paquier  > wrote:
>
>> On Fri, Aug 9, 2013 at 8:55 AM, Quentin Hartman
>>  wrote:
>> > 2013-08-08 23:47:30 GMT LOG:  WAL file is from different database system
>> > 2013-08-08 23:47:30 GMT DETAIL:  WAL file database system identifier is
>> > 5909892614333033983, pg_control database system identifier is
>> > 5909892824786287231.
>> It looks that you are not able to detect valid checkpoint records when
>> replaying WAL because your new system has been initialized with a
>> fresh initdb, symbolized by the errors above. You should build your
>> new node using a base backup or a snapshot of the data folder of the
>> node you are trying to replace.
>> --
>> Michael
>>
>
>


Re: [GENERAL] How To Install Extension Via Script File?

2013-08-09 Thread Quentin Hartman
A bit of an aside, but you also might want to change that "CREATE EXTENSION
ltree;" to

"CREATE EXTENSION IF NOT EXISTS ltree;"

That way your script won't error out if the extension is already enabled.


On Fri, Aug 9, 2013 at 3:57 AM, Don Parris  wrote:

> On Thu, Aug 8, 2013 at 8:42 PM, Tom Lane  wrote:
>
>> Don Parris  writes:
>> > When I try a simple psql -U postgres -W - just to initiate the psql
>> > session, I get:
>> > psql: FATAL:  Peer authentication failed for user "postgres"
>>
>> > It's like my regular user cannot connect as the postgres user.
>>
>> You're right, it can't, if you've selected peer authentication in
>> pg_hba.conf.  You'd need to use some other auth method, perhaps
>> password-based auth, if you want this to work.  Read up on auth methods
>> in the fine manual.
>>
>> > However, this works (with me just typing my password for sudo):
>> > donp@wiesbaden:~$ sudo -u postgres psql -U postgres
>> > [sudo] password for donp:
>>
>> Sure, because then psql is launched as the postgres OS user, and peer auth
>> will let that user connect as the postgres DB user.
>>
>> regards, tom lane
>>
>
> A...  now I understand.  Thanks Tom!  That should really help!  I'll
> check that out.
>
> --
> D.C. Parris, FMP, Linux+, ESL Certificate
> Minister, Security/FM Coordinator, Free Software Advocate
> http://dcparris.net/ 
> 
> GPG Key ID: F5E179BE
>


Re: [GENERAL] Weird error when setting up streaming replication

2013-08-09 Thread Quentin Hartman
This pair of servers aren't replacing anything, they are new, empty
servers. Before starting the slave at all, I'm copying the entire data
filestructure over to it via rsync. I'm doing almost exactly what is
described here:
http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Binary_Replication_in_6_Steps.
The only different is that I've tweaked the paths on the rsync to be
appropriate to my system layout. I've even gone so far as to delete
everything in the data dir except for the pg_xlog directory before syncing
everything over to make sure it wasn't caused by something not getting
overwritten when it was supposed to.


On Thu, Aug 8, 2013 at 6:23 PM, Michael Paquier
wrote:

> On Fri, Aug 9, 2013 at 8:55 AM, Quentin Hartman
>  wrote:
> > 2013-08-08 23:47:30 GMT LOG:  WAL file is from different database system
> > 2013-08-08 23:47:30 GMT DETAIL:  WAL file database system identifier is
> > 5909892614333033983, pg_control database system identifier is
> > 5909892824786287231.
> It looks that you are not able to detect valid checkpoint records when
> replaying WAL because your new system has been initialized with a
> fresh initdb, symbolized by the errors above. You should build your
> new node using a base backup or a snapshot of the data folder of the
> node you are trying to replace.
> --
> Michael
>


[GENERAL] Weird error when setting up streaming replication

2013-08-08 Thread Quentin Hartman
I'm going through all my usual steps for setting up streaming replication
on a new pair of servers. Modify configs as appropriate, rsync data from
master to slave, etc. I have this all automated with chef, and it has been
pretty bulletproof for awhile. However, today, I ran into this when
starting the slave on this new pair:

 * Starting PostgreSQL 9.2 database
server
* The PostgreSQL server failed to start. Please check the log output:
2013-08-08 23:47:30 GMT LOG:  database system was interrupted; last known
up at 2013-08-08 23:22:40 GMT
2013-08-08 23:47:30 GMT LOG:  entering standby mode
2013-08-08 23:47:30 GMT LOG:  WAL file is from different database system
2013-08-08 23:47:30 GMT DETAIL:  WAL file database system identifier is
5909892614333033983, pg_control database system identifier is
5909892824786287231.
2013-08-08 23:47:30 GMT LOG:  invalid primary checkpoint record
2013-08-08 23:47:30 GMT LOG:  invalid secondary checkpoint record
2013-08-08 23:47:30 GMT PANIC:  could not locate a valid checkpoint record
2013-08-08 23:47:30 GMT LOG:  startup process (PID 10600) was terminated by
signal 6: Aborted
2013-08-08 23:47:30 GMT LOG:  aborting startup due to startup process
failure


And I've been stumped. I've completely nuked my data dirs and started over
and gotten the same result, but with different identifier numbers (as I
would expect).

Any Ideas?

Thanks!

QH


Re: [GENERAL] Negative replication lag?

2013-04-23 Thread Quentin Hartman
Ah, that makes sense. I think I'll add some logic to the script that has it
get new data points if it comes up with a negative value.

Thanks for the insight.

QH


On Mon, Apr 22, 2013 at 5:11 PM, Andres Freund wrote:

> On 2013-04-22 16:36:38 -0600, Quentin Hartman wrote:
> > I'm using this script to check my replication lag on my streaming
> > replication pairs with Nagios:
> >
> > https://gist.github.com/jacobian/743942
> >
> > It generally works fine, but will occasionally return a negative lag
> value
> > (-37kb for example) which of course causes it to throw an alarm, but is
> > total nonsense. I've been working on the assumption that it is some sort
> of
> > bug in the script, but in taking a quick look at it nothing jumps out at
> me.
> >
> > Is there something in Postgres itself that could cause this to happen
> once
> > in awhile? Is it something to be concerned about? Is there a better way
> to
> > monitor this state?
>
> Well, between the time pg_current_xlog_location() is run on the primary
> and pg_last_xlog_replay_location() on the standby some time passes, so
> its not all that unlikely that wal has been generated, streamed *and*
> applied in that time. Given the short timeframe it only happens every
> now and then.
>
> Did you check the pg_stat_replication view on the primary?
>
> Greetings,
>
> Andres Freund
>
> --
>  Andres Freund http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


[GENERAL] Negative replication lag?

2013-04-22 Thread Quentin Hartman
I'm using this script to check my replication lag on my streaming
replication pairs with Nagios:

https://gist.github.com/jacobian/743942

It generally works fine, but will occasionally return a negative lag value
(-37kb for example) which of course causes it to throw an alarm, but is
total nonsense. I've been working on the assumption that it is some sort of
bug in the script, but in taking a quick look at it nothing jumps out at me.

Is there something in Postgres itself that could cause this to happen once
in awhile? Is it something to be concerned about? Is there a better way to
monitor this state?

Thanks!

QH


Re: [GENERAL] AWS and postgres issues

2013-04-08 Thread Quentin Hartman
What version of pgpool are you using?

Are there other commands you have a problem with? I would suspect that the
restart is causing the postgres server to go away, pgpool decides to
disconnect, and then it has to be manually added back to the cluster.
Unless of course you've got automatic failback setup, but even then I would
expect that command to do weird things when issued through middleware like
pgpool, regardless of what sort of infrastructure you are running on.

QH


Re: [GENERAL] PostgreSQL Backup Booklet

2013-04-03 Thread Quentin Hartman
Also, Packt has a deal on their site where if you buy 2 ebooks, you get
them both for half off. This one seems to qualify, so I Was able to get it
for $3, while also getting another book I've been wanting for half what I
was expecting to pay.

Hooray!

QH


Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Quentin Hartman
On Fri, Mar 29, 2013 at 10:50 AM, Tom Lane  wrote:

> Quentin Hartman  writes:
> > On Fri, Mar 29, 2013 at 10:37 AM, Tom Lane  wrote:
> >> What process did you use for setting up the slave?
>
> > I used an rsync from the master while both were stopped.
>
> If the master was shut down cleanly (not -m immediate) then the bug fix
> I was thinking about wouldn't explain this.  The fact that the panic
> didn't recur after restarting seems to void that theory as well.  I'm
> not sure what to make of that angle.
>

Yes, it was shut down cleanly. A good thought, but I don't think it's
relevant in this case.


> Can you determine which table is being complained of in the failure
> message, ie, what has relfilenode 63370 in database 63229?  If so it
> would be interesting to know what was being done to that table on the
> master.
>

Good point! Looking deeper into that, it's actually one of our smaller
tables, and it doesn't seem to have any corruption, on either server. I was
able to select all the records from it and the content seems sane. The only
thing that would have been happening on that table is an INSERT or UPDATE.

I think I'm going to run with the spurious EC2 hiccup explanation. I'm
comfortable with that given the extra due diligence I've done with your
(and Lonni's) guidance.

Thanks!

QH


Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Quentin Hartman
On Fri, Mar 29, 2013 at 10:37 AM, Tom Lane  wrote:

> Quentin Hartman  writes:
> > Yesterday morning, one of my streaming replication slaves running 9.2.3
> > crashed with the following in the log file:
>
> What process did you use for setting up the slave?
>

I used an rsync from the master while both were stopped.


> This theory would be more probable if it's a relatively new slave, since
> any corruption would have been there in the slave's initial state, just
> waiting for the replay to run into it.
>

It's newish. I upgraded this pair from 9.1.x to 9.2.3 a little over a week
ago, so did a dump/reload and resynced then. It was running happily in
between now and then.

Thanks!

QH


Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Quentin Hartman
On Fri, Mar 29, 2013 at 10:23 AM, Lonni J Friedman wrote:

> Looks like you've got some form of coruption:
> page 1441792 of relation base/63229/63370 does not exist
>

Thanks for the insight. I thought that might be it, but never having seen
this before I'm glad to have some confirmation.

The question is whether it was corrupted on the master and then
> replicated to the slave, or if it was corrupted on the slave.  I'd
> guess that the pg_dump tried to read from that page and barfed.  It
> would be interesting to try re-running the pg_dump again to see if
> this crash can be replicated.  If so, does it also replicate if you
> run pg_dump against the master?  If not, then the corruption is
> isolated to the slave, and you might have a hardware problem which is
> causing the data to get corrupted.
>

Yes, we've gotten several clean dumps form the slave since then w/o
crashing. We're running these machines on EC2 so we sadly have no control
over the hardware. With your confirmation, and an apparently clean state
now, I'm inclined to chalk this up to an EC2 hiccup getting caught by
Postgres and get on with life.

Thanks!

QH


[GENERAL] Streaming replication slave crash

2013-03-29 Thread Quentin Hartman
Yesterday morning, one of my streaming replication slaves running 9.2.3
crashed with the following in the log file:

2013-03-28 12:49:30 GMT WARNING:  page 1441792 of relation base/63229/63370
does not exist
2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index
1663/63229/109956; iblk 303, heap 1663/63229/63370;
2013-03-28 12:49:30 GMT PANIC:  WAL contains references to invalid pages
2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index
1663/63229/109956; iblk 303, heap 1663/63229/63370;
2013-03-28 12:49:31 GMT LOG:  startup process (PID 22941) was terminated by
signal 6: Aborted
2013-03-28 12:49:31 GMT LOG:  terminating any other active server processes
2013-03-28 12:49:31 GMT WARNING:  terminating connection because of crash
of another server process
2013-03-28 12:49:31 GMT DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2013-03-28 12:49:31 GMT HINT:  In a moment you should be able to reconnect
to the database and repeat your command.
2013-03-28 12:57:44 GMT LOG:  database system was interrupted while in
recovery at log time 2013-03-28 12:37:42 GMT
2013-03-28 12:57:44 GMT HINT:  If this has occurred more than once some
data might be corrupted and you might need to choose an earlier recovery
target.
2013-03-28 12:57:44 GMT LOG:  entering standby mode
2013-03-28 12:57:44 GMT LOG:  redo starts at 19/2367CE30
2013-03-28 12:57:44 GMT LOG:  incomplete startup packet
2013-03-28 12:57:44 GMT LOG:  consistent recovery state reached at
19/241835B0
2013-03-28 12:57:44 GMT LOG:  database system is ready to accept read only
connections
2013-03-28 12:57:44 GMT LOG:  invalid record length at 19/2419EE38
2013-03-28 12:57:44 GMT LOG:  streaming replication successfully connected
to primary

As you can see I was able to restart it and it picked up and synchronized
right away, but this crash still concerns me.

The DB has about 75GB of data in it, and it is almost entirely write
traffic. It's essentially a log aggregator. I believe it was doing a
pg_dump backup at the time of the crash. It has hot_standby_feedback on to
allow that process to complete.

Any insights into this, or advice on figuring out the root of it would be
appreciated. So far all the things I've found like this are bugs that
should be fixed in this version, or the internet equivalent of a shrug.

Thanks!

QH