from:"Lonni J Friedman"

Re: [GENERAL] window function ordering not working as expected

2015-02-17 Thread Lonni J Friedman

On Tue, Feb 17, 2015 at 4:18 PM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> I'm interested in seeing:
>> * the date for the most recent result
>> * test name (identifier)
>> * most recent result (decimal value)
>> * the worst (lowest decimal value) test result from the past 21 days
>> * the date which corresponds with the worst test result from the past 21 days
>> * the 2nd worst (2nd lowest decimal value) test result
>> ...
>> The problem that I'm seeing is in the prv_score column. It should show
>> a value of 0.6, which corresponds with 2015-02-13, however instead its
>> returning 0.7. I thought by ordering by metrics->>'PT TWBR' I'd always
>> be sorting by the scores, and as a result, the lead(metrics->>'PT
>> TWBR', 1) would give me the next greatest value of the score. Thus my
>> confusion as to why ORDER BY metrics->>'PT TWBR' isn't working as
>> expected.
>
> lead() and lag() retrieve values from rows that are N away from the
> current row in the specified ordering.  That isn't what you want here
> AFAICS.
>
> I think the worst test result would be obtained with
> nth_value(metrics->>'PT TWBR', 1)
> which is equivalent to what you used,
> first_value(metrics->>'PT TWBR')
> while the 2nd worst result would be obtained with
> nth_value(metrics->>'PT TWBR', 2)
>
> However, "worst" and "2nd worst" with this implementation would mean
> "worst and 2nd worst within the partition", which isn't the stated
> goal either, at least not with the partition definition you're using.
>
> What you really want for the "worst in last 21 days" is something like
>
> min(metrics->>'PT TWBR') OVER (
>PARTITION BY ... that same mess you used ...
>ORDER BY tstamp
>RANGE BETWEEN '21 days'::interval PRECEDING AND CURRENT ROW)
>
> However Postgres doesn't implement RANGE x PRECEDING yet.  You could
> get "worst in last 21 observations" easily:
>
> min(metrics->>'PT TWBR') OVER (
>PARTITION BY ... that mess ...
>ORDER BY tstamp
>ROWS BETWEEN 20 PRECEDING AND CURRENT ROW)
>
> and maybe that's close enough.
>
> I do not know an easy way to get "second worst" :-(.  You could build a
> user-defined aggregate to produce "second smallest value among the inputs"
> and then apply it in the same way as I used min() here.

Thanks Tom, much appreciate the fast reply.  I'll chew this over and
see if I have any other questions.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] window function ordering not working as expected

2015-02-17 Thread Lonni J Friedman

Greetings,
I have a postgresql-9.3.x database with a table with a variety of date
stamped test results, some of which are stored in json format
(natively in the database). I'm attempting to use some window
functions to pull out specific data from the test results over a a
time window, but part of the results are not making sense. Some tests
run every day, others less frequently. For each unique test's results,
I'm interested in seeing:

* the date for the most recent result
* test name (identifier)
* most recent result (decimal value)
* the worst (lowest decimal value) test result from the past 21 days
* the date which corresponds with the worst test result from the past 21 days
* the 2nd worst (2nd lowest decimal value) test result

Here's a sample of the data and resulting score for one test (tname)
from the past few weeks:

  tstamp   |  tname  | score
+-+
2015-02-17 | dfw001.ix-cr-02 | 0.7
2015-02-15 | dfw001.ix-cr-02 | 0.6
2015-02-14 | dfw001.ix-cr-02 | 0.6
2015-02-14 | dfw001.ix-cr-02 | 0.7
2015-02-13 | dfw001.ix-cr-02 | 0.6
2015-02-12 | dfw001.ix-cr-02 | 0.7
2015-02-11 | dfw001.ix-cr-02 | 0.7
2015-02-10 | dfw001.ix-cr-02 | 0.7
2015-02-09 | dfw001.ix-cr-02 | 0.7
2015-02-08 | dfw001.ix-cr-02 | 0.7
2015-02-08 | dfw001.ix-cr-02 | 0.5
2015-02-07 | dfw001.ix-cr-02 | 0.7
2015-02-07 | dfw001.ix-cr-02 | 0.5
2015-02-06 | dfw001.ix-cr-02 | 0.7
2015-02-05 | dfw001.ix-cr-02 | 0.7
2015-02-04 | dfw001.ix-cr-02 | 0.7
2015-01-30 | dfw001.ix-cr-02 | 0.7

Here's the SQL query that I'm running:

SELECT * FROM
(SELECT tstamp,
concat_ws('/',attrs->>'RCluster ID',
regexp_replace(replace(replace(attrs->>'ASN HTML','',''),'http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d''
target=''_blank''>)','')) AS tname ,
metrics->>'PT TWBR' AS score,
first_value(metrics->>'PT TWBR') OVER
(PARTITION BY concat_ws('/',attrs->>'Route Cluster
ID', regexp_replace(replace(replace(attrs->>'ASN HTML','',''),'http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d''
target=''_blank''>)',''))
ORDER BY metrics->>'PT TWBR') AS worst_score,
first_value(tstamp) OVER
(PARTITION BY concat_ws('/',attrs->>'Route Cluster
ID', regexp_replace(replace(replace(attrs->>'ASN HTML','',''),'http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d''
target=''_blank''>)',''))
ORDER BY metrics->>'PT TWBR') AS worst_date,
lead(metrics->>'PT TWBR', 1) OVER
(PARTITION BY concat_ws('/',attrs->>'Route Cluster
ID', regexp_replace(replace(replace(attrs->>'ASN HTML','',''),'http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d''
target=''_blank''>)',''))
ORDER BY metrics->>'PT TWBR') AS prv_score
FROM btworks
WHERE
age(now(),tstamp) < '21 days'
ORDER BY tstamp DESC, rank
) AS stuff
WHERE
tstamp = '2015-02-17';

Here's the data from the above query as it pertains to the data
(tname='dfw001.ix-cr-02') set that I posted above:

   tstamp   | tname | score | worst_score | worst_date
| prv_score
+---+---+-++---
 2015-02-17 | dfw001.ix-cr-02   | 0.7   | 0.5 | 2015-02-08 | 0.7

The problem that I'm seeing is in the prv_score column. It should show
a value of 0.6, which corresponds with 2015-02-13, however instead its
returning 0.7. I thought by ordering by metrics->>'PT TWBR' I'd always
be sorting by the scores, and as a result, the lead(metrics->>'PT
TWBR', 1) would give me the next greatest value of the score. Thus my
confusion as to why ORDER BY metrics->>'PT TWBR' isn't working as
expected.


thanks in advance for any pointers.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] pg_basebackup: ERROR: could not find any WAL files (9.3)

2013-09-26 Thread Lonni J Friedman

Greetings,
I've recently pushed a new postgres-9.3 (Linux-x86_64/RHEL6) cluster
into production, with one master, and two hot standby streaming
replication slaves.  Everything seems to be working ok, however
roughly half of my pg_basebackup attempts are failing at the very end
with the error:

pg_basebackup: could not get transaction log end position from server:
ERROR:  could not find any WAL files

I should note that I'm running pg_basebackup on one of the two slaves,
and not the master.  However, I've got an older, separate 9.3 cluster
with the same setup, and pg_basebackup never fails there.

I thought that the WAL files in question were coming from the pg_xlog
subdirectory.  But I don't see any lack of files there on the server
running pg_basebackup.  They are being generated continuously (as
expected), before, during & after the pg_basebackup.  I scanned the
source ( http://doxygen.postgresql.org/basebackup_8c_source.html ),
and it seems to backup my understanding of the expected behavior:

306  /*
307  * There must be at least one xlog file in the pg_xlog directory,
308  * since we are doing backup-including-xlog.
309  */
310  if (nWalFiles < 1)
311  ereport(ERROR,
312  (errmsg("could not find any WAL files")));

However, what I see on the server conflicts with the error.
pg_basebackup was invoked on Thu Sep 26 01:00:01 PDT 2013, and failed
on Thu Sep 26 02:09:12 PDT 2013.  In the pg_xlog subdirectory, I see
lots of WAL files present, before, during & after pg_basebackup was
run:
-rw--- 1 postgres postgres 16777216 Sep 26 00:38 0001208A00E3
-rw--- 1 postgres postgres 16777216 Sep 26 00:43 0001208A00E4
-rw--- 1 postgres postgres 16777216 Sep 26 00:48 0001208A00E5
-rw--- 1 postgres postgres 16777216 Sep 26 00:53 0001208A00E6
-rw--- 1 postgres postgres 16777216 Sep 26 00:58 0001208A00E7
-rw--- 1 postgres postgres 16777216 Sep 26 01:03 0001208A00E8
-rw--- 1 postgres postgres 16777216 Sep 26 01:08 0001208A00E9
-rw--- 1 postgres postgres 16777216 Sep 26 01:14 0001208A00EA
-rw--- 1 postgres postgres 16777216 Sep 26 01:19 0001208A00EB
-rw--- 1 postgres postgres 16777216 Sep 26 01:24 0001208A00EC
-rw--- 1 postgres postgres 16777216 Sep 26 01:29 0001208A00ED
-rw--- 1 postgres postgres 16777216 Sep 26 01:34 0001208A00EE
-rw--- 1 postgres postgres 16777216 Sep 26 01:38 0001208A00EF
-rw--- 1 postgres postgres 16777216 Sep 26 01:43 0001208A00F0
-rw--- 1 postgres postgres 16777216 Sep 26 01:48 0001208A00F1
-rw--- 1 postgres postgres 16777216 Sep 26 01:53 0001208A00F2
-rw--- 1 postgres postgres 16777216 Sep 26 01:58 0001208A00F3
-rw--- 1 postgres postgres 16777216 Sep 26 02:03 0001208A00F4
-rw--- 1 postgres postgres 16777216 Sep 26 02:08 0001208A00F5
-rw--- 1 postgres postgres 16777216 Sep 26 02:14 0001208A00F6


Thanks in advance for any pointers.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] partitioned table + postgres_FDW not working in 9.3

2013-09-26 Thread Lonni J Friedman

On Thu, Sep 26, 2013 at 8:52 AM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> Thanks for your reply.  This sounds like a relatively simple
>> workaround, so I'll give it a try.  Is the search_path of the remote
>> session that postgres_fdw forces considered to be intentional,
>> expected behavior, or is it a bug?
>
> It's intentional.
>
> Possibly more to the point, don't you think your trigger function is
> rather fragile if it assumes the caller has provided a particular
> search path setting?

To be honest, I don't have much experience with functions, and was
using the trigger function from the official documentation:
http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] partitioned table + postgres_FDW not working in 9.3

2013-09-26 Thread Lonni J Friedman

Hi Shigeru,
Thanks for your reply.  This sounds like a relatively simple
workaround, so I'll give it a try.  Is the search_path of the remote
session that postgres_fdw forces considered to be intentional,
expected behavior, or is it a bug?

thanks!

On Wed, Sep 25, 2013 at 7:13 PM, Shigeru Hanada
 wrote:
> Hi Lonni,
>
> 2013/9/25 Lonni J Friedman :
>> The problem that I'm experiencing is if I attempt to perform an INSERT
>> on the foreign nppsmoke table on cluster a, it fails claiming that the
>> table partition which should hold the data in the INSERT does not
>> exist:
>>
>> ERROR:  relation "nppsmoke_2013_09" does not exist
>> CONTEXT:  Remote SQL command: INSERT INTO public.nppsmoke(id,
>> date_created, last_update, build_type, current_status, info, cudacode,
>> gpu, subtest, os, osversion, arch, cl, dispvers, branch, pass, fail,
>> oldfail, newfail, failureslog, totdriver, ddcl, buildid, testdcmd,
>> pclog, filtercount, filterlog, error) VALUES ($1, $2, $3, $4, $5, $6,
>> $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20,
>> $21, $22, $23, $24, $25, $26, $27, $28)
>> PL/pgSQL function public.nppsmoke_insert_trigger() line 30 at SQL statement
>
> I could reproduce the problem.
>
>> If I run the same exact SQL INSERT on cluster b (not using the foreign
>> table), then it works.  So whatever is going wrong seems to be related
>> to the foreign table.  Initially I thought that perhaps the problem
>> was that I needed to create all of the partitions as foreign tables on
>> cluster a, but that doesn't help.
>>
>> Am I hitting some kind of foreign data wrapper limitation, or am I
>> doing something wrong?
>
> The cause of the problem is search_path setting of remote session.
> For some reasons, postgres_fdw forces the search_path on the remote
> side to be 'pg_catalog', so all objects used in the session
> established by postgres_fdw have to be schema-qualified.  Trigger
> function is executed in such context, so you need to qualify all
> objects in your trigger function with schema name, like
> 'public.nppsmoke_2013_09'.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] postgres FDW doesn't support sequences?

2013-09-25 Thread Lonni J Friedman

On Wed, Sep 25, 2013 at 2:47 PM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> If I INSERT a new row into the local table (not the foreign table
>> version), without specifying the 'id' column explicitly, it
>> automatically is assigned the nextval in the sequence counter.
>> However, if I attempt to run the same INSERT using the foreign table,
>> it always fails complaining that null value in column "id" violates
>> not-null constraint.  It seems like the FDW is somehow ignoring the
>> existence of the sequence default value, and rewriting the SQL query
>> to explicitly attempt to insert a NULL value.
>
> Yeah, there was quite a bit of discussion about that back in February or
> so.  The short of it is that column default values that are defined on the
> foreign server are not respected by operations on a foreign table; rather,
> you have to attach a DEFAULT specification to the foreign table definition
> if you want inserts into the foreign table to use that default.
>
> The default expression is executed locally, too, which means that if you'd
> like it to read like "nextval('some_seq')" then some_seq has to be a local
> sequence, not one on the foreign server.

Is there some elegant mechanism for keeping the local & foreign
sequences in sync?

>
> I realize that this isn't ideal for serial-like columns, but honoring
> default expressions that would execute on the foreign server turned out
> to be a huge can of worms.  We might figure out how to fix that some day;
> but if we'd insisted on a solution now, there wouldn't be writable foreign
> tables at all in 9.3.

Understood.  Other than reading the code, is there somewhere that
these limitations are documented that I overlooked?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] postgres FDW doesn't support sequences?

2013-09-25 Thread Lonni J Friedman

I've got two 9.3 clusters, with a postgres foreign data wrapper (FDW)
setup to point from one cluster to the other.  One of the (foreign)
tables associated with the foreign server has a bigint sequence for
its primary key, defined as:

id | bigint  | not null default
nextval('nppsmoke_id_seq1'::regclass)


If I INSERT a new row into the local table (not the foreign table
version), without specifying the 'id' column explicitly, it
automatically is assigned the nextval in the sequence counter.
However, if I attempt to run the same INSERT using the foreign table,
it always fails complaining that null value in column "id" violates
not-null constraint.  It seems like the FDW is somehow ignoring the
existence of the sequence default value, and rewriting the SQL query
to explicitly attempt to insert a NULL value.  Here's the full query &
resulting error output:

nightly=# INSERT into nppsmoke
(date_created,last_update,build_type,current_status,info,cudacode,gpu,subtest,os,arch,cl,dispvers,branch,totdriver,ddcl,testdcmd,osversion)
VALUES 
((date_trunc('second',now())),(date_trunc('second',now())),'release','Building','npp-release-gpu-buildCUDA-2013-09-24-1380041350.log','2013-09-24.cuda-linux64-test42.release.gpu','380','CUDA
build','Linux','x86_64','16935289','CBS_cuda_a_2013-09-24_16935289','cuda_a','1','16935289','./npp-smoke.sh
--testtype release --amodel f --vc g --drvpath
/home/lfriedman/cuda-stuff/sw/dev/gpu_drv/cuda_a/drivers/gpgpu
--cudaroot /home/lfriedman/cuda-stuff/sw/gpgpu --totdriver t  --email
lfriedman','2.6.32-358.el6.x86_64');
ERROR:  null value in column "id" violates not-null constraint
DETAIL:  Failing row contains (null, 2013-09-25 08:00:46, 2013-09-25
08:00:46, release, Building,
npp-release-gpu-buildCUDA-2013-09-24-1380041350.log,
2013-09-24.cuda-linux64-test42.release.gpu, 380, CUDA build, Linux,
2.6.32-358.el6.x86_64, x86_64, 16935289,
CBS_cuda_a_2013-09-24_16935289, cuda_a, null, null, null, null, null,
t, 16935289, null, ./npp-smoke.sh --testtype release --amodel f --vc g
--drvpath /h..., null, null, null, null, g).
CONTEXT:  Remote SQL command: INSERT INTO public.nppsmoke(id,
date_created, last_update, build_type, current_status, info, cudacode,
gpu, subtest, os, osversion, arch, cl, dispvers, branch, pass, fail,
oldfail, newfail, failureslog, totdriver, ddcl, buildid, testdcmd,
pclog, filtercount, filterlog, error) VALUES ($1, $2, $3, $4, $5, $6,
$7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20,
$21, $22, $23, $24, $25, $26, $27, $28)

I tried to recreate the foreign table definition with a primary key,
and that failed:
ERROR:  constraints are not supported on foreign tables

Are sequences supported with the postgres FDW?  If not, is there any
workaround for inserting into a foreign table that doesn't require me
to explicitly specify a value for the primary key sequence column in
my INSERT statements?

thanks!


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] partitioned table + postgres_FDW not working in 9.3

2013-09-24 Thread Lonni J Friedman

Greetings,
I've got two different 9.3 clusters setup, a & b (on Linux if that
matters).  On cluster b, I have a table (nppsmoke) that is partitioned
by date (month), which uses a function which is called by a trigger to
manage INSERTS (exactly as documented in the official documentation
for partitioning of tables).  I've setup a postgres foreign data
wrapper server on cluster a which points to cluster b, and then setup
a foreign table (nppsmoke) on cluster a which points to the actual
partitioned (nppsmoke) table on cluster b.  The partitions on cluster
b use the naming scheme "nppsmoke_$_$MM" (where Y=4 digit year,
and M=2 digit month).  For example, the current month's partition is
named nppsmoke_2013_09 .

The problem that I'm experiencing is if I attempt to perform an INSERT
on the foreign nppsmoke table on cluster a, it fails claiming that the
table partition which should hold the data in the INSERT does not
exist:

ERROR:  relation "nppsmoke_2013_09" does not exist
CONTEXT:  Remote SQL command: INSERT INTO public.nppsmoke(id,
date_created, last_update, build_type, current_status, info, cudacode,
gpu, subtest, os, osversion, arch, cl, dispvers, branch, pass, fail,
oldfail, newfail, failureslog, totdriver, ddcl, buildid, testdcmd,
pclog, filtercount, filterlog, error) VALUES ($1, $2, $3, $4, $5, $6,
$7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20,
$21, $22, $23, $24, $25, $26, $27, $28)
PL/pgSQL function public.nppsmoke_insert_trigger() line 30 at SQL statement

If I run the same exact SQL INSERT on cluster b (not using the foreign
table), then it works.  So whatever is going wrong seems to be related
to the foreign table.  Initially I thought that perhaps the problem
was that I needed to create all of the partitions as foreign tables on
cluster a, but that doesn't help.

Am I hitting some kind of foreign data wrapper limitation, or am I
doing something wrong?

thanks


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation

2013-09-18 Thread Lonni J Friedman

On Wed, Sep 18, 2013 at 2:02 AM, Kevin Grittner  wrote:
> Lonni J Friedman  wrote:
>
>> top shows over 90% of the load is in sys space.  vmstat output
>> seems to suggest that its CPU bound (or bouncing back & forth):
>
> Can you run `perf top` during an episode and see what kernel
> functions are using all that CPU?

I take back what I said earlier.  While the master is currently back
to normal performance, the two hot standby slaves are still churning
something awful.

If I run 'perf top' on either slave, after a few seconds, these are
consistently the top three in the list:
 84.57%  [kernel]   [k] _spin_lock_irqsave
  6.21%  [unknown]  [.] 0x00659f60
  4.69%  [kernel]   [k] compaction_alloc

>
> This looks similar to cases I've seen of THP defrag going wild.
> Did the OS version or configuration change?  Did the PostgreSQL
> memory settings (like shared_buffers) change?

I think you're onto something here with respect to THP defrag going
wild.  I set /sys/kernel/mm/transparent_hugepage/defrag to 'never' and
immediately the load dropped on both slaves from over 5.00 to under
1.00.

So this raises the question, is this a kernel bug, or is there some
other solution to the problem?
Also, seems weird that the problem didn't happen until I switched from
9.2 to 9.3.  Is it possible this is somehow related to the change from
using SysV shared memory to using Posix shared memory and mmap for
memory management?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation

2013-09-18 Thread Lonni J Friedman

On Wed, Sep 18, 2013 at 2:02 AM, Kevin Grittner  wrote:
> Lonni J Friedman  wrote:
>
>> top shows over 90% of the load is in sys space.  vmstat output
>> seems to suggest that its CPU bound (or bouncing back & forth):
>
> Can you run `perf top` during an episode and see what kernel
> functions are using all that CPU?

Oddly, the problem went away on its own yesterday just after 4PM, and
performance has remained 'normal' since that time.  I changed
absolutely nothing.  If/when it returns, I'll certainly capture that
output.

>
> This looks similar to cases I've seen of THP defrag going wild.
> Did the OS version or configuration change?  Did the PostgreSQL
> memory settings (like shared_buffers) change?

Nothing changed other than the version of postgres.  I re-used the
same postgresql.conf that was in place when running 9.2.x.

Anyway, here are the current THP related settings on the server:
[root@cuda-db7 ~]# grep AnonHugePages /proc/meminfo
AnonHugePages:548864 kB
[root@cuda-db7 ~]# egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 272
thp_fault_alloc 129173889
thp_fault_fallback 17462551
thp_collapse_alloc 148437
thp_collapse_alloc_failed 15143
thp_split 242


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation

2013-09-17 Thread Lonni J Friedman

On Tue, Sep 17, 2013 at 3:47 PM, Andres Freund  wrote:
> Hi,
>
> On 2013-09-17 09:19:29 -0700, Lonni J Friedman wrote:
>> I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming
>> replication hot standby slaves) on RHEL6-x86_64.  Yesterday I upgraded
>> from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant
>> performance degradation.  PostgreSQL simply feels slower.  Nothing
>> other than the version of PostgreSQL changed yesterday.  I used
>> pg_upgrade to perform the upgrade, and ran the generated
>> analyze_new_cluster.sh immediately afterwards, which completed
>> successfully.
>
> Where did you get 9.3.0 from? Compiled it yourself? Any chance you
> compile with --enable-cassert or somesuch?

Directly from http://yum.postgresql.org.  So unless the RPMs on there
are built weird/wrong, I don't think that's the problem.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation

2013-09-17 Thread Lonni J Friedman

Thanks for your reply.  Comments/answers inline below

On Tue, Sep 17, 2013 at 11:28 AM, Jeff Janes  wrote:
> On Tue, Sep 17, 2013 at 11:22 AM, Lonni J Friedman 
> wrote:
>>
>>
>> > c) What does logs say?
>>
>> The postgres server logs look perfectly normal, minus a non-trivial
>> slower run time for most queries.  There's nothing unusual in any of
>> the OS level logs (/var/log/messages, etc) or dmesg.
>
>
> Are you generally CPU limited or IO limited?

top shows over 90% of the load is in sys space.  vmstat output seems
to suggest that its CPU bound (or bouncing back & forth):

procs ---memory-- ---swap-- -io --system-- -cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
 1  0  17308 852016 141104 12707419200101800
6  4 90  0  0
 0  0  17308 872316 141104 12707420000 0   988  940  564
1  0 99  0  0
 0  0  17308 884288 141104 12707420800 0  1921 1202 2132
1  0 99  0  0
 0  0  17308 898728 141104 12707420800 0 0 1064  577
1  0 99  0  0
 2  0  17308 914920 141104 12707422400 044  820  427
1  0 99  0  0
 0  0  17308 926524 141104 12707427200 048 1173  585
1  0 99  0  0
108  1  17308 753648 141104 12707422400 0   236 9825 3901
12  5 83  0  0
50  0  17308 723156 141104 12707440000 0   144 43481 9105
20 79  1  0  0
45  0  17308 722860 141104 12707441600 0 8 32969 1998
1 97  2  0  0
47  0  17308 738996 141104 12707441600 0 0 34099 1739
1 99  0  0  0
101  0  17308 770220 141104 12707448000 032 38550 5998
 7 93  0  0  0
101  0  17308 775732 141104 12707451200 0   156 33889 5809
 4 96  0  0  0
99  0  17308 791232 141104 12707454400 0 0 32385 4981
0 100  0  0  0
96  0  17308 803156 141104 12707454400 024 32413 4824
0 100  0  0  0
87  0  17308 811624 141104 12707454400 0 0 32438 4470
0 100  0  0  0
83  0  17308 815500 141104 12707454400 0 0 32489 4159
0 100  0  0  0
80  0  17308 826572 141104 12707455200 033 32582 3948
0 100  0  0  0
73  0  17308 853264 141108 12707455200 052 32833 3840
0 100  0  0  0
73  0  17308 882240 141108 12707456000 0 4 32820 3594
0 100  0  0  0
72  0  17308 892256 141108 12707456000 0 0 32368 3516
0 100  0  0  0
###

iostat consistently shows %util under 1.00 which also suggests that
disk IO is not the bottleneck:
#
iostat -dx /dev/sdb 5
Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7) 09/17/2013 _x86_64_
   (32 CPU)

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0.02 0.216.91   31.33   651.60  1121.85
46.38 0.092.25   0.08   0.31

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0.00 0.000.008.00 0.0093.00
11.62 0.000.28   0.20   0.16

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0.00 0.000.00   11.00 0.00   125.40
11.40 0.000.16   0.16   0.18

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0.00 0.000.00  105.00 0.00  3380.40
32.19 0.292.76   0.03   0.34

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0.00 0.000.00   14.80 0.00  2430.60
164.23 0.000.12   0.09   0.14

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0.00 1.200.00   41.60 0.00  1819.40
43.74 0.020.45   0.05   0.20

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0.00 0.000.002.80 0.0032.00
11.43 0.000.00   0.00   0.00
#

mpstat also shows a virtually 0 iowait, with a ton of sys (CPU) time:
#
mpstat 2 10
Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7) 09/17/2013 _x86_64_
   (32 CPU)

12:53:19 PM  CPU%usr   %nice%sys %iowait%irq   %soft
%steal  %guest   %idle
12:53:21 PM  all7.360.00   92.580.000.000.03
0.000.000.03
12:53:23 PM  all6.350.00   90.430.000.000.03
0.000.003.19
12:53:25 PM  all3.130.00   68.200.000.000.02
0.000.00   28.66
12:53:27 PM  all6.070.00   68.460.000.000.03
0.000.00   25.44
12:53:29 PM  all5.830.00   94.140.000.000.03
0.000.000.00
12:53:31 PM  all5.750.00   94.140.000.000.11
0.000.00

Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation

2013-09-17 Thread Lonni J Friedman

On Tue, Sep 17, 2013 at 9:54 AM, Eduardo Morras  wrote:
> On Tue, 17 Sep 2013 09:19:29 -0700
> Lonni J Friedman  wrote:
>
>> Greetings,
>> I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming
>> replication hot standby slaves) on RHEL6-x86_64.  Yesterday I upgraded
>> from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant
>> performance degradation.  PostgreSQL simply feels slower.  Nothing
>> other than the version of PostgreSQL changed yesterday.  I used
>> pg_upgrade to perform the upgrade, and ran the generated
>> analyze_new_cluster.sh immediately afterwards, which completed
>> successfully.
>>
>> Prior to the upgrade, I'd generally expect a load average of less than
>> 2.00 on the master, and less than 1.00 on each of the slaves.  Since
>> the upgrade, the load average on the master has been in double digits
>> (hitting 100.00 for a few minutes), and the slaves are consistently
>> above 5.00.
>>
>> There are a few things that are jumping out at me as behaving
>> differently since the upgrade.  vmstat processes waiting for runtime
>> counts have increased dramatically.  Prior to the upgrade the process
>> count would be consistently less than 10, however since upgrading it
>> hovers between 40 & 60 at all times.  /proc/interrupts "Local timer
>> interrupts" has increased dramatically as well.  It used to hover
>> around 6000 and is now over 20k much of the time.  However, I'm
>> starting to suspect that they are both symptoms of the problem rather
>> than the cause.
>>
>> At this point, I'm looking for guidance on how to debug this problem
>> more effectively.
>
> Don't know what happens but:
>
> a) Does analyze_new_cluster.sh include a reindex? If not, indexs are useless 
> because analyze statistics says so.

No, it doesn't include a reindex.  It merely invokes "vacuumdb --all
--analyze-only" with different values for default_statistics_target=1
-c vacuum_cost_delay=0.

According to the documentation for pg_upgrade, post-upgrade scripts to
rebuild tables and indexes will be generated automatically.  Nothing
was generated for this purpose, at least not in any obvious place.
The analyze_new_cluster.sh script is the only one that was
automatically generated as far as I can tell.

> b) Did you configure postgresql.conf on 9.3.0 for your server/load? Perhaps 
> it has default install values.

Yes, I'm using the same postgresql.conf as I was using when running
9.2.4.  Its definitely not running with default install values.

> c) What does logs say?

The postgres server logs look perfectly normal, minus a non-trivial
slower run time for most queries.  There's nothing unusual in any of
the OS level logs (/var/log/messages, etc) or dmesg.



Do you have any other suggestions?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation

2013-09-17 Thread Lonni J Friedman

Greetings,
I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming
replication hot standby slaves) on RHEL6-x86_64.  Yesterday I upgraded
from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant
performance degradation.  PostgreSQL simply feels slower.  Nothing
other than the version of PostgreSQL changed yesterday.  I used
pg_upgrade to perform the upgrade, and ran the generated
analyze_new_cluster.sh immediately afterwards, which completed
successfully.

Prior to the upgrade, I'd generally expect a load average of less than
2.00 on the master, and less than 1.00 on each of the slaves.  Since
the upgrade, the load average on the master has been in double digits
(hitting 100.00 for a few minutes), and the slaves are consistently
above 5.00.

There are a few things that are jumping out at me as behaving
differently since the upgrade.  vmstat processes waiting for runtime
counts have increased dramatically.  Prior to the upgrade the process
count would be consistently less than 10, however since upgrading it
hovers between 40 & 60 at all times.  /proc/interrupts "Local timer
interrupts" has increased dramatically as well.  It used to hover
around 6000 and is now over 20k much of the time.  However, I'm
starting to suspect that they are both symptoms of the problem rather
than the cause.

At this point, I'm looking for guidance on how to debug this problem
more effectively.

thanks


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] WAL Replication Working but Not Working

2013-08-21 Thread Lonni J Friedman

The first thing to do is look at your server logs around the time when
it stopped working.

On Wed, Aug 21, 2013 at 7:08 AM, Joseph Marlin  wrote:
> We're having an issue with our warm standby server. About 9:30 last night, it 
> stopped applying changes it received in WAL files that are shipped over to it 
> as they are created. It is still reading WAL files as they delivered, as the 
> startup_log.txt shows, but the changes in the primary database aren't 
> actually being made to the standby, and haven't been since last night.
>
> Is there any way we can figure out what is going on here? We'd like to 
> recover somehow without having to restore from a base backup, and we'd like 
> to figure out what is happening so we can prevent it in the future.
>
> Thanks!


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Streaming Replication Randomly Locking Up

2013-08-15 Thread Lonni J Friedman

I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname.  That will help to rule
in/out those 'unexpected EOF' errors.

On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman  wrote:
> The only thing I see that is a possibility for the issue is in the slave
> log:
>
> LOG:  unexpected EOF on client connection
> LOG:  could not receive data from client: Connection reset by peer
>
> I don't know if that's related or not as it could just be somebody running a
> query.  The log file does seem to be riddled with these but the replication
> failures don't happen constantly.
>
> As far as I know I'm not swallowing any errors.  The logging is all set as
> the default:
>
> log_destination = 'stderr'
> logging_collector = on
> #client_min_messages = notice
> #log_min_messages = warning
> #log_min_error_statement = error
> #log_min_duration_statement = -1
> #log_checkpoints = off
> #log_connections = off
> #log_disconnections = off
> #log_error_verbosity = default
>
> I'm going to have a look at the NICs to make sure there's no issue there.
>
> Thanks again for your help!
>
>
> On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman 
> wrote:
>>
>> Are you certain that there are no relevant errors in the database logs
>> (on both master & slave)?  Also, are you sure that you didn't
>> misconfigure logging such that errors wouldn't appear?
>>
>> On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman  wrote:
>> > Hi Lonni,
>> >
>> > Yes, I am using PG 9.1.9.
>> > Yes, 1 slave syncing from the master
>> > CentOS 6.4
>> > I don't see any network or hardware issues (e.g. NIC) but will look more
>> > into this.  They are communicating on a private network and switch.
>> >
>> > I forgot to mention that after I restart the slave, everything syncs
>> > right
>> > back up and all if working again so if it is a network issue, the
>> > replication is just stopping after some hiccup instead of retrying and
>> > resuming when things are back up.
>> >
>> > Thanks!
>> >
>> >
>> >
>> > On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman 
>> > wrote:
>> >>
>> >> I've never seen this happen.  Looks like you might be using 9.1?  Are
>> >> you up to date on all the 9.1.x releases?
>> >>
>> >> Do you have just 1 slave syncing from the master?
>> >> Which OS are you using?
>> >> Did you verify that there aren't any network problems between the
>> >> slave & master?
>> >> Or hardware problems (like the NIC dying, or dropping packets)?
>> >>
>> >>
>> >> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman 
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm having an issue where streaming replication just randomly stops
>> >> > working.
>> >> > I haven't been able to find anything in the logs which point to an
>> >> > issue,
>> >> > but the Postgres process shows a "waiting" status on the slave:
>> >> >
>> >> > postgres  5639  0.1 24.3 3428264 2970236 ? Ss   Aug14   1:54
>> >> > postgres:
>> >> > startup process   recovering 0001053D003F waiting
>> >> > postgres  5642  0.0 21.4 3428356 2613252 ? Ss   Aug14   0:30
>> >> > postgres:
>> >> > writer process
>> >> > postgres  5659  0.0  0.0 177524   788 ?Ss   Aug14   0:03
>> >> > postgres:
>> >> > stats collector process
>> >> > postgres  7159  1.2  0.1 3451360 18352 ?   Ss   Aug14  17:31
>> >> > postgres:
>> >> > wal receiver process   streaming 549/216B3730
>> >> >
>> >> > The replication works great for days, but randomly seems to lock up
>> >> > and
>> >> > replication halts.  I verified that the two databases were out of
>> >> > sync
>> >> > with
>> >> > a query on both of them.  Has anyone experienced this issue before?
>> >> >
>> >> > Here are some relevant config settings:
>> >> >
>> >> > Master:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > checkpoint_completion_target = 0.9
>> >> > archive_mode = on
>> >> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> >> > > >> > max_wal_senders = 2
>> >> > wal_keep_segments = 32
>> >> >
>> >> > Slave:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > #checkpoint_completion_target = 0.5
>> >> > hot_standby = on
>> >> > max_standby_archive_delay = -1
>> >> > max_standby_streaming_delay = -1
>> >> > #wal_receiver_status_interval = 10s
>> >> > #hot_standby_feedback = off
>> >> >
>> >> > Thank you for any help you can provide!
>> >> >
>> >> > Andrew
>> >> >
>
>



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Streaming Replication Randomly Locking Up

2013-08-15 Thread Lonni J Friedman

Are you certain that there are no relevant errors in the database logs
(on both master & slave)?  Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman  wrote:
> Hi Lonni,
>
> Yes, I am using PG 9.1.9.
> Yes, 1 slave syncing from the master
> CentOS 6.4
> I don't see any network or hardware issues (e.g. NIC) but will look more
> into this.  They are communicating on a private network and switch.
>
> I forgot to mention that after I restart the slave, everything syncs right
> back up and all if working again so if it is a network issue, the
> replication is just stopping after some hiccup instead of retrying and
> resuming when things are back up.
>
> Thanks!
>
>
>
> On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman 
> wrote:
>>
>> I've never seen this happen.  Looks like you might be using 9.1?  Are
>> you up to date on all the 9.1.x releases?
>>
>> Do you have just 1 slave syncing from the master?
>> Which OS are you using?
>> Did you verify that there aren't any network problems between the
>> slave & master?
>> Or hardware problems (like the NIC dying, or dropping packets)?
>>
>>
>> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman  wrote:
>> > Hello,
>> >
>> > I'm having an issue where streaming replication just randomly stops
>> > working.
>> > I haven't been able to find anything in the logs which point to an
>> > issue,
>> > but the Postgres process shows a "waiting" status on the slave:
>> >
>> > postgres  5639  0.1 24.3 3428264 2970236 ? Ss   Aug14   1:54
>> > postgres:
>> > startup process   recovering 0001053D003F waiting
>> > postgres  5642  0.0 21.4 3428356 2613252 ? Ss   Aug14   0:30
>> > postgres:
>> > writer process
>> > postgres  5659  0.0  0.0 177524   788 ?Ss   Aug14   0:03
>> > postgres:
>> > stats collector process
>> > postgres  7159  1.2  0.1 3451360 18352 ?   Ss   Aug14  17:31
>> > postgres:
>> > wal receiver process   streaming 549/216B3730
>> >
>> > The replication works great for days, but randomly seems to lock up and
>> > replication halts.  I verified that the two databases were out of sync
>> > with
>> > a query on both of them.  Has anyone experienced this issue before?
>> >
>> > Here are some relevant config settings:
>> >
>> > Master:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > checkpoint_completion_target = 0.9
>> > archive_mode = on
>> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> > > > max_wal_senders = 2
>> > wal_keep_segments = 32
>> >
>> > Slave:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > #checkpoint_completion_target = 0.5
>> > hot_standby = on
>> > max_standby_archive_delay = -1
>> > max_standby_streaming_delay = -1
>> > #wal_receiver_status_interval = 10s
>> > #hot_standby_feedback = off
>> >
>> > Thank you for any help you can provide!
>> >
>> > Andrew
>> >


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Streaming Replication Randomly Locking Up

2013-08-15 Thread Lonni J Friedman

I've never seen this happen.  Looks like you might be using 9.1?  Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?


On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman  wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres  5639  0.1 24.3 3428264 2970236 ? Ss   Aug14   1:54 postgres:
> startup process   recovering 0001053D003F waiting
> postgres  5642  0.0 21.4 3428356 2613252 ? Ss   Aug14   0:30 postgres:
> writer process
> postgres  5659  0.0  0.0 177524   788 ?Ss   Aug14   0:03 postgres:
> stats collector process
> postgres  7159  1.2  0.1 3451360 18352 ?   Ss   Aug14  17:31 postgres:
> wal receiver process   streaming 549/216B3730
>
> The replication works great for days, but randomly seems to lock up and
> replication halts.  I verified that the two databases were out of sync with
> a query on both of them.  Has anyone experienced this issue before?
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>  max_wal_senders = 2
> wal_keep_segments = 32
>
> Slave:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> #checkpoint_completion_target = 0.5
> hot_standby = on
> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1
> #wal_receiver_status_interval = 10s
> #hot_standby_feedback = off
>
> Thank you for any help you can provide!
>
> Andrew
>



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] postgres FDW cost estimation options unrecognized in 9.3-beta1

2013-07-26 Thread Lonni J Friedman

On Fri, Jul 26, 2013 at 3:28 PM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> nightly=# ALTER SERVER cuda_db10 OPTIONS (SET use_remote_estimate 'true') ;
>> ERROR:  option "use_remote_estimate" not found
>
>> Am I doing something wrong, or is this a bug?
>
> [ experiments... ]  You need to say ADD, not SET, to add a new option to
> the list.  SET might more appropriately be spelled REPLACE, because it
> requires that the object already have a defined value for the option,
> which will be replaced.
>
> Our documentation appears not to disclose this fine point, but a look
> at the SQL-MED standard says it's operating per spec.  The standard also
> says that ADD is an error if the option is already defined, which is a
> bit more defensible, but still not exactly what I'd call user-friendly.
> And the error we issue for that case is pretty misleading too:
>
> regression=# ALTER SERVER cuda_db10 OPTIONS (use_remote_estimate 'true') ;
> ALTER SERVER
> regression=# ALTER SERVER cuda_db10 OPTIONS (use_remote_estimate 'false') ;
> ERROR:  option "use_remote_estimate" provided more than once
>
> I think we could do with both more documentation, and better error
> messages for these cases.  In the SET-where-you-should-use-ADD case,
> perhaps
>
> ERROR:  option "use_remote_estimate" has not been set
> HINT: Use ADD not SET to define an option that wasn't already set.
>
> In the ADD-where-you-should-use-SET case, perhaps
>
> ERROR:  option "use_remote_estimate" is already set
> HINT: Use SET not ADD to change an option's value.
>
> The "provided more than once" wording would be appropriate if the same
> option is specified more than once in the command text, but I'm not sure
> that it's worth the trouble to detect that case.
>
> Thoughts, better wordings?

Thanks Tom, I've confirmed that using ADD was the solution.  I think
your suggested updated ERROR & HINT text is an excellent improvement.
It definitely would have given me the clue I was missing earlier.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] postgres FDW cost estimation options unrecognized in 9.3-beta1

2013-07-26 Thread Lonni J Friedman

Greetings,
I have a postgresql-9.3-beta1 cluster setup (from the
yum.postgresql.org RPMs), where I'm experimenting with the postgres
FDW extension.  The documentation (
http://www.postgresql.org/docs/9.3/static/postgres-fdw.html )
references three Cost Estimation Options which can be set for a
foreign table or a foreign server.  However when I attempt to set
them, I always get an error that the option is not found:
###
nightly=# show SERVER_VERSION ;
 server_version

 9.3beta1

nightly=# \des+
   List of
foreign servers
   Name|   Owner   | Foreign-data wrapper | Access privileges |
Type | Version |
  FDW Options| Description
---+---+--+---+--+-+--
-+-
 cuda_db10 | lfriedman | postgres_fdw |   |
  | | (host 'cuda-db10', dbname 'nightly', port '5432') |
(1 row)

nightly=# ALTER SERVER cuda_db10 OPTIONS (SET use_remote_estimate 'true') ;
ERROR:  option "use_remote_estimate" not found
###

Am I doing something wrong, or is this a bug?

thanks


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] upgrading from 9.3-beta1 to 9.3-beta2 requires dump & reload?

2013-07-24 Thread Lonni J Friedman

On Wed, Jul 24, 2013 at 2:05 PM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Lonni J Friedman escribió:
>>> I'm using the RPMs from yum.postgresql.org on RHEL6.  Is this
>>> expected, intentional behavior?  Do I really need to dump & reload to
>>> upgrade between beta releases of 9.3, or is there some more efficient
>>> way?
>
>> We try to avoid forcing initdb between beta versions, but it's not
>> guaranteed.  You should be able to use pg_upgrade, also.
>
> Unfortunately, the RPMs probably won't be very helpful for using
> pg_upgrade, since there's no convenient way to get beta1 and beta2
> postmaster executables installed at the same time (unless Devrim
> foresaw this case and packaged things differently than I did for
> Red Hat ;-)).

Sounds like I'm out of luck.  Thanks anyway.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] upgrading from 9.3-beta1 to 9.3-beta2 requires dump & reload?

2013-07-24 Thread Lonni J Friedman

Greetings,
I just got around to upgrading from 9.3-beta1 to 9.3-beta2, and was
surprised to see that the server was refusing to start. In the log,
I'm seeing:

2013-07-24 13:41:47 PDT [7083]: [1-1] db=,user= FATAL:  database files
are incompatible with server
2013-07-24 13:41:47 PDT [7083]: [2-1] db=,user= DETAIL:  The database
cluster was initialized with CATALOG_VERSION_NO 201305061, but the
server was compiled with CATALOG_VERSION_NO 201306121.
2013-07-24 13:41:47 PDT [7083]: [3-1] db=,user= HINT:  It looks like
you need to initdb.


I'm using the RPMs from yum.postgresql.org on RHEL6.  Is this
expected, intentional behavior?  Do I really need to dump & reload to
upgrade between beta releases of 9.3, or is there some more efficient
way?

thanks


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

2013-06-22 Thread Lonni J Friedman

Assuming that you still have $PGDATA from the broken instance (such
that you can reproduce the crash again), there might be a way to debug
it further.  I'd guess that something like bad RAM or storage could
cause an index to get corrupted in this fashion, but the fact that
you're using AWS makes that less likely.  Someone far more
knowledgeable than I will need to provide guidance on how to debug
this though.

On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan  wrote:
> Re-seeding the standby with a full base backup does seem to make the error go 
> away.
> The standby started, caught up and has been working for about 2 hours.
>
> The file in the error message was an index.  We rebuilt it just in case.
> Is there any way to debug the issue at this point?
>
>
>
> -----Original Message-
> From: Lonni J Friedman [mailto:netll...@gmail.com]
> Sent: Saturday, June 22, 2013 4:11 PM
> To: Dan Kogan
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains 
> references to invalid pages
>
> Looks like some kind of data corruption.  Question is whether it came from 
> the master, or was created by the standby.  If you re-seed the standby with a 
> full (base) backup, does the problem go away?
>
> On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan  wrote:
>> Hello,
>>
>>
>>
>> Today our standby instance stopped working with this error in the log:
>>
>>
>>
>> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of
>> relation
>> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>>
>> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains
>> references to invalid pages
>>
>> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID
>> 8367) was terminated by signal 6: Aborted
>>
>> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other
>> active server processes
>>
>>
>>
>> After re-start the same exact error occurred.
>>
>>
>>
>> We thought that maybe we hit this bug -
>> http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>>
>> However, there is nothing in our log about sub-transactions, so it
>> didn't seem the same to us.
>>
>>
>>
>> Any advice on how to further debug this so we can avoid this in the
>> future is appreciated.
>>
>>
>>
>> Environment:
>>
>>
>>
>> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>>
>>
>>
>> Software and settings:
>>
>>
>>
>> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc
>> (Ubuntu/Linaro
>> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>>
>>
>>
>> archive_command  rsync -a %p
>> slave:/var/lib/postgresql/replication_load/%f
>>
>> archive_mode   on
>>
>> autovacuum_freeze_max_age 10
>>
>> autovacuum_max_workers6
>>
>> checkpoint_completion_target 0.9
>>
>> checkpoint_segments   128
>>
>> checkpoint_timeout   30min
>>
>> default_text_search_config   pg_catalog.english
>>
>> hot_standby  on
>>
>> lc_messages  en_US.UTF-8
>>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

2013-06-22 Thread Lonni J Friedman

Looks like some kind of data corruption.  Question is whether it came
from the master, or was created by the standby.  If you re-seed the
standby with a full (base) backup, does the problem go away?

On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan  wrote:
> Hello,
>
>
>
> Today our standby instance stopped working with this error in the log:
>
>
>
> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of relation
> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>
> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum: rel
> 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains references
> to invalid pages
>
> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum: rel
> 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID 8367)
> was terminated by signal 6: Aborted
>
> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other active
> server processes
>
>
>
> After re-start the same exact error occurred.
>
>
>
> We thought that maybe we hit this bug -
> http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>
> However, there is nothing in our log about sub-transactions, so it didn't
> seem the same to us.
>
>
>
> Any advice on how to further debug this so we can avoid this in the future
> is appreciated.
>
>
>
> Environment:
>
>
>
> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>
>
>
> Software and settings:
>
>
>
> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro
> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>
>
>
> archive_command  rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode   on
>
> autovacuum_freeze_max_age 10
>
> autovacuum_max_workers6
>
> checkpoint_completion_target 0.9
>
> checkpoint_segments   128
>
> checkpoint_timeout   30min
>
> default_text_search_config   pg_catalog.english
>
> hot_standby  on
>
> lc_messages  en_US.UTF-8
>
> lc_monetary  en_US.UTF-8
>
> lc_numeric  en_US.UTF-8
>
> lc_time en_US.UTF-8
>
> listen_addresses  *
>
> log_checkpoints   on
>
> log_destinationstderr
>
> log_line_prefix %t [%p]: [%l-1] [%h]
>
> log_min_duration_statement-1
>
> log_min_error_statement   error
>
> log_min_messages error
>
> log_timezoneUTC
>
> maintenance_work_mem   1GB
>
> max_connections1200
>
> max_standby_streaming_delay90s
>
> max_wal_senders   5
>
> port   5432
>
> random_page_cost2
>
> seq_page_cost 1
>
> shared_buffers4GB
>
> ssl   off
>
> ssl_cert_file   /etc/ssl/certs/ssl-cert-snakeoil.pem
>
> ssl_key_file/etc/ssl/private/ssl-cert-snakeoil.key
>
> synchronous_commitoff
>
> TimeZoneUTC
>
> wal_keep_segments 128
>
> wal_level hot_standby
>
> work_mem8MB
>
>
>
> root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata
> /usr/local/pgsql/data
>
> pg_control version number:922
>
> Catalog version number:   201204301
>
> Database system identifier:   5838668587531239413
>
> Database cluster state:   in archive recovery
>
> pg_control last modified: Sat 22 Jun 2013 06:13:07 PM UTC
>
> Latest checkpoint location:   2250/18CA0790
>
> Prior checkpoint location:2250/18CA0790
>
> Latest checkpoint's REDO location:224F/E127B078
>
> Latest checkpoint's TimeLineID:   2
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID:  1/2018629527
>
> Latest checkpoint's NextOID:  43086248
>
> Latest checkpoint's NextMultiXactId:  7088726
>
> Latest checkpoint's NextMultiOffset:  20617234
>
> Latest checkpoint's oldestXID:1690316999
>
> Latest checkpoint's oldestXID's DB:   16448
>
> Latest checkpoint's oldestActiveXID:  2018629527
>
> Time of latest checkpoint:Sat 22 Jun 2013 03:24:05 PM UTC
>
> Minimum recovery ending location: 2251/5EA631F0
>
> Backup start location:0/0
>
> Backup end location:  0/0
>
> End-of-backup record required:no
>
> Current wal_level setting:hot_standby
>
> Current max_connections setting:  1200
>
> Current max_prepared_xacts setting:   0
>
> Current max_locks_per_xact setting:   64
>
> Maximum data alignment:   8
>
> Database block size:  8192
>
> Blocks per segment of large relation: 131072
>
> WAL block size:   8192
>
> Bytes per WAL segment:16777216
>
> Maximum length of identifiers:64
>
> Maximum columns in an index:  32
>
> Maximum size of a TOAST chunk:1996
>
> Date/time type storage:   64-bit in

Re: [GENERAL] 9.3-beta postgres-fdw COPY error

2013-06-21 Thread Lonni J Friedman

I was afraid someone would say that.  Is this a limitation that might
be removed in the future (like 9.4), or is there a technical reason
why its not possible to do a COPY against a foreign table?

On Fri, Jun 21, 2013 at 10:52 AM, Adrian Klaver  wrote:
> On 06/21/2013 10:39 AM, Lonni J Friedman wrote:
>>
>> Greetings,
>> I'm trying to test out the new postgres-fdw support in postgresql-9.3
>> (beta) in preparation for an upgrade from 9.2 later this year.  So
>> far, everything is working ok, however one problem I'm encountering is
>> with the COPY command. When I run it against a foreign table (which is
>> also in a 9.3 instance), it fails:
>>
>> COPY my_foreigntbl (id,testname) TO '/tmp/testlist_aid' (DELIMITER ',');
>> ERROR:  cannot copy from foreign table "my_foreigntbl"
>>
>>
>> Is this expected behavior or a bug?
>
>
> Expected I believe:
>
> http://www.postgresql.org/docs/9.3/static/postgres-fdw.html
>
> "Now you need only SELECT from a foreign table to access the data stored in
> its underlying remote table. You can also modify the remote table using
> INSERT, UPDATE, or DELETE. "
>
>>
>> thanks!
>>
>>
>
>
> --
> Adrian Klaver
> adrian.kla...@gmail.com



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] 9.3-beta postgres-fdw COPY error

2013-06-21 Thread Lonni J Friedman

Greetings,
I'm trying to test out the new postgres-fdw support in postgresql-9.3
(beta) in preparation for an upgrade from 9.2 later this year.  So
far, everything is working ok, however one problem I'm encountering is
with the COPY command. When I run it against a foreign table (which is
also in a 9.3 instance), it fails:

COPY my_foreigntbl (id,testname) TO '/tmp/testlist_aid' (DELIMITER ',');
ERROR:  cannot copy from foreign table "my_foreigntbl"


Is this expected behavior or a bug?

thanks!


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] how to reference variables in pgbench custom scripts?

2013-06-18 Thread Lonni J Friedman

I'm attempting to write a custom pgbench script (called via the -f
option), with a variable set at the top with:
\setrandom aid 100 50875000

However, I can't quite figure out how to reference the new aid
variable.  The documentation simply states that a variable is
referenced with a colon in front of its name (:aid ).  However, I
can't figure out how to make this work if the variable is embedded
between other characters.  For example, I want to do something like
this:

CREATE INDEX nppsmoke_tmp_:aid_idx ON nppsmoke_tmp_:aid (testname);

However, when I run pgbench, it fails:
Client 0 aborted in state 4: ERROR:  syntax error at or near ":"
LINE 1: CREATE INDEX nppsmoke_tmp_:aid_idx ON nppsmoke_tmp_29954053 ...

Is there a default delimiter, such as bash's curly brackets (${aid}) ?

thanks


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pg_upgrade link mode

2013-05-16 Thread Lonni J Friedman

On Thu, May 16, 2013 at 11:03 AM, Igor Neyman  wrote:
>
> From: pgsql-general-ow...@postgresql.org 
> [mailto:pgsql-general-ow...@postgresql.org] On Behalf Of AI Rumman
> Sent: Thursday, May 16, 2013 1:56 PM
> To: Fabio Rueda Carrascosa
> Cc: pgsql-general
> Subject: Re: [GENERAL] pg_upgrade link mode
>
> I always think its a bit risky to use link instead of copying. However, I'd 
> suggest to try the  --check at first of pg_upgrade.
>
> --
>
> Why?
> Do you have specific experience, when link mode caused any problems?
> Could you share?

I assume what he's referring to is if the upgrade gets partially
completed and fails for any reason, then you have a broken mess, with
no simple rollback path.  Since your database is only about 1GB in
size, it shouldn't take very long to run a base backup before doing
the upgrade.  You can send that backup over the network to a remote
system, so that you have a fallback solution if the upgrade fails.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Deploying PostgreSQL on CentOS with SSD and Hardware RAID

2013-05-10 Thread Lonni J Friedman

On Fri, May 10, 2013 at 11:23 AM, Steven Schlansker  wrote:
>
> On May 10, 2013, at 7:14 AM, Matt Brock  wrote:
>
>> Hello.
>>
>> We're intending to deploy PostgreSQL on Linux with SSD drives which would be 
>> in a RAID 1 configuration with Hardware RAID.
>>
>> My first question is essentially: are there any issues we need to be aware 
>> of when running PostgreSQL 9 on CentOS 6 on a server with SSD drives in a 
>> Hardware RAID 1 configuration? Will there be any compatibility problems 
>> (seems unlikely)? Should we consider alternative configurations as being 
>> more effective for getting better use out of the hardware?
>>
>> The second question is: are there any SSD-specific issues to be aware of 
>> when tuning PostgreSQL to make the best use of this hardware and software?
>>
>
> A couple of things I noticed with a similar-ish setup:
>
> * Some forms of RAID / LVM break the kernel's automatic disk tuning 
> mechanism.  In particular, there is a "rotational" tunable that often does 
> not get set right.  You might end up tweaking read ahead and friends as well.
> http://www.mjmwired.net/kernel/Documentation/block/queue-sysfs.txt#112
>
> * The default Postgres configuration is awful for a SSD backed database.  You 
> really need to futz with checkpoints to get acceptable throughput.
> The "PostgreSQL 9.0 High Performance" book is fantastic and is what I used to 
> great success.
>
> * The default Linux virtual memory configuration is awful for this 
> configuration.  Briefly, it will accept a ton of incoming data, and then go 
> through an awful stall as soon as it calls fsync() to write all that data to 
> disk.  We had multi-second delays all the way through to the application 
> because of this.  We had to change the zone_reclaim_mode and the dirty buffer 
> limits.
> http://www.postgresql.org/message-id/500616cb.3070...@2ndquadrant.com
>
>
>
> I am not sure that these numbers will end up being anywhere near what works 
> for you, but these are my notes from tuning a 4xMLC SSD RAID-10.  I haven't 
> proven that this is optimal, but it was way better than the defaults.  We 
> ended up with the following list of changes:
>
> * Change IO scheduler to "noop"
> * Mount DB volume with nobarrier, noatime
> * Turn blockdev readahead to 16MiB
> * Turn sdb's "rotational" tuneable to 0
>
> PostgreSQL configuration changes:
> synchronous_commit = off
> effective_io_concurrency = 4
> checkpoint_segments = 1024
> checkpoint_timeout = 10min
> checkpoint_warning = 8min
> shared_buffers = 32gb
> temp_buffers = 128mb
> work_mem = 512mb
> maintenance_work_mem = 1gb
>
> Linux sysctls:
> vm.swappiness = 0
> vm.zone_reclaim_mode = 0
> vm.dirty_bytes = 134217728
> vm.dirty_background_bytes = 1048576

Can you provide more details about your setup, including:
* What kind of filesystem are you using?
* Linux distro and/or kernel version
* hardware (RAM, CPU cores etc)
* database usage patterns (% writes, growth, etc)

thanks


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Deploying PostgreSQL on CentOS with SSD and Hardware RAID

2013-05-10 Thread Lonni J Friedman

On Fri, May 10, 2013 at 10:20 AM, Merlin Moncure  wrote:
> On Fri, May 10, 2013 at 12:03 PM, David Boreham  
> wrote:
>> On 5/10/2013 10:21 AM, Merlin Moncure wrote:
>>>
>>> As it turns out the list of flash drives are suitable for database use is
>>> surprisingly small. The s3700 I noted upthread seems to be specifically
>>> built with databases in mind and is likely the best choice for new
>>> deployments. The older Intel 320 is also a good choice. I think that's
>>> pretty much it until you get into expensive pci-e based gear.
>>
>>
>> This may have been a typo : did you mean Intel 710 series rather than 320 ?
>>
>> While the 320 has the supercap, it isn't specified for high write endurance.
>> Definitely usable for a database, and a better choice than most of the
>> alternatives, but I'd have listed the 710 ahead of the 320.
>
> It wasn't a typo.  The 320 though is perfectly fine although it will
> wear out faster -- so it fills a niche for low write intensity
> applications.  I find the s3700 to be superior to the 710 in just
> about every way (although you're right -- it is suitable for database
> use).

There's also the 520 series, which has better performance than the 320
series (which is EOL now).


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pg_basebackup, requested WAL has already been removed

2013-05-10 Thread Lonni J Friedman

That's a good point.  Then i dunno, perhaps it is a bug, but I'd be
surprised if this wasn't working, as its not really a corner case that
could be missed in testing, as long as all the options were exercised.
 Hopefully someone else can weigh in.

On Fri, May 10, 2013 at 10:00 AM, Sergey Koposov  wrote:
>
> On Fri, 10 May 2013, Lonni J Friedman wrote:
>
>> Its definitely not a bug.  You need to set/increase wal_keep_segments
>> to a value that ensures that they aren't recycled faster than the time
>> required to complete the base backup (plus some buffer).
>
>
> But I thought that wal_keep_segments is not needed for the streaming regime
> ( "--xlog-method=stream")  And the documentation
> http://www.postgresql.org/docs/9.2/static/app-pgbasebackup.html
> only mentions wal_keep_segments when talking about --xlog-method=fetch.
>
>
>>
>> On Fri, May 10, 2013 at 9:48 AM, Sergey Koposov
>
>  wrote: >> Hi,
>>>
>>>
>>> I've recently started to use pg_basebackup --xlog-method=stream to backup
>>> my
>>> multi-Tb database.
>>> Before I did the backup when there was not much activity in the DB and it
>>> went perfectly fine, but today, I've started the backup and it failed
>>> twice
>>> almost at the same time as the
>>> CREATE INDEX  (and another time CLUSTER) commands were finished.
>>>
>>> Here:
>>>
>>> postgres@cappc118:/mnt/backup/wsdb_130510$ pg_basebackup
>>> --xlog-method=stream --progress --verbose --pg
>>> transaction log start point: 23AE/BD003E70
>>> pg_basebackup: starting background WAL receiver
>>> pg_basebackup: unexpected termination of replication stream: FATAL:
>>> requested WAL segment 000123B100FE has already been removed
>>> 4819820/16816887078 kB (4%), 0/1 tablespace
>>> (/mnt/backup/wsdb_130510/base/1)
>>>
>>> And the logs from around that time contained:
>>>
>>> some_user:wsdb:2013-05-10 14:35:41 BST:10587LOG:  duration: 40128.163 ms
>>> statement: CREATE INDEX usno_cle
>>> an_q3c_idx ON usno_clean (q3c_ang2ipix(ra,dec));
>>> ::2013-05-10 14:35:43 BST:25529LOG:  checkpoints are occurring too
>>> frequently (8 seconds apart)
>>> ::2013-05-10 14:35:43 BST:25529HINT:  Consider increasing the
>>> configuration
>>> parameter "checkpoint_segmen
>>> ts".
>>> ::2013-05-10 14:35:51 BST:25529LOG:  checkpoints are occurring too
>>> frequently (8 seconds apart)
>>> ::2013-05-10 14:35:51 BST:25529HINT:  Consider increasing the
>>> configuration
>>> parameter "checkpoint_segmen
>>> ts".
>>> postgres:[unknown]:2013-05-10 14:35:55 BST:8177FATAL:  requested WAL
>>> segment
>>> 000123B100FE has already been removed
>>> some_user:wsdb:2013-05-10 14:36:59 BST:10599LOG:  duration: 78378.194 ms
>>> statement: CLUSTER usno_clean_q3c_idx ON usno_clean;
>>>
>>> One the previous occasion when it happened the CREATE INDEX() was being
>>> executed:
>>>
>>> some_user:wsdb:2013-05-10 09:17:20 BST:3300LOG:  duration: 67.680 ms
>>> statement: SELECT name FROM  (SELECT pg_catalog.lower(name) AS name FROM
>>> pg_catalog.pg_settings   UNION ALL SELECT 'session authorization'   UNION
>>> ALL SELECT 'all') ss  WHERE substring(name,1,4)='rand'
>>> LIMIT 1000
>>> ::2013-05-10 09:22:47 BST:25529LOG:  checkpoints are occurring too
>>> frequently (18 seconds apart)
>>> ::2013-05-10 09:22:47 BST:25529HINT:  Consider increasing the
>>> configuration
>>> parameter "checkpoint_segments".
>>> postgres:[unknown]:2013-05-10 09:22:49 BST:27659FATAL:  requested WAL
>>> segment 000123990040 has already been removed
>>> some_user:wsdb:2013-05-10 09:22:57 BST:3236LOG:  duration: 542955.262 ms
>>> statement: CREATE INDEX xmatch_temp_usnoid_idx ON xmatch_temp (usno_id);
>>>
>>> The .configuration
>>> PG 9.2.4, Debian 7.0, amd64
>>>
>>> shared_buffers = 10GB
>>> work_mem = 1GB
>>> maintenance_work_mem = 1GB
>>> effective_io_concurrency = 5
>>> synchronous_commit = off
>>> checkpoint_segments = 32
>>> max_wal_senders = 2
>>> effective_cache_size = 30GB
>>> autovacuum_max_workers = 3
>>> wal_level=archive
>>> archive_mode = off
>>>
>>> Does it look like a bug or am I missing something ?
>>>
>>> Thanks,
>>> Sergey
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>>
>>
>> --
>> ~
>> L. Friedmannetll...@gmail.com
>> LlamaLand   https://netllama.linux-sxs.org
>>
>



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pg_basebackup, requested WAL has already been removed

2013-05-10 Thread Lonni J Friedman

Its definitely not a bug.  You need to set/increase wal_keep_segments
to a value that ensures that they aren't recycled faster than the time
required to complete the base backup (plus some buffer).

On Fri, May 10, 2013 at 9:48 AM, Sergey Koposov  wrote:
> Hi,
>
> I've recently started to use pg_basebackup --xlog-method=stream to backup my
> multi-Tb database.
> Before I did the backup when there was not much activity in the DB and it
> went perfectly fine, but today, I've started the backup and it failed twice
> almost at the same time as the
> CREATE INDEX  (and another time CLUSTER) commands were finished.
>
> Here:
>
> postgres@cappc118:/mnt/backup/wsdb_130510$ pg_basebackup
> --xlog-method=stream --progress --verbose --pg
> transaction log start point: 23AE/BD003E70
> pg_basebackup: starting background WAL receiver
> pg_basebackup: unexpected termination of replication stream: FATAL:
> requested WAL segment 000123B100FE has already been removed
> 4819820/16816887078 kB (4%), 0/1 tablespace
> (/mnt/backup/wsdb_130510/base/1)
>
> And the logs from around that time contained:
>
> some_user:wsdb:2013-05-10 14:35:41 BST:10587LOG:  duration: 40128.163 ms
> statement: CREATE INDEX usno_cle
> an_q3c_idx ON usno_clean (q3c_ang2ipix(ra,dec));
> ::2013-05-10 14:35:43 BST:25529LOG:  checkpoints are occurring too
> frequently (8 seconds apart)
> ::2013-05-10 14:35:43 BST:25529HINT:  Consider increasing the configuration
> parameter "checkpoint_segmen
> ts".
> ::2013-05-10 14:35:51 BST:25529LOG:  checkpoints are occurring too
> frequently (8 seconds apart)
> ::2013-05-10 14:35:51 BST:25529HINT:  Consider increasing the configuration
> parameter "checkpoint_segmen
> ts".
> postgres:[unknown]:2013-05-10 14:35:55 BST:8177FATAL:  requested WAL segment
> 000123B100FE has already been removed
> some_user:wsdb:2013-05-10 14:36:59 BST:10599LOG:  duration: 78378.194 ms
> statement: CLUSTER usno_clean_q3c_idx ON usno_clean;
>
> One the previous occasion when it happened the CREATE INDEX() was being
> executed:
>
> some_user:wsdb:2013-05-10 09:17:20 BST:3300LOG:  duration: 67.680 ms
> statement: SELECT name FROM  (SELECT pg_catalog.lower(name) AS name FROM
> pg_catalog.pg_settings   UNION ALL SELECT 'session authorization'   UNION
> ALL SELECT 'all') ss  WHERE substring(name,1,4)='rand'
> LIMIT 1000
> ::2013-05-10 09:22:47 BST:25529LOG:  checkpoints are occurring too
> frequently (18 seconds apart)
> ::2013-05-10 09:22:47 BST:25529HINT:  Consider increasing the configuration
> parameter "checkpoint_segments".
> postgres:[unknown]:2013-05-10 09:22:49 BST:27659FATAL:  requested WAL
> segment 000123990040 has already been removed
> some_user:wsdb:2013-05-10 09:22:57 BST:3236LOG:  duration: 542955.262 ms
> statement: CREATE INDEX xmatch_temp_usnoid_idx ON xmatch_temp (usno_id);
>
> The .configuration
> PG 9.2.4, Debian 7.0, amd64
>
> shared_buffers = 10GB
> work_mem = 1GB
> maintenance_work_mem = 1GB
> effective_io_concurrency = 5
> synchronous_commit = off
> checkpoint_segments = 32
> max_wal_senders = 2
> effective_cache_size = 30GB
> autovacuum_max_workers = 3
> wal_level=archive
> archive_mode = off
>
> Does it look like a bug or am I missing something ?
>
> Thanks,
> Sergey
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Replication terminated due to PANIC

2013-04-25 Thread Lonni J Friedman

If its really index corruption, then you should be able to fix it by
reindexing.  However, that doesn't explain what caused the corruption.
 Perhaps your hardware is bad in some way?

On Wed, Apr 24, 2013 at 10:46 PM, Adarsh Sharma  wrote:
> Thanks Sergey for such a quick response, but i dont think this is some patch
> problem because we have other DB servers also running fine on same version
> and message is also different :
>
> host= PANIC: _bt_restore_page: cannot add item to page
>
> And the whole day replication is working fine but at midnight when log
> rotates it shows belows msg :
>
> 2013-04-24 00:00:00 UTC [26989]: [4945032-1] user= db= host= LOG:
> checkpoint starting: time
> 2013-04-24 00:00:00 UTC [26989]: [4945033-1] user= db= host= ERROR:  could
> not open file "global/14078": No such file or directory
>
> 2013-04-24 00:00:00 UTC [26989]: [4945034-1] user= db= host= CONTEXT:
> writing block 0 of relation global/14078
> 2013-04-24 00:00:00 UTC [26989]: [4945035-1] user= db= host= WARNING:  could
> not write block 0 of global/14078
>
> 2013-04-24 00:00:00 UTC [26989]: [4945036-1] user= db= host= DETAIL:
> Multiple failures --- write error might be permanent.
>
> Looks like some index corruption.
>
>
> Thanks
>
>
>
>
>
>
> On Thu, Apr 25, 2013 at 8:14 AM, Sergey Konoplev  wrote:
>>
>> On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma 
>> wrote:
>> > I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
>> > setup a hot standby by using pgbasebackup. Today i got the below  alert
>> > from
>> > standby box :
>> >
>> > [1] (from line 412,723)
>> > 2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
>> > _bt_restore_page: cannot add item to page
>> >
>> > When i check, the replication is terminated due to slave DB shutdown.
>> > From
>> > the logs i can see below messages :-
>>
>> I am not sure that it is your situation but take a look at this thread:
>>
>>
>> http://www.postgresql.org/message-id/CAL_0b1t=WuM6roO8dki=w8dhh8p8whhohbpjreymmqurocn...@mail.gmail.com
>>
>> There is a patch by Andres Freund in the end of the discussion. Three
>> weeks have passed after I installed the patched version and it looks
>> like the patch fixed my issue.
>>
>> >
>> > 2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR:
>> > could
>> > not open file "global/14078": No such file or directory
>> > 2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT:
>> > writing block 0 of relation global/14078
>> > 2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING:
>> > could
>> > not write block 0 of global/14078
>> > 2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL:
>> > Multiple failures --- write error might be permanent.
>> >
>> > I checked in global directory of master, the directory 14078 doesn't
>> > exist.
>> >
>> > Anyone has faced above issue ?
>> >
>> > Thanks
>>
>>
>>
>> --
>> Kind regards,
>> Sergey Konoplev
>> Database and Software Consultant
>>
>> Profile: http://www.linkedin.com/in/grayhemp
>> Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
>> Skype: gray-hemp
>> Jabber: gray...@gmail.com
>
>



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] corrupted item pointer in streaming based replication

2013-04-03 Thread Lonni J Friedman

You should figure out what base/16384/114846.39 corresponds to inside
the database.  If you're super lucky its something unimportant and/or
something that can be recreated easily (like an index).  If its
something important, then you're only option is to try to drop the
object and restore it from the last known good backup.

On Wed, Apr 3, 2013 at 1:02 PM, Jigar Shah  wrote:
> Hi,
>
> Postgres version = 9.1.2
> OS = debian(6.0.7)
> fsync = on
> full_page_writes = on
> Setup = Primary and streaming replication based secondary
>
> Few days ago we had a situation where our Primary started to through the
> error messages below indicating corruption in the database. It crashed
> sometimes and showed a panic message in the logs
>
> 2013-03-25 07:30:39.545 PDT PANIC:  corrupted item pointer: offset = 0, size
> = 0
> 2013-03-25 07:30:39.704 PDT LOG:  server process (PID 8715) was terminated
> by signal 6: Aborted
> 2013-03-25 07:30:39.704 PDT LOG:  terminating any other active server
> processes
>
> Days before it started to crash it showed the below error messages in the
> logs.
>
> [d: u:postgres p:2498 7] ERROR: could not access status of transaction
> 837550133
> DETAIL: Could not open file "pg_clog/031E": No such file or directory.
> [u:postgres p:2498 9]
>
> [d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39"
> (target block 360448000): No such file or directory [d: u:radio p:31917 243]
>
> On top of that, our secondaries are now crashed and would not startup and
> showed the error messages below in pg logs.
>
> 2013-03-27 11:00:47.281 PDT LOG:  recovery restart point at 161A/17108AA8
> 2013-03-27 11:00:47.281 PDT DETAIL:  last completed transaction was at log
> time 2013-03-27 11:00:47.241236-07
> 2013-03-27 11:00:47.520 PDT LOG:  restartpoint starting: xlog
>
> 2013-03-27 11:07:51.348 PDT FATAL:  corrupted item pointer: offset = 0, size
> = 0
> 2013-03-27 11:07:51.348 PDT CONTEXT:  xlog redo split_l: rel
> 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0,
> firstright 192
> 2013-03-27 11:07:51.716 PDT LOG:  startup process (PID 5959) exited with
> exit code 1
> 2013-03-27 11:07:51.716 PDT LOG:  terminating any other active server
> processes
>
> At this point we have a running but corrupt primary and crashed secondary
> that wont startup.
>
> I am wondering what are our options at this point. Can we do something to
> fix this? How can we recover from corruption.
>
> Thanks for help in advance.
>
> Regards
> Jigar
>
>



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Lonni J Friedman

Looks like you've got some form of coruption:
page 1441792 of relation base/63229/63370 does not exist


The question is whether it was corrupted on the master and then
replicated to the slave, or if it was corrupted on the slave.  I'd
guess that the pg_dump tried to read from that page and barfed.  It
would be interesting to try re-running the pg_dump again to see if
this crash can be replicated.  If so, does it also replicate if you
run pg_dump against the master?  If not, then the corruption is
isolated to the slave, and you might have a hardware problem which is
causing the data to get corrupted.

On Fri, Mar 29, 2013 at 9:19 AM, Quentin Hartman
 wrote:
> Yesterday morning, one of my streaming replication slaves running 9.2.3
> crashed with the following in the log file:
>
> 2013-03-28 12:49:30 GMT WARNING:  page 1441792 of relation base/63229/63370
> does not exist
> 2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index 1663/63229/109956;
> iblk 303, heap 1663/63229/63370;
> 2013-03-28 12:49:30 GMT PANIC:  WAL contains references to invalid pages
> 2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index 1663/63229/109956;
> iblk 303, heap 1663/63229/63370;
> 2013-03-28 12:49:31 GMT LOG:  startup process (PID 22941) was terminated by
> signal 6: Aborted
> 2013-03-28 12:49:31 GMT LOG:  terminating any other active server processes
> 2013-03-28 12:49:31 GMT WARNING:  terminating connection because of crash of
> another server process
> 2013-03-28 12:49:31 GMT DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2013-03-28 12:49:31 GMT HINT:  In a moment you should be able to reconnect
> to the database and repeat your command.
> 2013-03-28 12:57:44 GMT LOG:  database system was interrupted while in
> recovery at log time 2013-03-28 12:37:42 GMT
> 2013-03-28 12:57:44 GMT HINT:  If this has occurred more than once some data
> might be corrupted and you might need to choose an earlier recovery target.
> 2013-03-28 12:57:44 GMT LOG:  entering standby mode
> 2013-03-28 12:57:44 GMT LOG:  redo starts at 19/2367CE30
> 2013-03-28 12:57:44 GMT LOG:  incomplete startup packet
> 2013-03-28 12:57:44 GMT LOG:  consistent recovery state reached at
> 19/241835B0
> 2013-03-28 12:57:44 GMT LOG:  database system is ready to accept read only
> connections
> 2013-03-28 12:57:44 GMT LOG:  invalid record length at 19/2419EE38
> 2013-03-28 12:57:44 GMT LOG:  streaming replication successfully connected
> to primary
>
> As you can see I was able to restart it and it picked up and synchronized
> right away, but this crash still concerns me.
>
> The DB has about 75GB of data in it, and it is almost entirely write
> traffic. It's essentially a log aggregator. I believe it was doing a pg_dump
> backup at the time of the crash. It has hot_standby_feedback on to allow
> that process to complete.
>
> Any insights into this, or advice on figuring out the root of it would be
> appreciated. So far all the things I've found like this are bugs that should
> be fixed in this version, or the internet equivalent of a shrug.
>
> Thanks!
>
> QH



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] UNLOGGED TEMPORARY tables?

2013-03-25 Thread Lonni J Friedman

On Mon, Mar 25, 2013 at 4:49 PM, Michael Paquier
 wrote:
>
>
> On Tue, Mar 26, 2013 at 8:26 AM, Lonni J Friedman 
> wrote:
>>
>> I'm pretty sure that unlogged tables and temp tables are two separate
>> & distinct features, with no overlap in functionality.  It would be
>> nice if it was possible to create an unlogged temp table.
>
> Temporary tables are a subtype of unlogged tables, as temporary tables are
> not WAL-logged.
> This article from Robert Haas will give a good summary of such differences:
> http://rhaas.blogspot.jp/2010/05/global-temporary-and-unlogged-tables.html


Thanks, that's good to know.  the official dox don't really make it
clear that temp tables are unlogged.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] UNLOGGED TEMPORARY tables?

2013-03-25 Thread Lonni J Friedman

I'm pretty sure that unlogged tables and temp tables are two separate
& distinct features, with no overlap in functionality.  It would be
nice if it was possible to create an unlogged temp table.

On Sun, Mar 24, 2013 at 1:32 PM, aasat  wrote:
> I was tested write speed to temporary and unlogged tables and noticed that
> unlogged tables was a much faster
>
> Postgres 9.2.2
>
> Write speed
>
> Temporary 14.5k/s
> UNLOGGED 50k/s
>
> Before test I was convinced that temporary tables in postgres >= 9.1 are
> unlogged
>
>
>
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] replication behind high lag

2013-03-25 Thread Lonni J Friedman

On Mon, Mar 25, 2013 at 1:23 PM, AI Rumman  wrote:
>
>
> On Mon, Mar 25, 2013 at 4:03 PM, AI Rumman  wrote:
>>
>>
>>
>> On Mon, Mar 25, 2013 at 4:00 PM, Lonni J Friedman 
>> wrote:
>>>
>>> On Mon, Mar 25, 2013 at 12:55 PM, AI Rumman  wrote:
>>> >
>>> >
>>> > On Mon, Mar 25, 2013 at 3:52 PM, Lonni J Friedman 
>>> > wrote:
>>> >>
>>> >> On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman 
>>> >> wrote:
>>> >> >
>>> >> >
>>> >> > On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman
>>> >> > 
>>> >> > wrote:
>>> >> >>
>>> >> >> On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman 
>>> >> >> wrote:
>>> >> >> > Hi,
>>> >> >> >
>>> >> >> > I have two 9.2 databases running with hot_standby replication.
>>> >> >> > Today
>>> >> >> > when I
>>> >> >> > was checking, I found that replication has not been working since
>>> >> >> > Mar
>>> >> >> > 1st.
>>> >> >> > There was a large database restored in master on that day and I
>>> >> >> > believe
>>> >> >> > after that the lag went higher.
>>> >> >> >
>>> >> >> > SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0')
>>> >> >> > AS
>>> >> >> > offset
>>> >> >> >
>>> >> >> > 431326108320
>>> >> >> >
>>> >> >> > SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),
>>> >> >> > '0/0')
>>> >> >> > AS
>>> >> >> > receive,
>>> >> >> > pg_xlog_location_diff(pg_last_xlog_replay_location(),
>>> >> >> > '0/0')
>>> >> >> > AS replay
>>> >> >> >
>>> >> >> >receive|replay
>>> >> >> > --+--
>>> >> >> >  245987541312 | 245987534032
>>> >> >> > (1 row)
>>> >> >> >
>>> >> >> > I checked the pg_xlog in both the server. In Slave the last xlog
>>> >> >> > file
>>> >> >> > -rw--- 1 postgres postgres 16777216 Mar  1 06:02
>>> >> >> > 00010039007F
>>> >> >> >
>>> >> >> > In Master, the first xlog file is
>>> >> >> > -rw--- 1 postgres postgres 16777216 Mar  1 04:45
>>> >> >> > 00010039005E
>>> >> >> >
>>> >> >> >
>>> >> >> > Is there any way I could sync the slave in quick process?
>>> >> >>
>>> >> >> generate a new base backup, and seed the slave with it.
>>> >> >
>>> >> >
>>> >> > OK. I am getting these error in slave:
>>> >> > LOG:  invalid contrecord length 284 in log file 57, segment 127,
>>> >> > offset
>>> >> > 0
>>> >> >
>>> >> > What is the actual reason?
>>> >>
>>> >> Corruption?  What were you doing when you saw the error?
>>> >
>>> >
>>> > I did not have enough idea about these stuffs. I got the database now
>>> > and
>>> > saw the error.
>>> > Is there any way to recover from this state. The master database is a
>>> > large
>>> > database of 500 GB.
>>>
>>> generate a new base backup, and seed the slave with it.  if the error
>>> persists, then i'd guess that your master is corrupted, and then
>>> you've got huge problems.
>>
>>
>> Master is running fine right now showing only a warning:
>> WARNING:  archive_mode enabled, yet archive_command is not set
>>
>> Do you think the master could be corrupted?
>>
>
> Hi,
>
> I got the info that there was a master db restart on Feb 27th. Could this be
> a reason of this error?
>

restarting the database cleanly should never cause corruption.  again,
you need to create a new base backup, and seed the slave with it.  if
the problem persists, then the master is likely corrupted.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] replication behind high lag

2013-03-25 Thread Lonni J Friedman

On Mon, Mar 25, 2013 at 12:55 PM, AI Rumman  wrote:
>
>
> On Mon, Mar 25, 2013 at 3:52 PM, Lonni J Friedman 
> wrote:
>>
>> On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman  wrote:
>> >
>> >
>> > On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman 
>> > wrote:
>> >>
>> >> On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman 
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I have two 9.2 databases running with hot_standby replication. Today
>> >> > when I
>> >> > was checking, I found that replication has not been working since Mar
>> >> > 1st.
>> >> > There was a large database restored in master on that day and I
>> >> > believe
>> >> > after that the lag went higher.
>> >> >
>> >> > SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS
>> >> > offset
>> >> >
>> >> > 431326108320
>> >> >
>> >> > SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0')
>> >> > AS
>> >> > receive,   pg_xlog_location_diff(pg_last_xlog_replay_location(),
>> >> > '0/0')
>> >> > AS replay
>> >> >
>> >> >receive|replay
>> >> > --+--
>> >> >  245987541312 | 245987534032
>> >> > (1 row)
>> >> >
>> >> > I checked the pg_xlog in both the server. In Slave the last xlog file
>> >> > -rw--- 1 postgres postgres 16777216 Mar  1 06:02
>> >> > 00010039007F
>> >> >
>> >> > In Master, the first xlog file is
>> >> > -rw--- 1 postgres postgres 16777216 Mar  1 04:45
>> >> > 00010039005E
>> >> >
>> >> >
>> >> > Is there any way I could sync the slave in quick process?
>> >>
>> >> generate a new base backup, and seed the slave with it.
>> >
>> >
>> > OK. I am getting these error in slave:
>> > LOG:  invalid contrecord length 284 in log file 57, segment 127, offset
>> > 0
>> >
>> > What is the actual reason?
>>
>> Corruption?  What were you doing when you saw the error?
>
>
> I did not have enough idea about these stuffs. I got the database now and
> saw the error.
> Is there any way to recover from this state. The master database is a large
> database of 500 GB.

generate a new base backup, and seed the slave with it.  if the error
persists, then i'd guess that your master is corrupted, and then
you've got huge problems.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] replication behind high lag

2013-03-25 Thread Lonni J Friedman

On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman  wrote:
>
>
> On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman 
> wrote:
>>
>> On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman  wrote:
>> > Hi,
>> >
>> > I have two 9.2 databases running with hot_standby replication. Today
>> > when I
>> > was checking, I found that replication has not been working since Mar
>> > 1st.
>> > There was a large database restored in master on that day and I believe
>> > after that the lag went higher.
>> >
>> > SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS
>> > offset
>> >
>> > 431326108320
>> >
>> > SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS
>> > receive,   pg_xlog_location_diff(pg_last_xlog_replay_location(),
>> > '0/0')
>> > AS replay
>> >
>> >receive|replay
>> > --+--
>> >  245987541312 | 245987534032
>> > (1 row)
>> >
>> > I checked the pg_xlog in both the server. In Slave the last xlog file
>> > -rw--- 1 postgres postgres 16777216 Mar  1 06:02
>> > 00010039007F
>> >
>> > In Master, the first xlog file is
>> > -rw--- 1 postgres postgres 16777216 Mar  1 04:45
>> > 00010039005E
>> >
>> >
>> > Is there any way I could sync the slave in quick process?
>>
>> generate a new base backup, and seed the slave with it.
>
>
> OK. I am getting these error in slave:
> LOG:  invalid contrecord length 284 in log file 57, segment 127, offset 0
>
> What is the actual reason?

Corruption?  What were you doing when you saw the error?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] replication behind high lag

2013-03-25 Thread Lonni J Friedman

On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman  wrote:
> Hi,
>
> I have two 9.2 databases running with hot_standby replication. Today when I
> was checking, I found that replication has not been working since Mar 1st.
> There was a large database restored in master on that day and I believe
> after that the lag went higher.
>
> SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset
>
> 431326108320
>
> SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS
> receive,   pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0')
> AS replay
>
>receive|replay
> --+--
>  245987541312 | 245987534032
> (1 row)
>
> I checked the pg_xlog in both the server. In Slave the last xlog file
> -rw--- 1 postgres postgres 16777216 Mar  1 06:02
> 00010039007F
>
> In Master, the first xlog file is
> -rw--- 1 postgres postgres 16777216 Mar  1 04:45
> 00010039005E
>
>
> Is there any way I could sync the slave in quick process?

generate a new base backup, and seed the slave with it.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] postgres 9.0.2 replicated database is crashing

2013-03-09 Thread Lonni J Friedman

On Sat, Mar 9, 2013 at 4:05 PM, akp geek  wrote:
> Appreciate your findings.  taking your points and doing things now.  We can
> upgrade. Which version is more stable 9.2 or 9.1

They're two entirely different branches.  They should both be equally
stable.  However, if you're looking for the most straightforward path
I'd recommend going to 9.0.12.  Also be sure to read the release notes
first.

>
> We use GIST indexes quite a bit. and we gis also
>
> I recently compiled postgres 9.2 ..
>
> Regards
>
>
>
> On Sat, Mar 9, 2013 at 5:09 PM, Lonni J Friedman  wrote:
>>
>> On Sat, Mar 9, 2013 at 1:51 PM, akp geek  wrote:
>> > thank you. As you mentioned, I understood that I am starting the
>> > streaming
>> > scratch which is not what I wanted to do.
>> >
>> > Here is what I am planning to .
>> >
>> > Our replication process was down since March5th.
>> >
>> > 1. Is it Ok to get all wals from March5th till now to standby pg_xlog
>> > 2. take pg_basebackup
>> > 3. export the data dir
>> > 4. stop backup
>> > 5. restart the standby.
>> >
>> > Based on my understanding it should work, because when standby started,
>> > it
>> > is not able find the files, as I have deleted from the slave them when I
>> > tried to fix the issue
>>
>> Clearly it didn't work because you've been having issues for the past
>> 4 days, and you've already tried this approach unsuccessfully.  Based
>> on the log snippets below, it looks like you have multiple problems.
>> First your slave is definitely missing WAL segments.  You should
>> increase the number of WAL segments that are archived on the master
>> and ensure that your base backup is including the WAL segments (by
>> generating it with the -x option).  However, that's the least of your
>> problems at the moment, because it looks like the master is what is
>> crashing.
>>
>> >
>> >
>> >   0  2013-03-09 04:56:08 GMT LOG:  entering standby mode
>> > cp: cannot access /backup/9.0.4/archive/0001010600E1
>> >   0  2013-03-09 04:56:08 GMT LOG:  redo starts at 106/E120
>> > cp: cannot access /backup/9.0.4/archive/0001010600E2
>> > cp: cannot access /backup/9.0.4/archive/0001010600E3
>> > cp: cannot access /backup/9.0.4/archive/0001010600E4
>> >   0  2013-03-09 04:56:10 GMT LOG:  consistent recovery state reached
>> > at
>> > 106/E45AD4A8
>> >   0  2013-03-09 04:56:10 GMT LOG:  database system is ready to
>> > accept
>> > read only connections
>> > cp: cannot access /backup/9.0.4/archive/0001010600E5
>> > cp: cannot access /backup/9.0.4/archive/0001010600E6
>> > cp: cannot access /backup/9.0.4/archive/0001010600E7
>> > cp: cannot access /backup/9.0.4/archive/0001010600E8
>> > cp: cannot access /backup/9.0.4/archive/0001010600E9
>> > cp: cannot access /backup/9.0.4/archive/0001010600EA
>> >   0  2013-03-09 04:56:12 GMT LOG:  invalid record length at
>> > 106/EA10B8C0
>> > cp: cannot access /backup/9.0.4/archive/0001010600EA
>> >   0  2013-03-09 04:56:12 GMT LOG:  streaming replication
>> > successfully
>> > connected to primary
>> > [unknown]  0  2013-03-09 04:57:00 GMT [unknown]LOG:  connection
>> > received: host=[local]
>> > [unknown]  0  2013-03-09 04:57:03 GMT [unknown]LOG:  connection
>> > received: host=[local]
>> >
>> > [unknown] 10.155.253.43(51257) 0 SELECT 2013-03-09 07:07:18 GMT
>> > prodLog:
>> > duration: 6316.649 ms
>> > [unknown] 10.155.253.43(51257) 0 idle 2013-03-09 07:47:53 GMT
>> > prodLog:
>> > disconnection: session time: 0:41:06.529 user=postgres database=fprod
>> > host=
>> > 10.155.253.43 port=51257
>> >   0  2013-03-09 07:55:48 GMT LOG:  restartpoint starting: time
>> >   0  2013-03-09 08:25:47 GMT LOG:  restartpoint complete: wrote
>> > 19419
>> > buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
>> >   0  2013-03-09 08:25:47 GMT LOG:  recovery restart point at
>> > 107/FB01B238
>> >   0  2013-03-09 08:25:47 GMT DETAIL:  last completed transaction was
>> > at
>> > log time 2013-03-09 08:25:41.85776+00
>> >   0  2013-03-09 08:55:48 GMT LOG:  restartpoint starting: time
>> > psql [local] 08P01 idle 2013-03-09 09:1

Re: [GENERAL] postgres 9.0.2 replicated database is crashing

2013-03-09 Thread Lonni J Friedman

On Sat, Mar 9, 2013 at 1:51 PM, akp geek  wrote:
> thank you. As you mentioned, I understood that I am starting the streaming
> scratch which is not what I wanted to do.
>
> Here is what I am planning to .
>
> Our replication process was down since March5th.
>
> 1. Is it Ok to get all wals from March5th till now to standby pg_xlog
> 2. take pg_basebackup
> 3. export the data dir
> 4. stop backup
> 5. restart the standby.
>
> Based on my understanding it should work, because when standby started, it
> is not able find the files, as I have deleted from the slave them when I
> tried to fix the issue

Clearly it didn't work because you've been having issues for the past
4 days, and you've already tried this approach unsuccessfully.  Based
on the log snippets below, it looks like you have multiple problems.
First your slave is definitely missing WAL segments.  You should
increase the number of WAL segments that are archived on the master
and ensure that your base backup is including the WAL segments (by
generating it with the -x option).  However, that's the least of your
problems at the moment, because it looks like the master is what is
crashing.

>
>
>   0  2013-03-09 04:56:08 GMT LOG:  entering standby mode
> cp: cannot access /backup/9.0.4/archive/0001010600E1
>   0  2013-03-09 04:56:08 GMT LOG:  redo starts at 106/E120
> cp: cannot access /backup/9.0.4/archive/0001010600E2
> cp: cannot access /backup/9.0.4/archive/0001010600E3
> cp: cannot access /backup/9.0.4/archive/0001010600E4
>   0  2013-03-09 04:56:10 GMT LOG:  consistent recovery state reached at
> 106/E45AD4A8
>   0  2013-03-09 04:56:10 GMT LOG:  database system is ready to accept
> read only connections
> cp: cannot access /backup/9.0.4/archive/0001010600E5
> cp: cannot access /backup/9.0.4/archive/0001010600E6
> cp: cannot access /backup/9.0.4/archive/0001010600E7
> cp: cannot access /backup/9.0.4/archive/0001010600E8
> cp: cannot access /backup/9.0.4/archive/0001010600E9
> cp: cannot access /backup/9.0.4/archive/0001010600EA
>   0  2013-03-09 04:56:12 GMT LOG:  invalid record length at 106/EA10B8C0
> cp: cannot access /backup/9.0.4/archive/0001010600EA
>   0  2013-03-09 04:56:12 GMT LOG:  streaming replication successfully
> connected to primary
> [unknown]  0  2013-03-09 04:57:00 GMT [unknown]LOG:  connection
> received: host=[local]
> [unknown]  0  2013-03-09 04:57:03 GMT [unknown]LOG:  connection
> received: host=[local]
>
> [unknown] 10.155.253.43(51257) 0 SELECT 2013-03-09 07:07:18 GMT prodLog:
> duration: 6316.649 ms
> [unknown] 10.155.253.43(51257) 0 idle 2013-03-09 07:47:53 GMT prodLog:
> disconnection: session time: 0:41:06.529 user=postgres database=fprod host=
> 10.155.253.43 port=51257
>   0  2013-03-09 07:55:48 GMT LOG:  restartpoint starting: time
>   0  2013-03-09 08:25:47 GMT LOG:  restartpoint complete: wrote 19419
> buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
>   0  2013-03-09 08:25:47 GMT LOG:  recovery restart point at
> 107/FB01B238
>   0  2013-03-09 08:25:47 GMT DETAIL:  last completed transaction was at
> log time 2013-03-09 08:25:41.85776+00
>   0  2013-03-09 08:55:48 GMT LOG:  restartpoint starting: time
> psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog:  unexpected EOF on
> client connection
> psql [local] 0 idle 2013-03-09 09:10:52 GMT prodLog:  disconnection:
> session time: 2:15:06.351 user=postgres database=fprod host=[local]
>   XX000  2013-03-09 09:23:46 GMT FATAL:  failed to add item to index page in
> 100924/100937/1225845
>   XX000  2013-03-09 09:23:46 GMT CONTEXT:  xlog redo Insert item, node:
> 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf F
> isdelete
> F updateBlkno:36483
>   0  2013-03-09 09:23:46 GMT LOG:  startup process (PID 3880) exited
> with exit code 1
>   0  2013-03-09 09:23:46 GMT LOG:  terminating any other active server
> processes

The real problem is here:
  XX000  2013-03-09 09:23:46 GMT FATAL:  failed to add item to index
page in 100924/100937/1225845
  XX000  2013-03-09 09:23:46 GMT CONTEXT:  xlog redo Insert item,
node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata:
F isleaf F isdelete F updateBlkno:36483

That looks like something is badly broken (maybe data corruption)?
Most of the google hits on that error are associated with GIST
indexes.  Are you using GIST indexes?

Are you really (still) using 9.0.4 ?  I hope you realize that there
are known data corruption bugs in that version, and that version is
super old at this point.  You really need to update to 9.0.12.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] postgres 9.0.2 replicated database is crashing

2013-03-09 Thread Lonni J Friedman

That process merely sets up a new server, it doesn't start streaming,
unless the server has been configured correctly.  You state that the
slave crashed after two hours.  How did you make this determination?

All you seem to be doing is setting up the slave from scratch
repeatedly, and assuming that it will magically just work, rather than
understanding & debugging why its not working.  Where is the log
output from your servers that shows what is really transpiring?

On Sat, Mar 9, 2013 at 6:51 AM, akp geek  wrote:
> Hi all -
>
>  I am in desperate need of your help. The replication/streaming
> stopped working on March5th. I followed the following procedure to restart
> the streaming. After running it for couple of hours , the database is
> crashing on the slave.  This is on our production server.  Thanks for your
> help.
>
> Steps that I followed
>
> 1.  SELECT pg_start_backup('label', true);
> 2.  I created tar of data.
> 3.  un tar the data on the slave.
> 4. deleted the pid file on the slave
> 5. created recovery.conf
> 5. SELECT pg_stop_backup();  on the primary
> 6.  started the postgres on slave
> 7.  After 2 hours,  the slave server crashes
>
> Please help.
>
> Regards
>

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Replication stopped on 9.0.2 after making change to conf file

2013-03-09 Thread Lonni J Friedman

It sounds like all you did was setup the slave from scratch with a
fresh base backup, without understanding or debugging what caused
everything to break.  Clearly whatever was wrong on March 5 is still
wrong, and nothing has been fixed.  The first step in debugging this
problem is to look at and/or post the log content (from both the
master & slave) from the time when this stopped working (march 5).

On Fri, Mar 8, 2013 at 11:06 PM, akp geek  wrote:
> What I noticed is...
>
> our streaming stopped on march 5th.   I restarted the process today.
> However the new data is syncing correctly. But the data for these  dates
> between 5th and today is not syncing.  Is there some thing wrong that I did.
>
> here is what I did.
>
> 1.. created base backup
> 2.  Took the data directory on to the slave
> 3.  stopped the base backup on master
> 4. started the db on slave
>
> Appreciate your help.
>
> Regards
>
>
>
> On Sat, Mar 9, 2013 at 12:52 AM, akp geek  wrote:
>>
>> Thanks a lot. I started the replication. It became very slow. It is taking
>> long time to sync the masters data onto slave. Is there a way to find what's
>> causing the issue?
>>
>> Regards
>>
>>
>>
>> On Fri, Mar 8, 2013 at 12:06 PM, John Laing  wrote:
>>>
>>> I'm not sure about the existence of any standard scripts, but we have a
>>> pair of checks running periodically on the backup server.
>>>
>>> This shouldn't return anything:
>>> tail -3 /var/log/postgresql/postgresql-9.1-main.log | grep FATAL
>>>
>>> And this should return something:
>>> ps -u postgres -o cmd | grep "postgres: wal receiver process   streaming"
>>>
>>> These have worked very reliably for many months.
>>>
>>> -John
>>>
>>>
>>> On Fri, Mar 8, 2013 at 11:53 AM, akp geek  wrote:

 I got it fixed.

 What I did was

 $ psql -c "SELECT pg_start_backup('label', true)"
 $ rsync -a ${PGDATA}/ standby:/srv/pgsql/standby/ --exclude
 postmaster.pid
 $ psql -c "SELECT pg_stop_backup()"

 It took a while a to catch up the data.


 One question I have , are there any scripts to monitor the status of the
 replciation. so that I can be little proactive


 Regards



 On Thu, Mar 7, 2013 at 9:25 PM, Scott Marlowe 
 wrote:
>
> On Thu, Mar 7, 2013 at 5:28 PM, akp geek  wrote:
> > Hi all -
> >
> >Recently made change on our primary database
> >
> >   default_text_search_config = 'pg_catalog.simple' .
> > After that
> > the replication is stopped. Can you please help me ? how to fix the
> > issue. I
> > am sure I made the change on the slave also.
> >
> > How can I start the replication and catch up the data. Thanks for
> > your time.
>
> What are you seeing in your slony and / or postgresql logs, if
> anything?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] broke postgres, how to fix??

2013-02-26 Thread Lonni J Friedman

On Tue, Feb 26, 2013 at 4:10 PM, JD Wong  wrote:
> Hi Adrian,
>
>> That's guaranteed to break everything badly.
>
>
> Even if I "read only style" copied the files? Do you mind elaborating on why
> this happens?  ( or point me to relevant documentation )

What is "read only style", and how does postgres know about this?

http://www.postgresql.org/docs/9.2/static/backup-file.html


>
> Thanks,
> -JD
>
> On Tue, Feb 26, 2013 at 7:04 PM, Lonni J Friedman 
> wrote:
>>
>> On Tue, Feb 26, 2013 at 4:02 PM, JD Wong  wrote:
>> > Hi Adrian, yes I completely copied the config-file and data directories
>> > over.
>> >
>> > Lonnie, I don't remember.  I might not have shut down the "old"
>> > postgres,
>> > yes I set PGDATA accordingly.
>
>
>>
>>
>>
>> That's guaranteed to break everything badly.
>
>



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] broke postgres, how to fix??

2013-02-26 Thread Lonni J Friedman

On Tue, Feb 26, 2013 at 4:02 PM, JD Wong  wrote:
> Hi Adrian, yes I completely copied the config-file and data directories
> over.
>
> Lonnie, I don't remember.  I might not have shut down the "old" postgres,
> yes I set PGDATA accordingly.

That's guaranteed to break everything badly.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] broke postgres, how to fix??

2013-02-26 Thread Lonni J Friedman

Did you shut down the 'old' postgres before copying these files?
Did you (re)configure the 'new' postgres to set its $PGDATA directory
to the location of the 'new' files?

On Fri, Feb 22, 2013 at 3:46 PM, JD Wong  wrote:
> I tried copying postgres over to a new directory.  it was working until I
> deleted a bunch of databases from the "old" postgres.  Lo and behold this
> somehow broke the "new" copy too.
>
> Now when I start with pg_ctl
> 2013-02-22 18:36:13 EST DETAIL:  The database subdirectory "base/1066060" is
> missing.
> 2013-02-22 18:36:25 EST FATAL:  database "wormmine" does not exist
> 2013-02-22 18:36:25 EST DETAIL:  The database subdirectory "base/1027296" is
> missing.
> 2013-02-22 18:37:13 EST FATAL:  database "wormmine-dev" does not exist
>
> and it won't start...
>
> How can I fix this?  re-creating these databases with the old postgres
> didn't work...
>
> Also, why did this happen?  I created a new data directory for the new
> postgres, this should be impossible
>
> I have no ideas left, can anyone help?
>
> Thanks in advance,


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] special procedure required when running pg_basebackup from a standby?

2013-01-13 Thread Lonni J Friedman

Greetings,
I'm running postgres-9.2.2 in a Linux-x86_64 cluster with 1 master and
several hot standby servers.  Since upgrading to 9.2.2 from 9.1.x a
few months ago, I switched from generating a base backup on the
master, to generating it on a dedicated slave/standby (to reduce the
load on the master).  The command that I've always used to generate
the base backup is:
pg_basebackup -v -D /tmp/bb0 -x -Ft -U postgres

However, I've noticed that whenever I use the base backup generated
from the standby to create a new standby server, many of the indexes
are corrupted.  This was never the case when I was generating the
basebackup directly from the master.  Now, I see errors similar to the
following when running queries against the tables that own the
indexes:
INDEX "debugger_2013_01_dacode_idx" contains unexpected zero page at block 12
HINT:  Please REINDEX it.
INDEX "smoke32on64tests_2013_01_suiteid_idx" contains unexpected zero
page at block 111
HINT:  Please REINDEX it.

I've confirmed that the errors/corruption doesn't exist on the server
that is generating the base backup (I can run the same SQL query which
fails on the new standby, successfully).  Also reindexing the index
does fix the problem.  So it seems that I'm potentially
misunderstanding some part of the process.  My setup process is to
simply untar the basebackup in the $PGDATA directory, and copy over
all the WAL logs into $PGDATA/pg_xlog.

thanks for any pointers.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] data corruption when using base backups generated from hot standby

2013-01-10 Thread Lonni J Friedman

Greetings,
I'm running postgres-9.2.2 in a Linux-x86_64 cluster with 1 master and
several hot standby servers.  Since upgrading to 9.2.2 from 9.1.x a
few months ago, I switched from generating a base backup on the
master, to generating it on a dedicated slave/standby (to reduce the
load on the master).  The command that I've always used to generate
the base backup is:
pg_basebackup -v -D /tmp/bb0 -x -Ft -U postgres

However, I've noticed that whenever I use the base backup generated
from the standby to create a new standby server, many of the indexes
are corrupted.  This was never the case when I was generating the
basebackup directly from the master.  Now, I see errors similar to the
following when running queries against the tables that own the
indexes:
INDEX "debugger_2013_01_dacode_idx" contains unexpected zero page at block 12
HINT:  Please REINDEX it.
INDEX "smoke32on64tests_2013_01_suiteid_idx" contains unexpected zero
page at block 111
HINT:  Please REINDEX it.

I've confirmed that the errors/corruption doesn't exist on the server
that is generating the base backup (I can run the same SQL query which
fails on the new standby, successfully).  Also reindexing the index
does fix the problem.  So it seems that I'm potentially
misunderstanding some part of the process.  My setup process is to
simply untar the basebackup in the $PGDATA directory, and copy over
all the WAL logs into $PGDATA/pg_xlog.

thanks for any pointers.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pgpool2 load balancing not working

2013-01-04 Thread Lonni J Friedman

On Fri, Jan 4, 2013 at 3:42 PM, Greg Donald  wrote:
> Sorry if this is the wrong list, but I've been stuck for a couple days
> now.  I tried pgpool-general but that list appears to not like me.
> I'm not getting any posts and my post hasn't shown up in the archives.

Specifically which address are you sending to?  I'm on the official
list, and it seems to work fine for me.

>
> I have a Python/Django app that will require database load balancing
> at some point in the near future.  In the meantime I'm trying to learn
> to implement pgpool on a local virtual machine setup.
>
> I have 4 Ubuntu 12.04 VMs:
>
> 192.168.1.80 <- pool, pgppool2 installed and accessible
> 192.168.1.81 <- db1 master
> 192.168.1.82 <- db2 slave
> 192.168.1.83 <- db3 slave
>
> I have pgpool-II version 3.1.1 and my database servers are running
> PostgreSQL 9.1.
>
> I have my app's db connection pointed to 192.168.1.80: and it works fine.
>
> The problem is when I use Apache ab to throw some load at it, none of
> SELECT queries appear to be balanced.  All the load goes to my db1
> master.  Also, very concerning is the load on the pool server itself,
> it is really high compared to db1, maybe an average of 8-10 times
> higher.  Meanwhile my db2 and db3 servers have a load of nearly zero,
> they appear to only be replicating from db1, which isn't very load
> intensive for my tests with ab.
>
> ab -n 300 -c 4 -C 'sessionid=80a5fd3b6bb59051515e734326735f80'
> http://192.168.1.17/contacts/
>
> That drives the load on my pool server up to about 2.3.  Load on db1
> is about 0.4 and load on db2 and db3 is nearly zero.
>
> Can someone take a look at my pgpool.conf and see if what I'm doing wrong?
>
> http://pastebin.com/raw.php?i=wzBc0aSp

Nothing is jumping out at me as blatantly wrong, although it seems
kinda weird that each of your database servers is listening on a
different port #.

>
>
> I'm starting to think maybe it has something to do with Django
> wrapping every request in a transaction by default, but when the
> transaction only has SELECTs, shouldn't that be load balanced just
> fine?  Makes my stomach hurt to think I may have to turn off
> auto-commit and manually commit transactions all throughout my code :(
>  Still hoping it's a pgpool setup issue, since it's my first time
> setting it up and all.

I've never done anything with Django, but this seems like a good
possibility that the transactions are causing pgpool to get confused
and assume that every query requires write access.

What might be more useful is for you to post your actual pgpool log
somewhere, as that might contain a clue of what is going wrong.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Table with million rows - and PostgreSQL 9.1 is not using the index

2012-12-04 Thread Lonni J Friedman

I'm no expert on this, but it will likely be more helpful to others if
you include the table description with all the indices.

On Tue, Dec 4, 2012 at 8:44 PM, Edson Richter  wrote:
> I've a table with >110 rows, with streets.
> I'm making a partial search using zip code, and PostgreSQL is ignoring my
> ZIP index.
> I'm sure I'm making some mistake, but I can't see where.
> The query is:
>
> SELECT t2.ID, t2.CEP, t2.COMPLEMENTO, t2.NOME, t2.NOMESEMACENTOS, t2.TIPO,
> t2.BAIRRO_ID
>   FROM LOCALIDADE t0, LOGRADOURO t2, BAIRRO t1
>  WHERE t2.CEP LIKE '81630160%' AND ((t1.ID = t2.BAIRRO_ID) AND (t0.ID =
> t1.LOCALIDADE_ID)) ORDER BY t0.NOME;
>
> (for reference, BAIRRO = town, LOCALIDADE = city, LOGRADOURO = street)
>
> Here is the result of explain analyze:
>
> Sort  (cost=11938.72..11938.74 rows=91 width=93)
>   Sort Key: t0.nome
>   ->  Nested Loop  (cost=0.00..11938.42 rows=91 width=93)
> ->  Nested Loop  (cost=0.00..11935.19 rows=91 width=85)
>   ->  Seq Scan on logradouro t2  (cost=0.00..11634.42 rows=91
> width=81)
> Filter: ((cep)::text ~~ '81630160%'::text)
>   ->  Index Scan using pkbairro on bairro t1 (cost=0.00..3.30
> rows=1 width=8)
> Index Cond: (id = t2.bairro_id)
> ->  Index Scan using pklocalidade on localidade t0 (cost=0.00..0.03
> rows=1 width=16)
>   Index Cond: ((id)::text = (t1.localidade_id)::text)
>
> I've few tweaks in postgresql.conf:
> shared_buffers = 2GB
> temp_buffers = 32MB
> max_prepared_transactions = 50
> work_mem = 32MB
> maintenance_work_mem = 16MB
> max_stack_depth = 4MB
> max_files_per_process = 15000
> random_page_cost = 2.0
> cpu_tuple_cost = 0.001
> cpu_index_tuple_cost = 0.0005
> cpu_operator_cost = 0.00025
> effective_cache_size = 512MB
>
> Everything else is default configuration.
>
> This machine is Intel Quad 3.1Ghz, with 8 threads, 8Gig of RAM, 8Gig of
> Swap, running CentOS 6.3 64bit.
> Machine is free almost all the time.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgresql logfilename and times in GMT - not EST

2012-12-04 Thread Lonni J Friedman

On Tue, Dec 4, 2012 at 2:42 PM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> On Tue, Dec 4, 2012 at 1:59 PM, Bryan Montgomery  wrote:
>>> I changed postgres.conf to have timezone = 'EST' and restarted postgres.
>>> However the log file is still 5 hours ahead. What gives? Not the end of the
>>> world but a bit annoying.
>
>> you need to set log_timezone .  This is a new 'feature' in 9.2 that
>> annoyed me as well.  I assume that there was a good use case for this.
>
> "New"?  log_timezone has been around since 8.3, and it seems like a good
> idea to me --- what if you have N sessions each with its own active
> timezone setting?  Timestamps in the log would be an unreadable mismash
> if there weren't a separate log_timezone setting.
>
> What did change in 9.2 is that initdb sets values for timezone and
> log_timezone in postgresql.conf, so it's the initdb environment that
> will determine what you get in the absence of any manual action.
> Before that it was the postmaster's environment.

Sorry, I meant new, in that its impact changed in 9.2 such that it
needed to be explicitly set to not get UTC by default, whereas in the
past that wasn't required.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgresql logfilename and times in GMT - not EST

2012-12-04 Thread Lonni J Friedman

On Tue, Dec 4, 2012 at 1:59 PM, Bryan Montgomery  wrote:
> We have a test 9.2.0 db running on openSuse 12.2. When I select now() I get
> the correct timezone and date back (-5 hours).
> When I do date at the os prompt, I get the right timezone back.
>
> I changed postgres.conf to have timezone = 'EST' and restarted postgres.
> However the log file is still 5 hours ahead. What gives? Not the end of the
> world but a bit annoying.

you need to set log_timezone .  This is a new 'feature' in 9.2 that
annoyed me as well.  I assume that there was a good use case for this.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hot Standby Not So Hot Anymore

2012-11-05 Thread Lonni J Friedman

On Mon, Nov 5, 2012 at 8:31 PM, Ian Harding  wrote:
>
>
>
> On Mon, Nov 5, 2012 at 8:15 PM, Lonni J Friedman  wrote:
>>
>> On Mon, Nov 5, 2012 at 8:13 PM, Ian Harding  wrote:
>> >
>> >
>> >
>> > On Mon, Nov 5, 2012 at 7:57 PM, Lonni J Friedman 
>> > wrote:
>> >>
>> >> On Mon, Nov 5, 2012 at 7:49 PM, Ian Harding 
>> >> wrote:
>> >> >
>> >> >
>> >> > On Mon, Nov 5, 2012 at 7:46 PM, Lonni J Friedman 
>> >> > wrote:
>> >> >>
>> >> >> On Mon, Nov 5, 2012 at 7:40 PM, Ian Harding 
>> >> >> wrote:
>> >> >> > I had a 9.0.8 hot standby setup, one master, two slaves, working
>> >> >> > great.
>> >> >> > Then, I tried to re-initialize by making a base backup, the way
>> >> >> > I've
>> >> >> > done it
>> >> >> > many times before, but for some reason I can't get the standby to
>> >> >> > accept
>> >> >> > connections.  I copied the postgresql.conf and recorvery.conf out
>> >> >> > of
>> >> >> > the
>> >> >> > way, cleaned the data directory and extracted the backup files,
>> >> >> > then
>> >> >> > replaced the conf files.  Everything works perfectly, but I keep
>> >> >> > getting
>> >> >> > :FATAL:  the database system is starting up
>> >> >> >
>> >> >> > I know I'm an idiot and that I did something wrong but I can't
>> >> >> > find
>> >> >> > it.
>> >> >> >
>> >> >> > [root@db03 data]# grep standby postgresql.conf
>> >> >> > wal_level = hot_standby# minimal, archive, or
>> >> >> > hot_standby
>> >> >> > hot_standby = on# "on" allows queries during recovery
>> >> >> > max_standby_archive_delay = -1# max delay before canceling
>> >> >> > queries
>> >> >> > max_standby_streaming_delay = -1# max delay before canceling
>> >> >> > queries
>> >> >> >
>> >> >> > [root@db03 data]# grep standby recovery.conf
>> >> >> > # Specifies whether to start the server as a standby. In streaming
>> >> >> > replication,
>> >> >> > standby_mode  = 'on'
>> >> >> > # Specifies a connection string which is used for the standby
>> >> >> > server
>> >> >> > to
>> >> >> > connect
>> >> >> >
>> >> >> > [root@db03 data]# ps aux | grep stream
>> >> >> > postgres 62127  0.1  0.0 34554356 3724 ?   Ss   16:22   0:16
>> >> >> > postgres:
>> >> >> > wal receiver process   streaming BA7/12B87818
>> >> >> >
>> >> >> >
>> >> >> > [root@db03 data]# ps aux | grep startup
>> >> >> > postgres 62122  0.5  0.9 34545900 1223708 ?Ss   16:22   1:03
>> >> >> > postgres:
>> >> >> > startup process   recovering 00010BA70012
>> >> >> >
>> >> >> > Any suggestions appreciated!!!
>> >> >>
>> >> >> What's your log have to say?
>> >> >
>> >> >
>> >> > It says everything is happy as normal...
>> >> >
>> >> > 2012-11-05 16:22:38.744 PST -  :LOG:  database system was shut down
>> >> > in
>> >> > recovery at 2012-11-05 16:22:33 PST
>> >> > 2012-11-05 16:22:38.745 PST -  :LOG:  entering standby mode
>> >> > 2012-11-05 16:22:38.746 PST -  :LOG:  redo starts at BA5/F96F86A8
>> >> > 2012-11-05 16:22:38.762 PST - postgres :FATAL:  the database system
>> >> > is
>> >> > starting up
>> >> > 2012-11-05 16:22:39.764 PST - postgres :FATAL:  the database system
>> >> > is
>> >> > starting up
>> >> > 2012-11-05 16:22:40.766 PST - postgres :FATAL:  the database system
>> >> > is
>> >> > starting up
>> >> > 2012-11-05 16:22:41.200 PST -  :LOG:  invalid record length at
>> >> > BA6/6DCBA48
>> >> > 2012-11-05 16:22:41.206 PST -  :LOG:  streaming replication
>> >> > successfully
>> >> > connected to primary
>> >> >
>> >> >
>> >> > This is after I stopped and restarted... after that just more failed
>> >> > connection attempts.
>> >>
>> >> Are you sure that its not still recovering?  In other words, when you
>> >> look at the 'ps' output do the 'wal receiver process' and 'recovering'
>> >> entries change?
>> >
>> >
>> > Yeah, it's still recovering.  In normal operation it keeps right on
>> > recovering, it's just that it used to accept read-only connections...
>>
>> Is the 2nd slave also experiencing this problem?  If not, is it
>> further ahead than the bad one?
>
>
> Everything looks fine (forgot to cc the list...)
>
>
> [root@db0 ~]# ps aux | grep stream
> postgres 45267  0.0  0.0 34546456 3036 ?   Ss   16:22   0:14 postgres:
> wal sender process postgres 192.168.4.4(55925) streaming BA7/5FC1BFD8
> postgres 54247  0.0  0.0 34546456 3036 ?   Ss   18:15   0:07 postgres:
> wal sender process postgres 192.168.4.3(57482) streaming BA7/5FC1BFD8
>

Right, but what I meant was, are both slaves experiencing the same
problem, or just 1?  Did you use the same base backup on both of them,
or did you generate a different one for each slave?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hot Standby Not So Hot Anymore

2012-11-05 Thread Lonni J Friedman

On Mon, Nov 5, 2012 at 8:13 PM, Ian Harding  wrote:
>
>
>
> On Mon, Nov 5, 2012 at 7:57 PM, Lonni J Friedman  wrote:
>>
>> On Mon, Nov 5, 2012 at 7:49 PM, Ian Harding  wrote:
>> >
>> >
>> > On Mon, Nov 5, 2012 at 7:46 PM, Lonni J Friedman 
>> > wrote:
>> >>
>> >> On Mon, Nov 5, 2012 at 7:40 PM, Ian Harding 
>> >> wrote:
>> >> > I had a 9.0.8 hot standby setup, one master, two slaves, working
>> >> > great.
>> >> > Then, I tried to re-initialize by making a base backup, the way I've
>> >> > done it
>> >> > many times before, but for some reason I can't get the standby to
>> >> > accept
>> >> > connections.  I copied the postgresql.conf and recorvery.conf out of
>> >> > the
>> >> > way, cleaned the data directory and extracted the backup files, then
>> >> > replaced the conf files.  Everything works perfectly, but I keep
>> >> > getting
>> >> > :FATAL:  the database system is starting up
>> >> >
>> >> > I know I'm an idiot and that I did something wrong but I can't find
>> >> > it.
>> >> >
>> >> > [root@db03 data]# grep standby postgresql.conf
>> >> > wal_level = hot_standby# minimal, archive, or hot_standby
>> >> > hot_standby = on# "on" allows queries during recovery
>> >> > max_standby_archive_delay = -1# max delay before canceling
>> >> > queries
>> >> > max_standby_streaming_delay = -1# max delay before canceling
>> >> > queries
>> >> >
>> >> > [root@db03 data]# grep standby recovery.conf
>> >> > # Specifies whether to start the server as a standby. In streaming
>> >> > replication,
>> >> > standby_mode  = 'on'
>> >> > # Specifies a connection string which is used for the standby server
>> >> > to
>> >> > connect
>> >> >
>> >> > [root@db03 data]# ps aux | grep stream
>> >> > postgres 62127  0.1  0.0 34554356 3724 ?   Ss   16:22   0:16
>> >> > postgres:
>> >> > wal receiver process   streaming BA7/12B87818
>> >> >
>> >> >
>> >> > [root@db03 data]# ps aux | grep startup
>> >> > postgres 62122  0.5  0.9 34545900 1223708 ?Ss   16:22   1:03
>> >> > postgres:
>> >> > startup process   recovering 00010BA70012
>> >> >
>> >> > Any suggestions appreciated!!!
>> >>
>> >> What's your log have to say?
>> >
>> >
>> > It says everything is happy as normal...
>> >
>> > 2012-11-05 16:22:38.744 PST -  :LOG:  database system was shut down in
>> > recovery at 2012-11-05 16:22:33 PST
>> > 2012-11-05 16:22:38.745 PST -  :LOG:  entering standby mode
>> > 2012-11-05 16:22:38.746 PST -  :LOG:  redo starts at BA5/F96F86A8
>> > 2012-11-05 16:22:38.762 PST - postgres :FATAL:  the database system is
>> > starting up
>> > 2012-11-05 16:22:39.764 PST - postgres :FATAL:  the database system is
>> > starting up
>> > 2012-11-05 16:22:40.766 PST - postgres :FATAL:  the database system is
>> > starting up
>> > 2012-11-05 16:22:41.200 PST -  :LOG:  invalid record length at
>> > BA6/6DCBA48
>> > 2012-11-05 16:22:41.206 PST -  :LOG:  streaming replication successfully
>> > connected to primary
>> >
>> >
>> > This is after I stopped and restarted... after that just more failed
>> > connection attempts.
>>
>> Are you sure that its not still recovering?  In other words, when you
>> look at the 'ps' output do the 'wal receiver process' and 'recovering'
>> entries change?
>
>
> Yeah, it's still recovering.  In normal operation it keeps right on
> recovering, it's just that it used to accept read-only connections...

Is the 2nd slave also experiencing this problem?  If not, is it
further ahead than the bad one?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hot Standby Not So Hot Anymore

2012-11-05 Thread Lonni J Friedman

On Mon, Nov 5, 2012 at 7:49 PM, Ian Harding  wrote:
>
>
> On Mon, Nov 5, 2012 at 7:46 PM, Lonni J Friedman  wrote:
>>
>> On Mon, Nov 5, 2012 at 7:40 PM, Ian Harding  wrote:
>> > I had a 9.0.8 hot standby setup, one master, two slaves, working great.
>> > Then, I tried to re-initialize by making a base backup, the way I've
>> > done it
>> > many times before, but for some reason I can't get the standby to accept
>> > connections.  I copied the postgresql.conf and recorvery.conf out of the
>> > way, cleaned the data directory and extracted the backup files, then
>> > replaced the conf files.  Everything works perfectly, but I keep getting
>> > :FATAL:  the database system is starting up
>> >
>> > I know I'm an idiot and that I did something wrong but I can't find it.
>> >
>> > [root@db03 data]# grep standby postgresql.conf
>> > wal_level = hot_standby# minimal, archive, or hot_standby
>> > hot_standby = on# "on" allows queries during recovery
>> > max_standby_archive_delay = -1# max delay before canceling queries
>> > max_standby_streaming_delay = -1# max delay before canceling queries
>> >
>> > [root@db03 data]# grep standby recovery.conf
>> > # Specifies whether to start the server as a standby. In streaming
>> > replication,
>> > standby_mode  = 'on'
>> > # Specifies a connection string which is used for the standby server to
>> > connect
>> >
>> > [root@db03 data]# ps aux | grep stream
>> > postgres 62127  0.1  0.0 34554356 3724 ?   Ss   16:22   0:16
>> > postgres:
>> > wal receiver process   streaming BA7/12B87818
>> >
>> >
>> > [root@db03 data]# ps aux | grep startup
>> > postgres 62122  0.5  0.9 34545900 1223708 ?Ss   16:22   1:03
>> > postgres:
>> > startup process   recovering 00010BA70012
>> >
>> > Any suggestions appreciated!!!
>>
>> What's your log have to say?
>
>
> It says everything is happy as normal...
>
> 2012-11-05 16:22:38.744 PST -  :LOG:  database system was shut down in
> recovery at 2012-11-05 16:22:33 PST
> 2012-11-05 16:22:38.745 PST -  :LOG:  entering standby mode
> 2012-11-05 16:22:38.746 PST -  :LOG:  redo starts at BA5/F96F86A8
> 2012-11-05 16:22:38.762 PST - postgres :FATAL:  the database system is
> starting up
> 2012-11-05 16:22:39.764 PST - postgres :FATAL:  the database system is
> starting up
> 2012-11-05 16:22:40.766 PST - postgres :FATAL:  the database system is
> starting up
> 2012-11-05 16:22:41.200 PST -  :LOG:  invalid record length at BA6/6DCBA48
> 2012-11-05 16:22:41.206 PST -  :LOG:  streaming replication successfully
> connected to primary
>
>
> This is after I stopped and restarted... after that just more failed
> connection attempts.

Are you sure that its not still recovering?  In other words, when you
look at the 'ps' output do the 'wal receiver process' and 'recovering'
entries change?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hot Standby Not So Hot Anymore

2012-11-05 Thread Lonni J Friedman

On Mon, Nov 5, 2012 at 7:40 PM, Ian Harding  wrote:
> I had a 9.0.8 hot standby setup, one master, two slaves, working great.
> Then, I tried to re-initialize by making a base backup, the way I've done it
> many times before, but for some reason I can't get the standby to accept
> connections.  I copied the postgresql.conf and recorvery.conf out of the
> way, cleaned the data directory and extracted the backup files, then
> replaced the conf files.  Everything works perfectly, but I keep getting
> :FATAL:  the database system is starting up
>
> I know I'm an idiot and that I did something wrong but I can't find it.
>
> [root@db03 data]# grep standby postgresql.conf
> wal_level = hot_standby# minimal, archive, or hot_standby
> hot_standby = on# "on" allows queries during recovery
> max_standby_archive_delay = -1# max delay before canceling queries
> max_standby_streaming_delay = -1# max delay before canceling queries
>
> [root@db03 data]# grep standby recovery.conf
> # Specifies whether to start the server as a standby. In streaming
> replication,
> standby_mode  = 'on'
> # Specifies a connection string which is used for the standby server to
> connect
>
> [root@db03 data]# ps aux | grep stream
> postgres 62127  0.1  0.0 34554356 3724 ?   Ss   16:22   0:16 postgres:
> wal receiver process   streaming BA7/12B87818
>
>
> [root@db03 data]# ps aux | grep startup
> postgres 62122  0.5  0.9 34545900 1223708 ?Ss   16:22   1:03 postgres:
> startup process   recovering 00010BA70012
>
> Any suggestions appreciated!!!

What's your log have to say?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Quick estimate of num of rows & table size

2012-11-05 Thread Lonni J Friedman

On Mon, Nov 5, 2012 at 3:56 PM, Thalis Kalfigkopoulos
 wrote:
>
> On Mon, Nov 5, 2012 at 7:14 PM, Lonni J Friedman  wrote:
>>
>> On Mon, Nov 5, 2012 at 2:02 PM, Thalis Kalfigkopoulos
>>  wrote:
>> > Hi all,
>> >
>> > I read somewhere that the following query gives a quick estimate of the
>> > # of
>> > rows in a table regardless of the table's size (which would matter in a
>> > simple SELECT count(*)?):
>> >
>> > SELECT (CASE WHEN reltuples > 0 THEN
>> > pg_relation_size('mytable')/(8192*relpages/reltuples)
>> > ELSE 0
>> > END)::bigint AS estimated_row_count
>> > FROM pg_class
>> > WHERE oid = 'mytable'::regclass;
>> >
>> > If relpages & reltuples are recorded accurately each time VACUUM is run,
>> > wouldn't it be the same to just grab directly the value of reltuples
>> > like:
>> >
>> > SELECT reltuples FROM pg_class WHERE oid='mytable'::regclass;
>> >
>> > In the same manner, are pg_relation_size('mytable') and 8192*relpages
>> > the
>> > same?
>> >
>> > I run both assumptions against a freshly VACUUMed table and they seem
>> > correct.
>>
>> This doesn't seem to work for me.  I get an estimated row_count of 0
>> on a table that I know has millions of rows.
>
>
> Which one doesn't work exactly? The larger query? Are you on a 9.x?

doh, sorry.  The first/larger doesn't work.  As it turns out the 2nd
actually does work well.  I'm on 9.1.x.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Quick estimate of num of rows & table size

2012-11-05 Thread Lonni J Friedman

On Mon, Nov 5, 2012 at 2:02 PM, Thalis Kalfigkopoulos
 wrote:
> Hi all,
>
> I read somewhere that the following query gives a quick estimate of the # of
> rows in a table regardless of the table's size (which would matter in a
> simple SELECT count(*)?):
>
> SELECT (CASE WHEN reltuples > 0 THEN
> pg_relation_size('mytable')/(8192*relpages/reltuples)
> ELSE 0
> END)::bigint AS estimated_row_count
> FROM pg_class
> WHERE oid = 'mytable'::regclass;
>
> If relpages & reltuples are recorded accurately each time VACUUM is run,
> wouldn't it be the same to just grab directly the value of reltuples like:
>
> SELECT reltuples FROM pg_class WHERE oid='mytable'::regclass;
>
> In the same manner, are pg_relation_size('mytable') and 8192*relpages the
> same?
>
> I run both assumptions against a freshly VACUUMed table and they seem
> correct.

This doesn't seem to work for me.  I get an estimated row_count of 0
on a table that I know has millions of rows.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] dropdb breaks replication?

2012-10-31 Thread Lonni J Friedman

On Wed, Oct 31, 2012 at 11:01 AM, Edson Richter
 wrote:
> Em 31/10/2012 15:39, Lonni J Friedman escreveu:
>>
>> On Wed, Oct 31, 2012 at 10:32 AM, Edson Richter
>>  wrote:
>>>
>>> I've two PostgreSQL 9.1.6 running on Linux CentOS 5.8 64bit.
>>> They are replicated asynchronously.
>>>
>>> Yesterday, I've dropped a database of 20Gb, and then replication has
>>> broken,
>>> requiring me to manually synchronize both servers again.
>>>
>>> It is expected that dropdb (or, perhaps, createdb) break existing
>>> replication between servers?
>>
>> How did you determine that replication was broken, and how did you
>> manually synchronize the servers?  Are you certain that replication
>> was working prior to dropping the database?
>>
>>
> I'm sure replication was running.
> I usually keep two windows open in both servers, running
>
> In master:
>
> watch -n 2 "ps aux | egrep sender"
>
> In slave:
>
> watch -n 2 "ps aux | egrep receiver"
>
>
> At the point the dropdb command has been executed, both disappeared from my
> "radar".
> Also, in the log there is the following error:
>
> LOG:  replicação em fluxo conectou-se com sucesso ao servidor principal
> FATAL:  não pôde receber dados do fluxo do WAL: FATAL:  segmento do WAL
> solicitado 0001000100BE já foi removido
>
>
> May the cause not having enough segments (currently 80) for dropdb command?
> Is dropdb logged in transaction log page-by-page excluded?

I can't read portugese(?), but i think the gist of the error is that
the WAL segment was already removed before the slave could consume it.
 I'm guessing that you aren't keeping enough of them, and dropping the
database generated a huge volume which flushed out the old ones before
they could get consumed by your slave.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] dropdb breaks replication?

2012-10-31 Thread Lonni J Friedman

On Wed, Oct 31, 2012 at 10:32 AM, Edson Richter
 wrote:
> I've two PostgreSQL 9.1.6 running on Linux CentOS 5.8 64bit.
> They are replicated asynchronously.
>
> Yesterday, I've dropped a database of 20Gb, and then replication has broken,
> requiring me to manually synchronize both servers again.
>
> It is expected that dropdb (or, perhaps, createdb) break existing
> replication between servers?

How did you determine that replication was broken, and how did you
manually synchronize the servers?  Are you certain that replication
was working prior to dropping the database?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] 9.1 to 9.2 requires a dump/reload?

2012-10-22 Thread Lonni J Friedman

pg_upgrade has worked fine for several releases.  I believe that the
only time when pg_upgrade isn't a viable option is for some types of
GIST indices.

On Mon, Oct 22, 2012 at 2:55 PM, Nikolas Everett  wrote:
> I was just looking at
> http://www.postgresql.org/docs/devel/static/release-9-2.html and it
> mentioned that a dump/reload cycle was required to upgrade from a previous
> release.  I just got done telling some of my coworkers that PG had been
> bitten by this enough times that they were done with it.  Am I wrong?  Is
> this normal?
>
> I see that pg_upgrade is an option.  Having never used how long should I
> expect pg_upgrade to take?  Obviously we'll measure it in our environment,
> but it'd be nice to have a ballpark figure.
>
> Nik



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-12 Thread Lonni J Friedman

On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta  wrote:
> Hi,
>
> I currently have a table that is growing very quickly - i.e 7 million
> records in 5 days. This table acts as a placeholder for statistics, and
> hence the records are merely inserted and never updated or deleted.
>
> Many queries are run on this table to obtain trend analysis. However, these
> queries are now starting to take a very long time (hours) to execute due to
> the size of the table.
>
> I have put indexes on this table, to no significant benefit.  Some of the
> other strategies I have thought of:
> 1. Purge old data
> 2. Reindex
> 3. Partition
> 4. Creation of daily, monthly, yearly summary tables that contains
> aggregated data specific to the statistics required
>
> Does anyone know what is the best practice to handle this situation?
>
> I would appreciate knowledge sharing on the pros and cons of the above, or
> if there are any other strategies that I could put in place.

Partitioning is prolly your best solution.  3 & 4 sound like
variations on the same thing.  Before you go that route, you should
make sure that your bottleneck is really a result of the massive
amount of data, and not some other problem.  Are you sure that the
indices you created are being used, and that you have all the indices
that you need for your queries?  Look at the query plan output from
EXPLAIN, and/or post here if you're unsure.

Reindexing shouldn't make a difference unless something is wrong with
the indices that you already have in place.

Purging old data is only a good solution if you do not need the data,
and never will need the data.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] strange hot_standby behaviour

2012-10-01 Thread Lonni J Friedman

On Mon, Oct 1, 2012 at 7:28 AM, pfote  wrote:
> Hi,
>
> I had a very strange effect on the weekend that smells like a bug, so i'd
> like so share it.
>
> Setup:
> machine A: 16 CPU Cores (modern), 128GB RAM, nice 6-drive SAS Raid-10
> machines B, C: 8 Cores (substantially older than A), 48GB Ram, some scsi
> Raid, substantially slower than A
>
> The workload is about 80% - 90% SELECTs with heavy sorting and grouping, the
> remaining are INSERTs/UPDATEs/DELETEs.
> So In the original setup A  is the master, B and C are hot standby's that
> process some of the SELECTs, but by far the most processing is done on the
> master (A). pg version is 9.0.6. CPU utilization is about 80% on the master
> and between 90-100% in the standby's, so it's decided to upgrade to the
> latest 9.2 to profit from the latest performance enhancements.
>
> So B gets upgraded to 9.2.1-1.pgdg60+1 (from pgapt.debian.org) and becomes
> master, then A becomes a hot_standby slave that takes all the SELECTs (and C
> becomes another hot_standby). In the beginning everything works as expected,
> CPU utilization drops from 80% to about 50-60%, selects run faster,
> everything looks smoother (some queries drop from >5s to <1s due to 9.2s
> index-only-scan feature). Its friday, everyone is happy.
>
> About 16 hours later, saturday morning around 6:00, A suddenly goes wild and
> has a CPU utilization of 100% without a change in the workload, out of the
> blue. Queries that used to take <1s suddenly take 5-10s, "explain analyze"
> plans of these queries havn't change a bit though. Switching the workload
> off causes the server to become idle. (while I'm writing this I realize we
> haven't tried to restart A). Instead, $boss decides to twitch back to the
> original setup, so B gets dropped, A becomes master and gets 100% of the
> workload (all SELECTs/INSERTs/UPDATEs/DELETEs), and everything becomes just
> like friday, CPU usage drops to 50-60%, everything runs smothly.
>
> I'm not sure yet if this is replication related or a 9.2.1 problem. Any
> Ideas?

This could be just about anything.  Which OS are you running?  Did you
check any logs when everything went crazy?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] initial sync of multiple streaming slaves simultaneously

2012-09-19 Thread Lonni J Friedman

Just curious, is there a reason why you can't use pg_basebackup ?

On Wed, Sep 19, 2012 at 12:27 PM, Mike Roest  wrote:
>
>> Is there any hidden issue with this that we haven't seen.  Or does anyone
>> have suggestions as to an alternate procedure that will allow 2 slaves to
>> sync concurrently.
>>
> With some more testing I've done today I seem to have found an issue with
> this procedure.
> When the slave starts up after the sync It reaches what it thinks is a
> consistent recovery point very fast based on the pg_stop_backup
>
> eg:
> (from the recover script)
> 2012-09-19 12:15:02: pgsql_start start
> 2012-09-19 12:15:31: pg_start_backup
> 2012-09-19 12:15:31: -
> 2012-09-19 12:15:31: 61/3020
> 2012-09-19 12:15:31: (1 row)
> 2012-09-19 12:15:31:
> 2012-09-19 12:15:32: NOTICE:  pg_stop_backup complete, all required WAL
> segments have been archived
> 2012-09-19 12:15:32: pg_stop_backup
> 2012-09-19 12:15:32: 
> 2012-09-19 12:15:32: 61/30D8
> 2012-09-19 12:15:32: (1 row)
> 2012-09-19 12:15:32:
>
> While the sync was running (but after the pg_stop_backup) I pushed a bunch
> of traffic against the master server.  Which got me to a current xlog
> location of
> postgres=# select pg_current_xlog_location();
>  pg_current_xlog_location
> --
>  61/6834C450
> (1 row)
>
> The startup of the slave after the sync completed:
> 2012-09-19 12:42:49.976 MDT [18791]: [1-1] LOG:  database system was
> interrupted; last known up at 2012-09-19 12:15:31 MDT
> 2012-09-19 12:42:49.976 MDT [18791]: [2-1] LOG:  creating missing WAL
> directory "pg_xlog/archive_status"
> 2012-09-19 12:42:50.143 MDT [18791]: [3-1] LOG:  entering standby mode
> 2012-09-19 12:42:50.173 MDT [18792]: [1-1] LOG:  streaming replication
> successfully connected to primary
> 2012-09-19 12:42:50.487 MDT [18791]: [4-1] LOG:  redo starts at 61/3020
> 2012-09-19 12:42:50.495 MDT [18791]: [5-1] LOG:  consistent recovery state
> reached at 61/3100
> 2012-09-19 12:42:50.495 MDT [18767]: [2-1] LOG:  database system is ready to
> accept read only connections
>
> It shows the DB reached a consistent state as of 61/3100 which is well
> behind the current location of the master (and the data files that were
> synced over to the slave).  And monitoring the server showed the expected
> slave delay that disappeared as the slave pulled and recovered from the WAL
> files that go generated after the pg_stop_backup.
>
> But based on this it looks like this procedure would end up with a
> indeterminate amount of time (based on how much traffic the master processed
> while the slave was syncing) that the slave couldn't be trusted for fail
> over or querying as the server is up and running but is not actually in a
> consistent state.
>
> Thinking it through the more complicated script version of the 2 server
> recovery (where first past the post to run start_backup or stop_backup)
> would also have this issue (although our failover slave would always be the
> one running stop backup as it syncs faster so at least it would be always
> consistent but the DR would still have the problem)


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] initial sync of multiple streaming slaves simultaneously

2012-09-19 Thread Lonni J Friedman

On Wed, Sep 19, 2012 at 8:59 AM, Mike Roest  wrote:
> Hey Everyone,
> We currently have a 9.1.5 postgres cluster running using streaming
> replication.  We have 3 nodes right now
>
> 2 - local that are setup with pacemaker for a HA master/slave set failover
> cluster
> 1 - remote as a DR.
>
> Currently we're syncing with the pretty standard routine
>
> clear local datadir
> pg_start_backup
> sync datadir with fast-archiver (https://github.com/replicon/fast-archiver)
> pg_stop_backup
> start slave
>
> We use the streaming replication with wal_keep_segments set to 1000 to get
> the required WAL files to the slaves.
>
> With this procedure we can currently only sync one of the slaves at a time
> if we failover.  As when the second machine goes to start the sync it errors
> out cause trying to run pg_start_backup fails.

Specifically what is the error?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] File system level backup

2012-07-26 Thread Lonni J Friedman

On Thu, Jul 26, 2012 at 3:39 AM, Manoj Agarwal  wrote:
> Hi,
>
>
>
> I have two virtual machines with two different versions of Postgresql.  One
> machine contains Postgres 7.4.19 and another has Postgres 8.4.3.  I also
> have other instances of these two virtual machines.  I need to transfer the
> database from one machine to other.  Both machines contain a database with
> the same name, for example: testdb, but with different data/values, but the
> structure is exactly same.
>
>
>
> I want to do the following:
>
> · Take file system level backup from  from first machine for
> Postgres database ‘testdb’ and restore it in another instance that has SAME
> version of Postgres.  i.e. backup the database ‘testdb’ from Postgres 7.4.19
> and restore it on another virtual machine with same Postgres version 7.4.19.
>
> · Take file system level backup from  from first machine for
> Postgres database ‘testdb’ and restore it in another instance that has
> DIFFERENT version of Postgres.  i.e. backup the database testdb from
> Postgres 7.4.19 and restore it to another virtual machine with different
> Postgres version 8.4.3.
>
>
>
> I can achieve it with pg_dump and pg_restore, but not with file level
> backups.  The data directory is /var/lib/pgsql/data that contains base
> directory amongst others, that contains directories for each database
> referred to by their OIDs.  If I replace the complete data directory of one
> machine from the instance of another machine (with same or different
> Postgres version), It is failing to identify OIDs for that database.
>
>
>
> It should be possible in Postgres to swap two data directories in two
> different virtual machines without requiring pg_dump and pg_restore.  With
> me, it doesn’t work in both the cases mentioned above.  In first case, it
> gives an error of missing OID for the database.  In second case, it is
> giving version incompatibility issue.
>
>
>
> Is there a way in Postgres to do file system level backup?  The objective is
> to push /var/lib/pgsql/data directory across different virtual machines,
> without the need to backup and restore Postgres database with ‘pg_dump’ and
> ‘pg_restore’ commands.  Any help will be highly appreciated.

You definitely cannot take a filesystem level backup from one version
and throw it into a different version (perhaps minor versions, but not
7.x vs. 8.x).  This is basically what pg_upgrade was created to solve
(however with a different purpose).  But pg_upgrade definitely doesn't
support 7.x, and I'm not even sure that it supports 8.x.  In fact, I
don't even know that 7.x is a supported version of postgresql in any
context any longer.

As for the issue of moving a filesystem level backup between identical
versions, I believe that should work (although I have no clue if there
were bugs preventing this in a version as old as 7.x).  Can you
provide exact details & commands of what you're trying to do, and the
exact errors you're seeing?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] insert binary data into a table column with psql

2012-07-24 Thread Lonni J Friedman

On Tue, Jul 24, 2012 at 7:16 PM, Tom Lane  wrote:
> jtkells  writes:
>> Thanks much for your reply, that does the trick quite nicely. But, I just
>> came to the realization that this only works if your are running the
>> client and the file both resides  on the database server.  I thought that
>> I would be able to do this from a remote server where the client was
>> running, picking up a local file and sending it into a remote database
>> table.  Unless I am missing something, I cant. I can create a temp table
>> on this server, upload the file to a similar table then replicate it to
>> the targeted server..  Lots of work that could be easily done with a
>> programming language (as someone else posted) but thats what I have to
>> work with for now and I have no direct access (I.e. sftp ) to the
>> database server
>
> Maybe use psql's \lo_import command to suck the data into a "large
> object" on the server, and then use loread() to insert it into the
> target table?  (And don't forget to drop the large object after.)
> Pretty grotty but I think it might be the only solution with the
> currently available tools.
>
> btw, does "bytea_import" actually exist?  It's not in the core
> server for sure.

Doh.  No, its not standard, I found it here:
http://dba.stackexchange.com/questions/1742/how-to-insert-file-data-into-a-postgresql-bytea-column/2962#2962

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] insert binary data into a table column with psql

2012-07-24 Thread Lonni J Friedman

On Tue, Jul 24, 2012 at 2:22 PM, John R Pierce  wrote:
> On 07/24/12 1:28 PM, jkells wrote:
>>
>> from psql
>> I have tried several ways including creating a function to read a file
>> without any success but basically I want to do something like the
>> following from a bash shell
>>
>> psql  -c "insert into x (ID, load_date, image)
>> values ($PID,clock_timestamp()::timestamp(0), copy from '/tmp/$FN' with
>> binary);"
>>
>> Any help would be greatly appreciated
>
>
> use a proper programming language that can read files and insert BYTEA data.
> shell + psql just won't cut it.   I'd suggest perl or python or java or

Actually, that's not true.  Its definitely possible to INSERT data
into bytea using just psql.  The trick is to sub- 'select' the data
with the bytea_import function in the INSERT.  So something like this:
insert into x (ID, load_date, image)
values ($PID,clock_timestamp()::timestamp(0), (SELECT bytea_import(
'/tmp/$FN'));"

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] big database resulting in small dump

2012-07-20 Thread Lonni J Friedman

On Fri, Jul 20, 2012 at 11:23 AM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> On Fri, Jul 20, 2012 at 11:05 AM, Ilya Ivanov  wrote:
>>> I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds Zabbix
>>> database. The database on disk takes 10Gb. SQL dump takes only 2Gb.
>
>> Its not entirely clear what behavior you expect here.  Assuming that
>> you're referring to running pg_dump, then you should just about never
>> expect the size of the resulting dump to be equal to the amount of
>> disk space the database server files consume on disk.  For example,
>> when I pg_dump a database that consumes about 290GB of disk, the
>> resulting dump is about 1.3GB.  This is normal & expected behavior.
>
> The fine manual says someplace that databases are commonly about 5X the
> size of a plain-text dump, which is right in line with Ilya's results.
> Lonni's DB sounds a bit bloated :-(, though maybe he's got an atypically
> large set of indexes.

I do have a lot of indices.  Also, I'm using a lot of partitions, so
there are a relatively large number of tables.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] big database resulting in small dump

2012-07-20 Thread Lonni J Friedman

On Fri, Jul 20, 2012 at 11:05 AM, Ilya Ivanov  wrote:
> I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds Zabbix
> database. The database on disk takes 10Gb. SQL dump takes only 2Gb. I've
> gone through
> http://archives.postgresql.org/pgsql-general/2008-08/msg00316.php and got
> some hints. Naturally, the biggest table is history (the second biggest is
> history_uint. Together they make about 95% of total size). I've tried to
> perform CLUSTER on it, but seemed to be taking forever (3 hours and still
> not completed). So I cancelled it and went with database drop and restore.
> It resulted in database taking up 6.4Gb instead of 10Gb. This is a good
> improvement, but still isn't quite what I expect. I would appreciate some
> clarification.

Its not entirely clear what behavior you expect here.  Assuming that
you're referring to running pg_dump, then you should just about never
expect the size of the resulting dump to be equal to the amount of
disk space the database server files consume on disk.  For example,
when I pg_dump a database that consumes about 290GB of disk, the
resulting dump is about 1.3GB.  This is normal & expected behavior.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] efficiency of wildcards at both ends

2012-06-20 Thread Lonni J Friedman

On Wed, Jun 20, 2012 at 10:10 AM, Sam Z J  wrote:
> Hi all
>
> I'm curious how is wildcards at both ends implemented, e.g. LIKE '%str%'
> How efficient is it if that's the only search criteria against a large
> table? how much does indexing the column help and roughly how much more
> space is needed for the index?
>
> if the answers are too long, please point me to the relavant text =D

My limited understanding is that any time you need to resort to using
wildcards, indices are never used, and you're falling back to using
the inefficient table scan.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pg_basebackup blocking all queries

2012-06-06 Thread Lonni J Friedman

I'm still plagued by this.  Immediately before starting a basebackup
the load on my server is 1.00 or less.  Within a few minutes of
starting the basebackup, the load climbs steadily to 30+ and anything
trying to write to the database just sits for minutes at a time, with
overall performance on any query (read or write) being horrible
(seconds to minutes).  As soon as the basebackup completes, perf
returns to normal (and the load drops back down to 1.00 or less).

How can I debug what's wrong?

On Tue, May 22, 2012 at 3:20 PM, Lonni J Friedman  wrote:
> Thanks for your reply.  Unfortunately, those queries don't shed any
> light no the problem.  The first two return 0 rows, and the third just
> returns 12 rows all associated with the query itself, rather than
> anything else.
>
> Any other suggestions?
>
>
> On Tue, May 22, 2012 at 2:56 PM, Scott Marlowe  
> wrote:
>> Do the queries here help?
>>
>> http://wiki.postgresql.org/wiki/Lock_Monitoring
>>
>> On Tue, May 22, 2012 at 12:42 PM, Lonni J Friedman  
>> wrote:
>>> Greetings,
>>> I have a 4 server postgresql-9.1.3 cluster (one master doing streaming
>>> replication to 3 hot standby servers).  All of them are running
>>> Fedora-16-x86_64.  Last Friday I upgraded the entire cluster from
>>> Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3.
>>>
>>> I'm finding that I cannot runpg_basebackup at all, or it blocks all
>>> SQL queries from running until pg_basebackup has completed (and the
>>> load on the box just takes off to over 75.00).  By "blocks" I mean
>>> that any query that is submitted just hangs and does not return at all
>>> until pg_basebackup has stopped.   I'm assuming that this isn't
>>> expected behavior, so I'm rather confused on what is going on.  The
>>> command that I'm issuing is:
>>> pg_basebackup -v -D /mnt/backups/backups/tmp0 -x -Ft -U postgres
>>>
>>> Can someone provide some guidance on how to debug this?
>>>
>>> thanks!

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] autovacuum running for a long time on a new table with 1 row

2012-06-01 Thread Lonni J Friedman

On Fri, Jun 1, 2012 at 10:54 AM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> On Fri, Jun 1, 2012 at 10:34 AM, Tom Lane  wrote:
>>> This seems to have been noticed and fixed in HEAD:
>>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=b4e0741727685443657b55932da0c06f028fbc00
>>> I wonder whether that should've been back-patched.
>
>> Thanks for your reply.  I won't even pretend to understand what that
>> fix does.  Is this behavior something that is blatantly broken, or
>> harmless, or somewhere in between?  Should I expect autovacuum to
>> eventually complete succesfully when it stumbles into this scenario?
>
> Well, the problem with the original code was that it would recheck the
> visibility map's file size anytime somebody tried to check a bit beyond
> the end of the map.  If the map isn't there (which is not an error case)
> this would result in a useless open() attempt for each table page
> scanned by vacuum.  So ordinarily I would say that yes you could expect
> autovac to complete eventually.  However ...
>
>> Before dropping & recreating the table, yes it had millions of rows,
>> and millions of updates.  But since then, all I did was insert a
>> single row, and watched autovacuum wedge itself in that seemingly
>> infinite loop.  I ended up doing a 'kill -2' on the autovacuum PID
>> that was misbehaving, disabled autovacuuming the table, and went about
>> what I needed to get done as an interim solution.
>
> ... if you really did drop and recreate the table, then at this point
> it should only have a single page, I would think.  It might be worth
> checking the actual file size.  pg_relation_size('tablename') is
> probably the quickest way.

Unfortunately, I've since inserted several million rows into this
table, so I'm guessing its too late now to check the size?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] autovacuum running for a long time on a new table with 1 row

2012-06-01 Thread Lonni J Friedman

On Fri, Jun 1, 2012 at 10:34 AM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> Running 9.1.3 on Linux-x86_64.  I'm seeing autovacuum running for the
>> past 6 hours on a newly created table that only has 1 row of data in
>> it.  This table did exist previously, but was dropped & recreated.
>> I'm not sure if that might explain this behavior.  When I strace the
>> autovacuum process, I see the following scrolling by non-stop (with no
>> changes to the file referenced):
>> select(0, NULL, NULL, NULL, {0, 21000}) = 0 (Timeout)
>> open("base/16412/214803_vm", O_RDWR)    = -1 ENOENT (No such file or 
>> directory)
>> open("base/16412/214803_vm", O_RDWR)    = -1 ENOENT (No such file or 
>> directory)
>> open("base/16412/214803_vm", O_RDWR)    = -1 ENOENT (No such file or 
>> directory)
>> open("base/16412/214803_vm", O_RDWR)    = -1 ENOENT (No such file or 
>> directory)
>> open("base/16412/214803_vm", O_RDWR)    = -1 ENOENT (No such file or 
>> directory)
>> open("base/16412/214803_vm", O_RDWR)    = -1 ENOENT (No such file or 
>> directory)
>
> This seems to have been noticed and fixed in HEAD:
> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=b4e0741727685443657b55932da0c06f028fbc00
> I wonder whether that should've been back-patched.

Thanks for your reply.  I won't even pretend to understand what that
fix does.  Is this behavior something that is blatantly broken, or
harmless, or somewhere in between?  Should I expect autovacuum to
eventually complete succesfully when it stumbles into this scenario?

>
> In the meantime, though, it sure looks like you've got a lot more than
> one row in there.  Perhaps you did umpteen zillion updates on that one
> row?

Before dropping & recreating the table, yes it had millions of rows,
and millions of updates.  But since then, all I did was insert a
single row, and watched autovacuum wedge itself in that seemingly
infinite loop.  I ended up doing a 'kill -2' on the autovacuum PID
that was misbehaving, disabled autovacuuming the table, and went about
what I needed to get done as an interim solution.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] autovacuum running for a long time on a new table with 1 row

2012-05-31 Thread Lonni J Friedman

Running 9.1.3 on Linux-x86_64.  I'm seeing autovacuum running for the
past 6 hours on a newly created table that only has 1 row of data in
it.  This table did exist previously, but was dropped & recreated.
I'm not sure if that might explain this behavior.  When I strace the
autovacuum process, I see the following scrolling by non-stop (with no
changes to the file referenced):
select(0, NULL, NULL, NULL, {0, 21000}) = 0 (Timeout)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
select(0, NULL, NULL, NULL, {0, 21000}) = 0 (Timeout)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)
open("base/16412/214803_vm", O_RDWR)= -1 ENOENT (No such file or directory)

Is this normal/expected?

thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

2012-05-25 Thread Lonni J Friedman

On Thu, May 24, 2012 at 12:57 PM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> On Thu, May 24, 2012 at 12:34 PM, Tom Lane  wrote:
>>> Can you correlate the performance hit with any specific part of
>>> autovacuum? In particular, I'm wondering if it matters whether vacuum
>>> is cleaning tables or indexes --- it alternates between the two, and the
>>> access patterns are a bit different. You could probably watch what the
>>> autovac process is doing with strace to see what it's accessing.
>
>> Is there something specific I should be looking for in the strace
>> output, or is this just a matter of correlating PID and FD to
>> pg_class.relfilenode ?
>
> Nah, just match up the files it touches with pg_class.relfilenode.

Seems to be slower across the board regardless of what is being
vacuumed (tables or indices).

For example, i have a relatively small table (currently 395 rows,
rarely has any inserts or deletes).  vacuuming just the table (not
indices) takes on average 4s with the khugepaged defragmenter on, and
less than 1s with it off.  vacuuming just the indices takes on average
2s with the khugepaged defragmenter on, and less than 1s with it off.

The much larger tables (thousands to millions of rows), I see similar
performance (although they take even longer to complete with & without
the khugepaged defragmenter enabled).

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

2012-05-24 Thread Lonni J Friedman

On Thu, May 24, 2012 at 12:34 PM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> No, not lots of subqueries or ORDERing, and most queries only touch a
>> single table.  However, I'm honestly not sure that I'm following where
>> you're going with this.   The problem isn't triggered by explicit
>> queries.  I can disable all external access, and simply wait for
>> autovacuum to kick off, and the box starts to die.
>
> Can you correlate the performance hit with any specific part of
> autovacuum?  In particular, I'm wondering if it matters whether vacuum
> is cleaning tables or indexes --- it alternates between the two, and the
> access patterns are a bit different.  You could probably watch what the
> autovac process is doing with strace to see what it's accessing.

Is there something specific I should be looking for in the strace
output, or is this just a matter of correlating PID and FD to
pg_class.relfilenode ?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

2012-05-24 Thread Lonni J Friedman

On Wed, May 23, 2012 at 2:45 PM, Gavin Flower
 wrote:
> On 24/05/12 08:18, Lonni J Friedman wrote:
>
> On Wed, May 23, 2012 at 12:36 PM, Gavin Flower
>  wrote:
>
> On 24/05/12 05:09, Lonni J Friedman wrote:
>
> On Wed, May 23, 2012 at 9:37 AM, Tom Lane  wrote:
>
> Lonni J Friedman  writes:
>
> After banging my head on the wall for  a long time, I happened to
> notice that khugepaged was consuming 100% CPU every time autovacuum
> was running.  I did:
> echo "madvise" > /sys/kernel/mm/transparent_hugepage/defrag
> and immediately the entire problem went away.
>
> Fascinating.
>
> In hindsight, sure.  Before that, it was 2 days of horror.
>
> So this looks like a nasty Fedora16 kernel bug to me, or maybe
> postgresql & Fedora16's default kernel settings are just not
> compatible?
>
> I agree, kernel bug.  What kernel version are you using exactly?
>
> I'm using the stock 3.3.5-2.fc16.x86_64 kernel that is in Fedora updates.
>
> Is anyone else using Fedora16 & PostgreSQL-9.1 ?
>
> I use an F16 box daily, but can't claim to have done major performance
> testing with it.  Can you put together a summary of your nondefault
> Postgres settings?  I wonder whether it only kicks in for a certain
> size of shared memory for instance.
>
> Oh yea, I'm quite certain that this is somehow related to my setup,
> and not a generic problem with all F16/pgsql systems.  For starters,
> this problem isn't happening on any of the 3 standby systems, which
> are all otherwise identical to the master in every respect.  Also when
> we had done some testing (prior to the upgrades), we never ran into
> any of these problems.  However our test environment was on smaller
> scale hardware, with a much smaller number of clients (and overall
> load).
>
> Here are the non default settings in postgresql.conf :
> wal_level = hot_standby
> archive_mode = on
> archive_timeout = 61
> max_wal_senders = 10
> wal_keep_segments = 5000
> hot_standby = on
> log_autovacuum_min_duration = 2500
> autovacuum_max_workers = 4
> maintenance_work_mem = 1GB
> checkpoint_completion_target = 0.7
> effective_cache_size = 88GB
> work_mem = 576MB
> wal_buffers = 16MB
> checkpoint_segments = 64
> shared_buffers = 8GB
> max_connections = 350
>
> Let me know if you have any other questions.  I'd be happy to provide
> as much information as possible if it can aid in fixing this bug.
>
> I think they will need details of things like: RAM, number/type processors,
> number & type
> of disks, disk controllers & any other hardware specs that might be relevant
> etc.- at very
> least: total RAM & number of spindles
>
> 16 core Xeon X5550 2.67GHz
> 128GB RAM
> $PGDATA sits on a RAID5 array comprised of 3 SATA disks.  Its Linux's
> md software RAID.
>
> How does this compare to your other machines running the same, or similar,
> databases?
> However, you do say that the other machines are indentical - but are the
> other
> machines different in any aspect, that might prove siginificant?
>
>
>
> Also anything else running on the box.
>
> nothing else.  its dedicated exclusively to postgresql.
>
> Plus transaction load pattern - over time and read/write ratios.
>
> I'm not sure how I'd obtain this data.  however, the patterns didn't
> change since the upgrade.  If someone can point me in the right
> direction, I can at least obtain this data as its generated currently.
>
> type/nature of queries
>
> I need some clarification on specifically what you're asking for here.
>
> The complexity, structure, and features of the queries. Do you have lots of
> sub queries,
> and ORDER BY's? Also the number of tables accessed in a query. This is
> heading into the
> territory where others will be better placed to advise you as to what might
> be relevant!

No, not lots of subqueries or ORDERing, and most queries only touch a
single table.  However, I'm honestly not sure that I'm following where
you're going with this.   The problem isn't triggered by explicit
queries.  I can disable all external access, and simply wait for
autovacuum to kick off, and the box starts to die.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

2012-05-23 Thread Lonni J Friedman

On Wed, May 23, 2012 at 3:33 PM, Tom Lane  wrote:
> Gavin Flower  writes:
>>> 16 core Xeon X5550 2.67GHz
>>> 128GB RAM
>>> $PGDATA sits on a RAID5 array comprised of 3 SATA disks.  Its Linux's
>>> md software RAID.
>
>> How does this compare to your other machines running the same, or
>> similar, databases?
>> However, you do say that the other machines are indentical - but are the
>> other
>> machines different in any aspect, that might prove siginificant?
>
> I think Lonnie said that the other machines are just running standby
> clusters, which would mean they aren't running autovacuum as such,
> merely applying any WAL it produces.  So that could be plenty enough
> to explain a difference in kernel-visible behavior.

Yes, that is correct.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

2012-05-23 Thread Lonni J Friedman

On Wed, May 23, 2012 at 12:36 PM, Gavin Flower
 wrote:
> On 24/05/12 05:09, Lonni J Friedman wrote:
>
> On Wed, May 23, 2012 at 9:37 AM, Tom Lane  wrote:
>
> Lonni J Friedman  writes:
>
> After banging my head on the wall for  a long time, I happened to
> notice that khugepaged was consuming 100% CPU every time autovacuum
> was running.  I did:
> echo "madvise" > /sys/kernel/mm/transparent_hugepage/defrag
> and immediately the entire problem went away.
>
> Fascinating.
>
> In hindsight, sure.  Before that, it was 2 days of horror.
>
> So this looks like a nasty Fedora16 kernel bug to me, or maybe
> postgresql & Fedora16's default kernel settings are just not
> compatible?
>
> I agree, kernel bug.  What kernel version are you using exactly?
>
> I'm using the stock 3.3.5-2.fc16.x86_64 kernel that is in Fedora updates.
>
> Is anyone else using Fedora16 & PostgreSQL-9.1 ?
>
> I use an F16 box daily, but can't claim to have done major performance
> testing with it.  Can you put together a summary of your nondefault
> Postgres settings?  I wonder whether it only kicks in for a certain
> size of shared memory for instance.
>
> Oh yea, I'm quite certain that this is somehow related to my setup,
> and not a generic problem with all F16/pgsql systems.  For starters,
> this problem isn't happening on any of the 3 standby systems, which
> are all otherwise identical to the master in every respect.  Also when
> we had done some testing (prior to the upgrades), we never ran into
> any of these problems.  However our test environment was on smaller
> scale hardware, with a much smaller number of clients (and overall
> load).
>
> Here are the non default settings in postgresql.conf :
> wal_level = hot_standby
> archive_mode = on
> archive_timeout = 61
> max_wal_senders = 10
> wal_keep_segments = 5000
> hot_standby = on
> log_autovacuum_min_duration = 2500
> autovacuum_max_workers = 4
> maintenance_work_mem = 1GB
> checkpoint_completion_target = 0.7
> effective_cache_size = 88GB
> work_mem = 576MB
> wal_buffers = 16MB
> checkpoint_segments = 64
> shared_buffers = 8GB
> max_connections = 350
>
> Let me know if you have any other questions.  I'd be happy to provide
> as much information as possible if it can aid in fixing this bug.
>
> I think they will need details of things like: RAM, number/type processors,
> number & type
> of disks, disk controllers & any other hardware specs that might be relevant
> etc.- at very
> least: total RAM & number of spindles

16 core Xeon X5550 2.67GHz
128GB RAM
$PGDATA sits on a RAID5 array comprised of 3 SATA disks.  Its Linux's
md software RAID.

>
> Also anything else running on the box.

nothing else.  its dedicated exclusively to postgresql.

>
> Plus transaction load pattern - over time and read/write ratios.

I'm not sure how I'd obtain this data.  however, the patterns didn't
change since the upgrade.  If someone can point me in the right
direction, I can at least obtain this data as its generated currently.

>
> type/nature of queries

I need some clarification on specifically what you're asking for here.

>
> size of heavily accessed tables and their indexes

there are several rather large tables (90 million+ rows), but most
others are under 1M rows.  However, most tables are accessed & written
to with equal frequency.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

2012-05-23 Thread Lonni J Friedman

On Wed, May 23, 2012 at 9:37 AM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> After banging my head on the wall for  a long time, I happened to
>> notice that khugepaged was consuming 100% CPU every time autovacuum
>> was running.  I did:
>> echo "madvise" > /sys/kernel/mm/transparent_hugepage/defrag
>> and immediately the entire problem went away.
>
> Fascinating.

In hindsight, sure.  Before that, it was 2 days of horror.

>
>> So this looks like a nasty Fedora16 kernel bug to me, or maybe
>> postgresql & Fedora16's default kernel settings are just not
>> compatible?
>
> I agree, kernel bug.  What kernel version are you using exactly?

I'm using the stock 3.3.5-2.fc16.x86_64 kernel that is in Fedora updates.

>
>> Is anyone else using Fedora16 & PostgreSQL-9.1 ?
>
> I use an F16 box daily, but can't claim to have done major performance
> testing with it.  Can you put together a summary of your nondefault
> Postgres settings?  I wonder whether it only kicks in for a certain
> size of shared memory for instance.

Oh yea, I'm quite certain that this is somehow related to my setup,
and not a generic problem with all F16/pgsql systems.  For starters,
this problem isn't happening on any of the 3 standby systems, which
are all otherwise identical to the master in every respect.  Also when
we had done some testing (prior to the upgrades), we never ran into
any of these problems.  However our test environment was on smaller
scale hardware, with a much smaller number of clients (and overall
load).

Here are the non default settings in postgresql.conf :
wal_level = hot_standby
archive_mode = on
archive_timeout = 61
max_wal_senders = 10
wal_keep_segments = 5000
hot_standby = on
log_autovacuum_min_duration = 2500
autovacuum_max_workers = 4
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
effective_cache_size = 88GB
work_mem = 576MB
wal_buffers = 16MB
checkpoint_segments = 64
shared_buffers = 8GB
max_connections = 350

Let me know if you have any other questions.  I'd be happy to provide
as much information as possible if it can aid in fixing this bug.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

2012-05-23 Thread Lonni J Friedman

Thanks for your reply.

On Tue, May 22, 2012 at 7:19 PM, Andy Colson  wrote:
>  On Mon, May 21, 2012 at 2:05 PM, Lonni J Friedman
>  wrote:
>>>
>>> Greetings,
>>>
>>> When I got in this morning, I found
>>> an autovacuum process that had been running since just before the load
>>> spiked,
>
>
> Autovacuum might need to set the freeze bit very first time it runs.  I
> recall hearing advice about running a 'vacuum freeze' after you insert a
> huge amount of data.  And I recall pg_upgrade doesn't write stats, so did
> you analyze your database?

yes, I ran a 'vacuum analyze' for all databases & tables immediately
following completion of pg_upgrade.

>
> Or, maybe its not vacuum... maybe some of your sql statements are planning
> differently and running really bad.  Can you check some?  Can you log slow
> queries?
>
> Have you checked the status of your raid?  Maybe you lost a drive and its in
> recovery and you have very slow IO?

I checked that initially, but the array is fine.

After banging my head on the wall for  a long time, I happened to
notice that khugepaged was consuming 100% CPU every time autovacuum
was running.  I did:
echo "madvise" > /sys/kernel/mm/transparent_hugepage/defrag

and immediately the entire problem went away.  Load dropped within
minutes from 35.00 to 1.00, and has remained under 4.00 for the past
18 hours.  Prior to disabling defrag, I never saw the load below 10.00
for more than a few seconds at a time.

So this looks like a nasty Fedora16 kernel bug to me, or maybe
postgresql & Fedora16's default kernel settings are just not
compatible?

Is anyone else using Fedora16 & PostgreSQL-9.1 ?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pg_basebackup blocking all queries

2012-05-22 Thread Lonni J Friedman

Thanks for your reply.  Unfortunately, those queries don't shed any
light no the problem.  The first two return 0 rows, and the third just
returns 12 rows all associated with the query itself, rather than
anything else.

Any other suggestions?


On Tue, May 22, 2012 at 2:56 PM, Scott Marlowe  wrote:
> Do the queries here help?
>
> http://wiki.postgresql.org/wiki/Lock_Monitoring
>
> On Tue, May 22, 2012 at 12:42 PM, Lonni J Friedman  wrote:
>> Greetings,
>> I have a 4 server postgresql-9.1.3 cluster (one master doing streaming
>> replication to 3 hot standby servers).  All of them are running
>> Fedora-16-x86_64.  Last Friday I upgraded the entire cluster from
>> Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3.
>>
>> I'm finding that I cannot runpg_basebackup at all, or it blocks all
>> SQL queries from running until pg_basebackup has completed (and the
>> load on the box just takes off to over 75.00).  By "blocks" I mean
>> that any query that is submitted just hangs and does not return at all
>> until pg_basebackup has stopped.   I'm assuming that this isn't
>> expected behavior, so I'm rather confused on what is going on.  The
>> command that I'm issuing is:
>> pg_basebackup -v -D /mnt/backups/backups/tmp0 -x -Ft -U postgres
>>
>> Can someone provide some guidance on how to debug this?
>>
>> thanks!

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] pg_basebackup blocking all queries

2012-05-22 Thread Lonni J Friedman

Greetings,
I have a 4 server postgresql-9.1.3 cluster (one master doing streaming
replication to 3 hot standby servers).  All of them are running
Fedora-16-x86_64.  Last Friday I upgraded the entire cluster from
Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3.

I'm finding that I cannot runpg_basebackup at all, or it blocks all
SQL queries from running until pg_basebackup has completed (and the
load on the box just takes off to over 75.00).  By "blocks" I mean
that any query that is submitted just hangs and does not return at all
until pg_basebackup has stopped.   I'm assuming that this isn't
expected behavior, so I'm rather confused on what is going on.  The
command that I'm issuing is:
pg_basebackup -v -D /mnt/backups/backups/tmp0 -x -Ft -U postgres

Can someone provide some guidance on how to debug this?

thanks!

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

2012-05-22 Thread Lonni J Friedman

No one has any ideas or suggestions, or even questions?  If someone
needs more information, I'd be happy to provide it.

This problem is absolutely killing me.

On Mon, May 21, 2012 at 2:05 PM, Lonni J Friedman  wrote:
> Greetings,
> I have a 4 server postgresql-9.1.3 cluster (one master doing streaming
> replication to 3 hot standby servers).  All of them are running
> Fedora-16-x86_64.  Last Friday I upgraded the entire cluster from
> Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3.  I
> made no changes to postgresql.conf following the upgrade.  I used
> pg_upgrade on the master to upgrade it, followed by blowing away
> $PGDATA on all the standbys and rsyncing them fresh from the master.
> All of the servers have 128GB RAM, and at least 16 CPU cores.
>
> Everything appeared to be working fine until last night when the load
> on the master suddenly took off, and hovered at around 30.00 ever
> since.  Prior to the load spike, the load was hovering around 2.00
> (which is actually lower than it was averaging prior to the upgrade
> when it was often around 4.00).  When I got in this morning, I found
> an autovacuum process that had been running since just before the load
> spiked, and the pg_dump cronjob that started shortly after the load
> spike (and normally completes in about 20 minutes for all the
> databases) was still running, and hadn't finished the first of the 6
> databases.  I ended up killing the pg_dump process altogether in the
> hope that it might unblock whatever was causing the high load.
> Unfortunately that didn't help, and the load continued to run high.
>
> I proceeded to check dmesg, /var/log/messages and the postgresql
> server log (all on the master), but I didn't spot anything out of the
> ordinary, definitely nothing that pointed to a potential explanation
> for all of the high load.
>
> I inspected what the autovacuum process was doing, and determined that
> it was chewing away on the largest table (nearly 98 million rows) in
> the largest database.  It was making very slow progress, at least I
> believe that was the case, as when I attached strace to the process,
> the seek addresses were changing in a random fashion.
>
> Here are the current autovacuum settings:
> autovacuum                      | on
> autovacuum_analyze_scale_factor | 0.1
> autovacuum_analyze_threshold    | 50
> autovacuum_freeze_max_age       | 2
> autovacuum_max_workers          | 4
> autovacuum_naptime              | 1min
> autovacuum_vacuum_cost_delay    | 20ms
> autovacuum_vacuum_cost_limit    | -1
> autovacuum_vacuum_scale_factor  | 0.2
> autovacuum_vacuum_threshold     | 50
>
> Did something significant change in 9.1 that would impact autovacuum
> behavior?  I'm at a complete loss on how to debug this, since I'm
> using the exact same settings now as prior to the upgrade.
>
> thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

2012-05-21 Thread Lonni J Friedman

Greetings,
I have a 4 server postgresql-9.1.3 cluster (one master doing streaming
replication to 3 hot standby servers).  All of them are running
Fedora-16-x86_64.  Last Friday I upgraded the entire cluster from
Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3.  I
made no changes to postgresql.conf following the upgrade.  I used
pg_upgrade on the master to upgrade it, followed by blowing away
$PGDATA on all the standbys and rsyncing them fresh from the master.
All of the servers have 128GB RAM, and at least 16 CPU cores.

Everything appeared to be working fine until last night when the load
on the master suddenly took off, and hovered at around 30.00 ever
since.  Prior to the load spike, the load was hovering around 2.00
(which is actually lower than it was averaging prior to the upgrade
when it was often around 4.00).  When I got in this morning, I found
an autovacuum process that had been running since just before the load
spiked, and the pg_dump cronjob that started shortly after the load
spike (and normally completes in about 20 minutes for all the
databases) was still running, and hadn't finished the first of the 6
databases.  I ended up killing the pg_dump process altogether in the
hope that it might unblock whatever was causing the high load.
Unfortunately that didn't help, and the load continued to run high.

I proceeded to check dmesg, /var/log/messages and the postgresql
server log (all on the master), but I didn't spot anything out of the
ordinary, definitely nothing that pointed to a potential explanation
for all of the high load.

I inspected what the autovacuum process was doing, and determined that
it was chewing away on the largest table (nearly 98 million rows) in
the largest database.  It was making very slow progress, at least I
believe that was the case, as when I attached strace to the process,
the seek addresses were changing in a random fashion.

Here are the current autovacuum settings:
autovacuum  | on
autovacuum_analyze_scale_factor | 0.1
autovacuum_analyze_threshold| 50
autovacuum_freeze_max_age   | 2
autovacuum_max_workers  | 4
autovacuum_naptime  | 1min
autovacuum_vacuum_cost_delay| 20ms
autovacuum_vacuum_cost_limit| -1
autovacuum_vacuum_scale_factor  | 0.2
autovacuum_vacuum_threshold | 50

Did something significant change in 9.1 that would impact autovacuum
behavior?  I'm at a complete loss on how to debug this, since I'm
using the exact same settings now as prior to the upgrade.

thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] problems after restoring from a pg_basebackup

2012-04-27 Thread Lonni J Friedman

Greetings,
I'm running postgresql-9.1.3 on a Linux-x86_64 (Fedora16, if it
matters) system.  I noticed the existence of pg_basebackup starting in
9.1, and figured I'd try it out and see if it would simplify our
backup & management processes.

$ pg_basebackup -P -v -D /tmp/backup -x -Ft -z -U postgres
xlog start point: C6/6420
135733616/135733616 kB (100%), 1/1 tablespace
xlog end point: C6/64A0
pg_basebackup: base backup completed

So after running through this, I tried to use (restore) the backup
that was generated.  While everything appears to be working ok from a
functional perspective, in the server log I saw the following:
##
LOG:  creating missing WAL directory "pg_xlog/archive_status"
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at C6/6678
LOG:  could not open file "pg_xlog/000100C60067" (log file
198, segment 103): No such file or directory
LOG:  redo done at C6/66A0
FATAL:  the database system is starting up
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
#

Just to be clear, here's what I did after pg_basebackup had completed
successfully:
0) shutdown postgresql gracefully, and verified that it was fully shutdown
1) moved $PGDATA to $PGDATA.old
2) created $PGDATA as postgres user
3) extracted the basebackup tarball as postgres user
cd $PGDATA && tar xzvpf /tmp/backup/base.tar.gz
4) started postgresql up

I would have expected that I wouldn't have gotten the 'not properly
shutdown' warning, or the 'could not open file' warning by following
this process.  Am I doing something wrong?

thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pg_basebackup issues

2012-04-24 Thread Lonni J Friedman

On Fri, Apr 20, 2012 at 12:31 PM, Magnus Hagander  wrote:
> On Fri, Apr 20, 2012 at 19:51, Lonni J Friedman  wrote:
>> Anyway, lesson learned, I need to either invoke pg_basebackup as the
>> same user that runs the database (or is specified with the -U
>> parameter ?), or write the backup somewhere outside of the directory
>> structure that is being backed up.
>>
>> I eventually also found the following entries in the postgresql server log:
>> FATAL:  could not open directory "./backups": Permission denied
>> FATAL:  archive member "backups/base.tar.gz" too large for tar format
>>
>> What concerns me is the 2nd fatal error.  The tarball that
>> pg_basebackup created before erroring out is about 12GB:
>> 12393094165  base.tar.gz
>
> Are you actually storing your backup files *inside* the data
> directory? You really shouldn't do that, you're creating a cyclic
> dependency where each new backup will include the old one inside it...
> You should store the resulting backup file somewhere outside the data
> directory.

yea, in hindsight that was silly.  i just saw a convenient 'backups'
subdirectory and figured that'd be a nice logical place.

>
>> I wasn't aware of any 12GB file size limit for tar, so this is a bit
>> of a mystery to me.  Regardless, I'd be happy to try some other
>> archiving strategy, but the man page for pg_basebackup suggests that
>> there are only two formats, tar and basically just copying the
>> filesystem.  If I copied the filesystem, I'd still have to find some
>> way to archive them for easy management (copying elsewhere, etc).  Has
>> anyone come up with a good strategy on how to deal with it?
>
> The max file size of a single flie inside a standard tar file is 8Gb,
> see e.g. http://en.wikipedia.org/wiki/Tar_(file_format).
>
> I think there are extensions that let you store bigger files, but
> since PostgreSQL will never create files that  big it's not
> implemented in the basebackup system. Because again, the root of your
> problem seems to be that you are trying to store the resulting backup
> inside the data directory.

You're right, that was indeed the issue.  I've redone the process
using a location external to $PGDATA, and it completed successfully:
$ pg_basebackup -P -v -D /tmp/backup -x -Ft -z -U postgres
xlog start point: C6/6420
135733616/135733616 kB (100%), 1/1 tablespace
xlog end point: C6/64A0
pg_basebackup: base backup completed

So after running through this, I tried to use (restore) the backup
that was generated.  While everything appears to be working ok from a
functional perspective, in the server log I saw the following:
##
LOG:  creating missing WAL directory "pg_xlog/archive_status"
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at C6/6678
LOG:  could not open file "pg_xlog/000100C60067" (log file
198, segment 103): No such file or directory
LOG:  redo done at C6/66A0
FATAL:  the database system is starting up
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
#

Just to be clear, here's what I did after pg_basebackup had completed
successfully:
0) shutdown postgresql gracefully, and verified that it was fully shutdown
1) moved $PGDATA to $PGDATA.old
2) created $PGDATA as postgres user
3) extracted the basebackup tarball as postgres user
cd $PGDATA && tar xzvpf /tmp/backup/base.tar.gz
4) started postgresql up

I would have expected that I wouldn't have gotten the 'not properly
shutdown' warning, or the 'could not open file' warning by following
this process.  Am I doing something wrong?

thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] pg_basebackup issues

2012-04-20 Thread Lonni J Friedman

Greetings,
I'm running postgresql-9.1.3 on a Linux-x86_64 (Fedora16, if it
matters) system.  I noticed the existence of pg_basebackup starting in
9.1, and figured I'd try it out and see if it would simplify our
backup & management processes.  I setup a test system (same OS &
postgresql version as production) with a fairly recent snapshot of our
production database, invoked it, and saw the following output:
##
# pg_basebackup -P -v -D backups -Ft -z -U postgres
135717206/135717230 kB (100%), 1/1 tablespace
pg_basebackup: could not get WAL end position from server
##

I wasn't sure what that error meant, so after googling a bit, turns
out that it really means that there were one or more files not owned
by the postgres user (see
http://serverfault.com/questions/312205/pg-basebackup-could-not-get-wal-end-position-from-server
).  Sure enough, the file that wasn't owned by the postgres user was
the backup tarball that pg_basebackup was creating, since I had been
running it as root.  That error is rather cryptic, and it would be
helpful if it was improved to suggest the real cause of the failure.
Anyway, lesson learned, I need to either invoke pg_basebackup as the
same user that runs the database (or is specified with the -U
parameter ?), or write the backup somewhere outside of the directory
structure that is being backed up.

I eventually also found the following entries in the postgresql server log:
FATAL:  could not open directory "./backups": Permission denied
FATAL:  archive member "backups/base.tar.gz" too large for tar format

What concerns me is the 2nd fatal error.  The tarball that
pg_basebackup created before erroring out is about 12GB:
12393094165  base.tar.gz

I wasn't aware of any 12GB file size limit for tar, so this is a bit
of a mystery to me.  Regardless, I'd be happy to try some other
archiving strategy, but the man page for pg_basebackup suggests that
there are only two formats, tar and basically just copying the
filesystem.  If I copied the filesystem, I'd still have to find some
way to archive them for easy management (copying elsewhere, etc).  Has
anyone come up with a good strategy on how to deal with it?

thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pg_upgrade + streaming replication ?

2012-03-20 Thread Lonni J Friedman

On Tue, Mar 20, 2012 at 11:46 AM, Bruce Momjian  wrote:
> On Mon, Mar 19, 2012 at 03:07:02PM -0700, Jeff Davis wrote:
>> On Mon, 2012-03-19 at 15:30 -0400, Bruce Momjian wrote:
>> > On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote:
>> > > I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with
>> > > streaming replication.  I'm in the planning stages of upgrading to
>> > > 9.1.x, and am looking into the most efficient way to do the upgrade
>> > > with the goal of minimizing downtime & risk.  After googling, the only
>> > > discussion that I've found of using pg_upgrade with a streaming
>> > > replication setup seems to be this (nearly) year old thread:
>> > > http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK
>> > >
>> > > In summary, there is no way to use both pg_upgrade and streaming
>> > > replication simultaneously.  I'd have to either use pg_upgrade and
>> > > then effectively rebuild/redeploy the slaves, or not use pg_upgrade,
>> > > and reimport all of the data.  Is that still the latest status, or are
>> > > there other options?
>> >
>> > You can shut down all three servers, run pg_upgrade on all of them, then
>> > restart them as 9.1 servers.
>>
>> After running pg_upgrade on each server individually, they will have
>> different system IDs, and potentially different on-disk representation
>> of the catalogs, right?
>>
>> So how can you resume streaming without rebuilding the slaves?
>
> Oh, wow, I never thought of the fact that the system tables will be
> different?   I guess you could assume the pg_dump restore is going to
> create things exactly the same on all the systems, but I never tested
> that.  Do the system id's have to match?  That would be a problem
> because you are initdb'ing on each server.  OK, crazy idea, but I
> wonder if you could initdb on the master, then copy that to the slaves,
> then run pg_upgrade on each of them.  Obviously this needs some testing.

Wouldn't it be easier to just pg_upgrade the master, then setup the
slaves from scratch (with rsync, etc)?  It certainly wouldn't be any
more work to do it that way (although still a lot more work than
simply running pg_upgrade on all servers).

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pg_upgrade + streaming replication ?

2012-03-19 Thread Lonni J Friedman

On Mon, Mar 19, 2012 at 12:30 PM, Bruce Momjian  wrote:
> On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote:
>> I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with
>> streaming replication.  I'm in the planning stages of upgrading to
>> 9.1.x, and am looking into the most efficient way to do the upgrade
>> with the goal of minimizing downtime & risk.  After googling, the only
>> discussion that I've found of using pg_upgrade with a streaming
>> replication setup seems to be this (nearly) year old thread:
>> http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK
>>
>> In summary, there is no way to use both pg_upgrade and streaming
>> replication simultaneously.  I'd have to either use pg_upgrade and
>> then effectively rebuild/redeploy the slaves, or not use pg_upgrade,
>> and reimport all of the data.  Is that still the latest status, or are
>> there other options?
>
> You can shut down all three servers, run pg_upgrade on all of them, then
> restart them as 9.1 servers.
>

Thanks for your reply.  This is very good news.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] how to measure wal_buffer usage

2012-03-16 Thread Lonni J Friedman

On Fri, Mar 16, 2012 at 2:45 AM, Albe Laurenz  wrote:
> Lonni J Friedman wrote:
>> After reading this interesting article on shared_buffers and wal_buffers:
>> http://rhaas.blogspot.com/2012/03/tuning-sharedbuffers-and-walbuffers.html
>>
>> it got me wondering if my settings were ideal.  Is there some way to
>> measure wal_buffer usage in real time, so that I could simply monitor
>> it for some period of time, and then come up with a way of determining
>> if the current setting is sufficient?
>>
>> I tried googling, but every reference that I've found simply defaults
>> to the "trial & error" approach to performance tuning.
>
> You can use the contrib module pg_buffercache to inspect the shared buffers.
> If almost all your shared buffers have high use count (4 or 5),
> shared_buffers may be too small.  If not, consider reducing shared_buffers.

pg_buffercache only reports on the buffer_cache, it does not report
any data on the wal_cache.

>
> It's probably better to start with a moderate value and tune upwards.
>
> You can also look at pg_statio_all_tables and pg_statio_all_indexes and
> calculate the buffer hit ratio.  If that is low, that's also an indication
> that shared_buffers is too small.

Isn't this also specific to the buffer_cache rather than the wal_cache?

>
> You should distinguish between tables and indexes:
> it is usually more important that indexes are cached.
>
> Try to observe these things over time, for example by taking
> snapshots every n minutes and storing the results in a table.
>
> Yours,
> Laurenz Albe



-- 
~
L. Friedman                                    netll...@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] how to measure wal_buffer usage

2012-03-15 Thread Lonni J Friedman

After reading this interesting article on shared_buffers and wal_buffers:
http://rhaas.blogspot.com/2012/03/tuning-sharedbuffers-and-walbuffers.html

it got me wondering if my settings were ideal.  Is there some way to
measure wal_buffer usage in real time, so that I could simply monitor
it for some period of time, and then come up with a way of determining
if the current setting is sufficient?

I tried googling, but every reference that I've found simply defaults
to the "trial & error" approach to performance tuning.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] pg_upgrade + streaming replication ?

2012-03-01 Thread Lonni J Friedman

I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with
streaming replication.  I'm in the planning stages of upgrading to
9.1.x, and am looking into the most efficient way to do the upgrade
with the goal of minimizing downtime & risk.  After googling, the only
discussion that I've found of using pg_upgrade with a streaming
replication setup seems to be this (nearly) year old thread:
http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK

In summary, there is no way to use both pg_upgrade and streaming
replication simultaneously.  I'd have to either use pg_upgrade and
then effectively rebuild/redeploy the slaves, or not use pg_upgrade,
and reimport all of the data.  Is that still the latest status, or are
there other options?

thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] returning rows from an implicit JOIN where results either exist in both tables OR only one table

2011-12-01 Thread Lonni J Friedman

On Thu, Dec 1, 2011 at 1:57 PM, David Johnston  wrote:
> -Original Message-
> From: pgsql-general-ow...@postgresql.org 
> [mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Lonni J Friedman
> Sent: Thursday, December 01, 2011 4:13 PM
> To: pgsql-general
> Subject: [GENERAL] returning rows from an implicit JOIN where results either 
> exist in both tables OR only one table
>
>
> I'm stumbling over how to integrate those two tables with static data into 
> the query.  The following query works fine as long as there's at least one 
> row in the 'results' table that corresponds to each row in the pending table 
> (however, it doesn't return anything for rows that only exist in 'pending' 
> yet not yet in 'results'):
>
> -
>
> Implicit JOINs are ALWAYS INNER JOINs
>
> Since you want to use an OUTER JOIN you must be explicit.
>
> I'm not going to try and figure out specifically what you need but from your 
> quick description (all pending and results where available) you need to do 
> something like
>
> "pending" LEFT OUTER JOIN "results" ON ("pending".active = "results".hostname 
> AND "pending".submittracker = "results".submittracker AND "pending".cl = 
> "results".cl)
>
> Then, for conditions dependent upon the "results" (or NULL-able) relation, 
> you need to make sure you explicitly allow for the missing rows:
>
> ( "results".current_status IS NULL  OR ( your existing "results" conditions ) 
> )
>
> http://www.postgresql.org/docs/9.0/interactive/queries-table-expressions.html#QUERIES-FROM

Thanks for your reply and input.  I ended up putting together the
following query which does what I need:

SELECT  pending.cl,
pending.id,
pending.buildid,
pending.build_type,
pending.active,
pending.submittracker,
pending.os,
pending.arch,
pending.osversion,
pending.branch,
pending.comment,
osversmap.osname,
buildlist.buildname,
results.logurl
FROM pending
JOIN osversmap
ON ( pending.os = osversmap.os
AND pending.osversion = osversmap.osversion )
JOIN buildlist
ON ( pending.buildid = buildlist.id )
LEFT OUTER JOIN results
ON ( pending.active = results.hostname
AND pending.submittracker = results.submittracker
AND pending.cl = results.cl
AND results.current_status != 'PASSED'
AND results.current_status NOT LIKE '%FAILED'
)
WHERE pending.owner = '$owner'
AND pending.completed = 'f'
ORDER BY pending.submittracker,
pending.branch,
pending.os,
pending.arch

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] returning rows from an implicit JOIN where results either exist in both tables OR only one table

2011-12-01 Thread Lonni J Friedman

Greetings,
I've got a PostgreSQL-9.0.x database that manages an automated testing
environment.  There are a bunch of tables that contain assorted static
data (OS versions, test names, etc) named 'buildlist' & 'osversmap'.
However, there are also two tables which contain data which changes
often.  The first is a 'pending' table which is effectively a test
queue where pending tests are self-selected by the test systems, and
then deleted when the test run has completed.  The second is a
'results' table which contains the test results as they are produced
(in progress and completed).

The records in the pending table have a one to many relationship with
the records in the results table (each row in pending can have 0 or
more rows in results).  For example, if no test systems have
self-assigned a pending row, then there will be zero associated rows
in results, and then once a pending row is assigned, the number of
rows in results will increase for each pending row.  An added catch is
that I always want only the newest results table row associated with
each pending table row.  What I need to do is query the 'pending'
table for pending tests, and then also get a 'logurl' from the results
table that corresponds to each pending table row.

All of this is rather similar to this problem, except that I have the
added burden of the two additional tables with the static data
(buildlist & osversmap):
http://stackoverflow.com/questions/3343857/php-sql-using-only-one-query-select-rows-from-two-tables-if-data-is-in-both-ta

I'm stumbling over how to integrate those two tables with static data
into the query.  The following query works fine as long as there's at
least one row in the 'results' table that corresponds to each row in
the pending table (however, it doesn't return anything for rows that
only exist in 'pending' yet not yet in 'results'):

SELECT
pending.cl,
pending.id,
pending.buildid,
pending.build_type,
pending.active,
pending.submittracker,
pending.os,pending.arch,
pending.osversion,
pending.branch,
pending.comment,
osversmap.osname,
buildlist.buildname,
results.logurl
FROM pending ,osversmap ,buildlist ,results
WHERE
pending.buildid=buildlist.id
AND pending.os=osversmap.os
AND pending.osversion=osversmap.osversion
AND pending.owner='$owner'
AND pending.completed='f'
AND results.hostname=pending.active
AND results.submittracker=pending.submittracker
AND pending.cl=results.cl
AND results.current_status!='PASSED'
AND results.current_status NOT LIKE '%FAILED'
ORDER BY pending.submittracker,pending.branch,pending.os,pending.arch


thanks in advance!

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] autovacuum stuck on a table for 18+ hours, consuming lots of CPU time

2011-11-22 Thread Lonni J Friedman

On Tue, Nov 22, 2011 at 7:49 PM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> I suspect you're right.  I just ran strace against that PID again, and
>> now all the lseek & read FD's are referrring to a different number
>> (115), so that means its moved onto something new since I looked a few
>> hours ago?
>
>> Anyway, I think this is what you were referring to:
>> /proc/30188/fd/115 ->   /var/lib/pgsql/data/base/64793/72633.10
>
>> How do I correlate that file to an actual database object?
>
> 64793 is the pg_database.oid of the database, and 72633 is the
> pg_class.relfilenode value of the table/index.

Its definitely an index.Thanks for your help, I just need to be
patient now that I understand how to better monitor this.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] autovacuum stuck on a table for 18+ hours, consuming lots of CPU time

2011-11-22 Thread Lonni J Friedman

On Tue, Nov 22, 2011 at 7:19 PM, Tom Lane  wrote:
> Lonni J Friedman  writes:
>> Thanks for your prompt reply.  I was pretty sure that I was using the
>> default, but just to confirm, I just ran:
>> 'SHOW vacuum_cost_delay;'
>
> What about autovacuum_vacuum_cost_delay?  The selects seem to be
> delaying for 32msec, which is not the default for anything.

20ms is what it returns.  I grepped for '32' in postgresql.conf  and
other than the '5432' port, nothing was returned.

I'm using the official postgres RPMs that shipped with
Fedora15-x86_64, so unless something weird got compiled in, I have no
idea where that 32ms is coming from.

>
>> Is there no way to safely kill off this autovacuum process that
>> doesn't involve shutting down & restarting the database?
>
> Sending it a SIGINT ought to be safe enough, though I don't think that
> is necessarily advisable, because the next autovacuum will probably take
> just as long.  Killing this one will mainly mean you've wasted (much of)
> the work it did so far.

ok, then I guess I'll wait longer.

>
> Before getting hasty I'd suggest identifying what table (or index) it's
> working on --- lsof on the process to see what FD 95 is connected to
> would be the first step.  I'm thinking it's an index since the seek
> addresses don't seem to be consecutive.  And it might be worth watching
> the seek addresses for awhile to see if you can prove that it's looping
> --- if it is, that might be an indication of a corrupt index.  If it
> isn't, but is just steadily working through the index, you'd be best
> advised to have patience.

I suspect you're right.  I just ran strace against that PID again, and
now all the lseek & read FD's are referrring to a different number
(115), so that means its moved onto something new since I looked a few
hours ago?

Anyway, I think this is what you were referring to:
/proc/30188/fd/115 ->   /var/lib/pgsql/data/base/64793/72633.10

How do I correlate that file to an actual database object?

thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

1 2 3 >

1 - 100 of 218 matches

Mail list logo