Re: [GENERAL] Anonymized database dumps

2012-03-19 Thread Janning Vygen

pgcrypto does not work for this scenario as far as i know.

pgcrypto enables me to encrypt my data and let only a user with the 
right password (or key or whatever) decrypt it, right? So if i run it in 
a test environment without this password the application is broken.


I still want to use these table columns in my test environment but 
instead of real email addresses i want addresses like 
random_num...@example.org.


You might be right that it is a good idea to additional encrypt this data.

regards
Janning

Am 19.03.2012 06:24, schrieb Kiriakos Georgiou:

I would store sensitive data encrypted in the database.  Check the pgcrypto 
module.

Kiriakos


On Mar 18, 2012, at 1:00 PM, Janning Vygen wrote:


Hi,

I am working on postgresql 9.1 and loving it!

Sometimes we need a full database dump to test some performance issues with 
real data.

Of course we don't like to have sensible data like bunches of e-mail addresses 
on our development machines as they are of no interest for developers and 
should be kept secure.

So we need an anonymized database dump. I thought about a few ways to achieve 
this.

1. Best solution would be a special db user and some rules which fire on 
reading some tables and replace privacy data with some random data. Now doing a 
dump as this special user doesn't even copy the sensible data at all. The user 
just has a different view on this database even when he calls pg_dump.

But as rules are not fired on COPY it can't work, right?

2. The other solution I can think of is something like

pg_dump | sed  pgdump_anon

where 'sed' does a lot of magical replace operations on the content of the 
dump. I don't think this is going to work reliable.

3. More reliable would be to dump the database, restore it on a different 
server, run some sql script which randomize some data, and dump it again. hmm, 
seems to be the only reliable way so far. But it is no fun when dumping and 
restoring takes an hour.

Does anybody has a better idea how to achieve an anonymized database dump?

regards
Janning





--
Kicktipp GmbH

Venloer Straße 8, 40477 Düsseldorf
Sitz der Gesellschaft: Düsseldorf
Geschäftsführung: Janning Vygen
Handelsregister Düsseldorf: HRB 55639

http://www.kicktipp.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




--
Kicktipp GmbH

Venloer Straße 8, 40477 Düsseldorf
Sitz der Gesellschaft: Düsseldorf
Geschäftsführung: Janning Vygen
Handelsregister Düsseldorf: HRB 55639

http://www.kicktipp.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)

2012-03-19 Thread Albe Laurenz
Kevin Goess wrote:
 We have a table contexts with 1.6 million rows, and a table
articles with 1.4 million rows, where
 an article is a particular kind of context.  We want to select
from a join on those two tables
 like this
 
 SELECT COUNT(*)
 FROM contexts
 JOIN articles ON (articles.context_key=contexts.context_key)
 WHERE contexts.context_key IN (...);
 /* and some combination of columns from articles and contexts */
 
 If IN(...) is a query, then this guy does a seq scan on the contexts
table, even if the subquery is
 select col_a from kgtest where kgtest has one row.  If however I
read the ids beforehand and write
 them into the query, a la IN (111,222,333...), then the everything
is happy, up to at least 20,000
 values written into the sql, at which point smaller machines will take
2-5 minutes to parse the query.
 
 I can certainly write the ids inline into the SQL, but when I do that
I get the distinct impression
 that I'm Doing It Wrong.  Is this expected behavior?  It seems
surprising to me.
 
 
 To demonstrate:
 
 /* nothing up my sleeve */
 # select * from kgtest;
   cola
 -
  1652729
 (1 row)

[...]

   /* subselect, query plan does seq scan on contexts */
[...]
 -  Seq Scan on kgtest  (cost=0.00..34.00 rows=2400 width=4)
(actual time=0.048..0.050 rows
[...]

There is something missing in this line, but according to what you wrote
it must be actual [...] rows=1, And yet the planner assumes that the
scan will return 2400 rows.
That means that your statistics are not accurate.

As a first measure, you should ANALYZE the tables involved and see if
the problem persists.  If yes, post the new plans.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multi server query

2012-03-19 Thread Sergey Konoplev
Hi,

On Mon, Mar 19, 2012 at 12:12 AM, Florent THOMAS mailingl...@tdeo.fr wrote:
 How do you query multi servers and multi databases on postgresql?


Look at this http://wiki.postgresql.org/wiki/PL/Proxy


-- 
Sergey Konoplev

Blog: http://gray-hemp.blogspot.com
LinkedIn: http://ru.linkedin.com/in/grayhemp
JID/GTalk: gray...@gmail.com Skype: gray-hemp

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Anonymized database dumps

2012-03-19 Thread hari . fuchs
Janning Vygen vy...@kicktipp.de writes:

 pgcrypto does not work for this scenario as far as i know.

 pgcrypto enables me to encrypt my data and let only a user with the
 right password (or key or whatever) decrypt it, right? So if i run it
 in a test environment without this password the application is broken.

 I still want to use these table columns in my test environment but
 instead of real email addresses i want addresses like
 random_num...@example.org.

 You might be right that it is a good idea to additional encrypt this data.

Maybe you could change your application so that it doesn't access the
critical tables directly and instead define views for them which, based
on current_user, either do decryption or return randim strings.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multi server query

2012-03-19 Thread John R Pierce

On 03/18/12 1:12 PM, Florent THOMAS wrote:

How do you query _multi servers_ and multi databases on postgresql?


other than plproxy, there's dblink, in the contrib collection.   this 
lets you query another database or server from within a SQL query






--
john r pierceN 37, W 122
santa cruz ca mid-left coast


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Anonymized database dumps

2012-03-19 Thread Marko Kreen
On Mon, Mar 19, 2012 at 10:12:01AM +0100, hari.fu...@gmail.com wrote:
 Janning Vygen vy...@kicktipp.de writes:
  pgcrypto does not work for this scenario as far as i know.
 
  pgcrypto enables me to encrypt my data and let only a user with the
  right password (or key or whatever) decrypt it, right? So if i run it
  in a test environment without this password the application is broken.
 
  I still want to use these table columns in my test environment but
  instead of real email addresses i want addresses like
  random_num...@example.org.
 
  You might be right that it is a good idea to additional encrypt this data.
 
 Maybe you could change your application so that it doesn't access the
 critical tables directly and instead define views for them which, based
 on current_user, either do decryption or return randim strings.

Encryption is wrong tool for anonymization.

The right tool is hmac() which gives you one-way hash that
is protected by key, which means other side can't even
calcutate the hashes unless they have same key.

You can calculate it with pgcrypto when dumping,
or later post-processing the dumps.

But it produces random values, if you need something
realistic-looking you need custom mapping logic.

-- 
marko


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multi server query

2012-03-19 Thread Florent THOMAS

Great thanks to all of you regards

Le 19/03/2012 09:58, Sergey Konoplev a écrit :

Hi,

On Mon, Mar 19, 2012 at 12:12 AM, Florent THOMASmailingl...@tdeo.fr  wrote:

How do you query multi servers and multi databases on postgresql?


Look at this http://wiki.postgresql.org/wiki/PL/Proxy




Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)

2012-03-19 Thread Kevin Goess
On Mon, Mar 19, 2012 at 9:24 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote:

 That means that your statistics are not accurate.

 As a first measure, you should ANALYZE the tables involved and see if
 the problem persists.  If yes, post the new plans.


Aha, thanks, that explains why my test table with one row was so bad.  But
even with all freshly ANALYZE'd tables, I still see the query reverting to
a sequential scan on that big contexts table once the number of rows in the
subselect goes over 199.  Here's a simplified version that demonstrates the
problem.

production= explain (analyze, buffers) SELECT contexts.context_key FROM
contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE
contexts.context_key IN (SELECT context_key FROM virtual_ancestors limit
200) AND articles.indexed;
 QUERY
PLAN
-
 Hash Join  (cost=7086.13..219322.15 rows=411736 width=4) (actual
time=50.118..1213.046 rows=35 loops=1)
   Hash Cond: (contexts.context_key = articles.context_key)
   Buffers: shared hit=72539 read=100104
   -  Seq Scan on contexts  (cost=0.00..190285.83 rows=1783283 width=4)
(actual time=0.040..769.891 rows=1786074 loops=1)
 Buffers: shared hit=72399 read=100054
   -  Hash  (cost=1939.43..1939.43 rows=411736 width=8) (actual
time=3.510..3.510 rows=35 loops=1)
 Buckets: 65536  Batches: 1  Memory Usage: 2kB
 Buffers: shared hit=140 read=50
 -  Nested Loop  (cost=6.18..1939.43 rows=411736 width=8) (actual
time=0.203..3.487 rows=35 loops=1)
   Buffers: shared hit=140 read=50
   -  HashAggregate  (cost=6.18..8.18 rows=200 width=4)
(actual time=0.174..0.198 rows=48 loops=1)
 Buffers: shared read=2
 -  Limit  (cost=0.00..3.68 rows=200 width=4) (actual
time=0.015..0.108 rows=200 loops=1)
   Buffers: shared read=2
   -  Seq Scan on virtual_ancestors
(cost=0.00..87676.17 rows=4759617 width=4) (actual time=0.015..0.075
rows=200 loops=1)
 Buffers: shared read=2
   -  Index Scan using articles_pkey on articles
(cost=0.00..9.64 rows=1 width=4) (actual time=0.015..0.068 rows=1 loops=48)
 Index Cond: (articles.context_key =
virtual_ancestors.context_key)
 Filter: articles.indexed
 Buffers: shared hit=140 read=48
 Total runtime: 1213.138 ms
(21 rows)


But if I write the keys in the subquery inline, I get a very nice execution
plan, all the way up to a tested maximum of about 50,000 keys:


production= explain (analyze, buffers) SELECT contexts.context_key FROM
contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE
contexts.context_key IN (2482612,2482612,...) AND articles.indexed;

QUERY PLAN
-
 Nested Loop  (cost=758.71..3418.40 rows=200 width=4) (actual
time=0.621..1.089 rows=35 loops=1)
   Buffers: shared hit=826 read=1
   -  Bitmap Heap Scan on contexts  (cost=752.58..1487.55 rows=200
width=4) (actual time=0.604..0.699 rows=48 loops=1)
 Recheck Cond: (context_key = ANY
('{2482612,2482612,...}'::integer[]))
 Buffers: shared hit=639
 -  Bitmap Index Scan on contexts_pkey  (cost=0.00..752.53
rows=200 width=0) (actual time=0.591..0.591 rows=200 loops=1)
   Index Cond: (context_key = ANY
('{2482612,2482612,...}'::integer[]))
   Buffers: shared hit=600
   -  Bitmap Heap Scan on articles  (cost=6.13..9.64 rows=1 width=4)
(actual time=0.007..0.007 rows=1 loops=48)
 Recheck Cond: (articles.context_key = contexts.context_key)
 Filter: articles.indexed
 Buffers: shared hit=187 read=1
 -  Bitmap Index Scan on articles_pkey  (cost=0.00..6.13 rows=1
width=0) (actual time=0.005..0.005 rows=1 loops=48)
   Index Cond: (articles.context_key = contexts.context_key)
   Buffers: shared hit=148
 Total runtime: 1.147 ms

Is this expected behavior, that writing the ids inline does much better
than the subquery?  I've been told that it's not, but this isn't the first
time I've seen this, so I feel like I'm not understanding something.


Re: [GENERAL] nice'ing the postgres COPY backend process to make pg_dumps run more softly

2012-03-19 Thread Kiriakos Georgiou
Try this:   http://klicman.org/throttle/

Kiriakos


On Mar 19, 2012, at 12:06 AM, Aleksey Tsalolikhin wrote:

 Hi.  When pg_dump runs, our application becomes inoperative (too
 slow).  I was going to ask if nice'ing the postgres backend process
 that handles the COPY would help but I just realized probably the
 pg_dump takes out locks when it runs and nice'ing it would just make
 it run longer...
 
 However the man page says pg_dump does not block other users
 accessing the database  (readers  or writers).  But if we run a
 pg_dump, the phone starts ringing, users are complaining that the web
 app is not working.
 
 Would appreciate some pointer to help me reconcile these two
 apparently contradictory facts.
 
 Best,
 -at
 
 -- 
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Anonymized database dumps

2012-03-19 Thread Bill Moran
In response to Janning Vygen vy...@kicktipp.de:
 
 I am working on postgresql 9.1 and loving it!
 
 Sometimes we need a full database dump to test some performance issues 
 with real data.
 
 Of course we don't like to have sensible data like bunches of e-mail 
 addresses on our development machines as they are of no interest for 
 developers and should be kept secure.
 
 So we need an anonymized database dump. I thought about a few ways to 
 achieve this.
 
 1. Best solution would be a special db user and some rules which fire on 
 reading some tables and replace privacy data with some random data. Now 
 doing a dump as this special user doesn't even copy the sensible data at 
 all. The user just has a different view on this database even when he 
 calls pg_dump.
 
 But as rules are not fired on COPY it can't work, right?
 
 2. The other solution I can think of is something like
 
 pg_dump | sed  pgdump_anon
 
 where 'sed' does a lot of magical replace operations on the content of 
 the dump. I don't think this is going to work reliable.
 
 3. More reliable would be to dump the database, restore it on a 
 different server, run some sql script which randomize some data, and 
 dump it again. hmm, seems to be the only reliable way so far. But it is 
 no fun when dumping and restoring takes an hour.
 
 Does anybody has a better idea how to achieve an anonymized database dump?

I highly recommend #3.  It's how we do it where I work.

At first it seems like a big, slow, complicated monster, but once you've
built the tools and have it running reliably it's very nice.  Our system
does the dumps overnight via cron (we have over 100 production databases)
then changes the sensitive data, as well changing all the passwords to
password so developers can easily log in as any account.  During the
day, the developers have access to all the sanitized dump files and can
use them to make as many testing databases as they need.  Yes, the data
gets up to 24 hours out of date, but it's never been a problem for us.

-- 
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)

2012-03-19 Thread Tom Lane
Kevin Goess kgo...@bepress.com writes:
 On Mon, Mar 19, 2012 at 9:24 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote:
 That means that your statistics are not accurate.

 Aha, thanks, that explains why my test table with one row was so bad.  But
 even with all freshly ANALYZE'd tables, I still see the query reverting to
 a sequential scan on that big contexts table once the number of rows in the
 subselect goes over 199.  Here's a simplified version that demonstrates the
 problem.

You've still got a nasty join-size estimation error:

  -  Nested Loop  (cost=6.18..1939.43 rows=411736 width=8) (actual
 time=0.203..3.487 rows=35 loops=1)

It's not apparent why that's so far off ...

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Anonymized database dumps

2012-03-19 Thread Janning Vygen
Am 19.03.2012 um 13:22 schrieb Bill Moran wmo...@potentialtech.com:

 In response to Janning Vygen vy...@kicktipp.de:
 
 I am working on postgresql 9.1 and loving it!
 
 Sometimes we need a full database dump to test some performance issues 
 with real data.
 
 Of course we don't like to have sensible data like bunches of e-mail 
 addresses on our development machines as they are of no interest for 
 developers and should be kept secure.
 
 So we need an anonymized database dump. I thought about a few ways to 
 achieve this.
 
 1. Best solution would be a special db user and some rules which fire on 
 reading some tables and replace privacy data with some random data. Now 
 doing a dump as this special user doesn't even copy the sensible data at 
 all. The user just has a different view on this database even when he 
 calls pg_dump.
 
 But as rules are not fired on COPY it can't work, right?
 
 2. The other solution I can think of is something like
 
 pg_dump | sed  pgdump_anon
 
 where 'sed' does a lot of magical replace operations on the content of 
 the dump. I don't think this is going to work reliable.
 
 3. More reliable would be to dump the database, restore it on a 
 different server, run some sql script which randomize some data, and 
 dump it again. hmm, seems to be the only reliable way so far. But it is 
 no fun when dumping and restoring takes an hour.
 
 Does anybody has a better idea how to achieve an anonymized database dump?
 
 I highly recommend #3.  It's how we do it where I work.
 
 At first it seems like a big, slow, complicated monster, but once you've
 built the tools and have it running reliably it's very nice.  Our system
 does the dumps overnight via cron (we have over 100 production databases)
 then changes the sensitive data, as well changing all the passwords to
 password so developers can easily log in as any account.  During the
 day, the developers have access to all the sanitized dump files and can
 use them to make as many testing databases as they need.  Yes, the data
 gets up to 24 hours out of date, but it's never been a problem for us.

Thanks for your response and your insights to your process. Sounds reasonable. 

Regards
Janning
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Conditionnal validation for transaction

2012-03-19 Thread Florent THOMAS

Hy all of you,

1 - Is there a way to have conditions for committing transactions like 
in oracle : 
http://www.scribd.com/doc/42831667/47/Validation-conditionnelle-de-transaction-62


2 - Is there a way to declare local variables for a SQL statement 
without beiing in a function?


regards


Re: [GENERAL] Conditionnal validation for transaction

2012-03-19 Thread mgould
2.  The short answer is No.  I've got the same issue.  I come from a
different sql that had a CREATE VARAIBLE which was good for the session.
 With PostGres, I've created a sessionsettings table and a bunch of
functions to get by variable and use the value.  My perceived downside
is that this causes a lot of calls to be made to get the data instead of
setting them one time (for most items).  I've been told that the table
will probably be cached so it will cost very little.

Michael Gould
Intermodal Software Solutions, LLC
904-226-0978



 Original Message 
Subject: [GENERAL] Conditionnal validation for transaction
From: Florent THOMAS mailingl...@tdeo.fr
Date: Mon, March 19, 2012 8:28 am
To: pgsql-general@postgresql.org

 Hy all of you,
 
 1 - Is there a way to have conditions for committing transactions like
in oracle :
http://www.scribd.com/doc/42831667/47/Validation-conditionnelle-de-transaction-62
 
 2 - Is there a way to declare local variables for a SQL statement
without beiing in a function?
 
 regards


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Conditionnal validation for transaction

2012-03-19 Thread Simon Riggs
On Mon, Mar 19, 2012 at 3:28 PM, Florent THOMAS mailingl...@tdeo.fr wrote:
 Hy all of you,

 1 - Is there a way to have conditions for committing transactions like in
 oracle :
 http://www.scribd.com/doc/42831667/47/Validation-conditionnelle-de-transaction-62

 2 - Is there a way to declare local variables for a SQL statement without
 beiing in a function?

PostgreSQL follows the SQL standard which does not allow anything like that.

Later versions do allow anonymous blocks, also known as DO statements
that allow you to execute some code to allow decision making like
that. So the Oracle example is very similar code in PostgreSQL, except
that you can't issue ROLLBACK and COMMIT. But then you don't need to
because you can do a conditional error or drop through to a commit.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Slow information_schema.views

2012-03-19 Thread Oliver Kohll - Mailing Lists
Hello,

I'm doing some SELECTs from information_schema.views to find views with 
dependencies on other views, i.e.

SELECT table_name FROM information_schema.views WHERE view_definition ILIKE 
'%myviewname%';

and each is taking about 1/2 a second, which is getting a bit slow for my use. 
There are 1213 views listed in information_schema.views

Doing an explain analyze, it looks like the issue is likely to be the 
pg_get_viewdef function or one of the privilege check functions. I'm not 
worried about privilege checks and I don't need a nicely formatted definition. 
Is there a way of finding out how pg_get_viewdef works so I can perhaps do a 
lower level query?

I've previously used pg_catalog.pg_views which performs similarly.

Or is there a better way of finding view dependencies? I see there's a 
pg_catalog entry for tables that a view depends on but that's not what I'm 
after.

Regards
Oliver Kohll
www.agilebase.co.uk
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] current thinking on Amazon EC2?

2012-03-19 Thread Ben Chobot
On Mar 19, 2012, at 10:59 AM, Welty, Richard wrote:

 i just finished this thread from May of last year, and am wondering if this 
 still represents consensus thinking about postgresql deployments in the EC2 
 cloud:
 
 http://postgresql.1045698.n5.nabble.com/amazon-ec2-td4368036.html
 

Yes, I believe that still sums up the situation pretty well. 

Re: [GENERAL] pg_upgrade + streaming replication ?

2012-03-19 Thread Bruce Momjian
On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote:
 I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with
 streaming replication.  I'm in the planning stages of upgrading to
 9.1.x, and am looking into the most efficient way to do the upgrade
 with the goal of minimizing downtime  risk.  After googling, the only
 discussion that I've found of using pg_upgrade with a streaming
 replication setup seems to be this (nearly) year old thread:
 http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK
 
 In summary, there is no way to use both pg_upgrade and streaming
 replication simultaneously.  I'd have to either use pg_upgrade and
 then effectively rebuild/redeploy the slaves, or not use pg_upgrade,
 and reimport all of the data.  Is that still the latest status, or are
 there other options?

You can shut down all three servers, run pg_upgrade on all of them, then
restart them as 9.1 servers.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pg_upgrade + streaming replication ?

2012-03-19 Thread Lonni J Friedman
On Mon, Mar 19, 2012 at 12:30 PM, Bruce Momjian br...@momjian.us wrote:
 On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote:
 I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with
 streaming replication.  I'm in the planning stages of upgrading to
 9.1.x, and am looking into the most efficient way to do the upgrade
 with the goal of minimizing downtime  risk.  After googling, the only
 discussion that I've found of using pg_upgrade with a streaming
 replication setup seems to be this (nearly) year old thread:
 http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK

 In summary, there is no way to use both pg_upgrade and streaming
 replication simultaneously.  I'd have to either use pg_upgrade and
 then effectively rebuild/redeploy the slaves, or not use pg_upgrade,
 and reimport all of the data.  Is that still the latest status, or are
 there other options?

 You can shut down all three servers, run pg_upgrade on all of them, then
 restart them as 9.1 servers.


Thanks for your reply.  This is very good news.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] usage of pg_get_functiondef() -- SQL state 42809

2012-03-19 Thread david.sahagian
-- This works.

select
  TRG.tgname,  TFX.proname,  pg_get_functiondef(TFX.oid) as fdef
from
  pg_trigger TRG
  inner join pg_proc TFX on TFX.oid = TRG.tgfoid
where
  TRG.tgisinternal = true


-- This blows up.  -- SQL state: 42809 -- ERROR: array_agg is an aggregate 
function

select
  TRG.tgname,  TFX.proname,  pg_get_functiondef(TFX.oid) as fdef
from
  pg_trigger TRG
  inner join pg_proc TFX on TFX.oid = TRG.tgfoid
where
  TRG.tgisinternal = true
  and
  pg_get_functiondef(TFX.oid) = 'whatever'


Can you help me understand why this blows up ?
I am running PostgreSQL 9.1.2, compiled by Visual C++ build 1500, 64-bit

-dvs-


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] current thinking on Amazon EC2?

2012-03-19 Thread Mike Christensen
On Mon, Mar 19, 2012 at 11:16 AM, Ben Chobot be...@silentmedia.com wrote:
 On Mar 19, 2012, at 10:59 AM, Welty, Richard wrote:

 i just finished this thread from May of last year, and am wondering if this
 still represents consensus thinking about postgresql deployments in the EC2
 cloud:

 http://postgresql.1045698.n5.nabble.com/amazon-ec2-td4368036.html


 Yes, I believe that still sums up the situation pretty well.

I've been running my site on RackSpace CloudServers (similar to EC2)
and have been getting pretty good performance, though I don't have
huge amounts of database load.

One advantage, though, is RackSpace allows for hybrid solutions so I
could potentially lease a dedicated server and continue to host my web
frontend servers on the cloud.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] usage of pg_get_functiondef() -- SQL state 42809

2012-03-19 Thread Tom Lane
david.sahag...@emc.com writes:
 -- This blows up.  -- SQL state: 42809 -- ERROR: array_agg is an aggregate 
 function

 select
   TRG.tgname,  TFX.proname,  pg_get_functiondef(TFX.oid) as fdef
 from
   pg_trigger TRG
   inner join pg_proc TFX on TFX.oid = TRG.tgfoid
 where
   TRG.tgisinternal = true
   and
   pg_get_functiondef(TFX.oid) = 'whatever'


 Can you help me understand why this blows up ?

The second part of the WHERE clause can be evaluated against pg_proc
rows for which pg_get_functiondef() will fail.

An easy workaround would be to use TRG.tgfoid instead, so that the WHERE
clause gets pushed down to the other table.  There probably shouldn't be
any entries in pg_trigger for which pg_get_functiondef() will fail.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] current thinking on Amazon EC2?

2012-03-19 Thread Welty, Richard



On Mon 3/19/2012 4:30 PM Mike Christensen writes:

I've been running my site on RackSpace CloudServers (similar to EC2)
and have been getting pretty good performance, though I don't have
huge amounts of database load.

One advantage, though, is RackSpace allows for hybrid solutions so I
could potentially lease a dedicated server and continue to host my web
frontend servers on the cloud.

that's good to know, although for the project i'm working on, EC2 is
what we have to work with, good parts and bad parts and all.

richard


Re: [GENERAL] current thinking on Amazon EC2?

2012-03-19 Thread Mike Christensen
 On Mon 3/19/2012 4:30 PM Mike Christensen writes:

I've been running my site on RackSpace CloudServers (similar to EC2)
and have been getting pretty good performance, though I don't have
huge amounts of database load.

One advantage, though, is RackSpace allows for hybrid solutions so I
could potentially lease a dedicated server and continue to host my web
frontend servers on the cloud.

 that's good to know, although for the project i'm working on, EC2 is
 what we have to work with, good parts and bad parts and all.

I know Heroku is built on EC2 and runs Postgres, so I would assume
they've got it set up to get pretty good performance..

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] nice'ing the postgres COPY backend process to make pg_dumps run more softly

2012-03-19 Thread Guillaume Lelarge
On Sun, 2012-03-18 at 21:06 -0700, Aleksey Tsalolikhin wrote:
 Hi.  When pg_dump runs, our application becomes inoperative (too
 slow).  I was going to ask if nice'ing the postgres backend process
 that handles the COPY would help but I just realized probably the
 pg_dump takes out locks when it runs and nice'ing it would just make
 it run longer...
 
 However the man page says pg_dump does not block other users
 accessing the database  (readers  or writers).  But if we run a
 pg_dump, the phone starts ringing, users are complaining that the web
 app is not working.
 
 Would appreciate some pointer to help me reconcile these two
 apparently contradictory facts.
 

Depends on what your app is doing. It doesn't block any usual use of the
database: DML are all accepted. But you cannot drop a table that pg_dump
must save, you cannot change its definition. So there are some DDL
commands you cannot use during a dump.

Other than this, your users shouldn't be blocked. If it happens again,
you should look at pg_locks and pg_stat_activity to understand what's
going on.


-- 
Guillaume
http://blog.guillaume.lelarge.info
http://www.dalibo.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Conditionnal validation for transaction

2012-03-19 Thread Florent THOMAS



Le 19/03/2012 16:57, Simon Riggs a écrit :

On Mon, Mar 19, 2012 at 3:28 PM, Florent THOMASmailingl...@tdeo.fr  wrote:

Hy all of you,

1 - Is there a way to have conditions for committing transactions like in
oracle :
http://www.scribd.com/doc/42831667/47/Validation-conditionnelle-de-transaction-62

2 - Is there a way to declare local variables for a SQL statement without
beiing in a function?

PostgreSQL follows the SQL standard which does not allow anything like that.

Later versions do allow anonymous blocks, also known as DO statements
that allow you to execute some code to allow decision making like
that. So the Oracle example is very similar code in PostgreSQL, except
that you can't issue ROLLBACK and COMMIT.

Thanks, Could you precise the sentence bellow

  But then you don't need to
because you*can do a conditional error or drop through to a commit*.

How do you do that?

Regards


Re: [GENERAL] nice'ing the postgres COPY backend process to make pg_dumps run more softly

2012-03-19 Thread Steve Crawford

On 03/19/2012 01:51 PM, Guillaume Lelarge wrote:

On Sun, 2012-03-18 at 21:06 -0700, Aleksey Tsalolikhin wrote:

Hi.  When pg_dump runs, our application becomes inoperative (too
slow)

Depends on what your app is doing. It doesn't block any usual use of the
database: DML are all accepted. But you cannot drop a table that pg_dump
must save, you cannot change its definition. So there are some DDL
commands you cannot use during a dump

Dumping may not technically block access but it *does*, of course, 
consume resources.


Most obvious is that it requires reading all table data in entirety. 
This will cause competition for disk access and may cause active data to 
be temporarily pushed out of cache.


You also have to write the data somewhere. If it is on the same drive as 
your database, you will have write competition. If it is on another 
machine it will use network resources.


If you are compressing the data either externally or using a compressed 
dump format, you will need more CPU to handle the compression on 
whatever machine is doing the actual compression.


To assist, we need more info. Tell us the database size, some details 
about your dump process (same or different machine, compression, etc.), 
how long your dumps take to run, how many backends are typically running 
and how many you reach during a dump, whether or not any web processes 
alter tables and other info you think may be of use.


Cheers,
Steve




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Anonymized database dumps

2012-03-19 Thread Kiriakos Georgiou
The data anonymizer process is flawed because you are one misstep away from 
data spillage.  Sensitive data should be stored encrypted to begin.  For test 
databases you or your developers can invoke a process that replaces the real 
encrypted data with fake encrypted data (for which everybody has the 
key/password.)  Or if the overhead is too much (ie billions of rows), you can 
have different decrypt() routines on your test databases that return fake data 
without touching the real encrypted columns.

Kiriakos

On Mar 19, 2012, at 8:22 AM, Bill Moran wrote:

 In response to Janning Vygen vy...@kicktipp.de:
 
 I am working on postgresql 9.1 and loving it!
 
 Sometimes we need a full database dump to test some performance issues 
 with real data.
 
 Of course we don't like to have sensible data like bunches of e-mail 
 addresses on our development machines as they are of no interest for 
 developers and should be kept secure.
 
 So we need an anonymized database dump. I thought about a few ways to 
 achieve this.
 
 1. Best solution would be a special db user and some rules which fire on 
 reading some tables and replace privacy data with some random data. Now 
 doing a dump as this special user doesn't even copy the sensible data at 
 all. The user just has a different view on this database even when he 
 calls pg_dump.
 
 But as rules are not fired on COPY it can't work, right?
 
 2. The other solution I can think of is something like
 
 pg_dump | sed  pgdump_anon
 
 where 'sed' does a lot of magical replace operations on the content of 
 the dump. I don't think this is going to work reliable.
 
 3. More reliable would be to dump the database, restore it on a 
 different server, run some sql script which randomize some data, and 
 dump it again. hmm, seems to be the only reliable way so far. But it is 
 no fun when dumping and restoring takes an hour.
 
 Does anybody has a better idea how to achieve an anonymized database dump?
 
 I highly recommend #3.  It's how we do it where I work.
 
 At first it seems like a big, slow, complicated monster, but once you've
 built the tools and have it running reliably it's very nice.  Our system
 does the dumps overnight via cron (we have over 100 production databases)
 then changes the sensitive data, as well changing all the passwords to
 password so developers can easily log in as any account.  During the
 day, the developers have access to all the sanitized dump files and can
 use them to make as many testing databases as they need.  Yes, the data
 gets up to 24 hours out of date, but it's never been a problem for us.
 
 -- 
 Bill Moran
 http://www.potentialtech.com
 http://people.collaborativefusion.com/~wmoran/
 
 -- 
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Anonymized database dumps

2012-03-19 Thread Bill Moran
In response to Kiriakos Georgiou kg.postgre...@olympiakos.com:

 The data anonymizer process is flawed because you are one misstep away from 
 data spillage.

In our case, it's only one layer.

Other layers that exist:
* The systems where this test data is instantiated can't send email
* The systems where this exist have limited access (i.e., not all
  developers can access it, and it's not used for typical testing --
  only for specific testing that requires production-like data)

You are correct, however, in that there's always the danger of
spillage if new sensitive data is added and the sanitation script
is not properly updated.  It's part of the ongoing overhead of
maintaining such a system.

 Sensitive data should be stored encrypted to begin.  For test databases you 
 or your developers can invoke a process that replaces the real encrypted data 
 with fake encrypted data (for which everybody has the key/password.)  Or if 
 the overhead is too much (ie billions of rows), you can have different 
 decrypt() routines on your test databases that return fake data without 
 touching the real encrypted columns.

The thing is, this process has the same potential data spillage
issues as sanitizing the data.  I find it intriguing, however, and
I'm going to see if there are places where this approach might
have advantages over our current one.

Since much of our sensitive data is already de-identified, it
provides an additional level of protection on that level as well.

-- 
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pg_upgrade + streaming replication ?

2012-03-19 Thread Jeff Davis
On Mon, 2012-03-19 at 15:30 -0400, Bruce Momjian wrote:
 On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote:
  I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with
  streaming replication.  I'm in the planning stages of upgrading to
  9.1.x, and am looking into the most efficient way to do the upgrade
  with the goal of minimizing downtime  risk.  After googling, the only
  discussion that I've found of using pg_upgrade with a streaming
  replication setup seems to be this (nearly) year old thread:
  http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK
  
  In summary, there is no way to use both pg_upgrade and streaming
  replication simultaneously.  I'd have to either use pg_upgrade and
  then effectively rebuild/redeploy the slaves, or not use pg_upgrade,
  and reimport all of the data.  Is that still the latest status, or are
  there other options?
 
 You can shut down all three servers, run pg_upgrade on all of them, then
 restart them as 9.1 servers.

After running pg_upgrade on each server individually, they will have
different system IDs, and potentially different on-disk representation
of the catalogs, right?

So how can you resume streaming without rebuilding the slaves?

Regards,
Jeff Davis


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] nice'ing the postgres COPY backend process to make pg_dumps run more softly

2012-03-19 Thread Merlin Moncure
On Mon, Mar 19, 2012 at 3:51 PM, Guillaume Lelarge
guilla...@lelarge.info wrote:
 On Sun, 2012-03-18 at 21:06 -0700, Aleksey Tsalolikhin wrote:
 Hi.  When pg_dump runs, our application becomes inoperative (too
 slow).  I was going to ask if nice'ing the postgres backend process
 that handles the COPY would help but I just realized probably the
 pg_dump takes out locks when it runs and nice'ing it would just make
 it run longer...

 However the man page says pg_dump does not block other users
 accessing the database  (readers  or writers).  But if we run a
 pg_dump, the phone starts ringing, users are complaining that the web
 app is not working.

 Would appreciate some pointer to help me reconcile these two
 apparently contradictory facts.


 Depends on what your app is doing. It doesn't block any usual use of the
 database: DML are all accepted. But you cannot drop a table that pg_dump
 must save, you cannot change its definition. So there are some DDL
 commands you cannot use during a dump.

 Other than this, your users shouldn't be blocked. If it happens again,
 you should look at pg_locks and pg_stat_activity to understand what's
 going on.

This.  pg_dump essentially runs a read only database wide transaction
that touches all objects.   This will allow all update, select,
delete, etc, but will block a say, CREATE INDEX or ALTER TABLE.  Let's
just say maybe some other transaction is firing off one of these guys
and is in turn already touched a customer record... bam.  Classic
priority inversion -- slowing down pg_dump will only make the problem
worse.

So the very first thing to determine is if you are seeing generalized
load issues (we don't have the info for that) or locking issues
because the solution will be entirely different depending on what
you're seeing.  Most likely case is you're having iowait issues and
moving the dump to another machine will fix the problem.   Next most
likely case is cpu (courtesy zlib) -- also easily fixed.  But you have
to know before changing things.  So:

What's general load (from top) during pg_dump bad times?
pg_locks, etc as Guillaume asked
iowait

etc.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Anonymized database dumps

2012-03-19 Thread Kiriakos Georgiou
On Mar 19, 2012, at 5:55 PM, Bill Moran wrote:

 
 Sensitive data should be stored encrypted to begin.  For test databases you 
 or your developers can invoke a process that replaces the real encrypted 
 data with fake encrypted data (for which everybody has the key/password.)  
 Or if the overhead is too much (ie billions of rows), you can have different 
 decrypt() routines on your test databases that return fake data without 
 touching the real encrypted columns.
 
 The thing is, this process has the same potential data spillage
 issues as sanitizing the data.  


Not really, in the modality I describe the sensitive data is always encrypted 
in the database and useless because nobody will have the private key or know 
the password that protects it other than the ops subsystems that require access.
So even if you take an ops dump, load it to a test box, and walk away, you are 
good.  If your developers/testers want to play with the data they will be 
forced to over-write and stage test encrypted data they can decrypt, or call 
a fake decrypt() that gives them test data (eg: joins to a test data table.)

Kiriakos
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)

2012-03-19 Thread Tom Lane
I wrote:
 You've still got a nasty join-size estimation error:

 -  Nested Loop  (cost=6.18..1939.43 rows=411736 width=8) (actual
 time=0.203..3.487 rows=35 loops=1)

 It's not apparent why that's so far off ...

What PG version is this, anyway?  It strikes me that this estimation
error might have something with the eqjoinsel bugs that we repaired
in 9.0.5.  I'm not having any luck reproducing such a bogus estimate
with current code, either, though that may just mean you've omitted
some critical info about how the tables are set up.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general