Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)
I wrote: > You've still got a nasty join-size estimation error: >> -> Nested Loop (cost=6.18..1939.43 rows=411736 width=8) (actual >> time=0.203..3.487 rows=35 loops=1) > It's not apparent why that's so far off ... What PG version is this, anyway? It strikes me that this estimation error might have something with the eqjoinsel bugs that we repaired in 9.0.5. I'm not having any luck reproducing such a bogus estimate with current code, either, though that may just mean you've omitted some critical info about how the tables are set up. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Anonymized database dumps
On Mar 19, 2012, at 5:55 PM, Bill Moran wrote: > >> Sensitive data should be stored encrypted to begin. For test databases you >> or your developers can invoke a process that replaces the real encrypted >> data with fake encrypted data (for which everybody has the key/password.) >> Or if the overhead is too much (ie billions of rows), you can have different >> decrypt() routines on your test databases that return fake data without >> touching the real encrypted columns. > > The thing is, this process has the same potential data spillage > issues as sanitizing the data. Not really, in the modality I describe the sensitive data is always encrypted in the database and "useless" because nobody will have the private key or know the password that protects it other than the ops subsystems that require access. So even if you take an ops dump, load it to a test box, and walk away, you are good. If your developers/testers want to play with the data they will be forced to over-write and "stage" test encrypted data they can decrypt, or call a "fake" decrypt() that gives them test data (eg: joins to a test data table.) Kiriakos -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] nice'ing the postgres COPY backend process to make pg_dumps run more "softly"
On Mon, Mar 19, 2012 at 3:51 PM, Guillaume Lelarge wrote: > On Sun, 2012-03-18 at 21:06 -0700, Aleksey Tsalolikhin wrote: >> Hi. When pg_dump runs, our application becomes inoperative (too >> slow). I was going to ask if nice'ing the postgres backend process >> that handles the COPY would help but I just realized probably the >> pg_dump takes out locks when it runs and nice'ing it would just make >> it run longer... >> >> However the man page says "pg_dump does not block other users >> accessing the database (readers or writers)." But if we run a >> pg_dump, the phone starts ringing, users are complaining that the web >> app is not working. >> >> Would appreciate some pointer to help me reconcile these two >> apparently contradictory facts. >> > > Depends on what your app is doing. It doesn't block any usual use of the > database: DML are all accepted. But you cannot drop a table that pg_dump > must save, you cannot change its definition. So there are some DDL > commands you cannot use during a dump. > > Other than this, your users shouldn't be blocked. If it happens again, > you should look at pg_locks and pg_stat_activity to understand what's > going on. This. pg_dump essentially runs a read only database wide transaction that touches all objects. This will allow all update, select, delete, etc, but will block a say, CREATE INDEX or ALTER TABLE. Let's just say maybe some other transaction is firing off one of these guys and is in turn already touched a customer record... bam. Classic priority inversion -- slowing down pg_dump will only make the problem worse. So the very first thing to determine is if you are seeing generalized load issues (we don't have the info for that) or locking issues because the solution will be entirely different depending on what you're seeing. Most likely case is you're having iowait issues and moving the dump to another machine will fix the problem. Next most likely case is cpu (courtesy zlib) -- also easily fixed. But you have to know before changing things. So: What's general load (from top) during pg_dump "bad" times? pg_locks, etc as Guillaume asked iowait etc. merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_upgrade + streaming replication ?
On Mon, 2012-03-19 at 15:30 -0400, Bruce Momjian wrote: > On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote: > > I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with > > streaming replication. I'm in the planning stages of upgrading to > > 9.1.x, and am looking into the most efficient way to do the upgrade > > with the goal of minimizing downtime & risk. After googling, the only > > discussion that I've found of using pg_upgrade with a streaming > > replication setup seems to be this (nearly) year old thread: > > http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK > > > > In summary, there is no way to use both pg_upgrade and streaming > > replication simultaneously. I'd have to either use pg_upgrade and > > then effectively rebuild/redeploy the slaves, or not use pg_upgrade, > > and reimport all of the data. Is that still the latest status, or are > > there other options? > > You can shut down all three servers, run pg_upgrade on all of them, then > restart them as 9.1 servers. After running pg_upgrade on each server individually, they will have different system IDs, and potentially different on-disk representation of the catalogs, right? So how can you resume streaming without rebuilding the slaves? Regards, Jeff Davis -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Anonymized database dumps
In response to Kiriakos Georgiou : > The data anonymizer process is flawed because you are one misstep away from > data spillage. In our case, it's only one layer. Other layers that exist: * The systems where this test data is instantiated can't send email * The systems where this exist have limited access (i.e., not all developers can access it, and it's not used for typical testing -- only for specific testing that requires production-like data) You are correct, however, in that there's always the danger of spillage if new sensitive data is added and the sanitation script is not properly updated. It's part of the ongoing overhead of maintaining such a system. > Sensitive data should be stored encrypted to begin. For test databases you > or your developers can invoke a process that replaces the real encrypted data > with fake encrypted data (for which everybody has the key/password.) Or if > the overhead is too much (ie billions of rows), you can have different > decrypt() routines on your test databases that return fake data without > touching the real encrypted columns. The thing is, this process has the same potential data spillage issues as sanitizing the data. I find it intriguing, however, and I'm going to see if there are places where this approach might have advantages over our current one. Since much of our sensitive data is already de-identified, it provides an additional level of protection on that level as well. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Anonymized database dumps
The data anonymizer process is flawed because you are one misstep away from data spillage. Sensitive data should be stored encrypted to begin. For test databases you or your developers can invoke a process that replaces the real encrypted data with fake encrypted data (for which everybody has the key/password.) Or if the overhead is too much (ie billions of rows), you can have different decrypt() routines on your test databases that return fake data without touching the real encrypted columns. Kiriakos On Mar 19, 2012, at 8:22 AM, Bill Moran wrote: > In response to Janning Vygen : >> >> I am working on postgresql 9.1 and loving it! >> >> Sometimes we need a full database dump to test some performance issues >> with real data. >> >> Of course we don't like to have sensible data like bunches of e-mail >> addresses on our development machines as they are of no interest for >> developers and should be kept secure. >> >> So we need an anonymized database dump. I thought about a few ways to >> achieve this. >> >> 1. Best solution would be a special db user and some rules which fire on >> reading some tables and replace privacy data with some random data. Now >> doing a dump as this special user doesn't even copy the sensible data at >> all. The user just has a different view on this database even when he >> calls pg_dump. >> >> But as rules are not fired on COPY it can't work, right? >> >> 2. The other solution I can think of is something like >> >> pg_dump | sed > pgdump_anon >> >> where 'sed' does a lot of magical replace operations on the content of >> the dump. I don't think this is going to work reliable. >> >> 3. More reliable would be to dump the database, restore it on a >> different server, run some sql script which randomize some data, and >> dump it again. hmm, seems to be the only reliable way so far. But it is >> no fun when dumping and restoring takes an hour. >> >> Does anybody has a better idea how to achieve an anonymized database dump? > > I highly recommend #3. It's how we do it where I work. > > At first it seems like a big, slow, complicated monster, but once you've > built the tools and have it running reliably it's very nice. Our system > does the dumps overnight via cron (we have over 100 production databases) > then changes the sensitive data, as well changing all the passwords to > "password" so developers can easily log in as any account. During the > day, the developers have access to all the sanitized dump files and can > use them to make as many testing databases as they need. Yes, the data > gets up to 24 hours out of date, but it's never been a problem for us. > > -- > Bill Moran > http://www.potentialtech.com > http://people.collaborativefusion.com/~wmoran/ > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] nice'ing the postgres COPY backend process to make pg_dumps run more "softly"
On 03/19/2012 01:51 PM, Guillaume Lelarge wrote: On Sun, 2012-03-18 at 21:06 -0700, Aleksey Tsalolikhin wrote: Hi. When pg_dump runs, our application becomes inoperative (too slow) Depends on what your app is doing. It doesn't block any usual use of the database: DML are all accepted. But you cannot drop a table that pg_dump must save, you cannot change its definition. So there are some DDL commands you cannot use during a dump Dumping may not technically block access but it *does*, of course, consume resources. Most obvious is that it requires reading all table data in entirety. This will cause competition for disk access and may cause active data to be temporarily pushed out of cache. You also have to write the data somewhere. If it is on the same drive as your database, you will have write competition. If it is on another machine it will use network resources. If you are compressing the data either externally or using a compressed dump format, you will need more CPU to handle the compression on whatever machine is doing the actual compression. To assist, we need more info. Tell us the database size, some details about your dump process (same or different machine, compression, etc.), how long your dumps take to run, how many backends are typically running and how many you reach during a dump, whether or not any web processes alter tables and other info you think may be of use. Cheers, Steve -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Conditionnal validation for transaction
Le 19/03/2012 16:57, Simon Riggs a écrit : On Mon, Mar 19, 2012 at 3:28 PM, Florent THOMAS wrote: Hy all of you, 1 - Is there a way to have conditions for committing transactions like in oracle : http://www.scribd.com/doc/42831667/47/Validation-conditionnelle-de-transaction-62 2 - Is there a way to declare local variables for a SQL statement without beiing in a function? PostgreSQL follows the SQL standard which does not allow anything like that. Later versions do allow anonymous blocks, also known as DO statements that allow you to execute some code to allow decision making like that. So the Oracle example is very similar code in PostgreSQL, except that you can't issue ROLLBACK and COMMIT. Thanks, Could you precise the sentence bellow But then you don't need to because you*can do a conditional error or drop through to a commit*. How do you do that? Regards
Re: [GENERAL] nice'ing the postgres COPY backend process to make pg_dumps run more "softly"
On Sun, 2012-03-18 at 21:06 -0700, Aleksey Tsalolikhin wrote: > Hi. When pg_dump runs, our application becomes inoperative (too > slow). I was going to ask if nice'ing the postgres backend process > that handles the COPY would help but I just realized probably the > pg_dump takes out locks when it runs and nice'ing it would just make > it run longer... > > However the man page says "pg_dump does not block other users > accessing the database (readers or writers)." But if we run a > pg_dump, the phone starts ringing, users are complaining that the web > app is not working. > > Would appreciate some pointer to help me reconcile these two > apparently contradictory facts. > Depends on what your app is doing. It doesn't block any usual use of the database: DML are all accepted. But you cannot drop a table that pg_dump must save, you cannot change its definition. So there are some DDL commands you cannot use during a dump. Other than this, your users shouldn't be blocked. If it happens again, you should look at pg_locks and pg_stat_activity to understand what's going on. -- Guillaume http://blog.guillaume.lelarge.info http://www.dalibo.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] current thinking on Amazon EC2?
> On Mon 3/19/2012 4:30 PM Mike Christensen writes: > >>I've been running my site on RackSpace CloudServers (similar to EC2) >>and have been getting pretty good performance, though I don't have >>huge amounts of database load. > >>One advantage, though, is RackSpace allows for hybrid solutions so I >>could potentially lease a dedicated server and continue to host my web >>frontend servers on the cloud. > > that's good to know, although for the project i'm working on, EC2 is > what we have to work with, good parts and bad parts and all. I know Heroku is built on EC2 and runs Postgres, so I would assume they've got it set up to get pretty good performance.. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] current thinking on Amazon EC2?
On Mon 3/19/2012 4:30 PM Mike Christensen writes: >I've been running my site on RackSpace CloudServers (similar to EC2) >and have been getting pretty good performance, though I don't have >huge amounts of database load. >One advantage, though, is RackSpace allows for hybrid solutions so I >could potentially lease a dedicated server and continue to host my web >frontend servers on the cloud. that's good to know, although for the project i'm working on, EC2 is what we have to work with, good parts and bad parts and all. richard
Re: [GENERAL] usage of pg_get_functiondef() -- SQL state 42809
writes: > -- This blows up. -- SQL state: 42809 -- ERROR: "array_agg" is an aggregate > function > select > TRG.tgname, TFX.proname, pg_get_functiondef(TFX.oid) as fdef > from > pg_trigger TRG > inner join pg_proc TFX on TFX.oid = TRG.tgfoid > where > TRG.tgisinternal = true > and > pg_get_functiondef(TFX.oid) = 'whatever' > Can you help me understand why this blows up ? The second part of the WHERE clause can be evaluated against pg_proc rows for which pg_get_functiondef() will fail. An easy workaround would be to use TRG.tgfoid instead, so that the WHERE clause gets pushed down to the other table. There probably shouldn't be any entries in pg_trigger for which pg_get_functiondef() will fail. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] current thinking on Amazon EC2?
On Mon, Mar 19, 2012 at 11:16 AM, Ben Chobot wrote: > On Mar 19, 2012, at 10:59 AM, Welty, Richard wrote: > > i just finished this thread from May of last year, and am wondering if this > still represents consensus thinking about postgresql deployments in the EC2 > cloud: > > http://postgresql.1045698.n5.nabble.com/amazon-ec2-td4368036.html > > > Yes, I believe that still sums up the situation pretty well. I've been running my site on RackSpace CloudServers (similar to EC2) and have been getting pretty good performance, though I don't have huge amounts of database load. One advantage, though, is RackSpace allows for hybrid solutions so I could potentially lease a dedicated server and continue to host my web frontend servers on the cloud. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] usage of pg_get_functiondef() -- SQL state 42809
-- This works. select TRG.tgname, TFX.proname, pg_get_functiondef(TFX.oid) as fdef from pg_trigger TRG inner join pg_proc TFX on TFX.oid = TRG.tgfoid where TRG.tgisinternal = true -- This blows up. -- SQL state: 42809 -- ERROR: "array_agg" is an aggregate function select TRG.tgname, TFX.proname, pg_get_functiondef(TFX.oid) as fdef from pg_trigger TRG inner join pg_proc TFX on TFX.oid = TRG.tgfoid where TRG.tgisinternal = true and pg_get_functiondef(TFX.oid) = 'whatever' Can you help me understand why this blows up ? I am running "PostgreSQL 9.1.2, compiled by Visual C++ build 1500, 64-bit" -dvs- -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_upgrade + streaming replication ?
On Mon, Mar 19, 2012 at 12:30 PM, Bruce Momjian wrote: > On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote: >> I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with >> streaming replication. I'm in the planning stages of upgrading to >> 9.1.x, and am looking into the most efficient way to do the upgrade >> with the goal of minimizing downtime & risk. After googling, the only >> discussion that I've found of using pg_upgrade with a streaming >> replication setup seems to be this (nearly) year old thread: >> http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK >> >> In summary, there is no way to use both pg_upgrade and streaming >> replication simultaneously. I'd have to either use pg_upgrade and >> then effectively rebuild/redeploy the slaves, or not use pg_upgrade, >> and reimport all of the data. Is that still the latest status, or are >> there other options? > > You can shut down all three servers, run pg_upgrade on all of them, then > restart them as 9.1 servers. > Thanks for your reply. This is very good news. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_upgrade + streaming replication ?
On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote: > I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with > streaming replication. I'm in the planning stages of upgrading to > 9.1.x, and am looking into the most efficient way to do the upgrade > with the goal of minimizing downtime & risk. After googling, the only > discussion that I've found of using pg_upgrade with a streaming > replication setup seems to be this (nearly) year old thread: > http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK > > In summary, there is no way to use both pg_upgrade and streaming > replication simultaneously. I'd have to either use pg_upgrade and > then effectively rebuild/redeploy the slaves, or not use pg_upgrade, > and reimport all of the data. Is that still the latest status, or are > there other options? You can shut down all three servers, run pg_upgrade on all of them, then restart them as 9.1 servers. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] current thinking on Amazon EC2?
On Mar 19, 2012, at 10:59 AM, Welty, Richard wrote: > i just finished this thread from May of last year, and am wondering if this > still represents consensus thinking about postgresql deployments in the EC2 > cloud: > > http://postgresql.1045698.n5.nabble.com/amazon-ec2-td4368036.html > Yes, I believe that still sums up the situation pretty well.
[GENERAL] Slow information_schema.views
Hello, I'm doing some SELECTs from information_schema.views to find views with dependencies on other views, i.e. SELECT table_name FROM information_schema.views WHERE view_definition ILIKE '%myviewname%'; and each is taking about 1/2 a second, which is getting a bit slow for my use. There are 1213 views listed in information_schema.views Doing an explain analyze, it looks like the issue is likely to be the pg_get_viewdef function or one of the privilege check functions. I'm not worried about privilege checks and I don't need a nicely formatted definition. Is there a way of finding out how pg_get_viewdef works so I can perhaps do a lower level query? I've previously used pg_catalog.pg_views which performs similarly. Or is there a better way of finding view dependencies? I see there's a pg_catalog entry for tables that a view depends on but that's not what I'm after. Regards Oliver Kohll www.agilebase.co.uk -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Conditionnal validation for transaction
On Mon, Mar 19, 2012 at 3:28 PM, Florent THOMAS wrote: > Hy all of you, > > 1 - Is there a way to have conditions for committing transactions like in > oracle : > http://www.scribd.com/doc/42831667/47/Validation-conditionnelle-de-transaction-62 > > 2 - Is there a way to declare local variables for a SQL statement without > beiing in a function? PostgreSQL follows the SQL standard which does not allow anything like that. Later versions do allow anonymous blocks, also known as DO statements that allow you to execute some code to allow decision making like that. So the Oracle example is very similar code in PostgreSQL, except that you can't issue ROLLBACK and COMMIT. But then you don't need to because you can do a conditional error or drop through to a commit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Conditionnal validation for transaction
2. The short answer is No. I've got the same issue. I come from a different sql that had a CREATE VARAIBLE which was good for the session. With PostGres, I've created a sessionsettings table and a bunch of functions to get by variable and use the value. My perceived downside is that this causes a lot of calls to be made to get the data instead of setting them one time (for most items). I've been told that the table will probably be cached so it will cost very little. Michael Gould Intermodal Software Solutions, LLC 904-226-0978 Original Message Subject: [GENERAL] Conditionnal validation for transaction From: Florent THOMAS Date: Mon, March 19, 2012 8:28 am To: pgsql-general@postgresql.org Hy all of you, 1 - Is there a way to have conditions for committing transactions like in oracle : http://www.scribd.com/doc/42831667/47/Validation-conditionnelle-de-transaction-62 2 - Is there a way to declare local variables for a SQL statement without beiing in a function? regards -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Conditionnal validation for transaction
Hy all of you, 1 - Is there a way to have conditions for committing transactions like in oracle : http://www.scribd.com/doc/42831667/47/Validation-conditionnelle-de-transaction-62 2 - Is there a way to declare local variables for a SQL statement without beiing in a function? regards
Re: [GENERAL] Anonymized database dumps
Am 19.03.2012 um 13:22 schrieb Bill Moran : > In response to Janning Vygen : >> >> I am working on postgresql 9.1 and loving it! >> >> Sometimes we need a full database dump to test some performance issues >> with real data. >> >> Of course we don't like to have sensible data like bunches of e-mail >> addresses on our development machines as they are of no interest for >> developers and should be kept secure. >> >> So we need an anonymized database dump. I thought about a few ways to >> achieve this. >> >> 1. Best solution would be a special db user and some rules which fire on >> reading some tables and replace privacy data with some random data. Now >> doing a dump as this special user doesn't even copy the sensible data at >> all. The user just has a different view on this database even when he >> calls pg_dump. >> >> But as rules are not fired on COPY it can't work, right? >> >> 2. The other solution I can think of is something like >> >> pg_dump | sed > pgdump_anon >> >> where 'sed' does a lot of magical replace operations on the content of >> the dump. I don't think this is going to work reliable. >> >> 3. More reliable would be to dump the database, restore it on a >> different server, run some sql script which randomize some data, and >> dump it again. hmm, seems to be the only reliable way so far. But it is >> no fun when dumping and restoring takes an hour. >> >> Does anybody has a better idea how to achieve an anonymized database dump? > > I highly recommend #3. It's how we do it where I work. > > At first it seems like a big, slow, complicated monster, but once you've > built the tools and have it running reliably it's very nice. Our system > does the dumps overnight via cron (we have over 100 production databases) > then changes the sensitive data, as well changing all the passwords to > "password" so developers can easily log in as any account. During the > day, the developers have access to all the sanitized dump files and can > use them to make as many testing databases as they need. Yes, the data > gets up to 24 hours out of date, but it's never been a problem for us. Thanks for your response and your insights to your process. Sounds reasonable. Regards Janning -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)
Kevin Goess writes: > On Mon, Mar 19, 2012 at 9:24 AM, Albe Laurenz wrote: >> That means that your statistics are not accurate. > Aha, thanks, that explains why my test table with one row was so bad. But > even with all freshly ANALYZE'd tables, I still see the query reverting to > a sequential scan on that big contexts table once the number of rows in the > subselect goes over 199. Here's a simplified version that demonstrates the > problem. You've still got a nasty join-size estimation error: > -> Nested Loop (cost=6.18..1939.43 rows=411736 width=8) (actual > time=0.203..3.487 rows=35 loops=1) It's not apparent why that's so far off ... regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Anonymized database dumps
In response to Janning Vygen : > > I am working on postgresql 9.1 and loving it! > > Sometimes we need a full database dump to test some performance issues > with real data. > > Of course we don't like to have sensible data like bunches of e-mail > addresses on our development machines as they are of no interest for > developers and should be kept secure. > > So we need an anonymized database dump. I thought about a few ways to > achieve this. > > 1. Best solution would be a special db user and some rules which fire on > reading some tables and replace privacy data with some random data. Now > doing a dump as this special user doesn't even copy the sensible data at > all. The user just has a different view on this database even when he > calls pg_dump. > > But as rules are not fired on COPY it can't work, right? > > 2. The other solution I can think of is something like > > pg_dump | sed > pgdump_anon > > where 'sed' does a lot of magical replace operations on the content of > the dump. I don't think this is going to work reliable. > > 3. More reliable would be to dump the database, restore it on a > different server, run some sql script which randomize some data, and > dump it again. hmm, seems to be the only reliable way so far. But it is > no fun when dumping and restoring takes an hour. > > Does anybody has a better idea how to achieve an anonymized database dump? I highly recommend #3. It's how we do it where I work. At first it seems like a big, slow, complicated monster, but once you've built the tools and have it running reliably it's very nice. Our system does the dumps overnight via cron (we have over 100 production databases) then changes the sensitive data, as well changing all the passwords to "password" so developers can easily log in as any account. During the day, the developers have access to all the sanitized dump files and can use them to make as many testing databases as they need. Yes, the data gets up to 24 hours out of date, but it's never been a problem for us. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] nice'ing the postgres COPY backend process to make pg_dumps run more "softly"
Try this: http://klicman.org/throttle/ Kiriakos On Mar 19, 2012, at 12:06 AM, Aleksey Tsalolikhin wrote: > Hi. When pg_dump runs, our application becomes inoperative (too > slow). I was going to ask if nice'ing the postgres backend process > that handles the COPY would help but I just realized probably the > pg_dump takes out locks when it runs and nice'ing it would just make > it run longer... > > However the man page says "pg_dump does not block other users > accessing the database (readers or writers)." But if we run a > pg_dump, the phone starts ringing, users are complaining that the web > app is not working. > > Would appreciate some pointer to help me reconcile these two > apparently contradictory facts. > > Best, > -at > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)
On Mon, Mar 19, 2012 at 9:24 AM, Albe Laurenz wrote: > That means that your statistics are not accurate. > > As a first measure, you should ANALYZE the tables involved and see if > the problem persists. If yes, post the new plans. > Aha, thanks, that explains why my test table with one row was so bad. But even with all freshly ANALYZE'd tables, I still see the query reverting to a sequential scan on that big contexts table once the number of rows in the subselect goes over 199. Here's a simplified version that demonstrates the problem. production=> explain (analyze, buffers) SELECT contexts.context_key FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE contexts.context_key IN (SELECT context_key FROM virtual_ancestors limit 200) AND articles.indexed; QUERY PLAN - Hash Join (cost=7086.13..219322.15 rows=411736 width=4) (actual time=50.118..1213.046 rows=35 loops=1) Hash Cond: (contexts.context_key = articles.context_key) Buffers: shared hit=72539 read=100104 -> Seq Scan on contexts (cost=0.00..190285.83 rows=1783283 width=4) (actual time=0.040..769.891 rows=1786074 loops=1) Buffers: shared hit=72399 read=100054 -> Hash (cost=1939.43..1939.43 rows=411736 width=8) (actual time=3.510..3.510 rows=35 loops=1) Buckets: 65536 Batches: 1 Memory Usage: 2kB Buffers: shared hit=140 read=50 -> Nested Loop (cost=6.18..1939.43 rows=411736 width=8) (actual time=0.203..3.487 rows=35 loops=1) Buffers: shared hit=140 read=50 -> HashAggregate (cost=6.18..8.18 rows=200 width=4) (actual time=0.174..0.198 rows=48 loops=1) Buffers: shared read=2 -> Limit (cost=0.00..3.68 rows=200 width=4) (actual time=0.015..0.108 rows=200 loops=1) Buffers: shared read=2 -> Seq Scan on virtual_ancestors (cost=0.00..87676.17 rows=4759617 width=4) (actual time=0.015..0.075 rows=200 loops=1) Buffers: shared read=2 -> Index Scan using articles_pkey on articles (cost=0.00..9.64 rows=1 width=4) (actual time=0.015..0.068 rows=1 loops=48) Index Cond: (articles.context_key = virtual_ancestors.context_key) Filter: articles.indexed Buffers: shared hit=140 read=48 Total runtime: 1213.138 ms (21 rows) But if I write the keys in the subquery inline, I get a very nice execution plan, all the way up to a tested maximum of about 50,000 keys: production=> explain (analyze, buffers) SELECT contexts.context_key FROM contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE contexts.context_key IN (2482612,2482612,...) AND articles.indexed; QUERY PLAN - Nested Loop (cost=758.71..3418.40 rows=200 width=4) (actual time=0.621..1.089 rows=35 loops=1) Buffers: shared hit=826 read=1 -> Bitmap Heap Scan on contexts (cost=752.58..1487.55 rows=200 width=4) (actual time=0.604..0.699 rows=48 loops=1) Recheck Cond: (context_key = ANY ('{2482612,2482612,...}'::integer[])) Buffers: shared hit=639 -> Bitmap Index Scan on contexts_pkey (cost=0.00..752.53 rows=200 width=0) (actual time=0.591..0.591 rows=200 loops=1) Index Cond: (context_key = ANY ('{2482612,2482612,...}'::integer[])) Buffers: shared hit=600 -> Bitmap Heap Scan on articles (cost=6.13..9.64 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=48) Recheck Cond: (articles.context_key = contexts.context_key) Filter: articles.indexed Buffers: shared hit=187 read=1 -> Bitmap Index Scan on articles_pkey (cost=0.00..6.13 rows=1 width=0) (actual time=0.005..0.005 rows=1 loops=48) Index Cond: (articles.context_key = contexts.context_key) Buffers: shared hit=148 Total runtime: 1.147 ms Is this expected behavior, that writing the ids inline does much better than the subquery? I've been told that it's not, but this isn't the first time I've seen this, so I feel like I'm not understanding something.
Re: [GENERAL] Multi server query
Great thanks to all of you regards Le 19/03/2012 09:58, Sergey Konoplev a écrit : Hi, On Mon, Mar 19, 2012 at 12:12 AM, Florent THOMAS wrote: How do you query multi servers and multi databases on postgresql? Look at this http://wiki.postgresql.org/wiki/PL/Proxy
Re: [GENERAL] Anonymized database dumps
On Mon, Mar 19, 2012 at 10:12:01AM +0100, hari.fu...@gmail.com wrote: > Janning Vygen writes: > > pgcrypto does not work for this scenario as far as i know. > > > > pgcrypto enables me to encrypt my data and let only a user with the > > right password (or key or whatever) decrypt it, right? So if i run it > > in a test environment without this password the application is broken. > > > > I still want to use these table columns in my test environment but > > instead of real email addresses i want addresses like > > random_num...@example.org. > > > > You might be right that it is a good idea to additional encrypt this data. > > Maybe you could change your application so that it doesn't access the > critical tables directly and instead define views for them which, based > on current_user, either do decryption or return randim strings. Encryption is wrong tool for "anonymization". The right tool is hmac() which gives you one-way hash that is protected by key, which means other side can't even calcutate the hashes unless they have same key. You can calculate it with pgcrypto when dumping, or later post-processing the dumps. But it produces random values, if you need something realistic-looking you need custom mapping logic. -- marko -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Multi server query
On 03/18/12 1:12 PM, Florent THOMAS wrote: How do you query _multi servers_ and multi databases on postgresql? other than plproxy, there's dblink, in the contrib collection. this lets you query another database or server from within a SQL query -- john r pierceN 37, W 122 santa cruz ca mid-left coast -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Anonymized database dumps
Janning Vygen writes: > pgcrypto does not work for this scenario as far as i know. > > pgcrypto enables me to encrypt my data and let only a user with the > right password (or key or whatever) decrypt it, right? So if i run it > in a test environment without this password the application is broken. > > I still want to use these table columns in my test environment but > instead of real email addresses i want addresses like > random_num...@example.org. > > You might be right that it is a good idea to additional encrypt this data. Maybe you could change your application so that it doesn't access the critical tables directly and instead define views for them which, based on current_user, either do decryption or return randim strings. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Multi server query
Hi, On Mon, Mar 19, 2012 at 12:12 AM, Florent THOMAS wrote: > How do you query multi servers and multi databases on postgresql? Look at this http://wiki.postgresql.org/wiki/PL/Proxy -- Sergey Konoplev Blog: http://gray-hemp.blogspot.com LinkedIn: http://ru.linkedin.com/in/grayhemp JID/GTalk: gray...@gmail.com Skype: gray-hemp -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] WHERE IN (subselect) versus WHERE IN (1,2,3,)
Kevin Goess wrote: > We have a table "contexts" with 1.6 million rows, and a table "articles" with 1.4 million rows, where > an "article" is a particular kind of "context". We want to select from a join on those two tables > like this > > SELECT COUNT(*) > FROM contexts > JOIN articles ON (articles.context_key=contexts.context_key) > WHERE contexts.context_key IN (...); > /* and some combination of columns from articles and contexts */ > > If "IN(...)" is a query, then this guy does a seq scan on the contexts table, even if the subquery is > "select col_a from kgtest" where kgtest has one row. If however I read the ids beforehand and write > them into the query, a la "IN (111,222,333...)", then the everything is happy, up to at least 20,000 > values written into the sql, at which point smaller machines will take 2-5 minutes to parse the query. > > I can certainly write the ids inline into the SQL, but when I do that I get the distinct impression > that I'm Doing It Wrong. Is this expected behavior? It seems surprising to me. > > > To demonstrate: > > /* nothing up my sleeve */ > # select * from kgtest; > cola > - > 1652729 > (1 row) [...] > /* subselect, query plan does seq scan on contexts */ [...] > -> Seq Scan on kgtest (cost=0.00..34.00 rows=2400 width=4) (actual time=0.048..0.050 rows [...] There is something missing in this line, but according to what you wrote it must be "actual [...] rows=1", And yet the planner assumes that the scan will return 2400 rows. That means that your statistics are not accurate. As a first measure, you should ANALYZE the tables involved and see if the problem persists. If yes, post the new plans. Yours, Laurenz Albe -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Anonymized database dumps
pgcrypto does not work for this scenario as far as i know. pgcrypto enables me to encrypt my data and let only a user with the right password (or key or whatever) decrypt it, right? So if i run it in a test environment without this password the application is broken. I still want to use these table columns in my test environment but instead of real email addresses i want addresses like random_num...@example.org. You might be right that it is a good idea to additional encrypt this data. regards Janning Am 19.03.2012 06:24, schrieb Kiriakos Georgiou: I would store sensitive data encrypted in the database. Check the pgcrypto module. Kiriakos On Mar 18, 2012, at 1:00 PM, Janning Vygen wrote: Hi, I am working on postgresql 9.1 and loving it! Sometimes we need a full database dump to test some performance issues with real data. Of course we don't like to have sensible data like bunches of e-mail addresses on our development machines as they are of no interest for developers and should be kept secure. So we need an anonymized database dump. I thought about a few ways to achieve this. 1. Best solution would be a special db user and some rules which fire on reading some tables and replace privacy data with some random data. Now doing a dump as this special user doesn't even copy the sensible data at all. The user just has a different view on this database even when he calls pg_dump. But as rules are not fired on COPY it can't work, right? 2. The other solution I can think of is something like pg_dump | sed> pgdump_anon where 'sed' does a lot of magical replace operations on the content of the dump. I don't think this is going to work reliable. 3. More reliable would be to dump the database, restore it on a different server, run some sql script which randomize some data, and dump it again. hmm, seems to be the only reliable way so far. But it is no fun when dumping and restoring takes an hour. Does anybody has a better idea how to achieve an anonymized database dump? regards Janning -- Kicktipp GmbH Venloer Straße 8, 40477 Düsseldorf Sitz der Gesellschaft: Düsseldorf Geschäftsführung: Janning Vygen Handelsregister Düsseldorf: HRB 55639 http://www.kicktipp.de/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Kicktipp GmbH Venloer Straße 8, 40477 Düsseldorf Sitz der Gesellschaft: Düsseldorf Geschäftsführung: Janning Vygen Handelsregister Düsseldorf: HRB 55639 http://www.kicktipp.de/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general