[GENERAL] help optimizing query
Hi all, I'm looking for helping optimizing a query. It currently requires two passes on the data per query, when I'd like to try and write it to only require a single pass. Here's the high level, it's parsing flow level network traffic and it's interested in 4 fields: src_ip, dst_ip, src_packets, dst_packets In words: 'src_ip' sent 'src_packets' number of packets to 'dst_ip' 'dst_ip' sent 'dst_packets' number of packets to 'src_ip' For every IP address, I want to count how many packets were sent to it. This could come one of two ways, the IP is the source in the flow, and it received dst_packets. Or, the IP is the destination in the flow, and it received src_packets. My current method is to split that into two queries and then take the union. But, I was wondering if its possible to do this in one pass through the data? SELECT ip,sum(dst_packets) FROM( (SELECT dst_ip AS ip,sum(src_packets) AS dst_packets FROM flows WHERE interval='2005-02-01 00:00:00' GROUP BY dst_ip) UNION ALL (SELECT src_ip AS ip,sum(dst_packets) AS dst_packets FROM flows WHERE interval='2005-02-01 00:00:00' GROUP BY src_ip) ) AS stats GROUP BY stats.ip HAVING sum(dst_packets)0 ORDER BY sum(dst_packets) DESC - George -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] foreign key constraints with inhertiance, hack suggestions?
Hi, First, yes I have read the 5.8.1. Caveats section that this support does not exist. I agree with the document that this is a serious limitation of the inheritance feature Has there been any effort to support this in the near future versions of postgresql? I searched the mailing lists and didn't quite find a straight answer on that, only that there will be some sort of stronger inheritance in 8.2 Second, I was wondering if anyone has any hack suggestions at doing some integrity checks? Basically I have a master 'flows' table which is partitioned and has non-overlapping CHECK constraints on each partition. Each record in the partitions have a unique pair of attributes: interval, flow_id When inserting in to another table 'flow_labels', these two attributes are given, I want to ensure that there exists a single flow in some partition in the 'flows' table that has these two values. I'm no DB expert, but could there be some sort of rule or trigger on insert into 'flow_labels' that does a count() where these two values are exact to ensure there exists 1 flow? Problem is my 'flows' table is on the order of billions of flows, each partition having hundreds of thousands. Suggestions? Thanks! George ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[GENERAL] creating a function with a variable table name
Hey all, I'm trying to create a function in which the table a query is run on is variable, but I guess this is not as easy as I thought. BEGIN dp= CREATE FUNCTION stats_addr_dst(date,text) dp- RETURNS setof addr_count dp- AS 'SELECT ip,sum(dst_packets) dp' FROM( dp' (SELECT dst_ip AS ip,sum(src_packets) AS dst_packets dp' FROM $2 dp' WHERE interval=$1 dp' GROUP BY dst_ip) dp' UNION ALL dp' (SELECT src_ip AS ip,sum(dst_packets) AS dst_packets dp' FROM $2 dp' WHERE interval=$1 dp' GROUP BY src_ip) ) dp' AS topk dp' GROUP BY topk.ip dp' HAVING sum(dst_packets)0 dp' ORDER BY sum(dst_packets) DESC;' dp- LANGUAGE SQL; ERROR: syntax error at or near $2 at character 179 LINE 6: FROM $2 ^ How can I pass the table name? Thanks! George ---(end of broadcast)--- TIP 6: explain analyze is your friend
[GENERAL] get username of user calling function?
Hi, Is it possible to get the username of the user calling a function? Just as a test, a function which would return the user their username. Thanks! George ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] creating a function with a variable table name
do I need to use PREPARE with it also? A. Kretschmer wrote: am Thu, dem 01.03.2007, um 11:17:46 -0500 mailte George Nychis folgendes: Hey all, I'm trying to create a function in which the table a query is run on is variable, but I guess this is not as easy as I thought. BEGIN dp= CREATE FUNCTION stats_addr_dst(date,text) ... dp' FROM $2 ^ How can I pass the table name? Build a string with your SQL and EXECUTE this string. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] get username of user calling function?
A. Kretschmer wrote: You can use the current_user - variable. Select current_user; I'm trying to create a function in which users can only kill their own processes, it works perfectly if i hardcode a username in such as this: CREATE FUNCTION kill_process(integer) RETURNS boolean AS 'select pg_cancel_backend(procpid) FROM (SELECT procpid FROM pg_stat_activity WHERE procpid=$1 and usename=''gnychis'') AS kill;' LANGUAGE SQL SECURITY DEFINER; But if i try to replace usename=''gnychis'' with usename=current_user it no longer works. Any ideas? - George ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] get username of user calling function?
David Legault wrote: See the EXECUTE function in the pl/pgSQL language in the docs for dynamic queries. So it turns out that in a SECURITY DEFINER the current_user is the owner of the function. I had to use session_user and it works now :) - George ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[GENERAL] giving a user permission to kill their processes only
Hey all, So the pg_cancel_backend() function by default is only available to super users, so I decided to write a wrapper function around, use a SECURITY DEFINER, and GRANT my user privilege to use the wrapper. BEGIN; CREATE FUNCTION kill_process(integer) RETURNS boolean AS 'select pg_cancel_backend($1);' LANGUAGE SQL SECURITY DEFINER; REVOKE EXECUTE ON FUNCTION kill_process(integer) FROM PUBLIC; COMMIT; GRANT EXECUTE ON FUNCTION kill_process(integer) TO gnychis; The problem with this is I can now kill other users postgresql processes. I was wondering if anyone knows a way in which i can check that the postgres process being killed is running a query for that user? Therefore, they can't kill queries in postgres processes started by other users. Thanks! George ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] dropping a master table and all of its partitions?
Tom Lane wrote: George Nychis [EMAIL PROTECTED] writes: Here is an exact script which generates this every single time... After you're done running the ruby script: DROP TABLE testflows CASCADE; I tweaked the ruby script to emit the SQL commands into a script file, which proved to issue 1765 CREATE TABLE commands (one parent and 1764 children) and 1764 CREATE RULE commands (one per child table). On my test installation the creation script runs about 6m15s, while DROP TABLE testflows CASCADE runs about 3m4s. While neither number is exactly awe-inspiring, I'm not seeing why you think the DROP is particularly broken? regards, tom lane Then maybe it's a bug in my version of postgresql, what version are you using? Because that DROP TABLE testflows CASCADE; runs for ~5 minutes and then dies on my side. It never finishes. - George ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[GENERAL] performance of partitioning?
Hey all, So I have a master table called flows and 400 partitions in the format flow_* where * is equal to some epoch. Each partition contains ~700,000 rows and has a check such that 1 field is equal to a value: flows_1107246900_interval_check CHECK (interval = '2005-02-01 03:35:00'::timestamp without time zone) Each partition has a different and unique non-overlapping check. This query takes about 5 seconds to execute: dp= select count(*) from flows_1107246900; count 696836 (1 row) This query has been running for 10 minutes now and hasn't stopped: dp= select count(*) from flows where interval='2005-02-01 03:35:00'; Isn't partitioning supposed to make the second query almost as fast? My WHERE is exactly the partitioning constraint, therefore it only needs to go to 1 partition and execute the query. Why would it take magnitudes longer to run? Am i misunderstanding something? Thanks! George ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] performance of partitioning?
cedric wrote: Le mardi 27 février 2007 15:00, George Nychis a écrit : Hey all, So I have a master table called flows and 400 partitions in the format flow_* where * is equal to some epoch. Each partition contains ~700,000 rows and has a check such that 1 field is equal to a value: flows_1107246900_interval_check CHECK (interval = '2005-02-01 03:35:00'::timestamp without time zone) Each partition has a different and unique non-overlapping check. This query takes about 5 seconds to execute: dp= select count(*) from flows_1107246900; count 696836 (1 row) This query has been running for 10 minutes now and hasn't stopped: dp= select count(*) from flows where interval='2005-02-01 03:35:00'; Isn't partitioning supposed to make the second query almost as fast? My WHERE is exactly the partitioning constraint, therefore it only needs to go to 1 partition and execute the query. Why would it take magnitudes longer to run? Am i misunderstanding something? perhaps you should consider constraint_exclusion http://www.postgresql.org/docs/current/static/ddl-partitioning.html http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-CONSTRAINT-EXCLUSION Thanks! George ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/ That sounds like what i'm looking for, thanks. I'll give it a try and report back. - George ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] performance of partitioning?
George Nychis wrote: cedric wrote: Le mardi 27 février 2007 15:00, George Nychis a écrit : Hey all, So I have a master table called flows and 400 partitions in the format flow_* where * is equal to some epoch. Each partition contains ~700,000 rows and has a check such that 1 field is equal to a value: flows_1107246900_interval_check CHECK (interval = '2005-02-01 03:35:00'::timestamp without time zone) Each partition has a different and unique non-overlapping check. This query takes about 5 seconds to execute: dp= select count(*) from flows_1107246900; count 696836 (1 row) This query has been running for 10 minutes now and hasn't stopped: dp= select count(*) from flows where interval='2005-02-01 03:35:00'; Isn't partitioning supposed to make the second query almost as fast? My WHERE is exactly the partitioning constraint, therefore it only needs to go to 1 partition and execute the query. Why would it take magnitudes longer to run? Am i misunderstanding something? perhaps you should consider constraint_exclusion http://www.postgresql.org/docs/current/static/ddl-partitioning.html http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-CONSTRAINT-EXCLUSION Thanks! George ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/ That sounds like what i'm looking for, thanks. I'll give it a try and report back. - George ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly Worked perfectly, the two commands have near exact execution time now. Thank you! - George ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
[GENERAL] dropping a master table and all of its partitions?
Hey everyone, I created a master table, and created ~2000 partitions for it. *no* data is in any of these partitions. I am trying to drop the master and all of the partitions with a cascade: DROP TABLE master CASCADE; Except after about 30 seconds my memory usage (4GB) jumps to 99%, and after about 10 minutes it kills over and drops my connection. How do you delete a master and all of its partitions? Thanks! George ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] dropping a master table and all of its partitions?
Sure I can do that, but why is this happening? Is this normal behavior? - George Erik Jones wrote: Did you use some kind of sensical naming convention for the child tables? If so, couldn't you write a script to loop through and drop them one at a time? On Feb 26, 2007, at 6:42 PM, George Nychis wrote: Hey everyone, I created a master table, and created ~2000 partitions for it. *no* data is in any of these partitions. I am trying to drop the master and all of the partitions with a cascade: DROP TABLE master CASCADE; Except after about 30 seconds my memory usage (4GB) jumps to 99%, and after about 10 minutes it kills over and drops my connection. How do you delete a master and all of its partitions? Thanks! George ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/ erik jones [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] sofware developer 615-296-0838 emma(r) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] dropping a master table and all of its partitions?
I tend to forget the important details ;) [EMAIL PROTECTED]:~$ psql --version psql (PostgreSQL) 8.1.8 contains support for command-line editing [EMAIL PROTECTED]:~$ uname -a Linux sn001 2.6.17-10-server #2 SMP Tue Dec 5 21:17:26 UTC 2006 x86_64 GNU/Linux - George Joshua D. Drake wrote: George Nychis wrote: Sure I can do that, but why is this happening? Is this normal behavior? Well that is the better question. If it is indeed doing what you say it is doing, I would say it is a bug. However you have not mentioned several important items, like what postgresql version you are running. Joshua D. Drake - George Erik Jones wrote: Did you use some kind of sensical naming convention for the child tables? If so, couldn't you write a script to loop through and drop them one at a time? On Feb 26, 2007, at 6:42 PM, George Nychis wrote: Hey everyone, I created a master table, and created ~2000 partitions for it. *no* data is in any of these partitions. I am trying to drop the master and all of the partitions with a cascade: DROP TABLE master CASCADE; Except after about 30 seconds my memory usage (4GB) jumps to 99%, and after about 10 minutes it kills over and drops my connection. How do you delete a master and all of its partitions? Thanks! George ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/ erik jones [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] sofware developer 615-296-0838 emma(r) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[GENERAL] insert only unique values in to a table, ignore rest?
Hi, I have approximately 2 billion data entries that I would like to insert into a database. Each entry consists of: INT BOOLEAN INT BOOLEAN I want to populate a table such that it only contains the unique rows, all other data should be thrown out. I would say a significant amount of the insertions are going to fail due to unique constraints. The unique constraint is on the two integers, not on the booleans. Using mysql, I was able to do this with the following query, for all data files (25574 data files total): mysql -e \use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn FIELDS TERMINATED BY ' ';\ What I *think* mysql did was sort each data file and do a sort of merge sort between the data I was inserting and the data in the database. It would insert the first unique instance of a row it saw, and reject all other insertions that violated the unique constraint due to the IGNORE. From what I understand, this functionality is not in postgresql. Fine, I certainly can't change that. But I am looking for a comparable solution for the size of my data. One solution is to have a temporary table, insert all 2 billion rows, and then copy the distinct entries to another table. This would be like one massive sort? Is this the only/best solution using postgresql? Thanks! George ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] insert only unique values in to a table, ignore rest?
Scott Marlowe wrote: On Mon, 2007-01-08 at 14:58, George Nychis wrote: Hi, I have approximately 2 billion data entries that I would like to insert into a database. Each entry consists of: INT BOOLEAN INT BOOLEAN I want to populate a table such that it only contains the unique rows, all other data should be thrown out. I would say a significant amount of the insertions are going to fail due to unique constraints. The unique constraint is on the two integers, not on the booleans. Using mysql, I was able to do this with the following query, for all data files (25574 data files total): mysql -e \use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn FIELDS TERMINATED BY ' ';\ A quick question. Could you run selects or other inserts on that table while the load data infile was running? Cause I'm guessing that it basically locked the whole table while running. What does this have to do with my question? I don't need to run selects or inserts on the table while the load data is running... What I *think* mysql did was sort each data file and do a sort of merge sort between the data I was inserting and the data in the database. It would insert the first unique instance of a row it saw, and reject all other insertions that violated the unique constraint due to the IGNORE. Me too. Which would require one big lock on the table which would mean no parallel access. Thats fine, it doesn't matter. It's also likely that it used a temp table which doubled the size of the database while you were inserting. From what I understand, this functionality is not in postgresql. Fine, I certainly can't change that. But I am looking for a comparable solution for the size of my data. One solution is to have a temporary table, insert all 2 billion rows, and then copy the distinct entries to another table. This would be like one massive sort? Is this the only/best solution using postgresql? TANSTAAFL. PostgreSQL is designed so that you can run an import process on that table while 100 other users still access it at the same time. Because of that, you don't get to do dirty, nasty things under the sheets that allow for super easy data loading and merging like you got with MySQL. Apples and Oranges. Assuming you're loading into an empty table, the load to temp, select distinct out and into the final table seems reasonable, should run reasonably fast. If you need to load to an existing table, it might get a little more complex. The goal is not to run queries while the data is being insertedI am wondering if the postgresql method I have mentioned to actually insert and get only distinct values is most optimal, which would produce the same results method I explained in mysql. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] insert only unique values in to a table, ignore rest?
Scott Marlowe wrote: On Mon, 2007-01-08 at 15:52, George Nychis wrote: Scott Marlowe wrote: On Mon, 2007-01-08 at 14:58, George Nychis wrote: Hi, I have approximately 2 billion data entries that I would like to insert into a database. Each entry consists of: INT BOOLEAN INT BOOLEAN I want to populate a table such that it only contains the unique rows, all other data should be thrown out. I would say a significant amount of the insertions are going to fail due to unique constraints. The unique constraint is on the two integers, not on the booleans. Using mysql, I was able to do this with the following query, for all data files (25574 data files total): mysql -e \use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn FIELDS TERMINATED BY ' ';\ A quick question. Could you run selects or other inserts on that table while the load data infile was running? Cause I'm guessing that it basically locked the whole table while running. What does this have to do with my question? I don't need to run selects or inserts on the table while the load data is running... What I *think* mysql did was sort each data file and do a sort of merge sort between the data I was inserting and the data in the database. It would insert the first unique instance of a row it saw, and reject all other insertions that violated the unique constraint due to the IGNORE. Me too. Which would require one big lock on the table which would mean no parallel access. Thats fine, it doesn't matter. It's also likely that it used a temp table which doubled the size of the database while you were inserting. From what I understand, this functionality is not in postgresql. Fine, I certainly can't change that. But I am looking for a comparable solution for the size of my data. One solution is to have a temporary table, insert all 2 billion rows, and then copy the distinct entries to another table. This would be like one massive sort? Is this the only/best solution using postgresql? TANSTAAFL. PostgreSQL is designed so that you can run an import process on that table while 100 other users still access it at the same time. Because of that, you don't get to do dirty, nasty things under the sheets that allow for super easy data loading and merging like you got with MySQL. Apples and Oranges. Assuming you're loading into an empty table, the load to temp, select distinct out and into the final table seems reasonable, should run reasonably fast. If you need to load to an existing table, it might get a little more complex. The goal is not to run queries while the data is being insertedI am wondering if the postgresql method I have mentioned to actually insert and get only distinct values is most optimal, which would produce the same results method I explained in mysql. Did I fail to answer your question? Sorry if I gave you more information than you needed. Please feel free to ask someone else next time. ahhh i missed your last paragraph... so much text. Actually yeah that answers my question, thank you. I guess its more a single run through than the mysql method which was piece-wise. Thanks for the help/response. - George ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] insert only unique values in to a table, ignore rest?
Jeremy Haile wrote: Note that things will go faster if you do your initial data load using copy from stdin for the initial bulk data load. individual inserts in postgresql are quite costly compared to mysql. It's the transactional overhead. by grouping them together you can make things much faster. copy from stdin does all the inserts in one big transaction. You could do copy from file as well right? (no performance difference compared to copy from stdin) I do this all the time. Also - maybe I misunderstand something, but why does PostgreSQL's implementation prohibit it from ignoring insert errors during a copy? If you added a unique constraint to the table before copying, PostgreSQL would generate errors due to the unique constraint violation - so I don't think any additional locking would be required for it to simply say If there is an error while copying in, ignore it and continue inserting other rows PostgreSQL's copy command doesn't currently support this, so the temp table followed by a distinct select is the way to go. But I didn't follow all of the talk about it requiring locking the table and being inherently impossible for PostgreSQL to support. I've wanted a similar feature. I select rows into a table on a regular basis. I'd like to be able to overlap old values and have PostgreSQL ignore failed inserts. SQL Server offers a flag that allows you to ignore inserts whose primary key already exists in the table. The only solution in PostgreSQL is to run a query to manually delete the duplicate rows from a temp table before inserting - which takes much more time. I would also like this feature... :) (obviously) I also didn't exactly follow the locking, I don't need it as far as I know. - George ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings