Re: [SQL] Multiple DB join
On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote: > Hi All, > > I have a database which consists of 20 million records and I've split up > the db into 6-7 dbs. You can do this (as someone suggested with dblink), but I'm wondering why the split? 20 million records isn't very big. A -- Andrew Sullivan | [EMAIL PROTECTED] Unfortunately reformatting the Internet is a little more painful than reformatting your hard drive when it gets out of whack. --Scott Morris ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [SQL] Multiple DB join
On Tue, Aug 15, 2006 at 06:36:24AM -0400, Andrew Sullivan wrote: > On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote: > > I have a database which consists of 20 million records and I've split up > > the db into 6-7 dbs. > > You can do this (as someone suggested with dblink), but I'm wondering > why the split? 20 million records isn't very big. And why multiple databases instead of multiple schemas within the same database? Or even all data in the same schema? Is there a reason for the segregation? -- Michael Fuhr ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [SQL] Multiple DB join
Michael Fuhr <[EMAIL PROTECTED]> writes: > And why multiple databases instead of multiple schemas within the > same database? Or even all data in the same schema? Is there a > reason for the segregation? I can think that spreading processing requirements should be one. And distributing load another one. Disk space can be solved with new disks and tablespaces, but we can't yet distribute the load through several servers without partitioning. -- Jorge Godoy <[EMAIL PROTECTED]> ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [SQL] Multiple DB join
On Tue, Aug 15, 2006 at 09:01:03AM -0300, Jorge Godoy wrote: > I can think that spreading processing requirements should be one. And > distributing load another one. Disk space can be solved with new disks and > tablespaces, but we can't yet distribute the load through several servers > without partitioning. The cost of inter-node communication isn't nothing, though. It strikes me as at least possible that the overhead of dblink is going to be larger than whatever gains one makes from adding a new server. For only 20M rows, I find it pretty hard to believe the gain is going to be demonstrable. A -- Andrew Sullivan | [EMAIL PROTECTED] Everything that happens in the world happens at some place. --Jane Jacobs ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] Multiple DB join
Andrew Sullivan <[EMAIL PROTECTED]> writes: > On Tue, Aug 15, 2006 at 09:01:03AM -0300, Jorge Godoy wrote: >> I can think that spreading processing requirements should be one. And >> distributing load another one. Disk space can be solved with new disks and >> tablespaces, but we can't yet distribute the load through several servers >> without partitioning. > > The cost of inter-node communication isn't nothing, though. It > strikes me as at least possible that the overhead of dblink is going > to be larger than whatever gains one makes from adding a new server. > For only 20M rows, I find it pretty hard to believe the gain is going > to be demonstrable. I totally agree with you. But when your base grows, it might start getting interesting. I'd like to see some sort of data partitioning in PostgreSQL. -- Jorge Godoy <[EMAIL PROTECTED]> ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [SQL] Query response time
On Fri, 2006-08-11 at 08:58, Jonathan Sinclair wrote: > Hi all. Thanks for your help so far. However after configuring my system > I am still getting major lag times with a couple of queries. The first, > which suffers from the '538/539'(see old email included below) bug, is > caused by running the following statement: > > SELECT t1.col1, SUM(test) test_col, SUM(col2) > FROM table1 tl, table2 t2 > WHERE t1.date BETWEEN '01/10/2004' AND '01/10/2004' > AND t1.col3 = t2.col1 > AND t1.col5 = t2.col2 > AND t2.col3 BETWEEN 50.00 AND 200.00 > GROUP BY t1.col1 > HAVING SUM(test) BETWEEN 95.00 AND 100.00 > ORDER BY 2 DESC, t1.col1; > > I would like to know if anyone has any ideas why this problem arises. > (It's not due to the date format being ambiguous; I have set it to > European standard) Have you looked at the plan for this query? explain select... If I remember correctly, the planner has a bug where if you do a between with the same date, it does a seq scan instead of using an index. But my memory on that's a bit rough. How does it work if you change the where clause to be "t1.date = '01/10/2004'??? P.s. this is more of a performance than a SQL question. Not that that's a big deal or anything, the SQL list is kinda slow and can likely use the traffic. :) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] Multiple DB join
Andrew Sullivan wrote: On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote: Hi All, I have a database which consists of 20 million records and I've split up the db into 6-7 dbs. You can do this (as someone suggested with dblink), but I'm wondering why the split? 20 million records isn't very big. A The Design of the database is because our organization wants to split up different datasets into different entities, and there might be a possibility that they'll run different instances of postgres for each dataset. The problem is that one of the databases consists of 14 million records and when i query the base database which consists of 20million records, the query runs damn slow...below is the sample schema for my base table and the query i try to run on itit takes me more than 2-3 minutes to run a queryIs there any way i could speed this up.. sample schema for my base table - doc_id | title | authors | pub_year | abs | db_name | In the above scheme the field db_name is the name of the other databases which contain the whole record. I'm trying to run query which searches on title, authors name, pub_year or abstract simultaneously... e.g. SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors ~* '.*something.*') AND (db_name='something')); Any suggestions to speed up this query. Thanks, Sumeet. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [SQL] Breaking up a query
On Thu, 2006-08-10 at 17:53, Saad Anis wrote: > Hi Guys, > > A fellow at work has written the SQL below to retrieve some data from > multiple tables. Obviously it is inefficient and unnecessarily complex, and > I am trying to break it into 2 or more queries so as to enhance performance. Nope, that's not true in PostgreSQL. It is for some databases with relatively simplistic query planners, but not postgresql. I'd check that you have indexes where you need them (generally when you see a seq scan on a small set) including, especially, the foreign key columns (i.e. the ones pointing to another table's primary keys). On to your explain analyze, I noticed a lot of lines like this: Index Scan using positions_pkey on positions p (cost=0.00..32.00 rows=1000 width=28) (actual time=0.019..90.920 rows=13958 loop seeing as how the statistical default for a new, unanalyzed table is 1000, and you've got 1000 peppered all through your explain analyze, I'd guess you've not analyzed your database. Which means you've likely not read the admin docs. which means you've likely not vacuumed the database. Read the admin docs (they're not that thick, and there's lots of good info in there) and apply things like vacuum and analyze, and get back to us on how things are doing then. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [SQL] Multiple DB join
On Tue, Aug 15, 2006 at 10:11:41AM -0400, Sumeet Ambre wrote: > > > The Design of the database is because our organization wants to split up > different datasets into different entities, and there might be a > possibility that they'll run different instances of postgres for each > dataset. It's this "different instances" thing that I'm having a tough time with. Is this because they want to be able to query local things when disconnected or something? I can think of applications for this, for sure, I'm just suggesting that you make sure you're not building an optimisation that is (1) premature and (2) possibly a pessimal operation. > records, the query runs damn slow...below is the sample schema for > my base table and the query i try to run on itit takes me more than > 2-3 minutes to run a queryIs there any way i could speed this up.. The problem is not your design, nor even the size of the data exactly, but the query: > e.g. SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors > ~* '.*something.*') AND (db_name='something')); You have two initially-unbound search terms there: ILIKE '%' and ~* '.*' are automatically seqscans, because you have nowhere in the index to start. If you really want to do this kind of unbound-string query, you need to look into full text search. The above approach is never going to be fast. -- Andrew Sullivan | [EMAIL PROTECTED] This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] Multiple DB join
On Tue, Aug 15, 2006 at 10:25:00AM -0300, Jorge Godoy wrote: > I'd like to see some sort of data partitioning in PostgreSQL. Sure, I think everybody would. I think it's among the more interesting problems we have. A -- Andrew Sullivan | [EMAIL PROTECTED] The plural of anecdote is not data. --Roger Brinner ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [SQL] Multiple DB join
On 8/15/06, Andrew Sullivan <[EMAIL PROTECTED]> wrote: On Tue, Aug 15, 2006 at 10:11:41AM -0400, Sumeet Ambre wrote:> >> The Design of the database is because our organization wants to split up> different datasets into different entities, and there might be a > possibility that they'll run different instances of postgres for each> dataset.It's this "different instances" thing that I'm having a tough timewith. Is this because they want to be able to query local things when disconnected or something? I can think of applications forthis, for sure, I'm just suggesting that you make sure you're notbuilding an optimisation that is (1) premature and (2) possibly apessimal operation. The reason for splitting up the dbs into differnet instances is that in case one of the postgres instances on the server goes down for some reason, it doesnt effect the other instances which are running on the same server. Even I dont know the reason for this kind of approach. But i've to deal with it. > records, the query runs damn slow...below is the sample schema for> my base table and the query i try to run on itit takes me more than > 2-3 minutes to run a queryIs there any way i could speed this up..The problem is not your design, nor even the size of the dataexactly, but the query:> e.g. SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors > ~* '.*something.*') AND (db_name='something'));You have two initially-unbound search terms there: ILIKE '%' and ~*'.*' are automatically seqscans, because you have nowhere in theindex to start. If you really want to do this kind of unbound-string query, you need to look into full text search. The above approach isnever going to be fast. I previously thought of doing the full text search indexing thing...but i had a intution that the full text search thing is for fields which have very large strings...but in my case the strings are not above 200 chars in length.so would the full text search thing give me the performance which we need...also i tried doing combined indexes on title, authors in my base table..would indexes of any kind help me in this case.. Thanks for ur prompt replies,Sumeet.-- Thanks,Sumeet.
Re: [SQL] Multiple DB join
On Tue, Aug 15, 2006 at 10:35:36AM -0400, Sumeet wrote: > > The reason for splitting up the dbs into differnet instances is that in case > one of the postgres instances on the server > goes down for some reason, it doesnt effect the other instances which are > running on the same server. Even I dont know > the reason for this kind of approach. But i've to deal with it. I certainly sympathise with the situation where managers decide to solve problems that never happen. Given that these are on the same server, it is _way_ more likely that you'll introduce a problem due to running several instances of the back end than that one instance of Postgres will magically die (especially in such a way that other instances will continue to work). But if Some Important Person decreed it, you have my sympathy. > I previously thought of doing the full text search indexing thing...but i > had a intution that the full text search thing is for fields which have very > large strings...but in my case the strings are not above 200 chars in No, it's for data where you're going to be searching in random pieces of the text. I think you should look at tsearch2, probably. If you had any way to enforce bounded searches, it'd be a different matter: strings with initial matches but an unbound end are fast. (You can do it the other way, too, by some tricks with reversing the strings.) A -- Andrew Sullivan | [EMAIL PROTECTED] The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] Undo an update
2006. August 11. Friday 19.26 dátummal Judith ezt írta: > > Is there a way to undo an update??? > At the 1st time: I speak little english, so I sorry. If I understand, what's your problem, maybe I can help you... Maybe... So, if I'd like to save the inserted, deleted and/or updated rows, then I create an extra table and some rows, triggers... -- Function: generate_log_table() CREATE OR REPLACE FUNCTION generate_log_table() RETURNS "trigger" AS $BODY$ DECLARE query text; BEGIN IF (TG_OP = 'INSERT') THENquery := 'INSERT INTO data_table VALUES('|||| NEW.a ||''');'; ELSIF (TG_OP = 'UPDATE') THEN query := 'UPDATE data_table SET a = ' || || NEW.a || ''' WHERE id_table = '|| NEW.id_table ||';'; ELSIF (TG_OP = 'DELETE') THEN query := 'DELETE FROM data_table WHERE id_table = '|| OLD.id_table ||';'; END IF; INSERT INTO log_table (fecha, instruction) VALUES (now(), query); RETURN NEW; END $BODY$ LANGUAGE 'plpgsql' VOLATILE; -- Table: data_table CREATE TABLE data_table ( a text, id_table serial NOT NULL, CONSTRAINT table_pkey PRIMARY KEY (id_table) ) WITH OIDS; -- Trigger: generate_log_table on data_table CREATE TRIGGER generate_log_table AFTER INSERT OR UPDATE OR DELETE ON data_table FOR EACH ROW EXECUTE PROCEDURE generate_log_table(); -- Table: log_table CREATE TABLE log_table ( fecha timestamp, instruction text, id_table_log serial NOT NULL, CONSTRAINT log_table_pkey PRIMARY KEY (id_table_log) ) WITH OIDS; So, if you send every insert, update, delete command to a logger-table, then you can to undo anything. I think... Bye, kjt McAfee SCM 4.1 által ellenrizve! ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [SQL] The Right Way to manage schemas in SCM systems
On 11 Aug 2006 08:05:34 -0700, Andrew Hammond <[EMAIL PROTECTED]> wrote: I've been trying to figure out a good way to manage schema changecontrol for a while now. Since I have a development background, Ireally want to find some way to check the schema into a SCM system likeCVS (for example). Just using a pg_dump doesn't work very well becase there's no guarantee of consistent ordering. One of the things I findmost useful with SCM systems is the ability to quickly spot changes. Are-ordering within a single file makes for a lot of noise in thedeltas. So far, the best idea I've come up with is the "file tree dump"approach: get a list of every object in the database and then dump itinto a suitably named file. Finally, create a master file which consists of only include lines. There are a couple of problems I cansee with this.1) How to calculate the dependancy graph (or worse, dealing with RIloops) to determine the right order to load things in isn't stupidly obvious.2) If I just script this in bash, without making any changes topg_dump, it doesn't happen in a single transaction.3) No clever solution for relative vs absolute filenames (unless allthis goes into a tar file, but that format already exists and doesn't solve my problem).So my question is, does anyone have a better way of managing schemas inSCM systems? And if not, is there a precieved value in the communityfor a pg_dump --format scm_friendly option? (In which case I'll take this thread over to hackers) The approach you are taking here is difficult to implement but makes life easy on developers who like to make changes to the database whenever needed. It can be great for rapid development but I have chosen to take a different approach - one that requires a little more discipline up front but is better for controlling your testing and production environments. For every project, I first create DDL scripts for my database. I have a code builder that does this for me but you could also just create the database and use the pg_dump to get your initial DDL.Next I save this DDL in CVS. I break each schema into a separate script. If you are worried about the order, do as pg_dump does - create the tables first, then add the constraints like foreign keys after the structure is there. This is great for the initial setup. To deal with change, I have a manual process in place and use an auditing process to check that everything is done correctly. Here is how it works:Development:1. Copy the production database into the development database (we do this at least once a week). 2. Make the change in development - the nice thing here is I use EMS Manager and whenever I make a change it provides the SQL required to make the change.3. Copy the SQL for the change (from EMS Manager) and put it into a migration script (this will be executed in production when we deploy the application) - order is very important here. 4. Make the change to the initial DDL and check that change into CVS. Our code builder helps out a lot with this when the changes are large.5. Update my application code and test with new changes in development. Again out code builder does a lot of this for us. 6. Drop the development database, refresh it from production, run the migration script and test the new code - if all goes well it is ready for production.7. Deploy to production. We never allow developers to directly make changes to production. Only our administrators have that capability. Audit:1. Get a copy of production and put it into development.2. Run the DDL from CVS and put it into an audit database.3. Run a database diff - we use DB Comparer (same company as EMS Manager).4. Reconcile differences and put into DDL. We also will, from time to time, test the application against the DDL rather than from a copy of production.Currently we are working on ways to automate this mostly manual process. The nice thing is, the tools we use allow us to do the work fairly quickly. I like a little manual work in the whole process though as it keeps the developers better atuned to the database structure and the changes that are being made to it. If you were looking for a way to just backup the structure in CVS every day, you may consider writing a script (Perl would be an excellent choice for this) that reads the pg_dump and splits it out into separate files for each schema/table. == Aaron Bono Aranya Software Technologies, Inc. http://www.aranya.com http://codeelixir.com==
Re: [SQL] Using bitmap index scans-more efficient
Tom Lane wrote: Kyle Bateman <[EMAIL PROTECTED]> writes: But I'm assuming that using an interval-encoded project tree, I would have to do something like the following to get a progency group: select * from ledger l, proj p where p.proj_id = l.proj and p.left > 1234 and p.right < 2345; btree has no idea about the constraint (that I imagine exists) that left <= right. If you're just doing a simple index on (left, right) then the above query requires scanning all index entries with left > 1234. It would probably help to say select * from ledger l, proj p where p.proj_id = l.proj and p.left > 1234 and p.left < 2345 and p.right < 2345; so that you can constrain the range of "left" values scanned. Thanks for the replies, Tom and Florian. My problem is not that it is difficult (or costly) to determine the progeny of a given project. I can determine this in about 90 msec regardless of whether I use an adjacency model, interval-encoding, or materialized path (current implementation). The problem is, when I try to extract the ledger entries belonging to that progeny from a set of a million records, it seems to need to process all million records rather than being able to index right into the ones I want. I'm not very good at reading explain output, but I tried to set up the query Tom suggests by creating an interval-encoded project table (proj_int) and then joining it to my ledger like so: select l.* from ledger l, proj_int i where l.proj = i.proj_id and i.lft >= 5283 and i.lft < 5300 and i.rgt <= 5300; On my mini-test-ledger of 100,000 entries, this takes the longest time (5 seconds) with the following explain output: Hash Join (cost=19018.46..23411.52 rows=14 width=85) Hash Cond: ("outer".proj = "inner".proj_id) -> Nested Loop Left Join (cost=18994.38..23378.41 rows=1700 width=85) -> Hash Join (cost=18990.84..23340.87 rows=1700 width=81) Hash Cond: ("outer".vendid = "inner".org_id) -> Merge Join (cost=18935.35..23255.64 rows=1700 width=63) Merge Cond: (("outer".vendid = "inner".vendid) AND (("outer".invnum)::text = "inner"."?column10?")) -> Index Scan using apinv_hdr_pkey on apinv_hdr h (cost=0.00..3148.16 rows=51016 width=21) -> Sort (cost=18935.35..19235.45 rows=120041 width=55) Sort Key: i.vendid, (i.invnum)::text -> Seq Scan on apinv_items i (cost=0.00..4152.99 rows=120041 width=55) Filter: ((status = 'en'::bpchar) OR (status = 'cl'::bpchar) OR (status = 'pd'::bpchar)) -> Hash (cost=50.99..50.99 rows=1799 width=26) -> Seq Scan on vend_org v (cost=0.00..50.99 rows=1799 width=26) -> Materialize (cost=3.54..3.55 rows=1 width=4) -> Seq Scan on acct a (cost=0.00..3.54 rows=1 width=4) Filter: ((code)::text = 'ap'::text) -> Hash (cost=24.06..24.06 rows=10 width=4) -> Bitmap Heap Scan on proj_int i (cost=2.26..24.06 rows=10 width=4) Recheck Cond: ((lft >= 5283) AND (lft < 5300) AND (rgt <= 5300)) -> Bitmap Index Scan on i_proj_int_lft_rgt (cost=0.00..2.26 rows=10 width=0) Index Cond: ((lft >= 5283) AND (lft < 5300) AND (rgt <= 5300)) That is roughly equivalent to my materialized path method: select l.* from ledger l where l.projin (select proj_id from proj_v where 4737 = any(ppath)); And is quite slow compared to 150 msec when enumerating the progeny projects like so: select l.* from ledger l where l.proj in (4737,4789,4892,4893,4894,4895,4933,4934,4935); Nested Loop Left Join (cost=19.73..4164.10 rows=7 width=85) -> Nested Loop (cost=19.73..4139.08 rows=7 width=81) -> Nested Loop (cost=19.73..4100.07 rows=7 width=63) -> Bitmap Heap Scan on apinv_items i (cost=19.73..1185.71 rows=487 width=55) Recheck Cond: ((proj = 4737) OR (proj = 4789) OR (proj = 4892) OR (proj = 4893) OR (proj = 4894) OR (proj = 4895) OR (proj = 4933) OR (proj = 4934) OR (proj = 4935)) Filter: ((status = 'en'::bpchar) OR (status = 'cl'::bpchar) OR (status = 'pd'::bpchar)) -> BitmapOr (cost=19.73..19.73 rows=495 width=0) -> Bitmap Index Scan on i_apinv_items_proj (cost=0.00..2.19 rows=55 width=0) Index Cond: (proj = 4737) -> Bitmap Index Scan on i_apinv_items_proj (cost=0.00..2.19 rows=55 width=0) Index Cond: (proj = 4789) -> Bitmap Index Scan on i_apinv_items_proj (cost=0.00..2.19 rows=55 width=0) Index Cond: (proj = 4892) -> Bitmap Index Scan on i_apinv_items_proj (cost=0.00..2.19 rows=55 width=0)
Re: [SQL] Using bitmap index scans-more efficient
Tom Lane wrote: Kyle Bateman <[EMAIL PROTECTED]> writes: But I'm assuming that using an interval-encoded project tree, I would have to do something like the following to get a progency group: select * from ledger l, proj p where p.proj_id = l.proj and p.left > 1234 and p.right < 2345; Here's an interesting result: I created a function proj_left(int4) that returns the left interval number for a given project. Then I created an index on the underlying table for the ledger view(which took forever to build) like so: create index i_test on apinv_items (proj_left(proj)); Now my query: select * from ledger where proj_left(dr_proj) >= 5283 and proj_left(dr_proj) < 5300; is very speedy. Problem is, I had to mark the function proj_left() as immutable, which it can not be since the left and right values for a given project will change any time a project is added, removed, or moved around the hierarchy :( So is there any good way to tell the planner to do several individual index scans for the projects involved in the desired progeny, or the results together and return the result? This is what it seems to be choosing in the case of the query: select * from ledger where proj in (4737,4789,4892,4893,4894,4895,4933,4934,4935); ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org