Re: [SQL] Extracting hostname from URI column
Ah, I figured out what to look for and found my uniq -c solution: select substring( href from '.*://([^/]*)' ) as hostname, count(substring( href from '.*://([^/]*)' )) from url where id<10 group by hostname order by count desc; hostname | count --+--- texturizer.net | 2 www.google.com | 2 dictionary.reference.com | 1 www.mozillazine.org | 1 devedge.netscape.com | 1 groups.google.com| 1 forums.mozillazine.org | 1 Thanks for the quick help with substring func, people! Otis - Original Message From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: pgsql-sql@postgresql.org Sent: Tuesday, September 11, 2007 11:16:15 PM Subject: Re: [SQL] Extracting hostname from URI column Hi, Thanks, perfect! (though I'll have to look into the regex warning): => select substring( href from '.*://\([^/]*)' ) as hostname from url where id<10; WARNING: nonstandard use of escape in a string literal at character 29 HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. So now I have this: hostname -- texturizer.net texturizer.net forums.mozillazine.org www.mozillazine.org devedge.netscape.com www.google.com groups.google.com www.google.com dictionary.reference.com And what I'd like is something that would give me the counts for the number of occurrences of each unique hostname. Something much like `uniq -c'. Can anyone tell me how that's done or where I should look for info? (I'm not sure what to look for, that's the problem). Thanks, Otis - Original Message From: chester c young <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: sql pgsql Sent: Tuesday, September 11, 2007 8:42:46 PM Subject: Re: [SQL] Extracting hostname from URI column > I'm trying to use substr() and position() functions to extract the > full host name (and later a domain) from a column that holds URLs. substring( href from '.*://\([^/]*)' ); Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [SQL] Extracting hostname from URI column
Hi, Thanks, perfect! (though I'll have to look into the regex warning): => select substring( href from '.*://\([^/]*)' ) as hostname from url where id<10; WARNING: nonstandard use of escape in a string literal at character 29 HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. So now I have this: hostname -- texturizer.net texturizer.net forums.mozillazine.org www.mozillazine.org devedge.netscape.com www.google.com groups.google.com www.google.com dictionary.reference.com And what I'd like is something that would give me the counts for the number of occurrences of each unique hostname. Something much like `uniq -c'. Can anyone tell me how that's done or where I should look for info? (I'm not sure what to look for, that's the problem). Thanks, Otis - Original Message From: chester c young <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: sql pgsql Sent: Tuesday, September 11, 2007 8:42:46 PM Subject: Re: [SQL] Extracting hostname from URI column > I'm trying to use substr() and position() functions to extract the > full host name (and later a domain) from a column that holds URLs. substring( href from '.*://\([^/]*)' ); Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
[SQL] Extracting hostname from URI column
Hi, I'm trying to use substr() and position() functions to extract the full host name (and later a domain) from a column that holds URLs. This is what I'm trying, but it clearly doesn't do the job. => select substr(href, position('://' in href)+3, position('://' in href)+3+position('/' in href)), href from url where id <10; substr | href +-- texturizer.net | http://texturizer.net/firebird/extensions/ texturizer.net | http://texturizer.net/firebird/themes/ forums.mozilla | http://forums.mozillazine.org/index.php?c=4 www.mozillazin | http://www.mozillazine.org/ devedge.netsca | http://devedge.netscape.com/viewsource/2002/bookmarks/ www.google.com | http://www.google.com/search?&q=%s groups.google. | http://groups.google.com/groups?scoring=d&q=%s www.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+Lucky dictionary.ref | http://dictionary.reference.com/search?q=%s The 3rd param to the substr function is clearly wrong. Is it even doable without writing a procedure? Finally, is this the fastest way to get this data, or is there regex-based function that might be faster? Thanks, Otis ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] Help with a seq scan on multi-million row table
Hello Andrew, Markus, and Tom - thanks for all the help! You've just helped a large Simpy community! :) I'll try to post some performance charts to http://blog.simpy.com/ shortly. In short, this immediately dropped the load from 2-3-4-5-6-7+ to circa 0.25. Thanks! Otis - Original Message From: Andrew Sullivan <[EMAIL PROTECTED]> To: pgsql-sql@postgresql.org Sent: Thursday, May 11, 2006 1:18:08 PM Subject: Re: [SQL] Help with a seq scan on multi-million row table On Thu, May 11, 2006 at 10:09:44AM -0700, [EMAIL PROTECTED] wrote: > Hi Markus & Tom, > > Higher statistics for this column hm, I'd love to try changing > it to see how that changes things, but I'm afraid I don't know how > to do that. How can I change the statistics target value for this > column? > > Ah, I think I found the place: No. Just ALTER TABLE [name] ALTER [column] SET STATISTICS. See http://www.postgresql.org/docs/8.1/interactive/sql-altertable.html for more. You'll need to ANALYSE afterwards. A -- Andrew Sullivan | [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [SQL] Help with a seq scan on multi-million row table
Hi Markus & Tom, Higher statistics for this column hm, I'd love to try changing it to see how that changes things, but I'm afraid I don't know how to do that. How can I change the statistics target value for this column? Ah, I think I found the place: => select * from pg_attribute where attname='user_url_id'; attrelid | attname | atttypid | attstattarget | attlen | attnum | attndims | attcacheoff | atttypmod | attbyval | attstorage | attalign | attnotnull | atthasdef | attisdropped | attislocal | attinhcount --+-+--+---+++--+-+---+--++--++---+--++- 6124839 | user_url_id | 23 |-1 | 4 | 1 |0 | -1 |-1 | t| p | i| f | f | f| t | 0 1646081 | user_url_id | 23 |-1 | 4 | 2 |0 | -1 |-1 | t| p | i| f | f | f| t | 0 10048109 | user_url_id | 23 |-1 | 4 | 3 |0 | -1 |-1 | t| p | i| f | f | f| t | 0 10048123 | user_url_id | 23 |-1 | 4 | 2 |0 | -1 |-1 | t| p | i| f | f | f| t | 0 Hm, 4 rows. I need to change the value of the 'attstattarget' column, but for which of these rows? Only attrelid is different. I tried looking at pg_class, but didn't find anything with the above attrelid's. I used: => select * from pg_class where relname like 'user_url%'; Tom: you asked about distinct values. pg_stats shows cca. 60K distinct values, but the real number is: select count(distinct user_url_id) from user_url_tag; count - 1505933 This number grows daily by... not sure how much, probably 5k a day currently. Thanks, Otis - Original Message From: Markus Schaber <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Sent: Thursday, May 11, 2006 6:33:55 AM Subject: Re: [SQL] Help with a seq scan on multi-million row table Hi, Otis, [EMAIL PROTECTED] wrote: > I'm not sure which numbers you are referring to when you said the estimate is > off, but here are some numbers: > The whole table has 6-7 M rows. > That query matches about 2500 rows. > > If there are other things I can play with and help narrow this down, please > let me know. Did you try to set higher statistics targets for this columns? For experimenting, I'd try to set it to 100 or even higher, then ANALYZE the table, and then retest the query. HTH, Marks -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [SQL] Help with a seq scan on multi-million row table
Not sure if I'm showing you what you asked for, but here it is: select * from pg_stats where tablename='user_url_tag' and attname='user_url_id'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation +--+-+---+---++++--+- public | user_url_tag | user_url_id | 0 | 4 | 60825 | {458321,1485346,16304,68027,125417,153465,182503,201175,202973,218423} | {0.0013,0.001,0.00067,0.00067,0.00067,0.00067,0.00067,0.00067,0.00067,0.00067} | {195,195993,325311,480323,647778,782598,1014527,1201726,1424822,1614712,1853719} |0.795521 You asked if the table has been analyzed recently. I think so - I run ANALYZE on the whole DB every night, like this: $ psql -U me -c "ANALYZE;" mydb For a good measure, I just analyzed the table now: $ psql -U me -c "ANALYZE user_url_tag;" mydb Then I set the enable_hashjoin back to ON and re-run the EXPLAIN ANALYZE. I still get the sequential scan, even after analyzing the table :( I'm not sure which numbers you are referring to when you said the estimate is off, but here are some numbers: The whole table has 6-7 M rows. That query matches about 2500 rows. If there are other things I can play with and help narrow this down, please let me know. Thanks, Otis - Original Message From: Tom Lane <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Sent: Wednesday, May 10, 2006 9:53:49 PM Subject: Re: [SQL] Help with a seq scan on multi-million row table <[EMAIL PROTECTED]> writes: > Aha! set hashjoin=off did the trick. >-> Index Scan using ix_user_url_tag_user_url_id on > user_url_tag userurltag0_ (cost=0.00..157.34 rows=103 width=14) (actual > time=1.223..1.281 rows=5 loops=1666) > Index Cond: (userurltag0_.user_url_id = "outer".id) This seems to be the problem right here: the estimate of matching rows is off by a factor of 20, and that inflates the overall cost estimate for this plan about the same, causing the planner to think the other way is cheaper. What does the pg_stats row for user_url_tag.user_url_id contain? Have you analyzed that table recently? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] Help with a seq scan on multi-million row table
Aha! set hashjoin=off did the trick. The PG version is: 8.0.3 NB: I removed that redundant "DISTINCT" after the SELECT. EXPLAIN ANALYZE select userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url userurl1_ where (((userurl1_.user_id=1 ))AND((userurltag0_.user_url_id=userurl1_.id ))) group by userurltag0_.tag order by count(*)DESC; QUERY PLAN - Sort (cost=155766.79..155774.81 rows=3207 width=10) (actual time=2387.756..2396.578 rows=2546 loops=1) Sort Key: count(*) -> HashAggregate (cost=155572.02..155580.03 rows=3207 width=10) (actual time=2365.643..2376.626 rows=2546 loops=1) -> Nested Loop (cost=0.00..12.68 rows=3867 width=10) (actual time=0.135...028 rows=8544 loops=1) -> Index Scan using ix_user_url_user_id_url_id on user_url userurl1_ (cost=0.00..2798.12 rows=963 width=4) (actual time=0.067..9.744 rows=1666 loops=1) Index Cond: (user_id = 1) -> Index Scan using ix_user_url_tag_user_url_id on user_url_tag userurltag0_ (cost=0.00..157.34 rows=103 width=14) (actual time=1.223..1.281 rows=5 loops=1666) Index Cond: (userurltag0_.user_url_id = "outer".id) Total runtime: 2405.691 ms (9 rows) Are you still interested in other "its second-choice join type"? If you are, please tell me what join types those are, this is a bit beyond me. :( Is there a way to force PG to use the index automatically? This query is executed from something called Hibernate, and I'm not sure if that will let me set enable_hashjoin=off through its API... Thanks, Otis - Original Message From: Tom Lane <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Sent: Wednesday, May 10, 2006 8:27:01 PM Subject: Re: [SQL] Help with a seq scan on multi-million row table <[EMAIL PROTECTED]> writes: >-> Hash Join (cost=2797.65..140758.50 rows=3790 width=10) > (actual time=248.530..380635.132 rows=8544 loops=1) > Hash Cond: ("outer".user_url_id = "inner".id) > -> Seq Scan on user_url_tag userurltag0_ > (cost=0.00..106650.30 rows=6254530 width=14) (actual time=0.017..212256.630 > rows=6259553 loops=1) > -> Hash (cost=2795.24..2795.24 rows=962 width=4) > (actual time=199.840..199.840 rows=0 loops=1) >-> Index Scan using ix_user_url_user_id_url_id on > user_url userurl1_ (cost=0.00..2795.24 rows=962 width=4) (actual > time=0.048..193.707 rows=1666 loops=1) > Index Cond: (user_id = 1) Hm, I'm not sure why it's choosing that join plan. A nestloop indexscan wouldn't be terribly cheap, but just counting on my fingers it seems like it ought to come in at less than 10 cost units. What do you get if you set enable_hashjoin off? (Then try disabling its second-choice join type too --- I'm interested to see EXPLAIN ANALYZE output for all three join types.) What PG version is this exactly? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [SQL] Help with a seq scan on multi-million row table
Hi, Thanks Bruno. That was indeed a redundant DISTINCT. It did reduce the amount of work, but as you said it doesn't get rid of the sequential scan, which is the real problem with this query. Otis - Original Message From: Bruno Wolff III <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Sent: Wednesday, May 10, 2006 3:23:29 PM Subject: Re: [SQL] Help with a seq scan on multi-million row table On Wed, May 10, 2006 at 13:13:59 -0500, [EMAIL PROTECTED] wrote: > Hello, > > I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential > scan on a multi-million row table. I _thought_ I had all the appropriate > indices, but apparently I do not. I was wondering if anyone can spot a way I > can speed up this query. > The query currently takes... *gulp*: 381119.201 ms :( > > There are only 2 tables in the game: user_url and user_url_tag. The latter > has FKs pointing to the former. The sequential scan happens on the latter - > user_url_tag: > > EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ > from user_url_tag userurltag0_, user_url userurl1_ WHERE > (((userurl1_.user_id=1 )) AND ((userurltag0_.user_url_id=userurl1_.id ))) > GROUP BY userurltag0_.tag ORDER BY count(*) DESC; While this isn't a big issue, it looks like DISTINCT is redundant in your query and seems to be adding some extra work. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[SQL] Help with a seq scan on multi-million row table
Hello, I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential scan on a multi-million row table. I _thought_ I had all the appropriate indices, but apparently I do not. I was wondering if anyone can spot a way I can speed up this query. The query currently takes... *gulp*: 381119.201 ms :( There are only 2 tables in the game: user_url and user_url_tag. The latter has FKs pointing to the former. The sequential scan happens on the latter - user_url_tag: EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url userurl1_ WHERE (((userurl1_.user_id=1 )) AND ((userurltag0_.user_url_id=userurl1_.id ))) GROUP BY userurltag0_.tag ORDER BY count(*) DESC; QUERY PLAN --- Unique (cost=140972.22..140996.28 rows=3207 width=10) (actual time=381082.868..381110.094 rows=2546 loops=1) -> Sort (cost=140972.22..140980.24 rows=3207 width=10) (actual time=381082.858..381091.733 rows=2546 loops=1) Sort Key: count(*), userurltag0_.tag -> HashAggregate (cost=140777.45..140785.46 rows=3207 width=10) (actual time=381032.844..381064.068 rows=2546 loops=1) -> Hash Join (cost=2797.65..140758.50 rows=3790 width=10) (actual time=248.530..380635.132 rows=8544 loops=1) Hash Cond: ("outer".user_url_id = "inner".id) -> Seq Scan on user_url_tag userurltag0_ (cost=0.00..106650.30 rows=6254530 width=14) (actual time=0.017..212256.630 rows=6259553 loops=1) -> Hash (cost=2795.24..2795.24 rows=962 width=4) (actual time=199.840..199.840 rows=0 loops=1) -> Index Scan using ix_user_url_user_id_url_id on user_url userurl1_ (cost=0.00..2795.24 rows=962 width=4) (actual time=0.048..193.707 rows=1666 loops=1) Index Cond: (user_id = 1) Total runtime: 381119.201 ms (11 rows) This is what the two tables look like (extra colums removed): Table "public.user_url_tag" Column| Type | Modifiers -+---+-- id | integer | not null default nextval('public.user_url_tag_id_seq'::text) user_url_id | integer | tag | character varying(64) | Indexes: "pk_user_url_tag_id" PRIMARY KEY, btree (id) "ix_user_url_tag_tag" btree (tag) "ix_user_url_tag_user_url_id" btree (user_url_id) Foreign-key constraints: "fk_user_url_tag_user_url_id" FOREIGN KEY (user_url_id) REFERENCES user_url(id) Table "public.user_url" Column |Type | Modifiers --+-+-- id | integer | not null default nextval('public.user_url_id_seq'::text) user_id | integer | url_id | integer | Indexes: "pk_user_url_id" PRIMARY KEY, btree (id) "ix_user_url_url_id_user_id" UNIQUE, btree (url_id, user_id) "ix_user_url_user_id_url_id" UNIQUE, btree (user_id, url_id) Does anyone see a way to speed up this s-l-o-w query? I cache DB results, but I'd love to get rid of that sequential scan. Thanks, Otis ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[SQL] Multi-column index not used, new flipped column index is
Hi, I'm using PG 8.0.3 and recently spotted a query that was not using a multi-column index I had created. The index looks like: CREATE INDEX . ON FooTable(fkColumnOne, fkColumnTwo); The query that was not using the index was using: SELECT a bunch of columns and joins WHERE FooTable.fkColumnTwo=1000; So I added another index where the indexed columns are flipped: CREATE INDEX . ON FooTable(fkColumnTwo, fkColumnOne); Now the query started to use that index -- good! But now I have 2 indices that are nearly the same, and that means overhead during INSERTs/DELETEs. :( Is there a way to get this to use that first index, so I don't have to have this additional index? If I had PG 8.1.3, would PG know how to use that first index? I seem to recall something about this somewhere... but I don't see it at http://www.postgresql.org/docs/whatsnew . Thanks, Otis ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [SQL] Referential integrity broken (8.0.3), sub-select help
Hi, I mistakenly swapped the tables in my email. Here they are, corrected: Table "url": id SERIAL CONSTRAINT pk_url_id PRIMARY KEY Table "bookmark": url_id INTEGER CONSTRAINT fk_url_id REFERENCES url(id) I see my questions got chopped off from this email below, so let me restate them: Problem #1: Strange that PG allowed this to happen. Maybe my DDL above allows this to happen and needs to be tightened? I thought the above would ensure referential integrity, but maybe I need to specify something else? Problem #2: I'd like to find all rows in B that point to non-existent rows in U. I can do it with the following sub-select, I believe, but it's rather inefficient (EXPLAIN shows both tables would be sequentially scanned): SELECT * FROM bookmark WHERE id NOT IN (SELECT b.id FROM bookmark b, url u WHERE b.url_id=u.id); Is there a more efficient way to get the rows from "bookmark"? Thanks, Otis - Original Message From: Stephan Szabo <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Sent: Tuesday, March 21, 2006 10:08:38 AM Subject: Re: [SQL] Referential integrity broken (8.0.3), sub-select help On Tue, 21 Mar 2006 [EMAIL PROTECTED] wrote: > I've got 2 tables, "url" (U), and "bookmark" (B), with "bookmark" pointing to > "url" via FK. That's not what your schema below has. Your fragment below has URL pointing to bookmark. > Somehow I ended up with some rows in B referencing non-existent rows in U. With the below, this is entirely possible, since you're only guaranteeing that URLs have valid bookmarks not the other way around. Are you sure the below is actually what you have? > This sounds super strange and dangerous to me, and it's not clear to me > how/why PG let this happen. > I'm using 8.0.3. > > Here are the table references I just mentioned: > > Table "bookmark": > id SERIAL > CONSTRAINT pk_bookmark_id PRIMARY KEY > > Table "url": > url_id INTEGER > CONSTRAINT fk_bookmark_id REFERENCES bookmark(id) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[SQL] Referential integrity broken (8.0.3), sub-select help
Hello, I've got 2 tables, "url" (U), and "bookmark" (B), with "bookmark" pointing to "url" via FK. Somehow I ended up with some rows in B referencing non-existent rows in U. This sounds super strange and dangerous to me, and it's not clear to me how/why PG let this happen. I'm using 8.0.3. Here are the table references I just mentioned: Table "bookmark": id SERIAL CONSTRAINT pk_bookmark_id PRIMARY KEY Table "url": url_id INTEGER CONSTRAINT fk_bookmark_id REFERENCES bookmark(id) Problem #1: Strange that PG allowed this to happen. Maybe my DDL above allows this to happen and needs to be tightened? I thought the above would ensure referential integrity, but maybe I need to specify something else? Problem #2: I'd like to find all rows in B that point to non-existent rows in U. I can do it with the following sub-select, I believe, but it's rather inefficient (EXPLAIN shows both tables would be sequentially scanned): SELECT * FROM bookmark WHERE id NOT IN (SELECT b.id FROM bookmark b, url u WHERE b.url_id=u.id); Is there a more efficient way to get the rows from "bookmark"? Thanks, Otis ---(end of broadcast)--- TIP 6: explain analyze is your friend
[SQL] READ COMMITTE without START TRANSACTION?
Hello, I need to run some SELECT queries that take a while (10+ minutes) to complete, and I'm wondering about the isolation about the results I get. More precisely, while my SELECT is running, the DB is being updated by another application, and I am wondering which, if any, data changes my SELECT will see. Example: If I start my SELECT at 10:00, and it finishes at 10:10, will my results include data that was inserted between 10:00 and 10:10? Similarly, will my result include data that was updated between 10:00 and 10:10? The same question for data that was deleted during that period. If it matters, my SELECT runs from psql client, while concurrent inserts, updates, and deletes are executed from a separate application (webapp). For my purposes in this case I need the SELECT to get the results that represent data right at the beginning of the query - a snapshot. I read this: http://www.postgresql.org/docs/8.1/static/transaction-iso.html and it looks like this is the default PG behaviour (READ COMMITTED) Question: If I do not explicitly START TRANSACTION before the SELECT, will this READ COMMITTED XA behaviour still be in effect? Thanks, Otis ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[SQL] Need help: Find dirty rows, Update, Delete SQL
Hello, I need a bit of help with some SQL. I have two tables, call them Page and Bookmark. Each row in Page can have many Bookmarks pointing to it, and they are joined via a FK (Page.id = Bookmark.page_id). Page has a 'url' column: Page.url, which has a unique index on it. My Page.url column got a little dirty, and I need to clean it up, and that's what I need help with. Here is an example of dirtiness: Page: id=1 url = 'http://example.com/' id=2 url = 'http://example.com/#' -- dirty id=3 url = 'http://example.com/#foo' -- dirty The last two rows are dirty. Normally I normalize URLs before inserting them, but these got in, and now I need to clean them. The problem is that rows in Bookmark table may point to dirty rows in Page, so I can't just remove the dirty rows, and I can't just update 'url' column in Page to 'http://example.com/', because that column is unique. Is there some fancy SQL that I can use them to find the dirty rows in page (... where url like '%#%') and then find rows in Bookmark table that point to them, then point those rows to good rows in Page (e.g. id=1 row above), and finally remove the dirty rows from Page? Any help would be greatly appreciated. I'm using Pg 8.0.3 Thanks, Otis ---(end of broadcast)--- TIP 6: explain analyze is your friend
[SQL] 3 tables, slow count(*), order by Seq Scan in Query Plan
Hello, I have 3 tables (2 tables + 1 lookup table that ties them) and running a straight-forward aggregate count(*) query with a couple of joins takes about 10 seconds (and I need it to be sub-second or so). Also, I am wondering if this approach is scalable with my row-counts and my hardware (below). My slow query is this: -- SELECT keyword.name, count(*) FROM user_data, user_data_keyword, keyword WHERE (user_data.user_id = 1) AND (user_data.id = user_data_keyword.user_data_id) AND (user_data_keyword.keyword_id = keyword.id) GROUP BY keyword.name ORDER BY COUNT(*) DESC LIMIT 10; QUERY PLAN from EXPLAIN ANALYZE: Limit (cost=27820.00..27820.03 rows=10 width=114) (actual time=9971.322..9971.401 rows=10 loops=1) -> Sort (cost=27820.00..27822.50 rows=1000 width=114) (actual time=9971.312..9971.338 rows=10 loops=1) Sort Key: count(*) -> HashAggregate (cost=27767.67..27770.17 rows=1000 width=114) (actual time=9955.457..9963.051 rows=2005 loops=1) -> Hash Join (cost=4459.64..27738.80 rows=5774 width=114) (actual time=1140.776..9919.852 rows=5516 loops=1) Hash Cond: ("outer".keyword_id = "inner".id) -> Hash Join (cost=4437.14..27600.81 rows=5774 width=4) (actual time=21.781..7804.329 rows=5516 loops=1) Hash Cond: ("outer".user_data_id = "inner".id) -> Seq Scan on user_data_keyword (cost=0.00..17332.29 rows=1154729 width=8) (actual time=2.717..3834.186 rows=1154729 loops=1) -> Hash (cost=4433.94..4433.94 rows=1278 width=4) (actual time=18.862..18.862 rows=0 loops=1) -> Index Scan using ix_user_data_user_id_data_id on user_data (cost=0.00..4433.94 rows=1278 width=4) (actual time=0.234..13.454 rows=1149 loops=1) Index Cond: (user_id = 1) -> Hash (cost=20.00..20.00 rows=1000 width=118) (actual time=1118.616..1118.616 rows=0 loops=1) -> Seq Scan on keyword (cost=0.00..20.00 rows=1000 width=118) (actual time=1.140..609.577 rows=105440 loops=1) Total runtime: 9972.704 ms (15 rows) Ouch :) I'm trying to analyze the query plan (I'm not very good at it, obviously), and I see 2 Sequential Scans, one on the _big_ "user_data_keyword" table with about 60% of the total cost, and one of the "keyword". I also see HashAggregate with a high cost and a long actual time. I'm not sure what to do, which indices I need to add, as the "user_data_keyword" and "keyword" tables already have PK-based btree indices: "user_data_keyword" has: ... btree (user_data_id, keyword_id) "keyword" has: ... btree (id) Here are my 3 tables: user_data (about 300K rows currently, will grow to 10M+) - id (PK), user_id (FK) ... other columns ... user_data_keyword (lookup table - size always 4 x user_data) - user_data_id (FK) keyword_id (FK) PK(user_data_id, keyword_id) keyword (size about 10 x smaller than user_data_keyword) --- id (PK) name VARCHAR(64) NOT NULL UNIQUE add_date TIMEZONE Is there any way of speeding up my query? Also, given the number of rows expected in those tables: user_data: 10M user_data_keyword: 40M keyword:4M Any ideas how a query like this will scale when i hit those numbers? This particular query speed numbers are from 7.4.6 on a 1.5GHz laptop, but the production server is running PG 8.0.3 on a 3.0GHz P4 with 2 SATA disks in RAID1 config and 1GB RAM. How realistic is it to get sub-second responses on such hardware given the above numbers? Thanks, Otis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[SQL] Help: Function for splitting VARCHAR column and migrating its data to 2 new tables
I am restructuring my DB schema and need help migrating data from 1 column of an existing table to two new tables. I have some Java code that can do this for me, but it's very slow, and I am now hoping I can migrate this data with some clever SQL instead. Here are my 3 tables: user_data (existing, old table) - id (PK), user_id (FK) keywords VARCHAR(256) -- this contains comma separated keywords -- e.g. "new york,san francisco, dallas, food" -- also "keywords without strings are really just 1 keyword" add_date TIMESTAMP So now I'm trying to migrate this "keywords" VARCHAR column to a more normalized schema: user_data_keyword (new lookup table to populate) - id (PK) -- I may change PK to PK(user_data_id, keyword_id) user_data_id (FK) keyword_id (FK) keyword (new table to populate) --- id (PK) name VARCHAR(64) NOT NULL UNIQUE add_date TIMEZONE I just found http://www.postgresql.org/docs/current/static/functions-string.html , but if anyone could lend me a hand by getting me started with writing a function for this, I'd really appreciate it. Thanks, Otis ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[SQL] Trimming the cost of ORDER BY in a simple query
Hello, I have a simple query with a pretty high cost (EXPLAIN ...), and I'm wondering if I can somehow trim it. Query (shows the last 7 dates): => SELECT DISTINCT date_part('year', uu.add_date), date_part('month', uu.add_date), date_part('day', uu.add_date) FROM user_url uu WHERE uu.user_id=1 ORDER BY date_part('year', uu.add_date) DESC, date_part('month', uu.add_date) DESC, date_part('day', uu.add_date) DESC LIMIT 7; QUERY PLAN: - Limit (cost=4510.14..4522.93 rows=2 width=8) (actual time=19.924..20.160 rows=7 loops=1) -> Unique (cost=4510.14..4522.93 rows=2 width=8) (actual time=19.919..20.139 rows=7 loops=1) -> Sort (cost=4510.14..4513.34 rows=1279 width=8) (actual time=19.915..20.004 rows=78 loops=1) Sort Key: date_part('year'::text, add_date), date_part('month'::text, add_date), date_part('day'::text, add_date) -> Index Scan using foo on user_url uu (cost=0.00...14 rows=1279 width=8) (actual time=0.095..14.761 rows=1225 loops=1) Index Cond: (user_id = 1) Total runtime: 20.313 ms (7 rows) It looks like the cost is all in ORDER BY, and if I remove ORDER BY the execution time goes from 20-90 ms to less than 1 ms. I do need the 7 most recent add_dates. Is there a more efficient way of grabbing them? Thanks, Otis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[SQL] copy old column's values to new column
Hello, I need some basic SQL help. I added a new column to an existing table, and now I need to copy values from one of the old columns to this new columns. I need something like this: FOR pvId IN SELECT id FROM preference_value LOOP update preference_value SET display_value = (select value from preference_value where id=pvId) where id=pvId; END LOOP; I tried running this from psql, but it didn't work (I suspect FOR can be used in functions, which I don't know how to write in PG, yet). Is there a simple way to do this? Thanks, Otis ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [SQL] Order by YYYY MM DD in reverse chrono order trouble
Hello, Thank you for all your help, Stijn. date_part is a standard PG function. While not the most elegant, the DESC, DESC, DESC solution suggested the other day works okay for me, so I think I'll use that for now. Thanks again! Otis --- Stijn Vanroye <[EMAIL PROTECTED]> wrote: > Indeed, it seems that I get the same result for a similar query. > I'm running version 7.3.4 on a rh 9 server. > > Also: is the function date_part a function you wrote yourself? I get > an error stating that the function date_part("Unknown",date) is not > recognized. > > It maybe not a solution to the actual problem but you could try this: > save the date and the time in two seperate fields. I use a similar > construction for convenience. > > Regards, > > Stijn. > > > Hello, > > > > Hm, doesn't work for me: > > > > [EMAIL PROTECTED] mydb=> select distinct date_part('year', uu.add_date), > > date_part('month', uu.add_date), date_part('day', > > uu.add_date) from uus inner join ui on uus.user_id=ui.id > > inner join > > uu on ui.id=uu.user_id where uus.subscriber_user_id=1 order by > > uu.add_date desc; > > > > ERROR: For SELECT DISTINCT, ORDER BY expressions must appear > > in target > > list > > > > I have this version of PostgreSQL installed: > postgresql-7.3.4-3.rhl9 > > > > Thanks, > > Otis > > > > > > --- Tom Lane <[EMAIL PROTECTED]> wrote: > > > <[EMAIL PROTECTED]> writes: > > > > I'd love to be able to do that, but I cannot just ORDER BY > > > uu.add_date, > > > > because I do not have uu.add_date in the SELECT part of the > > > statement. > > > > > > Sure you can. Back around SQL89 there was a restriction that > ORDER > > > BY > > > values had to appear in the SELECT list as well, but no modern > > > database > > > has such a restriction anymore ... > > > > > > regards, tom lane > > > > > > ---(end of > > broadcast)--- > > TIP 5: Have you checked our extensive FAQ? > > > >http://www.postgresql.org/docs/faqs/FAQ.html > > ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [SQL] Order by YYYY MM DD in reverse chrono order trouble
Hello, Hm, doesn't work for me: [EMAIL PROTECTED] mydb=> select distinct date_part('year', uu.add_date), date_part('month', uu.add_date), date_part('day', uu.add_date) from uus inner join ui on uus.user_id=ui.id inner join uu on ui.id=uu.user_id where uus.subscriber_user_id=1 order by uu.add_date desc; ERROR: For SELECT DISTINCT, ORDER BY expressions must appear in target list I have this version of PostgreSQL installed: postgresql-7.3.4-3.rhl9 Thanks, Otis --- Tom Lane <[EMAIL PROTECTED]> wrote: > <[EMAIL PROTECTED]> writes: > > I'd love to be able to do that, but I cannot just ORDER BY > uu.add_date, > > because I do not have uu.add_date in the SELECT part of the > statement. > > Sure you can. Back around SQL89 there was a restriction that ORDER > BY > values had to appear in the SELECT list as well, but no modern > database > has such a restriction anymore ... > > regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [SQL] Order by YYYY MM DD in reverse chrono order trouble
Hello, I'd love to be able to do that, but I cannot just ORDER BY uu.add_date, because I do not have uu.add_date in the SELECT part of the statement. The reason I don't have it there is because I need distinct MM DD values back. Is there a trick that I could use to make this more elegant? Thanks, Otis --- Edmund Bacon <[EMAIL PROTECTED]> wrote: > Is there some reason you can't do this: > > SELECT DISTINCT > date_part('year', uu.add_date), date_part('month', uu.add_date), > date_part('day', uu.add_date) > > FROM uus INNER JOIN ui ON uus.user_id=ui.id INNER JOIN uu ON > ui.id=uu.user_id > WHERE uus.x_id=1 > > ORDER BY > uu.add_date DESC; > > This might be faster, as you only have to sort on one field, and I > think it should give the desired results > > [EMAIL PROTECTED] wrote: > > >Hello, > > > >I am trying to select distinct dates and order them in the reverse > >chronological order. Although the column type is TIMESTAMP, in this > >case I want only , MM, and DD back. > > > >I am using the following query, but it's not returning dates back in > >the reverse chronological order: > > > >SELECT DISTINCT > > date_part('year', uu.add_date), date_part('month', uu.add_date), > > date_part('day', uu.add_date) > > > >FROM uus INNER JOIN ui ON uus.user_id=ui.id INNER JOIN uu ON > >ui.id=uu.user_id > >WHERE uus.x_id=1 > > > >ORDER BY > >date_part('year', uu.add_date), date_part('month', uu.add_date), > >date_part('day', uu.add_date) DESC; > > > > > >This is what the above query returns: > > > > date_part | date_part | date_part > >---+---+--- > > 2004 | 2 | 6 > > 2004 | 4 |20 > >(2 rows) > > > > > >I am trying to get back something like this: > >2004 4 20 > >2004 4 19 > >2004 2 6 > >... > > > >My query is obviously wrong, but I can't see the mistake. I was > >wondering if anyone else can see it. Just changing DESC to ASC, did > >not work. > > > >Thank you! > >Otis > > > > > >---(end of > broadcast)--- > >TIP 2: you can get off all lists at once with the unregister command > >(send "unregister YourEmailAddressHere" to > [EMAIL PROTECTED]) > > > > > > -- > Edmund Bacon <[EMAIL PROTECTED]> > ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [SQL] Order by YYYY MM DD in reverse chrono order trouble
Thank you and Denis ([EMAIL PROTECTED]) - that was it. I needed explicit DESC for each ORDER BY criterium. Otis --- Stijn Vanroye <[EMAIL PROTECTED]> wrote: > > Hello, > > > > I am trying to select distinct dates and order them in the reverse > > chronological order. Although the column type is TIMESTAMP, in > this > > case I want only , MM, and DD back. > > > > I am using the following query, but it's not returning dates back > in > > the reverse chronological order: > > > > SELECT DISTINCT > > date_part('year', uu.add_date), date_part('month', uu.add_date), > > > date_part('day', uu.add_date) > > > > FROM uus INNER JOIN ui ON uus.user_id=ui.id INNER JOIN uu ON > > ui.id=uu.user_id > > WHERE uus.x_id=1 > > > > ORDER BY > > date_part('year', uu.add_date), date_part('month', uu.add_date), > > date_part('day', uu.add_date) DESC; > > > > > > This is what the above query returns: > > > > date_part | date_part | date_part > > ---+---+--- > > 2004 | 2 | 6 > > 2004 | 4 |20 > > (2 rows) > > > > > > I am trying to get back something like this: > > 2004 4 20 > > 2004 4 19 > > 2004 2 6 > > ... > > > > My query is obviously wrong, but I can't see the mistake. I was > > wondering if anyone else can see it. Just changing DESC to ASC, > did > > not work. > > > > Thank you! > > Otis > What you could try to do in your order by clause is the following: > ORDER BY > date_part('year', uu.add_date) DESC, > date_part('month', uu.add_date) DESC, > date_part('day', uu.add_date) DESC; > That way you are sure each of the fields is sorted DESC. if you don't > specify a direction in your order by clause postgres will take ASC as > the default. I think that he does "ASC,ASC,DESC" instead. I'm not > sure if he applies the DESC to all specified fields in the order by > if you declare it only once. > > > Regards, > > Stijn Vanroye > > ---(end of > broadcast)--- > TIP 9: the planner will ignore your desire to choose an index scan if > your > joining column's datatypes do not match ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[SQL] Order by YYYY MM DD in reverse chrono order trouble
Hello, I am trying to select distinct dates and order them in the reverse chronological order. Although the column type is TIMESTAMP, in this case I want only , MM, and DD back. I am using the following query, but it's not returning dates back in the reverse chronological order: SELECT DISTINCT date_part('year', uu.add_date), date_part('month', uu.add_date), date_part('day', uu.add_date) FROM uus INNER JOIN ui ON uus.user_id=ui.id INNER JOIN uu ON ui.id=uu.user_id WHERE uus.x_id=1 ORDER BY date_part('year', uu.add_date), date_part('month', uu.add_date), date_part('day', uu.add_date) DESC; This is what the above query returns: date_part | date_part | date_part ---+---+--- 2004 | 2 | 6 2004 | 4 |20 (2 rows) I am trying to get back something like this: 2004 4 20 2004 4 19 2004 2 6 ... My query is obviously wrong, but I can't see the mistake. I was wondering if anyone else can see it. Just changing DESC to ASC, did not work. Thank you! Otis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])