Re: [SQL] Extracting hostname from URI column

2007-09-11 Thread ogjunk-pgjedan
Ah, I figured out what to look for and found my uniq -c solution:

select substring( href from '.*://([^/]*)' ) as hostname, count(substring( href 
from '.*://([^/]*)' )) from url where id<10 group by hostname order by count 
desc;
 hostname | count
--+---
 texturizer.net   | 2
 www.google.com   | 2
 dictionary.reference.com | 1
 www.mozillazine.org  | 1
 devedge.netscape.com | 1
 groups.google.com| 1
 forums.mozillazine.org   | 1

Thanks for the quick help with substring func, people!

Otis


- Original Message 
From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
To: pgsql-sql@postgresql.org
Sent: Tuesday, September 11, 2007 11:16:15 PM
Subject: Re: [SQL] Extracting hostname from URI column

Hi,

Thanks, perfect! (though I'll have to look into the regex warning):

=> select substring( href from '.*://\([^/]*)' ) as hostname from url where 
id<10;
WARNING:  nonstandard use of escape in a string literal at character 29
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.

So now I have this:
 hostname
--
 texturizer.net
 texturizer.net
 forums.mozillazine.org
 www.mozillazine.org
 devedge.netscape.com
 www.google.com
 groups.google.com
 www.google.com
 dictionary.reference.com

And what I'd like is something that would give me the counts for the number of 
occurrences of each unique hostname.  Something much like `uniq -c'.  Can 
anyone tell me how that's done or where I should look for info? (I'm not sure 
what to look for, that's the problem).

Thanks,
Otis


- Original Message 
From: chester c young <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: sql pgsql 
Sent: Tuesday, September 11, 2007 8:42:46 PM
Subject: Re: [SQL] Extracting hostname from URI column

> I'm trying to use substr() and position() functions to extract the
> full host name (and later a domain) from a column that holds URLs.

substring( href from '.*://\([^/]*)' );



   

Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/




---(end of broadcast)---
TIP 6: explain analyze is your friend




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [SQL] Extracting hostname from URI column

2007-09-11 Thread ogjunk-pgjedan
Hi,

Thanks, perfect! (though I'll have to look into the regex warning):

=> select substring( href from '.*://\([^/]*)' ) as hostname from url where 
id<10;
WARNING:  nonstandard use of escape in a string literal at character 29
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.

So now I have this:
 hostname
--
 texturizer.net
 texturizer.net
 forums.mozillazine.org
 www.mozillazine.org
 devedge.netscape.com
 www.google.com
 groups.google.com
 www.google.com
 dictionary.reference.com

And what I'd like is something that would give me the counts for the number of 
occurrences of each unique hostname.  Something much like `uniq -c'.  Can 
anyone tell me how that's done or where I should look for info? (I'm not sure 
what to look for, that's the problem).

Thanks,
Otis


- Original Message 
From: chester c young <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: sql pgsql 
Sent: Tuesday, September 11, 2007 8:42:46 PM
Subject: Re: [SQL] Extracting hostname from URI column

> I'm trying to use substr() and position() functions to extract the
> full host name (and later a domain) from a column that holds URLs.

substring( href from '.*://\([^/]*)' );



   

Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/




---(end of broadcast)---
TIP 6: explain analyze is your friend


[SQL] Extracting hostname from URI column

2007-09-11 Thread ogjunk-pgjedan
Hi,

I'm trying to use substr() and position() functions to extract the full host 
name (and later a domain) from a column that holds URLs.
This is what I'm trying, but it clearly doesn't do the job.

=> select substr(href, position('://' in href)+3, position('://' in 
href)+3+position('/' in href)), href from url where id <10;
 substr |   href
+--
 texturizer.net | http://texturizer.net/firebird/extensions/
 texturizer.net | http://texturizer.net/firebird/themes/
 forums.mozilla | http://forums.mozillazine.org/index.php?c=4
 www.mozillazin | http://www.mozillazine.org/
 devedge.netsca | http://devedge.netscape.com/viewsource/2002/bookmarks/
 www.google.com | http://www.google.com/search?&q=%s
 groups.google. | http://groups.google.com/groups?scoring=d&q=%s
 www.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+Lucky
 dictionary.ref | http://dictionary.reference.com/search?q=%s

The 3rd param to the substr function is clearly wrong.  Is it even doable 
without writing a procedure?

Finally, is this the fastest way to get this data, or is there  regex-based 
function that might be faster?

Thanks,
Otis



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [SQL] Help with a seq scan on multi-million row table

2006-05-11 Thread ogjunk-pgjedan
Hello Andrew, Markus, and Tom - thanks for all the help!  You've just helped a 
large Simpy community! :) I'll try to post some performance charts to 
http://blog.simpy.com/ shortly.  In short, this immediately dropped the load 
from 2-3-4-5-6-7+ to circa 0.25.

Thanks!
Otis

- Original Message 
From: Andrew Sullivan <[EMAIL PROTECTED]>
To: pgsql-sql@postgresql.org
Sent: Thursday, May 11, 2006 1:18:08 PM
Subject: Re: [SQL] Help with a seq scan on multi-million row table

On Thu, May 11, 2006 at 10:09:44AM -0700, [EMAIL PROTECTED] wrote:
> Hi Markus & Tom,
> 
> Higher statistics for this column hm, I'd love to try changing
> it to see how that changes things, but I'm afraid I don't know how
> to do that.  How can I change the statistics target value for this
> column?
> 
> Ah, I think I found the place:

No.  Just ALTER TABLE [name] ALTER [column] SET STATISTICS.  See
http://www.postgresql.org/docs/8.1/interactive/sql-altertable.html
for more.  You'll need to ANALYSE afterwards.

A
-- 
Andrew Sullivan  | [EMAIL PROTECTED]





---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [SQL] Help with a seq scan on multi-million row table

2006-05-11 Thread ogjunk-pgjedan
Hi Markus & Tom,

Higher statistics for this column hm, I'd love to try changing it to see 
how that changes things, but I'm afraid I don't know how to do that.  How can I 
change the statistics target value for this column?

Ah, I think I found the place:

  => select * from pg_attribute where attname='user_url_id';
 attrelid |   attname   | atttypid | attstattarget | attlen | attnum | attndims 
| attcacheoff | atttypmod | attbyval | attstorage | attalign | attnotnull | 
atthasdef | attisdropped | attislocal | attinhcount
--+-+--+---+++--+-+---+--++--++---+--++-
  6124839 | user_url_id |   23 |-1 |  4 |  1 |0 
|  -1 |-1 | t| p  | i| f  | f   
  | f| t  |   0
  1646081 | user_url_id |   23 |-1 |  4 |  2 |0 
|  -1 |-1 | t| p  | i| f  | f   
  | f| t  |   0
 10048109 | user_url_id |   23 |-1 |  4 |  3 |0 
|  -1 |-1 | t| p  | i| f  | f   
  | f| t  |   0
 10048123 | user_url_id |   23 |-1 |  4 |  2 |0 
|  -1 |-1 | t| p  | i| f  | f   
  | f| t  |   0

Hm, 4 rows.  I need to change the value of the 'attstattarget' column, but for 
which of these rows?  Only attrelid is different.
I tried looking at pg_class, but didn't find anything with the above 
attrelid's.  I used:

  => select * from pg_class where relname like 'user_url%';

Tom: you asked about distinct values.  pg_stats shows cca. 60K distinct values, 
but the real number is:
select count(distinct user_url_id) from user_url_tag;
  count
-
 1505933

This number grows daily by... not sure how much, probably 5k a day currently.

Thanks,
Otis

- Original Message 
From: Markus Schaber <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: pgsql-sql@postgresql.org
Sent: Thursday, May 11, 2006 6:33:55 AM
Subject: Re: [SQL] Help with a seq scan on multi-million row table

Hi, Otis,

[EMAIL PROTECTED] wrote:

> I'm not sure which numbers you are referring to when you said the estimate is 
> off, but here are some numbers:
>   The whole table has 6-7 M rows.
>   That query matches about 2500 rows.
> 
> If there are other things I can play with and help narrow this down, please 
> let me know.

Did you try to set higher statistics targets for this columns?

For experimenting, I'd try to set it to 100 or even higher, then ANALYZE
the table, and then retest the query.

HTH,
Marks

-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [SQL] Help with a seq scan on multi-million row table

2006-05-10 Thread ogjunk-pgjedan
Not sure if I'm showing you what you asked for, but here it is:

select * from pg_stats  where tablename='user_url_tag' and 
attname='user_url_id';
 schemaname |  tablename   |   attname   | null_frac | avg_width | n_distinct | 
   most_common_vals   | 
most_common_freqs   |   
  histogram_bounds 
| correlation
+--+-+---+---++++--+-
 public | user_url_tag | user_url_id | 0 | 4 |  60825 | 
{458321,1485346,16304,68027,125417,153465,182503,201175,202973,218423} | 
{0.0013,0.001,0.00067,0.00067,0.00067,0.00067,0.00067,0.00067,0.00067,0.00067}
 | 
{195,195993,325311,480323,647778,782598,1014527,1201726,1424822,1614712,1853719}
 |0.795521

You asked if the table has been analyzed recently.
I think so - I run ANALYZE on the whole DB every night, like this:

$ psql -U me -c "ANALYZE;" mydb

For a good measure, I just analyzed the table now: $ psql -U me -c "ANALYZE 
user_url_tag;" mydb
Then I set the enable_hashjoin back to ON and re-run the EXPLAIN ANALYZE.
I still get the sequential scan, even after analyzing the table :(

I'm not sure which numbers you are referring to when you said the estimate is 
off, but here are some numbers:
  The whole table has 6-7 M rows.
  That query matches about 2500 rows.

If there are other things I can play with and help narrow this down, please let 
me know.

Thanks,
Otis


- Original Message 
From: Tom Lane <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: pgsql-sql@postgresql.org
Sent: Wednesday, May 10, 2006 9:53:49 PM
Subject: Re: [SQL] Help with a seq scan on multi-million row table 

<[EMAIL PROTECTED]> writes:
> Aha!  set hashjoin=off did the trick.

>->  Index Scan using ix_user_url_tag_user_url_id on 
> user_url_tag userurltag0_  (cost=0.00..157.34 rows=103 width=14) (actual 
> time=1.223..1.281 rows=5 loops=1666)
>  Index Cond: (userurltag0_.user_url_id = "outer".id)

This seems to be the problem right here: the estimate of matching rows
is off by a factor of 20, and that inflates the overall cost estimate
for this plan about the same, causing the planner to think the other way
is cheaper.

What does the pg_stats row for user_url_tag.user_url_id contain?
Have you analyzed that table recently?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [SQL] Help with a seq scan on multi-million row table

2006-05-10 Thread ogjunk-pgjedan
Aha!  set hashjoin=off did the trick.
The PG version is: 8.0.3

NB: I removed that redundant "DISTINCT" after the SELECT.

EXPLAIN ANALYZE select userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from 
user_url_tag userurltag0_, user_url userurl1_ where (((userurl1_.user_id=1 
))AND((userurltag0_.user_url_id=userurl1_.id ))) group by  userurltag0_.tag 
order by  count(*)DESC;

   QUERY PLAN 
-
 Sort  (cost=155766.79..155774.81 rows=3207 width=10) (actual 
time=2387.756..2396.578 rows=2546 loops=1)
   Sort Key: count(*)
   ->  HashAggregate  (cost=155572.02..155580.03 rows=3207 width=10) (actual 
time=2365.643..2376.626 rows=2546 loops=1)
 ->  Nested Loop  (cost=0.00..12.68 rows=3867 width=10) (actual 
time=0.135...028 rows=8544 loops=1)
   ->  Index Scan using ix_user_url_user_id_url_id on user_url 
userurl1_  (cost=0.00..2798.12 rows=963 width=4) (actual time=0.067..9.744 
rows=1666 loops=1)
 Index Cond: (user_id = 1)
   ->  Index Scan using ix_user_url_tag_user_url_id on user_url_tag 
userurltag0_  (cost=0.00..157.34 rows=103 width=14) (actual time=1.223..1.281 
rows=5 loops=1666)
 Index Cond: (userurltag0_.user_url_id = "outer".id)
 Total runtime: 2405.691 ms
(9 rows)


Are you still interested in other "its second-choice join type"?  If you are, 
please tell me what join types those are, this is a bit beyond me. :(

Is there a way to force PG to use the index automatically?  This query is 
executed from something called Hibernate, and I'm not sure if that will let me 
set enable_hashjoin=off through its API...

Thanks,
Otis


- Original Message 
From: Tom Lane <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: pgsql-sql@postgresql.org
Sent: Wednesday, May 10, 2006 8:27:01 PM
Subject: Re: [SQL] Help with a seq scan on multi-million row table 

<[EMAIL PROTECTED]> writes:
>->  Hash Join  (cost=2797.65..140758.50 rows=3790 width=10) 
> (actual time=248.530..380635.132 rows=8544 loops=1)  
>  Hash Cond: ("outer".user_url_id = "inner".id)  
>  ->  Seq Scan on user_url_tag userurltag0_  
> (cost=0.00..106650.30 rows=6254530 width=14) (actual time=0.017..212256.630 
> rows=6259553 loops=1)  
>  ->  Hash  (cost=2795.24..2795.24 rows=962 width=4) 
> (actual time=199.840..199.840 rows=0 loops=1)  
>->  Index Scan using ix_user_url_user_id_url_id on 
> user_url userurl1_  (cost=0.00..2795.24 rows=962 width=4) (actual 
> time=0.048..193.707 rows=1666 loops=1)  
>  Index Cond: (user_id = 1)  

Hm, I'm not sure why it's choosing that join plan.  A nestloop indexscan
wouldn't be terribly cheap, but just counting on my fingers it seems
like it ought to come in at less than 10 cost units.  What do you
get if you set enable_hashjoin off?  (Then try disabling its
second-choice join type too --- I'm interested to see EXPLAIN ANALYZE
output for all three join types.)

What PG version is this exactly?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [SQL] Help with a seq scan on multi-million row table

2006-05-10 Thread ogjunk-pgjedan
Hi,

Thanks Bruno.  That was indeed a redundant DISTINCT.  It did reduce the amount 
of work, but as you said it doesn't get rid of the sequential scan, which is 
the real problem with this query.

Otis

- Original Message 
From: Bruno Wolff III <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: pgsql-sql@postgresql.org
Sent: Wednesday, May 10, 2006 3:23:29 PM
Subject: Re: [SQL] Help with a seq scan on multi-million row table

On Wed, May 10, 2006 at 13:13:59 -0500,
  [EMAIL PROTECTED] wrote:
>   Hello,
> 
> I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential 
> scan on a multi-million row table.  I _thought_ I had all the appropriate 
> indices, but apparently I do not.  I was wondering if anyone can spot a way I 
> can speed up this query.
> The query currently takes... *gulp*: 381119.201 ms :(
>   
> There are only 2 tables in the game: user_url and user_url_tag.  The latter 
> has FKs pointing to the former.  The sequential scan happens on the latter - 
> user_url_tag:
> 
> EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ 
> from user_url_tag userurltag0_, user_url userurl1_ WHERE 
> (((userurl1_.user_id=1 )) AND ((userurltag0_.user_url_id=userurl1_.id ))) 
> GROUP BY userurltag0_.tag ORDER BY  count(*) DESC;  

While this isn't a big issue, it looks like DISTINCT is redundant in your
query and seems to be adding some extra work.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[SQL] Help with a seq scan on multi-million row table

2006-05-10 Thread ogjunk-pgjedan
  Hello,

I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential 
scan on a multi-million row table.  I _thought_ I had all the appropriate 
indices, but apparently I do not.  I was wondering if anyone can spot a way I 
can speed up this query.
The query currently takes... *gulp*: 381119.201 ms :(
  
There are only 2 tables in the game: user_url and user_url_tag.  The latter has 
FKs pointing to the former.  The sequential scan happens on the latter - 
user_url_tag:

EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ 
from user_url_tag userurltag0_, user_url userurl1_ WHERE (((userurl1_.user_id=1 
)) AND ((userurltag0_.user_url_id=userurl1_.id ))) GROUP BY userurltag0_.tag 
ORDER BY  count(*) DESC;  
  

  QUERY PLAN   
---
  
 Unique  (cost=140972.22..140996.28 rows=3207 width=10) (actual 
time=381082.868..381110.094 rows=2546 loops=1)  
   ->  Sort  (cost=140972.22..140980.24 rows=3207 width=10) (actual 
time=381082.858..381091.733 rows=2546 loops=1)  
 Sort Key: count(*), userurltag0_.tag  
 ->  HashAggregate  (cost=140777.45..140785.46 rows=3207 width=10) 
(actual time=381032.844..381064.068 rows=2546 loops=1)  
   ->  Hash Join  (cost=2797.65..140758.50 rows=3790 width=10) 
(actual time=248.530..380635.132 rows=8544 loops=1)  
 Hash Cond: ("outer".user_url_id = "inner".id)  
 ->  Seq Scan on user_url_tag userurltag0_  
(cost=0.00..106650.30 rows=6254530 width=14) (actual time=0.017..212256.630 
rows=6259553 loops=1)  
 ->  Hash  (cost=2795.24..2795.24 rows=962 width=4) (actual 
time=199.840..199.840 rows=0 loops=1)  
   ->  Index Scan using ix_user_url_user_id_url_id on 
user_url userurl1_  (cost=0.00..2795.24 rows=962 width=4) (actual 
time=0.048..193.707 rows=1666 loops=1)  
 Index Cond: (user_id = 1)  
 Total runtime: 381119.201 ms  
(11 rows)  
  
  
  This is what the two tables look like (extra colums removed):

Table "public.user_url_tag"  
   Column| Type  |  Modifiers  
-+---+--
  
 id  | integer   | not null default 
nextval('public.user_url_tag_id_seq'::text)  
 user_url_id | integer   |  
 tag | character varying(64) |  
Indexes:  
"pk_user_url_tag_id" PRIMARY KEY, btree (id)  
"ix_user_url_tag_tag" btree (tag)  
"ix_user_url_tag_user_url_id" btree (user_url_id)  
Foreign-key constraints:  
"fk_user_url_tag_user_url_id" FOREIGN KEY (user_url_id) REFERENCES 
user_url(id)  
  
  Table "public.user_url"  
  Column  |Type |
Modifiers  
--+-+--
  
 id   | integer | not null default 
nextval('public.user_url_id_seq'::text)  
 user_id  | integer |  
 url_id   | integer |  
Indexes:  
"pk_user_url_id" PRIMARY KEY, btree (id)  
"ix_user_url_url_id_user_id" UNIQUE, btree (url_id, user_id)  
"ix_user_url_user_id_url_id" UNIQUE, btree (user_id, url_id)  
  
  
  Does anyone see a way to speed up this s-l-o-w query?
I cache DB results, but I'd love to get rid of that sequential scan.

Thanks,
Otis




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[SQL] Multi-column index not used, new flipped column index is

2006-05-10 Thread ogjunk-pgjedan
Hi,

I'm using PG 8.0.3 and recently spotted a query that was not using a 
multi-column index I had created.
The index looks like:

  CREATE INDEX . ON FooTable(fkColumnOne, fkColumnTwo);

The query that was not using the index was using:

  SELECT  a bunch of columns and joins  WHERE FooTable.fkColumnTwo=1000;

So I added another index where the indexed columns are flipped:

  CREATE INDEX . ON FooTable(fkColumnTwo, fkColumnOne);
 
Now the query started to use that index -- good!
But now I have 2 indices that are nearly the same, and that means overhead 
during INSERTs/DELETEs. :(

Is there a way to get this to use that first index, so I don't have to have 
this additional index?
If I had PG 8.1.3, would PG know how to use that first index?  I seem to recall 
something about this somewhere... but I don't see it at 
http://www.postgresql.org/docs/whatsnew .

Thanks,
Otis




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [SQL] Referential integrity broken (8.0.3), sub-select help

2006-03-21 Thread ogjunk-pgjedan
Hi,

I mistakenly swapped the tables in my email.  Here they are, corrected:

 Table "url":
 id  SERIAL
  CONSTRAINT pk_url_id PRIMARY KEY
  
Table "bookmark":
 url_id  INTEGER
   CONSTRAINT fk_url_id REFERENCES url(id)
 
I see my questions got chopped off from this email below, so let me restate 
them:


Problem #1: Strange that PG allowed this to happen.  Maybe my DDL above allows 
this to happen and needs to be tightened?  I thought the above would ensure 
referential integrity, but maybe I need to specify something else?

Problem #2: I'd like to find all rows in B that point to non-existent rows in 
U.  I can do it with the following sub-select, I believe, but it's rather 
inefficient (EXPLAIN shows both tables would be sequentially scanned):

  SELECT * FROM bookmark WHERE id NOT IN (SELECT b.id FROM bookmark b, url u 
WHERE b.url_id=u.id);

Is there a more efficient way to get the rows from "bookmark"?

Thanks,
Otis


- Original Message 
From: Stephan Szabo <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: pgsql-sql@postgresql.org
Sent: Tuesday, March 21, 2006 10:08:38 AM
Subject: Re: [SQL] Referential integrity broken (8.0.3), sub-select help

On Tue, 21 Mar 2006 [EMAIL PROTECTED] wrote:

> I've got 2 tables, "url" (U), and "bookmark" (B), with "bookmark" pointing to 
> "url" via FK.

That's not what your schema below has.  Your fragment below has URL
pointing to bookmark.

> Somehow I ended up with some rows in B referencing non-existent rows in U.

With the below, this is entirely possible, since you're only guaranteeing
that URLs have valid bookmarks not the other way around.  Are you sure the
below is actually what you have?

> This sounds super strange and dangerous to me, and it's not clear to me 
> how/why PG let this happen.
> I'm using 8.0.3.
>
> Here are the table references I just mentioned:
>
> Table "bookmark":
>  id  SERIAL
>  CONSTRAINT pk_bookmark_id PRIMARY KEY
>
>  Table "url":
> url_id  INTEGER
>  CONSTRAINT fk_bookmark_id REFERENCES bookmark(id)





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[SQL] Referential integrity broken (8.0.3), sub-select help

2006-03-21 Thread ogjunk-pgjedan
Hello,

I've got 2 tables, "url" (U), and "bookmark" (B), with "bookmark" pointing to 
"url" via FK.
Somehow I ended up with some rows in B referencing non-existent rows in U.
This sounds super strange and dangerous to me, and it's not clear to me how/why 
PG let this happen.
I'm using 8.0.3.

Here are the table references I just mentioned:

Table "bookmark":
 id  SERIAL
 CONSTRAINT pk_bookmark_id PRIMARY KEY
 
 Table "url":
url_id  INTEGER
 CONSTRAINT fk_bookmark_id REFERENCES bookmark(id)


Problem #1: Strange that PG allowed this to happen.  Maybe my DDL above allows 
this to happen and needs to be tightened?  I thought the above would ensure 
referential integrity, but maybe I need to specify something else?

Problem #2: I'd like to find all rows in B that point to non-existent rows in 
U.  I can do it with the following sub-select, I believe, but it's rather 
inefficient (EXPLAIN shows both tables would be sequentially scanned):

  SELECT * FROM bookmark WHERE id NOT IN (SELECT b.id FROM bookmark b, url u 
WHERE b.url_id=u.id);

Is there a more efficient way to get the rows from "bookmark"?

Thanks,
Otis


---(end of broadcast)---
TIP 6: explain analyze is your friend


[SQL] READ COMMITTE without START TRANSACTION?

2006-03-10 Thread ogjunk-pgjedan
Hello,

I need to run some SELECT queries that take a while (10+ minutes) to complete, 
and I'm wondering about the isolation about the results I get.  More precisely, 
while my SELECT is running, the DB is being updated by another application, and 
I am wondering which, if any, data changes my SELECT will see.

Example:
If I start my SELECT at 10:00, and it finishes at 10:10, will my results 
include data that was inserted between 10:00 and 10:10?
Similarly, will my result include data that was updated between 10:00 and 10:10?
The same question for data that was deleted during that period.

If it matters, my SELECT runs from psql client, while concurrent inserts, 
updates, and deletes are executed from a separate application (webapp).

For my purposes in this case I need the SELECT to get the results that 
represent data right at the beginning of the query - a snapshot.  I read this: 
http://www.postgresql.org/docs/8.1/static/transaction-iso.html  and it looks 
like this is the default PG behaviour (READ COMMITTED)

Question:
If I do not explicitly START TRANSACTION before the SELECT, will this READ 
COMMITTED XA behaviour still be in effect?

Thanks,
Otis




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[SQL] Need help: Find dirty rows, Update, Delete SQL

2006-02-18 Thread ogjunk-pgjedan
Hello,

I need a bit of help with some SQL.
I have two tables, call them Page and Bookmark.
Each row in Page can have many Bookmarks pointing to it, and
they are joined via a FK (Page.id = Bookmark.page_id).

Page has a 'url' column: Page.url, which has a unique index on it.

My Page.url column got a little dirty, and I need to clean it up,
and that's what I need help with.

Here is an example of dirtiness:

Page:

id=1 url = 'http://example.com/'
id=2 url = 'http://example.com/#' -- dirty
id=3 url = 'http://example.com/#foo'  -- dirty

The last two rows are dirty.  Normally I normalize URLs before
inserting them, but these got in, and now I need to clean them.

The problem is that rows in Bookmark table may point to dirty
rows in Page, so I can't just remove the dirty rows, and I can't
just update 'url' column in Page to 'http://example.com/',
because that column is unique.

Is there some fancy SQL that I can use them to find the dirty
rows in page (... where url like '%#%') and then find rows in
Bookmark table that point to them, then point those rows to
good rows in Page (e.g. id=1 row above), and finally remove the
dirty rows from Page?

Any help would be greatly appreciated.

I'm using Pg 8.0.3

Thanks,
Otis




---(end of broadcast)---
TIP 6: explain analyze is your friend


[SQL] 3 tables, slow count(*), order by Seq Scan in Query Plan

2005-05-26 Thread ogjunk-pgjedan
Hello,

I have 3 tables (2 tables + 1 lookup table that ties them) and running
a straight-forward aggregate count(*) query with a couple of joins
takes about 10 seconds (and I need it to be sub-second or so).
Also, I am wondering if this approach is scalable with my row-counts
and my hardware (below).

My slow query is this:
--
SELECT keyword.name, count(*)
FROM user_data, user_data_keyword, keyword
WHERE (user_data.user_id = 1)
  AND (user_data.id = user_data_keyword.user_data_id)
  AND (user_data_keyword.keyword_id = keyword.id)
  GROUP BY keyword.name
  ORDER BY COUNT(*) DESC LIMIT 10;


QUERY PLAN from EXPLAIN ANALYZE:

 Limit  (cost=27820.00..27820.03 rows=10 width=114) (actual
time=9971.322..9971.401 rows=10 loops=1)
   ->  Sort  (cost=27820.00..27822.50 rows=1000 width=114) (actual
time=9971.312..9971.338 rows=10 loops=1)
 Sort Key: count(*)
 ->  HashAggregate  (cost=27767.67..27770.17 rows=1000
width=114) (actual time=9955.457..9963.051 rows=2005 loops=1)
   ->  Hash Join  (cost=4459.64..27738.80 rows=5774
width=114) (actual time=1140.776..9919.852 rows=5516 loops=1)  
  Hash Cond: ("outer".keyword_id = "inner".id)
 ->  Hash Join  (cost=4437.14..27600.81 rows=5774
width=4) (actual time=21.781..7804.329 rows=5516 loops=1)
   Hash Cond: ("outer".user_data_id =
"inner".id)
   ->  Seq Scan on user_data_keyword 
(cost=0.00..17332.29 rows=1154729 width=8) (actual time=2.717..3834.186
rows=1154729 loops=1)
   ->  Hash  (cost=4433.94..4433.94 rows=1278
width=4) (actual time=18.862..18.862 rows=0 loops=1)
 ->  Index Scan using
ix_user_data_user_id_data_id on user_data  (cost=0.00..4433.94
rows=1278 width=4) (actual time=0.234..13.454 rows=1149 loops=1)
   Index Cond: (user_id = 1)
 ->  Hash  (cost=20.00..20.00 rows=1000 width=118)
(actual time=1118.616..1118.616 rows=0 loops=1)
   ->  Seq Scan on keyword  (cost=0.00..20.00
rows=1000 width=118) (actual time=1.140..609.577 rows=105440 loops=1)
 Total runtime: 9972.704 ms
(15 rows)


Ouch :)

I'm trying to analyze the query plan (I'm not very good at it,
obviously), and I see 2 Sequential Scans, one on the _big_
"user_data_keyword" table with about 60% of the total cost, and one of
the "keyword".
I also see HashAggregate with a high cost and a long actual time.

I'm not sure what to do, which indices I need to add, as the
"user_data_keyword" and "keyword" tables already have PK-based btree
indices:
  "user_data_keyword" has: ... btree (user_data_id, keyword_id)
  "keyword"   has: ... btree (id)


Here are my 3 tables:

user_data (about 300K rows currently, will grow to 10M+)
-
  id   (PK),
  user_id  (FK)
  ... other columns ...

user_data_keyword (lookup table - size always 4 x user_data)
-
  user_data_id (FK)
  keyword_id   (FK)
  PK(user_data_id, keyword_id)

keyword (size about 10 x smaller than user_data_keyword)
---
  id   (PK)
  name VARCHAR(64) NOT NULL UNIQUE
  add_date TIMEZONE



Is there any way of speeding up my query?

Also, given the number of rows expected in those tables:
user_data: 10M
user_data_keyword: 40M
keyword:4M

Any ideas how a query like this will scale when i hit those numbers?

This particular query speed numbers are from 7.4.6 on a 1.5GHz laptop,
but the production server is running PG 8.0.3 on a 3.0GHz P4 with 2
SATA disks in RAID1 config and 1GB RAM.  How realistic is it to get
sub-second responses on such hardware given the above numbers?

Thanks,
Otis


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[SQL] Help: Function for splitting VARCHAR column and migrating its data to 2 new tables

2005-05-23 Thread ogjunk-pgjedan
I am restructuring my DB schema and need help migrating data from 1
column of an existing table to two new tables.  I have some Java code
that can do this for me, but it's very slow, and I am now hoping I can
migrate this data with some clever SQL instead.

Here are my 3 tables:

user_data (existing, old table)
-
  id   (PK),
  user_id  (FK)
  keywords VARCHAR(256)
-- this contains comma separated keywords
-- e.g. "new york,san francisco, dallas, food"
-- also "keywords without strings are really just 1 keyword"
  add_date TIMESTAMP


So now I'm trying to migrate this "keywords" VARCHAR column to a more
normalized schema:

user_data_keyword (new lookup table to populate)
-
  id   (PK) -- I may change PK to PK(user_data_id, keyword_id)
  user_data_id (FK)
  keyword_id   (FK)


keyword (new table to populate)
---
  id   (PK)
  name VARCHAR(64) NOT NULL UNIQUE
  add_date TIMEZONE


I just found
http://www.postgresql.org/docs/current/static/functions-string.html ,
but if anyone could lend me a hand by getting me started with writing a
function for this, I'd really appreciate it.

Thanks,
Otis


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[SQL] Trimming the cost of ORDER BY in a simple query

2005-05-02 Thread ogjunk-pgjedan
Hello,

I have a simple query with a pretty high cost (EXPLAIN ...), and I'm
wondering if I can somehow trim it.

Query (shows the last 7 dates):

=> SELECT DISTINCT date_part('year',  uu.add_date),  date_part('month',
uu.add_date),  date_part('day',   uu.add_date)  FROM user_url uu  WHERE
uu.user_id=1 ORDER BY  date_part('year',  uu.add_date) DESC, 
date_part('month', uu.add_date) DESC,  date_part('day',   uu.add_date)
DESC  LIMIT 7;

QUERY PLAN:
-
 Limit  (cost=4510.14..4522.93 rows=2 width=8) (actual
time=19.924..20.160 rows=7 loops=1)
   ->  Unique  (cost=4510.14..4522.93 rows=2 width=8) (actual
time=19.919..20.139 rows=7 loops=1)
 ->  Sort  (cost=4510.14..4513.34 rows=1279 width=8) (actual
time=19.915..20.004 rows=78 loops=1)
   Sort Key: date_part('year'::text, add_date),
date_part('month'::text, add_date), date_part('day'::text, add_date)
   ->  Index Scan using foo on user_url uu 
(cost=0.00...14 rows=1279 width=8) (actual time=0.095..14.761
rows=1225 loops=1)
 Index Cond: (user_id = 1)
 Total runtime: 20.313 ms
(7 rows)


It looks like the cost is all in ORDER BY, and if I remove ORDER BY the
execution time goes from 20-90 ms to less than 1 ms.

I do need the 7 most recent add_dates.  Is there a more efficient way
of grabbing them?

Thanks,
Otis



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[SQL] copy old column's values to new column

2004-09-02 Thread ogjunk-pgjedan
Hello,

I need some basic SQL help.  I added a new column to an existing table,
and now I need to copy values from one of the old columns to this new
columns.

I need something like this:

FOR pvId IN SELECT id FROM preference_value LOOP
update preference_value SET display_value = (select value from
preference_value where id=pvId) where id=pvId;
END LOOP;

I tried running this from psql, but it didn't work (I suspect FOR can
be used in functions, which I don't know how to write in PG, yet).

Is there a simple way to do this?

Thanks,
Otis


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [SQL] Order by YYYY MM DD in reverse chrono order trouble

2004-04-23 Thread ogjunk-pgjedan
Hello,

Thank you for all your help, Stijn.
date_part is a standard PG function.
While not the most elegant, the DESC, DESC, DESC solution suggested the
other day works okay for me, so I think I'll use that for now.

Thanks again!
Otis

--- Stijn Vanroye <[EMAIL PROTECTED]> wrote:
> Indeed, it seems that I get the same result for a similar query.
> I'm running version 7.3.4 on a rh 9 server.
> 
> Also: is the function date_part a function you wrote yourself? I get
> an error stating that the function date_part("Unknown",date) is not
> recognized.
> 
> It maybe not a solution to the actual problem but you could try this:
> save the date and the time in two seperate fields. I use a similar
> construction for convenience.
> 
> Regards,
> 
> Stijn.
>  
> > Hello,
> > 
> > Hm, doesn't work for me:
> > 
> > [EMAIL PROTECTED] mydb=> select distinct date_part('year',  uu.add_date), 
> > date_part('month', uu.add_date),  date_part('day',
> > uu.add_date)  from uus  inner join ui on uus.user_id=ui.id  
> > inner join 
> > uu on ui.id=uu.user_id where uus.subscriber_user_id=1 order by
> > uu.add_date desc;
> > 
> > ERROR:  For SELECT DISTINCT, ORDER BY expressions must appear 
> > in target
> > list
> > 
> > I have this version of PostgreSQL installed:
> postgresql-7.3.4-3.rhl9
> > 
> > Thanks,
> > Otis
> > 
> > 
> > --- Tom Lane <[EMAIL PROTECTED]> wrote:
> > > <[EMAIL PROTECTED]> writes:
> > > > I'd love to be able to do that, but I cannot just ORDER BY
> > > uu.add_date,
> > > > because I do not have uu.add_date in the SELECT part of the
> > > statement. 
> > > 
> > > Sure you can.  Back around SQL89 there was a restriction that
> ORDER
> > > BY
> > > values had to appear in the SELECT list as well, but no modern
> > > database
> > > has such a restriction anymore ...
> > > 
> > >   regards, tom lane
> > 
> > 
> > ---(end of 
> > broadcast)---
> > TIP 5: Have you checked our extensive FAQ?
> > 
> >http://www.postgresql.org/docs/faqs/FAQ.html
> > 


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [SQL] Order by YYYY MM DD in reverse chrono order trouble

2004-04-22 Thread ogjunk-pgjedan
Hello,

Hm, doesn't work for me:

[EMAIL PROTECTED] mydb=> select distinct date_part('year',  uu.add_date), 
date_part('month', uu.add_date),  date_part('day',
uu.add_date)  from uus  inner join ui on uus.user_id=ui.id  inner join 
uu on ui.id=uu.user_id where uus.subscriber_user_id=1 order by
uu.add_date desc;

ERROR:  For SELECT DISTINCT, ORDER BY expressions must appear in target
list

I have this version of PostgreSQL installed: postgresql-7.3.4-3.rhl9

Thanks,
Otis


--- Tom Lane <[EMAIL PROTECTED]> wrote:
> <[EMAIL PROTECTED]> writes:
> > I'd love to be able to do that, but I cannot just ORDER BY
> uu.add_date,
> > because I do not have uu.add_date in the SELECT part of the
> statement. 
> 
> Sure you can.  Back around SQL89 there was a restriction that ORDER
> BY
> values had to appear in the SELECT list as well, but no modern
> database
> has such a restriction anymore ...
> 
>   regards, tom lane


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [SQL] Order by YYYY MM DD in reverse chrono order trouble

2004-04-22 Thread ogjunk-pgjedan
Hello,

I'd love to be able to do that, but I cannot just ORDER BY uu.add_date,
because I do not have uu.add_date in the SELECT part of the statement. 
The reason I don't have it there is because I need distinct  MM DD
values back.
Is there a trick that I could use to make this more elegant?

Thanks,
Otis


--- Edmund Bacon <[EMAIL PROTECTED]> wrote:
> Is there some reason you can't do this:
> 
> SELECT DISTINCT
>   date_part('year', uu.add_date),  date_part('month', uu.add_date), 
>   date_part('day', uu.add_date)
> 
>   FROM uus INNER JOIN ui ON uus.user_id=ui.id INNER JOIN uu ON
>   ui.id=uu.user_id
>   WHERE uus.x_id=1
> 
>   ORDER BY 
>   uu.add_date DESC;
> 
> This might be faster, as you only have to sort on one field, and I
> think it should give the desired results
> 
> [EMAIL PROTECTED] wrote:
> 
> >Hello,
> >
> >I am trying to select distinct dates and order them in the reverse
> >chronological order.  Although the column type is TIMESTAMP, in this
> >case I want only , MM, and DD back.
> >
> >I am using the following query, but it's not returning dates back in
> >the reverse chronological order:
> >
> >SELECT DISTINCT
> >  date_part('year', uu.add_date),  date_part('month', uu.add_date), 
> >  date_part('day', uu.add_date)
> >
> >FROM uus INNER JOIN ui ON uus.user_id=ui.id INNER JOIN uu ON
> >ui.id=uu.user_id
> >WHERE uus.x_id=1
> >
> >ORDER BY
> >date_part('year', uu.add_date), date_part('month', uu.add_date), 
> >date_part('day',  uu.add_date) DESC;
> >
> >
> >This is what the above query returns:
> >
> > date_part | date_part | date_part
> >---+---+---
> >  2004 | 2 | 6
> >  2004 | 4 |20
> >(2 rows)
> >
> >
> >I am trying to get back something like this:
> >2004 4 20
> >2004 4 19
> >2004 2 6
> >...
> >
> >My query is obviously wrong, but I can't see the mistake.  I was
> >wondering if anyone else can see it.  Just changing DESC to ASC, did
> >not work.
> >
> >Thank you!
> >Otis
> >
> >
> >---(end of
> broadcast)---
> >TIP 2: you can get off all lists at once with the unregister command
> >(send "unregister YourEmailAddressHere" to
> [EMAIL PROTECTED])
> >  
> >
> 
> -- 
> Edmund Bacon <[EMAIL PROTECTED]>
> 


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [SQL] Order by YYYY MM DD in reverse chrono order trouble

2004-04-21 Thread ogjunk-pgjedan
Thank you and Denis ([EMAIL PROTECTED]) - that was it.  I needed
explicit DESC for each ORDER BY criterium.

Otis

--- Stijn Vanroye <[EMAIL PROTECTED]> wrote:
> > Hello,
> > 
> > I am trying to select distinct dates and order them in the reverse
> > chronological order.  Although the column type is TIMESTAMP, in
> this
> > case I want only , MM, and DD back.
> > 
> > I am using the following query, but it's not returning dates back
> in
> > the reverse chronological order:
> > 
> > SELECT DISTINCT
> >   date_part('year', uu.add_date),  date_part('month', uu.add_date),
> 
> >   date_part('day', uu.add_date)
> > 
> > FROM uus INNER JOIN ui ON uus.user_id=ui.id INNER JOIN uu ON
> > ui.id=uu.user_id
> > WHERE uus.x_id=1
> > 
> > ORDER BY
> > date_part('year', uu.add_date), date_part('month', uu.add_date), 
> > date_part('day',  uu.add_date) DESC;
> > 
> > 
> > This is what the above query returns:
> > 
> >  date_part | date_part | date_part
> > ---+---+---
> >   2004 | 2 | 6
> >   2004 | 4 |20
> > (2 rows)
> > 
> > 
> > I am trying to get back something like this:
> > 2004 4 20
> > 2004 4 19
> > 2004 2 6
> > ...
> > 
> > My query is obviously wrong, but I can't see the mistake.  I was
> > wondering if anyone else can see it.  Just changing DESC to ASC,
> did
> > not work.
> > 
> > Thank you!
> > Otis
> What you could try to do in your order by clause is the following:
> ORDER BY
> date_part('year', uu.add_date) DESC,
> date_part('month', uu.add_date) DESC, 
> date_part('day',  uu.add_date) DESC;
> That way you are sure each of the fields is sorted DESC. if you don't
> specify a direction in your order by clause postgres will take ASC as
> the default. I think that he does "ASC,ASC,DESC" instead. I'm not
> sure if he applies the DESC to all specified fields in the order by
> if you declare it only once.
> 
> 
> Regards,
> 
> Stijn Vanroye
> 
> ---(end of
> broadcast)---
> TIP 9: the planner will ignore your desire to choose an index scan if
> your
>   joining column's datatypes do not match


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[SQL] Order by YYYY MM DD in reverse chrono order trouble

2004-04-21 Thread ogjunk-pgjedan
Hello,

I am trying to select distinct dates and order them in the reverse
chronological order.  Although the column type is TIMESTAMP, in this
case I want only , MM, and DD back.

I am using the following query, but it's not returning dates back in
the reverse chronological order:

SELECT DISTINCT
  date_part('year', uu.add_date),  date_part('month', uu.add_date), 
  date_part('day', uu.add_date)

FROM uus INNER JOIN ui ON uus.user_id=ui.id INNER JOIN uu ON
ui.id=uu.user_id
WHERE uus.x_id=1

ORDER BY
date_part('year', uu.add_date), date_part('month', uu.add_date), 
date_part('day',  uu.add_date) DESC;


This is what the above query returns:

 date_part | date_part | date_part
---+---+---
  2004 | 2 | 6
  2004 | 4 |20
(2 rows)


I am trying to get back something like this:
2004 4 20
2004 4 19
2004 2 6
...

My query is obviously wrong, but I can't see the mistake.  I was
wondering if anyone else can see it.  Just changing DESC to ASC, did
not work.

Thank you!
Otis


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])