[SQL] Partitioning by letter question

2010-01-29 Thread John Lister
Hi, I was wondering if this was possible. I'm trying to partition a 
table, which is straightforward enough thanks to the great 
documentation, but i have a question:


If I partition using something like a product_id for example and have 
check constraints such as (id>=1000 and id<2000) then everything is fine 
and the planner correctly uses the right subset of the tables. However I 
would like to partition by the first letter and using something like 
this substr(word,1,1)='a' is ignored by the planner. From reading the 
docs I understand that complicated check constraints are ignored, but 
this doesn't seem overly complicated.


Am i doing  something wrong or is there another better way to do this

Thanks

John

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Partitioning by letter question

2010-01-30 Thread John Lister

 wrote:

Hi, I was wondering if this was possible. I'm trying to partition a

table,

which is straightforward enough thanks to the great documentation, but i
have a question:

If I partition using something like a product_id for example and have 
check
constraints such as (id>=1000 and id<2000) then everything is fine and 
the
planner correctly uses the right subset of the tables. However I would 
like

to partition by the first letter and using something like this
substr(word,1,1)='a' is ignored by the planner. From reading the docs I
understand that complicated check constraints are ignored, but this 
doesn't

seem overly complicated.

Am i doing something wrong or is there another better way to do this



Have you tried:



(word >= 'a' and word <'b')


Cheers, had my programming head on. One question:

any ideas about what to put for the last in the list

i thought something like  (word>='z' and word<'{') which is based on the 
ascii ordering. - my db is using utf8


I tried to check this by doing

select * from words where word >'' order by word limit 10;

which returns '.' as the first result (ok not a word, but that is a 
different issue) but if i do


select * from words where word <'.' order by word desc limit 10

I get '/...' as the first result, I would expect '', this doesn't seem 
consistent.


I'm obviously missing some inherent sorting behaviour her, but not sure..

Thanks

John


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] Partitioning improvements query

2010-02-04 Thread John Lister
Hi all, I was just wondering if any progress has been made on improving 
partitioning, particuarly performance wise. I've found a few documents 
on the web, for example: 
http://wiki.postgresql.org/wiki/Table_partitioning and 
http://wiki.postgresql.org/wiki/Image:Partitioning_Requirements.pdf, 
http://wiki.postgresql.org/wiki/PgCon_2008_Developer_Meeting#Partitioning_Roadmap 
which mention improvements to partitioning, but I can't find any info if 
these have been acted on.


Just curious as things like pushing limits down to the sub queries would 
be a great feature, etc


Cheers

John



--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] obtaining difference between minimum value and next in size

2010-11-17 Thread John Lister
Hi, I was wondering if it is possible to do this with a single query rather 
than iterate over all of the rows in an application:

I have a table which for brevity looks like:
create table offers {
  integer id;
  integer product_id;
  double price;
}

where for each product there is a number of offers in this table. Now my 
question:
Is it possible to obtain the difference between just the minimum price and the 
next one up per product, so say I have the following data:
id, product_id, price
123, 2, 10.01
125, 2, 10.05
128, 2, 11.30
134, 3, 9.45
147, 3, 11.42
157, 3, 12.08
167, 3, 12.09

then I would like the following returned
product_id, difference
2, .04   (10.05-10.01)
3, 1.97 (11.42-9.45)

,etc


Any ideas?

Thanks

John
--

Got needs? Get Goblin'! - http://www.pricegoblin.co.uk/


Re: [SQL] obtaining difference between minimum value and next in size

2010-11-17 Thread John Lister
Cheers oliverios and tom for your speedy replies. Unfortunately using v8.3 so 
the new functions are out. A big credit to oliverios for his sql fu, that seems 
to do exactly what I want and I think I pretty much understand the query. I 
always forget the comparison on the rows when thinking about groups.

John
  - Original Message - 
  From: Oliveiros d'Azevedo Cristina 
  To: John Lister ; pgsql-sql@postgresql.org 
  Sent: Wednesday, November 17, 2010 4:09 PM
  Subject: Re: [SQL] obtaining difference between minimum value and next in size


  Hi, John.

  I am not familiar with the functions Tom's indicated and I'm sure they 
constitute a much more straightfoward to solve your problem.

  Meanwhile, if you'd like to solve it with just SQL give this a try and see if 
it gives you the result you want

  Best,
  Oliveiros

  SELECT product_id, MIN(pv2) - pv1
  FROM ((
  SELECT product_id,MIN(price) as pv1
  FROM offers
  GROUP BY product_id)  firstSubQuery
  NATURAL JOIN
  (
  SELECT product_id,price as pv2
  FROM offers) secondSubQuery
  ) total
  WHERE  pv1 <>  pv2
  GROUP BY product_id,pv1
- Original Message ----- 
From: John Lister 
To: pgsql-sql@postgresql.org 
Sent: Wednesday, November 17, 2010 3:11 PM
Subject: [SQL] obtaining difference between minimum value and next in size


Hi, I was wondering if it is possible to do this with a single query rather 
than iterate over all of the rows in an application:

I have a table which for brevity looks like:
create table offers {
  integer id;
  integer product_id;
  double price;
}

where for each product there is a number of offers in this table. Now my 
question:
Is it possible to obtain the difference between just the minimum price and 
the next one up per product, so say I have the following data:
id, product_id, price
123, 2, 10.01
125, 2, 10.05
128, 2, 11.30
134, 3, 9.45
147, 3, 11.42
157, 3, 12.08
167, 3, 12.09

then I would like the following returned
product_id, difference
2, .04   (10.05-10.01)
3, 1.97 (11.42-9.45)

,etc


Any ideas?

Thanks

John
--

Got needs? Get Goblin'! - http://www.pricegoblin.co.uk/


[SQL] Full text search ordering question

2008-11-25 Thread John Lister
Hi, is it possible to order the results of a full text search using another 
field?


for example with the following table:

CREATE TABLE breadcrumbs (
 node_id integer NOT NULL,
 breadcrumb character varying,
 textsearchable tsvector,
 views integer,
 CONSTRAINT pk_breadcrumbs PRIMARY KEY (node_id)
)

I'd like to do something like this

select node_id, views from breadcrumbs  where textsearchable @@ 
to_tsquery('word') order by views desc limit 100;


As such I'd like to create a fts index on the textsearchable field and views 
field such that it orders the results by the views column.


atm, this table has over 3M rows (and is likely to b magnitudes bigger) and 
some words match hundreds of thousands of rows, The best i've got so far is 
to create a fts index which is used and then the resulting rows are sorted 
in memory. Unfortunately because of the number of rows returned this takes a 
few seconds.


With a btree index i could index on the 2 columns and it would only hit the 
index and take a fraction of a second.


I've tried the btree_gist module, but it doesn't make any difference (except 
in letting me use an int in the gist index)


Any ideas or is this simply not possible?

Thanks


--

Got needs? Get Goblin'! - http://www.pricegoblin.co.uk/ 



--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Full text search ordering question

2008-11-25 Thread John Lister

Thanks for the reply and apologies, it was my first post.

I'm running on PG 8.3.3 (ubuntu), i haven't tried gin as that doesn't 
support multi-column. I haven't used 8.4 as it is still in development?


a sample query is as follows

select node_id from breadcrumbs where textsearchable @@ to_tsquery('book') 
order by views desc limit 100;


explain analyze results in this:

Limit  (cost=10300.58..10300.83 rows=100 width=381) (actual 
time=69887.851..69887.880 rows=100 loops=1)
 ->  Sort  (cost=10300.58..10307.61 rows=2812 width=381) (actual 
time=69887.849..69887.862 rows=100 loops=1)

   Sort Key: views
   Sort Method:  top-N heapsort  Memory: 84kB
   ->  Bitmap Heap Scan on breadcrumbs  (cost=171.49..10193.10 
rows=2812 width=381) (actual time=60311.197..69574.742 rows=569519 loops=1)

 Filter: (textsearchable@@ to_tsquery('book'::text))"
 ->  Bitmap Index Scan on idx_breadcr  (cost=0.00..170.79 
rows=2812 width=0) (actual time=60261.959..60261.959 rows=569519 loops=1)

   Index Cond: (textsearchable @@ to_tsquery('book'::text))
Total runtime: 69896.896 ms

As you can see it sorts the full result set from the search. Ideally i'd 
like to use an index on the views.


How stable is 8.4? Is it worth trying that or is the multi-column gin likely 
to be back-ported?


Thanks



John,

it's a good tradition to include query and their EXPLAIN ANALYZE. Pg 
version is also useful.

Did you try GIN index ?
In 8.4 you can use gin index on (views,tsvector)

Oleg

On Tue, 25 Nov 2008, John Lister wrote:

Hi, is it possible to order the results of a full text search using 
another field?


for example with the following table:

CREATE TABLE breadcrumbs (
node_id integer NOT NULL,
breadcrumb character varying,
textsearchable tsvector,
views integer,
CONSTRAINT pk_breadcrumbs PRIMARY KEY (node_id)
)

I'd like to do something like this

select node_id, views from breadcrumbs  where textsearchable @@ 
to_tsquery('word') order by views desc limit 100;


As such I'd like to create a fts index on the textsearchable field and 
views field such that it orders the results by the views column.


atm, this table has over 3M rows (and is likely to b magnitudes bigger) 
and some words match hundreds of thousands of rows, The best i've got so 
far is to create a fts index which is used and then the resulting rows 
are sorted in memory. Unfortunately because of the number of rows 
returned this takes a few seconds.


With a btree index i could index on the 2 columns and it would only hit 
the index and take a fraction of a second.


I've tried the btree_gist module, but it doesn't make any difference 
(except in letting me use an int in the gist index)


Any ideas or is this simply not possible?

Thanks


--

Got needs? Get Goblin'! - http://www.pricegoblin.co.uk/



 Regards,
 Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83




--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] adding "order by" to a "group by" query

2008-12-06 Thread John Lister

>(still curious about the "must be used in an aggregate function" error
>though... because I do use it in an aggregate)


You're original query grouped on the person id and name, therefore you 
can only return (and order by) these functions or the result of an 
aggregate function on other columns (such as the array_accum function).


I'm no expert, but I think the error is slightly misleading, normally 
you would order by the result of an aggregate function but maybe the 
parser does this implicitly for you sometimes. does


select p.id_person, person_name(p), array_accum(distinct pt.type_fr)
   from person p
   left join person_to_event x using (id_person)
   left join person_type pt using (id_person_type)
   where person_name(p) ilike '%will%' group by 
p.id_person,person_name(p)

   order by 3;


work for you?


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Best way to "and" from a one-to-many joined table?

2008-12-08 Thread John Lister
I guess it depends on the optimiser and how clever it is. With the 
former the db will probably generate 2 sets of ids for the 2 joined 
tables (a, b) which only contain the values you require, these lists are 
probably much smaller than the total number of rows in the table 
therefore any merges and sorts on them have to operate on less rows and 
will be quicker. With the latter query it has to fetch all the rows 
regardless of the attribute and then do the restriction at the end, 
which results in more rows, bigger merges and sorts and takes longer...
Obviously postgres may be clever enough to realise what you want and 
rearrange the query internally to a more efficient form.


Generally to find out what it is doing stick "EXPLAIN (ANALYZE)" in 
front. This will show you the steps the db is taking to perform the 
query and in what order.
If you include ANAYLZE then the db actually does the query (throwing 
away the results) and gives you accurate values, etc otherwise it shows 
you estimated values based on the various stats collected for the table.



SELECT person_name
FROM test_people p
JOIN test_attributes a
ON ((a.people_id = p.people_id) AND (a."attribute" = @firstAttr))
JOIN test_attributes b
ON ((b."people_id" = p."people_id") AND (b."attribute" = @secondAttr));


Hi,

I saw a few people post answers to this question and it raised another 
related question for me.


What are the differences between the above query and this one. Are 
they semantically/functionally identical but might differ in 
performance? Or would they be optimized down to an identical query? Or 
am I misreading them and they are actually different?


SELECT person_name
FROM test_people p
JOIN test_attributes a
ON ((a.people_id = p.people_id)
JOIN test_attributes b
ON ((b."people_id" = p."people_id")
WHERE
  (a."attribute" = @firstAttr))
  AND (b."attribute" = @secondAttr));

Also, any suggestions about how to figure out this on my own without 
bugging the list in the future would be great. Thanks for any insight!


Steve

p.s. I posting in the same thread, but if you think I should have 
started a new thread let me know for the future.




--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] temp tables versus normal tables

2009-02-16 Thread John Lister
I've got a process that every minute or so selects some data from a number of 
tables. At the minute i dump this into a normal table, where i do some more 
processing with it before truncating the table and starting again.. I don't 
have any indexes on the temporary table but have thought about adding some for 
the processing stage. 

My question is, would a proper temp table be any faster/better. Are they stored 
entirely in memory or written to disk (i don't care if the data is lost on 
server restart/crash)

I suppose i could also use a view for the initial "insert into" and bypass the 
table althogether but i'm guessing this would be slower as the data would need 
to be looked up each time the view is used for the subsequent processing steps..

Any thoughts

Thanks

JOHN

--

Got needs? Get Goblin'! - http://www.pricegoblin.co.uk/


[SQL] Query planning question

2009-05-11 Thread John Lister
Doing the following query 

select distinct m.id, m.name 
from manufacturer_manufacturer m 
join product_product p on (p.manufacturer_id=m.id) 
join retailer_offer o on (o.product_id=p.id)
where o.retailer_id=XXX and o.active

results in one of 2 query plans depending upon the value of XXX. 
The first ignores the index on products and does a hash join which is very 
slow, the second uses the index and does a nested loop which is fast. 

Am I right in assuming the planner thinks a sequential scan is quicker than 10k 
index hits, would tweaking the costs fix this or would i be better updating the 
stats for the product_id and manufacturer_id fields?

"Unique  (cost=318308.62..321110.94 rows=1029 width=13) (actual 
time=5057.271..5296.973 rows=699 loops=1)"
"  ->  Sort  (cost=318308.62..319242.73 rows=373642 width=13) (actual 
time=5057.270..5196.780 rows=455733 loops=1)"
"Sort Key: m.id, m.name"
"Sort Method:  external merge  Disk: 11032kB"
"->  Hash Join  (cost=110196.74..283725.63 rows=373642 width=13) 
(actual time=1706.287..3451.352 rows=455733 loops=1)"
"  Hash Cond: (p.manufacturer_id = m.id)"
"  ->  Hash Join  (cost=110163.59..278554.90 rows=373642 width=4) 
(actual time=1705.652..3230.879 rows=455733 loops=1)"
"Hash Cond: (o.product_id = p.id)"
"->  Bitmap Heap Scan on retailer_offer o  
(cost=9418.68..157960.21 rows=373642 width=4) (actual time=120.277..382.208 
rows=455733 loops=1)"
"  Recheck Cond: ((retailer_id = 1347) AND active)"
"  ->  Bitmap Index Scan on 
idx_retaileroffer_retailerid  (cost=0.00..9325.27 rows=373642 width=0) (actual 
time=79.503..79.503 rows=455829 loops=1)"
"Index Cond: (retailer_id = 1347)"
"->  Hash  (cost=59067.07..59067.07 rows=2540307 width=8) 
(actual time=1584.994..1584.994 rows=2540324 loops=1)"
"  ->  Seq Scan on product_product p  
(cost=0.00..59067.07 rows=2540307 width=8) (actual time=0.008..698.313 
rows=2540324 loops=1)"
"  ->  Hash  (cost=20.29..20.29 rows=1029 width=13) (actual 
time=0.627..0.627 rows=1029 loops=1)"
"->  Seq Scan on manufacturer_manufacturer m  
(cost=0.00..20.29 rows=1029 width=13) (actual time=0.007..0.278 rows=1029 
loops=1)"
"Total runtime: 5310.663 ms"


"Unique  (cost=43237.52..43266.80 rows=1029 width=13) (actual 
time=190.978..196.625 rows=276 loops=1)"
"  ->  Sort  (cost=43237.52..43247.28 rows=3903 width=13) (actual 
time=190.977..192.431 rows=11298 loops=1)"
"Sort Key: m.id, m.name"
"Sort Method:  quicksort  Memory: 1037kB"
"->  Hash Join  (cost=134.14..43004.70 rows=3903 width=13) (actual 
time=5.006..155.188 rows=11298 loops=1)"
"  Hash Cond: (p.manufacturer_id = m.id)"
"  ->  Nested Loop  (cost=100.99..42917.88 rows=3903 width=4) 
(actual time=4.363..146.421 rows=11298 loops=1)"
"->  Bitmap Heap Scan on retailer_offer o  
(cost=100.99..13663.63 rows=3903 width=4) (actual time=4.345..29.871 rows=11298 
loops=1)"
"  Recheck Cond: ((retailer_id = 1710) AND active)"
"  ->  Bitmap Index Scan on 
idx_retaileroffer_retailerid  (cost=0.00..100.02 rows=3903 width=0) (actual 
time=2.368..2.368 rows=11380 loops=1)"
"Index Cond: (retailer_id = 1710)"
"->  Index Scan using product_product_pkey on 
product_product p  (cost=0.00..7.48 rows=1 width=8) (actual time=0.008..0.009 
rows=1 loops=11298)"
"  Index Cond: (p.id = o.product_id)"
"  ->  Hash  (cost=20.29..20.29 rows=1029 width=13) (actual 
time=0.634..0.634 rows=1029 loops=1)"
"->  Seq Scan on manufacturer_manufacturer m  
(cost=0.00..20.29 rows=1029 width=13) (actual time=0.009..0.275 rows=1029 
loops=1)"
"Total runtime: 196.716 ms"



Thanks


--

Got needs? Get Goblin'! - http://www.pricegoblin.co.uk/



Re: [SQL] Query planning question

2009-05-11 Thread John Lister

"John Lister"  writes:
Am I right in assuming the planner thinks a sequential scan is quicker 
than 10k index hits, would tweaking the costs fix this or would i be 
better updating the stats for the product_id and manufacturer_id fields?


AFAICT the planner did exactly the right things here.  Your first
example is fetching 40 times as many rows from retailer_offer as
the second one is.  If the planner had stuck with the nestloop plan,
it would've taken about 40x as long, and been significantly slower
than the hash join.


Cheers for the quick reply, maybe not the best values, see the following 2 
plans with approx the same number of product rows but different results and 
times. I forgot to mention that the product table has 2.5M rows although 
this is apparent from the plans:


with hash join:

"Unique  (cost=199627.47..199900.51 rows=1029 width=13) (actual 
time=2226.358..2238.255 rows=49 loops=1)"
"  ->  Sort  (cost=199627.47..199718.48 rows=36406 width=13) (actual 
time=2226.356..2230.342 rows=37086 loops=1)"

"Sort Key: m.name, m.id"
"Sort Method:  quicksort  Memory: 3276kB"
"->  Hash Join  (cost=101700.78..196869.37 rows=36406 width=13) 
(actual time=1759.983..2193.453 rows=37086 loops=1)"

"  Hash Cond: (p.manufacturer_id = m.id)"
"  ->  Hash Join  (cost=101667.62..196335.64 rows=36406 width=4) 
(actual time=1759.338..2174.826 rows=37086 loops=1)"

"Hash Cond: (o.product_id = p.id)"
"->  Bitmap Heap Scan on retailer_offer o 
(cost=921.66..84697.06 rows=36406 width=4) (actual time=12.168..49.759 
rows=37086 loops=1)"

"  Recheck Cond: ((retailer_id = 5149) AND active)"
"  ->  Bitmap Index Scan on 
idx_retaileroffer_retailerid  (cost=0.00..912.56 rows=36406 width=0) (actual 
time=7.136..7.136 rows=37089 loops=1)"

"Index Cond: (retailer_id = 5149)"
"->  Hash  (cost=59067.54..59067.54 rows=2540354 
width=8) (actual time=1746.670..1746.670 rows=2540383 loops=1)"
"  ->  Seq Scan on product_product p 
(cost=0.00..59067.54 rows=2540354 width=8) (actual time=0.012..787.095 
rows=2540383 loops=1)"
"  ->  Hash  (cost=20.29..20.29 rows=1029 width=13) (actual 
time=0.635..0.635 rows=1029 loops=1)"
"->  Seq Scan on manufacturer_manufacturer m 
(cost=0.00..20.29 rows=1029 width=13) (actual time=0.009..0.296 rows=1029 
loops=1)"

"Total runtime: 2244.036 ms"

and without:

"Unique  (cost=43237.53..43266.80 rows=1029 width=13) (actual 
time=410.191..421.953 rows=332 loops=1)"
"  ->  Sort  (cost=43237.53..43247.29 rows=3903 width=13) (actual 
time=410.189..414.351 rows=32959 loops=1)"

"Sort Key: m.name, m.id"
"Sort Method:  quicksort  Memory: 3384kB"
"->  Hash Join  (cost=134.15..43004.71 rows=3903 width=13) (actual 
time=16.356..328.938 rows=32959 loops=1)"

"  Hash Cond: (p.manufacturer_id = m.id)"
"  ->  Nested Loop  (cost=100.99..42917.89 rows=3903 width=4) 
(actual time=15.716..308.037 rows=32959 loops=1)"
"->  Bitmap Heap Scan on retailer_offer o 
(cost=100.99..13663.64 rows=3903 width=4) (actual time=15.693..67.479 
rows=32959 loops=1)"

"  Recheck Cond: ((retailer_id = 2016) AND active)"
"  ->  Bitmap Index Scan on 
idx_retaileroffer_retailerid  (cost=0.00..100.02 rows=3903 width=0) (actual 
time=7.863..7.863 rows=33369 loops=1)"

"Index Cond: (retailer_id = 2016)"
"->  Index Scan using product_product_pkey on 
product_product p  (cost=0.00..7.48 rows=1 width=8) (actual 
time=0.006..0.006 rows=1 loops=32959)"

"  Index Cond: (p.id = o.product_id)"
"  ->  Hash  (cost=20.29..20.29 rows=1029 width=13) (actual 
time=0.627..0.627 rows=1029 loops=1)"
"->  Seq Scan on manufacturer_manufacturer m 
(cost=0.00..20.29 rows=1029 width=13) (actual time=0.009..0.270 rows=1029 
loops=1)"

"Total runtime: 422.058 ms"

You can see that the sequential scan is significantly slower than the index 
scan (i've tried to mitigate any caching by the OS with these results).
Postgresql 8.3.5 running on a Quad Core Xeon 2Ghz with 12Gb ram. All costs 
set to defaults, shared_buffers=4.2GB  and effective_cache=6Gb.
I thought with the later versions more shared_buffers was better, is this 
too much??


Thanks

JOHN



--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql