[PERFORM] full outer performance problem

2005-05-10 Thread Kim Bisgaard




Hi,

I'm having problems with the query optimizer and FULL OUTER JOIN on
PostgreSQL 7.4. I cannot get it to use my indexes with full outer
joins. I might be naive, but I think that it should be possible?

I have two BIG tables (virtually identical) with 3 NOT NULL columns
Station_id, TimeObs, Temp_, with indexes on (Station_id, TimeObs)
and valid ANALYSE (set statistics=100). I want to join the two tables
with a FULL OUTER JOIN.

When I specify the query as:
select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065 
and timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;
I get the correct results, BUT LOUSY performance, and the following
explain:
 Nested Loop Left Join  (cost=5.84..163484.08 rows=1349 width=12) (actual time=66146.815..119005.381 rows=1 loops=1)
   Filter: (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 00:00:00'::timestamp without time zone)
   ->  Hash Join  (cost=5.84..155420.24 rows=1349 width=16) (actual time=8644.449..110836.038 rows=109826 loops=1)
 Hash Cond: ("outer".station_id = "inner".station_id)
 ->  Seq Scan on temp_dry_at_2m a  (cost=0.00..120615.94 rows=6956994 width=16) (actual time=0.024..104548.515 rows=6956994 loops=1)
 ->  Hash  (cost=5.84..5.84 rows=1 width=4) (actual time=0.114..0.114 rows=0 loops=1)
   ->  Index Scan using wmo_idx on station  (cost=0.00..5.84 rows=1 width=4) (actual time=0.105..0.108 rows=1 loops=1)
 Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 00:00:00'::timestamp without time zone <= enddate))
   ->  Index Scan using temp_max_60min_idx on temp_max_60min b  (cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071 rows=0 loops=109826)
 Index Cond: (("outer".station_id = b.station_id) AND ("outer".timeobs = b.timeobs))
 Total runtime: 119005.499 ms
(11 rows)

If I change the query to (and thus negates the full outer join):
select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065 
and a.timeobs='2004-1-1 0:0:0' and b.timeobs='2004-1-1 0:0:0' 
and '2004-1-1 0:0:0' between s.startdate and s.enddate;


I get wrong results (In the case where one of the records is missing in
one of the tables), BUT GOOD performance, and this query plan:
 Nested Loop  (cost=0.00..17.83 rows=1 width=12) (actual time=79.221..79.236 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..11.82 rows=1 width=24) (actual time=65.517..65.526 rows=1 loops=1)
 ->  Index Scan using wmo_idx on station  (cost=0.00..5.83 rows=1 width=4) (actual time=0.022..0.026 rows=1 loops=1)
   Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 00:00:00'::timestamp without time zone <= enddate))
 ->  Index Scan using temp_max_60min_idx on temp_max_60min b  (cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486 rows=1 loops=1)
   Index Cond: (("outer".station_id = b.station_id) AND (b.timeobs = '2004-01-01 00:00:00'::timestamp without time zone))
   ->  Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a  (cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698 rows=1 loops=1)
 Index Cond: (("outer".station_id = a.station_id) AND (a.timeobs = '2004-01-01 00:00:00'::timestamp without time zone))
 Total runtime: 79.340 ms
(9 rows)


If further info like EXPLAIN VERBOSE is useful please say so and I will
provide it.

Thanks in advance!
Kim Bisgaard.





[PERFORM] Partitioning / Clustering

2005-05-10 Thread Alex Stapleton
What is the status of Postgres support for any sort of multi-machine  
scaling support? What are you meant to do once you've upgraded your  
box and tuned the conf files as much as you can? But your query load  
is just too high for a single machine?

Upgrading stock Dell boxes (I know we could be using better machines,  
but I am trying to tackle the real issue) is not a hugely price  
efficient way of getting extra performance, nor particularly scalable  
in the long term.

So, when/is PG meant to be getting a decent partitioning system?  
MySQL is getting one (eventually) which is apparently meant to be  
similiar to Oracle's according to the docs. Clusgres does not appear  
to be widely/or at all used, and info on it seems pretty thin on the  
ground, so I am
not too keen on going with that. Is the real solution to multi- 
machine partitioning (as in, not like MySQLs MERGE tables) on  
PostgreSQL actually doing it in our application API? This seems like  
a less than perfect solution once we want to add redundancy and  
things into the mix. 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Prefetch

2005-05-10 Thread Matt Olson
I wanted to get some opinions about row prefetching.  AFAIK, there is no
prefetching done by PostgreSQL; all prefetching is delegated to the operating
system.

The hardware (can't say enough good things about it):

Athlon 64, dual channel
4GB ram
240GB usable 4 disk raid5 (ATA133)
Fedora Core 3
PostgreSQL 7.4.7

I have what is essentially a data warehouse of stock data.  Each day has
around 30,000 records (tickers).  A typical operation is to get the 200 day
simple moving average (of price) for each ticker and write the result to a
summary table.  In running this process (Perl/DBI), it is typical to see
70-80% I/O wait time with postgres running a about 8-9%.   If I run the next
day's date, the postgres cache and file cache is now populated with 199 days
of the needed data, postgres runs 80-90% of CPU and total run time is greatly
reduced.  My conclusion is that this is a high cache hit rate in action.

I've done other things that make sense, like using indexes, playing with the
planner constants and turning up the postgres cache buffers.

Even playing with extream hdparm read-ahead numbers (i.e. 64738) yields no
apparent difference in database performance.  The random nature of the I/O
drops disk reads down to about 1MB/sec for the array.  A linear table scan
can easily yield 70-80MB/sec on this system.  Total table size is usually
around 1GB and with indexes should be able to fit completely in main memory.

Other databases like Oracle and DB2 implement some sort of row prefetch.  Has
there been serious consideration of implementing something like a prefetch
subsystem?  Does anyone have any opinions as to why this would be a bad idea
for postgres?

Postges is great for a multiuser environment and OLTP applications.  However,
in this set up, a data warehouse, the observed performance is not what I
would hope for.

Regards,

Matt Olson
Ocean Consulting
http://www.oceanconsulting.com/


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Prefetch

2005-05-10 Thread Tom Lane
Matt Olson <[EMAIL PROTECTED]> writes:
> Other databases like Oracle and DB2 implement some sort of row prefetch.  Has
> there been serious consideration of implementing something like a prefetch
> subsystem?

No.

> Does anyone have any opinions as to why this would be a bad idea for
> postgres? 

We know even less than the OS does about disk layout, and not a lot more
than it about what our next request will be.  (If we're doing a seqscan,
then of course that's not true, but I would expect the OS to be able to
figure that one out and do readahead.)

You haven't shown us your problem queries, but I think that conventional
query tuning would be a more appropriate answer.  In particular I wonder
whether you shouldn't be looking at ways to calculate multiple
aggregates in parallel.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] full outer performance problem

2005-05-10 Thread John A Meinel
Kim Bisgaard wrote:
Hi,
I'm having problems with the query optimizer and FULL OUTER JOIN on
PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins.
I might be naive, but I think that it should be possible?
I have two BIG tables (virtually identical) with 3 NOT NULL columns
Station_id, TimeObs, Temp_, with indexes on (Station_id, TimeObs)
and valid ANALYSE (set statistics=100). I want to join the two tables
with a FULL OUTER JOIN.
When I specify the query as:
select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065
and timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;
I get the correct results, BUT LOUSY performance, and the following explain:
 Nested Loop Left Join  (cost=5.84..163484.08 rows=1349 width=12) (actual 
time=66146.815..119005.381 rows=1 loops=1)
   Filter: (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 
00:00:00'::timestamp without time zone)
   ->  Hash Join  (cost=5.84..155420.24 rows=1349 width=16) (actual 
time=8644.449..110836.038 rows=109826 loops=1)
Well, the estimate here is quite a bit off. It thinks you will be
getting 1349 (which is probably why it picked a nested loop plan), but
then it is getting 109826 rows.
I'm guessing it is misunderstanding the selectivity of the timeobs column.
 Hash Cond: ("outer".station_id = "inner".station_id)
 ->  Seq Scan on temp_dry_at_2m a  (cost=0.00..120615.94 rows=6956994 
width=16) (actual time=0.024..104548.515 rows=6956994 loops=1)
 ->  Hash  (cost=5.84..5.84 rows=1 width=4) (actual time=0.114..0.114 
rows=0 loops=1)
   ->  Index Scan using wmo_idx on station  (cost=0.00..5.84 rows=1 
width=4) (actual time=0.105..0.108 rows=1 loops=1)
 Index Cond: ((wmo_id = 6065) AND ('2004-01-01 
00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 
00:00:00'::timestamp without time zone <= enddate))
   ->  Index Scan using temp_max_60min_idx on temp_max_60min b  
(cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071 rows=0 loops=109826)
 Index Cond: (("outer".station_id = b.station_id) AND ("outer".timeobs 
= b.timeobs))
 Total runtime: 119005.499 ms
(11 rows)
I think the bigger problem is that a full outer join says grab all rows,
even if they are null.
What about this query:
SELECT temp_max_60min,temp_dry_at_2m
  FROM (station s LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs)
   LEFT JOIN temp_max_60min b USING (station_id, timeobs)
where s.wmo_id=6065
and timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;
After that, you should probably have a multi-column index on
(station_id, timeobs), which lets postgres use just that index for the
lookup, rather than using an index and then a filter. (Looking at your
next query you might already have that index).
If I change the query to (and thus negates the full outer join):
This is the same query, I think you messed up your copy and paste.
select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065
and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;
I get wrong results (In the case where one of the records is missing in
one of the tables), BUT GOOD performance, and this query plan:
 Nested Loop  (cost=0.00..17.83 rows=1 width=12) (actual time=79.221..79.236 
rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..11.82 rows=1 width=24) (actual 
time=65.517..65.526 rows=1 loops=1)
 ->  Index Scan using wmo_idx on station  (cost=0.00..5.83 rows=1 
width=4) (actual time=0.022..0.026 rows=1 loops=1)
   Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp 
without time zone >= startdate) AND ('2004-01-01 00:00:00'::timestamp without time 
zone <= enddate))
 ->  Index Scan using temp_max_60min_idx on temp_max_60min b  
(cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486 rows=1 loops=1)
   Index Cond: (("outer".station_id = b.station_id) AND (b.timeobs 
= '2004-01-01 00:00:00'::timestamp without time zone))
   ->  Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a  
(cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698 rows=1 loops=1)
 Index Cond: (("outer".station_id = a.station_id) AND (a.timeobs = 
'2004-01-01 00:00:00'::timestamp without time zone))
 Total runtime: 79.340 ms
(9 rows)
If further info like EXPLAIN VERBOSE is useful please say so and I will
provide it.
Thanks in advance!
Kim Bisgaard.
I still feel like you will have a problem with an outer join in this
circumstance, because it will have to scan all of both tables.
I think what you are wanting is "give me everything where station_id =
X, and there is a row in either a or b".
I think my LEFT JOIN example does that, but I a

Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread John A Meinel
Alex Stapleton wrote:
What is the status of Postgres support for any sort of multi-machine
scaling support? What are you meant to do once you've upgraded your  box
and tuned the conf files as much as you can? But your query load  is
just too high for a single machine?
Upgrading stock Dell boxes (I know we could be using better machines,
but I am trying to tackle the real issue) is not a hugely price
efficient way of getting extra performance, nor particularly scalable
in the long term.
Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is
far away from Big Iron. I don't know what performance you are looking
for, but you can easily get into inserting 10M rows/day with quality
hardware.
But actually is it your SELECT load that is too high, or your INSERT
load, or something inbetween.
Because Slony is around if it is a SELECT problem.
http://gborg.postgresql.org/project/slony1/projdisplay.php
Basically, Slony is a Master/Slave replication system. So if you have
INSERT going into the Master, you can have as many replicated slaves,
which can handle your SELECT load.
Slony is an asynchronous replicator, so there is a time delay from the
INSERT until it will show up on a slave, but that time could be pretty
small.
This would require some application level support, since an INSERT goes
to a different place than a SELECT. But there has been some discussion
about pg_pool being able to spread the query load, and having it be
aware of the difference between a SELECT and an INSERT and have it route
the query to the correct host. The biggest problem being that functions
could cause a SELECT func() to actually insert a row, which pg_pool
wouldn't know about. There are 2 possible solutions, a) don't do that
when you are using this system, b) add some sort of comment hint so that
pg_pool can understand that the select is actually an INSERT, and needs
to be done on the master.
So, when/is PG meant to be getting a decent partitioning system?  MySQL
is getting one (eventually) which is apparently meant to be  similiar to
Oracle's according to the docs. Clusgres does not appear  to be
widely/or at all used, and info on it seems pretty thin on the  ground,
so I am
not too keen on going with that. Is the real solution to multi- machine
partitioning (as in, not like MySQLs MERGE tables) on  PostgreSQL
actually doing it in our application API? This seems like  a less than
perfect solution once we want to add redundancy and  things into the mix.
There is also PGCluster
http://pgfoundry.org/projects/pgcluster/
Which is trying to be more of a Synchronous multi-master system. I
haven't heard of Clusgres, so I'm guessing it is an older attempt, which
has been overtaken by pgcluster.
Just realize that clusters don't necessarily scale like you would want
them too. Because at some point you have to insert into the same table,
which means you need to hold a lock which prevents the other machine
from doing anything. And with synchronous replication, you have to wait
for all of the machines to get a copy of the data before you can say it
has been committed, which does *not* scale well with the number of machines.
If you can make it work, I think having a powerful master server, who
can finish an INSERT quickly, and then having a bunch of Slony slaves
with a middleman (like pg_pool) to do load balancing among them, is the
best way to scale up. There are still some requirements, like not having
to see the results of an INSERT instantly (though if you are using
hinting to pg_pool, you could hint that this query must be done on the
master, realizing that the more you do it, the more you slow everything
down).
John
=:->
PS> I don't know what functionality has been actually implemented in
pg_pool, just that it was discussed in the past. Slony-II is also in the
works.


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] full outer performance problem

2005-05-10 Thread Tom Lane
Kim Bisgaard <[EMAIL PROTECTED]> writes:
> I have two BIG tables (virtually identical) with 3 NOT NULL columns 
> Station_id, TimeObs, Temp_, with indexes on (Station_id, TimeObs) 
> and valid ANALYSE (set statistics=100). I want to join the two tables 
> with a FULL OUTER JOIN.

I'm confused.  If the columns are NOT NULL, why isn't this a valid
transformation of your original query?

> select temp_max_60min,temp_dry_at_2m
> from station s natural join
> temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
> where s.wmo_id=6065 
> and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0' 
> and '2004-1-1 0:0:0' between s.startdate and s.enddate;

Seems like it's not eliminating any rows that would otherwise succeed.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes:
> Tom Lane <[EMAIL PROTECTED]> writes:
>> I think that efficient implementation of this would require explicitly
>> storing the hash code for each index entry,

> It seems that means doubling the size of the hash index. That's a pretty big
> i/o to cpu tradeoff. 

Hardly.  The minimum possible size of a hash entry today is 8 bytes
header plus 4 bytes datum, plus there's a 4-byte line pointer to factor
in.  So under the most pessimistic assumptions, storing the hash code
would add 25% to the size.  (On MAXALIGN=8 hardware, it might cost you
nothing at all.)

> What if the hash index stored *only* the hash code? That could be useful for
> indexing large datatypes that would otherwise create large indexes.

Hmm, that could be a thought.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Adam Haberlach

I think that perhaps he was trying to avoid having to buy "Big Iron" at all.

With all the Opteron v. Xeon around here, and talk of $30,000 machines,
perhaps it would be worth exploring the option of buying 10 cheapass
machines for $300 each.  At the moment, that $300 buys you, from Dell, a
2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet.
The aggregate CPU and bandwidth is pretty stupendous, but not as easy to
harness as a single machine.

For those of us looking at batch and data warehousing applications, it would
be really handy to be able to partition databases, tables, and processing
load across banks of cheap hardware.

Yes, clustering solutions can distribute the data, and can even do it on a
per-table basis in some cases.  This still leaves it up to the application's
logic to handle reunification of the data.

Ideas:
1. Create a table/storage type that consists of a select statement
on another machine.  While I don't think the current executor is capable of
working on multiple nodes of an execution tree at the same time, it would be
great if it could offload a select of tuples from a remote table to an
entirely different server and merge the resulting data into the current
execution.  I believe MySQL has this, and Oracle may implement it in another
way.

2. There is no #2 at this time, but I'm sure one can be
hypothesized.

...Google and other companies have definitely proved that one can harness
huge clusters of cheap hardware.  It can't be _that_ hard, can it.  :)


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John A Meinel
Sent: Tuesday, May 10, 2005 7:41 AM
To: Alex Stapleton
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Partitioning / Clustering

Alex Stapleton wrote:
> What is the status of Postgres support for any sort of multi-machine 
> scaling support? What are you meant to do once you've upgraded your  
> box and tuned the conf files as much as you can? But your query load  
> is just too high for a single machine?
>
> Upgrading stock Dell boxes (I know we could be using better machines, 
> but I am trying to tackle the real issue) is not a hugely price 
> efficient way of getting extra performance, nor particularly scalable 
> in the long term.

Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is far
away from Big Iron. I don't know what performance you are looking for, but
you can easily get into inserting 10M rows/day with quality hardware.

But actually is it your SELECT load that is too high, or your INSERT load,
or something inbetween.

Because Slony is around if it is a SELECT problem.
http://gborg.postgresql.org/project/slony1/projdisplay.php

Basically, Slony is a Master/Slave replication system. So if you have INSERT
going into the Master, you can have as many replicated slaves, which can
handle your SELECT load.
Slony is an asynchronous replicator, so there is a time delay from the
INSERT until it will show up on a slave, but that time could be pretty
small.

This would require some application level support, since an INSERT goes to a
different place than a SELECT. But there has been some discussion about
pg_pool being able to spread the query load, and having it be aware of the
difference between a SELECT and an INSERT and have it route the query to the
correct host. The biggest problem being that functions could cause a SELECT
func() to actually insert a row, which pg_pool wouldn't know about. There
are 2 possible solutions, a) don't do that when you are using this system,
b) add some sort of comment hint so that pg_pool can understand that the
select is actually an INSERT, and needs to be done on the master.

>
> So, when/is PG meant to be getting a decent partitioning system?  
> MySQL is getting one (eventually) which is apparently meant to be  
> similiar to Oracle's according to the docs. Clusgres does not appear  
> to be widely/or at all used, and info on it seems pretty thin on the  
> ground, so I am not too keen on going with that. Is the real solution 
> to multi- machine partitioning (as in, not like MySQLs MERGE tables) 
> on  PostgreSQL actually doing it in our application API? This seems 
> like  a less than perfect solution once we want to add redundancy and  
> things into the mix.

There is also PGCluster
http://pgfoundry.org/projects/pgcluster/

Which is trying to be more of a Synchronous multi-master system. I haven't
heard of Clusgres, so I'm guessing it is an older attempt, which has been
overtaken by pgcluster.

Just realize that clusters don't necessarily scale like you would want them
too. Because at some point you have to insert into the same table, which
means you need to hold a lock which prevents the other machine from doing
anything. And with synchronous replication, you have to wait for all of the
machines to get a copy of the data before you can say it has been committed,
which does *not* scale well with the number of machi

Re: [PERFORM] Prefetch

2005-05-10 Thread Rod Taylor
> I've done other things that make sense, like using indexes, playing with the
> planner constants and turning up the postgres cache buffers.

After you load the new days data try running CLUSTER on the structure
using a key of (stockID, date) -- probably your primary key.

This should significantly reduce the amount of IO required for your
calculations involving a few stocks over a period of time.

-- 


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Alex Stapleton
On 10 May 2005, at 15:41, John A Meinel wrote:
Alex Stapleton wrote:
What is the status of Postgres support for any sort of multi-machine
scaling support? What are you meant to do once you've upgraded  
your  box
and tuned the conf files as much as you can? But your query load  is
just too high for a single machine?

Upgrading stock Dell boxes (I know we could be using better machines,
but I am trying to tackle the real issue) is not a hugely price
efficient way of getting extra performance, nor particularly scalable
in the long term.
Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is
far away from Big Iron. I don't know what performance you are looking
for, but you can easily get into inserting 10M rows/day with quality
hardware.
Better hardware = More Efficient != More Scalable
But actually is it your SELECT load that is too high, or your INSERT
load, or something inbetween.
Because Slony is around if it is a SELECT problem.
http://gborg.postgresql.org/project/slony1/projdisplay.php
Basically, Slony is a Master/Slave replication system. So if you have
INSERT going into the Master, you can have as many replicated slaves,
which can handle your SELECT load.
Slony is an asynchronous replicator, so there is a time delay from the
INSERT until it will show up on a slave, but that time could be pretty
small.


So, when/is PG meant to be getting a decent partitioning system?   
MySQL
is getting one (eventually) which is apparently meant to be   
similiar to
Oracle's according to the docs. Clusgres does not appear  to be
widely/or at all used, and info on it seems pretty thin on the   
ground,
so I am
not too keen on going with that. Is the real solution to multi-  
machine
partitioning (as in, not like MySQLs MERGE tables) on  PostgreSQL
actually doing it in our application API? This seems like  a less  
than
perfect solution once we want to add redundancy and  things into  
the mix.

There is also PGCluster
http://pgfoundry.org/projects/pgcluster/
Which is trying to be more of a Synchronous multi-master system. I
haven't heard of Clusgres, so I'm guessing it is an older attempt,  
which
has been overtaken by pgcluster.

Just realize that clusters don't necessarily scale like you would want
them too. Because at some point you have to insert into the same  
table,
which means you need to hold a lock which prevents the other machine
from doing anything. And with synchronous replication, you have to  
wait
for all of the machines to get a copy of the data before you can  
say it
has been committed, which does *not* scale well with the number of  
machines.
This is why I mention partitioning. It solves this issue by storing  
different data sets on different machines under the same schema.  
These seperate chunks of the table can then be replicated as well for  
data redundancy and so on. MySQL are working on these things, but PG  
just has a bunch of third party extensions, I wonder why these are  
not being integrated into the main trunk :/ Thanks for pointing me to  
PGCluster though. It looks like it should be better than Slony at least.

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Alex Stapleton
On 10 May 2005, at 16:02, Adam Haberlach wrote:
I think that perhaps he was trying to avoid having to buy "Big  
Iron" at all.
You would be right. Although we are not against paying a bit more  
than $300 for a server ;)

With all the Opteron v. Xeon around here, and talk of $30,000  
machines,
perhaps it would be worth exploring the option of buying 10 cheapass
machines for $300 each.  At the moment, that $300 buys you, from  
Dell, a
2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit  
ethernet.
The aggregate CPU and bandwidth is pretty stupendous, but not as  
easy to
harness as a single machine.

Yes, clustering solutions can distribute the data, and can even do  
it on a
per-table basis in some cases.  This still leaves it up to the  
application's
logic to handle reunification of the data.
If your going to be programming that sort of logic into your API in  
the beginning, it's not too much more work to add basic replication,  
load balancing and partitioning into it either. But the DB should be  
able to do it for you, adding that stuff in later is often more  
difficult and less likely to get done.

Ideas:
1. Create a table/storage type that consists of a select statement
on another machine.  While I don't think the current executor is  
capable of
working on multiple nodes of an execution tree at the same time, it  
would be
great if it could offload a select of tuples from a remote table to an
entirely different server and merge the resulting data into the  
current
execution.  I believe MySQL has this, and Oracle may implement it  
in another
way.
MySQL sort of has this, it's not as good as Oracle's though.  
Apparently there is a much better version of it in 5.1 though, that  
should make it to stable sometime next year I imagine.

2. There is no #2 at this time, but I'm sure one can be
hypothesized.
I would of thought a particularly smart version of pg_pool could do  
it. It could partition data to different servers if it knew which  
columns to key by on each table.

...Google and other companies have definitely proved that one can  
harness
huge clusters of cheap hardware.  It can't be _that_ hard, can it.  :)
I shudder to think how much the "Big Iron" equivalent of a google  
data-center would cost.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John A  
Meinel
Sent: Tuesday, May 10, 2005 7:41 AM
To: Alex Stapleton
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Partitioning / Clustering

Alex Stapleton wrote:
What is the status of Postgres support for any sort of multi-machine
scaling support? What are you meant to do once you've upgraded your
box and tuned the conf files as much as you can? But your query load
is just too high for a single machine?
Upgrading stock Dell boxes (I know we could be using better machines,
but I am trying to tackle the real issue) is not a hugely price
efficient way of getting extra performance, nor particularly scalable
in the long term.
Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell  
is far
away from Big Iron. I don't know what performance you are looking  
for, but
you can easily get into inserting 10M rows/day with quality hardware.

But actually is it your SELECT load that is too high, or your  
INSERT load,
or something inbetween.

Because Slony is around if it is a SELECT problem.
http://gborg.postgresql.org/project/slony1/projdisplay.php
Basically, Slony is a Master/Slave replication system. So if you  
have INSERT
going into the Master, you can have as many replicated slaves,  
which can
handle your SELECT load.
Slony is an asynchronous replicator, so there is a time delay from the
INSERT until it will show up on a slave, but that time could be pretty
small.

This would require some application level support, since an INSERT  
goes to a
different place than a SELECT. But there has been some discussion  
about
pg_pool being able to spread the query load, and having it be aware  
of the
difference between a SELECT and an INSERT and have it route the  
query to the
correct host. The biggest problem being that functions could cause  
a SELECT
func() to actually insert a row, which pg_pool wouldn't know about.  
There
are 2 possible solutions, a) don't do that when you are using this  
system,
b) add some sort of comment hint so that pg_pool can understand  
that the
select is actually an INSERT, and needs to be done on the master.


So, when/is PG meant to be getting a decent partitioning system?
MySQL is getting one (eventually) which is apparently meant to be
similiar to Oracle's according to the docs. Clusgres does not appear
to be widely/or at all used, and info on it seems pretty thin on the
ground, so I am not too keen on going with that. Is the real solution
to multi- machine partitioning (as in, not like MySQLs MERGE tables)
on  PostgreSQL actually doing it in our application API? This seems
like  a less than perfect solution once we want to add re

Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Richard_D_Levine
> exploring the option of buying 10 cheapass
> machines for $300 each.  At the moment, that $300 buys you, from Dell, a
> 2.5Ghz Pentium 4

Buy cheaper ass Dells with an AMD 64 3000+.  Beats the crap out of the 2.5
GHz Pentium, especially for PostgreSQL.

See the thread "Whence the Opterons" for more

Rick

[EMAIL PROTECTED] wrote on 05/10/2005 10:02:50 AM:

>
> I think that perhaps he was trying to avoid having to buy "Big Iron" at
all.
>
> With all the Opteron v. Xeon around here, and talk of $30,000 machines,
> perhaps it would be worth exploring the option of buying 10 cheapass
> machines for $300 each.  At the moment, that $300 buys you, from Dell, a
> 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit
ethernet.
> The aggregate CPU and bandwidth is pretty stupendous, but not as easy to
> harness as a single machine.
>
> For those of us looking at batch and data warehousing applications, it
would
> be really handy to be able to partition databases, tables, and processing
> load across banks of cheap hardware.
>
> Yes, clustering solutions can distribute the data, and can even do it on
a
> per-table basis in some cases.  This still leaves it up to the
application's
> logic to handle reunification of the data.
>
> Ideas:
>1. Create a table/storage type that consists of a select statement
> on another machine.  While I don't think the current executor is capable
of
> working on multiple nodes of an execution tree at the same time, it would
be
> great if it could offload a select of tuples from a remote table to an
> entirely different server and merge the resulting data into the current
> execution.  I believe MySQL has this, and Oracle may implement it in
another
> way.
>
>2. There is no #2 at this time, but I'm sure one can be
> hypothesized.
>
> ...Google and other companies have definitely proved that one can harness
> huge clusters of cheap hardware.  It can't be _that_ hard, can it.  :)
>
>
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of John A
Meinel
> Sent: Tuesday, May 10, 2005 7:41 AM
> To: Alex Stapleton
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Partitioning / Clustering
>
> Alex Stapleton wrote:
> > What is the status of Postgres support for any sort of multi-machine
> > scaling support? What are you meant to do once you've upgraded your
> > box and tuned the conf files as much as you can? But your query load
> > is just too high for a single machine?
> >
> > Upgrading stock Dell boxes (I know we could be using better machines,
> > but I am trying to tackle the real issue) is not a hugely price
> > efficient way of getting extra performance, nor particularly scalable
> > in the long term.
>
> Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is
far
> away from Big Iron. I don't know what performance you are looking for,
but
> you can easily get into inserting 10M rows/day with quality hardware.
>
> But actually is it your SELECT load that is too high, or your INSERT
load,
> or something inbetween.
>
> Because Slony is around if it is a SELECT problem.
> http://gborg.postgresql.org/project/slony1/projdisplay.php
>
> Basically, Slony is a Master/Slave replication system. So if you have
INSERT
> going into the Master, you can have as many replicated slaves, which can
> handle your SELECT load.
> Slony is an asynchronous replicator, so there is a time delay from the
> INSERT until it will show up on a slave, but that time could be pretty
> small.
>
> This would require some application level support, since an INSERT goes
to a
> different place than a SELECT. But there has been some discussion about
> pg_pool being able to spread the query load, and having it be aware of
the
> difference between a SELECT and an INSERT and have it route the query to
the
> correct host. The biggest problem being that functions could cause a
SELECT
> func() to actually insert a row, which pg_pool wouldn't know about. There
> are 2 possible solutions, a) don't do that when you are using this
system,
> b) add some sort of comment hint so that pg_pool can understand that the
> select is actually an INSERT, and needs to be done on the master.
>
> >
> > So, when/is PG meant to be getting a decent partitioning system?
> > MySQL is getting one (eventually) which is apparently meant to be
> > similiar to Oracle's according to the docs. Clusgres does not appear
> > to be widely/or at all used, and info on it seems pretty thin on the
> > ground, so I am not too keen on going with that. Is the real solution
> > to multi- machine partitioning (as in, not like MySQLs MERGE tables)
> > on  PostgreSQL actually doing it in our application API? This seems
> > like  a less than perfect solution once we want to add redundancy and
> > things into the mix.
>
> There is also PGCluster
> http://pgfoundry.org/projects/pgcluster/
>
> Which is trying to be more of a Synchronous multi-master system. I
haven't
> heard of Clusgres, 

Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread John A Meinel
Adam Haberlach wrote:
I think that perhaps he was trying to avoid having to buy "Big Iron" at all.
With all the Opteron v. Xeon around here, and talk of $30,000 machines,
perhaps it would be worth exploring the option of buying 10 cheapass
machines for $300 each.  At the moment, that $300 buys you, from Dell, a
2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet.
The aggregate CPU and bandwidth is pretty stupendous, but not as easy to
harness as a single machine.
For those of us looking at batch and data warehousing applications, it would
be really handy to be able to partition databases, tables, and processing
load across banks of cheap hardware.
Yes, clustering solutions can distribute the data, and can even do it on a
per-table basis in some cases.  This still leaves it up to the application's
logic to handle reunification of the data.
Sure. A lot of this is application dependent, though. For instance
foreign key constraints. In a general cluster solution, you would allow
foreign keys across partitions. I have a feeling this would be extra
slow, and hard to do correctly. Multi-machine transactions are also a
difficulty, since WAL now has to take into account all machines, and you
have to wait for fsync on all of them.
I'm not sure how Oracle does it, but these things seem like they prevent
clustering from really scaling very well.
Ideas:
1. Create a table/storage type that consists of a select statement
on another machine.  While I don't think the current executor is capable of
working on multiple nodes of an execution tree at the same time, it would be
great if it could offload a select of tuples from a remote table to an
entirely different server and merge the resulting data into the current
execution.  I believe MySQL has this, and Oracle may implement it in another
way.
2. There is no #2 at this time, but I'm sure one can be
hypothesized.
...Google and other companies have definitely proved that one can harness
huge clusters of cheap hardware.  It can't be _that_ hard, can it.  :)
Again, it depends on the application. A generic database with lots of
cross reference integrity checking does not work on a cluster very well.
A very distributed db where you don't worry about cross references does
scale. Google made a point of making their application work in a
distributed manner.
In the other post he mentions that pg_pool could naturally split out the
rows into different machines based on partitioning, etc. I would argue
that it is more of a custom pool daemon based on the overall
application. Because you have to start dealing with things like
cross-machine joins. Who handles that? the pool daemon has to, since it
is the only thing that talks to both tables. I think you could certainly
write a reasonably simple application specific daemon where all of the
clients send their queries to, and it figures out where they need to go,
and aggregates them as necessary. But a fully generic one is *not*
simple at all, and I think is far out of the scope of something like
pg_pool.
I'm guessing that PGCluster is looking at working on that, and it might
be true that pg_pool is thinking about it. But just thinking about the
very simple query:
SELECT row1, row2 FROM table1_on_machine_a NATURAL JOIN table2_on_machine_b
WHERE restrict_table_1 AND restrict_table_2
AND restrict_1_based_on_2;
This needs to be broken into something like:
SELECT row1 FROM table1_on_machine_a
WHERE restrict_table_1
ORDER BY join_column;
SELECT row2 FROM table2_on_machine_b
WHERE restrict_table_2
ORDER BY join_column;
Then these rows need to be merge_joined, and the restrict_1_based_on_2
needs to be applied.
This is in no way trivial, and I think it is outside the scope of
pg_pool. Now maybe if you restrict yourself so that each query stays
within one machine you can make it work. Or write your own app to handle
some of this transparently for the clients. But I would expect to make
the problem feasible, it would not be a generic solution.
Maybe I'm off base, I don't really keep track of pg_pool/PGCluster/etc.
But I can see that the problem is very difficult. Not at the very least,
this is a simple query. And it doesn't even do optimizations. You might
actually prefer the above to be done with a Nestloop style, where
table_1 is selected, and then for each row you do a single index select
on table_2. But how is your app going to know that? It has to have the
statistics from the backend databases. And if it has to place an extra
query to get those statistics, you just hurt your scalability even more.
Whereas big-iron already has all the statistics, and can optimize the
query plan.
Perhaps pg_cluster will handle this, by maintaining full statistics
across the cluster on each machine, so that more optimal queries can be
performed. I don't really know.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Jim C. Nasby
On Tue, May 10, 2005 at 10:14:11AM +1000, Neil Conway wrote:
> Jim C. Nasby wrote:
> >> No, hash joins and hash indexes are unrelated.
> >I know they are now, but does that have to be the case?
> 
> I mean, the algorithms are fundamentally unrelated. They share a bit of 
> code such as the hash functions themselves, but they are really solving 
> two different problems (disk based indexing with (hopefully) good 
> concurrency and WAL logging vs. in-memory joins via hashing with spill 
> to disk if needed).

Well, in a hash-join right now you normally end up feeding at least one
side of the join with a seqscan. Wouldn't it speed things up
considerably if you could look up hashes in the hash index instead? That
way you can eliminate going to the heap for any hashes that match. Of
course, if limited tuple visibility info was added to hash indexes
(similar to what I think is currently happening to B-tree's), many of
the heap scans could be eliminated as well. A similar method could also
be used for hash aggregates, assuming they use the same hash.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Jim C. Nasby
On Tue, May 10, 2005 at 12:10:57AM -0400, Tom Lane wrote:
> be responsive to your search.)  (This also brings up the thought that
> it might be interesting to support hash buckets smaller than a page ...
> but I don't know how to make that work in an adaptive fashion.)

IIRC, other databases that support hash indexes also allow you to define
the bucket size, so it might be a good start to allow for that. DBA's
usually have a pretty good idea of what a table will look like in
production, so if there's clear documentation on the effect of bucket
size a good DBA should be able to make a good decision.

What's the challange to making it adaptive, comming up with an algorithm
that gives you the optimal bucket size (which I would think there's
research on...) or allowing the index to accommodate different bucket
sizes existing in the index at once? (Presumably you don't want to
re-write the entire index every time it looks like a different bucket
size would help.)
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Tom Lane
"Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> What's the challange to making it adaptive, comming up with an algorithm
> that gives you the optimal bucket size (which I would think there's
> research on...) or allowing the index to accommodate different bucket
> sizes existing in the index at once? (Presumably you don't want to
> re-write the entire index every time it looks like a different bucket
> size would help.)

Exactly.  That's (a) expensive and (b) really hard to fit into the WAL
paradigm --- I think we could only handle it as a REINDEX.  So if it
were adaptive at all I think we'd have to support multiple bucket sizes
existing simultaneously in the index, and I do not see a good way to do
that.

Allowing a bucket size to be specified at CREATE INDEX doesn't seem out
of line though.  We'd have to think up a scheme for index-AM-specific
index parameters ...

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread PFC

SELECT row1, row2 FROM table1_on_machine_a NATURAL JOIN  
table2_on_machine_b
WHERE restrict_table_1 AND restrict_table_2
AND restrict_1_based_on_2;
	I don't think that's ever going to be efficient...
	What would be efficient would be, for instance, a Join of a part of a  
table against another part of another table which both happen to be on the  
same machine, because the partitioning was done with this in mind (ie. for  
instance partitioning on client_id and keeping the information for each  
client on the same machine).

	You could build your smart pool daemon in pl/pgsql and use dblink ! At  
least you have the query parser built-in.

I wonder how Oracle does it ;)
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Josh Berkus
Alex,

> This is why I mention partitioning. It solves this issue by storing  
> different data sets on different machines under the same schema.  

That's clustering, actually.  Partitioning is simply dividing up a table into 
chunks and using the chunks intelligently.   Putting those chunks on seperate 
machines is another thing entirely.  

We're working on partitioning through the Bizgres sub-project:
www.bizgres.org  / http://pgfoundry.org/projects/bizgres/
... and will be pushing it to the main PostgreSQL when we have something.

I invite you to join the mailing list.

> These seperate chunks of the table can then be replicated as well for  
> data redundancy and so on. MySQL are working on these things, 

Don't hold your breath.   MySQL, to judge by their first "clustering" 
implementation, has a *long* way to go before they have anything usable.  In 
fact, at OSCON their engineers were asking Jan Wieck for advice.

If you have $$$ to shell out, my employer (GreenPlum) has a multi-machine 
distributed version of PostgreSQL.  It's proprietary, though.  
www.greenplum.com.

If you have more time than money, I understand that Stanford is working on 
this problem:
http://www-db.stanford.edu/~bawa/

But, overall, some people on this list are very mistaken in thinking it's an 
easy problem.   GP has devoted something like 5 engineers for 3 years to 
develop their system.  Oracle spent over $100 million to develop RAC.  

> but PG   
> just has a bunch of third party extensions, I wonder why these are  
> not being integrated into the main trunk :/ 

Because it represents a host of complex functionality which is not applicable 
to most users?  Because there are 4 types of replication and 3 kinds of 
clusering and not all users want the same kind?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Jim C. Nasby
On Tue, May 10, 2005 at 11:49:50AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> > What's the challange to making it adaptive, comming up with an algorithm
> > that gives you the optimal bucket size (which I would think there's
> > research on...) or allowing the index to accommodate different bucket
> > sizes existing in the index at once? (Presumably you don't want to
> > re-write the entire index every time it looks like a different bucket
> > size would help.)
> 
> Exactly.  That's (a) expensive and (b) really hard to fit into the WAL
> paradigm --- I think we could only handle it as a REINDEX.  So if it
> were adaptive at all I think we'd have to support multiple bucket sizes
> existing simultaneously in the index, and I do not see a good way to do
> that.

I'm not really familiar enough with hash indexes to know if this would
work, but if the maximum bucket size was known you could use that to
determine a maximum range of buckets to look at. In some cases, that
range would include only one bucket, otherwise it would be a set of
buckets. If you found a set of buckets, I think you could then just go
to the specific one you need.

If we assume that the maximum bucket size is one page it becomes more
realistic to take an existing large bucket and split it into several
smaller ones. This could be done on an update to the index page, or a
background process could handle it.

In any case, should this go on the TODO list?

> Allowing a bucket size to be specified at CREATE INDEX doesn't seem out
> of line though.  We'd have to think up a scheme for index-AM-specific
> index parameters ...
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Greg Stark
Tom Lane <[EMAIL PROTECTED]> writes:

> > What if the hash index stored *only* the hash code? That could be useful for
> > indexing large datatypes that would otherwise create large indexes.
> 
> Hmm, that could be a thought.

Hm, if you go this route of having hash indexes store tuples ordered by hash
code and storing the hash code in the index, then it seems hash indexes become
just a macro for a btree index of HASH(index columns). 

I'm not saying that to criticize this plan. In fact I think that captures most
(though not all) of what a hash index should be. 

It would be pretty useful. In fact if it isn't how hash indexes are
implemented then it might be useful to provide a user visible hash(ROW)
function that allows creating such indexes as functional indexes. Though
hiding it would make the SQL simpler.

-- 
greg


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Tom Lane
"Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> Well, in a hash-join right now you normally end up feeding at least one
> side of the join with a seqscan. Wouldn't it speed things up
> considerably if you could look up hashes in the hash index instead?

That's called a "nestloop with inner index scan", not a hash join.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Mischa Sandberg
Quoting "Jim C. Nasby" <[EMAIL PROTECTED]>:

> I'm not really familiar enough with hash indexes to know if this
> would
> work, but if the maximum bucket size was known you could use that to
> determine a maximum range of buckets to look at. In some cases, that
> range would include only one bucket, otherwise it would be a set of
> buckets. If you found a set of buckets, I think you could then just
> go
> to the specific one you need.
> 
> If we assume that the maximum bucket size is one page it becomes
> more
> realistic to take an existing large bucket and split it into several
> smaller ones. This could be done on an update to the index page, or
> a
> background process could handle it.
> 
> In any case, should this go on the TODO list?
> 
> > Allowing a bucket size to be specified at CREATE INDEX doesn't seem
> out
> > of line though.  We'd have to think up a scheme for
> index-AM-specific
> > index parameters ...
> -- 
> Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
> Give your computer some brain candy! www.distributed.net Team #1828

Google "dynamic hash" or "linear hash". It takes care of not needing to
have varying bucket sizes.

Hash indexes are useful if you ALWAYS require disk access; they behave
like worst-case random cache-thrash tests. That's probably why dbs have
gravitated toward tree indexes instead. On the other hand, there's more
(good) to hashing than initially meets the eye.

Dynamic multiway hashing has come a long way from just splicing the bits
together from multiple columns' hash values. If you can lay your hands
on Tim Merrett's old text "Relational Information Systems", it's an
eye-opener. Picture an efficient terabyte spreadsheet.

For one thing, unlike a btree, a multicolumn hash is symmetric: it
doesn't matter which column(s) you do not specify in a partial match.

For another, a multiway hash is useful for much lower selectivity than a
btree. I built such indexes for OLAP cubes, and some dimensions were
only 10 elements wide. At the point where btree indexing becomes worse
than seqscan, a multiway hash tells you which 10% of the disk to scan.


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Prefetch

2005-05-10 Thread Greg Stark

Matt Olson <[EMAIL PROTECTED]> writes:

> I've done other things that make sense, like using indexes, playing with the
> planner constants and turning up the postgres cache buffers.
> 
> Even playing with extream hdparm read-ahead numbers (i.e. 64738) yields no
> apparent difference in database performance.  The random nature of the I/O
> drops disk reads down to about 1MB/sec for the array.  A linear table scan
> can easily yield 70-80MB/sec on this system.  Total table size is usually
> around 1GB and with indexes should be able to fit completely in main memory.

Actually forcing things to use indexes is the wrong direction to go if you're
trying to process lots of data and want to stream it off disk as rapidly as
possible. I would think about whether you can structure your data such that
you can use sequential scans. That might mean partitioning your raw data into
separate tables and then accessing only the partitions that are relevant to
the query.

In your application that might be hard. It sounds like you would need more or
less one table per stock ticker which would really be hard to manage.

One thing you might look into is using the CLUSTER command. But postgres
doesn't maintain the cluster ordering so it would require periodically
rerunning it.

I'm a bit surprised by your 1MB/s rate. I would expect to see about 10MB/s
even for completely random reads. Is it possible you're seeing something else
interfering? Do you have INSERT/UPDATE/DELETE transactions happening
concurrently with this select scan? If so you should strongly look into
separating the transaction log from the data files.

-- 
greg


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


RE: [PERFORM] Partitioning / Clustering

2005-05-10 Thread tdrayton

Hi Alex,

Actually, our product can partition data among several clustered nodes
running PostgreSQL, if that is what you are looking for. Data is
distributed based on a designated column. Other tables can be
replicated to all nodes.

For SELECTs, it also knows when it can join locally or it needs to ship
rows as part of the query plan. For FK constraints (discussed here), it
also knows when it can enforce them locally or not.

Please let me know if you would like some more information.

Regards,

Tom Drayton
ExtenDB
http://www.extendb.com



> This is why I mention partitioning. It solves this issue by storing  
> different data sets on different machines under the same schema.  
> These seperate chunks of the table can then be replicated as 
> well for  
> data redundancy and so on. MySQL are working on these things, but PG  
> just has a bunch of third party extensions, I wonder why these are  
> not being integrated into the main trunk :/ Thanks for 
> pointing me to  
> PGCluster though. It looks like it should be better than 
> Slony at least.
>

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Prefetch

2005-05-10 Thread Matt Olson
My postgres binaries and WAL are on a separate disk from the raid array.  The 
table I'm doing the selects from is probably about 4GB in size and 18-20 
million records.  No concurrent or dependent inserts or deletes are going on.

Tom's point and your points about optimizing the application are well taken.  
I know my approach is sub optimal and prone to getting caught by latency 
issues (seek times, cache hit rates, etc.).  However, the question of 
prefetch in my mind is all about eliminating latencies, so, I thought my 
problem would be good for the sake of discussing prefetching.

The two approaches I'm in the process of testing are Rod and Greg's suggestion 
of using 'CLUSTER'.  And for the sake of not letting a good idea get away, 
I'll probably spend time on doing a parallel query approach which Tom 
suggested.  

I'll report back to the list what I find and maybe do some _rough_ 
benchmarking.  This is a production app, so I can't get too much in the way 
of the daily batches.  

-- 
Matt Olson
Ocean Consulting
http://www.oceanconsulting.com/

On Tuesday 10 May 2005 11:13 am, Greg Stark wrote:
> Matt Olson writes:
> > I've done other things that make sense, like using indexes, playing with
> > the planner constants and turning up the postgres cache buffers.
> >
> > Even playing with extream hdparm read-ahead numbers (i.e. 64738) yields
> > no apparent difference in database performance.  The random nature of the
> > I/O drops disk reads down to about 1MB/sec for the array.  A linear table
> > scan can easily yield 70-80MB/sec on this system.  Total table size is
> > usually around 1GB and with indexes should be able to fit completely in
> > main memory.
>
> Actually forcing things to use indexes is the wrong direction to go if
> you're trying to process lots of data and want to stream it off disk as
> rapidly as possible. I would think about whether you can structure your
> data such that you can use sequential scans. That might mean partitioning
> your raw data into separate tables and then accessing only the partitions
> that are relevant to the query.
>
> In your application that might be hard. It sounds like you would need more
> or less one table per stock ticker which would really be hard to manage.
>
> One thing you might look into is using the CLUSTER command. But postgres
> doesn't maintain the cluster ordering so it would require periodically
> rerunning it.
>
> I'm a bit surprised by your 1MB/s rate. I would expect to see about 10MB/s
> even for completely random reads. Is it possible you're seeing something
> else interfering? Do you have INSERT/UPDATE/DELETE transactions happening
> concurrently with this select scan? If so you should strongly look into
> separating the transaction log from the data files.



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Prefetch

2005-05-10 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes:
> Actually forcing things to use indexes is the wrong direction to go if you're
> trying to process lots of data and want to stream it off disk as rapidly as
> possible. I would think about whether you can structure your data such that
> you can use sequential scans.

Agreed.

> In your application that might be hard. It sounds like you would need more or
> less one table per stock ticker which would really be hard to manage.

Actually, in a previous lifetime I used to do pretty much the same stuff
Matt is working on.  The reason I suggested parallelizing is that what
you want is usually not so much the 200day moving average of FOO, as the
200day moving averages of a whole bunch of things.  If your input table
contains time-ordered data for all those things, then a seqscan works
out pretty well.

> One thing you might look into is using the CLUSTER command. But postgres
> doesn't maintain the cluster ordering so it would require periodically
> rerunning it.

If the desired sort order is time-based, it falls out pretty much for
free in this application, because historical data doesn't change -- you
are only interested in appending at the right.

In said previous lifetime, we used Postgres for tracking our actual
transactions, but we built a custom file format for storing the
individual tick data.  That's not stuff you need transactional semantics
for; the historical data is what it is.  Besides, you need to compress
it as much as you can because there's always too much of it.  Machines
are faster and disk space cheaper than they were at the time, but I'd
still question the wisdom of using a Postgres row for each daily bar,
let alone finer-grain data.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Prefetch - OffTopic

2005-05-10 Thread Mohan, Ross
for time-series and "insane fast", nothing beats kdB, I believe

www.kx.com

Not trying to Quisling-out PG here, just hoping to respond to Mr. Olson





-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
Sent: Tuesday, May 10, 2005 2:54 PM
To: Greg Stark
Cc: [EMAIL PROTECTED]; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Prefetch 


Greg Stark <[EMAIL PROTECTED]> writes:
> Actually forcing things to use indexes is the wrong direction to go if 
> you're trying to process lots of data and want to stream it off disk 
> as rapidly as possible. I would think about whether you can structure 
> your data such that you can use sequential scans.

Agreed.

> In your application that might be hard. It sounds like you would need 
> more or less one table per stock ticker which would really be hard to 
> manage.

Actually, in a previous lifetime I used to do pretty much the same stuff Matt 
is working on.  The reason I suggested parallelizing is that what you want is 
usually not so much the 200day moving average of FOO, as the 200day moving 
averages of a whole bunch of things.  If your input table contains time-ordered 
data for all those things, then a seqscan works out pretty well.

> One thing you might look into is using the CLUSTER command. But 
> postgres doesn't maintain the cluster ordering so it would require 
> periodically rerunning it.

If the desired sort order is time-based, it falls out pretty much for free in 
this application, because historical data doesn't change -- you are only 
interested in appending at the right.

In said previous lifetime, we used Postgres for tracking our actual 
transactions, but we built a custom file format for storing the individual tick 
data.  That's not stuff you need transactional semantics for; the historical 
data is what it is.  Besides, you need to compress it as much as you can 
because there's always too much of it.  Machines are faster and disk space 
cheaper than they were at the time, but I'd still question the wisdom of using 
a Postgres row for each daily bar, let alone finer-grain data.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes:
>>> What if the hash index stored *only* the hash code? That could be useful for
>>> indexing large datatypes that would otherwise create large indexes.
>> 
>> Hmm, that could be a thought.

> Hm, if you go this route of having hash indexes store tuples ordered by hash
> code and storing the hash code in the index, then it seems hash indexes become
> just a macro for a btree index of HASH(index columns). 

No, not at all, because searching such an index will require a tree
descent, thus negating the one true advantage of hash indexes.  I see
the potential value of sorting by hashcode within an individual page,
but that doesn't mean we should do the same across the whole index.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg
Quoting [EMAIL PROTECTED]:

> > exploring the option of buying 10 cheapass
> > machines for $300 each.  At the moment, that $300 buys you, from
> Dell, a
> > 2.5Ghz Pentium 4
> 
> Buy cheaper ass Dells with an AMD 64 3000+.  Beats the crap out of
> the 2.5
> GHz Pentium, especially for PostgreSQL.

Whence "Dells with an AMD 64" ?? Perhaps you skimmed:

  http://www.thestreet.com/tech/kcswanson/10150604.html
or
  http://www.eweek.com/article2/0,1759,1553822,00.asp





---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Prefetch - OffTopic

2005-05-10 Thread Chris Browne
[EMAIL PROTECTED] ("Mohan, Ross") writes:
> for time-series and "insane fast", nothing beats kdB, I believe
>
> www.kx.com

... Which is well and fine if you're prepared to require that all of
the staff that interact with data are skilled APL hackers.  Skilled
enough that they're all ready to leap into Whitney's ASCII-based
variant, K.
-- 
(format nil "[EMAIL PROTECTED]" "cbbrowne" "acm.org")
http://www.ntlug.org/~cbbrowne/functional.html
Rules of the Evil Overlord #78.  "I will not tell my Legions of Terror
"And he must  be taken alive!" The command will be:  ``And try to take
him alive if it is reasonably practical.''"


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Prefetch - OffTopic

2005-05-10 Thread Mohan, Ross
Yes, that would be a sufficient (although not necessary) condition for being 
well and fine with kdB. 
Last time I used APL was.pre-Gregorian, so yea, that's scary to me, too. 

( Of course, one can use C/ODBC or Java/JDBC to reach kdB; once there, one uses 
SQL92, or
proprietary kSQL. )



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris Browne
Sent: Tuesday, May 10, 2005 4:14 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Prefetch - OffTopic


[EMAIL PROTECTED] ("Mohan, Ross") writes:
> for time-series and "insane fast", nothing beats kdB, I believe
>
> www.kx.com

... Which is well and fine if you're prepared to require that all of the staff 
that interact with data are skilled APL hackers.  Skilled enough that they're 
all ready to leap into Whitney's ASCII-based variant, K.
-- 
(format nil "[EMAIL PROTECTED]" "cbbrowne" "acm.org") 
http://www.ntlug.org/~cbbrowne/functional.html
Rules of the Evil Overlord #78.  "I will not tell my Legions of Terror "And he 
must  be taken alive!" The command will be:  ``And try to take him alive if it 
is reasonably practical.''" 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] PGSQL Capacity

2005-05-10 Thread Jim C. Nasby
http://stats.distributed.net has a table that's 130M rows.
http://stats.distributed.net/participant/phistory.php?project_id=8&id=39622
is a page that hits that table, and as you can see it's quite fast. This
is on a dual opteron with 4G of memory.

Unless you're looking for sub millisecond response times, 50k rows is
nothing.

On Mon, May 09, 2005 at 09:32:18PM +0200, Steinar H. Gunderson wrote:
> On Mon, May 09, 2005 at 09:22:40PM +0200, [EMAIL PROTECTED] wrote:
> > How can i know a capacity of a pg database ?
> > How many records my table can have ?
> > I saw in a message that someone have 50 000 records it's possible in a 
> > table ?
> > (My table have 8 string field (length 32 car)).
> > Thanks for your response.
> 
> You can have several million records in a table easily -- I've done 10
> million personally, but you can find people doing that many records a _day_.
> Hitting 1 billion records should probably not be impossible either -- it all
> depends on your hardware, and perhaps more importantly, what kind of queries
> you're running against it. 5 is absolutely no problem at all.
> 
> /* Steinar */
> -- 
> Homepage: http://www.sesse.net/
> 
> ---(end of broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 

-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Mischa Sandberg
Quoting "Jim C. Nasby" <[EMAIL PROTECTED]>:

> Well, in a hash-join right now you normally end up feeding at least
> one
> side of the join with a seqscan. Wouldn't it speed things up
> considerably if you could look up hashes in the hash index instead?

You might want to google on "grace hash" and "hybrid hash".

The PG hash join is the simplest possible: build a hash table in memory,
and match an input stream against it.

*Hybrid hash* is where you spill the hash to disk in a well-designed
way. Instead of thinking of it as building a hash table in memory, think
of it as partitioning one input; if some or all of it fits in memory,
all the better. The boundary condition is the same. 

The real wizard of hybrid hash has to be Goetz Graefe, who sadly has now
joined the MS Borg. He demonstrated that for entire-table joins, hybrid
hash completely dominates sort-merge. MSSQL now uses what he developed
as an academic, but I don't know what the patent state is.

"Grace hash" is the original implementation of hybrid hash:
  Kitsuregawa, M., Tanaka, H., and Moto-oka, T. (1984).
  Architecture and Performance of Relational Algebra Machine Grace. 



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Configing 8 gig box.

2005-05-10 Thread Jim C. Nasby
On Mon, May 09, 2005 at 04:55:53PM -0400, Joel Fradkin wrote:
> Seems to be only using like 360 meg out of 7 gig free (odd thing is I did
> see some used swap 4k out of 1.9) with a bunch of users (this may be normal,
> but it is not going overly fast so thought I would ask).

This is perfectly normal. Each postgresql backend will only report
memory usage roughly equal to shared_buffers plus the size of the code
(16M or so?). If it's in the middle of a sort or vacuum, it will use
more memory.

It's not uncommon for modern OS's to swap out stuff that's not being
used. They would rather have the memory available for disk caching,
which is normally a good trade-off.

For reference, on a 4G box running FreeBSD, there's currently 18M of
swap used. Postgresql processes typically show 53M of total VM, with
~22M resident. This is with shared buffers of 2000.

> Items I modified per commandprompt.coma nd watching this list etc.
> 
>  
> 
> shared_buffers = 24576
> 
> work_mem = 32768
> 
> max_fsm_pages = 10
> 
> max_fsm_relations = 1500
> 
> fsync = true
> 
> wal_sync_method = open_sync
> 
> wal_buffers = 2048
> 
> checkpoint_segments = 100 
> 
> effective_cache_size = 524288
> 
> default_statistics_target = 250
> 
>  
> 
> Any help is appreciated.
> 
>  
> 
>  
> 
>  
> 
> Joel Fradkin
> 
>  
> 
> Wazagua, Inc.
> 2520 Trailmate Dr
> Sarasota, Florida 34243
> Tel.  941-753-7111 ext 305
> 
>  
> 
> [EMAIL PROTECTED]
> www.wazagua.com
> Powered by Wazagua
> Providing you with the latest Web-based technology & advanced tools.
> C 2004. WAZAGUA, Inc. All rights reserved. WAZAGUA, Inc
>  This email message is for the use of the intended recipient(s) and may
> contain confidential and privileged information.  Any unauthorized review,
> use, disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and delete and destroy
> all copies of the original message, including attachments.
> 
>  
> 
> 
>  
> 
>  
> 

-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Jim C. Nasby
On Tue, May 10, 2005 at 07:29:59PM +0200, PFC wrote:
>   I wonder how Oracle does it ;)

Oracle *clustering* demands shared storage. So you've shifted your money
from big-iron CPUs to big-iron disk arrays.

Oracle replication works similar to Slony, though it supports a lot more
modes (ie: syncronous).
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg
Quoting Alex Stapleton <[EMAIL PROTECTED]>:

> This is why I mention partitioning. It solves this issue by storing 
> different data sets on different machines under the same schema.  
> These seperate chunks of the table can then be replicated as well for
> data redundancy and so on. MySQL are working on these things, but PG 
> just has a bunch of third party extensions, I wonder why these are  
> not being integrated into the main trunk :/ Thanks for pointing me to
> PGCluster though. It looks like it should be better than Slony at
> least.

Across a decade or two of projects, including creating a federated
database engine for Simba, I've become rather dubious of horizontal
partitions (across disks or servers), either to improve performance, or
just to scale up and not lose performance. [[The one exception is for
 non-time-critical read-only systems, with
Slony-style replication.]]

The most successful high-volume systems I've seen have broken up
databases functionally, like a pipeline, where different applications
use different sections of the pipe. 

The highest-volume system I've worked on is Acxiom's gigantic
data-cleansing system. This is the central clearinghouse for every scrap
of demographic that can be associated with some North American,
somewhere. Think of D&B for 300M people (some dead). The volumes are
just beyond belief, for both updates and queries. At Acxiom, the
datasets are so large, even after partitioning, that they just
constantly cycle them through memory, and commands are executes in
convoys --- sort of like riding a paternoster.
..
Anybody been tracking on what Mr Stonebraker's been up to, lately?
Datastream management. Check it out. Like replication, everybody
hand-rolled their own datastream systems until finally somebody else
generalized it well enough that it didn't have to be built from scratch
every time.

Datastream systems require practically no locking, let alone distributed
transactions. They give you some really strong guarantees on transaction
elapsed-time and throughput. 
...
Where is this all leading? Well, for scaling data like this, the one
feature that you need is the ability of procedures/rules on one server
to perform queries/procedures on another. MSSQL has linked servers and
(blech) OpenQuery. This lets you do reasonably-efficient work when you
only deal with one table at a time. Do NOT try anything fancy with
multi-table joins; timeouts are unavoidable, and painful.

Postgres has a natural advantage in such a distributed server system:
all table/index stats are openly available through the SQL interface,
for one server to make rational query plans involving another server's
resources. God! I would have killed for that when I was writing a
federated SQL engine; the kluges you need to do this at arms-length from
that information are true pain.

So where should I go look, to see what's been done so far, on a Postgres
that can treat another PG server as a new table type?



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Jim C. Nasby
On Tue, May 10, 2005 at 02:55:55PM -0700, Mischa Sandberg wrote:
> just beyond belief, for both updates and queries. At Acxiom, the
> datasets are so large, even after partitioning, that they just
> constantly cycle them through memory, and commands are executes in
> convoys --- sort of like riding a paternoster.

Speaking of which... what's the status of the patch that allows seqscans
to piggyback on already running seqscans on the same table?

> So where should I go look, to see what's been done so far, on a Postgres
> that can treat another PG server as a new table type?

To the best of my knowledge no such work has been done. There is a
project (who's name escapes me) that lets you run queries against a
remote postgresql server from a postgresql connection to a different
server, which could serve as the basis for what you're proposing.

BTW, given your experience, you might want to check out Bizgres.
(http://pgfoundry.org/projects/bizgres/) I'm sure your insights would be
most welcome.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Mark Lewis
If the original paper was published in 1984, then it's been more than 20
years.  Any potential patents would already have expired, no?

-- Mark Lewis

On Tue, 2005-05-10 at 14:35, Mischa Sandberg wrote:
> Quoting "Jim C. Nasby" <[EMAIL PROTECTED]>:
> 
> > Well, in a hash-join right now you normally end up feeding at least
> > one
> > side of the join with a seqscan. Wouldn't it speed things up
> > considerably if you could look up hashes in the hash index instead?
> 
> You might want to google on "grace hash" and "hybrid hash".
> 
> The PG hash join is the simplest possible: build a hash table in memory,
> and match an input stream against it.
> 
> *Hybrid hash* is where you spill the hash to disk in a well-designed
> way. Instead of thinking of it as building a hash table in memory, think
> of it as partitioning one input; if some or all of it fits in memory,
> all the better. The boundary condition is the same. 
> 
> The real wizard of hybrid hash has to be Goetz Graefe, who sadly has now
> joined the MS Borg. He demonstrated that for entire-table joins, hybrid
> hash completely dominates sort-merge. MSSQL now uses what he developed
> as an academic, but I don't know what the patent state is.
> 
> "Grace hash" is the original implementation of hybrid hash:
>   Kitsuregawa, M., Tanaka, H., and Moto-oka, T. (1984).
>   Architecture and Performance of Relational Algebra Machine Grace. 
> 
> 
> 
> ---(end of broadcast)---
> TIP 3: if posting/reading through Usenet, please send an appropriate
>   subscribe-nomail command to [EMAIL PROTECTED] so that your
>   message can get through to the mailing list cleanly


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Tom Lane
Mischa Sandberg <[EMAIL PROTECTED]> writes:
> The PG hash join is the simplest possible: build a hash table in memory,
> and match an input stream against it.

> *Hybrid hash* is where you spill the hash to disk in a well-designed
> way. Instead of thinking of it as building a hash table in memory, think
> of it as partitioning one input; if some or all of it fits in memory,
> all the better. The boundary condition is the same. 

[ raised eyebrow... ]  Apparently you've not read the code.  It's been
hybrid hashjoin since we got it from Berkeley.  Probably not the best
possible implementation of the concept, but we do understand about spill
to disk.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Mischa Sandberg
Quoting Tom Lane <[EMAIL PROTECTED]>:

> Mischa Sandberg <[EMAIL PROTECTED]> writes:
> > The PG hash join is the simplest possible: build a hash table in
> memory,  and match an input stream against it.
> 
> [ raised eyebrow... ]  Apparently you've not read the code.  It's
> been hybrid hashjoin since we got it from Berkeley.  Probably not the
> best possible implementation of the concept, but we do 
> understand about spill to disk.

Apologies. I stopped reading around line 750 (PG 8.0.1) in
src/backend/executor/nodeHashjoin.c

if (!node->hj_hashdone)
{

/*
 * execute the Hash node, to build the hash table
 */
hashNode->hashtable = hashtable;
(void) ExecProcNode((PlanState *) hashNode);
...

and missed the comment:
/*
 * Open temp files for outer batches,
 */

Will quietly go and read twice, talk once. 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Greg Stark

Tom Lane <[EMAIL PROTECTED]> writes:

> No, not at all, because searching such an index will require a tree
> descent, thus negating the one true advantage of hash indexes.  

The hash index still has to do a tree descent, it just has a larger branching
factor than the btree index.

btree indexes could have a special case hack to optionally use a large
branching factor for the root node, effectively turning them into hash
indexes. That would be useful for cases where you know the values will be very
evenly distributed and won't need to scan ranges, ie, when you're indexing a
hash function.

-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Christopher Kings-Lynne
This is why I mention partitioning. It solves this issue by storing  
different data sets on different machines under the same schema.  These 
seperate chunks of the table can then be replicated as well for  data 
redundancy and so on. MySQL are working on these things
*laff*
Yeah, like they've been working on views for the last 5 years, and still 
haven't released them :D :D :D

Chris
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg
Quoting Christopher Kings-Lynne <[EMAIL PROTECTED]>:

> > This is why I mention partitioning. It solves this issue by storing
> > different data sets on different machines under the same schema. 
> > These seperate chunks of the table can then be replicated as well for 
> > data redundancy and so on. MySQL are working on these things
> *laff*
> Yeah, like they've been working on views for the last 5 years, and
> still haven't released them :D :D :D

? 
http://dev.mysql.com/doc/mysql/en/create-view.html
...for MySQL 5.0.1+ ?


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Joshua D. Drake
Mischa Sandberg wrote:
Quoting Christopher Kings-Lynne <[EMAIL PROTECTED]>:

This is why I mention partitioning. It solves this issue by storing
different data sets on different machines under the same schema. 
These seperate chunks of the table can then be replicated as well for 
data redundancy and so on. MySQL are working on these things
*laff*
Yeah, like they've been working on views for the last 5 years, and
still haven't released them :D :D :D

? 
http://dev.mysql.com/doc/mysql/en/create-view.html
...for MySQL 5.0.1+ ?
Yes but MySQL 5 isn't out yet (considered stable).
Sincerely,
Joshua D. Drake


---(end of broadcast)---
TIP 6: Have you searched our list archives?
   http://archives.postgresql.org

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Christopher Kings-Lynne

*laff*
Yeah, like they've been working on views for the last 5 years, and
still haven't released them :D :D :D

? 
http://dev.mysql.com/doc/mysql/en/create-view.html
...for MySQL 5.0.1+ ?
Give me a call when it's RELEASED.
Chris
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg
Quoting Christopher Kings-Lynne <[EMAIL PROTECTED]>:

> 
> >>*laff*
> >>Yeah, like they've been working on views for the last 5 years, and
> >>still haven't released them :D :D :D
> > 
> > ? 
> > http://dev.mysql.com/doc/mysql/en/create-view.html
> > ...for MySQL 5.0.1+ ?
> 
> Give me a call when it's RELEASED.


:-) Touche'



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg
Quoting "Jim C. Nasby" <[EMAIL PROTECTED]>:

> To the best of my knowledge no such work has been done. There is a
> project (who's name escapes me) that lets you run queries against a
> remote postgresql server from a postgresql connection to a different
> server, which could serve as the basis for what you're proposing.

Okay, if the following looks right to the powerthatbe, I'd like to start
a project. Here's the proposition:

"servername.dbname.schema.object" would change RangeVar, which would
affect much code. "dbname.schema.object" itself is not implemented in
8.0. So, simplicity dictates something like:

table pg_remote(schemaname text, connectby text, remoteschema text)

The pg_statistic info from a remote server cannot be cached in local
pg_statistic, without inventing pseudo reloids as well as a
pseudoschema. Probably cleaner to cache it somewhere else. I'm still
reading down the path that puts pg_statistic data where costsize can get
at it.

First step: find out whether one can link libpq.so to postmaster :-)


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-10 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes:
> Tom Lane <[EMAIL PROTECTED]> writes:
>> No, not at all, because searching such an index will require a tree
>> descent, thus negating the one true advantage of hash indexes.  

> The hash index still has to do a tree descent, it just has a larger branching
> factor than the btree index.

There is *no* tree descent in a hash index: you go directly to the
bucket you want.

If the bucket spans more than one page, you pay something, but this
strikes me as being equivalent to the case of multiple equal keys
spanning multiple pages in a btree.  It works, but it's not the design
center.

> btree indexes could have a special case hack to optionally use a large
> branching factor for the root node, effectively turning them into hash
> indexes.

No, because you'd still have to fetch and search the root node.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Neil Conway
Josh Berkus wrote:
Don't hold your breath.   MySQL, to judge by their first "clustering" 
implementation, has a *long* way to go before they have anything usable.
Oh? What's wrong with MySQL's clustering implementation?
-Neil
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Joshua D. Drake
Neil Conway wrote:
Josh Berkus wrote:
Don't hold your breath.   MySQL, to judge by their first "clustering" 
implementation, has a *long* way to go before they have anything usable.

Oh? What's wrong with MySQL's clustering implementation?
Ram only tables :)
-Neil
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Bruno Wolff III
On Tue, May 10, 2005 at 08:02:50 -0700,
  Adam Haberlach <[EMAIL PROTECTED]> wrote:
> 
> 
> With all the Opteron v. Xeon around here, and talk of $30,000 machines,
> perhaps it would be worth exploring the option of buying 10 cheapass
> machines for $300 each.  At the moment, that $300 buys you, from Dell, a
> 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet.
> The aggregate CPU and bandwidth is pretty stupendous, but not as easy to
> harness as a single machine.

That isn't going to be ECC ram. I don't think you really want to use
non-ECC ram in a critical database.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Configing 8 gig box.

2005-05-10 Thread Tom Lane
"Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> On Mon, May 09, 2005 at 04:55:53PM -0400, Joel Fradkin wrote:
>> Seems to be only using like 360 meg out of 7 gig free (odd thing is I did
>> see some used swap 4k out of 1.9) with a bunch of users (this may be normal,
>> but it is not going overly fast so thought I would ask).

> This is perfectly normal. Each postgresql backend will only report
> memory usage roughly equal to shared_buffers plus the size of the code
> (16M or so?). If it's in the middle of a sort or vacuum, it will use
> more memory.

One thing to note is that depending on which Unix variant you are using,
top may claim that any particular backend process is using the portion
of shared memory that it's actually physically touched.  This means that
the claimed size of a backend process will grow as it runs (and randomly
needs to touch pages that are in different slots of the shared-memory
buffers) regardless of any actual objective growth in memory needs.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Prefetch

2005-05-10 Thread Sam Vilain
Matt Olson wrote:
Other databases like Oracle and DB2 implement some sort of row prefetch.  Has
there been serious consideration of implementing something like a prefetch
subsystem?  Does anyone have any opinions as to why this would be a bad idea
for postgres?
Postges is great for a multiuser environment and OLTP applications.  However,
in this set up, a data warehouse, the observed performance is not what I
would hope for.
Oracle doesn't pre-fetch data to get its fast results in this case.
pre-fetching doesn't give you the 100 times speed increases.
Bitmap indexes are very important for data mining.  You might want to see
   http://www.it.iitb.ac.in/~rvijay/dbms/proj/
I have no idea how well developed this is, but this is often the biggest
win with Data Warehousing.  If it works, you'll get results back in seconds,
if it doesn't you'll have plenty of time while your queries run to reflect on
the possibility that commercial databases might actually have important features
that haven't even penetrated the awareness of most free database developers.
Another trick you can use with large data sets like this when you want results
back in seconds is to have regularly updated tables that aggregate the data
along each column normally aggregated against the main data set.
Of couse, Pg doesn't have the nice features that make this just work and make
queries against your data source faster (called "OLAP Query rewrite" in
Oracle), so you'll have to put a lot of work into changing your application to
figure out when to use the summary tables.  As far as I know it doesn't have
materialized views, either, so updating these summary tables is also a more
complex task than just a single REFRESH command.
Maybe some bright person will prove me wrong by posting some working
information about how to get these apparently absent features working.
You might also want to consider ditching RAID 5 and switching to plain
mirroring.  RAID 5 is a helluva performance penalty (by design).  This is
why they said RAID - fast, cheap, reliable - pick any two.  RAID 5 ain't
fast.  But that's probably not your main problem.
Sam.
Regards,
Matt Olson
Ocean Consulting
http://www.oceanconsulting.com/
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Neil Conway
Joshua D. Drake wrote:
Neil Conway wrote:
Oh? What's wrong with MySQL's clustering implementation?
Ram only tables :)
Sure, but that hardly makes it not "usable". Considering the price of 
RAM these days, having enough RAM to hold the database (distributed over 
the entire cluster) is perfectly acceptable for quite a few people.

(Another deficiency is in 4.0, predicates in queries would not be pushed 
down to storage nodes -- so you had to stream the *entire* table over 
the network, and then apply the WHERE clause at the frontend query node. 
That is fixed in 5.0, though.)

-Neil
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Josh Berkus
Neil,

> Sure, but that hardly makes it not "usable". Considering the price of
> RAM these days, having enough RAM to hold the database (distributed over
> the entire cluster) is perfectly acceptable for quite a few people.

The other problem, as I was told it at OSCON, was that these were not 
high-availability clusters; it's impossible to add a server to an existing 
cluster, and a server going down is liable to take the whole cluster down.  
Mind you, I've not tried that aspect of it myself; once I saw the ram-only 
rule, we switched to something else.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Prefetch

2005-05-10 Thread Christopher Kings-Lynne
Another trick you can use with large data sets like this when you want 
results
back in seconds is to have regularly updated tables that aggregate the data
along each column normally aggregated against the main data set.

Maybe some bright person will prove me wrong by posting some working
information about how to get these apparently absent features working.
Most people just use simple triggers to maintain aggregate summary tables...
Chris
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]