Re: [GENERAL] The REAL cost of joins

2010-03-04 Thread Richard Huxton

On 03/03/10 21:59, Marcin Krol wrote:

What do you think of this? And in general: when (if?) should one
denormalize data?


As a last resort. No sooner.

The support costs of denormalising your database is such that if you can 
reasonably just buy more hardware / add caching / etc, do so.


Oh, and I'm afraid your tests are probably meaningless.
1. The times are too small to be accurate.
2. You have the overhead of starting psql and connecting to the 
database, starting a new backend etc.

3. You have the overhead of parsing the query
4. You're fetching all rows (which presumably aren't many) - not a 
terribly useful scenario.


If you wanted to measure actual join costs, you'd need to repeat the 
tests (say) 100-1000 times in a loop, optionally with prepared plans. 
Varying WHERE clauses might be useful too, if that's how your real 
application will work.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] The REAL cost of joins

2010-03-04 Thread Grzegorz Jaśkiewicz
the reason you are using joins, most often is because your schema is
normalized. One way or another, de-normalisation + queries will cost you
more, than normalised tables and joins.
That's at least the short answer.


Re: [GENERAL] The REAL cost of joins

2010-03-04 Thread Craig Ringer

On 4/03/2010 4:47 PM, Richard Huxton wrote:

On 03/03/10 21:59, Marcin Krol wrote:

What do you think of this? And in general: when (if?) should one
denormalize data?


As a last resort. No sooner.

The support costs of denormalising your database is such that if you can
reasonably just buy more hardware / add caching / etc, do so.


Well, materialized views are a denormalization of sorts, and a really 
useful one - but they're not the authorative store of the data.


Things like views, materialized views, trigger-maintained side tables 
etc are what you should look at before even considering denormalizing 
your main data storage.


--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] The REAL cost of joins

2010-03-04 Thread Merlin Moncure
On Wed, Mar 3, 2010 at 4:59 PM, Marcin Krol mrk...@gmail.com wrote:
 Hello everyone,

 I have inadvertently set off a furball on an unrelated ng on what is the
 actual cost of SQL joins. But there's no reliable conclusion. I would like
 to actually know that, that is, are JOINs truly expensive?

There's a lot of variables that go in to join performance.  Are the
necessary columns indexed, what is the comparison operator, size of
the index, etc?  Complicating the calculation, postgres has access to
different algorithms it chooses depending on circumstances.  In
general, joins cost less than a lot of novice programmers think
because they are simply unaware of the number of highly efficient ways
to attack the problem of doing set operations on lists of sorted
items, or perhaps have had bad experiences on databases that had a
lousy selection of algorithms or poor (if any) statistics.

The three biggest factors on join performance are 1: having index in
appropriate places,  2: if/when there is a full cache fault and the
database has to get information from disk when executing the join.
And 3: there being enough working memory to do things like 'in query'
sorts. Point 2 is where denormalization can reap real benefits because
it can reduce the number of disk seeks to get data (compared to the
cpu and memory, disk seeks take eons).

Databases have a characteristic that reminds me of some of the
challenges that aircraft engineers face.  Jets that go supersonic have
an entirely different set of operating principles because above the
speed of sound all the rules changes in terms of thermals, vibration,
stresses, etc.  Similarly in the database world performance tend to
lurch in a very unpleasant direction when you active working set of
data is not able to be properly served in RAM.   The 'sonic boom' you
hear is your disk drives grinding as the clock ticks away...for many
DBAs hearing this sound the first time is a life changing experience.
This is when those little things you never bothered to think about
like having the records that tend to be pulled up together by grouped
together on disk become suddenly very important...you discover the
CLUSTER command and become best friends :-).

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] The REAL cost of joins

2010-03-04 Thread Simon Riggs
On Wed, 2010-03-03 at 22:59 +0100, Marcin Krol wrote:

 What do you think of this? And in general: when (if?) should one 
 denormalize data?

Your example shows that its quicker to put the ingredients on the table
than it is to bake a cake. I'm not sure that's an argument against
baking, nor is it an argument in favour of baking the night before.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] The REAL cost of joins

2010-03-03 Thread Marcin Krol

Hello everyone,

I have inadvertently set off a furball on an unrelated ng on what is the 
actual cost of SQL joins. But there's no reliable conclusion. I would 
like to actually know that, that is, are JOINs truly expensive?


As they say, one measurement is worth a thousand opinions, so I've done 
measurement on my PG app:


$ time echo \c hrs;
SELECT hosts.ip, reservation.start_date, architecture.architecture, 
os_kind.os_kind, os_rel.os_rel, os_version.os_version, project.project, 
email.email FROM hosts

 INNER JOIN project ON project.id = hosts.project_id
 INNER JOIN architecture ON hosts.architecture_id = architecture.id
 INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id
 INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id
 INNER JOIN os_version ON hosts.os_version_id = os_version.id
 INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id
 INNER JOIN reservation on reservation.id = 
reservation_hosts.reservation_id

 INNER JOIN email ON reservation.email_id = email.id

; | psql  /dev/null

real0m0.099s
user0m0.015s
sys 0m0.005s



$ time echo \c hrs;
 SELECT hosts.ip FROM hosts;
 SELECT reservation.start_date FROM reservation;
 SELECT architecture.architecture FROM architecture;
 SELECT os_rel.os_rel FROM os_rel;
 SELECT os_version.os_version FROM os_version;
 SELECT project.project FROM project;
 SELECT email.email FROM email;
  | psql  /dev/null

real0m0.046s
user0m0.008s
sys 0m0.004s

Note: I've created indexes on those tables, both on data columns like 
hosts.ip and on .id columns.


What do you think of this? And in general: when (if?) should one 
denormalize data?


Regards,
mk


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] The REAL cost of joins

2010-03-03 Thread Peter Hunsberger
On Wed, Mar 3, 2010 at 3:59 PM, Marcin Krol mrk...@gmail.com wrote:

 Hello everyone,

 I have inadvertently set off a furball on an unrelated ng on what is the 
 actual cost of SQL joins. But there's no reliable conclusion. I would like to 
 actually know that, that is, are JOINs truly expensive?

 As they say, one measurement is worth a thousand opinions, so I've done 
 measurement on my PG app:

 $ time echo \c hrs;
 SELECT hosts.ip, reservation.start_date, architecture.architecture, 
 os_kind.os_kind, os_rel.os_rel, os_version.os_version, project.project, 
 email.email FROM hosts
  INNER JOIN project ON project.id = hosts.project_id
  INNER JOIN architecture ON hosts.architecture_id = architecture.id
  INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id
  INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id
  INNER JOIN os_version ON hosts.os_version_id = os_version.id
  INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id
  INNER JOIN reservation on reservation.id = reservation_hosts.reservation_id
  INNER JOIN email ON reservation.email_id = email.id

 ; | psql  /dev/null

 real    0m0.099s
 user    0m0.015s
 sys     0m0.005s



 $ time echo \c hrs;
  SELECT hosts.ip FROM hosts;
  SELECT reservation.start_date FROM reservation;
  SELECT architecture.architecture FROM architecture;
  SELECT os_rel.os_rel FROM os_rel;
  SELECT os_version.os_version FROM os_version;
  SELECT project.project FROM project;
  SELECT email.email FROM email;
   | psql  /dev/null

 real    0m0.046s
 user    0m0.008s
 sys     0m0.004s

 Note: I've created indexes on those tables, both on data columns like 
 hosts.ip and on .id columns.

 What do you think of this? And in general: when (if?) should one denormalize 
 data?

I don't get what these two examples are supposed to show?  They do
not, in general, create equivalent output data. If you had to write
procedural logic to do the same correlation between the individual
result sets as the joins are performing then you'd be back to at least
the same performance and probably worse, so what exactly is being
compared here?

As to your second question; well, outside of not doing premature
optimization I'll observe that for operational databases,
denormalization rarely gives me enough savings to justify the
resultant extra data management required to maintain it.  If you're
talking about a data warehouse (and given the way you've worded the
question I supsect you are not?) then that's a different question, in
that case I prefer to generate the data from the operational side of
the house so you're no longer paying an operational cost to maintain
the denormalized data (you've got a whole different set of costs
instead).

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general