[HACKERS] Looking for a tool to * pg tables as ERDs

2006-02-23 Thread Ron Peacetree
Where * == 
{print | save to PDF | save to mumble format | display on screen}

Anyone know of one?

TiA
Ron

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Ron Peacetree
I've now gotten verification from multiple working DBA's that DB2, Oracle, and
SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in
setups akin to Oracle RAC) when attached to a decent (not outrageous, but
decent) HD subsystem...

I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR is
attainable.  Cache based bursts that high, yes.  ASTR, no.

The DBA's in question run RW installations that include Solaris, M$, and Linux 
OS's
for companies that just about everyone on these lists are likely to recognize.

Also, the implication of these pg IO limits is that money spent on even 
moderately
priced 300MBps SATA II based RAID HW is wasted $'s.

In total, this situation is a recipe for driving potential pg users to other 
DBMS. 
  
25MBps in and 15MBps out is =BAD=.

Have we instrumented the code in enough detail that we can tell _exactly_ where
the performance drainage is?

We have to fix this.
Ron  


-Original Message-
From: Luke Lonergan [EMAIL PROTECTED]
Sent: Oct 5, 2005 11:24 AM
To: Michael Stone [EMAIL PROTECTED], Martijn van Oosterhout 
kleptog@svana.org
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Nope - it would be disk wait.

COPY is CPU bound on I/O subsystems faster that 50 MB/s on COPY (in) and about 
15 MB/s (out).

- Luke

 -Original Message-
From:   Michael Stone [mailto:[EMAIL PROTECTED]
Sent:   Wed Oct 05 09:58:41 2005
To: Martijn van Oosterhout
Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
Subject:Re: [HACKERS] [PERFORM] A Better External Sort?

On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote:
COPY TO /dev/null WITH binary
13MB/s55% user 45% system  (ergo, CPU bound)
[snip]
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

Note that 45% of that cpu usage is system--which is where IO overhead
would end up being counted. Until you profile where you system time is
going it's premature to say it isn't an IO problem.

Mike Stone


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 6: explain analyze is your friend


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Ron Peacetree
First I wanted to verify that pg's IO rates were inferior to The Competition.
Now there's at least an indication that someone else has solved similar
problems.  Existence proofs make some things easier ;-)

Is there any detailed programmer level architectual doc set for pg?  I know
the best doc is the code, but the code in isolation is often the Slow Path to
understanding with systems as complex as a DBMS IO layer.

Ron
 

-Original Message-
From: Joshua D. Drake [EMAIL PROTECTED]
Sent: Oct 5, 2005 1:18 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?


The source is freely available for your perusal. Please feel free to
point us in specific directions in the code where you may see some
benefit. I am positive all of us that can, would put resources into
fixing the issue had we a specific direction to attack.

Sincerely,

Joshua D. Drake

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Ron Peacetree
I'm putting in as much time as I can afford thinking about pg related
performance issues.  I'm doing it because of a sincere desire to help
understand and solve them, not to annoy people.

If I didn't believe in pg, I would't be posting thoughts about how to
make it better.  

It's probably worth some review (suggestions marked with a +:

+I came to the table with a possibly better way to deal with external
sorts (that now has branched into 2 efforts: short term improvements
to the existing code, and the original from-the-ground-up idea).  That
suggestion was based on a great deal of prior thought and research,
despite what some others might think.

Then we were told that our IO limit was lower than I thought.

+I suggested that as a Quick Fix we try making sure we do IO
transfers in large enough chunks based in the average access time
of the physical device in question so as to achieve the device's
ASTR (ie at least 600KB per access for a 50MBps ASTR device with
a 12ms average access time.) whenever circumstances allowed us.
As far as I know, this experiment hasn't been tried yet.

I asked some questions about physical layout and format translation
overhead being possibly suboptimal that seemed to be agreed to, but
specifics as to where we are taking the hit don't seem to have been
made explicit yet.

+I made the from left field suggestion that perhaps a pg native fs
format would be worth consideration.  This is a major project, so
the suggestion was to at least some extent tongue-in-cheek.

+I then made some suggestions about better code instrumentation
so that we can more accurately characterize were the bottlenecks are. 

We were also told that evidently we are CPU bound far before one
would naively expect to be based on the performance specifications
of the components involved.

Double checking among the pg developer community led to some
differing opinions as to what the actual figures were and under what
circumstances they were achieved.  Further discussion seems to have
converged on both accurate values and a better understanding as to
the HW and SW  needed; _and_ we've gotten some RW confirmation
as to what current reasonable expectations are within this problem
domain from outside the pg community.

+Others have made some good suggestions in this thread as well.
Since I seem to need to defend my tone here, I'm not detailing them
here.  That should not be construed as a lack of appreciation of them.

Now I've asked for the quickest path to detailed understanding of the
pg IO subsystem.  The goal being to get more up to speed on its
coding details.  Certainly not to annoy you or anyone else.

At least from my perspective, this for the most part seems to have
been an useful and reasonable engineering discussion that has
exposed a number of important things.
  
Regards,
Ron

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Ron Peacetree
The constants related to inlining involve pcode, not actual assembly 
instructions, and are compiler version dependent as well as subject to change 
without notice by the GNU folks...

from:
http://gcc.gnu.org/onlinedocs/gcc-3.3.5/gcc/Optimize-Options.html#Optimize-Options

-finline-limit=n
By default, gcc limits the size of functions that can be inlined. This flag 
allows the control of this limit for functions that are explicitly marked as 
inline (i.e., marked with the inline keyword or defined within the class 
definition in c++). n is the size of functions that can be inlined in number of 
pseudo instructions (not counting parameter handling). The default value of n 
is 600. Increasing this value can result in more inlined code at the cost of 
compilation time and memory consumption. Decreasing usually makes the 
compilation faster and less code will be inlined (which presumably means slower 
programs). This option is particularly useful for programs that use inlining 
heavily such as those based on recursive templates with C++.

Inlining is actually controlled by a number of parameters, which may be 
specified individually by using --param name=value. The -finline-limit=n option 
sets some of these parameters as follows:

max-inline-insns
is set to n.
max-inline-insns-single
is set to n/2.
max-inline-insns-auto
is set to n/2.
min-inline-insns
is set to 130 or n/4, whichever is smaller.
max-inline-insns-rtl
is set to n. 

Using -finline-limit=600 thus results in the default settings for these 
parameters. See below for a documentation of the individual parameters 
controlling inlining.

Note: pseudo instruction represents, in this particular context, an abstract 
measurement of function's size. In no way, it represents a count of assembly 
instructions and as such its exact meaning might change from one release to an 
another. 

Further Down It Says...

--param name=value
In some places, GCC uses various constants to control the amount of 
optimization that is done. For example, GCC will not inline functions that 
contain more that a certain number of instructions. You can control some of 
these constants on the command-line using the --param option.

The names of specific parameters, and the meaning of the values, are tied to 
the internals of the compiler, and are subject to change without notice in 
future releases.

In each case, the value is an integer. The allowable choices for name are given 
in the following table:

snip

max-inline-insns-single
Several parameters control the tree inliner used in gcc. This number sets the 
maximum number of instructions (counted in gcc's internal representation) in a 
single function that the tree inliner will consider for inlining. This only 
affects functions declared inline and methods implemented in a class 
declaration (C++). The default value is 300.

max-inline-insns-auto
When you use -finline-functions (included in -O3), a lot of functions that 
would otherwise not be considered for inlining by the compiler will be 
investigated. To those functions, a different (more restrictive) limit compared 
to functions declared inline can be applied. The default value is 300.

max-inline-insns
The tree inliner does decrease the allowable size for single functions to be 
inlined after we already inlined the number of instructions given here by 
repeated inlining. This number should be a factor of two or more larger than 
the single function limit. Higher numbers result in better runtime performance, 
but incur higher compile-time resource (CPU time, memory) requirements and 
result in larger binaries. Very high values are not advisable, as too large 
binaries may adversely affect runtime performance. The default value is 600.

max-inline-slope
After exceeding the maximum number of inlined instructions by repeated 
inlining, a linear function is used to decrease the allowable size for single 
functions. The slope of that function is the negative reciprocal of the number 
specified here. The default value is 32.

min-inline-insns
The repeated inlining is throttled more and more by the linear function after 
exceeding the limit. To avoid too much throttling, a minimum for this function 
is specified here to allow repeated inlining for very small functions even when 
a lot of repeated inlining already has been done. The default value is 130.

max-inline-insns-rtl
For languages that use the RTL inliner (this happens at a later stage than tree 
inlining), you can set the maximum allowable size (counted in RTL instructions) 
for the RTL inliner with this parameter. The default value is 600.


-Original Message-
From: Martijn van Oosterhout kleptog@svana.org
Sent: Oct 4, 2005 8:24 AM
To: Simon Riggs [EMAIL PROTECTED]
Cc: Tom Lane [EMAIL PROTECTED], Ron Peacetree [EMAIL PROTECTED], 
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort

Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Ron Peacetree
Jeff, are those _burst_ rates from HD buffer or _sustained_ rates from
actual HD media?  Rates from IO subsystem buffer or cache are
usually considerably higher than Average Sustained Transfer Rate.

Also, are you measuring _raw_ HD IO (bits straight off the platters, no
FS or other overhead) or _cooked_ HD IO (actual FS or pg IO)?

BTW, it would seem Useful to measure all of raw HD IO, FS HD IO,
and pg HD IO as this would give us an idea of just how much overhead
each layer is imposing on the process.

We may be able to get better IO than we currently are for things like
sorts by the simple expedient of making sure we read enough data per
seek.

For instance, a HD with a 12ms average access time and a ASTR of
50MBps should always read _at least_ 600KB/access or it is impossible
for it to achieve it's rated ASTR.

This number will vary according to the average access time and the
ASTR of your physical IO subsystem, but the concept is valid for _any_
physical IO subsystem.
 

-Original Message-
From: Jeffrey W. Baker [EMAIL PROTECTED]
Sent: Oct 3, 2005 4:42 PM
To: josh@agliodbs.com
Cc: 
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Mon, 2005-10-03 at 13:34 -0700, Josh Berkus wrote:
 Michael,
 
  Realistically, you can't do better than about 25MB/s on a
   single-threaded I/O on current Linux machines,
 
  What on earth gives you that idea? Did you drop a zero?
 
 Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A 
 Big-Name Proprietary Database doesn't get much more than that either.

I find this claim very suspicious.  I get single-threaded reads in
excess of 1GB/sec with XFS and  250MB/sec with ext3.  

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Ron Peacetree
Let's pretend we get a 24HD HW RAID solution like that J Baker
says he has access to and set it up as a RAID 10.  Assuming
it uses two 64b 133MHz PCI-X busses and has the fastest HDs
available on it,  Jeff says he can hit ~1GBps of XFS FS IO rate
with that set up (12*83.3MBps= 1GBps).

Josh says that pg can't do more than 25MBps of DB level IO
regardless of how fast the physical IO subsystem is because at
25MBps, pg is CPU bound.  

Just how bad is this CPU bound condition?  How powerful a CPU is
needed to attain a DB IO rate of 25MBps?
 
If we replace said CPU with one 2x, 10x, etc faster than that, do we
see any performance increase?

If a modest CPU can drive a DB IO rate of 25MBps, but that rate
does not go up regardless of how much extra CPU we throw at
it...

Ron 

-Original Message-
From: Josh Berkus josh@agliodbs.com
Sent: Oct 3, 2005 6:03 PM
To: Jeffrey W. Baker [EMAIL PROTECTED]
Cc: 
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Jeffrey,

 I guess database reads are different, but I remain unconvinced that they
 are *fundamentally* different.  After all, a tab-delimited file (my sort
 workload) is a kind of database.

Unfortunately, they are ... because of CPU overheads.   I'm basing what's 
reasonable for data writes on the rates which other high-end DBs can 
make.   From that, 25mb/s or even 40mb/s for sorts should be achievable 
but doing 120mb/s would require some kind of breakthrough.

 On a single disk you wouldn't notice, but XFS scales much better when
 you throw disks at it.  I get a 50MB/sec boost from the 24th disk,
 whereas ext3 stops scaling after 16 disks.  For writes both XFS and ext3
 top out around 8 disks, but in this case XFS tops out at 500MB/sec while
 ext3 can't break 350MB/sec.

That would explain it.  I seldom get more than 6 disks (and 2 channels) to 
test with.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Ron Peacetree
OK, change performance to single thread performance and we
still have a valid starting point for a discussion.

Ron


-Original Message-
From: Gregory Maxwell [EMAIL PROTECTED]
Sent: Oct 3, 2005 8:19 PM
To: Ron Peacetree [EMAIL PROTECTED]
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On 10/3/05, Ron Peacetree [EMAIL PROTECTED] wrote:
[snip]
 Just how bad is this CPU bound condition?  How powerful a CPU is
 needed to attain a DB IO rate of 25MBps?

 If we replace said CPU with one 2x, 10x, etc faster than that, do we
 see any performance increase?

 If a modest CPU can drive a DB IO rate of 25MBps, but that rate
 does not go up regardless of how much extra CPU we throw at
 it...

Single threaded was mentioned.
Plus even if it's purely cpu bound, it's seldom as trivial as throwing
CPU at it, consider the locking in both the application, in the
filesystem, and elsewhere in the kernel.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Ron Peacetree
*blink* Tapes?!  I thought that was a typo...
If our sort is code based on sorting tapes, we've made a mistake.  HDs
are not tapes, and Polyphase Merge Sort and it's brethren are not the
best choices for HD based sorts.

Useful references to this point:
Knuth, Vol 3 section 5.4.9, (starts p356 of 2ed)   
Tharp, ISBN 0-471-60521-2, starting p352
Folk, Zoellick, and Riccardi, ISBN 0-201-87401-6, chapter 8 (starts p289)

The winners of the Daytona version of Jim Gray's sorting contest, for
general purpose external sorting algorithms that are of high enough quality
to be offered commercially, also demonstrate a number of better ways to
attack external sorting using HDs.

The big take aways from all this are:
1= As in Polyphase Merge Sort, optimum External HD Merge Sort
performance is obtained by using Replacement Selection and creating
buffers of different lengths for later merging.  The values are different.

2= Using multiple HDs split into different functions, IOW _not_ simply
as RAIDs, is a big win.
A big enough win that we should probably consider having a config option
to pg that allows the use of HD(s) or RAID set(s) dedicated as temporary
work area(s).
 
3= If the Key is small compared record size, Radix or Distribution
Counting based algorithms are worth considering.

The good news is all this means it's easy to demonstrate that we can
improve the performance of our sorting functionality.

Assuming we get the abyssmal physical IO performance fixed...
(because until we do, _nothing_ is going to help us as much)

Ron 


-Original Message-
From: Tom Lane [EMAIL PROTECTED]
Sent: Oct 1, 2005 2:01 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort? 

Jeffrey W. Baker [EMAIL PROTECTED] writes:
 I think the largest speedup will be to dump the multiphase merge and
 merge all tapes in one pass, no matter how large M.  Currently M is
 capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
 the tape.  It could be done in a single pass heap merge with N*log(M)
 comparisons, and, more importantly, far less input and output.

I had more or less despaired of this thread yielding any usable ideas
:-( but I think you have one here.  The reason the current code uses a
six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
edition) shows that there's not much incremental gain from using more
tapes ... if you are in the regime where number of runs is much greater
than number of tape drives.  But if you can stay in the regime where
only one merge pass is needed, that is obviously a win.

I don't believe we can simply legislate that there be only one merge
pass.  That would mean that, if we end up with N runs after the initial
run-forming phase, we need to fit N tuples in memory --- no matter how
large N is, or how small work_mem is.  But it seems like a good idea to
try to use an N-way merge where N is as large as work_mem will allow.
We'd not have to decide on the value of N until after we've completed
the run-forming phase, at which time we've already seen every tuple
once, and so we can compute a safe value for N as work_mem divided by
largest_tuple_size.  (Tape I/O buffers would have to be counted too
of course.)

It's been a good while since I looked at the sort code, and so I don't
recall if there are any fundamental reasons for having a compile-time-
constant value of the merge order rather than choosing it at runtime.
My guess is that any inefficiencies added by making it variable would
be well repaid by the potential savings in I/O.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Ron Peacetree
As I posted earlier, I'm looking for code to base a prototype on now.
I'll test it outside pg to make sure it is bug free and performs as
promised before I hand it off to the core pg developers.

Someone else is going to have to merge it into the pg code base
since I don't know the code intimately enough to make changes this
deep in the core functionality, nor is there enough time for me to
do so if we are going to be timely enough get this into 8.2
(and no, I can't devote 24x7 to doing pg development unless
someone is going to replace my current ways of paying my bills so
that I can.)

Ron
 

-Original Message-
From: Andrew Dunstan [EMAIL PROTECTED]
Sent: Oct 1, 2005 11:19 AM
To: Ron Peacetree [EMAIL PROTECTED]
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?



Ron Peacetree wrote:

The good news is all this means it's easy to demonstrate that we can
improve the performance of our sorting functionality.

Assuming we get the abyssmal physical IO performance fixed...
(because until we do, _nothing_ is going to help us as much)

  


I for one would be paying more attention if such a demonstration were 
forthcoming, in the form of a viable patch and some benchmark results.

cheers

andrew


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Ron Peacetree
You have not said anything about what HW, OS version, and pg version
used here, but even at that can't you see that something Smells Wrong?

The most common CPUs currently shipping have clock rates of ~2-3GHz
and have 8B-16B internal pathways.  SPARCs and other like CPUs are
clocked slower but have 16B-32B internal pathways.  In short, these
CPU's have an internal bandwidth of 16+ GBps.

The most common currently shipping mainboards have 6.4GBps RAM
subsystems.  ITRW, their peak is ~80% of that, or ~5.1GBps.

In contrast, the absolute peak bandwidth of a 133MHx 8B PCI-X bus is
1GBps, and ITRW it peaks at ~800-850MBps.  Should anyone ever build
a RAID system that can saturate a PCI-Ex16 bus, that system will be
maxing ITRW at ~3.2GBps.

CPUs should NEVER be 100% utilized during copy IO.  They should be
idling impatiently waiting for the next piece of data to finish being
processed even when the RAM IO subsystem is pegged; and they
definitely should be IO starved rather than CPU bound when doing
HD IO.

Those IO rates are also alarming in all but possibly the first case.  A
single ~50MBps HD doing 21MBps isn't bad, but for even a single
~80MBps HD it starts to be of concern.  If any these IO rates came
from any reasonable 300+MBps RAID array, then they are BAD.

What your simple experiment really does is prove We Have A
Problem (tm) with our IO code at either or both of the OS or the pg
level(s).

Ron

 
-Original Message-
From: Martijn van Oosterhout kleptog@svana.org
Sent: Oct 1, 2005 12:19 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Sat, Oct 01, 2005 at 10:22:40AM -0400, Ron Peacetree wrote:
 Assuming we get the abyssmal physical IO performance fixed...
 (because until we do, _nothing_ is going to help us as much)

I'm still not convinced this is the major problem. For example, in my
totally unscientific tests on an oldish machine I have here:

Direct filesystem copy to /dev/null
21MB/s10% user 50% system  (dual cpu, so the system is using a whole CPU)

COPY TO /dev/null WITH binary
13MB/s55% user 45% system  (ergo, CPU bound)

COPY TO /dev/null
4.4MB/s   60% user 40% system

\copy to /dev/null in psql
6.5MB/s   60% user 40% system

This machine is a bit strange setup, not sure why fs copy is so slow.
As to why \copy is faster than COPY, I have no idea, but it is
repeatable. And actually turning the tuples into a printable format is
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

So, I don't think physical I/O is the problem. It's something further
up the call tree. I wouldn't be surprised at all it it had to do with
the creation and destruction of tuples. The cost of comparing tuples
should not be underestimated.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
From: Pailloncy Jean-Gerard [EMAIL PROTECTED]
Sent: Sep 29, 2005 7:11 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
Jeff Baker:
Your main example seems to focus on a large table where a key  
column has constrained values.  This case is interesting in
proportion to the number of possible values.  If I have billions
of rows, each having one of only two values, I can think of a
trivial and very fast method of returning the table sorted by
that key: make two sequential passes, returning the first value
on the first pass and the second value on the second pass.
 This will be faster than the method you propose.

Ron Peacetree:
1= No that was not my main example.  It was the simplest example  
used to frame the later more complicated examples.  Please don't
get hung up on it.

2= You are incorrect.  Since IO is the most expensive operation we  
can do, any method that makes two passes through the data at top
scanning speed will take at least 2x as long as any method that only
takes one such pass.

You do not get the point.
As the time you get the sorted references to the tuples, you need to  
fetch the tuples themself, check their visbility, etc. and returns  
them to the client.

As PFC correctly points out elsewhere in this thread, =maybe= you
have to do all that.  The vast majority of the time people are not
going to want to look at a detailed record by record output of that
much data.

The most common usage is to calculate or summarize some quality
or quantity of the data and display that instead or to use the tuples
or some quality of the tuples found as an intermediate step in a
longer query process such as a join.

Sometimes there's a need to see _some_ of the detailed records; a
random sample or a region in a random part of the table or etc.
It's rare that there is a RW need to actually list every record in a
table of significant size.

On the rare occasions where one does have to return or display all
records in such large table, network IO and/or display IO speeds
are the primary performance bottleneck.  Not HD IO.

Nonetheless, if there _is_ such a need, there's nothing stopping us
from rearranging the records in RAM into sorted order in one pass
through RAM (using at most space for one extra record) after
constructing the cache conscious Btree index.  Then the sorted
records can be written to HD in RAM buffer sized chunks very
efficiently.  
Repeating this process until we have stepped through the entire
data set will take no more HD IO than one HD scan of the data
and leave us with a permanent result that can be reused for
multiple purposes.  If the sorted records are written in large
enough chunks, rereading them at any later time can be done
at maximum HD throughput

In a total of two HD scans (one to read the original data, one
to write out the sorted data) we can make a permanent
rearrangement of the data.  We've essentially created a 
cluster index version of the data.


So, if there is only 2 values in the column of big table that is larger  
than available RAM, two seq scans of the table without any sorting
is the fastest solution.

If you only need to do this once, yes this wins.  OTOH, if you have
to do this sort even twice, my method is better.

regards,
Ron  

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
From: Zeugswetter Andreas DAZ SD [EMAIL PROTECTED]
Sent: Sep 29, 2005 9:28 AM
Subject: RE: [HACKERS] [PERFORM] A Better External Sort?

In my original example, a sequential scan of the 1TB of 2KB 
or 4KB records, = 250M or 500M records of data, being sorted 
on a binary value key will take ~1000x more time than reading 
in the ~1GB Btree I described that used a Key+RID (plus node 
pointers) representation of the data.

Imho you seem to ignore the final step your algorithm needs of
collecting the data rows. After you sorted the keys the collect
step will effectively access the tuples in random order (given a 
sufficiently large key range).

Collecting the data rows can be done for each RAM buffer full of
of data in one pass through RAM after we've built the Btree.  Then
if desired those data rows can be read out to HD in sorted order
in essentially one streaming burst.  This combination of index build
+ RAM buffer rearrangement + write results to HD can be repeat
as often as needed until we end up with an overall Btree index and
a set of sorted sublists on HD.  Overall HD IO for the process is only
two effectively sequential passes through the data.

Subsequent retrieval of the sorted information from HD can be
done at full HD streaming speed and whatever we've decided to
save to HD can be reused later if we desire.

Hope this helps,
Ron

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
From: Josh Berkus josh@agliodbs.com
Sent: Sep 29, 2005 12:54 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

The biggest single area where I see PostgreSQL external
sort sucking is on index creation on large tables.   For
example, for free version of TPCH, it takes only 1.5 hours to
load a 60GB Lineitem table on OSDL's hardware, but over 3
hours to create each index on that table.  This means that
over all our load into TPCH takes 4 times as long to create 
the indexes as it did to bulk load the data.

Hmmm.
60GB/5400secs= 11MBps.  That's ssllooww.  So the first
problem is evidently our physical layout and/or HD IO layer
sucks.

Creating the table and then creating the indexes on the table
is going to require more physical IO than if we created the
table and the indexes concurrently in chunks and then
combined the indexes on the chunks into the overall indexes
for the whole table, so there's a potential speed-up.

The method I've been talking about is basically a recipe for
creating indexes as fast as possible with as few IO operations,
HD or RAM, as possible and nearly no random ones, so it
could help as well.

OTOH, HD IO rate is the fundamental performance metric.
As long as our HD IO rate is pessimal, so will the performance
of everything else be.   Why can't we load a table at closer to
the peak IO rate of the HDs?   


Anyone restoring a large database from pg_dump is in the
same situation.  Even worse, if you have to create a new
index on a large table on a production database in use,
because the I/O from the index creation swamps everything.

Fix for this in the works ;-)


Following an index creation, we see that 95% of the time
required is the external sort, which averages 2mb/s.

Assuming decent HD HW, this is HORRIBLE.

What's kind of instrumenting and profiling has been done of
the code involved?


This is with seperate drives for the WAL, the pg_tmp, the table
and the index.  I've confirmed that increasing work_mem 
beyond a small minimum (around 128mb) had no benefit on
the overall index creation speed.

No surprise.  The process is severely limited by the abyssmally
slow HD IO.

Ron

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
That 11MBps was your =bulk load= speed.  If just loading a table
is this slow, then there are issues with basic physical IO, not just
IO during sort operations.

As I said, the obvious candidates are inefficient physical layout
and/or flawed IO code.

Until the basic IO issues are addressed, we could replace the
present sorting code with infinitely fast sorting code and we'd
still be scrod performance wise.

So why does basic IO suck so badly?

Ron  


-Original Message-
From: Josh Berkus josh@agliodbs.com
Sent: Sep 30, 2005 1:23 PM
To: Ron Peacetree [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Ron,

 Hmmm.
 60GB/5400secs= 11MBps.  That's ssllooww.  So the first
 problem is evidently our physical layout and/or HD IO layer
 sucks.

Actually, it's much worse than that, because the sort is only dealing 
with one column.  As I said, monitoring the iostat our top speed was 
2.2mb/s.

--Josh


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
25MBps should not be a CPU bound limit for IO, nor should it be
an OS limit.  It should be something ~100x (Single channel RAM)
to ~200x (dual channel RAM) that.

For an IO rate of 25MBps to be pegging the CPU at 100%, the CPU
is suffering some combination of
A= lot's of cache misses (cache thrash), 
B= lot's of random rather than sequential IO (like pointer chasing)
C= lot's of wasteful copying
D= lot's of wasteful calculations

In fact, this is crappy enough performance that the whole IO layer
should be rethought and perhaps reimplemented from scratch.
Optimization of the present code is unlikely to yield a 100-200x
improvement.

On the HD side, the first thing that comes to mind is that DBs are
-NOT- like ordinary filesystems in a few ways:
1= the minimum HD IO is a record that is likely to be larger than
a HD sector.  Therefore, the FS we use should be laid out with
physical segments of max(HD sector size, record size)

2= DB files (tables) are usually considerably larger than any other
kind of files stored.  Therefore the FS we should use should be laid
out using LARGE physical pages.  64KB-256KB at a _minimum_.

3= The whole 2GB striping of files idea needs to be rethought.
Our tables are significantly different in internal structure from the
usual FS entity.

4= I'm sure we are paying all sorts of nasty overhead for essentially
emulating the pg filesystem inside another filesystem.  That means
~2x as much overhead to access a particular piece of data.   

The simplest solution is for us to implement a new VFS compatible
filesystem tuned to exactly our needs: pgfs.

We may be able to avoid that by some amount of hacking or
modifying of the current FSs we use, but I suspect it would be more
work for less ROI.

Ron 


-Original Message-
From: Josh Berkus josh@agliodbs.com
Sent: Sep 30, 2005 4:41 PM
To: Ron Peacetree [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Ron,

 That 11MBps was your =bulk load= speed.  If just loading a table
 is this slow, then there are issues with basic physical IO, not just
 IO during sort operations.

Oh, yeah.  Well, that's separate from sort.  See multiple posts on this 
list from the GreenPlum team, the COPY patch for 8.1, etc.  We've been 
concerned about I/O for a while.  

Realistically, you can't do better than about 25MB/s on a single-threaded 
I/O on current Linux machines, because your bottleneck isn't the actual 
disk I/O.   It's CPU.   Databases which go faster than this are all, to 
my knowledge, using multi-threaded disk I/O.

(and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's 
another thread ... )

 As I said, the obvious candidates are inefficient physical layout
 and/or flawed IO code.

Yeah, that's what I thought too.   But try sorting an 10GB table, and 
you'll see: disk I/O is practically idle, while CPU averages 90%+.   We're 
CPU-bound, because sort is being really inefficient about something. I 
just don't know what yet.

If we move that CPU-binding to a higher level of performance, then we can 
start looking at things like async I/O, O_Direct, pre-allocation etc. that 
will give us incremental improvements.   But what we need now is a 5-10x 
improvement and that's somewhere in the algorithms or the code.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Ron Peacetree
If I've done this correctly, there should not be anywhere near
the number of context switches we currently see while sorting.

Each unscheduled context switch represents something unexpected
occuring or things not being where they are needed when they are
needed.  Reducing such circumstances to the absolute minimum 
was one of the design goals.

Reducing the total amount of IO to the absolute minimum should
help as well. 

Ron


-Original Message-
From: Kevin Grittner [EMAIL PROTECTED]
Sent: Sep 27, 2005 11:21 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

I can't help wondering how a couple thousand context switches per
second would affect the attempt to load disk info into the L1 and
L2 caches.  That's pretty much the low end of what I see when the
server is under any significant load.




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Ron Peacetree
From: Jeffrey W. Baker [EMAIL PROTECTED]
Sent: Sep 29, 2005 12:27 AM
To: Ron Peacetree [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

You are engaging in a length and verbose exercise in mental
masturbation, because you have not yet given a concrete example of a
query where this stuff would come in handy.  A common, general-purpose
case would be the best.
 
??? I posted =3= specific classes of common, general-purpose query
operations where OES and the OES Btrees look like they should be
superior to current methods:
1= when splitting sorting or other operations across multiple CPUs
2= when doing joins of different tables by doing the join on these Btrees
rather than the original tables.
3= when the opportunity arises to reuse OES Btree results of previous
sorts for different keys in the same table.  Now we can combine the
existing Btrees to obtain the new order based on the composite key
without ever manipulating the original, much larger, table.  

In what way are these examples not concrete?


We can all see that the method you describe might be a good way to sort
a very large dataset with some known properties, which would be fine if
you are trying to break the terasort benchmark.  But that's not what
we're doing here.  We are designing and operating relational databases.
So please explain the application.

This is a GENERAL method.  It's based on CPU cache efficient Btrees that
use variable length prefix keys and RIDs.
It assumes NOTHING about the data or the system in order to work.
I gave some concrete examples for the sake of easing explanation, NOT
as an indication of assumptions or limitations of the method.  I've even
gone out of my way to prove that no such assumptions or limitations exist.
Where in the world are you getting such impressions?
  

Your main example seems to focus on a large table where a key column has
constrained values.  This case is interesting in proportion to the
number of possible values.  If I have billions of rows, each having one
of only two values, I can think of a trivial and very fast method of
returning the table sorted by that key: make two sequential passes,
returning the first value on the first pass and the second value on the
second pass.  This will be faster than the method you propose.

1= No that was not my main example.  It was the simplest example used to
frame the later more complicated examples.  Please don't get hung up on it.

2= You are incorrect.  Since IO is the most expensive operation we can do,
any method that makes two passes through the data at top scanning speed
will take at least 2x as long as any method that only takes one such pass.


I think an important aspect you have failed to address is how much of
the heap you must visit after the sort is complete.  If you are
returning every tuple in the heap then the optimal plan will be very
different from the case when you needn't.  

Hmmm.  Not sure which heap you are referring to, but the OES Btree
index is provably the lowest (in terms of tree height) and smallest
possible CPU cache efficient data structure that one can make and still
have all of the traditional benefits associated with a Btree representation
of a data set.

Nonetheless, returning a RID, or all RIDs with(out) the same Key, or all
RIDs (not) within a range of Keys, or simply all RIDs in sorted order is
efficient.  Just as should be for a Btree (actually it's a B+ tree variant to
use Knuth's nomenclature).  I'm sure someone posting from acm.org
recognizes how each of these Btree operations maps to various SQL
features...  

I haven't been talking about query plans because they are orthogonal to
the issue under discussion?  If we use a layered model for PostgreSQL's
architecture, this functionality is more primal than that of a query
planner.  ALL query plans that currently involve sorts will benefit from a
more efficient way to do, or avoid, sorts.


PS: Whatever mailer you use doesn't understand or respect threading nor
attribution.  Out of respect for the list's readers, please try a mailer
that supports these 30-year-old fundamentals of electronic mail.

That is an issue of infrastructure on the recieving side, not on the sending
(my) side since even my web mailer seems appropriately RFC conformant.
Everything seems to be going in the correct places and being properly 
organized on archival.postgres.org ...

Ron

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Ron Peacetree
From: Jeffrey W. Baker [EMAIL PROTECTED]
Sent: Sep 27, 2005 1:26 PM
To: Ron Peacetree [EMAIL PROTECTED]
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

That Btree can be used to generate a physical reordering of the data
in one pass, but that's the weakest use for it.  The more powerful
uses involve allowing the Btree to persist and using it for more
efficient re-searches or combining it with other such Btrees (either as
a step in task distribution across multiple CPUs or as a more efficient
way to do things like joins by manipulating these Btrees rather than
the actual records.)

Maybe you could describe some concrete use cases.  I can see what
you are getting at, and I can imagine some advantageous uses, but
I'd like to know what you are thinking.

1= In a 4P box, we split the data in RAM into 4 regions and create
a CPU cache friendly Btree using the method I described for each CPU.
The 4 Btrees can be merged in a more time and space efficient manner
than the original records to form a Btree that represents the sorted order
of the entire data set.  Any of these Btrees can be allowed to persist to
lower the cost of doing similar operations in the future (Updating the
Btrees during inserts and deletes is cheaper than updating the original
data files and then redoing the same sort from scratch in the future.)
Both the original sort and future such sorts are made more efficient
than current methods.

2= We use my method to sort two different tables.  We now have these
very efficient representations of a specific ordering on these tables.  A
join operation can now be done using these Btrees rather than the
original data tables that involves less overhead than many current
methods.

3=  We have multiple such Btrees for the same data set representing
sorts done using different fields (and therefore different Keys).
Calculating a sorted order for the data based on a composition of
those Keys is now cheaper than doing the sort based on the composite
Key from scratch.  When some of the Btrees exist and some of them
do not, there is a tradeoff calculation to be made.  Sometimes it will be
cheaper to do the sort from scratch using the composite Key.   


Specifically I'd like to see some cases where this would beat sequential
scan.  I'm thinking that in your example of a terabyte table with a
column having only two values, all the queries I can think of would be
better served with a sequential scan.

In my original example, a sequential scan of the 1TB of 2KB or 4KB
records, = 250M or 500M records of data, being sorted on a binary
value key will take ~1000x more time than reading in the ~1GB Btree
I described that used a Key+RID (plus node pointers) representation
of the data.
 
Just to clarify the point further,
1TB of 1B records = 2^40 records of at most 256 distinct values.
1TB of 2B records = 2^39 records of at most 2^16 distinct values.
1TB of 4B records = 2^38 records of at most 2^32 distinct values.
1TB of 5B records = 200B records of at most 200B distinct values.
From here on, the number of possible distinct values is limited by the
number of records.
100B records are used in the Indy version of Jim Gray's sorting 
contests, so 1TB = 10B records.
2KB-4KB is the most common record size I've seen in enterprise
class DBMS (so I used this value to make my initial example more
realistic).

Therefore the vast majority of the time representing a data set by Key
will use less space that the original record.  Less space used means
less IO to scan the data set, which means faster scan times.

This is why index files work in the first place, right?


Perhaps I believe this because you can now buy as much sequential I/O
as you want.  Random I/O is the only real savings.

1= No, you can not buy as much sequential IO as you want.  Even if
with an infinite budget, there are physical and engineering limits.  Long
before you reach those limits, you will pay exponentially increasing costs
for linearly increasing performance gains.  So even if you _can_ buy a
certain level of sequential IO, it may not be the most efficient way to
spend money.

2= Most RW IT professionals have far from an infinite budget.  Just traffic
on these lists shows how severe the typical cost constraints usually are.
OTOH, if you have an inifinite IT budget, care to help a few less fortunate
than yourself?  After all, a even a large constant substracted from infinity
is still infinity... ;-)

3= No matter how fast you can do IO, IO remains the most expensive
part of the performance equation.  The fastest and cheapest IO you can
do is _no_ IO.  As long as we trade cheaper RAM and even cheaoer CPU
operations for IO correctly, more space efficient data representations will
always be a Win because of this.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining

Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Ron Peacetree
In the interest of efficiency and not reinventing the wheel, does anyone know
where I can find C or C++ source code for a Btree variant with the following
properties:

A= Data elements (RIDs) are only stored in the leaves, Keys (actually
KeyPrefixes; see D below) and Node pointers are only stored in the internal
nodes of the Btree.

B= Element redistribution is done as an alternative to node splitting in 
overflow
conditions during Inserts whenever possible.

C= Variable length Keys are supported.

D= Node buffering with a reasonable replacement policy is supported.

E= Since we will know beforehand exactly how many RID's will be stored, we
will know apriori how much space will be needed for leaves, and will know the
worst case for how much space will be required for the Btree internal nodes
as well.  This implies that we may be able to use an array, rather than linked
list, implementation of the Btree.  Less pointer chasing at the expense of more
CPU calculations, but that's a trade-off in the correct direction. 

Such source would be a big help in getting a prototype together.

Thanks in advance for any pointers or source,
Ron

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-28 Thread Ron Peacetree
From: Josh Berkus josh@agliodbs.com
ent: Sep 27, 2005 12:15 PM
To: Ron Peacetree [EMAIL PROTECTED] 
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

I've somehow missed part of this thread, which is a shame since this is 
an area of primary concern for me.

Your suggested algorithm seems to be designed to relieve I/O load by 
making more use of the CPU.   (if I followed it correctly).

The goal is to minimize all IO load.  Not just HD IO load, but also RAM
IO load.  Particularly random access IO load of any type (for instance:
the pointer chasing problem).

In addition, the design replaces explicit data or explicit key manipulation
with the creation of a smaller, far more CPU and IO efficient data
structure (essentially a CPU cache friendly Btree index) of the sorted
order of the data.

That Btree can be used to generate a physical reordering of the data
in one pass, but that's the weakest use for it.  The more powerful
uses involve allowing the Btree to persist and using it for more
efficient re-searches or combining it with other such Btrees (either as
a step in task distribution across multiple CPUs or as a more efficient
way to do things like joins by manipulating these Btrees rather than
the actual records.)


However, that's not PostgreSQL's problem; currently for us external
sort is a *CPU-bound* operation, half of which is value comparisons.
(oprofiles available if anyone cares)

So we need to look, instead, at algorithms which make better use of 
work_mem to lower CPU activity, possibly even at the expense of I/O.

I suspect that even the highly efficient sorting code we have is
suffering more pessimal CPU IO behavior than what I'm presenting.
Jim Gray's external sorting contest web site points out that memory IO
has become a serious problem for most of the contest entries.

Also, I'll bet the current code manipulates more data.

Finally, there's the possibilty of reusing the product of this work to a
degree and in ways that we can't with our current sorting code.


Now all we need is resources and time to create a prototype.
Since I'm not likely to have either any time soon, I'm hoping that
I'll be able to explain this well enough that others can test it.

*sigh* I _never_ have enough time or resources any more...
Ron
 
  

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-27 Thread Ron Peacetree
From: Dann Corbit [EMAIL PROTECTED]
Sent: Sep 26, 2005 5:13 PM
To: Ron Peacetree [EMAIL PROTECTED], pgsql-hackers@postgresql.org, 
   pgsql-performance@postgresql.org
Subject: RE: [HACKERS] [PERFORM] A Better External Sort?

I think that the btrees are going to be O(n*log(n)) in construction of
the indexes in disk access unless you memory map them [which means you
would need stupendous memory volume] and so I cannot say that I really
understand your idea yet.

Traditional algorithms for the construction of Btree variants (B, B+, B*, ...)
don't require O(nlgn) HD accesses.  These shouldn't either.

Let's start by assuming that an element is = in size to a cache line and a
node fits into L1 DCache.  To make the discussion more concrete, I'll use a
64KB L1 cache + a 1MB L2 cache only as an example.

Simplest case: the Key has few enough distinct values that all Keys or
KeyPrefixes fit into L1 DCache (for a 64KB cache with 64B lines, that's
 = 1000 different values.  More if we can fit more than 1 element into
each cache line.).

As we scan the data set coming in from HD, we compare the Key or KeyPrefix
to the sorted list of Key values in the node.  This can be done in O(lgn) using
Binary Search or O(lglgn) using a variation of Interpolation Search.  
If the Key value exists, we append this RID to the list of RIDs having the
same Key:
  If the RAM buffer of this list of RIDs is full we append it and the current
  RID to the HD list of these RIDs.
Else we insert this new key value into its proper place in the sorted list of 
Key
values in the node and start a new list for this value of RID.

We allocate room for a CPU write buffer so we can schedule RAM writes to
the RAM lists of RIDs so as to minimize the randomness of them.

When we are finished scanning the data set from HD, the sorted node with
RID lists for each Key value contains the sort order for the whole data set.

Notice that almost all of the random data access is occuring within the CPU
rather than in RAM or HD, and that we are accessing RAM or HD only when
absolutely needed.

Next simplest case: Multiple nodes, but they all fit in the CPU cache(s).
In the given example CPU, we will be able to fit at least 1000 elements per
node and 2^20/2^16= up to 16 such nodes in this CPU.  We use a node's
worth of space as a RAM write buffer, so we end up with room for 15 such
nodes in this CPU.  This is enough for a 2 level index to at least 15,000
distinct Key value lists.

All of the traditional tricks for splitting a Btree node and redistributing
elements within them during insertion or splitting for maximum node
utilization can be used here.

The most general case: There are too many nodes to fit within the CPU
cache(s).  The root node now points to a maximum of at least 1000 nodes
since each element in the root node points to another node.  A full 2 level
index is now enough to point to at least 10^6 distinct Key value lists, and
3 levels will index more distinct Key values than is possible in our 1TB, 
500M record example.

We can use some sort of node use prediction algorithm like LFU to decide
which node should be moved out of CPU when we have to replace one of
the nodes in the CPU.  The nodes in RAM or on HD can be arranged to
maximize streaming IO behavior and minimize random access IO
behavior.

As you can see, both the RAM and HD IO are as minimized as possible,
and what such IO there is has been optimized for streaming behavior.

 
Can you draw a picture of it for me?  (I am dyslexic and understand things
far better when I can visualize it).

Not much for pictures.  Hopefully the explanation helps?

Ron

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-27 Thread Ron Peacetree
SECOND ATTEMPT AT POST.  Web mailer appears to have
eaten first one.  I apologize in advance if anyone gets two
versions of this post.
=r

From: Tom Lane [EMAIL PROTECTED]
Sent: Sep 26, 2005 9:42 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort? 

So far, you've blithely assumed that you know the size of a cache line,
the sizes of L1 and L2 cache,

NO.  I used exact values only as examples.  Realistic examples drawn
from an extensive survey of past, present, and what I could find out
about future systems; but only examples nonetheless.  For instance,
Hennessy and Patterson 3ed points out that 64B cache lines are
optimally performing for caches between 16KB and 256KB.  The same
source as well as sources specifically on CPU memory hierarchy
design points out that we are not likely to see L1 caches larger than
256KB in the forseeable future.

The important point was the idea of an efficient Key, rather than
Record, sort using a CPU cache friendly data structure with provably
good space and IO characteristics based on a reasonable model of
current and likely future single box computer architecture (although
it would be fairly easy to extend it to include the effects of
networking.)

No apriori exact or known values are required for the method to work.


and that you are working with sort keys that you can efficiently pack
into cache lines.

Not pack.  map.  n items can not take on more than n values.  n
values can be represented in lgn bits.  Less efficient mappings can
also work.  Either way I demonstrated that we have plenty of space in
a likely and common cache line size.  Creating a mapping function
to represent m values in lgm bits is a well known hack, and if we keep
track of minimum and maximum values for fields during insert and
delete operations, we can even create mapping functions fairly easily.
(IIRC, Oracle does keep track of minimum and maximum field
values.)


And that you know the relative access speeds of the caches and
memory so that you can schedule transfers,

Again, no.  I created a reasonable model of a computer system that
holds remarkably well over a _very_ wide range of examples.  I
don't need the numbers to be exactly right to justify my approach
to this problem or understand why other approaches may have
downsides.  I just have to get the relative performance of the
system components and the relative performance gap between them
reasonably correct.  The stated model does that very well.

Please don't take my word for it.  Go grab some random box:
laptop, desktop, unix server, etc and try it for yourself.  Part of the
reason I published the model was so that others could examine it.
 

and that the hardware lets you get at that transfer timing.

Never said anything about this, and in fact I do not need any such.


And that the number of distinct key values isn't very large.

Quite the opposite in fact.  I went out of my way to show that the
method still works well even if every Key is distinct.  It is _more
efficient_ when the number of distinct keys is small compared to
the number of data items, but it works as well as any other Btree
would when all n of the Keys are distinct.  This is just a CPU cache
and more IO friendly Btree, not some magical and unheard of
technique.  It's just as general purpose as Btrees usually are.

I'm simply looking at the current and likely future state of computer
systems architecture and coming up with a slight twist on how to use
already well known and characterized techniques. not trying to start
a revolution.


I'm trying very hard NOT to waste anyone's time around here.
Including my own
Ron 

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] [PERFORM] A Better External Sort?

2005-09-26 Thread Ron Peacetree
From: Ron Peacetree [EMAIL PROTECTED]
Sent: Sep 24, 2005 6:30 AM
Subject: Re: [HACKERS] [PERFORM] Releasing memory during External sorting?

... the amount of IO done is the most
important of the things that you should be optimizing for in
choosing an external sorting algorithm.

 snip

Since sorting is a fundamental operation in many parts of a DBMS,
this is a Big Deal.
   
This discussion has gotten my creative juices flowing.  I'll post
some Straw Man algorithm sketches after I've done some more
thought.

As a thought exeriment, I've been considering the best way to sort 1TB
(2^40B) of 2-4KB (2^11-2^12B) records.  That's 2^28-2^29 records.

Part I: A Model of the System
The performance of such external sorts is limited by HD IO, then
memory IO, and finally CPU throughput.

On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
case) to ~1/128 (single HD best case. No more than one seek every
~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.

RAID HD IO will be in the range from as low as a single HD (RAID 1) to
~1/8 (a RAID system saturating the external IO bus) the throughput of
RAM.

RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
pathways internal to the CPU.

This model suggests that HD IO will greatly dominate every other
factor, particuarly if we are talking about a single HD rather than a
peripheral bus saturating RAID subsystem. If at all possible, we want
to access the HD subsystem only once for each data item, and we want
to avoid seeking more than the critical number of seeks implied above
when doing it.  It also suggests that at a minimum, it's worth it to
spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
Far more than that if we are talking about a single random access.

It's worth spending ~128 CPU operations to avoid a single random RAM
access, and literally 10's or even 100's of thousands of CPU operations to
avoid a random HD access.  In addition, there are many indications in
current ECE and IT literature that the performance gaps between these
pieces of computer systems are increasing and expected to continue to do
so for the forseeable future.  In short, _internal_ sorts have some, and are
going to increasingly have more, of the same IO problems usually
associated with external sorts.


Part II: a Suggested Algorithm
The simplest case is one where we have to order the data using a key that
only has two values.

Given 2^40B of data using 2KB or 4KB per record, the most compact
representation we can make of such a data set is to assign a 32b= 4B RID
or Rptr for location + a 1b key for each record.  Just the RID's would take up
1.25GB (250M records) or 2.5GB (500M records).  Enough space that even
an implied ordering of records may not fit into RAM.

Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive in terms
of IO operations than sorting the actual 1TB of data.

That IO cost can be lowered even further if instead of actually physically
sorting the RIDs, we assign a RID to the appropriate catagory inside the CPU
as we scan the data set and append the entries in a catagory from CPU cache
to a RAM file in one IO burst whenever said catagory gets full inside the CPU.
We can do the same with either RAM file to HD whenever they get full.  The
sorted order of the data is found by concatenating the appropriate files at the
end of the process.

As simple as this example is, it has many of the characteristics we are looking 
for:
A= We access each piece of data once on HD and in RAM.
B= We do the minimum amount of RAM and HD IO, and almost no random IO in
either case.
C= We do as much work as possible within the CPU.
D= This process is stable.  Equal keys stay in the original order they are 
encountered.

To generalize this method, we first need our 1b Key to become a sufficiently 
large
enough Key or KeyPrefix to be useful, yet not so big as to be CPU cache 
unfriendly.

Cache lines (also sometimes called blocks) are usually 64B= 512b in size.
Therefore our RID+Key or KeyPrefix should never be larger than this.  For a 
2^40B
data set, a 5B RID leaves us with potentially as much as 59B of Key or 
KeyPrefix.
Since the data can't take on more than 40b worth different values (actually 
500M= 29b
for our example), we have more than adequate space for Key or KeyPrefix.  We 
just
have to figure out how to use it effectively.
A typical CPU L2 cache can hold 10's or 100's of thousands of such cache lines.
That's enough that we should be able to do a significant amount of useful work 
within
the CPU w/o having to go off-die.

The data structure we are using to represent the sorted data also needs to be
generalized.  We want a space efficient DS that allows us to find any given 
element in
as few accesses as possible and that allows us to insert new elements or 
reorganize
the DS as efficiently as possible.  This being a DB discussion list, a B+ tree 
seems like
a fairly obvious suggestion ;-)

A B+ tree where each

Re: [HACKERS] [PERFORM] Releasing memory during External sorting?

2005-09-24 Thread Ron Peacetree
From: Dann Corbit [EMAIL PROTECTED]
Sent: Sep 23, 2005 5:38 PM
Subject: RE: [HACKERS] [PERFORM] Releasing memory during External sorting?

_C Unleashed_ also explains how to use a callback function to perform
arbitrary radix sorts (you simply need a method that returns the
[bucketsize] most significant bits for a given data type, for the length
of the key).

So you can sort fairly arbitrary data in linear time (of course if the
key is long then O(n*log(n)) will be better anyway.)

But in any case, if we are talking about external sorting, then disk
time will be so totally dominant that the choice of algorithm is
practically irrelevant.

Horsefeathers.  Jim Gray's sorting contest site:
http://research.microsoft.com/barc/SortBenchmark/

proves that the choice of algorithm can have a profound affect on
performance.  After all, the amount of IO done is the most
important of the things that you should be optimizing for in
choosing an external sorting algorithm.

Clearly, if we know or can assume the range of the data in question
the theoretical minimum amount of IO is one pass through all of the
data (otherwise, we are back in O(lg(n!)) land ).  Equally clearly, for
HD's that one pass should involve as few seeks as possible.

In fact, such a principle can be applied to _all_ forms of IO:  HD,
RAM, and CPU cache.  The absolute best that any sort can
possibly do is to make one pass through the data and deduce the
proper ordering of the data during that one pass.

It's usually also important that our algorithm be Stable, preferably
Wholly Stable.

Let's call such a sort Optimal External Sort (OES).  Just how much
faster would it be than current practice?

The short answer is the difference between how long it currently
takes to sort a file vs how long it would take to cat the contents
of the same file to a RAM buffer (_without_ displaying it). IOW, 
there's SIGNIFICANT room for improvement over current
standard practice in terms of sorting performance, particularly
external sorting performance.

Since sorting is a fundamental operation in many parts of a DBMS,
this is a Big Deal.
   
This discussion has gotten my creative juices flowing.  I'll post
some Straw Man algorithm sketches after I've done some more
thought.

Ron

 -Original Message-
 From: Dann Corbit [EMAIL PROTECTED]
 Sent: Friday, September 23, 2005 2:21 PM
 Subject: Re: [HACKERS] [PERFORM] Releasing memory during ...
 
For the subfiles, load the top element of each subfile into a priority
queue.  Extract the min element and write it to disk.  If the next
value is the same, then the queue does not need to be adjusted.
If the next value in the subfile changes, then adjust it.
 
Then, when the lowest element in the priority queue changes, adjust
the queue.
 
Keep doing that until the queue is empty.
 
You can create all the subfiles in one pass over the data.
 
You can read all the subfiles, merge them, and write them out in a
second pass (no matter how many of them there are).
 
The Gotcha with Priority Queues is that their performance depends
entirely on implementation.  In naive implementations either Enqueue()
or Dequeue() takes O(n) time, which reduces sorting time to O(n^2).

The best implementations I know of need O(lglgn) time for those
operations, allowing sorting to be done in O(nlglgn) time.
Unfortunately, there's a lot of data manipulation going on in the 
process and two IO passes are required to sort any given file.
Priority Queues do not appear to be very IO friendly.

I know of no sorting performance benchmark contest winner based on
Priority Queues.


Replacement selection is not a good idea any more, since obvious
better ideas should take over.  Longer runs are of no value if you do not
have to do multiple merge passes.
 
Judging from the literature and the contest winners, Replacement
Selection is still a viable and important technique.  Besides Priority
Queues, what obvious better ideas have you heard of?


I have explained this general technique in the book C Unleashed,
chapter 13.
 
Sample code is available on the book's home page.

URL please?  

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] Releasing memory during External sorting?

2005-09-23 Thread Ron Peacetree
From: Tom Lane [EMAIL PROTECTED]
Sent: Sep 23, 2005 2:15 PM
Subject: Re: [PERFORM] Releasing memory during External sorting? 

Mark Lewis [EMAIL PROTECTED] writes:
 operations != passes.  If you were clever, you could probably write a
 modified bubble-sort algorithm that only made 2 passes.  A pass is a
 disk scan, operations are then performed (hopefully in memory) on what
 you read from the disk.  So there's no theoretical log N lower-bound on
 the number of disk passes.

Given infinite memory that might be true, but I don't think I believe it
for limited memory.  If you have room for K tuples in memory then it's
impossible to perform more than K*N useful comparisons per pass (ie, as
each tuple comes off the disk you can compare it to all the ones
currently in memory; anything more is certainly redundant work).  So if
K  logN it's clearly not gonna work.

Actually, it's far better than that.  I recall a paper I saw in one of the
algorithms journals 15+ years ago that proved that if you knew the range
of the data, regardless of what that range was, and had n^2 space, you
could sort n items in O(n) time.

Turns out that with very modest constraints on the range of the data and
substantially less extra space (about the same as you'd need for
Replacement Selection + External Merge Sort), you can _still_ sort in
O(n) time.


It's possible that you could design an algorithm that works in a fixed
number of passes if you are allowed to assume you can hold O(log N)
tuples in memory --- and in practice that would probably work fine,
if the constant factor implied by the O() isn't too big.  But it's not
really solving the general external-sort problem.

If you know nothing about the data to be sorted and must guard against
the worst possible edge cases, AKA the classic definition of the general
external sorting problem,  then one can't do better than some variant
of Replacement Selection + Unbalanced Multiway Merge.

OTOH, ITRW things are _not_ like that.  We know the range of the data
in our DB fields or we can safely assume it to be relatively constrained.
This allows us access to much better external sorting algorithms.

For example Postman Sort (the 2005 winner of the PennySort benchmark)
is basically an IO optimized version of an external Radix Sort.


Ron

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] Releasing memory during External sorting?

2005-09-23 Thread Ron Peacetree
From: Simon Riggs [EMAIL PROTECTED]
Sent: Sep 23, 2005 5:37 AM
Subject: [PERFORM] Releasing memory during External sorting?

I have concerns about whether we are overallocating memory for use in
external sorts. (All code relating to this is in tuplesort.c)

A decent external sorting algorithm, say a Merge Sort + Radix (or
Distribution Counting) hybrid with appropriate optimizations for small sub-
files, should become more effective / efficient the more RAM you give it. 


The external sort algorithm benefits from some memory but not much.

That's probably an artifact of the psql external sorting code and _not_
due to some fundamental external sorting issue.


Knuth says that the amount of memory required is very low, with a value
typically less than 1 kB.

Required means the external sort can operate on that little memory.  How
Much memory is required for optimal performance is another matter.


I/O overheads mean that there is benefit from having longer sequential
writes, so the optimum is much larger than that. I've not seen any data
that indicates that a setting higher than 16 MB adds any value at all to a 
large external sort.

It should.  A first pass upper bound would be the amount of RAM needed for
Replacement Selection to create a run (ie sort) of the whole file.  That should
be ~ the amount of RAM to hold 1/2 the file in a Replacement Selection pass.

At the simplest, for any file over 32MB the optimum should be more than 
16MB.


 I have some indications from private tests that very high memory settings
may actually hinder performance of the sorts, though I cannot explain that
and wonder whether it is the performance tests themselves that have issues.

Hmmm.  Are you talking about amounts so high that you are throwing the OS
into paging and swapping thrash behavior?  If not, then the above is weird.


Does anyone have any clear data that shows the value of large settings
of work_mem when the data to be sorted is much larger than memory? (I am
well aware of the value of setting work_mem higher for smaller sorts, so
any performance data needs to reflect only very large sorts). 

This is not PostgreSQL specific, but it does prove the point that the 
performance
of external sorts benefits greatly from large amounts of RAM being available:

http://research.microsoft.com/barc/SortBenchmark/

Looking at the particulars of the algorithms listed there should shed a lot of 
light
on what a good external sorting algorithm looks like:
1= HD IO matters the most.
 1a= Seeking behavior is the largest factor in poor performance.
2= No optimal external sorting algorithm should use more than 2 passes.
3= Optimal external sorting algorithms should use 1 pass if at all possible.
4= Use as much RAM as possible, and use it as efficiently as possible.
5= The amount of RAM needed to hide the latency of a HD subsytem goes up as
the _square_ of the difference between the bandwidth of the HD subsystem and
memory.
6= Be cache friendly.
7= For large numbers of records whose sorting key is substantially smaller than
the record itself, use a pointer + compressed key representation and write the 
data
to HD in sorted order (Replace HD seeks with RAM seeks.  Minimize RAM seeks).
8= Since your performance will be constrained by HD IO first and RAM IO second,
up to a point it is worth it to spend more CPU cycles to save on IO.

Given the large and growing gap between CPU IO, RAM IO, and HD IO, these issues
are becoming more important for _internal_ sorts as well.  


Feedback, please.

Best Regards, Simon Riggs

Hope this is useful,
Ron

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] Releasing memory during External sorting?

2005-09-23 Thread Ron Peacetree
Yep.  Also, bear in mind that the lg(n!)= ~ nlgn - n lower bound on
the number of comparisions:
a= says nothing about the amount of data movement used.
b= only holds for generic comparison based sorting algorithms.

As Knuth says (vol 3, p180), Distribution Counting sorts without
ever comparing elements to each other at all, and so does Radix
Sort.  Similar comments can be found in many algorithms texts.

Any time we know that the range of the data to be sorted is substantially
restricted compared to the number of items to be sorted, we can sort in
less than O(lg(n!)) time.  DB fields tend to take on few values and are
therefore substantially restricted.

Given the proper resources and algorithms, O(n) sorts are very plausible
when sorting DB records.

All of the fastest external sorts of the last decade or so take advantage of
this.  Check out that URL I posted.

Ron


-Original Message-
From: Mark Lewis [EMAIL PROTECTED]
Sent: Sep 23, 2005 1:43 PM
To: Tom Lane [EMAIL PROTECTED]
Subject: Re: [PERFORM] Releasing memory during External sorting?

operations != passes.  If you were clever, you could probably write a
modified bubble-sort algorithm that only made 2 passes.  A pass is a
disk scan, operations are then performed (hopefully in memory) on what
you read from the disk.  So there's no theoretical log N lower-bound on
the number of disk passes.

Not that I have anything else useful to add to this discussion, just a
tidbit I remembered from my CS classes back in college :)

-- Mark

On Fri, 2005-09-23 at 13:17 -0400, Tom Lane wrote:
 Ron Peacetree [EMAIL PROTECTED] writes:
  2= No optimal external sorting algorithm should use more than 2 passes.
  3= Optimal external sorting algorithms should use 1 pass if at all possible.
 
 A comparison-based sort must use at least N log N operations, so it
 would appear to me that if you haven't got approximately log N passes
 then your algorithm doesn't work.
 
   regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly