subject:"\[HACKERS\] Memory usage during sorting"

Re: [HACKERS] Memory usage during sorting

2012-05-01 Thread Jim Nasby


On 4/17/12 7:19 AM, Greg Stark wrote:

On Mon, Apr 16, 2012 at 10:42 PM, Peter Geogheganpe...@2ndquadrant.com  wrote:

  All but 4 regression tests pass, but they don't really count
  as failures, since they're down to an assumption in the tests that the
  order certain tuples appear should be the same as our current
  quicksort implementation returns them, even though, in these
  problematic cases, that is partially dictated by implementation - our
  quicksort isn't stable, but timsort is.

This is an interesting point. If we use a stable sort we'll probably
be stuck with stable sorts indefinitely. People will start depending
on the stability and then we'll break their apps if we find a faster
sort that isn't stable.


I have often wished that I could inject entropy into a test database to ferret 
out these kinds of issues. In particular I worry about things like users 
depending on specific values for serial types or depending on the order of data 
in the heap.

I would find it useful if Postgres had an option to intentionally inject more 
randomness in areas at the cost of some performance. IE: have nextval() burn 
through a small, random number of values before returning one, and have scan 
operators do some re-ordering of tuples where appropriate.

If we had such an option and encouraged users to use it in testing, it would 
reduce the risk of people depending on behavior that they shouldn't be.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-17 Thread Greg Stark

On Mon, Apr 16, 2012 at 10:42 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 All but 4 regression tests pass, but they don't really count
 as failures, since they're down to an assumption in the tests that the
 order certain tuples appear should be the same as our current
 quicksort implementation returns them, even though, in these
 problematic cases, that is partially dictated by implementation - our
 quicksort isn't stable, but timsort is.

This is an interesting point. If we use a stable sort we'll probably
be stuck with stable sorts indefinitely. People will start depending
on the stability and then we'll break their apps if we find a faster
sort that isn't stable.

Notably though tapesort is not stable (because heapsort is not stable
so neither the runs nor the merge steps are stable). So people's apps
would appear to work when they're in memory and fail only on large
data sets. It's easily possible for a user's query to never need to go
to tape though.

We don't have the luxury of having a separate sort and stable_sort
though due to the ORDER BY clause.

All in all I think it's handier to have a stable ORDER BY sort than an
unstable one though. So I'm not necessarily opposed to it even if it
means we're stuck using a stable sort indefinitely.
-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-16 Thread Peter Geoghegan

On 14 April 2012 14:34, Peter Geoghegan pe...@2ndquadrant.com wrote:
 FWIW, I started playing with adding timsort to Postgres last night:

 https://github.com/Peter2ndQuadrant/postgres/tree/timsort

I've fixed this feature-branch so that every qsort_arg call site
(including the tuplesort specialisations thereof) call timsort_arg
instead. All but 4 regression tests pass, but they don't really count
as failures, since they're down to an assumption in the tests that the
order certain tuples appear should be the same as our current
quicksort implementation returns them, even though, in these
problematic cases, that is partially dictated by implementation - our
quicksort isn't stable, but timsort is. I think that the original
tests are faulty, but I understand that they're only expected to
provide smoke testing.

I did have to fix one bug in the implementation that I found, which
was an assumption that comparators would only return one of {-1, 0,
1}. The C standard requires only that they return negative, zero or
positive values, as required. I also polished the code a bit, and
added more comments.

I haven't actually quantified the benefits yet, and have no numbers to
show just yet, but I suspect it will prove to be a fairly compelling
win in certain situations.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-14 Thread Greg Stark

On Fri, Apr 13, 2012 at 7:01 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Well, timsort is specifically designed to take advantage of pre-sorted
 data. It does appear to have a lot of traction, as wikipedia points
 out:

I hadn't heard of it. But reading up on it it does seem like a good
fit for us. It trades some additional storage -- an array of pointers
into the sort array where in our case the pointers would be much
smaller than a whole SortTuple -- for fewer comparisons -- which in
our case are often much slower than a simple integer comparison.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-14 Thread Peter Geoghegan

On 14 April 2012 13:32, Greg Stark st...@mit.edu wrote:
 On Fri, Apr 13, 2012 at 7:01 PM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 Well, timsort is specifically designed to take advantage of pre-sorted
 data. It does appear to have a lot of traction, as wikipedia points
 out:

 I hadn't heard of it. But reading up on it it does seem like a good
 fit for us. It trades some additional storage -- an array of pointers
 into the sort array where in our case the pointers would be much
 smaller than a whole SortTuple -- for fewer comparisons -- which in
 our case are often much slower than a simple integer comparison.

I wouldn't go so far as to suggest getting rid of quicksort, of
course. Quicksort is generally faster than other average case O(n log
n) algorithms in practice, for various reasons, principal among them
that it takes advantage of memory locality so well. I don't think that
it's a coincidence that timsort is popular in Python and Java land.
Even the Python C implementation has to sort integers through all that
PyObject reference indirection, I think. I'd now speculate that an
appropriate use of this algorithm might be to simply use it with types
that don't have a SortSupport function, that are largely passed by
reference, and have expensive comparisons. FWIW, I started playing
with adding timsort to Postgres last night:

https://github.com/Peter2ndQuadrant/postgres/tree/timsort

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-13 Thread Tom Lane

Jeff Davis pg...@j-davis.com writes:
 On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote:
 Yeah, that was me, and it came out of actual user complaints ten or more
 years back.  (It's actually not 2X growth but more like 4X growth
 according to the comments in logtape.c, though I no longer remember the
 exact reasons why.)  We knew when we put in the logtape logic that we
 were trading off speed for space, and we accepted that.

 I skimmed through TAOCP, and I didn't find the 4X number you are
 referring to, and I can't think what would cause that, either. The exact
 wording in the comment in logtape.c is 4X the actual data volume, so
 maybe that's just referring to per-tuple overhead?

My recollection is that that was an empirical measurement using the
previous generation of code.  It's got nothing to do with per-tuple
overhead IIRC, but with the fact that the same tuple can be sitting on
multiple tapes during a polyphase merge, because some of the tapes can
be lying fallow waiting for future use --- but data on them is still
taking up space, if you do nothing to recycle it.  The argument in the
comment shows why 2X is the minimum space growth for a plain merge
algorithm, but that's only a minimum.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-13 Thread Robert Haas

On Wed, Mar 21, 2012 at 8:57 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Mon, Mar 19, 2012 at 12:23 PM, Robert Haas robertmh...@gmail.com wrote:
 ...

 One thing that seems inefficient to me about our current algorithm is
 the use of the run number as a leading column in the sort key.
 There's no real reason why the tuples destined for the next run need
 to be maintained in heap order; we could just store them unordered and
 heapify the whole lot of them when it's time to start the next run.
 That ought to save comparison cycles for the current run, since the
 heap will be less deep, and there are other possible savings as well -
 in particular, an unordered array of tuples can be heapify'd in linear
 time, whereas our current algorithm is O(n lg n).  However, when I
 actually implemented this, I found that doing it this way was a loser.
  In fact, excluding the next-run tuples from the heap for the current
 run was a BIG loser even before you add in the time required to
 re-heapify.  This confused the daylights out of me for a while,
 because it's hard to understand how insert and siftup can be slower on
 a small heap than a large one.

 My working theory is that, even though we must necessarily do more
 comparisons when manipulating a larger heap, many of those comparisons
 are resolved by comparing run numbers, and that's a lot cheaper than
 invoking the real comparator.  For example, if we insert into a heap
 whose rightmost child is in the next run, and the new tuple goes into
 the current run, and the current size of the heap is such that the
 initial position of the new element is a descendent of the right node,
 it will very quickly crash through all of the next-run tuples and only
 need one REAL comparison - against the root.  Either the new element
 ends up as the root, or it ends up as the direct child of the root;
 now we remove the root and, perhaps, replace it with its rightmost
 child, meaning that the next element we read in can do the exact same
 thing all over again.  If we exclude the next-run elements from the
 heap, then the average depth of the heap when inserting a new element
 is smaller, but all the elements are in the same-run, and we have to
 invoke the real comparator every time.  In other words, those next-run
 tuples act as dummies which effectively create a heap of uneven depth,
 and the average depth encountered when inserting tuples turns out to
 be less than what we get by pulling out the dummies and making the
 depth uniform.

 This makes me kind of nervous, because I don't understand why things
 should work out so well as all that.  If throwing some dummy elements
 into the heap makes things work better, then maybe we should throw in
 some more.  Or maybe it would work better to take some but not all of
 them out.  There certainly doesn't seem to be any indication in the
 code that this is an anticipated effect, and it seems an awfully
 providential accident.

 Is there a patch or a git repo available for this change?  I'd like to
 see if I can analyze that if I get some spare time.

Sorry for the slow response to this.  Patch is attached
(crummy-external-sort.patch).

 Clever.  Rather than doing a binary search of the path, it might make
 sense to do a reverse search of it.  The tuple is likely to send up
 somewhere at the bottom of the heap, both because that is where most
 of the data is, and because the tuple being reinserted just came from
 the bottom so it is likely to be biased that way.

Patches for these approaches also attached (siftup-using-bsearch.patch
and siftup-reverse-linear.patch).  Neither approach seemed terribly
promising in my testing, IIRC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


crummy-external-sort.patch
Description: Binary data


siftup-reverse-linear.patch
Description: Binary data


siftup-using-bsearch.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-13 Thread Greg Stark

On Mon, Mar 19, 2012 at 7:23 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Mar 18, 2012 at 11:25 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 So has somebody found a hole in the n log n lower bound on the cost of
 comparison-based sorting?  I thought that had been proven pretty
 rigorously.

 There's not much danger of anyone finding a way around that bound
 since the proof is quite trivial, but keep in mind that it's a
 worst-case bound.

Fwiw it also only holds for comparison sorts. If you can sort your
items without actually comparing each item with the others then you
aren't bound by it at all. Notably algorithms like radix sort and
direct sort aren't bound by it and are O(n). I'm hoping to look at
trying some of these for integer sorts when they apply using the new
sort specialization code Peter added.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-13 Thread Peter Geoghegan

On 13 April 2012 16:03, Greg Stark st...@mit.edu wrote:
 Fwiw it also only holds for comparison sorts. If you can sort your
 items without actually comparing each item with the others then you
 aren't bound by it at all. Notably algorithms like radix sort and
 direct sort aren't bound by it and are O(n). I'm hoping to look at
 trying some of these for integer sorts when they apply using the new
 sort specialization code Peter added.

Actually, though a later revision of the patch did nominally allow for
user-defined sorting functions through the SortSupport infrastructure
(I didn't intend that it would be complete/useful enough to be really
worth documenting), the version that was finally committed did not.
However, there is a fairly obvious entry point for a radix sort, which
is here:

/*
 * We were able to accumulate all the tuples within the 
allowed
 * amount of memory.  Just qsort 'em and we're done.
 */
if (state-memtupcount  1)
{
/* Can we use the single-key sort function? */
if (state-onlyKey != NULL)
qsort_ssup(state-memtuples, 
state-memtupcount,
   state-onlyKey);
else
qsort_tuple(state-memtuples,

state-memtupcount,

state-comparetup,
state);
}

The only specialisation that occurs here is between tuple sorts of a
single key and multiple (type-specific specialisations were ultimately
not judged to be worth the increase in binary size relative to their
diminishing benefits).  You'd probably set-up a type-specific
positional notation machinery within the state's SortSupport struct
(the type-specific abstraction through which we elide the SQL function
call machinery for types that support it).

One insight that I had at the time was that text comparisons where the
c locale isn't used are really rather expensive, and I doubt that
there is too much that can be done to address that directly.  However,
if we were to support timsort, that could substantially cut down on
the number of comparisons for text columns, which could be a real win.
Maybe that would happen through some explicit mechanism, or maybe the
planner could actually infer that it was likely to be optimal to use
timsort based on a high statistical correlation between physical row
ordering and logical ordering of a key column's values.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-13 Thread Peter Geoghegan

On 13 April 2012 17:42, Peter Geoghegan pe...@2ndquadrant.com wrote:
 One insight that I had at the time was that text comparisons where the
 c locale isn't used are really rather expensive, and I doubt that
 there is too much that can be done to address that directly.  However,
 if we were to support timsort, that could substantially cut down on
 the number of comparisons for text columns, which could be a real win.
 Maybe that would happen through some explicit mechanism, or maybe the
 planner could actually infer that it was likely to be optimal to use
 timsort based on a high statistical correlation between physical row
 ordering and logical ordering of a key column's values.

Further thoughts:

At the time we committed our own quicksort implementation, based on
the NetBSD one, we eschewed the optimisation of using insertion sort
when n is fairly low. This happens to be a very common optimisation,
so I'm not really super-confident that that was a good decision.
However, we also added our own optimisation, which is to attempt,
regardless of the size of n, to ascertain that the array is
pre-sorted, in which case we don't quicksort at all.

So if we attempt to quicksort an array which is almost pre-sorted, but
say only has its very last element out-of-order, we'll do fairly
horribly, because we waste the effort of almost an entire bubble sort
iteration. So almost-sorted data would seem to be a compelling thing
to optimise for that reason, as well as for the simple observation
that it isn't exactly uncommon in a relational database.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-13 Thread Robert Haas

On Fri, Apr 13, 2012 at 1:15 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 13 April 2012 17:42, Peter Geoghegan pe...@2ndquadrant.com wrote:
 One insight that I had at the time was that text comparisons where the
 c locale isn't used are really rather expensive, and I doubt that
 there is too much that can be done to address that directly.  However,
 if we were to support timsort, that could substantially cut down on
 the number of comparisons for text columns, which could be a real win.
 Maybe that would happen through some explicit mechanism, or maybe the
 planner could actually infer that it was likely to be optimal to use
 timsort based on a high statistical correlation between physical row
 ordering and logical ordering of a key column's values.

 Further thoughts:

 At the time we committed our own quicksort implementation, based on
 the NetBSD one, we eschewed the optimisation of using insertion sort
 when n is fairly low. This happens to be a very common optimisation,
 so I'm not really super-confident that that was a good decision.
 However, we also added our own optimisation, which is to attempt,
 regardless of the size of n, to ascertain that the array is
 pre-sorted, in which case we don't quicksort at all.

We do use insertion sort for partitions of size  7.  I assume you are
referring to something else.

One thing that has struck me as odd is that we check for presorted
arrays at every level of recursion.  That is, if we start out with 100
elements, we'll check whether the input is presorted.  Then, if we
partition it into a block of 60 elements and a block of 40 elements,
we'll check each of those partitions over again to see whether they
are presorted.  There is definitely some potential upside here,
because if the input is mostly sorted it may contain runs where
everything is in order, and we'll detect that and avoid some work.
But it's also kind of expensive; I've wondered whether we should, for
example, move the test for presorted input down a bit, so that it only
happens in the n  40 case.  Another idea is to arrange things so that
we remember the offset at which the last presort check failed.  When
we tail-recurse into the left half of the array, there's no need to
recheck - if the presort check fell out after our partition boundary,
it's presorted; if it fell out before the partition boundary, it
isn't.

 So if we attempt to quicksort an array which is almost pre-sorted, but
 say only has its very last element out-of-order, we'll do fairly
 horribly, because we waste the effort of almost an entire bubble sort
 iteration. So almost-sorted data would seem to be a compelling thing
 to optimise for that reason, as well as for the simple observation
 that it isn't exactly uncommon in a relational database.

Heap-sorting could also benefit from some optimization in this area.
Right now, if you heap-sort data that's already in order, it gets
progressively slower as you crank up work_mem, because the heap gets
deeper and so extraction and siftup get more expensive, for no real
benefit.  We could do something like check, every time we use up our
available memory, whether the heap is already entirely in order.  If
so, we could dump all but one tuple to the current run and the start
reading tuples again.  Or maybe dump some percentage of the array,
though that might require a bit more bookkeeping.  Or maybe there's a
smarter algorithm that would also cater to mostly-sorted data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-13 Thread Peter Geoghegan

On 13 April 2012 18:33, Robert Haas robertmh...@gmail.com wrote:
 We do use insertion sort for partitions of size  7.  I assume you are
 referring to something else.

I stand corrected. My confusion came from the removal of a later
*switch* to insertion sort in
a3f0b3d68f9a5357a3f72b40a45bcc714a9e0649, which also added the
pre-sorted check where you'd expect to see the insertion switch. Of
course, the n  7 insertion sort thing is right beside the added
check. So another, redundant copy of insertion sort was removed, and
not the one that almost every real-world quicksort implementation has.

 Heap-sorting could also benefit from some optimization in this area.
 Right now, if you heap-sort data that's already in order, it gets
 progressively slower as you crank up work_mem, because the heap gets
 deeper and so extraction and siftup get more expensive, for no real
 benefit.  We could do something like check, every time we use up our
 available memory, whether the heap is already entirely in order.  If
 so, we could dump all but one tuple to the current run and the start
 reading tuples again.  Or maybe dump some percentage of the array,
 though that might require a bit more bookkeeping.  Or maybe there's a
 smarter algorithm that would also cater to mostly-sorted data.

Well, timsort is specifically designed to take advantage of pre-sorted
data. It does appear to have a lot of traction, as wikipedia points
out:

Timsort has been Python's standard sorting algorithm since version
2.3. It is now also used to sort arrays in Java SE 7,[2] and on the
Android platform.[3] 

Somebody has produced a BSD-licensed C implementation that draws
heavily on the Python/Java one here:

http://code.google.com/p/timsort/source/browse/trunk/timSort.c?spec=svn17r=17

It looks like it has exactly the same interface as a c stdlib qsort.
So it'd be fairly easy to produce a timsort_arg() based on this, if
anyone cares to.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-04-12 Thread Jeff Davis

On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote:
 Yeah, that was me, and it came out of actual user complaints ten or more
 years back.  (It's actually not 2X growth but more like 4X growth
 according to the comments in logtape.c, though I no longer remember the
 exact reasons why.)  We knew when we put in the logtape logic that we
 were trading off speed for space, and we accepted that.

I skimmed through TAOCP, and I didn't find the 4X number you are
referring to, and I can't think what would cause that, either. The exact
wording in the comment in logtape.c is 4X the actual data volume, so
maybe that's just referring to per-tuple overhead?

However, I also noticed that section 5.4.4 (Vol 3 p299) starts
discussing the idea of running the tapes backwards and forwards. That
doesn't directly apply, because a disk seek is cheaper than rewinding a
tape, but perhaps it could be adapted to get the block-freeing behavior
we want. The comments in logtape.c say:

  Few OSes allow arbitrary parts of a file to be released back to the
OS, so we have to implement this space-recycling ourselves within a
single logical file.

But if we alternate between reading in forward and reverse order, we can
make all of the deallocations at the end of the file, and then just
truncate to free space. I would think that the OS could be more
intelligent about block allocations and deallocations to avoid too much
fragmentation, and it would probably be a net reduction in code
complexity. Again, the comments in logtape.c have something to say about
it:

   ...but the seeking involved should be comparable to what would
happen if we kept each logical tape in a separate file

But I don't understand why that is the case.

On another topic, quite a while ago Len Shapiro (PSU professor)
suggested to me that we implement Algorithm F (Vol 3 p321). Right now,
as tuplesort.c says:

   In the current code we determine the number of tapes M on the basis
of workMem: we want workMem/M to be large enough that we read a fair
amount of data each time we preread from a tape

But with forcasting, we can be a little smarter about which tapes we
preread from if the data in the runs is not random. That means we could
potentially merge more runs at once with the same work_mem without
sacrificing adequate buffers for prefetching. I'm not sure whether this
is a practical problem today, and I'm also not sure what to do if we
start merging a lot more runs and then determine that forcasting doesn't
work as well as we'd hoped (e.g. that the data in the runs really is
random). But I thought it was worth mentioning.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-21 Thread Robert Haas

On Tue, Mar 20, 2012 at 6:16 PM, Greg Stark st...@mit.edu wrote:
 On Tue, Mar 20, 2012 at 8:00 PM, Robert Haas robertmh...@gmail.com wrote:
 Frankly that analysis didn't make any sense to me at the time.
 Comparing integers is fast, sure, but it's still slower than not
 having to do any comparison at all.

 I think you're underestimating how much it costs to call the
 datatype-specific comparator.  My belief is that it's wicked
 expensive.

 I'm totally with you on the datatype-specific comparator being expensive.

 But we must be talking about two different scenarios. I don't see why
 Tom's algorithm was slower than Knuth's unless there was a bug. It
 seems to me it should perform exactly as many comparator calls but
 save the integer comparisons and the extra space for them.

It seemed that way to me, too, but it wasn't.

 In the current algorithm, Knuth's, it compares the new tuple against
 the most recently emitted tuple to set the run number then adds it to
 the bottom of the heap and sifts up. If it's from the current run it
 does a bunch of integer comparisons and skips past the next run
 quickly. If it's from the next run it sifts up to the right spot in
 the next run and if it hits the top of the next run it does a quick
 integer comparison and stops.

Check.

 In Tom's algorithm it would perform the same comparison against the
 recently emitted tuple to set the run number and either add it to the
 unsorted list or the bottom of the heap. If it's added to the unsorted
 list we're done,

Check.

 if it's added to the bottom of the heap it performs
 the same siftup it would have done above except that it skips the
 bottom few levels of the heap -- all of which were fast integer
 comparisons.

I think this is where the wheels come off.  It's excessively
optimistic to suppose that we're going to skip the bottom few levels
of the heap.  Half of the elements in the heap have no children at
all; and half of the remainder are parents but not grand-parents.  If
you assume, at a given point in time, that half of the tuples we're
holding onto are for the next run, then the heap is ONE level
shallower than it would otherwise have been, and you figure to save
ONE integer comparison.  If we're relatively early in the run and only
one-tenth of the tuples we're holding onto are for the next run, then
heap is barely shallower at all and we're scarcely avoiding anything.

But having even a small number of next-run tuples scattered through
the heap gives us the opportunity to get lucky.  If the heap size is
N, even lg N next-run tuples in the heap is potentially enough for us
to avoid most of the calls to a real comparator, if it so happens that
all of those tuples are our ancestors.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-21 Thread Jeff Janes

On Mon, Mar 19, 2012 at 12:23 PM, Robert Haas robertmh...@gmail.com wrote:
...

 One thing that seems inefficient to me about our current algorithm is
 the use of the run number as a leading column in the sort key.
 There's no real reason why the tuples destined for the next run need
 to be maintained in heap order; we could just store them unordered and
 heapify the whole lot of them when it's time to start the next run.
 That ought to save comparison cycles for the current run, since the
 heap will be less deep, and there are other possible savings as well -
 in particular, an unordered array of tuples can be heapify'd in linear
 time, whereas our current algorithm is O(n lg n).  However, when I
 actually implemented this, I found that doing it this way was a loser.
  In fact, excluding the next-run tuples from the heap for the current
 run was a BIG loser even before you add in the time required to
 re-heapify.  This confused the daylights out of me for a while,
 because it's hard to understand how insert and siftup can be slower on
 a small heap than a large one.

 My working theory is that, even though we must necessarily do more
 comparisons when manipulating a larger heap, many of those comparisons
 are resolved by comparing run numbers, and that's a lot cheaper than
 invoking the real comparator.  For example, if we insert into a heap
 whose rightmost child is in the next run, and the new tuple goes into
 the current run, and the current size of the heap is such that the
 initial position of the new element is a descendent of the right node,
 it will very quickly crash through all of the next-run tuples and only
 need one REAL comparison - against the root.  Either the new element
 ends up as the root, or it ends up as the direct child of the root;
 now we remove the root and, perhaps, replace it with its rightmost
 child, meaning that the next element we read in can do the exact same
 thing all over again.  If we exclude the next-run elements from the
 heap, then the average depth of the heap when inserting a new element
 is smaller, but all the elements are in the same-run, and we have to
 invoke the real comparator every time.  In other words, those next-run
 tuples act as dummies which effectively create a heap of uneven depth,
 and the average depth encountered when inserting tuples turns out to
 be less than what we get by pulling out the dummies and making the
 depth uniform.

 This makes me kind of nervous, because I don't understand why things
 should work out so well as all that.  If throwing some dummy elements
 into the heap makes things work better, then maybe we should throw in
 some more.  Or maybe it would work better to take some but not all of
 them out.  There certainly doesn't seem to be any indication in the
 code that this is an anticipated effect, and it seems an awfully
 providential accident.

Is there a patch or a git repo available for this change?  I'd like to
see if I can analyze that if I get some spare time.

...

 It turns out that it's possible to reduce the number of comparisons
 required to reinsert the last heap element from 2*heap_depth to
 heap_depth+lg(heap_depth).  At each node, compare the two children,
 and then let the current node be the lesser of those.  When you reach
 a leaf node, you've walked a path of length heap_depth which is known
 to be sorted (since we've got a heap), so you can find the correct
 reinsertion position by binary searching that path.  In the case of
 the sorted data mentioned above, this reduces the time to 19.45
 seconds with work_mem = 128MB and 16.12 seconds with work_mem = 1MB.
 However, in other cases, it seems not to help at all, or even be a
 slight regression.

Clever.  Rather than doing a binary search of the path, it might make
sense to do a reverse search of it.  The tuple is likely to send up
somewhere at the bottom of the heap, both because that is where most
of the data is, and because the tuple being reinserted just came from
the bottom so it is likely to be biased that way.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Greg Stark

On Tue, Mar 20, 2012 at 1:57 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 That was a long time ago, of course, but I have some vague recollection
 that keeping next-run tuples in the current heap achieves a net savings
 in the total number of comparisons needed to heapify both runs.

Offhand I wonder if this is all because we don't have the O(n) heapify
implemented.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Tom Lane

Greg Stark st...@mit.edu writes:
 Offhand I wonder if this is all because we don't have the O(n) heapify
 implemented.

Robert muttered something about that before, but is it real?  If you
could do that, I'd think you'd have a less-than-n-log-n sorting
solution.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Greg Stark

http://en.wikipedia.org/wiki/Binary_heap#Building_a_heap

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Jeff Janes

On Tue, Mar 20, 2012 at 6:31 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@mit.edu writes:
 Offhand I wonder if this is all because we don't have the O(n) heapify
 implemented.

I think we do already have it implemented.  1/2 the time the tuple
stays where it is after one comparison, 1/4 it moves up one level with
two comparisons, 1/8 it moves up two levels with 3 comparisons, etc.
That series sums up to a constant.  Maybe there is a worst-case that
makes this fall apart, though.  Heapifying something which is already
reverse sorted, maybe?

 Robert muttered something about that before, but is it real?  If you
 could do that, I'd think you'd have a less-than-n-log-n sorting
 solution.

Turning random tuples into heap can be linear.  Extracting them while
maintaining the heap is NlogN, though.  You can't sort without the
extraction step, so the law is preserved.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Tom Lane

Greg Stark st...@mit.edu writes:
 http://en.wikipedia.org/wiki/Binary_heap#Building_a_heap

Interesting.  I'm pretty sure that idea appears nowhere in Knuth
(which might mean it's new enough to have a live patent on it ...
anybody know who invented this?).  But it seems like that should buy
back enough comparisons to justify leaving the next-run tuples out of
the heap (unordered) until the heap becomes empty.  You still want to
insert new tuples into the heap if they can go to the current run, of
course.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Robert Haas

On Tue, Mar 20, 2012 at 7:44 AM, Greg Stark st...@mit.edu wrote:
 On Tue, Mar 20, 2012 at 1:57 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 That was a long time ago, of course, but I have some vague recollection
 that keeping next-run tuples in the current heap achieves a net savings
 in the total number of comparisons needed to heapify both runs.

 Offhand I wonder if this is all because we don't have the O(n) heapify
 implemented.

I'm pretty sure that's not the problem.  Even though our heapify is
not as efficient as it could be, it's plenty fast enough.  I thought
about writing a patch to implement the better algorithm, but it seems
like a distraction at this point because the heapify step is such a
small contributor to overall sort time.  What's taking all the time is
the repeated siftup operations as we pop things out of the heap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Robert Haas

On Tue, Mar 20, 2012 at 12:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@mit.edu writes:
 http://en.wikipedia.org/wiki/Binary_heap#Building_a_heap

 Interesting.  I'm pretty sure that idea appears nowhere in Knuth
 (which might mean it's new enough to have a live patent on it ...
 anybody know who invented this?).

It's in every introductory algorithms textbook; I'd be shocked if
anyone could make an IP claim on it.

 But it seems like that should buy
 back enough comparisons to justify leaving the next-run tuples out of
 the heap (unordered) until the heap becomes empty.  You still want to
 insert new tuples into the heap if they can go to the current run, of
 course.

It seems like it should, but if you read (or reread) my long boring
analysis upthread, you'll learn that it doesn't.  It's slower even if
the cost of building a heap is zero.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Tue, Mar 20, 2012 at 7:44 AM, Greg Stark st...@mit.edu wrote:
 Offhand I wonder if this is all because we don't have the O(n) heapify
 implemented.

 I'm pretty sure that's not the problem.  Even though our heapify is
 not as efficient as it could be, it's plenty fast enough.  I thought
 about writing a patch to implement the better algorithm, but it seems
 like a distraction at this point because the heapify step is such a
 small contributor to overall sort time.  What's taking all the time is
 the repeated siftup operations as we pop things out of the heap.

Right, but wouldn't getting rid of the run-number comparisons provide
some marginal improvement in the speed of tuplesort_heap_siftup?

BTW, there's a link at the bottom of the wikipedia page to a very
interesting ACM Queue article, which argues that the binary-tree
data structure isn't terribly well suited to virtual memory because
it touches random locations in succession.  I'm not sure I believe
his particular solution, but I'm wondering about B+ trees, ie more
than 2 children per node.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Robert Haas

On Tue, Mar 20, 2012 at 12:33 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Mar 20, 2012 at 7:44 AM, Greg Stark st...@mit.edu wrote:
 Offhand I wonder if this is all because we don't have the O(n) heapify
 implemented.

 I'm pretty sure that's not the problem.  Even though our heapify is
 not as efficient as it could be, it's plenty fast enough.  I thought
 about writing a patch to implement the better algorithm, but it seems
 like a distraction at this point because the heapify step is such a
 small contributor to overall sort time.  What's taking all the time is
 the repeated siftup operations as we pop things out of the heap.

 Right, but wouldn't getting rid of the run-number comparisons provide
 some marginal improvement in the speed of tuplesort_heap_siftup?

No.  It does the opposite: it slows it down.  This is a highly
surprising result but it's quite repeatable: removing comparisons
makes it slower.  As previously pontificated, I think this is probably
because the heap can fill up with next-run tuples that are cheap to
compare against, and that spares us having to do real comparisons
involving the actual datatype comparators.

 BTW, there's a link at the bottom of the wikipedia page to a very
 interesting ACM Queue article, which argues that the binary-tree
 data structure isn't terribly well suited to virtual memory because
 it touches random locations in succession.  I'm not sure I believe
 his particular solution, but I'm wondering about B+ trees, ie more
 than 2 children per node.

I don't think virtual memory locality is the problem.  I read
somewhere that a ternary heap is supposed to be about one-eighth
faster than a binary heap, but that's because picking the smallest of
three tuples requires two comparisons, whereas picking the smallest of
four tuples requires three comparisons, which is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Greg Stark

On Tue, Mar 20, 2012 at 5:04 PM, Robert Haas robertmh...@gmail.com wrote:
 No.  It does the opposite: it slows it down.  This is a highly
 surprising result but it's quite repeatable: removing comparisons
 makes it slower.  As previously pontificated, I think this is probably
 because the heap can fill up with next-run tuples that are cheap to
 compare against, and that spares us having to do real comparisons
 involving the actual datatype comparators.

Frankly that analysis didn't make any sense to me at the time.
Comparing integers is fast, sure, but it's still slower than not
having to do any comparison at all.

It's just barely conceivable that it's a cache effect on the *code*
but even that seems pretty darned unlikely to me. My money currently
is on a measurement error but that's just a guess since I haven't
tried to replicate any of your results.


 BTW, there's a link at the bottom of the wikipedia page to a very
 interesting ACM Queue article, which argues that the binary-tree
 data structure isn't terribly well suited to virtual memory because
 it touches random locations in succession.  I'm not sure I believe
 his particular solution, but I'm wondering about B+ trees, ie more
 than 2 children per node.

 I don't think virtual memory locality is the problem.  I read
 somewhere that a ternary heap is supposed to be about one-eighth
 faster than a binary heap, but that's because picking the smallest of
 three tuples requires two comparisons, whereas picking the smallest of
 four tuples requires three comparisons, which is better.

The things I was reading suggested 4-heap was more cache efficient
which is a heap with 4-children per node and still representable as a
flat array. There's also Fibonacci heap which makes insertions really
fast but since we're doing exactly as many insertions as deletions the
question is how much work it adds to deletions. It's still O(logn)
amortized but it may be 2x the constant factor.

Fwiw I think more interesting than improving tapesort would be
implementing wholly different algorithms like radix sort or the
repeated quicksort. Being able to handle different algorithms that
require a different API would be the first step to being able to
handle parallel algorithms using threads or GPUs.

I also wonder if disabling the space reuse and just generating
separate files for each run might not be a win on exotic filesystems
like ZFS or WAFL where overwriting blocks is more expensive than
allocating new blocks.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Robert Haas

On Tue, Mar 20, 2012 at 3:41 PM, Greg Stark st...@mit.edu wrote:
 On Tue, Mar 20, 2012 at 5:04 PM, Robert Haas robertmh...@gmail.com wrote:
 No.  It does the opposite: it slows it down.  This is a highly
 surprising result but it's quite repeatable: removing comparisons
 makes it slower.  As previously pontificated, I think this is probably
 because the heap can fill up with next-run tuples that are cheap to
 compare against, and that spares us having to do real comparisons
 involving the actual datatype comparators.

 Frankly that analysis didn't make any sense to me at the time.
 Comparing integers is fast, sure, but it's still slower than not
 having to do any comparison at all.

I think you're underestimating how much it costs to call the
datatype-specific comparator.  My belief is that it's wicked
expensive.   The COMPARETUP() macro extracts a function pointer from
the Tuplesortstate and calls it; we end up comparetup_heap, which
calls ApplySortComparator(), which pulls the comparator function out
of the state and then calls that.  Since I was sorting strings, which
have no sortsupport, we then end up in comparison_shim(), which uses
the FunctionCallInvoke method to extract the actual function pointer
and jump into bttextcmp(), which unpacks its arguments and then calls
text_cmp(), which unpacks its arguments some more and then calls
varstr_cmp() where the actual work happens.  That's not trivial either
- we have to call lc_collate_is_c() and then memcmp().  I have no
problem believing that 6 levels of function calls, 3 of which involve
jumps through function pointers, followed by lc_collate_is_c() and
memcmp() is 100x or more as expensive than the lone integer comparison
that happens when the tupindex values don't match - that's a single
instruction.

 Fwiw I think more interesting than improving tapesort would be
 implementing wholly different algorithms like radix sort or the
 repeated quicksort. Being able to handle different algorithms that
 require a different API would be the first step to being able to
 handle parallel algorithms using threads or GPUs.

Yeah, I think I'm going to try implementing
quicksort-the-whole-batch-and-dump-it-out-as-a-run algorithm, just to
see how good or bad that is compared to what we have now.  We may not
end up doing anything that remotely resembles that, in the end, but I
want to see the numbers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Jim Nasby


On 3/18/12 10:25 AM, Tom Lane wrote:

Jeff Janesjeff.ja...@gmail.com  writes:

  On Wed, Mar 7, 2012 at 11:55 AM, Robert Haasrobertmh...@gmail.com  wrote:

  On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janesjeff.ja...@gmail.com  wrote:

  Anyway, I think the logtape could use redoing.

  The problem there is that none of the files can be deleted until it
  was entirely read, so you end up with all the data on disk twice.  I
  don't know how often people run their databases so close to the edge
  on disk space that this matters, but someone felt that that extra
  storage was worth avoiding.

Yeah, that was me, and it came out of actual user complaints ten or more
years back.  (It's actually not 2X growth but more like 4X growth
according to the comments in logtape.c, though I no longer remember the
exact reasons why.)  We knew when we put in the logtape logic that we
were trading off speed for space, and we accepted that.  It's possible
that with the growth of hard drive sizes, real-world applications would
no longer care that much about whether the space required to sort is 4X
data size rather than 1X.  Or then again, maybe their data has grown
just as fast and they still care.



I believe the case of tape sorts that fit entirely in filesystem cache is a big one as 
well... doubling or worse the amount of data that needed to live on disk at 
once would likely suck in that case.

Also, it's not uncommon to be IO-bound on a database server... so even if we're 
not worried about storing everything 2 or more times from a disk space 
standpoint, we should be concerned about the IO bandwidth.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 Yeah, I think I'm going to try implementing
 quicksort-the-whole-batch-and-dump-it-out-as-a-run algorithm, just to
 see how good or bad that is compared to what we have now.  We may not
 end up doing anything that remotely resembles that, in the end, but I
 want to see the numbers.

The reason replacement selection is attractive is that the initial runs
tend to be about twice as long as you can get if you just sort
memory-loads independently.  (And that's for random input, the win can
be a lot more on partially ordered data.)  It seems unlikely that you
will get enough win from quicksorting to overcome that; especially not
if your thesis is correct that all the cost is coming from comparison
infrastructure.

But don't let me discourage you from measuring it ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-20 Thread Greg Stark

On Tue, Mar 20, 2012 at 8:00 PM, Robert Haas robertmh...@gmail.com wrote:
 Frankly that analysis didn't make any sense to me at the time.
 Comparing integers is fast, sure, but it's still slower than not
 having to do any comparison at all.

 I think you're underestimating how much it costs to call the
 datatype-specific comparator.  My belief is that it's wicked
 expensive.

I'm totally with you on the datatype-specific comparator being expensive.

But we must be talking about two different scenarios. I don't see why
Tom's algorithm was slower than Knuth's unless there was a bug. It
seems to me it should perform exactly as many comparator calls but
save the integer comparisons and the extra space for them.

In the current algorithm, Knuth's, it compares the new tuple against
the most recently emitted tuple to set the run number then adds it to
the bottom of the heap and sifts up. If it's from the current run it
does a bunch of integer comparisons and skips past the next run
quickly. If it's from the next run it sifts up to the right spot in
the next run and if it hits the top of the next run it does a quick
integer comparison and stops.

In Tom's algorithm it would perform the same comparison against the
recently emitted tuple to set the run number and either add it to the
unsorted list or the bottom of the heap. If it's added to the unsorted
list we're done, if it's added to the bottom of the heap it performs
the same siftup it would have done above except that it skips the
bottom few levels of the heap -- all of which were fast integer
comparisons. They might be fast but they can't be faster than doing
nothing at all. When the heap is empty then we do a heapify which
currently would be exactly the same O(nlogn) comparisons that we do
maintaining the bottom of the heap but with a O(n) heapify would be
fewer.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-19 Thread Robert Haas

On Sun, Mar 18, 2012 at 11:25 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 So has somebody found a hole in the n log n lower bound on the cost of
 comparison-based sorting?  I thought that had been proven pretty
 rigorously.

There's not much danger of anyone finding a way around that bound
since the proof is quite trivial, but keep in mind that it's a
worst-case bound.  For example, if you have reason to believe that the
input is likely to already be sorted, you can check for that case in
using just n-1 comparisons.  If it turns out you're right, you can
sort the data in linear time.  Heck, you can also check for
reverse-sorted input at the same time, if you like, and handle that
special case too.  Of course, when the data is unordered, such
special-case checks hurt rather than help, even though it's still O(n
lg n) overall; those pesky constant factors do matter quite a bit.
Still, sometimes it's worth it: our quicksort algorithm checks for
sorted input on each recursive invocation, since quicksort degrades
rather badly in that scenario; but our heapsort doesn't contain a
similar check, probably because heapsort typically works fine in that
scenario.

One thing that seems inefficient to me about our current algorithm is
the use of the run number as a leading column in the sort key.
There's no real reason why the tuples destined for the next run need
to be maintained in heap order; we could just store them unordered and
heapify the whole lot of them when it's time to start the next run.
That ought to save comparison cycles for the current run, since the
heap will be less deep, and there are other possible savings as well -
in particular, an unordered array of tuples can be heapify'd in linear
time, whereas our current algorithm is O(n lg n).  However, when I
actually implemented this, I found that doing it this way was a loser.
 In fact, excluding the next-run tuples from the heap for the current
run was a BIG loser even before you add in the time required to
re-heapify.  This confused the daylights out of me for a while,
because it's hard to understand how insert and siftup can be slower on
a small heap than a large one.

My working theory is that, even though we must necessarily do more
comparisons when manipulating a larger heap, many of those comparisons
are resolved by comparing run numbers, and that's a lot cheaper than
invoking the real comparator.  For example, if we insert into a heap
whose rightmost child is in the next run, and the new tuple goes into
the current run, and the current size of the heap is such that the
initial position of the new element is a descendent of the right node,
it will very quickly crash through all of the next-run tuples and only
need one REAL comparison - against the root.  Either the new element
ends up as the root, or it ends up as the direct child of the root;
now we remove the root and, perhaps, replace it with its rightmost
child, meaning that the next element we read in can do the exact same
thing all over again.  If we exclude the next-run elements from the
heap, then the average depth of the heap when inserting a new element
is smaller, but all the elements are in the same-run, and we have to
invoke the real comparator every time.  In other words, those next-run
tuples act as dummies which effectively create a heap of uneven depth,
and the average depth encountered when inserting tuples turns out to
be less than what we get by pulling out the dummies and making the
depth uniform.

This makes me kind of nervous, because I don't understand why things
should work out so well as all that.  If throwing some dummy elements
into the heap makes things work better, then maybe we should throw in
some more.  Or maybe it would work better to take some but not all of
them out.  There certainly doesn't seem to be any indication in the
code that this is an anticipated effect, and it seems an awfully
providential accident.

It may (or may not) be related to the effect that Greg Stark mentions
in a nearby post: increasing work_mem makes the heap bigger, so heap
maintenance gets more expensive, sometimes without any corresponding
benefit.  For example, if the input happens to already be sorted, then
inserting into the heap is cheap, because each new element is already
greater than its parent, and life is good.  But removing from the heap
is painfully expensive, because to remove element from the heap we
stick the last element in the heap into the whole left by the
departing element and then sift it down.  However, that requires
2*heap_depth comparisons, all of which involve really invoking the
comparison function since everything's going to end up going into a
single run, and that gets more and more expensive as work_mem
increases.  So a heap sort of already-sorted data seems to get slower
and slower the more memory you give it to work with.  For example, a
sort of 5 million random strings of length 30, all in shared_buffers,
takes 22.03 seconds on my MacBook Pro with work_mem =

Re: [HACKERS] Memory usage during sorting

2012-03-19 Thread Greg Stark

On Mon, Mar 19, 2012 at 7:23 PM, Robert Haas robertmh...@gmail.com wrote:
 There's no real reason why the tuples destined for the next run need
 to be maintained in heap order; we could just store them unordered and
 heapify the whole lot of them when it's time to start the next run.

This sounded familiar

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-19 Thread Tom Lane

Greg Stark st...@mit.edu writes:
 On Mon, Mar 19, 2012 at 7:23 PM, Robert Haas robertmh...@gmail.com wrote:
 There's no real reason why the tuples destined for the next run need
 to be maintained in heap order; we could just store them unordered and
 heapify the whole lot of them when it's time to start the next run.

 This sounded familiar
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e

Yeah, see also the pgsql-hackers thread starting here:
http://archives.postgresql.org/pgsql-hackers/1999-10/msg00384.php

That was a long time ago, of course, but I have some vague recollection
that keeping next-run tuples in the current heap achieves a net savings
in the total number of comparisons needed to heapify both runs.
Robert's point about integer comparisons being faster than data
comparisons may or may not be relevant.  Certainly they are faster, but
there are never very many run numbers in the heap at once (possibly no
more than 2, I forget; and in any case often only 1).  So I'd expect
most tuple comparisons to end up having to do a data comparison anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-18 Thread Tom Lane

Jeff Janes jeff.ja...@gmail.com writes:
 On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Anyway, I think the logtape could use redoing.

 The problem there is that none of the files can be deleted until it
 was entirely read, so you end up with all the data on disk twice.  I
 don't know how often people run their databases so close to the edge
 on disk space that this matters, but someone felt that that extra
 storage was worth avoiding.

Yeah, that was me, and it came out of actual user complaints ten or more
years back.  (It's actually not 2X growth but more like 4X growth
according to the comments in logtape.c, though I no longer remember the
exact reasons why.)  We knew when we put in the logtape logic that we
were trading off speed for space, and we accepted that.  It's possible
that with the growth of hard drive sizes, real-world applications would
no longer care that much about whether the space required to sort is 4X
data size rather than 1X.  Or then again, maybe their data has grown
just as fast and they still care.

 As a desirable side effect, I think it would mean
 that we could dispense with retail palloc and pfree altogether.  We
 could just allocate a big chunk of memory, copy tuples into it until
 it's full, using a pointer to keep track of the next unused byte, and
 then, after writing the run, reset the allocation pointer back to the
 beginning of the buffer.  That would not only avoid the cost of going
 through palloc/pfree, but also the memory overhead imposed by
 bookkeeping and power-of-two rounding.

That would be worthwhile, probably.  The power-of-2 rounding in palloc
is not nice at all for tuplesort's purposes.  We've occasionally talked
about inventing a different memory context type that is less willing to
waste space ... but on the other hand, such an allocator would run
slower, so you're right back to making a speed for space tradeoff.
If the tuples could be deallocated in bulk then that catch-22 goes away.

 No, it's about reducing the number of comparisons needed to maintain
 the heap property.

 That sounds very interesting.  I didn't know it was even theoretically
 possible to do that.

So has somebody found a hole in the n log n lower bound on the cost of
comparison-based sorting?  I thought that had been proven pretty
rigorously.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-18 Thread Jeremy Harris


On 2012-03-18 15:25, Tom Lane wrote:

Jeff Janesjeff.ja...@gmail.com  writes:

On Wed, Mar 7, 2012 at 11:55 AM, Robert Haasrobertmh...@gmail.com  wrote:
The problem there is that none of the files can be deleted until it
was entirely read, so you end up with all the data on disk twice.  I
don't know how often people run their databases so close to the edge
on disk space that this matters, but someone felt that that extra
storage was worth avoiding.


Yeah, that was me, and it came out of actual user complaints ten or more
years back.  (It's actually not 2X growth but more like 4X growth
according to the comments in logtape.c, though I no longer remember the
exact reasons why.)  We knew when we put in the logtape logic that we
were trading off speed for space, and we accepted that.


How about a runtime check of disk-free?
--
Jeremy


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-18 Thread Tom Lane

Jeremy Harris j...@wizmail.org writes:
 On 2012-03-18 15:25, Tom Lane wrote:
 Yeah, that was me, and it came out of actual user complaints ten or more
 years back.  (It's actually not 2X growth but more like 4X growth
 according to the comments in logtape.c, though I no longer remember the
 exact reasons why.)  We knew when we put in the logtape logic that we
 were trading off speed for space, and we accepted that.

 How about a runtime check of disk-free?

Not very workable, the foremost reason why not being that it assumes
that the current sort is the only thing consuming disk space.  I'm not
sure there is any portable method of finding out how much disk space
is available, anyway.  (Think user quotas and such before supposing
you know how to do that.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-18 Thread Greg Stark

On Sun, Mar 18, 2012 at 3:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 No, it's about reducing the number of comparisons needed to maintain
 the heap property.

 That sounds very interesting.  I didn't know it was even theoretically
 possible to do that.

 So has somebody found a hole in the n log n lower bound on the cost of
 comparison-based sorting?  I thought that had been proven pretty
 rigorously.

I think what he's describing is, for example, say you have a heap of
1,000 elements and that lets you do the entire sort in a single pass
-- i.e. it generates a single sorted run after the first pass.
Increasing the heap size to 2,000 will just mean doing an extra
comparison for each tuple to sift up. And increasing the heap size
further will just do more and more work for nothing. It's still nlogn
but the constant factor is slightly higher. Of course to optimize this
you would need to be able to see the future.

I always thought the same behaviour held for if the heap was larger
than necessary to do N merge passes. that is, making the heap larger
might generate fewer tapes but the still enough to require the same
number of passes. However trying to work out an example I'm not too
sure. If you generate fewer tapes then the heap you use to do the
merge is smaller so it might work out the same (or more likely, you
might come out ahead).

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-18 Thread Jeff Janes

On Sun, Mar 18, 2012 at 8:56 AM, Greg Stark st...@mit.edu wrote:
 On Sun, Mar 18, 2012 at 3:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 No, it's about reducing the number of comparisons needed to maintain
 the heap property.

 That sounds very interesting.  I didn't know it was even theoretically
 possible to do that.

 So has somebody found a hole in the n log n lower bound on the cost of
 comparison-based sorting?  I thought that had been proven pretty
 rigorously.

 I think what he's describing is, for example, say you have a heap of
 1,000 elements and that lets you do the entire sort in a single pass
 -- i.e. it generates a single sorted run after the first pass.
 Increasing the heap size to 2,000 will just mean doing an extra
 comparison for each tuple to sift up. And increasing the heap size
 further will just do more and more work for nothing. It's still nlogn
 but the constant factor is slightly higher. Of course to optimize this
 you would need to be able to see the future.

 I always thought the same behaviour held for if the heap was larger
 than necessary to do N merge passes. that is, making the heap larger
 might generate fewer tapes but the still enough to require the same
 number of passes. However trying to work out an example I'm not too
 sure. If you generate fewer tapes then the heap you use to do the
 merge is smaller so it might work out the same (or more likely, you
 might come out ahead).

Except for rounding effects, it does come out the same.  Every extra
layer on the tuple-heap during the initial run building causes a
reduced layer on the tape-heap during the merge.  So they wash.  I
haven't analyzed the exact rounding effects in detail, but by just
instrumenting the code I found a huge tuple-heap with a two-tape merge
at the end used less than one percent more comparisons, but was 40%
slower, then a lower-memory sort which used a modest sized tuple-heap
followed by a modest-size tape-heap merge.My conclusion is that it
isn't the number of comparison that drove the difference, but the
number of on-chip cache misses.  Two modest sized heaps are more cache
friendly than one giant heap and one tiny heap.

Of course if the data is partially sorted in a way that is apparent
over large ranges but not short ranges, the larger initial heap will
capture that during the initial run construction while the smaller one
will not.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-17 Thread Greg Stark

On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas robertmh...@gmail.com wrote:
 But it would mean we have about 1.7x  more runs that need to be merged
 (for initially random data).  Considering the minimum merge order is
 6, that increase in runs is likely not to lead to an additional level
 of merging, in which case the extra speed of building the runs would
 definitely win.  But if it does cause an additional level of merge, it
 could end up being a loss.

 That's true, but the real hit to the run size should be quite a bit
 less than 1.7x, because we'd also be using memory more efficiently,
 and therefore more tuples should fit.

I'm not sure I believe the 1.7x.  Keep in mind that even after
starting a new tape we can continue to read in new tuples that belong
on the first tape. So even if you have tuples that are more than N
positions out of place (where N is the size of your heap) as long as
you don't have very many you can keep writing out the first tape for
quite a while.

I suspect analyzing truly random inputs is also a bit like saying no
compression algorithm can work on average. Partly sorted data is quite
common and the tapesort algorithm would be able to do a lot of cases
in a single merge that the quicksort and merge would generate a large
number of merges for.

All that said, quicksort and merge would always do no more i/o in
cases where the total number of tuples to be sorted is significantly
less than N^2 since that would be guaranteed to be possible to process
with a single merge pass. (Where significantly less has something to
do with how many tuples you have to read in one i/o to be efficient).
That probably covers a lot of cases, and Postgres already has the
stats to predict when it would happen, more or less.

Fwiw when I was doing the top-k sorting I did a bunch of experiements
and came up with a rule-of-thumb that our quicksort was about twice as
fast as our heapsort. I'm not sure whether that's a big deal or not in
this case. Keep in mind that the only win would be reducing the cpu
time on a sort where every tuple was being written to disk and read
back. For most users that won't run noticeably faster, just reduce cpu
time consumption.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-17 Thread Jeff Janes

On Sat, Mar 17, 2012 at 10:47 AM, Greg Stark st...@mit.edu wrote:
 On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas robertmh...@gmail.com wrote:
 But it would mean we have about 1.7x  more runs that need to be merged
 (for initially random data).  Considering the minimum merge order is
 6, that increase in runs is likely not to lead to an additional level
 of merging, in which case the extra speed of building the runs would
 definitely win.  But if it does cause an additional level of merge, it
 could end up being a loss.

 That's true, but the real hit to the run size should be quite a bit
 less than 1.7x, because we'd also be using memory more efficiently,
 and therefore more tuples should fit.

 I'm not sure I believe the 1.7x.  Keep in mind that even after
 starting a new tape we can continue to read in new tuples that belong
 on the first tape. So even if you have tuples that are more than N
 positions out of place (where N is the size of your heap) as long as
 you don't have very many you can keep writing out the first tape for
 quite a while.

 I suspect analyzing truly random inputs is also a bit like saying no
 compression algorithm can work on average.

I'm not sure what you mean here.  The also suggests the previous
discussion was about something other than the assumption of
randomness, but that discussion doesn't make much sense unless both
paragraphs are about that.

Anyway I think keeping best/worst cases in mind is certainly a good
thing to do.  I've certainly seen train wrecks unfold when people
assumed anything could be compressed without verifying it.

 Partly sorted data is quite
 common and the tapesort algorithm would be able to do a lot of cases
 in a single merge that the quicksort and merge would generate a large
 number of merges for.

This is where some common and credible corpus would be invaluable.

 All that said, quicksort and merge would always do no more i/o in
 cases where the total number of tuples to be sorted is significantly
 less than N^2 since that would be guaranteed to be possible to process
 with a single merge pass. (Where significantly less has something to
 do with how many tuples you have to read in one i/o to be efficient).

Unless the tuples are quite large, the number needed for efficient i/o
is quite high.

 That probably covers a lot of cases, and Postgres already has the
 stats to predict when it would happen, more or less.

Currently sort makes no attempt to use the planner stats.  Would that
be a good thing to work on?

 Fwiw when I was doing the top-k sorting I did a bunch of experiements
 and came up with a rule-of-thumb that our quicksort was about twice as
 fast as our heapsort. I'm not sure whether that's a big deal or not in
 this case.

I've been seeing around 3 fold, and that might go up more with some of
the work being done that speeds up qsort but not tape sort.


 Keep in mind that the only win would be reducing the cpu
 time on a sort where every tuple was being written to disk and read
 back. For most users that won't run noticeably faster, just reduce cpu
 time consumption.

But in many (perhaps most) tape sorts CPU time is most of it.   A lot
of tape sorts are entirely memory backed.  The tuples either never
reach disk, or are partially written but never read, instead being
served back from the file-systems cache.  By changing to a tape sort,
you are buying insurance against a bunch of other sorts also running
simultaneously, but often the insurance doesn't pay off because those
numerous other sorts don't exist and the kernel manages to keep the
few ones that do in RAM anyway.

One avenue for major surgery on the sorts would be for an entry
admission where jobs can negotiable over the memory.  Something like
allocation by halves, but with a possibility that a job could decide
it would rather wait for another sort to finish and its memory to be
freed up, than to make do with what is currently available.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-17 Thread Jeff Janes

On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 Anyway, I think the logtape could use redoing.  When your tapes are
 actually physically tape drives, it is necessary to build up runs one
 after the other on physical tapes, because un-mounting a tape from a
 tape drive and remounting another tape is not very feasible at scale.
 Which then means you need to place your runs carefully, because if the
 final merge finds that two runs it needs are back-to-back on one tape,
 that is bad.  But with disks pretending to be tapes, you could
 re-arrange the runs with just some book keeping.  Maintaining the
 distinction between tapes and runs is pointless, which means the
 Fibonacci placement algorithm is pointless as well.

 I think you're right.  It seems to me that we could just write out an
 arbitrary number of runs, one per file, ending up with files number
 1..N.

The problem there is that none of the files can be deleted until it
was entirely read, so you end up with all the data on disk twice.  I
don't know how often people run their databases so close to the edge
on disk space that this matters, but someone felt that that extra
storage was worth avoiding.  We could still get rid of the run/tape
distinction but keep the block recycling method--they are conceptually
distinct things.   (But hopefully improve the block recycling so the
locality doesn't degrade as much as it seems to now)

 If we can do a final merge of N runs without overrunning
 work_mem, fine.  If not, we merge the first K runs (for the largest
 possible K) and write out a new run N+1.  The range of valid run
 number is now K+1..N+1.  If those can be merged in a single pass, we
 do so; otherwise we again merge the first K runs (K+1 through 2K) to
 create a new run N+2.  And so on.

 I am not clear, however, whether this would be any faster.  It may not
 be worth tinkering with just for the reduction in code complexity.

Yeah, I was thinking it would only be worth redoing if the current
implementation interferes with other improvements--which I think is
likely, but don't have any concrete ideas yet where it would.

...

 As a desirable side effect, I think it would mean
 that we could dispense with retail palloc and pfree altogether.  We
 could just allocate a big chunk of memory, copy tuples into it until
 it's full, using a pointer to keep track of the next unused byte, and
 then, after writing the run, reset the allocation pointer back to the
 beginning of the buffer.  That would not only avoid the cost of going
 through palloc/pfree, but also the memory overhead imposed by
 bookkeeping and power-of-two rounding.

 Wouldn't we still need an array of pointers to the start of every
 tuple's location in the buffer?  Or else, how would qsort know where
 to find them?

 Yes, we'd still need that.  I don't see any way to get around that; I
 just don't like the expense of palloc-ing so many little chunks in
 addition to that array.

Right, OK, I had thought you meant the power of two rounding of
mem_tuples, I overlooked the power of two rounding of palloc.



 Also, to do this we would need to get around the 1GB allocation limit.
  It is bad enough that memtuples is limited to 1GB, it would be much
 worse if the entire arena was limited to that amount.

 Well, we could always allocate multiple 1GB chunks and peel off pieces
 from each one in turn.

I thought of that with mem_tuples itself, but I think it would be more
difficult to do that than just fixing the allocation limit would be.
Using multiple chunks might be easier with the buffer space as you
don't need to do heap arithmetic on those.

 If we do want to stick with the current algorithm, there seem to be
 some newer techniques for cutting down on the heap maintenance
 overhead.  Heikki's been investigating that a bit.

 Interesting.  Is that investigation around the poor L1/L2 caching
 properties of large heaps?  I was wondering if there might be a way to
 give tuplesort an initial estimate of how much data there was to sort,
 so that it could use a smaller amount of memory than the max if it
 decided that that would lead to better caching effects.  Once you know
 you can't do an in-memory sort, then there is no reason to use more
 memory than the amount that lets you merge all the runs in one go.

 No, it's about reducing the number of comparisons needed to maintain
 the heap property.

That sounds very interesting.  I didn't know it was even theoretically
possible to do that.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-03-07 Thread Robert Haas

On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 The better solution would be to reduce the overhead in the first
 place.  While building the initial runs, there is no reason to have 3
 blocks worth of overhead for each tape, when only one tape is ever
 being used at a time.  But that change seems much tougher to
 implement.

Hmm, yeah.

 Anyway, I think the logtape could use redoing.  When your tapes are
 actually physically tape drives, it is necessary to build up runs one
 after the other on physical tapes, because un-mounting a tape from a
 tape drive and remounting another tape is not very feasible at scale.
 Which then means you need to place your runs carefully, because if the
 final merge finds that two runs it needs are back-to-back on one tape,
 that is bad.  But with disks pretending to be tapes, you could
 re-arrange the runs with just some book keeping.  Maintaining the
 distinction between tapes and runs is pointless, which means the
 Fibonacci placement algorithm is pointless as well.

I think you're right.  It seems to me that we could just write out an
arbitrary number of runs, one per file, ending up with files number
1..N.  If we can do a final merge of N runs without overrunning
work_mem, fine.  If not, we merge the first K runs (for the largest
possible K) and write out a new run N+1.  The range of valid run
number is now K+1..N+1.  If those can be merged in a single pass, we
do so; otherwise we again merge the first K runs (K+1 through 2K) to
create a new run N+2.  And so on.

I am not clear, however, whether this would be any faster.  It may not
be worth tinkering with just for the reduction in code complexity.

 But it would mean we have about 1.7x  more runs that need to be merged
 (for initially random data).  Considering the minimum merge order is
 6, that increase in runs is likely not to lead to an additional level
 of merging, in which case the extra speed of building the runs would
 definitely win.  But if it does cause an additional level of merge, it
 could end up being a loss.

That's true, but the real hit to the run size should be quite a bit
less than 1.7x, because we'd also be using memory more efficiently,
and therefore more tuples should fit. A little debugging code suggests
that on my MacBook Pro, things come out like this:

- Allocations of 8 bytes of less use 24 bytes.
- Allocations of 9-16 bytes use 32 bytes.
- Allocations of 17-32 bytes use 48 bytes.
- Allocations of 33-64 bytes use 80 bytes.
- Allocations of 65-128 bytes use 144 bytes.
- Allocations of 129-256 bytes use 272 bytes.

In other words, the palloc overhead seems to be: round up to the next
power of two, and add 16 bytes.  This means that if, say, all the rows
are exactly 100 bytes, we're using 44% more memory than we would under
this scheme, which means that the runs would be 44% longer than
otherwise.  Of course that's just an example - if you are sorting 132
byte rows, you'll be able to squeeze in 106% more of them, and if
you're sorting 128 byte rows, you'll only be able to squeeze in 12.5%
more of them.  So there are certainly cases where the runs would get
shorter, especially if you happened to have mostly-sorted data rather
than randomly-ordered data, but not in every case.  Still, it may be a
terrible idea; I only mention it because I did find a lot of
references to people using quicksort to build up the initial runs
rather than our heapsort-based approach.

 Is there some broad corpus of sorting benchmarks which changes could
 be tested against?  I usually end up testing just simple columns of
 integers or small strings, because they are easy to set up.  That is
 not ideal.

I don't know of anything more formal, unfortunately.

 As a desirable side effect, I think it would mean
 that we could dispense with retail palloc and pfree altogether.  We
 could just allocate a big chunk of memory, copy tuples into it until
 it's full, using a pointer to keep track of the next unused byte, and
 then, after writing the run, reset the allocation pointer back to the
 beginning of the buffer.  That would not only avoid the cost of going
 through palloc/pfree, but also the memory overhead imposed by
 bookkeeping and power-of-two rounding.

 Wouldn't we still need an array of pointers to the start of every
 tuple's location in the buffer?  Or else, how would qsort know where
 to find them?

Yes, we'd still need that.  I don't see any way to get around that; I
just don't like the expense of palloc-ing so many little chunks in
addition to that array.

 Also, to do this we would need to get around the 1GB allocation limit.
  It is bad enough that memtuples is limited to 1GB, it would be much
 worse if the entire arena was limited to that amount.

Well, we could always allocate multiple 1GB chunks and peel off pieces
from each one in turn.

 If we do want to stick with the current algorithm, there seem to be
 some newer techniques for cutting down on the heap maintenance
 overhead.

Re: [HACKERS] Memory usage during sorting

2012-03-03 Thread Jeff Janes

On Sun, Feb 26, 2012 at 7:20 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I'm not sure about the conclusion, but given this discussion, I'm
 inclined to mark this Returned with Feedback.

 OK, thanks.  Does anyone have additional feed-back on how tightly we
 wish to manage memory usage?  Is trying to make us use as much memory
 as we are allowed to without going over a worthwhile endeavor at all,
 or is it just academic nitpicking?

 I'm not sure, either.  It strikes me that, in general, it's hard to
 avoid a little bit of work_mem overrun, since we can't know whether
 the next tuple will fit until we've read it, and then if it turns out
 to be big, well, the best thing we can do is free it, but perhaps
 that's closing the barn door after the horse has gotten out already.

That type of overrun doesn't bother me much, because the size of the
next tuple some else feeds us is mostly outside of this modules
control.  And also be cause the memory that is overrun should be
reusable once the offending tuple is written out to tape.  The type of
overrun I'm more concerned with is that which is purely under this
modules control, and which is then not re-used productively.

The better solution would be to reduce the overhead in the first
place.  While building the initial runs, there is no reason to have 3
blocks worth of overhead for each tape, when only one tape is ever
being used at a time.  But that change seems much tougher to
implement.

 Having recently spent quite a bit of time looking at tuplesort.c as a
 result of Peter Geoghegan's work and some other concerns, I'm inclined
 to think that it needs more than minor surgery.  That file is peppered
 with numerous references to Knuth which serve the dual function of
 impressing the reader with the idea that the code must be very good
 (since Knuth is a smart guy) and rendering it almost completely
 impenetrable (since the design is justified by reference to a textbook
 most of us probably do not have copies of).

Yes, I agree with that analysis.  And getting the volume I want by
inter-library-loan has been challenging--I keep getting the volume
before or after the one I want.  Maybe Knuth starts counting volumes
at 0 and libraries start counting at 1 :)

Anyway, I think the logtape could use redoing.  When your tapes are
actually physically tape drives, it is necessary to build up runs one
after the other on physical tapes, because un-mounting a tape from a
tape drive and remounting another tape is not very feasible at scale.
Which then means you need to place your runs carefully, because if the
final merge finds that two runs it needs are back-to-back on one tape,
that is bad.  But with disks pretending to be tapes, you could
re-arrange the runs with just some book keeping.  Maintaining the
distinction between tapes and runs is pointless, which means the
Fibonacci placement algorithm is pointless as well.

 A quick Google search for external sorting algorithms suggest that the
 typical way of doing an external sort is to read data until you fill
 your in-memory buffer, quicksort it, and dump it out as a run.  Repeat
 until end-of-data; then, merge the runs (either in a single pass, or
 if there are too many, in multiple passes).  I'm not sure whether that
 would be better than what we're doing now, but there seem to be enough
 other people doing it that we might want to try it out.  Our current
 algorithm is to build a heap, bounded in size by work_mem, and dribble
 tuples in and out, but maintaining that heap is pretty expensive;
 there's a reason people use quicksort rather than heapsort for
 in-memory sorting.

But it would mean we have about 1.7x  more runs that need to be merged
(for initially random data).  Considering the minimum merge order is
6, that increase in runs is likely not to lead to an additional level
of merging, in which case the extra speed of building the runs would
definitely win.  But if it does cause an additional level of merge, it
could end up being a loss.

Is there some broad corpus of sorting benchmarks which changes could
be tested against?  I usually end up testing just simple columns of
integers or small strings, because they are easy to set up.  That is
not ideal.

 As a desirable side effect, I think it would mean
 that we could dispense with retail palloc and pfree altogether.  We
 could just allocate a big chunk of memory, copy tuples into it until
 it's full, using a pointer to keep track of the next unused byte, and
 then, after writing the run, reset the allocation pointer back to the
 beginning of the buffer.  That would not only avoid the cost of going
 through palloc/pfree, but also the memory overhead imposed by
 bookkeeping and power-of-two rounding.

Wouldn't we still need an array of pointers to the start of every
tuple's location in the buffer?  Or else, how would qsort know where
to find them?

Also, to do this we would need to get around

Re: [HACKERS] Memory usage during sorting

2012-02-26 Thread Robert Haas

On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I'm not sure about the conclusion, but given this discussion, I'm
 inclined to mark this Returned with Feedback.

 OK, thanks.  Does anyone have additional feed-back on how tightly we
 wish to manage memory usage?  Is trying to make us use as much memory
 as we are allowed to without going over a worthwhile endeavor at all,
 or is it just academic nitpicking?

I'm not sure, either.  It strikes me that, in general, it's hard to
avoid a little bit of work_mem overrun, since we can't know whether
the next tuple will fit until we've read it, and then if it turns out
to be big, well, the best thing we can do is free it, but perhaps
that's closing the barn door after the horse has gotten out already.

Having recently spent quite a bit of time looking at tuplesort.c as a
result of Peter Geoghegan's work and some other concerns, I'm inclined
to think that it needs more than minor surgery.  That file is peppered
with numerous references to Knuth which serve the dual function of
impressing the reader with the idea that the code must be very good
(since Knuth is a smart guy) and rendering it almost completely
impenetrable (since the design is justified by reference to a textbook
most of us probably do not have copies of).

A quick Google search for external sorting algorithms suggest that the
typical way of doing an external sort is to read data until you fill
your in-memory buffer, quicksort it, and dump it out as a run.  Repeat
until end-of-data; then, merge the runs (either in a single pass, or
if there are too many, in multiple passes).  I'm not sure whether that
would be better than what we're doing now, but there seem to be enough
other people doing it that we might want to try it out.  Our current
algorithm is to build a heap, bounded in size by work_mem, and dribble
tuples in and out, but maintaining that heap is pretty expensive;
there's a reason people use quicksort rather than heapsort for
in-memory sorting.  As a desirable side effect, I think it would mean
that we could dispense with retail palloc and pfree altogether.  We
could just allocate a big chunk of memory, copy tuples into it until
it's full, using a pointer to keep track of the next unused byte, and
then, after writing the run, reset the allocation pointer back to the
beginning of the buffer.  That would not only avoid the cost of going
through palloc/pfree, but also the memory overhead imposed by
bookkeeping and power-of-two rounding.

If we do want to stick with the current algorithm, there seem to be
some newer techniques for cutting down on the heap maintenance
overhead.  Heikki's been investigating that a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-02-26 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 A quick Google search for external sorting algorithms suggest that the
 typical way of doing an external sort is to read data until you fill
 your in-memory buffer, quicksort it, and dump it out as a run.  Repeat
 until end-of-data; then, merge the runs (either in a single pass, or
 if there are too many, in multiple passes).  I'm not sure whether that
 would be better than what we're doing now, but there seem to be enough
 other people doing it that we might want to try it out.  Our current
 algorithm is to build a heap, bounded in size by work_mem, and dribble
 tuples in and out, but maintaining that heap is pretty expensive;
 there's a reason people use quicksort rather than heapsort for
 in-memory sorting.

Well, the reason the heapsort approach is desirable here is that you end
up with about half as many runs to merge, because the typical run length
is significantly more than what will fit in work_mem.  Don't get too
excited about micro-optimization if you haven't understood the larger
context.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-02-25 Thread Jeff Janes

On Tue, Feb 14, 2012 at 1:44 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 The attached patch allows it to reuse that memory.  On my meager
 system it reduced the building of an index on an integer column in a
 skinny 200 million row totally randomly ordered table by about 3% from
 a baseline of 25 minutes.


 Just to give a standard review, this patch is one line change and
 applies cleanly, builds ok.

 I'm not pretty sure what exactly you're trying to accomplish, but it
 seems to me that it's avoiding the first dumptuples cycle by forcing
 availMem = 0 even if it's negative.

 Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
 tape-sort, it starts by calling WRITETUP a large number of time
 consecutively, to work off the memory deficit incurred by the 3 blocks
 per tape of tape overhead, and then after that calls WRITETUP about
 once per puttuple..   Under my patch, it would only call WRITETUP
 about once per puttuple, right from the beginning.

 I read your comments as it'd be
 avoiding to alternate reading/writing back and force with scattered
 memory failing memory cache much during merge phase, but actually it
 doesn't affect merge phase but only init-dump phase, does it?

 It effects the building of the runs.  But this building of the runs is
 not a simple dump, it is itself a mini merge phase, in which it merges
 the existing in-memory priority queue against the still-incoming
 tuples from the node which invoked the sort.  By using less memory
 than it could, this means that the resulting runs are smaller than
 they could be, and so will sometimes necessitate an additional layer
 of merging later on.   (This effect is particularly large for the very
 first run being built.  Generally by merging incoming tuples into the
 memory-tuples, you can create runs that are 1.7 times the size of fits
 in memory.  By wasting some memory, we are getting 1.7 the size of a
 smaller starting point.  But for the first run, it is worse than that.
  Most of the benefit that leads to that 1.7 multiplier comes at the
 very early stage of each run-build.  But by initially using the full
 memory, then writing out a bunch of tuples without doing any merge of
 the incoming, we have truncated the part that gives the most benefit.)

 My analysis that the freed memory is never reused (because we refuse
 to reuse it ourselves and it is too fragmented to be reused by anyone
 else, like the palloc or VM system) only applies to the run-building
 phase.  So never was a bit of an overstatement.  By the time the last
 initial run is completely written out to tape, the heap used for the
 priority queue should be totally empty.  So at this point the
 allocator would have the chance to congeal all of the fragmented
 memory back into larger chunks, or maybe it parcels the allocations
 back out again in an order so that the unused space is contiguous and
 could be meaningfully paged out.

 But, it is it worth worrying about how much we fragment memory and if
 we overshoot our promises by 10 or 20%?

 If so,
 I'm not so convinced your benchmark gave 3 % gain by this change.
 Correct me as I'm probably wrong.

 I've now done more complete testing.  Building an index on an
 200,000,000 row table with an integer column populated in random order
 with integers from 1..500,000,000, non-unique, on a machine with 2GB
 of RAM and 600MB of shared_buffers.

 It improves things by 6-7 percent at the end of working mem size, the
 rest are in the noise except at 77936 KB, where it reproducibly makes
 things 4% worse, for reasons I haven't figured out.  So maybe the best
 thing to do is, rather than micromanaging memory usage, simply don't
 set maintenance_work_mem way to low.  (But, it is the default).

 I've tested here with only a million rows mix of integer/text (table
 size is 80MB or so) with default setting, running CREATE INDEX
 continuously, but couldn't find performance improvement.  The number
 varies from -2% to +2%, which I think is just error.

 While I agree with your insist that avoiding the first dump would make
 sense, I guess it depends on situations; if the dump goes with lots of
 tuples (which should happen when availMem is big), writing tuples a
 lot at a time will be faster than writing little by little later.

 I'm not sure about the conclusion, but given this discussion, I'm
 inclined to mark this Returned with Feedback.

OK, thanks.  Does anyone have additional feed-back on how tightly we
wish to manage memory usage?  Is trying to make us use as much memory
as we are allowed to without going over a worthwhile endeavor at all,
or is it just academic nitpicking?

Also, since the default value of work_mem is quite low, should the
docs be more aggressive in suggesting that people using any

Re: [HACKERS] Memory usage during sorting

2012-02-14 Thread Hitoshi Harada

On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 The attached patch allows it to reuse that memory.  On my meager
 system it reduced the building of an index on an integer column in a
 skinny 200 million row totally randomly ordered table by about 3% from
 a baseline of 25 minutes.


 Just to give a standard review, this patch is one line change and
 applies cleanly, builds ok.

 I'm not pretty sure what exactly you're trying to accomplish, but it
 seems to me that it's avoiding the first dumptuples cycle by forcing
 availMem = 0 even if it's negative.

 Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
 tape-sort, it starts by calling WRITETUP a large number of time
 consecutively, to work off the memory deficit incurred by the 3 blocks
 per tape of tape overhead, and then after that calls WRITETUP about
 once per puttuple..   Under my patch, it would only call WRITETUP
 about once per puttuple, right from the beginning.

 I read your comments as it'd be
 avoiding to alternate reading/writing back and force with scattered
 memory failing memory cache much during merge phase, but actually it
 doesn't affect merge phase but only init-dump phase, does it?

 It effects the building of the runs.  But this building of the runs is
 not a simple dump, it is itself a mini merge phase, in which it merges
 the existing in-memory priority queue against the still-incoming
 tuples from the node which invoked the sort.  By using less memory
 than it could, this means that the resulting runs are smaller than
 they could be, and so will sometimes necessitate an additional layer
 of merging later on.   (This effect is particularly large for the very
 first run being built.  Generally by merging incoming tuples into the
 memory-tuples, you can create runs that are 1.7 times the size of fits
 in memory.  By wasting some memory, we are getting 1.7 the size of a
 smaller starting point.  But for the first run, it is worse than that.
  Most of the benefit that leads to that 1.7 multiplier comes at the
 very early stage of each run-build.  But by initially using the full
 memory, then writing out a bunch of tuples without doing any merge of
 the incoming, we have truncated the part that gives the most benefit.)

 My analysis that the freed memory is never reused (because we refuse
 to reuse it ourselves and it is too fragmented to be reused by anyone
 else, like the palloc or VM system) only applies to the run-building
 phase.  So never was a bit of an overstatement.  By the time the last
 initial run is completely written out to tape, the heap used for the
 priority queue should be totally empty.  So at this point the
 allocator would have the chance to congeal all of the fragmented
 memory back into larger chunks, or maybe it parcels the allocations
 back out again in an order so that the unused space is contiguous and
 could be meaningfully paged out.

 But, it is it worth worrying about how much we fragment memory and if
 we overshoot our promises by 10 or 20%?

 If so,
 I'm not so convinced your benchmark gave 3 % gain by this change.
 Correct me as I'm probably wrong.

 I've now done more complete testing.  Building an index on an
 200,000,000 row table with an integer column populated in random order
 with integers from 1..500,000,000, non-unique, on a machine with 2GB
 of RAM and 600MB of shared_buffers.

 It improves things by 6-7 percent at the end of working mem size, the
 rest are in the noise except at 77936 KB, where it reproducibly makes
 things 4% worse, for reasons I haven't figured out.  So maybe the best
 thing to do is, rather than micromanaging memory usage, simply don't
 set maintenance_work_mem way to low.  (But, it is the default).

I've tested here with only a million rows mix of integer/text (table
size is 80MB or so) with default setting, running CREATE INDEX
continuously, but couldn't find performance improvement.  The number
varies from -2% to +2%, which I think is just error.

While I agree with your insist that avoiding the first dump would make
sense, I guess it depends on situations; if the dump goes with lots of
tuples (which should happen when availMem is big), writing tuples a
lot at a time will be faster than writing little by little later.

I'm not sure about the conclusion, but given this discussion, I'm
inclined to mark this Returned with Feedback.


Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-02-11 Thread Jeff Janes

On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 The attached patch allows it to reuse that memory.  On my meager
 system it reduced the building of an index on an integer column in a
 skinny 200 million row totally randomly ordered table by about 3% from
 a baseline of 25 minutes.


 Just to give a standard review, this patch is one line change and
 applies cleanly, builds ok.

 I'm not pretty sure what exactly you're trying to accomplish, but it
 seems to me that it's avoiding the first dumptuples cycle by forcing
 availMem = 0 even if it's negative.

Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
tape-sort, it starts by calling WRITETUP a large number of time
consecutively, to work off the memory deficit incurred by the 3 blocks
per tape of tape overhead, and then after that calls WRITETUP about
once per puttuple..   Under my patch, it would only call WRITETUP
about once per puttuple, right from the beginning.

 I read your comments as it'd be
 avoiding to alternate reading/writing back and force with scattered
 memory failing memory cache much during merge phase, but actually it
 doesn't affect merge phase but only init-dump phase, does it?

It effects the building of the runs.  But this building of the runs is
not a simple dump, it is itself a mini merge phase, in which it merges
the existing in-memory priority queue against the still-incoming
tuples from the node which invoked the sort.  By using less memory
than it could, this means that the resulting runs are smaller than
they could be, and so will sometimes necessitate an additional layer
of merging later on.   (This effect is particularly large for the very
first run being built.  Generally by merging incoming tuples into the
memory-tuples, you can create runs that are 1.7 times the size of fits
in memory.  By wasting some memory, we are getting 1.7 the size of a
smaller starting point.  But for the first run, it is worse than that.
 Most of the benefit that leads to that 1.7 multiplier comes at the
very early stage of each run-build.  But by initially using the full
memory, then writing out a bunch of tuples without doing any merge of
the incoming, we have truncated the part that gives the most benefit.)

My analysis that the freed memory is never reused (because we refuse
to reuse it ourselves and it is too fragmented to be reused by anyone
else, like the palloc or VM system) only applies to the run-building
phase.  So never was a bit of an overstatement.  By the time the last
initial run is completely written out to tape, the heap used for the
priority queue should be totally empty.  So at this point the
allocator would have the chance to congeal all of the fragmented
memory back into larger chunks, or maybe it parcels the allocations
back out again in an order so that the unused space is contiguous and
could be meaningfully paged out.

But, it is it worth worrying about how much we fragment memory and if
we overshoot our promises by 10 or 20%?

 If so,
 I'm not so convinced your benchmark gave 3 % gain by this change.
 Correct me as I'm probably wrong.

I've now done more complete testing.  Building an index on an
200,000,000 row table with an integer column populated in random order
with integers from 1..500,000,000, non-unique, on a machine with 2GB
of RAM and 600MB of shared_buffers.

It improves things by 6-7 percent at the end of working mem size, the
rest are in the noise except at 77936 KB, where it reproducibly makes
things 4% worse, for reasons I haven't figured out.  So maybe the best
thing to do is, rather than micromanaging memory usage, simply don't
set maintenance_work_mem way to low.  (But, it is the default).

maintenance_work_mem%Change with patch
16384   6.2
19484   7.8
23170   -0.1
27554   0.1
32768   1.9
38968   -1.9
46341   0.4
55109   0.4
65536   0.6
77936   -4.3
92682   -0.3
110218  0.1
131072  1.7


 Anyway, it's nice to modify the comment just above the change, since
 it's now true with it.

Much of that comment is referring to other things (some of which I
don't understand).  How about an additional comment just before my
code to the gist of:

/* If we have already used, and thus fragmented, the memory then we
 * might as well continue to make use of it as no one else can.
 */

(That might not actually be true, if the tuples we are sorting are
very large, or perhaps if they arrive in already reverse sorted order)

Thanks,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-02-08 Thread Hitoshi Harada

On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 The attached patch allows it to reuse that memory.  On my meager
 system it reduced the building of an index on an integer column in a
 skinny 200 million row totally randomly ordered table by about 3% from
 a baseline of 25 minutes.


Just to give a standard review, this patch is one line change and
applies cleanly, builds ok.

I'm not pretty sure what exactly you're trying to accomplish, but it
seems to me that it's avoiding the first dumptuples cycle by forcing
availMem = 0 even if it's negative. I read your comments as it'd be
avoiding to alternate reading/writing back and force with scattered
memory failing memory cache much during merge phase, but actually it
doesn't affect merge phase but only init-dump phase, does it? If so,
I'm not so convinced your benchmark gave 3 % gain by this change.
Correct me as I'm probably wrong.

Anyway, it's nice to modify the comment just above the change, since
it's now true with it.

Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-02-08 Thread Hitoshi Harada

On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 The worst thing about the current memory usage is probably that big
 machines can't qsort more than 16,777,216 tuples no matter how much
 memory they have, because memtuples has to live entirely within a
 single allocation, which is currently limited to 1GB.  It is probably
 too late to fix this problem for 9.2. I wish I had started there
 rather than here.

 This 16,777,216 tuple limitation will get even more unfortunate if the
 specializations that speed up qsort but not external sort get
 accepted.


I think it's a fair ask to extend our palloc limitation of 1GB to
64bit space. I see there are a lot of applications that want more
memory by one palloc call in user-defined functions, aggregates, etc.
As you may notice, however, the area in postgres to accomplish it
needs to be investigated deeply. I don't know where it's safe to allow
it and where not. varlena types is unsafe, but it might be possible to
extend varlena header to 64 bit in major release somehow.

 Attached is a completely uncommitable proof of concept/work in
 progress patch to get around the limitation.  It shows a 2 fold
 improvement when indexing an integer column on a 50,000,000 row
 randomly ordered table.

In any case, we do need bird-eye sketch to attack it but I guess it's
worth and at some future point we definitely must do, though I don't
know if it's the next release or third next release from now.


Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-02-04 Thread Jeff Janes

On Sat, Jan 21, 2012 at 4:51 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 16 January 2012 00:59, Jeff Janes jeff.ja...@gmail.com wrote:
 I think it would be better to pre-deduct the tape overhead amount we
 will need if we decide to switch to tape sort from the availMem before
 we even start reading (and then add it back if we do indeed make that
 switch).  That way we wouldn't over-run the memory in the first place.
  However, that would cause apparent regressions in which sorts that
 previously fit into maintenance_work_mem no longer do.  Boosting
 maintenance_work_mem to a level that was actually being used
 previously would fix those regressions, but pointing out that the
 previous behavior was not optimal doesn't change the fact that people
 are used to it and perhaps tuned to it.  So the attached patch seems
 more backwards-friendly.

 Hmm. Are people really setting maintenance_work_mem such that it is
 exactly large enough to quicksort when building an index in one case
 but not another?

It could also apply to work_mem in other situations.  I'm just using
maintenance_work_mem in my example because index creation is the thing
that lead me into this little diversion and so that is what my
test-bed is set up to use.

Some people do have very stable ritualized work-loads and so could
have tuned exactly to them.  I don't know how common that would be.
More common might be synthetic benchmarks which had been tuned, either
intentionally or accidentally, to just barely fit in the (overshot)
memory, so it would be a perception problem that there was a
regression when in fact the tuning merely became out of date.

 Is the difference large enough to warrant avoiding
 pre-deduction?

Switching to an external sort when you could have done it all by quick
sort (by cheating just a little bit on memory usage) makes a pretty
big difference, over 2 fold in my tests.  If the fast-path
optimization for qsort goes in, the difference might get even bigger.
Of course, there will always be some point at which that switch over
must occur, so the real question is do we need to keep that switch
point historically consistent to avoid nasty surprises for people with
excessively fine-tuned *work_mem settings based on old behavior?  And
do such people even exist outside of my imagination?

But a bigger question I now have is if any of this matters.  Since
there is currently no way to know how many connections might be
running how many concurrent sorts, there is really no way to tune
work_mem with much precision.  So, does it matter if we slightly lie
about how much we will actually use?  And is it worth worrying about
whether every last drop of memory is used efficiently?

I was trying to compare the performance I was getting with a
theoretical model of what I should get, and it was confusing to me
that we were using a originally defined size of memory for the
priority heap, and then later reducing that amount of memory.  That is
annoying because it complicates the theoretical model, and even more
annoying when I realized that the freed memory that results from
doing this is too fragmented to even be used for other purposes.  But
theoretical annoyances aside, it is hardly the worst thing about
memory usage in sorting.  It is just one that is standing in the way
of my further study.

So I don't think this patch I proposed is particularly important in
its own right.  I want to see if anyone will point out that I am all
wet because my analysis failed to take foo into account.   I
probably should have explained this when I submitted the patch.


The worst thing about the current memory usage is probably that big
machines can't qsort more than 16,777,216 tuples no matter how much
memory they have, because memtuples has to live entirely within a
single allocation, which is currently limited to 1GB.  It is probably
too late to fix this problem for 9.2. I wish I had started there
rather than here.

This 16,777,216 tuple limitation will get even more unfortunate if the
specializations that speed up qsort but not external sort get
accepted.

Attached is a completely uncommitable proof of concept/work in
progress patch to get around the limitation.  It shows a 2 fold
improvement when indexing an integer column on a 50,000,000 row
randomly ordered table.

As for what to do to make it commitable, it seems like maybe there
should be two different MaxAllocSize.  One determined by the aset.c
assumes they can compute twice an allocation's size without overflow.
 I would think that simply testing size_t at compile time would be
enough.  Allowing more than 1 GB allocations would probably be
pointless on 32 bit machines, and 64 bit should be able to compute
twice of a much larger value than 1GB without overflow.

For the varlena stuff, I don't know what to do.  Is the I swear to
God I won't pass this pointer to one of those functions sufficient?
Or does each function need to test it?

And I'm pretty sure that palloc, not just repalloc,

Re: [HACKERS] Memory usage during sorting

2012-02-04 Thread Jeff Janes

On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 Attached is a completely uncommitable proof of concept/work in
 progress patch to get around the limitation.  It shows a 2 fold
 improvement when indexing an integer column on a 50,000,000 row
 randomly ordered table.

Oops, forgot to say that is with
set maintenance_work_mem=4194304;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory usage during sorting

2012-01-21 Thread Peter Geoghegan

On 16 January 2012 00:59, Jeff Janes jeff.ja...@gmail.com wrote:
 I think it would be better to pre-deduct the tape overhead amount we
 will need if we decide to switch to tape sort from the availMem before
 we even start reading (and then add it back if we do indeed make that
 switch).  That way we wouldn't over-run the memory in the first place.
  However, that would cause apparent regressions in which sorts that
 previously fit into maintenance_work_mem no longer do.  Boosting
 maintenance_work_mem to a level that was actually being used
 previously would fix those regressions, but pointing out that the
 previous behavior was not optimal doesn't change the fact that people
 are used to it and perhaps tuned to it.  So the attached patch seems
 more backwards-friendly.

Hmm. Are people really setting maintenance_work_mem such that it is
exactly large enough to quicksort when building an index in one case
but not another? Is the difference large enough to warrant avoiding
pre-deduction?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Memory usage during sorting

2012-01-15 Thread Jeff Janes

In tuplesort.c, it initially reads tuples into memory until availMem
is exhausted.

It then switches to the tape sort algorithm, and allocates buffer
space for each tape it will use.  This substantially over-runs the
allowedMem, and drives availMem negative.

It works off this deficit by writing tuples to tape, and pfree-ing
their spot in the heap, which pfreed space is more or less randomly
scattered.  Once this is done it proceeds to alternate between freeing
more space in the heap and adding things to the heap (in a nearly
strict 1+1 alternation if the tuples are constant size).

The space freed up by the initial round of pfree where it is working
off the space deficit from inittapes is never re-used.  It also cannot
be paged out by the VM system, because it is scattered among actively
used memory.

I don't think that small chunks can be reused from one memory context
to another, but I haven't checked.  Even if it can be, during a big
sort like an index build the backend process doing the build may have
no other contexts which need to use it.

So having over-ran workMem and stomped all over it to ensure no one
else can re-use it, we then scrupulously refuse to benefit from that
over-run amount ourselves.

The attached patch allows it to reuse that memory.  On my meager
system it reduced the building of an index on an integer column in a
skinny 200 million row totally randomly ordered table by about 3% from
a baseline of 25 minutes.

I think it would be better to pre-deduct the tape overhead amount we
will need if we decide to switch to tape sort from the availMem before
we even start reading (and then add it back if we do indeed make that
switch).  That way we wouldn't over-run the memory in the first place.
 However, that would cause apparent regressions in which sorts that
previously fit into maintenance_work_mem no longer do.  Boosting
maintenance_work_mem to a level that was actually being used
previously would fix those regressions, but pointing out that the
previous behavior was not optimal doesn't change the fact that people
are used to it and perhaps tuned to it.  So the attached patch seems
more backwards-friendly.


Cheers,

Jeff


sort_mem_usage_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

53 matches

Mail list logo