Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-07 Thread Eliot Gable
Thanks again for all the input and suggestions from people. I have this
sorting algorithm re-implemented in C now and it is somewhere 2ms to run it
now; though it is difficult to get a more accurate measure. There may be
some additional optimizations I can come up with, but for now, this will
work very well compared to the alternative methods.

On Tue, Jul 6, 2010 at 6:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Eliot Gable 
 egable+pgsql-performa...@gmail.comegable%2bpgsql-performa...@gmail.com
 writes:
  Do I need to somehow force the server to unload and then re-load this .so
  file each time I build a new version of it? If so, how do I do that?

 Start a new database session.

regards, tom lane




-- 
Eliot Gable

We do not inherit the Earth from our ancestors: we borrow it from our
children. ~David Brower

I decided the words were too conservative for me. We're not borrowing from
our children, we're stealing from them--and it's not even considered to be a
crime. ~David Brower

Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; not
live to eat.) ~Marcus Tullius Cicero


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-07 Thread Kenneth Marshall
Hi Eliot,

Would you mind posting your code for reference. It is nice to
have working examples when trying to figure out how it all fits
together.

Regards,
Ken

On Wed, Jul 07, 2010 at 03:23:12PM -0400, Eliot Gable wrote:
 Thanks again for all the input and suggestions from people. I have this
 sorting algorithm re-implemented in C now and it is somewhere 2ms to run it
 now; though it is difficult to get a more accurate measure. There may be
 some additional optimizations I can come up with, but for now, this will
 work very well compared to the alternative methods.
 
 On Tue, Jul 6, 2010 at 6:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
  Eliot Gable 
  egable+pgsql-performa...@gmail.comegable%2bpgsql-performa...@gmail.com
  writes:
   Do I need to somehow force the server to unload and then re-load this .so
   file each time I build a new version of it? If so, how do I do that?
 
  Start a new database session.
 
 regards, tom lane
 
 
 
 
 -- 
 Eliot Gable
 
 We do not inherit the Earth from our ancestors: we borrow it from our
 children. ~David Brower
 
 I decided the words were too conservative for me. We're not borrowing from
 our children, we're stealing from them--and it's not even considered to be a
 crime. ~David Brower
 
 Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; not
 live to eat.) ~Marcus Tullius Cicero

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-06 Thread Robert Haas
On Sat, Jul 3, 2010 at 4:17 PM, Eliot Gable
egable+pgsql-performa...@gmail.com wrote:
 Read RFC 2782 on random weighted load balancing of SRV records inside DNS.

It may be asking a bit much to expect people here to read an RFC to
figure out how to help you solve this problem, but...

 I've looked through the documentation on how to re-write this in C, but I
 cannot seem to find good documentation on working with the input array
 (which is an array of a complex type). I also don't see good documentation
 for working with the complex type. I found stuff that talks about
 constructing a complex type in C and returning it. However, I'm not sure how
 to take an input complex type and deconstruct it into something I can work
 with in C. Also, the memory context management stuff is not entirely clear.

...there's no question that writing things in C is a lot more work,
and takes some getting used to.  Still, it's fast, so maybe worth it,
especially since you already know C++, and will therefore mostly just
need to learn the PostgreSQL coding conventions.  The best thing to do
is probably to look at some of the existing examples within the
backend code.  Most of the datatype code is in src/backend/utils/adt.
You might want to look at arrayfuncs.c (perhaps array_ref() or
array_map()); and also rowtypes.c (perhaps record_cmp()).

 Specifically, how do I go about preserving the pointers to the data that I
 allocate in multi-call memory context so that they still point to the data
 on the next call to the function for the next result row? Am I supposed to
 set up some global variables to do that, or am I supposed to take a
 different approach? If I need to use global variables, then how do I deal
 with concurrency?

Global variables would be a bad idea, not so much because of
concurrency as because they won't get cleaned up properly.  Again, the
best thing to do is to look at existing examples, like array_unnest()
in src/backend/utils/adt/arrayfuncs.c; the short answer is that you
probably want to compute all your results on the first call and stash
them in the FuncCallContext (funcctx-user_fctx); and then on
subsequent calls just return one row per call.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-06 Thread Eliot Gable
On Tue, Jul 6, 2010 at 3:01 PM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Jul 3, 2010 at 4:17 PM, Eliot Gable
 egable+pgsql-performa...@gmail.com egable%2bpgsql-performa...@gmail.com
 wrote:
  Read RFC 2782 on random weighted load balancing of SRV records inside
 DNS.

 It may be asking a bit much to expect people here to read an RFC to
 figure out how to help you solve this problem, but...


Yeah, I was not actually expecting them to read the whole RFC. The section
on random weighted load balancing is only a few paragraphs and describes
just the algorithm I am trying to implement as efficiently as possible:

   Priority
The priority of this target host.  A client MUST attempt to
contact the target host with the lowest-numbered priority it can
reach; target hosts with the same priority SHOULD be tried in an
order defined by the weight field.  The range is 0-65535.  This
is a 16 bit unsigned integer in network byte order.

   Weight
A server selection mechanism.  The weight field specifies a
relative weight for entries with the same priority. Larger
weights SHOULD be given a proportionately higher probability of
being selected. The range of this number is 0-65535.  This is a
16 bit unsigned integer in network byte order.  Domain
administrators SHOULD use Weight 0 when there isn't any server
selection to do, to make the RR easier to read for humans (less
noisy).  In the presence of records containing weights greater
than 0, records with weight 0 should have a very small chance of
being selected.

In the absence of a protocol whose specification calls for the
use of other weighting information, a client arranges the SRV
RRs of the same Priority in the order in which target hosts,
specified by the SRV RRs, will be contacted. The following
algorithm SHOULD be used to order the SRV RRs of the same
priority:

To select a target to be contacted next, arrange all SRV RRs
(that have not been ordered yet) in any order, except that all
those with weight 0 are placed at the beginning of the list.

Compute the sum of the weights of those RRs, and with each RR
associate the running sum in the selected order. Then choose a
uniform random number between 0 and the sum computed
(inclusive), and select the RR whose running sum value is the
first in the selected order which is greater than or equal to
the random number selected. The target host specified in the
selected SRV RR is the next one to be contacted by the client.
Remove this SRV RR from the set of the unordered SRV RRs and
apply the described algorithm to the unordered SRV RRs to select
the next target host.  Continue the ordering process until there
are no unordered SRV RRs.  This process is repeated for each
Priority.

The difference between this description and my implementation is that I have
added a group field to the mix so that this algorithm is applied to each
group independently of the others. Also, my input data has an id field
which must be present on the same rows of the output and is used to map the
output back to my original input.


 ...there's no question that writing things in C is a lot more work,
 and takes some getting used to.  Still, it's fast, so maybe worth it,
 especially since you already know C++, and will therefore mostly just
 need to learn the PostgreSQL coding conventions.  The best thing to do
 is probably to look at some of the existing examples within the
 backend code.  Most of the datatype code is in src/backend/utils/adt.
 You might want to look at arrayfuncs.c (perhaps array_ref() or
 array_map()); and also rowtypes.c (perhaps record_cmp()).


I did actually find the arrayfuncs.c file and start looking through it for
examples. I'm just not entirely clear on what is going on in some of those
functions -- what is necessary to keep in order to extract my data and get
it represented in C structures and what I can remove. I was hoping there was
some sort of documentation on how to work with input arrays for extracting
the data and getting it converted. In any event, I have spent several hours
reverse engineering how that stuff works, and I think I am pretty close to
being able to get my data into a C structure that I can work with.


  Specifically, how do I go about preserving the pointers to the data that
 I
  allocate in multi-call memory context so that they still point to the
 data
  on the next call to the function for the next result row? Am I supposed
 to
  set up some global variables to do that, or am I supposed to take a
  different approach? If I need to use global variables, then how do I deal
  with concurrency?

 Global variables would be a bad idea, not so much because of
 concurrency as because they won't get cleaned up 

Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-06 Thread Joe Conway
On 07/06/2010 12:42 PM, Eliot Gable wrote:
 Thanks for suggesting array_unnest(). I think that will actually prove
 more useful to me than the other example I'm using for extracting my
 data from an array. I was actually planning on computing the order on
 the first call and storing it in a linked list which gets returned one
 item at a time until all rows have been returned. Also, I found a code
 example using Google that showed someone storing data across function
 calls using that pointer. I used their example to produce this:
 
 snip
 if(SRF_IS_FIRSTCALL()) {
 funcctx = SRF_FIRSTCALL_INIT();
 
 /* This is where we stick or sorted data for returning later */
 funcctx-user_fctx =
 MemoryContextAlloc(funcctx-multi_call_memory_ctx, sizeof(sort_data));
 oldcontext = MemoryContextSwitchTo(funcctx-multi_call_memory_ctx);
 data = (sort_data*) funcctx-user_fctx;
 /snip
 
 I have a structure set up that is typedef'd to sort_data which stores
 pointers to various things that I need to survive across the calls.
 Since this seems to be what you are suggesting, I assume this is the
 correct approach.

This approach works, but you could also use the SFRM_Materialize mode
and calculate the entire result set in one go. That tends to be simpler.
See, for example crosstab_hash() in contrib/tablefunc for an example.

FWIW, there are also some good examples of array handling in PL/R, e.g.
pg_array_get_r() in pg_conversion.c

HTH,

Joe

-- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting,  Support



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-06 Thread Eliot Gable
On Tue, Jul 6, 2010 at 4:00 PM, Joe Conway m...@joeconway.com wrote:



 This approach works, but you could also use the SFRM_Materialize mode
 and calculate the entire result set in one go. That tends to be simpler.
 See, for example crosstab_hash() in contrib/tablefunc for an example.

 FWIW, there are also some good examples of array handling in PL/R, e.g.
 pg_array_get_r() in pg_conversion.c


 Thanks. That looks like less code and probably will be slightly more
efficient.


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-06 Thread Eliot Gable
On Tue, Jul 6, 2010 at 4:17 PM, Eliot Gable 
egable+pgsql-performa...@gmail.com egable%2bpgsql-performa...@gmail.comwrote:


 On Tue, Jul 6, 2010 at 4:00 PM, Joe Conway m...@joeconway.com wrote:



 This approach works, but you could also use the SFRM_Materialize mode
 and calculate the entire result set in one go. That tends to be simpler.
 See, for example crosstab_hash() in contrib/tablefunc for an example.

 FWIW, there are also some good examples of array handling in PL/R, e.g.
 pg_array_get_r() in pg_conversion.c


  Thanks. That looks like less code and probably will be slightly more
 efficient.


I just got my first test of the new C-based function compiled and loaded
into the server. The first time it is called, I see it correctly print the
priority of each of the five rows of the array that I passed to it:

Got priority 1.
Got priority 1.
Got priority 1.
Got priority 1.
Got priority 1.
CONTEXT: ERROR
CODE: XX000
MESSAGE: cache lookup failed for type 7602245
---

I assume this cache lookup error is because I am not actually returning
any results (or even NULL) at the end of the function call. If it means
something else, please let me know.

Do I need to somehow force the server to unload and then re-load this .so
file each time I build a new version of it? If so, how do I do that? Can I
just re-run the create or replace function SQL code again to make that
happen? In every other system I have dealt with where I build a module, I
have some way to unload the module and force it to load again; but I don't
see a mention of that in the PostgreSQL documentation.

Thanks again to everyone who has provided feedback.

-- 
Eliot Gable


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-06 Thread Tom Lane
Eliot Gable egable+pgsql-performa...@gmail.com writes:
 Do I need to somehow force the server to unload and then re-load this .so
 file each time I build a new version of it? If so, how do I do that?

Start a new database session.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-03 Thread Merlin Moncure
On Fri, Jul 2, 2010 at 11:17 PM, Eliot Gable
egable+pgsql-performa...@gmail.com wrote:
 Well, I re-wrote the algorithm in Perl. However, it did not solve the speed
 issue. Running time now is a whopping 240+ ms instead of the 31.8ms I was
 getting before (15.2 of which is sorting). Here is the Perl code on the
 sorting. I won't post the pl/pgsql code, because this is far more clear (in
 my opinion) on what the algorithm does:
 DROP TYPE IF EXISTS glbtype CASCADE;
 CREATE TYPE glbtype AS (
 id INTEGER,
 group TEXT,
 priority INTEGER,
 weight INTEGER
 );
 DROP TYPE IF EXISTS glbtype_result CASCADE;
 CREATE TYPE glbtype_result AS (
 id INTEGER,
 priority INTEGER,
 weight INTEGER,
 order BIGINT
 );
 CREATE OR REPLACE FUNCTION GroupedRandomWeightedLB(glbtype[]) RETURNS SETOF
 glbtype_result AS

ok, I didn't take the time to read your implementation and completely
understand it, but it looks like you're looking at a N^2 sorting at
best.

You probably want to do something like this (it might not be quite
right, you need to explain what each of your input array fields is
supposed to represent):
CREATE OR REPLACE FUNCTION GroupedRandomWeightedLB(glbtype[]) RETURNS SETOF
glbtype_result AS
$$
  with g as (select unnest(glbtype) as t)
select array(select ((t).id, (t).priority) (t).weight), 0)::glbtype_result
  from g order by (t).group, (t).priority, random() * (t).weight);
$$ language sql;

(not sure what order is, is that the rownum, can't that just be
inferred from the array position?)

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-03 Thread Eliot Gable
Read RFC 2782 on random weighted load balancing of SRV records inside DNS.
That algorithm is what I need implemented, but with an extension. I have
groups of records I need to have the algorithm applied to where each group
is treated separately from the others. I understand the operational
complexity of what I'm doing. It is more like N^3, or more precisely G*P*W
where G is the number of groups, P the number of priorities per group, and W
the number of different weights per priority. But, the complexity of the
algorithm means nothing in terms of performance  or run time because it will
only ever deal with very small sets of records (maybe 20 rows of data,
tops). Even if the algorithm were N^4, it wouldn't matter with that few
records. But, more importantly, there are constraints in how the data is
sub-divided. Primarily, G  P  W. Further, G and P are groupings which
subdivide the entire set of data and the groups do not have overlapping
data. So, maybe it's more like N^2.2 or something. But, regardless, we're
only talking about 20 rows, tops.

The issue is how efficiently the languages can deal with arrays. In Perl, I
have to parse a string into an array of data, then break it up into sub
arrays inside associative arrays just to work with the input. I also have to
splice the array to remove elements, which I don't think is very efficient.
Any way I could come up with of removing elements involved rebuilding the
entire array. The same thing goes for pl/pgsql. Dealing with arrays there is
also not very efficient. I do a lot of constructing of arrays from sets of
data using myvar = array(select blah);. While pl/pgsql was considerably
faster than Perl, it cannot come close to what I did in C++ using a hash of
a hash of a linked list. The two hash tables provide my groupings and the
linked list gives me something that is highly efficient for removing
elements as I pick them.

I've looked through the documentation on how to re-write this in C, but I
cannot seem to find good documentation on working with the input array
(which is an array of a complex type). I also don't see good documentation
for working with the complex type. I found stuff that talks about
constructing a complex type in C and returning it. However, I'm not sure how
to take an input complex type and deconstruct it into something I can work
with in C. Also, the memory context management stuff is not entirely clear.
Specifically, how do I go about preserving the pointers to the data that I
allocate in multi-call memory context so that they still point to the data
on the next call to the function for the next result row? Am I supposed to
set up some global variables to do that, or am I supposed to take a
different approach? If I need to use global variables, then how do I deal
with concurrency?

On Sat, Jul 3, 2010 at 2:08 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Fri, Jul 2, 2010 at 11:17 PM, Eliot Gable
 egable+pgsql-performa...@gmail.com egable%2bpgsql-performa...@gmail.com
 wrote:
  Well, I re-wrote the algorithm in Perl. However, it did not solve the
 speed
  issue. Running time now is a whopping 240+ ms instead of the 31.8ms I was
  getting before (15.2 of which is sorting). Here is the Perl code on the
  sorting. I won't post the pl/pgsql code, because this is far more clear
 (in
  my opinion) on what the algorithm does:
  DROP TYPE IF EXISTS glbtype CASCADE;
  CREATE TYPE glbtype AS (
  id INTEGER,
  group TEXT,
  priority INTEGER,
  weight INTEGER
  );
  DROP TYPE IF EXISTS glbtype_result CASCADE;
  CREATE TYPE glbtype_result AS (
  id INTEGER,
  priority INTEGER,
  weight INTEGER,
  order BIGINT
  );
  CREATE OR REPLACE FUNCTION GroupedRandomWeightedLB(glbtype[]) RETURNS
 SETOF
  glbtype_result AS

 ok, I didn't take the time to read your implementation and completely
 understand it, but it looks like you're looking at a N^2 sorting at
 best.

 You probably want to do something like this (it might not be quite
 right, you need to explain what each of your input array fields is
 supposed to represent):
 CREATE OR REPLACE FUNCTION GroupedRandomWeightedLB(glbtype[]) RETURNS SETOF
 glbtype_result AS
 $$
  with g as (select unnest(glbtype) as t)
select array(select ((t).id, (t).priority) (t).weight),
 0)::glbtype_result
  from g order by (t).group, (t).priority, random() * (t).weight);
 $$ language sql;

 (not sure what order is, is that the rownum, can't that just be
 inferred from the array position?)

 merlin




-- 
Eliot Gable

We do not inherit the Earth from our ancestors: we borrow it from our
children. ~David Brower

I decided the words were too conservative for me. We're not borrowing from
our children, we're stealing from them--and it's not even considered to be a
crime. ~David Brower

Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; not
live to eat.) ~Marcus Tullius Cicero


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-03 Thread Merlin Moncure
On Sat, Jul 3, 2010 at 4:17 PM, Eliot Gable
egable+pgsql-performa...@gmail.com wrote:
 Read RFC 2782 on random weighted load balancing of SRV records inside DNS.
 That algorithm is what I need implemented, but with an extension. I have
 groups of records I need to have the algorithm applied to where each group
 is treated separately from the others. I understand the operational
 complexity of what I'm doing. It is more like N^3, or more precisely G*P*W
 where G is the number of groups, P the number of priorities per group, and W
 the number of different weights per priority. But, the complexity of the
 algorithm means nothing in terms of performance  or run time because it will
 only ever deal with very small sets of records (maybe 20 rows of data,
 tops). Even if the algorithm were N^4, it wouldn't matter with that few
 records. But, more importantly, there are constraints in how the data is
 sub-divided. Primarily, G  P  W. Further, G and P are groupings which
 subdivide the entire set of data and the groups do not have overlapping
 data. So, maybe it's more like N^2.2 or something. But, regardless, we're
 only talking about 20 rows, tops.
 The issue is how efficiently the languages can deal with arrays. In Perl, I
 have to parse a string into an array of data, then break it up into sub
 arrays inside associative arrays just to work with the input. I also have to
 splice the array to remove elements, which I don't think is very efficient.
 Any way I could come up with of removing elements involved rebuilding the
 entire array. The same thing goes for pl/pgsql. Dealing with arrays there is
 also not very efficient. I do a lot of constructing of arrays from sets of
 data using myvar = array(select blah);. While pl/pgsql was considerably
 faster than Perl, it cannot come close to what I did in C++ using a hash of
 a hash of a linked list. The two hash tables provide my groupings and the
 linked list gives me something that is highly efficient for removing
 elements as I pick them.
 I've looked through the documentation on how to re-write this in C, but I
 cannot seem to find good documentation on working with the input array
 (which is an array of a complex type). I also don't see good documentation
 for working with the complex type. I found stuff that talks about
 constructing a complex type in C and returning it. However, I'm not sure how
 to take an input complex type and deconstruct it into something I can work
 with in C. Also, the memory context management stuff is not entirely clear.
 Specifically, how do I go about preserving the pointers to the data that I
 allocate in multi-call memory context so that they still point to the data
 on the next call to the function for the next result row? Am I supposed to
 set up some global variables to do that, or am I supposed to take a
 different approach? If I need to use global variables, then how do I deal
 with concurrency?

please stop top posting.

What about my suggestion doesn't work for your requirements?  (btw,
let me disagree with my peers and state pl/perl is lousy for this type
of job, only sql/and pl/sql can interact with postgresql variables
natively for the most part).

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-03 Thread Alvaro Herrera
Excerpts from Merlin Moncure's message of sáb jul 03 18:53:46 -0400 2010:

 What about my suggestion doesn't work for your requirements?  (btw,
 let me disagree with my peers and state pl/perl is lousy for this type
 of job, only sql/and pl/sql can interact with postgresql variables
 natively for the most part).

IIRC the other reason pl/perl sucks for this kind of thing is that it
forces a subtransaction to be created before the function call, which is
expensive.  (I might be misremembering and what actually causes a
subtransaction is a SPI call inside a PL/Perl function, which wouldn't
apply here.)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-02 Thread Merlin Moncure
On Thu, Jul 1, 2010 at 8:46 PM, Eliot Gable
egable+pgsql-performa...@gmail.com wrote:
 I have a long stored procedure (over 3,000 lines). Originally, it would take
 about 44ms to run the whole query. After lots and lots of tweaking, Postgres
 now runs the entire thing and gathers my results in just 15.2ms, which is
 very impressive given the hardware this is running on. Now, I used to return
 the results unsorted to my C++ backend and then sort them there using my
 custom sort order which provides prioritized, weighted random ordering with
 4 different priority fields and 3 different weighting fields within 3 of
 those 4 priority fields. Needless to say, the sorting is quite complex. I
 wanted to cut down on the amount of data being delivered to my C++ backend,
 so I am using the stored procedure to insert a summary of my results
 directly into the database, which is far more efficient than dumping it all
 to the C++ backend (including stuff that isn't really needed there) and then
 dumping it all back to Postgres via INSERTS later in the execution path. The
 problem is that I want the results sorted in this custom order before they
 are stored in the database. (By sorted, I mean I want to include a field
 that lists a numerical order for the set of results.) Thus, I used to dump
 everything to the C++ program, perform the sorting, then INSERT back to
 Postgres. This was obviously not all that efficient. Now, the sorting in C++
 took 1ms to accomplish. When I re-wrote the sorting in pl/pgsql using a
 couple of additional stored procedures, I discovered it is taking 15.2ms to
 perform the sort of the records within Postgres. This almost cancels out all
 of the prior optimizations I previously performed:
 T:20100702001841+0903010 TID:0x43945940 INFO:NOTICE:  Sorting group ...
 snip
 ...
 /snip

what are you sorting and how are you sorting it?

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-02 Thread Eliot Gable
Yes, I have two pl/pgsql functions. They take a prepared set of data (just
the row id of the original results, plus the particular priority and weight
fields) and they return the same set of data with an extra field called
order which contains a numerical order to apply when sorting the rows. One
function uses the priority information to break everything into priority
groups, then calls the other function for each priority group. Each time it
gets results back from the inner function, it returns that set of results.
When it has looped through all priority groups, then it returns the full
built-up set of results back to the calling function.

The pl/pgsql functions implementing the sort are as optimized as they are
likely to get. I don't want to waste my time trying to further optimize
pl/pgsql functions that are never going to be as fast and efficient as I
need. I would rather spend that time re-writing it in C and get sorting back
to 1ms.

I guess the real question is, is a generic C sorting function my only real
alternative? Is there anything else that would allow me to sort things
faster than pl/pgsql functions? For example, if I used pl/perl, would I be
able to expect considerably better performance for sorting than using
pl/pgsql? What about other supported languages? If I can get close to 1ms
sorting performance without resorting to C, it would save me much time and
frustration.

On Fri, Jul 2, 2010 at 12:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Craig Ringer cr...@postnewspapers.com.au writes:
  On 02/07/10 08:46, Eliot Gable wrote:
  So, the bottom line is, I need a faster way to do this sorting.

  You haven't showed us how you're doing it at the moment, so it's awfully
  hard to comment usefully on possible approaches.

 I'm guessing from tea leaves, but the impression I got from Eliot's
 description is that he's using plpgsql functions as sort comparators.
 It's not surprising that that sucks performance-wise compared to having
 the equivalent logic in C/C++ functions used as comparators on the
 client side.  plpgsql is no speed demon.  Best fix might be to code the
 comparators as C functions on the server side.

regards, tom lane




-- 
Eliot Gable

We do not inherit the Earth from our ancestors: we borrow it from our
children. ~David Brower

I decided the words were too conservative for me. We're not borrowing from
our children, we're stealing from them--and it's not even considered to be a
crime. ~David Brower

Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; not
live to eat.) ~Marcus Tullius Cicero


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-02 Thread Matthew Wakeling

On Fri, Jul 2, 2010 at 12:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:

I'm guessing from tea leaves, but the impression I got from Eliot's
description is that he's using plpgsql functions as sort comparators.
It's not surprising that that sucks performance-wise compared to having
the equivalent logic in C/C++ functions used as comparators on the
client side.  plpgsql is no speed demon.  Best fix might be to code the
comparators as C functions on the server side.


On Fri, 2 Jul 2010, Eliot Gable wrote:

I guess the real question is, is a generic C sorting function my only real
alternative?


Sounds to me like you are not really listening. You don't need to code an 
entire sorting algorithm in C, as Postgres already has a pretty good one 
of those. All you need to do is implement a comparator of some kind. 
Inserting C functions into Postgres is pretty easy, especially on the 
level of comparators.


Matthew

--
For those of you who are into writing programs that are as obscure and
complicated as possible, there are opportunities for... real fun here
   -- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-02 Thread Merlin Moncure
On Fri, Jul 2, 2010 at 10:50 AM, Matthew Wakeling matt...@flymine.org wrote:
 On Fri, Jul 2, 2010 at 12:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 I'm guessing from tea leaves, but the impression I got from Eliot's
 description is that he's using plpgsql functions as sort comparators.
 It's not surprising that that sucks performance-wise compared to having
 the equivalent logic in C/C++ functions used as comparators on the
 client side.  plpgsql is no speed demon.  Best fix might be to code the
 comparators as C functions on the server side.

 On Fri, 2 Jul 2010, Eliot Gable wrote:

 I guess the real question is, is a generic C sorting function my only real
 alternative?

 Sounds to me like you are not really listening. You don't need to code an
 entire sorting algorithm in C, as Postgres already has a pretty good one of
 those. All you need to do is implement a comparator of some kind. Inserting
 C functions into Postgres is pretty easy, especially on the level of
 comparators.

in recent versions of postgres you rarely if ever even have to do that
-- row types are comparable w/o any extra work, as are arrays.  If
Eliot would just give a little more deal of WHAT he is trying to sort
and HOW he is currently doing it, i suspect his problem will be
trivially solved :-).

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-02 Thread Craig James

On 7/2/10 6:59 AM, Eliot Gable wrote:

Yes, I have two pl/pgsql functions. They take a prepared set of data
(just the row id of the original results, plus the particular priority
and weight fields) and they return the same set of data with an extra
field called order which contains a numerical order to apply when
sorting the rows. One function uses the priority information to break
everything into priority groups, then calls the other function for each
priority group. Each time it gets results back from the inner function,
it returns that set of results. When it has looped through all priority
groups, then it returns the full built-up set of results back to the
calling function.

The pl/pgsql functions implementing the sort are as optimized as they
are likely to get. I don't want to waste my time trying to further
optimize pl/pgsql functions that are never going to be as fast and
efficient as I need. I would rather spend that time re-writing it in C
and get sorting back to 1ms.

I guess the real question is, is a generic C sorting function my only
real alternative? Is there anything else that would allow me to sort
things faster than pl/pgsql functions? For example, if I used pl/perl,
would I be able to expect considerably better performance for sorting
than using pl/pgsql? What about other supported languages? If I can get
close to 1ms sorting performance without resorting to C, it would save
me much time and frustration.


Try coding it in perl on the server.  It is MUCH easier to code, and you don't 
have to link anything or learn the esoteric details of the Postgres/C API.

Perl itself is written in C, and some of it's operations are extremely fast.  
Depending on the size and complexity of your data structures, Perl code may be 
just as fast as code you could write in C.

Even if it turns out to be slower than you like, it will give you a way to 
package up your sort functionality into a function call, so if you later find 
you need to replace the Perl function with a C function, the rest of your 
application won't change.

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-02 Thread Craig Ringer
On 03/07/10 00:36, Craig James wrote:

 Perl itself is written in C, and some of it's operations are extremely
 fast.

The same is true of PL/PgSQL, though ;-)

The main advantage of PL/Perl is that it doesn't rely on the SPI to do
everything. It's interpreted not compiled, but it has a much faster
approach to interpretation than PL/PgSQL.

Really, the right choice depends on exactly what the OP is doing and
how, which they're not saying.

Where's the code?

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-02 Thread Eliot Gable
Well, I re-wrote the algorithm in Perl. However, it did not solve the speed
issue. Running time now is a whopping 240+ ms instead of the 31.8ms I was
getting before (15.2 of which is sorting). Here is the Perl code on the
sorting. I won't post the pl/pgsql code, because this is far more clear (in
my opinion) on what the algorithm does:

DROP TYPE IF EXISTS glbtype CASCADE;
CREATE TYPE glbtype AS (
id INTEGER,
group TEXT,
priority INTEGER,
weight INTEGER
);

DROP TYPE IF EXISTS glbtype_result CASCADE;
CREATE TYPE glbtype_result AS (
id INTEGER,
priority INTEGER,
weight INTEGER,
order BIGINT
);

CREATE OR REPLACE FUNCTION GroupedRandomWeightedLB(glbtype[]) RETURNS SETOF
glbtype_result AS
$BODY$
# Input is an array of a composite type
my ($input) = @_;
my %groups;
$input =~ s/^{|}$//g;
$input =~ s/[)(]//g;
my @rows;
my $count = 0;
while ($input  $count  1) {
my ($id, $group, $prio, $weight, @rest) = split(/,/, $input);
push(@rows, {id = $id, group = $group, priority = $prio, weight =
$weight});
$count++;
$input = join(',', @rest);
}

if(scalar @rows  1) {
elog(NOTICE, '  No rows sent for sorting.');
return undef;
} else {
elog(NOTICE, '  '.(scalar @rows).' rows sent for sorting.');
}

foreach $rw (@rows) {
if($rw-{group}  $rw-{priority}  $rw-{weight}) {
push( @{ $groups{$rw-{group}}{$rw-{priority}} }, $rw);
elog(NOTICE, '  Pushing '.$rw-{group}.' with prio ('.$rw-{priority}.'),
weight ('.$rw-{weight}.') onto array.');
} else {
elog(NOTICE, '  Invalid sort row: Group ('.$rw-{group}.'), Prio
('.$rw-{priority}.'), Weight ('.$rw-{weight}.')');
}
}

foreach $group (keys %groups) {
elog(NOTICE, '  Sorting group '.$group.'...');
foreach $prio (keys %{$groups{$group}}) {
my @rows = @{ $groups{$group}{$prio} };
elog(NOTICE, 'Sorting '.(scalar @rows).' rows in priority
'.$prio.'...');
my @zeros;
my @nonzeros;
my $total_weight = 0;
my $row_order = 1;
for($row_id = 0; $row_id  scalar @rows; $row_id++) {
my $row = $rows[$row_id];
$total_weight += $row-{weight};
elog(NOTICE, 'Total Weight ('.$total_weight.')');
if($row-{weight} == 0) {
push(@zeros, $row);
} else {
push(@nonzeros, $row);
}
}
my @first_order = (@zeros, @nonzeros);
undef(@zeros);
undef(@nonzeros);
while(scalar @first_order) {
elog(NOTICE, '  '.(scalar @first_order).' items remaining ...');
my $rand = int(rand($total_weight));
elog(NOTICE, '  Random weight ('.$rand.')');
my $running_weight = 0;
for($row_id = 0; $row_id  scalar @first_order; $row_id++) {
my $row = $first_order[$row_id];
$running_weight += $row-{weight};
elog(NOTICE, '  Running weight ('.$running_weight.') Current Weight
('.$row-{weight}.')');
if($running_weight = $rand) {
elog(NOTICE, ': Priority ('.($row-{priority}).') Weight
('.($row-{weight}).')');
return_next(
{ id = int($row-{id}),
  priority = int($row-{priority}),
  weight = int($row-{weight}),
  order = int($row_order) }
);
$row_order++;
splice(@first_order, $row_id, 1);
# Recalculate total weight
$total_weight = 0;
foreach $row (@first_order) {
$total_weight += $row-{weight};
}
elog(NOTICE, ': Remaining Weight ('.$total_weight.')');
break;
}
}
}
}
}
return undef;
$BODY$
LANGUAGE plperl VOLATILE;

5 rows sent for sorting.
Pushing GROUP_7 with prio (1), weight (0) onto array.
Pushing GROUP_7 with prio (1), weight (5) onto array.
Pushing GROUP_8 with prio (1), weight (1) onto array.
Pushing GROUP_8 with prio (1), weight (5) onto array.
Pushing GROUP_8 with prio (1), weight (5) onto array.
Sorting group GROUP_7...
Sorting 2 rows in priority 1...
Total Weight (0)
Total Weight (5)
2 items remaining ...
Random weight (0)
Running weight (0) Current Weight (0)
: Priority (1) Weight (0)
: Remaining Weight (5)
1 items remaining ...
Random weight (0)
Running weight (5) Current Weight (5)
: Priority (1) Weight (5)
: Remaining Weight (0)
Sorting group GROUP_8...
Sorting 3 rows in priority 1...
Total Weight (1)
Total Weight (6)
Total Weight (11)
3 items remaining ...
Random weight (8)
Running weight (1) Current Weight (1)
Running weight (6) Current Weight (5)
Running weight (11) Current Weight (5)
: Priority (1) Weight (5)
: Remaining Weight (6)
2 items remaining ...
Random weight (2)
Running weight (1) Current Weight (1)
Running weight (6) Current Weight (5)
: Priority (1) Weight (5)
: Remaining Weight (1)
1 items remaining ...
Random weight (0)
Running weight (1) Current Weight (1)
: Priority (1) Weight (1)
: Remaining Weight (0)

2 rows sent for sorting.
Pushing GROUP_1 with prio (1), weight (0) onto array.
Pushing GROUP_1 with prio (2), weight (4) onto array.
Sorting group GROUP_1...
Sorting 1 rows in priority 1...
Total Weight (0)
1 items remaining ...
Random weight (0)
Running weight (0) Current Weight (0)
: Priority (1) Weight (0)
: Remaining Weight (0)
Sorting 1 rows in priority 2...
Total Weight (4)
1 items remaining ...
Random weight (2)
Running weight (4) Current Weight (4)
: Priority (2) Weight (4)
: Remaining Weight (0)

Total runtime: 244.101 ms


On Fri, Jul 2, 2010 at 9:44 PM, Craig Ringer 

[PERFORM] Highly Efficient Custom Sorting

2010-07-01 Thread Eliot Gable
I have a long stored procedure (over 3,000 lines). Originally, it would take
about 44ms to run the whole query. After lots and lots of tweaking, Postgres
now runs the entire thing and gathers my results in just 15.2ms, which is
very impressive given the hardware this is running on. Now, I used to return
the results unsorted to my C++ backend and then sort them there using my
custom sort order which provides prioritized, weighted random ordering with
4 different priority fields and 3 different weighting fields within 3 of
those 4 priority fields. Needless to say, the sorting is quite complex. I
wanted to cut down on the amount of data being delivered to my C++ backend,
so I am using the stored procedure to insert a summary of my results
directly into the database, which is far more efficient than dumping it all
to the C++ backend (including stuff that isn't really needed there) and then
dumping it all back to Postgres via INSERTS later in the execution path. The
problem is that I want the results sorted in this custom order before they
are stored in the database. (By sorted, I mean I want to include a field
that lists a numerical order for the set of results.) Thus, I used to dump
everything to the C++ program, perform the sorting, then INSERT back to
Postgres. This was obviously not all that efficient. Now, the sorting in C++
took 1ms to accomplish. When I re-wrote the sorting in pl/pgsql using a
couple of additional stored procedures, I discovered it is taking 15.2ms to
perform the sort of the records within Postgres. This almost cancels out all
of the prior optimizations I previously performed:

T:20100702001841+0903010 TID:0x43945940 INFO:NOTICE:  Sorting group ...
snip
...
/snip
T:20100702001841+0917683 TID:0x43945940 INFO:NOTICE:  Sorting 1 rows in
priority 1... -- Last sort item
T:20100702001841+0918292 TID:0x43945940 INFO:NOTICE:

918,292 - 903,010 = 15,282 us = 15.282 ms

So, the bottom line is, I need a faster way to do this sorting.

What options are at my disposal for improving the speed and efficiency of
this sorting? Which is the easiest to implement? What are the drawbacks of
each different method?


Thanks in advance for your insights.


-- 
Eliot Gable

We do not inherit the Earth from our ancestors: we borrow it from our
children. ~David Brower

I decided the words were too conservative for me. We're not borrowing from
our children, we're stealing from them--and it's not even considered to be a
crime. ~David Brower

Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; not
live to eat.) ~Marcus Tullius Cicero


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-01 Thread Craig Ringer
On 02/07/10 08:46, Eliot Gable wrote:

 So, the bottom line is, I need a faster way to do this sorting. 

You haven't showed us how you're doing it at the moment, so it's awfully
hard to comment usefully on possible approaches.

-- 
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Highly Efficient Custom Sorting

2010-07-01 Thread Tom Lane
Craig Ringer cr...@postnewspapers.com.au writes:
 On 02/07/10 08:46, Eliot Gable wrote:
 So, the bottom line is, I need a faster way to do this sorting. 

 You haven't showed us how you're doing it at the moment, so it's awfully
 hard to comment usefully on possible approaches.

I'm guessing from tea leaves, but the impression I got from Eliot's
description is that he's using plpgsql functions as sort comparators.
It's not surprising that that sucks performance-wise compared to having
the equivalent logic in C/C++ functions used as comparators on the
client side.  plpgsql is no speed demon.  Best fix might be to code the
comparators as C functions on the server side.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance