Re: [HACKERS] More thoughts on sorting

2009-08-01 Thread PFC

PFC li...@peufeu.com writes:

- for short strings (average 12 bytes), sort is CPU-bound in strcoll()
- for longer strings (average 120 bytes), sort is even more CPU-bound in
strcoll()


No news there.  If you are limited by the speed of text comparisons,
consider using C locale.

regards, tom lane



	Actually, I think (see the bottom of my last email) that this would be a  
good argument for the per-column COLLATE patch...


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More thoughts on sorting

2009-08-01 Thread Martijn van Oosterhout
On Sat, Aug 01, 2009 at 09:37:11AM +0200, PFC wrote:
 PFC li...@peufeu.com writes:
 - for short strings (average 12 bytes), sort is CPU-bound in strcoll()
 - for longer strings (average 120 bytes), sort is even more CPU-bound in
 strcoll()

 No news there.  If you are limited by the speed of text comparisons,
 consider using C locale.

   Actually, I think (see the bottom of my last email) that this would be 
 a 
 good argument for the per-column COLLATE patch...

Standard SQL COLLATE support is per column anyway, so just implementing
that will solve all the problems anyway.

Have a nice day,

-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] More thoughts on sorting

2009-08-01 Thread Greg Stark
On Sat, Aug 1, 2009 at 6:17 PM, Martijn van Oosterhoutklep...@svana.org wrote:
 On Sat, Aug 01, 2009 at 09:37:11AM +0200, PFC wrote:
       Actually, I think (see the bottom of my last email) that this would be 
 a
 good argument for the per-column COLLATE patch...

 Standard SQL COLLATE support is per column anyway, so just implementing
 that will solve all the problems anyway.

Well it's more flexible than that. You can specify the collation to
use in your order by clause or comparison operator. So even for the
same column it can be different for different queries.


-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] More thoughts on sorting

2009-07-31 Thread PFC


There was a thread some time ago about sorting... it kind of died...
I did some tests on a desktop (Postgres 8.3.7, kubuntu, Core 2 dual core,  
4GB RAM, RAID1 of 2 SATA disks)


Quick conclusions :

- grabbing the stuff to sort can be IO bound of course (not here)
- for short strings (average 12 bytes), sort is CPU-bound in strcoll()
- for longer strings (average 120 bytes), sort is even more CPU-bound in  
strcoll()
- strcoll() time seems to depend on the length of the strings, not the  
place where a difference occurs (grokking glibc source code confirms)


See detailed test procedure below.

Locale is fr_FR.UTF-8 and database is UNICODE
All strings are ASCII, they are mostly alphanumeric references.
There are 391469 strings.
min length 6 chars
max length 80 chars
avg length 11.82 chars

We have a table test with (id INTEGER PRIMARY KEY, TEXT, BYTEA ), and  
contents of TEXT and BYTEA are identical.
We have a table test2 which contains the same thing as test, except the id  
and a 100-character constant are appended to the strings to make them  
longer.



Test Procedure :

Grab test data from :
http://home.peufeu.com/pg/test_data_references.copy.gz

 Sorting with Python

Sorting all string converted to unicode (from utf8) using strcoll() and  
correct locale

= 5.8 s

With longer strings (as in table test2 below )
= 8 s


 Postgres

To get query timings, I use \t and SELECT * FROM test ORDER BY id OFFSET
391468; which avoids EXPLAIN ANALYZE overhead, it just prints the last  
row from the results. Timings are a bit shorter than EXPLAIN ANALYZE  
gives, and I checked the plans, they are all sorts.


-- Create test table and load it
BEGIN;
CREATE TABLE test1 (t TEXT NOT NULL);
\copy test1 FROM test_data_references.copy
CREATE TABLE test (id SERIAL PRIMARY KEY, t TEXT NOT NULL, b BYTEA NOT
NULL );
INSERT INTO test (t,b) SELECT t,t::BYTEA FROM test1;
DROP TABLE test1;
ALTER TABLE test DROP CONSTRAINT test_pkey;
CREATE TABLE test2 (id INTEGER NOT NULL, t TEXT NOT NULL, b BYTEA NOT NULL
);
INSERT INTO test2 SELECT id,
(t || id || 'This is a dummy text of length 100  
bytes') AS t,
(t || id || 'This is a dummy text of length 100  
bytes')::BYTEA

AS b
 FROM test;
COMMIT;

\d test

SHOW work_mem;
-- 16MB
SHOW maintenance_work_mem;
-- 512MB

\timing
-- cache it really well
SELECT count(*) FROM test;
SELECT count(*) FROM test;
SELECT count(*) FROM test;
-- 391469
-- Temps : 87,033 ms

SELECT * FROM test ORDER BY id OFFSET 391468;
-- Temps : 918,893 ms

SELECT id FROM test ORDER BY id OFFSET 391468;
-- Temps : 948,015 ms

Interpretation :
- Time for hauling around extra data (SELECT * instead of SELECT id) is  
not significant.
- Sorting by integers is quite fast (not THAT fast though, MySQL below is  
3x faster when selecting just 'id' and 2x slower when SELECT *, hum.)


SELECT * FROM test ORDER BY b OFFSET 391468;
-- Temps : 2145,555 ms

SELECT id FROM test ORDER BY b OFFSET 391468;
-- Temps : 2152,273 ms

Interpretation :
- Time for hauling around extra data (SELECT * instead of SELECT id) is  
not significant.
- Sorting by BYTEA is just a memcmp(), it is strange that is it 2x slower  
than ints. Probably the varlena stuff, I guess.

- See ridiculous MySQL results using a BLOB below which are 10x slower

SELECT * FROM test ORDER BY t OFFSET 391468;
-- Temps : 7305,373 ms

SELECT id FROM test ORDER BY t OFFSET 391468;
-- Temps : 7345,234 ms

Interpretation :
- Time for hauling around extra data (SELECT * instead of SELECT id) is  
not significant.

- Sorting localized TEXT really is SLOW !
- The little test above calling strcoll() from Python confirms the  
slowness is in strcoll()
- MySQL (see below) seems to be much faster (about equal to postgres) on  
VARCHAR, and 2x slower on TEXT (hum...)


BEGIN;
CREATE INDEX test_id ON test( id );
-- Temps : 555,718 ms

CREATE INDEX test_b ON test( b );
-- Temps : 1762,263 ms

CREATE INDEX test_t ON test( t );
-- Temps : 6274,624 ms

ROLLBACK;

Interpretation :
- maintenance_work_mem is much higher than work_mem so the sorts are  
faster, but the slowness in localized text sorting subsists...



SELECT count(*) FROM test2;
-- 391469
-- Temps : 114,669 ms

SELECT * FROM test2 ORDER BY id OFFSET 391468;
-- Temps : 1788,246 ms

SELECT id FROM test2 ORDER BY id OFFSET 391468;
-- Temps : 989,238 ms

Interpretation :
- Time for hauling around extra data (SELECT * instead of SELECT id) IS  
significant this time due to the extra string lengths.


SELECT * FROM test2 ORDER BY b OFFSET 391468;
-- Temps : 2906,108 ms

SELECT id FROM test2 ORDER BY b OFFSET 391468;
-- Temps : 2554,931 ms

SELECT * FROM test2 ORDER BY t OFFSET 391468;
-- Temps : 10637,649 ms

SELECT id FROM test2 ORDER BY t OFFSET 391468;
-- Temps : 10322,480 ms

Interpretation :
- Note : the strings are longer, however they are sortable only by looking  
at 

Re: [HACKERS] More thoughts on sorting

2009-07-31 Thread Tom Lane
PFC li...@peufeu.com writes:
 - for short strings (average 12 bytes), sort is CPU-bound in strcoll()
 - for longer strings (average 120 bytes), sort is even more CPU-bound in  
 strcoll()

No news there.  If you are limited by the speed of text comparisons,
consider using C locale.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers