There was a thread some time ago about sorting... it kind of died...
I did some tests on a desktop (Postgres 8.3.7, kubuntu, Core 2 dual core,
4GB RAM, RAID1 of 2 SATA disks)
Quick conclusions :
- grabbing the stuff to sort can be IO bound of course (not here)
- for short strings (average 12 bytes), sort is CPU-bound in strcoll()
- for longer strings (average 120 bytes), sort is even more CPU-bound in
strcoll()
- strcoll() time seems to depend on the length of the strings, not the
place where a difference occurs (grokking glibc source code confirms)
See detailed test procedure below.
Locale is fr_FR.UTF-8 and database is UNICODE
All strings are ASCII, they are mostly alphanumeric references.
There are 391469 strings.
min length 6 chars
max length 80 chars
avg length 11.82 chars
We have a table test with (id INTEGER PRIMARY KEY, TEXT, BYTEA ), and
contents of TEXT and BYTEA are identical.
We have a table test2 which contains the same thing as test, except the id
and a 100-character constant are appended to the strings to make them
longer.
Test Procedure :
Grab test data from :
http://home.peufeu.com/pg/test_data_references.copy.gz
Sorting with Python
Sorting all string converted to unicode (from utf8) using strcoll() and
correct locale
= 5.8 s
With longer strings (as in table test2 below )
= 8 s
Postgres
To get query timings, I use \t and SELECT * FROM test ORDER BY id OFFSET
391468; which avoids EXPLAIN ANALYZE overhead, it just prints the last
row from the results. Timings are a bit shorter than EXPLAIN ANALYZE
gives, and I checked the plans, they are all sorts.
-- Create test table and load it
BEGIN;
CREATE TABLE test1 (t TEXT NOT NULL);
\copy test1 FROM test_data_references.copy
CREATE TABLE test (id SERIAL PRIMARY KEY, t TEXT NOT NULL, b BYTEA NOT
NULL );
INSERT INTO test (t,b) SELECT t,t::BYTEA FROM test1;
DROP TABLE test1;
ALTER TABLE test DROP CONSTRAINT test_pkey;
CREATE TABLE test2 (id INTEGER NOT NULL, t TEXT NOT NULL, b BYTEA NOT NULL
);
INSERT INTO test2 SELECT id,
(t || id || 'This is a dummy text of length 100
bytes') AS t,
(t || id || 'This is a dummy text of length 100
bytes')::BYTEA
AS b
FROM test;
COMMIT;
\d test
SHOW work_mem;
-- 16MB
SHOW maintenance_work_mem;
-- 512MB
\timing
-- cache it really well
SELECT count(*) FROM test;
SELECT count(*) FROM test;
SELECT count(*) FROM test;
-- 391469
-- Temps : 87,033 ms
SELECT * FROM test ORDER BY id OFFSET 391468;
-- Temps : 918,893 ms
SELECT id FROM test ORDER BY id OFFSET 391468;
-- Temps : 948,015 ms
Interpretation :
- Time for hauling around extra data (SELECT * instead of SELECT id) is
not significant.
- Sorting by integers is quite fast (not THAT fast though, MySQL below is
3x faster when selecting just 'id' and 2x slower when SELECT *, hum.)
SELECT * FROM test ORDER BY b OFFSET 391468;
-- Temps : 2145,555 ms
SELECT id FROM test ORDER BY b OFFSET 391468;
-- Temps : 2152,273 ms
Interpretation :
- Time for hauling around extra data (SELECT * instead of SELECT id) is
not significant.
- Sorting by BYTEA is just a memcmp(), it is strange that is it 2x slower
than ints. Probably the varlena stuff, I guess.
- See ridiculous MySQL results using a BLOB below which are 10x slower
SELECT * FROM test ORDER BY t OFFSET 391468;
-- Temps : 7305,373 ms
SELECT id FROM test ORDER BY t OFFSET 391468;
-- Temps : 7345,234 ms
Interpretation :
- Time for hauling around extra data (SELECT * instead of SELECT id) is
not significant.
- Sorting localized TEXT really is SLOW !
- The little test above calling strcoll() from Python confirms the
slowness is in strcoll()
- MySQL (see below) seems to be much faster (about equal to postgres) on
VARCHAR, and 2x slower on TEXT (hum...)
BEGIN;
CREATE INDEX test_id ON test( id );
-- Temps : 555,718 ms
CREATE INDEX test_b ON test( b );
-- Temps : 1762,263 ms
CREATE INDEX test_t ON test( t );
-- Temps : 6274,624 ms
ROLLBACK;
Interpretation :
- maintenance_work_mem is much higher than work_mem so the sorts are
faster, but the slowness in localized text sorting subsists...
SELECT count(*) FROM test2;
-- 391469
-- Temps : 114,669 ms
SELECT * FROM test2 ORDER BY id OFFSET 391468;
-- Temps : 1788,246 ms
SELECT id FROM test2 ORDER BY id OFFSET 391468;
-- Temps : 989,238 ms
Interpretation :
- Time for hauling around extra data (SELECT * instead of SELECT id) IS
significant this time due to the extra string lengths.
SELECT * FROM test2 ORDER BY b OFFSET 391468;
-- Temps : 2906,108 ms
SELECT id FROM test2 ORDER BY b OFFSET 391468;
-- Temps : 2554,931 ms
SELECT * FROM test2 ORDER BY t OFFSET 391468;
-- Temps : 10637,649 ms
SELECT id FROM test2 ORDER BY t OFFSET 391468;
-- Temps : 10322,480 ms
Interpretation :
- Note : the strings are longer, however they are sortable only by looking
at