Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-08 Thread Leonardo Francalanci
 Applied with some  significant editorialization.  The biggest problem I
 found was that the  code for expression indexes didn't really work, and
 would leak memory like  there's no tomorrow even when it did work.


Sorry I couldn't write the way it was supposed to... I'll look at the
differences to see what I did wrong... (so maybe next time I'll do
better!)


Thank you

Leonardo




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-07 Thread Tom Lane
Josh Kupershmidt schmi...@gmail.com writes:
 So I think there are definitely cases where this patch helps, but it
 looks like a seq. scan is being chosen in some cases where it doesn't
 help.

I've been poking through this patch, and have found two different ways
in which it underestimates the cost of the seqscan case:

* it's not setting rel-width, resulting in an underestimate of the
amount of disk space needed for a sort; this would get worse for wider
tables.

* it's not allowing for the cost of recomputing index expression values
during comparisons.  That doesn't matter of course if you're not testing
the index-expression case (which other infelicities suggest hasn't
exactly been stressed yet).

I suspect the first of these might have something to do with your
observation.  AFAIR the width value isn't used in estimating indexscan
cost, so this omission would bias it in favor of seqscans, as soon as
the data volume exceeded maintenance_work_mem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-07 Thread Tom Lane
Itagaki Takahiro itagaki.takah...@gmail.com writes:
 I re-ordered some description in the doc. Does it look better?
 Comments and suggestions welcome.

Applied with some significant editorialization.  The biggest problem I
found was that the code for expression indexes didn't really work, and
would leak memory like there's no tomorrow even when it did work.
I fixed that, but I think the performance is still going to be pretty
undesirable.  We have to re-evaluate the index expressions for each
tuple each time we do a comparison, which means it's going to be really
really slow unless the index expressions are *very* cheap.  But perhaps
the use-case for clustering on expression indexes is small enough that
this isn't worth worrying about.

I considered computing the index expressions just once as the data is
being fed in, and including their values in the tuples-to-be-sorted;
that would cut the number of times the values have to be computed by
a factor of about log N.  But it'd also bloat the on-disk sort data,
which could possibly cost more in I/O than we save.  So it's not real
clear what to do anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-07 Thread Tom Lane
Takahiro Itagaki itagaki.takah...@oss.ntt.co.jp writes:
 BTW, we could have LogicalTapeReadExact() as an alias of
 LogicalTapeRead() and checking the result because we have
 many duplicated codes for unexpected end of data errors.

Good idea, done.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-07 Thread Tom Lane
Itagaki Takahiro itagaki.takah...@gmail.com writes:
 I wrote a patch to improve CLUSTER VERBOSE (and VACUUM FULL VERBOSE).
 The patch should be applied after sorted_cluster-20100721.patch .

Applied with minor fixes; in particular, I think you got the effects of
rewrite_heap_dead_tuple backwards.  When a tuple is removed from
unresolved_tups, that amounts to changing its status from recently
dead to dead and will not be written out.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-07 Thread Itagaki Takahiro
On Fri, Oct 8, 2010 at 10:49 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Itagaki Takahiro itagaki.takah...@gmail.com writes:
 I wrote a patch to improve CLUSTER VERBOSE (and VACUUM FULL VERBOSE).
 The patch should be applied after sorted_cluster-20100721.patch .

 Applied with minor fixes; in particular, I think you got the effects of
 rewrite_heap_dead_tuple backwards.  When a tuple is removed from
 unresolved_tups, that amounts to changing its status from recently
 dead to dead and will not be written out.

Ah, yes. I misunderstood the behavior. Thanks for the fix!

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-04 Thread Robert Haas
On Oct 1, 2010, at 8:36 PM, Josh Kupershmidt schmi...@gmail.com wrote:
 On Fri, Oct 1, 2010 at 4:33 AM, Leonardo Francalanci m_li...@yahoo.it wrote:
 I ran a few more performance tests on this patch. Here's what  I got
 for the tests Leonardo posted originally:
* 2M  rows:  22 seconds for seq. scan, 24 seconds for index scan
* 5M  rows:  139 seconds for seq. scan, 97 seconds for index scan
*  10M rows: 256 seconds seq. scan, 611 seconds for index scan
 
 I don't have time right now to run more tests, I'll try to make some by
 next week.
 
 Would it mean that doing:
 
 create table p as select * from atable order by akey
 
 (where akey is random distributed)
 with 5M rows is faster with enable_seqscan=0 and
 enable_indexscan=1??? That would be weird, especially on a
 laptop hard drive! (assuming there's a reasonable amount of
 memory set in work_mem/maintenance_work_mem)
 
 Hrm, this is interesting. I set up a test table with 5M rows like so:
 
 CREATE TABLE atable (
   akey   int
 );
 INSERT INTO atable (akey)
 SELECT (RANDOM() * 10)::int FROM generate_series(1,500);
 CREATE INDEX akey_idx ON atable(akey);
 ANALYZE atable;
 
 And then I tested table creation times. First, using a normal:
 
 BEGIN;
   SET enable_seqscan = on;
   SET enable_indexscan = on;
   EXPLAIN ANALYZE CREATE TABLE idxscanned AS SELECT * FROM atable
 ORDER BY akey;
 ROLLBACK;
 
 and I get:
   Index Scan using akey_idx on atable
 (cost=0.00..218347.89 rows=500 width=4)
 (actual time=0.058..23612.020 rows=500 loops=1)
  Total runtime: 33029.884 ms
 
 Then, I tried forcing a sequential scan by changing SET
 enable_indexscan = off;, and it's significantly faster, as I would
 expect:
 
 Sort  (cost=696823.42..709323.42 rows=500 width=4)
  (actual time=8664.699..13533.131 rows=500 loops=1)
   Sort Key: akey
   Sort Method:  external merge  Disk: 68304kB
   -  Seq Scan on atable  (cost=0.00..72124.00 rows=500 width=4)
  (actual time=0.012..838.092 rows=500 loops=1)
 Total runtime: 21015.501 ms
 
 I've ran both of these several times, and get 30-32 seconds for the
 index scan and 20-21 seconds for the seq. scan each time.
 
 My seq_page_cost and random_page_cost were left at the defaults for
 these tests. Oddly, I tried turning seq_page_cost down to 0.01 and
 EXPLAIN ANALYZE told me that an index scan was still being chosen. Is
 there maybe some other setting I'm forgetting?

Did you also adjust random_page_cost?

...Robert
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-04 Thread Leonardo Francalanci
 It sounds like the costing model might need a bit more work before  we commit 
this.


I tried again the simple sql tests I posted a while ago, and I still get the 
same ratios.
I've tested the applied patch on a dual opteron + disk array Solaris machine.

I really don't get how a laptop hard drive can be faster at reading data using 
random
seeks (required by the original cluster method) than seq scan + sort for the 5M 
rows
test case. 
Same thing for the cluster vs bloat test: the seq scan + sort is faster on my 
machine.

I've just noticed that Josh used shared_buffers = 16MB for the cluster vs 
bloat test:
I'm using a much higher shared_buffers (I think something like 200MB), since if
you're working with tables this big I thought it could be a more appropriate 
value.
Maybe that's the thing that makes the difference???

Can someone else test the patch?

And: I don't have that deep knowledge of how postgresql deletes rows; but I 
thought
that something like:

DELETE FROM mybloat WHERE RANDOM()  0.9;

would only delete data, not indexes; so the patch should perform even better in 
this
case (as it does, in fact, on my test machine), as:

- the original cluster method would read the whole index, and fetch only the 
still alive
rows
- the new method would read the table using a seq scan, and sort in memory the 
few
rows still alive

But, as I said, maybe I'm getting this part wrong...





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-04 Thread Josh Kupershmidt
On Mon, Oct 4, 2010 at 8:42 AM, Robert Haas robertmh...@gmail.com wrote:

 Did you also adjust random_page_cost?

No, my seq_page_cost (1) and random_page_cost (4) are from the
defaults. Here are all my non-default settings:

test=# SELECT name, unit, setting FROM pg_settings WHERE source !=
'default'; name| unit |
setting
+--+--
 application_name   |  | psql
 config_file|  | /Users/josh/runtime/data/postgresql.conf
 data_directory |  | /Users/josh/runtime/data
 DateStyle  |  | ISO, MDY
 default_text_search_config |  | pg_catalog.english
 effective_cache_size   | 8kB  | 131072
 hba_file   |  | /Users/josh/runtime/data/pg_hba.conf
 ident_file |  | /Users/josh/runtime/data/pg_ident.conf
 lc_collate |  | en_US.UTF-8
 lc_ctype   |  | en_US.UTF-8
 lc_messages|  | C
 lc_monetary|  | C
 lc_numeric |  | C
 lc_time|  | C
 listen_addresses   |  | *
 log_directory  |  | /Users/josh/runtime/logs
 log_line_prefix|  | %t %u %h %d
 log_min_duration_statement | ms   | 0
 log_statement  |  | all
 log_timezone   |  | US/Eastern
 logging_collector  |  | on
 maintenance_work_mem   | kB   | 65536
 max_connections|  | 7
 max_stack_depth| kB   | 2048
 server_encoding|  | UTF8
 shared_buffers | 8kB  | 16384
 TimeZone   |  | US/Eastern
 timezone_abbreviations |  | Default
 transaction_isolation  |  | read committed
 transaction_read_only  |  | off
 wal_buffers| 8kB  | 2048
 work_mem   | kB   | 16384
(32 rows)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-04 Thread Josh Kupershmidt
On Mon, Oct 4, 2010 at 4:47 PM, Leonardo Francalanci m_li...@yahoo.it wrote:
 It sounds like the costing model might need a bit more work before  we commit
this.


 I tried again the simple sql tests I posted a while ago, and I still get the
 same ratios.
 I've tested the applied patch on a dual opteron + disk array Solaris machine.

 I really don't get how a laptop hard drive can be faster at reading data using
 random
 seeks (required by the original cluster method) than seq scan + sort for the 
 5M
 rows
 test case.
 Same thing for the cluster vs bloat test: the seq scan + sort is faster on 
 my
 machine.

Well, my last tests showed that the planner was choosing an index scan
for queries like:

  SELECT * FROM atable ORDER BY akey;

but forcing a seqscan + sort made this faster, as you expect. So I was
thinking my cost settings (posted upthread) probably need some
tweaking, unless it's a problem with the optimizer. But all of this is
unrelated to the patch.

[... pokes a bit more ...] Sigh, now I'm finding it impossible to
reproduce my own results, particulary the earlier cluster_vs_bloat.sql
test of:

 * 10M rows: 84 seconds for seq. scan, 44 seconds for index scan

I'm getting about 5 seconds now for the cluster, both with and without
the patch. effective_cache_size doesn't seem to impact this much. I'll
have another look when I have some more time.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-03 Thread Itagaki Takahiro
On Sat, Oct 2, 2010 at 9:36 AM, Josh Kupershmidt schmi...@gmail.com wrote:
 Hrm, this is interesting. I set up a test table with 5M rows like so:

Such discussions are for the planner itself, right? The sorted cluster
patch uses the existing planner's costing model, so we can discuss the
clustering feature and the improvement of planner in different patches.

 My seq_page_cost and random_page_cost were left at the defaults for
 these tests. Oddly, I tried turning seq_page_cost down to 0.01 and
 EXPLAIN ANALYZE told me that an index scan was still being chosen. Is
 there maybe some other setting I'm forgetting?

It might come from effective_cache_size. We consider the value only
in the index scans. We can also use the effective cache in addition
to work_mem for tapes used by disk sorting, but we don't consider
the effective cache for now.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-01 Thread Leonardo Francalanci
 I ran a few more performance tests on this patch. Here's what  I got
 for the tests Leonardo posted originally:
* 2M  rows:  22 seconds for seq. scan, 24 seconds for index scan
* 5M  rows:  139 seconds for seq. scan, 97 seconds for index scan
*  10M rows: 256 seconds seq. scan, 611 seconds for index scan

I don't have time right now to run more tests, I'll try to make some by
next week.

Would it mean that doing:

create table p as select * from atable order by akey

(where akey is random distributed)
with 5M rows is faster with enable_seqscan=0 and
enable_indexscan=1??? That would be weird, especially on a
laptop hard drive! (assuming there's a reasonable amount of
memory set in work_mem/maintenance_work_mem)
 
 I tried a few more tests of creating a table with  either 10M or 50M
 rows, then deleting 90% of the rows and running a cluster.  The patch
 didn't fare so well here:

  * 10M rows: 84 seconds for seq.  scan, 44 seconds for index scan

[...]
 So I think there are  definitely cases where this patch helps, but it
 looks like a seq. scan is  being chosen in some cases where it doesn't
 help.
 
 Test machine:  MacBook Pro laptop, C2D 2.53 GHz, 4GB RAM.


Again: what would the planner choose in that case for a:

create table p as select * from mybloat order by myid

???

I guess that if the planner makes a wrong choice in this case (that is,
seq scan + sort instead of index scan) there's no way for cluster to
behave in a different way. If, on the contrary, the create table... uses
the right plan, and cluster doesn't, we have a problem in the patch.  
Am I right?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-01 Thread Robert Haas
On Fri, Oct 1, 2010 at 4:33 AM, Leonardo Francalanci m_li...@yahoo.it wrote:
 I guess that if the planner makes a wrong choice in this case (that is,
 seq scan + sort instead of index scan) there's no way for cluster to
 behave in a different way. If, on the contrary, the create table... uses
 the right plan, and cluster doesn't, we have a problem in the patch.
 Am I right?

Good point.  I think you're right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-01 Thread Josh Kupershmidt
On Fri, Oct 1, 2010 at 4:33 AM, Leonardo Francalanci m_li...@yahoo.it wrote:
 I ran a few more performance tests on this patch. Here's what  I got
 for the tests Leonardo posted originally:
    * 2M  rows:  22 seconds for seq. scan, 24 seconds for index scan
    * 5M  rows:  139 seconds for seq. scan, 97 seconds for index scan
    *  10M rows: 256 seconds seq. scan, 611 seconds for index scan

 I don't have time right now to run more tests, I'll try to make some by
 next week.

 Would it mean that doing:

 create table p as select * from atable order by akey

 (where akey is random distributed)
 with 5M rows is faster with enable_seqscan=0 and
 enable_indexscan=1??? That would be weird, especially on a
 laptop hard drive! (assuming there's a reasonable amount of
 memory set in work_mem/maintenance_work_mem)

Hrm, this is interesting. I set up a test table with 5M rows like so:

CREATE TABLE atable (
   akey   int
);
INSERT INTO atable (akey)
 SELECT (RANDOM() * 10)::int FROM generate_series(1,500);
CREATE INDEX akey_idx ON atable(akey);
ANALYZE atable;

And then I tested table creation times. First, using a normal:

BEGIN;
   SET enable_seqscan = on;
   SET enable_indexscan = on;
   EXPLAIN ANALYZE CREATE TABLE idxscanned AS SELECT * FROM atable
ORDER BY akey;
ROLLBACK;

and I get:
   Index Scan using akey_idx on atable
 (cost=0.00..218347.89 rows=500 width=4)
 (actual time=0.058..23612.020 rows=500 loops=1)
  Total runtime: 33029.884 ms

Then, I tried forcing a sequential scan by changing SET
enable_indexscan = off;, and it's significantly faster, as I would
expect:

 Sort  (cost=696823.42..709323.42 rows=500 width=4)
  (actual time=8664.699..13533.131 rows=500 loops=1)
   Sort Key: akey
   Sort Method:  external merge  Disk: 68304kB
   -  Seq Scan on atable  (cost=0.00..72124.00 rows=500 width=4)
  (actual time=0.012..838.092 rows=500 loops=1)
 Total runtime: 21015.501 ms

I've ran both of these several times, and get 30-32 seconds for the
index scan and 20-21 seconds for the seq. scan each time.

My seq_page_cost and random_page_cost were left at the defaults for
these tests. Oddly, I tried turning seq_page_cost down to 0.01 and
EXPLAIN ANALYZE told me that an index scan was still being chosen. Is
there maybe some other setting I'm forgetting?

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-30 Thread Simon Riggs
On Wed, 2010-09-29 at 08:44 -0400, Alvaro Herrera wrote:

  So, I think twice disk space of the sum of table and indexes would be
  the simplest explanation for safe margin.
 
 Agreed.

Surely the peak space is x3?

Old space + sort space + new space.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-30 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Wed, 2010-09-29 at 08:44 -0400, Alvaro Herrera wrote:
 So, I think twice disk space of the sum of table and indexes would be
 the simplest explanation for safe margin.
 
 Agreed.

 Surely the peak space is x3?
 Old space + sort space + new space.

The wording should be something like CLUSTER requires transient disk
space equal to about twice the size of the table plus its indexes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-30 Thread Itagaki Takahiro
Hi, Leonardo-san,

On Fri, Oct 1, 2010 at 4:04 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 The wording should be something like CLUSTER requires transient disk
 space equal to about twice the size of the table plus its indexes.

Could you merge those discussions into the final patch?
Also, please check whether my modification broke your patch.
Thank you.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-30 Thread Robert Haas
On Sep 30, 2010, at 9:07 PM, Itagaki Takahiro itagaki.takah...@gmail.com 
wrote:
 Hi, Leonardo-san,
 
 On Fri, Oct 1, 2010 at 4:04 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 The wording should be something like CLUSTER requires transient disk
 space equal to about twice the size of the table plus its indexes.
 
 Could you merge those discussions into the final patch?
 Also, please check whether my modification broke your patch.
 Thank you.

It sounds like the costing model might need a bit more work before we commit 
this.

...Robert
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-29 Thread Leonardo Francalanci
  10% is nothing.  I was expecting this  patch would give an order of
  magnitude of improvement or somethine like  that in the worst cases of
  the current code (highly unsorted  input)
 
 Yes. It should be x10 faster than ordinary method in the worst  cases.


Here's my post with a (very simple) performance test:

http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php


The test I used wasn't a worst case scenario, since it is based on
random data, not wrong-ordered data. Obviously, the real difference
can be seen on large tables (5M+ rows), and/or slow disks.


Leonardo




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-29 Thread Alvaro Herrera
Excerpts from Leonardo Francalanci's message of mié sep 29 03:17:07 -0400 2010:

 Here's my post with a (very simple) performance test:
 
 http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php

I think the 10M rows test is more in line with what we want (83s vs. 646).

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-29 Thread Alvaro Herrera
Excerpts from Itagaki Takahiro's message of mié sep 29 01:25:38 -0400 2010:

 To be exact, It's very complex.
 During reconstructing tables, it requires about twice disk space of
 the old table (for sort tapes and the new table).
 After sorting the table, CLUSTER performs REINDEX. We need
 {same size of the new table} + {twice disk space of the new indexes}.
 Also, new relations will be the same sizes of old relations if they
 have no free spaces.

Aren't the old table and indexes kept until transaction commit, though?
So during the reindex you still keep the old copy of the table and
indexes around, thus double space.  The only space difference would be
extra free space in the old table, tuples older than OldestXmin, and
fillfactor changes.

 So, I think twice disk space of the sum of table and indexes would be
 the simplest explanation for safe margin.

Agreed.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-29 Thread Robert Haas
On Wed, Sep 29, 2010 at 1:12 AM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 On Wed, Sep 29, 2010 at 1:27 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 I see a consistent
 ~10% advantage for the sequential scan clusters.

 10% is nothing.  I was expecting this patch would give an order of
 magnitude of improvement or somethine like that in the worst cases of
 the current code (highly unsorted input)

 Yes. It should be x10 faster than ordinary method in the worst cases.

 I have a performance result of pg_reorg, that performs as same as
 the patch. It shows 16 times faster than the old CLUSTER. In addition,
 it was slow if not fragmented. (So, it should not be consistent.)
 http://reorg.projects.postgresql.org/

Can you reproduce that with this patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-29 Thread Itagaki Takahiro
On Wed, Sep 29, 2010 at 10:14 PM, Robert Haas robertmh...@gmail.com wrote:
 http://reorg.projects.postgresql.org/

 Can you reproduce that with this patch?

No, I can't use the machine anymore.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-29 Thread Leonardo Francalanci
  Here's my post with a (very simple) performance test:
  http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php
 I  think the 10M rows test is more in line with what we want (83s vs.  646).


Can someone else test the patch to see if what I found is still valid?
I don't think it makes much sense if I'm the only one that says
this is faster :)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-29 Thread Josh Kupershmidt
On Wed, Sep 29, 2010 at 11:55 AM, Leonardo Francalanci m_li...@yahoo.it wrote:
 Can someone else test the patch to see if what I found is still valid?
 I don't think it makes much sense if I'm the only one that says
 this is faster :)

I ran a few more performance tests on this patch. Here's what I got
for the tests Leonardo posted originally:

   * 2M rows:  22 seconds for seq. scan, 24 seconds for index scan
   * 5M rows:  139 seconds for seq. scan, 97 seconds for index scan
   * 10M rows: 256 seconds seq. scan, 611 seconds for index scan

(times are for the cluster operation only, not for the table
creations, etc. which took most of the time)

I tried a few more tests of creating a table with either 10M or 50M
rows, then deleting 90% of the rows and running a cluster. The patch
didn't fare so well here:

 * 10M rows: 84 seconds for seq. scan, 44 seconds for index scan

The seq. scan results here were obtained with the patch applied, and
without using planner hints (enable_seqscan or enable_indexscan). I
added in an ereport() call to check that use_sort was actually true.
The index scan results were obtained without the patch applied. The
SQL file I used is attached.

So I think there are definitely cases where this patch helps, but it
looks like a seq. scan is being chosen in some cases where it doesn't
help.

Test machine: MacBook Pro laptop, C2D 2.53 GHz, 4GB RAM.
Settings: shared_buffers = 16MB, work_mem and maintenance_work_mem set
from the SQL scripts.

Josh
\timing on

BEGIN;

set work_mem='100MB';
set maintenance_work_mem='100MB';

CREATE TABLE mybloat (
  myid   int,
  name   text
);

INSERT INTO mybloat (myid, name)
   SELECT a, a::text FROM generate_series(1,1000) AS a;

ALTER TABLE mybloat ADD CONSTRAINT myid_pkey PRIMARY KEY (myid);
DELETE FROM mybloat WHERE RANDOM()  0.9;

ANALYZE mybloat;

select attname, correlation from pg_stats where tablename='mybloat';

cluster verbose mybloat using myid_pkey;

drop table mybloat;

COMMIT;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-28 Thread Josh Kupershmidt
On Mon, Sep 27, 2010 at 10:05 PM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 I re-ordered some description in the doc. Does it look better?
 Comments and suggestions welcome.

I thought this paragraph was a little confusing:

! In the second case, a full table scan is followed by a sort operation.
! The method is faster than the first one when the table is highly
fragmented.
! You need twice disk space of the sum in the case. In addition to the free
! space needed by the previous case, this approach may also need a temporary
! disk sort file which can be as big as the original table.

I think the worst-case disk space could be made a little more clear
here, and maybe some general wordsmithing as well. I wasn't sure what
twice disk space of the sum was in this description -- sum of what
(table and all indexes?). And does twice disk space include the
temporary disk sort file? Here's an idea of how I think this paragraph
could be cleaned up a bit, if my understanding of the disk space
required is about right:

! In the second case, a full table scan is followed by a sort operation.
! This method is faster than when the table is highly fragmented.
! However, commandCLUSTER/command may require available disk space of
! up to twice the sum of the size of the table and its indexes, if
it uses a temporary
! disk sort file, which can be as big as the original table.

Also, AIUI, this second clustering method is similar to the older
idiom of CREATE TABLE new AS SELECT * FROM old ORDER BY col; Since the
paragraph describing this older idiom is being removed, perhaps a
brief mention in the documentation could be made of this similarity.

Some more wordsmithing: change
!  The planner tries to choose a faster method in them base on the
information
to:
!  The planner tries to choose the fastest method based on the information

I started looking at the performance impact of this patch based on
Leonardo's SQL file. On the 2 million row table, I see a consistent
~10% advantage for the sequential scan clusters. I'm going to try to
run the bigger tests a few times and post results from there when I
get a chance.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-28 Thread Alvaro Herrera
Excerpts from Josh Kupershmidt's message of mar sep 28 23:53:33 -0400 2010:

 I started looking at the performance impact of this patch based on
 Leonardo's SQL file. On the 2 million row table, I see a consistent
 ~10% advantage for the sequential scan clusters. I'm going to try to
 run the bigger tests a few times and post results from there when I
 get a chance.

10% is nothing.  I was expecting this patch would give an order of
magnitude of improvement or somethine like that in the worst cases of
the current code (highly unsorted input)

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-28 Thread Itagaki Takahiro
On Wed, Sep 29, 2010 at 1:27 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 I see a consistent
 ~10% advantage for the sequential scan clusters.

 10% is nothing.  I was expecting this patch would give an order of
 magnitude of improvement or somethine like that in the worst cases of
 the current code (highly unsorted input)

Yes. It should be x10 faster than ordinary method in the worst cases.

I have a performance result of pg_reorg, that performs as same as
the patch. It shows 16 times faster than the old CLUSTER. In addition,
it was slow if not fragmented. (So, it should not be consistent.)
http://reorg.projects.postgresql.org/

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-28 Thread Itagaki Takahiro
On Wed, Sep 29, 2010 at 12:53 PM, Josh Kupershmidt schmi...@gmail.com wrote:
 I thought this paragraph was a little confusing:

Thanks for checking.

 !     In the second case, a full table scan is followed by a sort operation.
 !     The method is faster than the first one when the table is highly
 fragmented.
 !     You need twice disk space of the sum in the case. In addition to the 
 free
 !     space needed by the previous case, this approach may also need a 
 temporary
 !     disk sort file which can be as big as the original table.

 I think the worst-case disk space could be made a little more clear
 here, and maybe some general wordsmithing as well. I wasn't sure what
 twice disk space of the sum was in this description -- sum of what
 (table and all indexes?).

To be exact, It's very complex.
During reconstructing tables, it requires about twice disk space of
the old table (for sort tapes and the new table).
After sorting the table, CLUSTER performs REINDEX. We need
{same size of the new table} + {twice disk space of the new indexes}.
Also, new relations will be the same sizes of old relations if they
have no free spaces.

So, I think twice disk space of the sum of table and indexes would be
the simplest explanation for safe margin.

 Also, AIUI, this second clustering method is similar to the older
 idiom of CREATE TABLE new AS SELECT * FROM old ORDER BY col; Since the
 paragraph describing this older idiom is being removed, perhaps a
 brief mention in the documentation could be made of this similarity.

Good idea.

 Some more wordsmithing: change
 !      The planner tries to choose a faster method in them base on the
 information
 to:
 !      The planner tries to choose the fastest method based on the information

Thanks.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-27 Thread Itagaki Takahiro
On Tue, Aug 31, 2010 at 8:04 PM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 I think the patch is almost ready to commit, but still
 have some comments for the usability and documentations.
 I hope native English speakers would help improving docs.

I'm checking the latest patch for applying.
I found we actually use maintenance_work_mem for the sort in seqscan+sort
case, but the cost was estimated based on work_mem in the patch. I added
internal cost_sort_with_mem() into costsize.c.

 * Documentation could be a bit more simplified like as
  CLUSTER requires twice disk spaces of your original table.
  The added description seems too difficult for standard users.

I re-ordered some description in the doc. Does it look better?
Comments and suggestions welcome.

-- 
Itagaki Takahiro


sorted_cluster-20100928.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-01 Thread Itagaki Takahiro
On Tue, Aug 31, 2010 at 8:04 PM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 * How to determine which method was used?
  We can get some information from trace_sort logs,
  but CLUSTER VERBOSE would be better to log
  which CLUSTER method was used.

I wrote a patch to improve CLUSTER VERBOSE (and VACUUM FULL VERBOSE).
The patch should be applied after sorted_cluster-20100721.patch .

* clustering schema.table by index scan on index
* clustering schema.table by sequential scan and sort

It also adds VACUUM VERBOSE-like logs:
INFO:  table: found 1 removable, 11 nonremovable row versions in
1655 pages
DETAIL:  1 dead row versions cannot be removed yet.
CPU 0.03s/0.06u sec elapsed 0.21 sec.

Note that the patch says nothing about reindexing. But if
required, I'm willing to add some VERBOSE messages for
indexes (ex. REINDEX VERBOSE)

-- 
Itagaki Takahiro


cluster_verbose-20100901.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-08-31 Thread Itagaki Takahiro
Sorry for the very delayed review.

On Wed, Jul 21, 2010 at 11:15 PM, Leonardo Francalanci m_li...@yahoo.it wrote:
 2) what other areas can I comment more?

I think the patch is almost ready to commit, but still
have some comments for the usability and documentations.
I hope native English speakers would help improving docs.

* Documentation could be a bit more simplified like as
  CLUSTER requires twice disk spaces of your original table.
  The added description seems too difficult for standard users.

* How to determine which method was used?
  We can get some information from trace_sort logs,
  but CLUSTER VERBOSE would be better to log
  which CLUSTER method was used.

* How to control which method will be used?
  It might be good to explain we can control the method
  with enable_seqscan/indexscan.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-07-21 Thread Leonardo Francalanci
 I think writetup_rawheap() and readtup_rawheap() are a little  complex,
 but should work as long as there are no padding between t_len and  t_self
 in HeapTupleData struct.
 
 - It might be cleaner if you write the  total item length
   and tuple data separately.
 - (char *) tuple +  sizeof(tuplen) might be more robust
   than  tuple-t_self.


- I used your functions 
- changed the docs for CLUSTER (I don't know if they make sense/are enough)
- added a minor comment


2 questions:
 
1) about the copypaste from FormIndexDatum comment: how can I improve it?
The idea is that we could have a faster call, but it would mean copying and
pasting a lot of code from FormIndexDatum.

2) what other areas can I comment more?


  

sorted_cluster-20100721.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-07-07 Thread Takahiro Itagaki

Leonardo F m_li...@yahoo.it wrote:

 I saw that you also changed the writing:
(snip)
 Are we sure it's 100% equivalent?

I think writetup_rawheap() and readtup_rawheap() are a little complex,
but should work as long as there are no padding between t_len and t_self
in HeapTupleData struct.

- It might be cleaner if you write the total item length
  and tuple data separately.
- (char *) tuple + sizeof(tuplen) might be more robust
  than tuple-t_self.

Here is a sample code. writetup() and readtup() will be alike.

BTW, we could have LogicalTapeReadExact() as an alias of
LogicalTapeRead() and checking the result because we have
many duplicated codes for unexpected end of data errors.


static void
writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
{
HeapTuple   tuple = (HeapTuple) stup-tuple;
int tuplen = tuple-t_len + HEAPTUPLESIZE;

LogicalTapeWrite(state-tapeset, tapenum,
 tuplen, sizeof(tuplen));
LogicalTapeWrite(state-tapeset, tapenum,
 (char *) tuple + sizeof(tuplen),
 HEAPTUPLESIZE - sizeof(tuplen);
LogicalTapeWrite(state-tapeset, tapenum, tuple-t_data, tuple-t_len);
if (state-randomAccess)/* need trailing length word? */
LogicalTapeWrite(state-tapeset, tapenum, tuplen, 
sizeof(tuplen));

FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_freetuple(tuple);
}

static void
readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int tuplen)
{
HeapTuple   tuple = (HeapTuple) palloc(tuplen);

USEMEM(state, GetMemoryChunkSpace(tuple));

tuple-t_len = tuplen - HEAPTUPLESIZE;
if (LogicalTapeRead(state-tapeset, tapenum,
 (char *) tuple + 
sizeof(tuplen),
HEAPTUPLESIZE - sizeof(tuplen)) != HEAPTUPLESIZE - 
sizeof(tuplen))
elog(ERROR, unexpected end of data);
tuple-t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
if (LogicalTapeRead(state-tapeset, tapenum,
tuple-t_data, tuple-t_len) != 
tuple-t_len)
elog(ERROR, unexpected end of data);
if (state-randomAccess)/* need trailing length word? */
if (LogicalTapeRead(state-tapeset, tapenum, tuplen,
sizeof(tuplen)) != 
sizeof(tuplen))
elog(ERROR, unexpected end of data);


Regards,
---
Takahiro Itagaki
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-07-07 Thread Alvaro Herrera
Excerpts from Takahiro Itagaki's message of mié jul 07 04:39:38 -0400 2010:

 BTW, we could have LogicalTapeReadExact() as an alias of
 LogicalTapeRead() and checking the result because we have
 many duplicated codes for unexpected end of data errors.

I'd just add a boolean exact required to the existing functions.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-07-06 Thread Takahiro Itagaki

Leonardo F m_li...@yahoo.it wrote:

 Attached the updated patch (should solve a bug) and a script.

I reviewed your patch. It seems to be in good shape, and worked as
expected. I suppressed a compiler warning in the patch and cleaned up
whitespaces in it. Patch attached.

I think we need some documentation for the change. The only downside
of the feature is that sorted cluster requires twice disk spaces of
the target table (temp space for disk sort and the result table).
Could I ask you to write documentation about the new behavior?
Also, code comments can be improved; especially we need better
description than copypaste from FormIndexDatum.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center



sorted_cluster-20100706.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-07-06 Thread Leonardo F
 I reviewed 
 your patch. It seems to be in good shape, and worked as
 expected. I 
 suppressed a compiler warning in the patch and cleaned up
 whitespaces in it. 
 Patch attached.


Thanks for the review!

I saw that you also changed the writing:

 LogicalTapeWrite(state-tapeset, tapenum,
 tuple, HEAPTUPLESIZE);
 LogicalTapeWrite(state-tapeset, tapenum, tuple-t_data, 
tuple-t_len-HEAPTUPLESIZE);


and the reading:

 tuple-t_len = tuplen - HEAPTUPLESIZE;
 if (LogicalTapeRead(state-tapeset, tapenum, tuple-t_self, 
tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))
  elog(ERROR, unexpected end of data);



into (writing):

 LogicalTapeWrite(state-tapeset, tapenum,
tuple, tuple-t_len);


(reading):

if (LogicalTapeRead(state-tapeset, tapenum, tuple-t_self, 
tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))



Are we sure it's 100% equivalent?

I remember I had issues with the fact that tuple-t_data wasn't
at HEAPTUPLESIZE distance from tuple:

http://osdir.com/ml/pgsql-hackers/2010-02/msg00744.html


 I think we need some documentation for the change. The 
 only downside
 of the feature is that sorted cluster requires twice disk 
 spaces of
 the target table (temp space for disk sort and the result 
 table).
 Could I ask you to write documentation about the new 
 behavior?
 Also, code comments can be improved; especially we need 
 better
 description than copypaste from 
 FormIndexDatum.


I'll try to improve the comments and add doc changes (but my English
will have to be double checked...)





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-02-10 Thread Leonardo F
Hi all,


while testing the seq scan + sort CLUSTER implementation, I've found a
bug in write/readtup_rawheap.

The functions are needed by the sort part.
The code in 

http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php

didn't copy the whole tuple, but only the HeapTuple header: the t_data
part wasn't copied (at least, that's my understanding), The original code
is at the bottom of the email. It didn't work (the table wasn't fully clustered
at the end of the CLUSTER command).

So I modified write/readtup_rawheap: they are supposed to store/retrieve
both the fixed part of HeapTupleData, plus the variable part t_data.
But a get a lot of:
WARNING:  problem in alloc set TupleSort: detected write past chunk end
  in block 0x96853f0, chunk 0x968723c
WARNING:  problem in alloc set TupleSort: detected write past chunk end 
  in block 0x96853f0, chunk 0x96872c8
warnings when calling tuplesort_end  and some of the data gets garbled
after the CLUSTER command.


What am I doing wrong? Using the debugger, data coming out of
readtup_rawheap seems fine...

I *really* need your help here...


static void
writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
{
HeapTupletuple = (HeapTuple) stup-tuple;
tuple-t_len += sizeof(HeapTupleData); /* write out the header as well */

LogicalTapeWrite(state-tapeset, tapenum,
tuple, sizeof(HeapTupleData));
LogicalTapeWrite(state-tapeset, tapenum, tuple-t_data, 
 tuple-t_len-sizeof(HeapTupleData));
if (state-randomAccess)/* need trailing length word? */
  LogicalTapeWrite(state-tapeset, tapenum,
 tuple, sizeof(tuple-t_len));

FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_freetuple(tuple);
}

static void
readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int tuplen)
{
HeapTupletuple = (HeapTuple) palloc(sizeof(HeapTupleData));
tuple-t_data = (HeapTupleHeader) palloc(tuplen-sizeof(HeapTupleData));

USEMEM(state, GetMemoryChunkSpace(tuple));

tuple-t_len = tuplen - sizeof(HeapTupleData);
if (LogicalTapeRead(state-tapeset, tapenum, tuple-t_self, 
sizeof(HeapTupleData)-sizeof(tuplen))
   != sizeof(HeapTupleData)-sizeof(tuplen))
  elog(ERROR, unexpected end of data);
if (LogicalTapeRead(state-tapeset, tapenum, tuple-t_data, tuple-t_len) != 
tuple-t_len)
  elog(ERROR, unexpected end of data);
if (state-randomAccess)/* need trailing length word? */
 if (LogicalTapeRead(state-tapeset, tapenum, tuplen,
   sizeof(tuplen)) != sizeof(tuplen))
  elog(ERROR, unexpected end of data);

stup-tuple = tuple;
/* set up first-column key value */
if (state-indexInfo-ii_Expressions == NULL)
{
/* no expression index, just get the key datum value */
stup-datum1 = heap_getattr((HeapTuple) stup-tuple,
state-indexInfo-ii_KeyAttrNumbers[0],
state-tupDesc,
stup-isnull1);
}
else
{
  [...] expression index part, removed for clarity 
}
}




Original code:

static void
writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
{
HeapTuple   tuple = (HeapTuple) stup-tuple;
tuple-t_len += HEAPTUPLESIZE; /* write out the header as well */

LogicalTapeWrite(state-tapeset, tapenum,
tuple, tuple-t_len);

if (state-randomAccess)/* need trailing length word? */
 LogicalTapeWrite(state-tapeset, tapenum,
   tuple, sizeof(tuple-t_len));

FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_freetuple(tuple);
}

static void
readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int tuplen)
{
HeapTuple   tuple = (HeapTuple) palloc(tuplen);

USEMEM(state, GetMemoryChunkSpace(tuple));

tuple-t_len = tuplen - HEAPTUPLESIZE;
if (LogicalTapeRead(state-tapeset, tapenum, tuple-t_self, 
  tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))
  elog(ERROR, unexpected end of data);
if (state-randomAccess)/* need trailing length word? */
if (LogicalTapeRead(state-tapeset, tapenum, tuplen,
sizeof(tuplen)) != sizeof(tuplen))
elog(ERROR, unexpected end of data);

stup-tuple = tuple;
/* set up first-column key value */
stup-datum1 = heap_getattr(tuple,
state-scanKeys[0].sk_attno,
state-tupDesc,
stup-isnull1);
}





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-02-10 Thread Heikki Linnakangas
Leonardo F wrote:
 static void
 writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 {
 HeapTupletuple = (HeapTuple) stup-tuple;

I think you're confusing HeapTuple and HeapTupleHeader. SortTuple-tuple
field should point to a HeapTupleHeader, not a HeapTuple.

The right mental model is that HeapTupleHeader is a physical tuple, one
that you store on disk for example. A HeapTuple is an in-memory wrapper,
or pointer if you will, to a HeapTupleHeader, and holds some additional
information on the tuple that's useful when operating on it.

To add to the confusion, MinimalTuple is a shortened counterpart of
HeapTuple*Header*, not HeapTuple. And SortTuple is an in-memory wrapper
similar to HeapTuple, containing additional information on the tuple
that helps with sorting.

I didn't look at the rest of the code in detail, but I think that's
where your problems are stemming from.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-02-10 Thread Leonardo F
 I think you're confusing HeapTuple and HeapTupleHeader. SortTuple-tuple
 field should point to a HeapTupleHeader, not a HeapTuple.


Mmh, I don't get it: that would mean I might be using more info than required,
but I still don't understand why it's not working... at the end, CLUSTER is 
going
to need full HeapTuple(s) (if I'm not mistaken).
SortTuple-tuple can be any of MinimalTuple, IndexTuple, even a Datum.
So I added my own read/write_rawtuple (I'm not that good, they were in the
original patch and I copypasted them).

I can write and read those tuples (using some Asserts I think everything goes

as expected in write/readtup_rawheap). But when calling tuplesort_getrawtuple,
after some right tuples, the call to tuplesort_gettuple_common fails because
the lines:

/* pull next preread tuple from list, insert in heap */
newtup = state-memtuples[tupIndex];


give a wrong initialized newtup (in other words, newtup is random memory).
Since write/readtup_rawheap seems fine, I can't understand what's going on...

I write the HeapTuples with:

(in writetup_rawheap):
HeapTupletuple = (HeapTuple) stup-tuple;
tuple-t_len += HEAPTUPLESIZE; /* write out the header as well */

LogicalTapeWrite(state-tapeset, tapenum,
  tuple, HEAPTUPLESIZE);
LogicalTapeWrite(state-tapeset, tapenum, tuple-t_data, 
   tuple-t_len-HEAPTUPLESIZE);



then I read them with:
(in readtup_rawheap):
HeapTupletuple = (HeapTuple) palloc(tuplen);
USEMEM(state, GetMemoryChunkSpace(tuple));

tuple-t_len = tuplen - HEAPTUPLESIZE;
if (LogicalTapeRead(state-tapeset, tapenum, tuple-t_self, 
  HEAPTUPLESIZE-sizeof(tuplen)) != HEAPTUPLESIZE-sizeof(tuplen))
  elog(ERROR, unexpected end of data);
if (LogicalTapeRead(state-tapeset, tapenum, tuple-t_data, tuple-t_len) != 
tuple-t_len)
  elog(ERROR, unexpected end of data);




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-02-10 Thread Leonardo F
I think I've found the problem:

tuple-t_data wasn't at HEAPTUPLESIZE distance from tuple.
I guess the code makes that assumption somewhere, so I had
to do:

tuple-t_data = (HeapTupleHeader) ((char *) tuple + 
 HEAPTUPLESIZE);

Now that test works! Hope I don't find any more problems...



Leonardo





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-02-10 Thread Leonardo F
Perhaps you could supply a .sql file containing a testcase 
 illustrating the performance benefits you tested with your patch

Sure.


Attached the updated patch (should solve a bug) and a script.
The sql scripts generates a 2M rows table (orig); then the
table is copied and the copy clustered using seq + sort (since 
set enable_seqscan=false;).
Then the table orig is copied again, and the copy clustered
using regular index scan (set enable_indexscan=true; set 
enable_seqscan=false).
Then the same thing is done on a 5M rows table, and on a 10M
rows table.

On my system (Sol10 on a dual Opteron 2.8) single disc:


2M:  seq+sort 11secs; regular index scan: 33secs
5M:  seq+sort 39secs; regular index scan: 105secs
10M:seq+sort 83secs; regular index scan: 646secs


Maybe someone could suggest a better/different test?


Leonardo



  

sorted_cluster20100210.patch
Description: Binary data


cluster_tests.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-02-09 Thread Leonardo F
Not even a comment? As I said, performance results on my system
were very good

 I know you're all very busy getting 9.0 out, but I think the results in
 heap scanning + sort instead of index scanning for CLUSTER are
 very good... I would like to know if I did something wrong/I should
 improve something in the patch... I haven't tested it with index
 expressions yet (but the tests in make check work).
 
 Thanks
 
 Leonardo
 
 
  Hi all,
  
  attached a patch to do seq scan + sorting instead of index scan 
  
  on CLUSTER (when that's supposed to be faster).
  
  As I've already said, the patch is based on: 
  http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php
  
  Of course, the code isn't supposed to be ready to be merged: I
  would like to write more comments and add some test cases to
  cluster.sql (plus change all the things you are going to tell me I
  have to change...)
  
  I would like your opinions on code correctness and the decisions
  I took, especially:
  
  1) function names (cost_index_scan_vs_seqscansort I guess
  it's awful...)
  
  2) the fact that I put in Tuplesortstate an EState variable, so that 
  MakeSingleTupleTableSlot wouldn't have to be called for every
  row in the expression indexes case
  
  3) the expression index case is not optimized: I preferred to
  call FormIndexDatum once for the first key value in 
  copytup_rawheap and another time to get all the remaining values
  in comparetup_rawheap. I liked the idea of re-using
  FormIndexDatum in that case, instead of copyingpasting only
  the relevant code: but FormIndexDatum returns all the values,
  
  even when I might need only the first one
  
  
  4) I refactored the code to deform and rewrite tuple into the function
  deform_and_rewrite_tuple, because now that part can be called
  by the regular index scan or by the new seq-scan + sort (but I
  could copypaste those lines instead of refactoring them into a new
  function)
  
  Suggestions and comments are not just welcome, but needed!



  

sorted_cluster.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-02-09 Thread Josh Kupershmidt
On Tue, Feb 9, 2010 at 5:49 AM, Leonardo F m_li...@yahoo.it wrote:

 Not even a comment? As I said, performance results on my system
 were very good


Hi Leonardo,
Perhaps you could supply a .sql file containing a testcase illustrating the
performance benefits you tested with your patch -- as I understand, the sole
purpose of this patch is to speed up CLUSTER -- along with your results? The
existing src/test/regress/sql/cluster.sql looks like it only has some basic
sanity checks in it, and not performance tests. An idea of what hardware
you're testing on and any tweaks to postgresql.conf might be useful as well.

I was able to apply your patch cleanly and run CLUSTER with it, and make
check passed for me on OS X as well.

Hope this helps, and sorry I'm not able to offer more technical advice on
your patch.
Josh