Re: [HACKERS] Core team statement on replication in PostgreSQL

2008-06-04 Thread Markus Schiltknecht

Hello Andrew,

Andrew Sullivan wrote:

Yes.  And silent as ever. :-)


Are the slides of your PgCon talk available for download somewhere?

BTW: up until recently, there was yet another mailing list: 
[EMAIL PROTECTED] It was less focused on hooks 
and got at least some traffic. :-) Are those mails still archived somewhere?


Regards

Markus




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Markus Schiltknecht

Hi Jeff,

Jeff Cohen wrote:
If you don't define a default partition to handle outliers,  the 
insert should fail with an error.


IMO, you should always have a default partition, then, so as not to 
violate the constraints (by rejecting tuples which are correct according 
to the constraints).


With the generic approach, you start with a single table, and start by 
splitting it into two six-month partitions:


ALTER TABLE sales
  SPLIT where sales_date  date '2007-06-01'
   INTO
(
 PARTITION first_half
 PARTITION second_half
 );

We could implement this approach using check constraints and table 
inheritance: the partition second_half is a child table where sales_date 
  date '2007-06-01', and the partition first_half has the complementary 
constraint NOT(sales_date  date '2007-06-01').


Next, you split each partition:

ALTER TABLE sales
  SPLIT PARTITION first_half where sales_date  date '2007-03-01'
   INTO
(
 PARTITION first_quarter
 PARTITION second_quarter
 );

So now the child table for first_half itself has two children.  As you 
continue this process you construct a binary tree of table inheritance 
using 12 ALTER statements.


nitpickingThere are just 11 splits between 12 months, otherwise 
correct, yes./nitpicking


In the long grammar you can create and partition the table in one 
statement:


CREATE TABLE sales
...
PARTITION BY sales_date
(
start (date '2007-01-01') end (date '2008-01-01')
every (interval '1 month')
);


To be fair, you should add the 12 partition names here as well.

I can certainly see merit in letting the database system handle the 
binary tree.


Thanks for your feedback.  Partitioning the table using series of splits 
is a clever solution for situations where the partitioning operation 
cannot be described using simple equality (like list,hash) or ordered 
comparison (range).  But for many common business cases, the long 
grammar is easier to specify.


Easier to specify initially, maybe, yes. But how about managing it 
afterwards? Having seen all the different options for merging, 
splitting, exchanging, coalescing and adding, all of them with small 
little differences for hash, range and list partitioning - let alone 
sub-partitioning - with all of that, the proposed grammar doesn't look 
particularly easy to me.


Let's at least drop the differences for list, hash and range 
partitioning, those are pretty unneeded, IMO.


Regards

Markus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Markus Schiltknecht

Hi,

Hannu Krosing wrote:

I guess it would go to some default partition ?


Which doesn't have a name so far, which prevents from addressing that 
partition. Nor is it pretty well defined, it's just a rest.



sure, but this can become really tedious for 1024 partitions,


Well, managing 1024 partitions manually is a tedious job, no matter what 
grammar you take: You'll have to deal with 1024 different partition names.


What do you need so many partitions for?


not to
mention hard for optimiser.


It's pretty much the same for the optimizer: a binary tree.  Granted, 
that binary tree should better be balanced by the RDBMS.


Regards

Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Markus Schiltknecht

Hi,

Hans-Juergen Schoenig wrote:

What do you need so many partitions for?


having so many tables is not funny but it can be the only reasonable choice.


Well, what do you do with all those partitions? Most of them will end up 
on the same storage subsystem. So, if you don't partition to spread your 
data across storage with different characteristics, why do you need 
partitioning at all? Isn't an index better in most cases?


Or are you using it as a form of CLUSTERing? Where you expect to reduce 
time for sequential scans over a range? Simon's Segment Exclusion 
proposal looks like a much better fit to that purpose, IMO. It would 
prevent you from having to handle all those partitions manually.


Regards

Markus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Markus Schiltknecht

Hi,

Tom Lane wrote:

I don't agree with that at all.  I can imagine plenty of situations
where a tuple falling outside the range of available partitions *should*
be treated as an error.  For instance, consider timestamped observations
--- data in the future is certainly bogus, and data further back than
you want to deal with must be an entry error as well.


Isn't it better to have these constraints as table constraints, instead 
of burying them in the partitioning definition? Mixing those two 
concepts seems very wired to me.


Or am I missing any benefit of mixing them?

Regards

Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Markus Schiltknecht

Hi,

Gregory Stark wrote:

In a previous life I had a database which had daily partitions. I assure you
it was unquestionably the right decision. Each day's data wasn't just
distinguished by the timestamp but actually was entirely separate from the
previous day's data. Both the archiving strategy and the many reports which
were ran all depended specifically on the day the data was collected on.


Wouldn't Segment Exclusion (maybe together with a specialized form of 
CLUSTERing) handle that case much better than partitioning? Without the 
need to name all those thousands of partitions and manage them manually.


What I would want in such a case, is exactly not manual management of 
partitions, but rather a performance optimization for scanning a range 
of rows, which is something in between indexes (for very few rows) and a 
seq scan (for almost all rows of a table).


I know, this now sounds like I've turned sides to Simon's proposal. And 
yes, in a way, that's true. I certainly see merit for Segment Exclusion, 
more and more. OTOH I'm still skeptical about it replacing declarative 
partitioning entirely. But declarative partitioning only really makes 
sense, if you partition into different storage subsystems, IMO.


Everything happening on the same storage subsystem shouldn't need manual 
partitioning, but should be optimized pretty automatically. As Simon 
proposed, that's well possible in many cases.


Regards

Markus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Markus Schiltknecht

Hi,

Tom Lane wrote:

DBAs tend to be belt *and* suspenders guys, no?


I rather know those admins with stupid looking faces who are wondering 
why their transactions fail. Often enough, that can have a lot of 
different reasons. Extending the set of possible traps doesn't seem like 
a clever idea for those admins.



I'd think a lot of them
would want a table constraint, plus a partitioning rule that rejects
anything outside the intended partitions.


I'm rather a fan of the DRY principle (don't repeat yourself). Because 
having to maintain redundant constraints smells suspiciously like a 
maintenance nightmare.


And where's the real use of making the database system check twice? Want 
to protect against memory corruption in between the two checks, eh? :-)


Regards

Markus


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Markus Schiltknecht

Hi,

Zeugswetter Andreas ADI SD wrote:

Yes, but the problem with the timestamp partitioned tables is, that the
window is sliding. Thus you would need two alter tables for each new
period. One that changes the constraint + one that creates the new
partition. So it seems natural to join the two concepts for such a
partitioning syntax.


If you think in terms of split points, having to alter two table is not 
true. It's better



Personally I find the automatic partition idea intriguing, where you
only have to choose an expression that equates to one value (value
group) per partition (and possibly a way to derive a partition name).
Then a partition is automatically created when a new row arrives for a
new value. That does not however address Tom's concern of rejecting data
that is outside the acceptable window, but maybe that is better dealt
with in the application anyways.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Markus Schiltknecht

Hi,

(sorry for the previous one, if delivered, that went of too early...)

Zeugswetter Andreas ADI SD wrote:

Yes, but the problem with the timestamp partitioned tables is, that the
window is sliding. Thus you would need two alter tables for each new
period. One that changes the constraint + one that creates the new
partition. So it seems natural to join the two concepts for such a
partitioning syntax.


If you think in terms of split points, having to alter two partitions 
isn't true, you just add a split point.


Of course, that also alters the constraints of the partitions, but I 
think we all agree that the system should maintain those constraints 
automatically, anyway. As such, they don't even have to be visible to 
the DBA.



Personally I find the automatic partition idea intriguing, where you
only have to choose an expression that equates to one value (value
group) per partition (and possibly a way to derive a partition name).


IMO, better go right to a fully automated approach. Or why would you 
need partition names in such a case?


Regards

Markus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Declarative partitioning grammar

2008-01-14 Thread Markus Schiltknecht

Hi,

Jeff Cohen wrote:
We did look at allowing general functions for partitioning and this was 
one concern.  The other is that we want to enforce that a row only gets 
inserted into a single partition, so we wanted a declarative syntax 
where it was relatively easy to check that range and list specifications 
don't overlap.


Why do you need to define a split point so ambiguously at all? Why not 
just give the DBA exactly *one* place to define the split point?


I don't think the separation into list, hash and range partitioning is 
adequate. What is the system supposed to do, if you try to insert a row 
which doesn't fit any of the values in your list or doesn't fit any of 
the ranges you defined?


I prefer a partitioning grammar which doesn't interfere with 
constraints. We all know how to define constraints. Please don't 
introduce a new, ambiguous way. A partitioning definition should be able 
to tell the target partition for *every* row which satisfies the 
constraints (the real ones, not ambiguous ones).


IMO, a single DDL command should only touch a single split point, i.e. 
split a table into two partitions, move the split point or remove the 
split point (joining the partitions again). Those are the only basic 
commands you need to be able to handle partitioning.


Sorry, but for my taste, the proposed grammar is too long per command, 
not flexible enough and instead ambiguous for split points as well as 
for constraints. To me it looks like repeating the mistakes of others.


Regards

Markus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Some ideas about Vacuum

2008-01-12 Thread Markus Schiltknecht

Hi,

Gokulakannan Somasundaram wrote:

I'm also not sure it really buys us anything over having a second
dead-space-map data structure. The WAL is much larger and serves other
purposes which would limit what we can do with it.

Ok. One obvious advantage is that it saves the contention over DSM for 
the DML operations and Vacuum process.


Do you have evidence of that contention being so worse, that it 
justifies the additional WAL reading from disk? (Assuming no WAL archiving).


IMO we can get about any granularity we want for DSM update locking, 
depending on how we arrange the DSM bits.


Since Vacuum process is going to 
have much more information on what has happened in the database,


Why should that be? IMO, collecting the information at transaction time 
can give you exactly the same information, if not more or better 
information.


it is 
possible for some new structures. For example i have been thinking of 
changing our current index structure in such a way, it won't hold any 
duplicate tuples for different versions of data. Whenever there is a 
update, only the indexes relevant to the columns changed will get 
updated. The Vacuum has to play the role of changing the tid, the index 
tuple points to, whenever it vacuums a older version.


Huh? The index would then point to the old tuple only, until a VACUUM 
comes by, right. How are following transactions expected to find the new 
tuple before that VACUUMing?


Regards

Markus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Some ideas about Vacuum

2008-01-12 Thread Markus Schiltknecht

Hi,

Tom Lane wrote:

Well, one of the principal arguments for having VACUUM at all is that it
off-loads required maintenance effort from foreground transaction code
paths.


Off-loading doesn't mean we don't have to do the work, so it's obviously 
is a compromise.


AFAICT, having to write some DSM blocks from foreground transaction code 
paths may well be worth it overall, if it saves VACUUM from doing much 
more I/O.


Especially if the bgwriter can defer the I/O to after commit time (which 
I'm thinking of as another form of off-loading work from foreground 
transaction code).


Regards

Markus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Some ideas about Vacuum

2008-01-10 Thread Markus Schiltknecht

Hi,

Gokulakannan Somasundaram wrote:
But i am just thinking of creating the DSM 
by reading through the WAL Logs, instead of asking the Inserts, updates 
and deletes to do the DSM creation.


What's the advantage of that? What's wrong with collecting the 
information for DSM at transaction processing time? The overhead is 
certainly smaller than the overhead for doing it later on.


I think what Gregory is coming at is, if we schedule the Vacuum 
after 20% of table changes, then we essentially say we need 120% of the 
disk space and hence our select operations might end up doing more I/Os.


Well, full sequential scans end up doing more I/O, but not index scans 
typical for OLTP. So if autovacuum is the only thing doing full 
sequential scans, you'd better reduce the number of full scans, instead 
of saving only some percentage per scan, no?


Of course, depending on how much of your table fits in ram, you also 
need to consider the space savings in RAM...  However, I'm assuming a 
reasonably low ratio of RAM size vs table size.


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Named vs Unnamed Partitions

2008-01-10 Thread Markus Schiltknecht

Hi,

Simon Riggs wrote:

On Wed, 2008-01-09 at 18:04 +0100, Markus Schiltknecht wrote:
What do you think about letting the database system know the split point 
vs it having to find optimal split points automatically?


For me, managing the table's files can be separate from the chunking
that allows partition exclusion.


Agreed.

So in your terms, my question is: who does the chunking? What you are 
proposing with SE is that the database does the chunking automatically 
(based on spontaneous ordering). That's a novel and interesting idea to me.



Managing the table's files must be a manual operation. We can't infer
the presence of a new tablespace etc.. 


Sure.

However, letting the database do the chunking (i.e. define split 
points), but leaving it up to the user to decide where to put those 
chunks certainly doesn't work.


So, there are only two options: either we let the user choose split 
points manually, or else we tell the database those missing pieces of 
information and rely on automatisms.


If I understand correctly, you are stating, that in the case of 
read-only vs read-writable, the database has enough information to make 
good decisions.



Those files would need less than
10 zones or chunks, usually just one.


Agreed.


The WHERE clause approach might easily allow more than 2 chunks and they
need not be logically contiguous. So the phrase split point doesn't
really fit because it implies a one dimensional viewpoint, but I'm happy
for you to give it a name.


I consider read-only vs. read-writable to be pretty one dimensional. And 
the storage is logically organized in contiguous blocks. So there are 
split points between segments with differing read-only property, 
according to my definition.


Regards

Markus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Some ideas about Vacuum

2008-01-10 Thread Markus Schiltknecht

Hi,

Gokulakannan Somasundaram wrote:
because of the contention. Am i missing something 
here? While Vacuum is reading the DSM, operations may not be able to 
update the bits. We need to put the DSM in shared memory, if all the 
processes are going to update it, whereas if Vacuum is going to form the 
DSM, then it might well be in the process local memory.  I can think of 
things like False sharing which might be avoided. But i think the main 
stuff is contention.


Ah, I begin to understand where you are coming from now, yes. However, 
(ab-)using the WAL and archiver still doesn't look like a good idea to me.


Even in indexes, we might end up reading dead tuples. We would mark it 
with LP_DEAD. So the overhead is less, but its there.


That's a good point, yes.

Ofcourse its 
natural to think of some background jobs during OLTP, and they will be 
affected


Agreed.

Regards

Markus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Some ideas about Vacuum

2008-01-09 Thread Markus Schiltknecht

Hi,

Gokulakannan Somasundaram wrote:
If we can ask the Vacuum process to scan 
the WAL log, it can get all the relevant details on where it needs to 
go.


You seem to be assuming that only few tuples have changed between 
vacuums, so that WAL could quickly guide the VACUUM processes to the 
areas where cleaning is necessary.


Let's drop that assumption, because by default, autovacuum_scale_factor 
is 20%, so a VACUUM process normally kicks in after 20% of tuples 
changed (disk space is cheap, I/O isn't). Additionally, there's a 
default nap time of one minute - and VACUUM is forced to take at least 
that much of a nap.


So it's easily possible having more dead tuples, than live ones. In such 
cases, scanning the WAL can easily takes *longer* than scanning the 
table, because the amount of WAL to read would be bigger.


One main restriction it places on the WAL Logs is that the WAL Log 
needs to be archived only after all the transactions in it completes. In 
other words, WAL logs need to be given enough space, to survive the 
longest transaction of the database. It is possible to avoid this 
situation by asking the Vacuum process to take the necessary information 
out of WAL log and store it somewhere and wait for the long running 
transaction to complete.


That would result in even more I/O...

The information of interest in WAL is only the table 
inserts/updates/deletes. So if everyone accepts that this is a good 
idea, till this point, there is a point in reading further.


Well, that's the information of interest, the question is where to store 
that information. Maintaining a dead space map looks a lot cheaper to 
me, than relying on the WAL to store that information.


Ultimately, what has been achieved till now is that we have made the 
sequential scans made by the Vacuum process on each table into a few 
random i/os. Of course there are optimizations possible to group the 
random i/os and find some sequential i/o out of it. But still we need to 
do a full index scan for all those indexes out there. HOT might have 
saved some work over there. But i am pessimistic here and wondering how 
it could have been improved. So it just strikes me, we can do the same 
thing which we did just with the tables. Convert a seq scan of the 
entire table into a random scan of few blocks. We can read the necessary 
tuple information from the tuples, group them and hit at the index in 
just those blocks and clean it up.


Sorry, I don't quite get what you are talking about here. What do 
indexes have to do with dead space? Why not just keep acting on the 
block level?


   I can already hear people, saying that it is not always possible to 
go back to index from table. There is this culprit called unstable 
function based indexes.


No, there's no such thing. Citing [1]: All functions and operators used 
in an index definition must be immutable, that is, their results must 
depend only on their arguments and never on any outside influence.


Of course, you can mark any function IMMUTABLE and get unstable function 
based indexes, but that turns into a giant foot gun very quickly.


P.S.:  Let the objections/opposing views have a subtle reduction in its 
harshness.


I'm just pointing at things that are in conflict with my knowledge, 
assumptions and believes, all which might be erroneous, plain wrong or 
completely mad. ;-)


Regards

Markus

[1]: the Very Fine Postgres Manual on CREATE INDEX:
http://www.postgresql.org/docs/8.3/static/sql-createindex.html

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] LD_LIBRARY_PATH not honored on Debian unstable

2008-01-09 Thread Markus Schiltknecht

Hi,

I'm trying to run 'make check' on a 64bit Debian unstable. That aborts 
after 60 seconds due to not being able to connect to the postmaster.


I figured that there's nothing wrong with the postmaster, rather psql 
can't start up, because it gets linked against an older libpq.so.5. It 
looks like for some reason, it doesn't respect the LD_LIBRARY_PATH env 
variable, see ldd output:


[EMAIL PROTECTED]:/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress# 
LD_LIBRARY_PATH=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/./tmp_check/install//usr/lib 
ldd -r -v tmp_check/install/usr/bin/psql

linux-vdso.so.1 =  (0x7fffc2bfe000)
libpq.so.5 = /usr/lib/libpq.so.5 (0x2ac8e81ba000)
libz.so.1 = /usr/lib/libz.so.1 (0x2ac8e83db000)
libreadline.so.5 = /lib/libreadline.so.5 (0x2ac8e8606000)
libcrypt.so.1 = /lib/libcrypt.so.1 (0x2ac8e8846000)
libdl.so.2 = /lib/libdl.so.2 (0x2ac8e8a7e000)
libm.so.6 = /lib/libm.so.6 (0x2ac8e8c82000)
libc.so.6 = /lib/libc.so.6 (0x2ac8e8f04000)
libssl.so.0.9.8 = /usr/lib/libssl.so.0.9.8 (0x2ac8e9262000)
libcrypto.so.0.9.8 = /usr/lib/libcrypto.so.0.9.8 (0x2ac8e94ae000)
libkrb5.so.3 = /usr/lib/libkrb5.so.3 (0x2ac8e983c000)
libcom_err.so.2 = /lib/libcom_err.so.2 (0x2ac8e9ad7000)
libpthread.so.0 = /lib/libpthread.so.0 (0x2ac8e9cd9000)
libncurses.so.5 = /usr/lib/libncurses.so.5 (0x2ac8e9ef5000)
/lib64/ld-linux-x86-64.so.2 (0x2ac8e7f9c000)
libk5crypto.so.3 = /usr/lib/libk5crypto.so.3 (0x2ac8ea132000)
libkrb5support.so.0 = /usr/lib/libkrb5support.so.0 (0x2ac8ea357000)
libkeyutils.so.1 = /lib/libkeyutils.so.1 (0x2ac8ea55f000)
libresolv.so.2 = /lib/libresolv.so.2 (0x2ac8ea761000)
undefined symbol: pg_valid_server_encoding_id 
(tmp_check/install/usr/bin/psql)

undefined symbol: PQconnectionNeedsPassword (tmp_check/install/usr/bin/psql)


Giving it an additional LD_PRELOAD for the newish libpq.5.1 helps:

[EMAIL PROTECTED]:/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress# 

LD_PRELOAD=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/tmp_check/install/usr/lib/libpq.so.5.1 
LD_LIBRARY_PATH=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/./tmp_check/install//usr/lib 
ldd -r -v tmp_check/install/usr/bin/psql

linux-vdso.so.1 =  (0x7fffe97fe000)
	/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/tmp_check/install/usr/lib/libpq.so.5.1 
(0x2b69c1547000)

libz.so.1 = /usr/lib/libz.so.1 (0x2b69c1765000)
libreadline.so.5 = /lib/libreadline.so.5 (0x2b69c199)
libcrypt.so.1 = /lib/libcrypt.so.1 (0x2b69c1bd)
libdl.so.2 = /lib/libdl.so.2 (0x2b69c1e08000)
libm.so.6 = /lib/libm.so.6 (0x2b69c200c000)
libc.so.6 = /lib/libc.so.6 (0x2b69c228e000)
libncurses.so.5 = /usr/lib/libncurses.so.5 (0x2b69c25ec000)
/lib64/ld-linux-x86-64.so.2 (0x2b69c1329000)


Somebody have an idea on what's wrong here? Thanks.

Regards

Markus


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] LD_LIBRARY_PATH not honored on Debian unstable

2008-01-09 Thread Markus Schiltknecht

Andrew Dunstan wrote:
Smells suspiciously like an rpath problem to me. What are your configure 
settings?


Ah, yeah, I see. Using something else than --prefix=/usr helped.

Thanks for the hint!

Regards

Markus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Markus Schiltknecht

Simon Riggs wrote:

Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be
easy to apply the index condition and/or filters to see which segments
are excluded and then turn off bits in the bitmap appropriately.


Yeah, good point.


Not fully sure about IndexScans yet. I don't think it would be worth
trying to apply SE until we estimated we would return say 100 rows. It
needs to be able to work without slowing down the common path.


Yup.

Or 
put it another way: SE is an optimization for sequential scans. For 
tables where it works well, it could possibly replace the index entirely.


True

Without the index, you would rely on SE to always be able to exclude 
enough segments, so that the seq scan is less expensive than an index 
scan with the following table lookups.


It would have to be a very fat index scan for so large a table...


..for SE to be faster than an index scan, you mean? Yes.

Regards

Markus




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Named vs Unnamed Partitions

2008-01-09 Thread Markus Schiltknecht

Simon Riggs wrote:

I have to admit I always found it kludgy to have objects named
invoices_2000_JAN and invoices_2000_FEB and so on. It's kind of an meta
denormalization. But so is specifying where clauses repeatedly.


The idea for using the WHERE clauses was to specifically avoid naming.


I understand, and I'm all for avoiding needless, kludgy names.

As I pointed out, knowledge of split points might be important for the 
database system. Maybe we can store the split point without the need for 
names? Dunno.



If you guys really want names, we can have names, but I think I want to
see a case where the storage characteristics of the table are so complex
we can only make sense of it by naming particular chunks.


Well, assuming you only have to deal with one split point, that's 
certainly true. However, there are people using more than two table 
spaces, thus obviously needing more split points.


Can we name the split points, rather than the partitions?

Regards

Markus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Some ideas about Vacuum

2008-01-09 Thread Markus Schiltknecht

Hi,

Gregory Stark wrote:

That's an interesting thought. I think your caveats are right but with some
more work it might be possible to work it out. For example if a background
process processed the WAL and accumulated an array of possibly-dead tuples to
process in batch. It would wait whenever it sees an xid which isn't yet past
globalxmin, and keep accumulating until it has enough to make it worthwhile
doing a pass.


I don't understand why one would want to go via the WAL, that only 
creates needless I/O. Better accumulate the data right away, during the 
inserts, updates and deletes.  Spilling the accumulated data to disk, if 
absolutely required, would presumably still result in less I/O.



I think a bigger issue with this approach is that it ties all your tables
together. You can't process one table frequently while some other table has
some long-lived deleted tuples.


Don't use the WAL as the source of that information and that's issue's gone.


I'm also not sure it really buys us anything over having a second
dead-space-map data structure. The WAL is much larger and serves other
purposes which would limit what we can do with it.


Exactly.


You seem to be assuming that only few tuples have changed between vacuums, so
that WAL could quickly guide the VACUUM processes to the areas where cleaning
is necessary.

Let's drop that assumption, because by default, autovacuum_scale_factor is 20%,
so a VACUUM process normally kicks in after 20% of tuples changed (disk space
is cheap, I/O isn't). Additionally, there's a default nap time of one minute -
and VACUUM is forced to take at least that much of a nap.


I think this is exactly backwards. The goal should be to improve vacuum, then
adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper
the scale factor can go lower and lower.


But you can't lower it endlessly, it's still a compromise, because it 
also means reducing the amount of tuples being cleaned per scan, which 
is against the goal of minimizing overall I/O cost of vacuuming.



We shouldn't allow the existing
autovacuum behaviour to control the way vacuum works.


That's a point.


As a side point, disk is cheap, I/O isn't is a weird statement. The more
disk you use the more I/O you'll have to do to work with the data.


That's only true, as long as you need *all* your data to work with it.


I still
maintain the default autovacuum_scale_factor is *far* to liberal. If I had my
druthers it would be 5%. But that's mostly informed by TPCC experience, in
real life the actual value will vary depending on the width of your records
and the relative length of your transactions versus transaction rate. The TPCC
experience is with ~ 400 byte records and many short transactions.


Hm.. 5% vs 20% would mean 4x as many vacuum scans, but only a 15% growth 
in size (105% vs 120%), right? Granted, those 15% are also taken from 
memory and caches, resulting in additional I/O...  Still these numbers 
are surprising me. Or am I missing something?


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Named vs Unnamed Partitions

2008-01-09 Thread Markus Schiltknecht

Hi,

Simon Riggs wrote:

With that in mind, can I clarify what you're thinking, please?


Sure, I can try to clarify:


2) the things you've been discussing are essential requirements of
partitioning and we could never consider it complete until they are also
included and we must therefore talk about them now to check that its all
possible before we do anything on SE


I thought so, but am slowly dropping that point of view. In favor of 
something like: hey, if you manage to do it all automatically, cool, go 
for it!



3) doing SE first is right, I'm just thinking ahead


Yes, SE certainly has merit. Combine it with some sort of maintained 
CLUSTERing order and it's worth doing, IMO.


I'm not convinced about dynamic partitioning being able to generally 
replace explicit partitioning anytime soon.



Sorry if that seems blunt, I'm just not clear where we're going.


Well, implicit or automatic partitioning is still a pretty new concept 
to me, but I'm slowly beginning to like it. Thank you for pointing me at it.


Regards

Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Named vs Unnamed Partitions

2008-01-09 Thread Markus Schiltknecht

Hi,

Simon Riggs wrote:

When I delete all rows WHERE some_date  'cut-off date' on a segment
boundary value that would delete all segments that met the criteria. The
following VACUUM will then return those segments to be read-write, where
they can then be refilled with new incoming data. The only command we
would have to run is the DELETE, everything else is automatic.


Agreed, that would be very nice.


So not convinced of the need for named sections of tables yet. It all
seems like detail, rather than actually what we want for managing large
tables.


What do you think about letting the database system know the split point 
vs it having to find optimal split points automatically?


Read-write vs. read-only is as good start, but can that concept be 
expanded to automatically choosing hash partitioning between storage 
systems, for example? Or more generally: can the database system gather 
enough information about the storage systems to take a decision as good 
as or better than the DBA?


Regards

Markus


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal - libpq Type System beta-0.8a (was PGparam)

2008-01-08 Thread Markus Schiltknecht

Hi,

Andrew Chernow wrote:
It might be something with the 
attachment, who knows.


Most probably that was the case, yes. The -hackers list is limited, 
please use -patches to send patches. ;-)


Regards

Markus


---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] Named vs Unnamed Partitions

2008-01-08 Thread Markus Schiltknecht

Hi,

IMO, the lengthy discussion about Segment Exclusion and Segment 
Visibility Maps has long turned into a discussion about partitioning in 
general. I'm thankful for all the new insights it has brought me and I 
want to continue sharing my view on things. What's following is highly 
theoretical and has brainstorming characteristics. You've been warned.


There are two very distinct ways to handle partitioning. For now, I'm 
calling them named and unnamed partitioning. Let's have a closer look at 
both options from a users point of view. I'm using Andrew's pseudo DDL 
example from the above mentioned thread:


ALTER TABLE foo
  SET read_only='t'
  WHERE created_on  '2007-01-01';

Given all tuples were read-writeable before, that implicitly created two 
partitions. Giving them names could look like that:


ALTER TABLE foo
  SPLIT INTO old_foos AND new_foos;
  AT created_on  '2007-01-01'
ALTER PARTITION old_foos
  SET READ ONLY;


Instead of only setting the read-only property, one could also set an 
alternative table space for the partition:


ALTER TABLE foo
  SET TABLE SPACE large_but_slow_storage
  WHERE created_on  '2007-01-01';

vs:

ALTER PARTITION old_foos
  SET TABLE SPACE large_but_slow_storage;


Please also note, that neither variant is limited to range partitioning. 
You can theoretically partition by pretty much anything, for example 
with a WHERE clause like:


..WHERE (id % 5)  2

The primary difference I see between these two ways to declare 
partitions is, that the former only modifies tuple properties 
(read-only, storage location), while the later also tells the database 
*why* it has to modify them.


That has several different effects. First, newly inserted tuples are 
treated differently. For unnamed partitions, there must be defaults, 
like read-writable and a default table space. With named partitions, you 
define split points, so I guess one expects newly inserted tuples to end 
up in the right partition automatically. Unnamed partitioning could be 
equally automatic when letting a function decide, where to insert the 
new tuple.


Second, repartitioning must be treated differently. With unnamed 
partitioning, the admin must first adjust the defaults (if required) and 
then move the existing tuple properties accordingly. With named 
partitions, the admin only needs to adjust the split point and the 
database system knows what it has to do.


And third, but IMO most importantly: to be able to optimize queries, the 
database system has to know split points, so it can exclude partitions 
or segments from scanning. Obviously, with named partitions, it always 
knows them. Otherwise, you'll have to maintain some information about 
the tuples in your partitions, as Simon does with the min/max tuples. As 
soon as required, it could also maintain additional min/max values, i.e. 
for (id % 5) for the above example.



I hope to have shown the most relevant aspects. To conclude, I'd say 
that named partitioning is closer to manually managed partitioning, as 
already known and often used. While unnamed partitioning is closer to 
automated partitioning, where the DBA does *not need* to have names for 
partitions, which is a pretty new and interesting idea to me.


Regards

Markus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Markus Schiltknecht

Hi Csaba,

Csaba Nagy wrote:

One additional thought: what about a kind of segment fill factor ?
Meaning: each segment has some free space reserved for future
updates/inserts of records in the same range of it's partitioning
constraint. And when inserting/updating you put the new record into the
corresponding segment... just like a very coarse clustering.


Hm.. yeah. That way, a few writes to a read optimized segment could be 
accepted, without having to drop the optimization immediately. And the 
other way around: generally prevent having to drop the optimization by 
forcing tuples to be written to a segment with matching min/max tuples. 
Although, that's not exactly trivial, I think.


However, for tables which don't fit the use case of SE, people certainly 
don't want such a fill factor to bloat their tables.


Regards

Markus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Markus Schiltknecht

Hi,

Csaba Nagy wrote:

Sure, but it could be configurable and should only be enabled if the
table is marked as partitioned on some condition... 


As I'm regarding SE as an optimization, I disagree here.. As all 
optimizations, SE should conceptually be reasonably close to cost-less 
when unneeded.



I think it would be
a bad idea anyway if the DB would start partitioning on some arbitrary
criteria based on analyzing he table, so the DBA should be the one to
decide on what criteria to partition.


I absolutely agree for real partitioning, which targets manageability of 
table partitions.



In particular it could be a bad
idea on occasions to partition on the clustering criteria for tables
which were clustered once.


Why is that? AFAIUI, Segment Exclusion combines perfectly well with 
clustering. Or even better, with an upcoming feature to maintain 
clustered ordering. Where do you see disadvantages such an optimization 
for sequential scans?


Regards

Markus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Markus Schiltknecht

Hi,

Andrew Sullivan wrote:

On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote:
Well, management of relations is easy enough, known to the DBA and most 
importantly: it already exists. Having to set up something which is 
*not* tied to a relation complicates things just because it's an 
additional concept.


But we're already dealing with some complicated concepts.


Possibly, yes, but that's by far no reason to introduce even more 
complicated concepts...


Does anything speak against letting the DBA handle partitions as relations?


There isn't anything that will prevent current-style partitioning strategies
from continuing to work in the face of Simon's proposal.


Agreed. Nor will Simon's proposal completely replace them.


Without even trying, I can think of a dozen examples in the past 5 years
where I could have used that sort of functionality.  Because the cost of
data retrieval was high enough, we had to decide that the question wasn't
worth answering.  Some of those answers might have been quite valuable
indeed to the Internet community, to be frank; but because I had to pay the
cost without getting much direct benefit, it just wasn't worth the effort.


Sure, there's value in Simon's proposal. But it has pretty strict 
requirements. IMO, it's pretty hard to say, if it would have helped at 
all for your cases. Any of them still available to check?


Remember the requirements: no single tuple in the segment may be 
significantly out of the average bounds. Otherwise, the min/max gets 
pretty useless and the segment can never be excluded.


As said before, combination with CLUSTERing might help, yes. But if you 
need to maintain CLUSTERed ordering, aren't there better ways? For 
example, you could use binary searching on the relation directly, much 
like with indices, instead of sequentially scanning on the CLUSTERed 
relation. That would even give us some sort of indices with visibility.


Agreed. I'd say that's why the DBA needs to be able to define the split 
point between partitions: only he knows the meaning of the data.


I think this is only partly true.  A casual glance at the -general list will
reveal all manner of false assumptions on the parts of administrators about
how their data is structured.  My experience is that, given that the
computer has way more information about the data than I do, it is more
likely to make the right choice.  To the extent it doesn't do so, that's a
problem in the planning (or whatever) algorithms, and it ought to be fixed
there.


Well, Postgres doesn't automatically create indices, for a counter example.

With regard to partitioning over multiple table spaces, I think the DBA 
definitely has more information available, than the computer. A DBA 
(hopefully) knows future plans and emergency strategies for the storage 
system, for example. Lacking such information, the database system will 
have a pretty hard time taking a good decision on how to partition 
between table spaces, IMO.


Regards

Markus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Markus Schiltknecht

Hi,

Andrew Sullivan wrote:

On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:

Does anything speak against letting the DBA handle partitions as relations?


Yes: it doesn't solve the problem I have, which is that I don't want to have
to manage a whole bunch of tables.  I want one table, and I want to be able
to say, That section is closed. 


That's a fair requirement. Wanting to manage partitions manually is 
another fair requirement, IMO. They can well coexist.


Remember the requirements: no single tuple in the segment may be 
significantly out of the average bounds. Otherwise, the min/max gets 
pretty useless and the segment can never be excluded.


The segment can never be excluded in a search on that key, yes.  But
consider the likely cases we're looking at: 


Uh, which key are you talking about? AFAIU Simon's proposal, he suggests 
maintaining min/max values for all columns of the table.



WHERE some_date = '1999-01-01' AND some_date  '2001-01-01';


Yeah, and if only *one* tuple in the 1G segment has:

  some_date = '1998-12-31' OR some_date = '2001-01-01'

Segment Exclusion can't exclude that segment. That's all I'm saying.


That said, I am not trying to argue it should be adopted without poking at
its weaknesses.  I just think it unfair to ask the proposal to address
problems it's not really aimed at.


Huh? I'm certainly not the one asking for it. Quite the opposite, I'm 
warning from over-estimating the use of SE.


In his proposal, Simon was explicitly comparing to declarative 
partitioning, pointing out lots of advantages and implicitly stating 
that SE could cover most, if not all needs of what's commonly understand 
by partitioning. That's where I disagree.


But certainly, SE and SVM has some merit for it's use case. And I'm 
looking forward to test patches ;-)


Regards

Markus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-06 Thread Markus Schiltknecht

Hi,

Gokulakannan Somasundaram wrote:



On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote:


One thought I had back then, with partitioned tables was gee -- B-tree
index is already doing a partition; why do a manual partition on top of
that?.

Can you please explain more on what you are trying to say here?


I think this has to do with SE not being of much use for index scans. Or 
put it another way: SE is an optimization for sequential scans. For 
tables where it works well, it could possibly replace the index entirely.


Without the index, you would rely on SE to always be able to exclude 
enough segments, so that the seq scan is less expensive than an index 
scan with the following table lookups.


With an index, the planner gets a hard time deciding between the index 
scan and the (possibly SE optimized) seq scan.


Regards

Markus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-06 Thread Markus Schiltknecht

Hi,

Robert Treat wrote:

On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote:

To satisfy all the different requirements of partitioning with segments
based partitioning, we'd have to allow a table to span multiple table
spaces. I'm not very keen on going that way.

Why?

Uh.. if a table (RELKIND_RELATION) can only span one table space, as it
is now, all of its segments are in the same table space. I don't quite
call that partitioning. Well, sure, you could call it so, but then, each
and every Postgres table is already partitioned in 1G segments.

It all depends on the definitions, but in my world, horizontal
partitioning for databases involves multiple table spaces (and is quite
useless without that). Calling anything else partitioning is confusing,
IMO.


I'm not following this.  If we can work out a scheme, I see no reason not to 
allow a single table to span multiple tablespaces.  Do you see a problem with 
that?


Uhm... well, no. I was just pointing out that it's a requirement. It 
depends on how you define things, but I'm seeing it that way:


table -- 1:n -- partition -- 1:1 -- table space -- 1:n -- segments

What I'm advocating is making partitions available to the DBA as some 
kind of a relation, she can query separately and move around between 
table spaces.



Why should that not be possible with other schemes? Moving the split
point between two partitions involves moving tuples around, no matter if
you are going to move them between segments or between relations
building the partitions.


The difference is that, if I currently have a table split by month, I 
can re-partition it into weekly segments, and only shuffle one months data 
at a time minimize impact on the system while I shuffle it. This can even be 
used to do dynamic management, where data from the current month is archived 
by day, data from the past year by week, and data beyond that done monthly.


This should be possible for both schemes, I see no connection to what 
we've discussed. SE doesn't magically give you this level of control you 
are requesting here. Quite the opposite: referring to CLUSTERing to 
makes me wonder, if that's not going to shuffle way too many tuples around.


What I'm saying is, that SE doesn't partition the segments into 
different table spaces. Thus I don't consider it database partitioning 
in the first place. As I currently understand it, it's:


table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments

On many other databases, if you change the partition scheme, it requires 
exclusive locks and a shuffleing of all of the data, even data whose 
partitions arent being redefined.  Even worse are systems like mysql, where 
you need to rewrite the indexes as well.  To me, these requirements always 
seem like show stoppers; I generally can't afford to lock a table while the 
database rewrites a billion rows of data. 


I fully agree here. How do you plan to solve that problem on top of SE?

In a more general sense, a global index is a an index that spans multiple 
partitions, as opposed to a local index, which is an index on specific 
partitions; postgresql current supports the latter, not the former.


In any case, my thinking is if we had the segment exclusion technique, I could 
convert that partitioned table into a regular table again,


... on a single table space ...

use segment 
exclusion to handle what is currently handled by partitions,


... except, that there is no partitioning (!?!) (between table spaces)

and create 
a global index across all the other data for that other, currently killer, 
query. 


I thought the table you are referring to is bigger than your fastest 
table space? That would even make it impossible.


See where I'm coming from? And why I'm stating that SE is an 
optimization (for seq scans), but not partitioning?


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Markus Schiltknecht

Andrew Sullivan wrote:

On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote:
I'm still puzzled about how a DBA is expected to figure out which 
segments to mark. 


I think that part might be hand-wavy still.  But once this facility is
there, what's to prevent the current active segment (and the rest) from also
getting this mark, which would mean the table is read only?  


Well, sure, marking *all* segments read-only is pretty easy. But that 
was not quite my point.


Regards

Markus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Markus Schiltknecht

Hi,

[EMAIL PROTECTED] wrote:

The main proposal deliberately has few, if any, knobs and dials. That's
a point of philosophy that I've had views on previously: my normal
stance is that we need some knobs to allow the database to be tuned to
individual circumstances. 


One thought I had back then, with partitioned tables was gee -- B-tree
index is already doing a partition; why do a manual partition on top of
that?.


Well, that's why I'm so focused on manageability: I think the primary 
purpose of partitioning is manageability (of the underlying storage for 
a table).


Regards

Markus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Markus Schiltknecht

Hi,

Simon Riggs wrote:
 On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:

 I'm still puzzled about how a DBA is expected to figure out which
 segments to mark. Simon, are you assuming we are going to pass on
 segment numbers to the DBA one day?

 No Way!

Ah, I'm glad ;-)

Simon Riggs wrote:

Skepticism is OK, but I'd like to get your detailed thoughts on this.
I've been an advocate of the multi-tables approach now for many years,
so I don't expect everybody to switch their beliefs on my say-so
overnight. Let me make a few more comments in this area:


I've so far always thought about some sort of multi-relations approach 
for partitioning, yes. Let's see if I can get my mind around 
single-table partitioning.



The main proposal deliberately has few, if any, knobs and dials. That's
a point of philosophy that I've had views on previously: my normal
stance is that we need some knobs to allow the database to be tuned to
individual circumstances.

In this case, partitioning is way too complex to administer effectively
and requires application changes that make it impossible to use for
packaged applications. The latest Oracle TPC-H benchmark uses 10 pages
of DDL to set it up and if I can find a way to avoid that, I'd recommend
it to all. I do still want some knobs and dials, just not 10 pages
worth, though I'd like yours and others' guidance on what those should
be. Oracle have been responding to feedback with their new interval
partitioning, but its still a multi-table approach in essence.


I can absolutely support your efforts to minimize knobs and 
configuration DDL. However, my current feeling is, that segments based 
partitioning complicates things, because the DBA doesn't have tools and 
commands to handle segments.


To satisfy all the different requirements of partitioning with segments 
based partitioning, we'd have to allow a table to span multiple table 
spaces. I'm not very keen on going that way.


However, what I certainly like is the automated split point definition. 
Instead of having to create tables by hand and linking them via 
inheritance and constraint exclusion, I have something very similar in 
mind, like what you proposed for marking read-only segments. Something like:


  SPLIT TABLE customers AT cust_name  'n';

or:

  SPLIT TABLE inventory AT inv_id % 4 = 2;

In my imagination, this should automatically create the underlying 
relations, i.e.:


  NOTICE: relations inventory__l and inventory__r have been created.

That way, the DBA could then handle those like normal relations, 
querying them or moving them to different table spaces like all other 
normal relations.


In a way, that's not so different from possible extensions on top of 
Segment Exclusion, except that the DBA additionally get a relation name 
to be able to address the set of segments which form a partition. Or put 
it the other way around: go for Segment Exclusion, but add some sort of 
a sentinel relation for each set of segments, to make them reachable for 
the DBA.



My observation of partitioned databases is that they all work
beautifully at the design stage, but problems emerge over time. A
time-based range partitioned table can often have different numbers of
rows per partition, giving inconsistent response times. A
height-balanced approach where we make the partitions all the same size,
yet vary the data value boundaries will give much more consistent query
times and can be completely automated much more easily.


Uh.. well, consistent query time isn't the first thing I'm expecting 
from partitioning by time ranges. If I wanted consistent query times I'd 
rather use hash partition or something, no?


I'd even state, that one *wants* inconsistent response times when using 
time based range partitioning, by moving old, seldom used data to slower 
storage and keeping only a small amount of often used tuples on the 
faster disks, for example.



The SVM concept doesn't cover everything that you can do with
partitioning, but my feeling is it covers the main use cases well.


As I regard manageability to be the main advantage of partitioning, 
which you've intentionally left out for now, I disagree here.


How could SVM or Segment Exclusion potentially be covering what hash 
partitioning does? Maybe together with the ability to store different 
segments of a table on different table spaces. That could be considered 
an approach to range partitioning. But then, that would be the 
partitioning, and not SVM or Segment Exclusion. To me, both of SVM and 
SE look much more like an optimization for certain special cases and 
don't have much to do with partitioning.


Regards

Markus


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Markus Schiltknecht

Hi,

Robert Treat wrote:
Personally I cant say it complicates things, because it isn't clear how it 
will be managed. :-)


Well, management of relations is easy enough, known to the DBA and most 
importantly: it already exists. Having to set up something which is 
*not* tied to a relation complicates things just because it's an 
additional concept.


But as I've pointed out, maybe what we have in mind isn't that different 
at all. Just have a sentinel relation mean a set of segments, i.e. all 
read-only segments of a table. Then again, a table - in a way - is not 
much else than a set of segments. So where's the real difference?



To satisfy all the different requirements of partitioning with segments
based partitioning, we'd have to allow a table to span multiple table
spaces. I'm not very keen on going that way.



Why?


Uh.. if a table (RELKIND_RELATION) can only span one table space, as it 
is now, all of its segments are in the same table space. I don't quite 
call that partitioning. Well, sure, you could call it so, but then, each 
and every Postgres table is already partitioned in 1G segments.


It all depends on the definitions, but in my world, horizontal 
partitioning for databases involves multiple table spaces (and is quite 
useless without that). Calling anything else partitioning is confusing, IMO.


So the one thing that always scares me about these define it all and let the 
database sort it out methods is they seem to lead to cases where the system 
ends up rewriting the data to fit into some new partition layout.


That holds true no matter if you shuffle between segments or relations. 
To be able to let the DBA define an exact split point, the database 
*will* have to shuffle tuples around. Why does that scare you? It's a 
regular database system's maintenance procedure.


One thing 
that is nice about the current partitioning scheme is you can control the 
impact of this behavior in these scenarios, but moving around small portions 
of the table at a time.  


Uh.. I'm not quite following. What current partitioning scheme are you 
referring to?


Why should that not be possible with other schemes? Moving the split 
point between two partitions involves moving tuples around, no matter if 
you are going to move them between segments or between relations 
building the partitions.


More to the point (I think) is that people define access to the data based on 
the meaning of the data, not how it is stored on disk.  For example, in some 
tables we only need to be active on 1 months worth of data... how that is 
laid out on disk (# partitions, which tablespaces) is a means to the end of 
working actively on 1 months worth of data. I can't think of many cases where 
people would actually say the want to work actively on the most recent GB of 
data.  


Agreed. I'd say that's why the DBA needs to be able to define the split 
point between partitions: only he knows the meaning of the data.



To me, both of SVM and
SE look much more like an optimization for certain special cases and
don't have much to do with partitioning.


Even if this were true, it might still be a useful optimization.


Possibly, yes. To me, the use case seems pretty narrow, though. For 
example it doesn't affect index scans much.


One table I 
am thinking of in particular in my system has one query we need to run across 
partitions, which ends up doing a slew of bitmap index scans for all the 
partitions. If using segment exclusion on it meant that I could get a global 
index to help that query, I'd be happy. 


As proposed, Segment Exclusion works only on exactly one table. Thus, if 
you already have your data partitioned into multiple relations, it most 
probably won't affect your setup much. It certainly has nothing to do 
with what I understand by 'global index' (that's an index spanning 
multiple tables, right?).


Regards

Markus


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Markus Schiltknecht

Hello Simon,

Simon Riggs wrote:

I've come
up with an alternative concept to allow us to discuss the particular
merits of each. ISTM that this new proposal has considerable potential.


Hm.. interesting idea.


If we were able to keep track of which sections of a table are now
read-only then we would be able to record information about the data in
that section and use it to help solve queries. This is turning the
current thinking on its head: we could derive the constraints from the
data, rather than placing the data according to the constraints. That
would be much more natural: load data into the table and have the system
work out the rest.


Yeah, but that's also the most limiting factor of your approach: it 
covers only horizontal partitioning by time (or to be more precise, by 
columns which are very likely to increase or decrease with time). All 
other columns will very likely contain values from the full range of 
possible values.


As you have pointed out, that might be a very frequent use case. I can't 
argue about that, however, I think it's important to be well aware of 
that limitation.



Other scans types might also use segment exclusion, though this would
only be useful for scans retrieving many rows, otherwise the overhead of
segment exclusion might not be worthwhile.


Uh.. the overhead of checking against min/max values doesn't seem that 
big to me.


I rather think the gain for index scans would be prohibitively small, 
because (given frequent enough vacuuming) an index scan shouldn't return 
many pointers to tuples in segments which could be optimized away by 
segment exclusion.



If we collect data for all columns then many of our implicit constraints
would be useless. e.g. if a column only has a few values and these are
present in all segments. Matching our predicate against all constraints
would be expensive, so we must weed out poor constraints. We would do
this by removing any constraint that overlapped more than 10% of other
segments. Various heuristics would likely need to be discovered during
development to make this work without resorting to manual commands.


Uh, well, that's about the limitation I've pointed out above. But is it 
really worth maintaining statistics about overlapping values and 
removing min/max checks for certain columns?


It would save you the min/max check per segment and scan, but cost 
maintaining the statistics and checking against the statistics once per 
scan. AFAICS the block with the min/max tuple per segment will often 
remain cached anyway... dunno.



Noting which segments are read-only
---

Everything so far has relied upon our ability to note which segments of
a table are read-only. We could do this in two ways

1) have the system automatically keep track of non-changing data
2) have the DBA issue a command to mark a segment as read-only now

Probably a combination of the two is better, so we have three states for
segments
- READ_WRITE_ALLOWED
- EFFECTIVE_READ_ONLY (set by 1 only)
- MARKED_READ_ONLY (set by 2 only)

Having said that I want to concentrate on (1), though consider the other
one also in case requested by reviewers.


Hm.. AFAICT, horizontal partitioning often serves manageability, which 
is quite limited having all data in one table (you can't move a single 
segment to a different tablespace). Thus I think option 2 is pretty 
constrained is usability. What would the DBA gain by setting a segment 
to read only? How does the DBA figure out, in which segment a tuple is 
stored in (so she can decide to mark it read only)?



The user may also wish to clear down very old data, so allowing DELETEs
can ensure the user can still remove old data from the table. By
carefully choosing the values to be deleted, a whole segment can be
deleted and then returned to the FSM. 


Oh, yeah, that sounds like a good optimization. Bulk deletes, yummie!


This proposal offers many of the advantages of the earlier Visibility
Map proposal, yet without major changes to heap structure. Since the
segment-level visibility map is more granular it would only be 80% as
effective as the more detailed block-level map. Having said that, the
bookkeeping overheads will also be considerably reduced, so it does seem
a good joint approach. It also allows freezing to be handled fully,
which was a problem with the original visibility map proposal. WAL
logging visibility map changes is also now much easier.


I generally agree, although I'm somewhat dubious about the 80% factor.


My thoughts have been targeted directly at partitioning, yet I have to
admit that this idea overlaps, and in my world view, replaces the
Visibility Map proposal. I very much like the name Visibility Map
though.


I would even say, that partitioning is somewhat of a misleading term 
here, because it normally allows the DBA to decide on where to split.


Given that we are operating on segments here, to which the DBA has very 
limited information and access, I 

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Markus Schiltknecht

Hi,

Simon Riggs wrote:

- any Fact table where measurements/observations/events are accumulated
e.g.
Web Hits (any Internet events)
Call Detail Records
Sales
Security Events
Scientific Measurements
Process Control

- any Major Entity where new entities are created from a sequence
e.g.
Orders, OrderItems
Invoices
Shipments, Returns
most SCM/DCM events


...and only changed very seldom after a while, I would add. Because 
changing an old tuple would invalidate the optimization for the affected 
segment.


That's why this optimization can't help for inventory tables, where an 
id might correlate with time and storage location, but writing access 
doesn't correlate with storage location (segment number) and time.


Regards

Markus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Markus Schiltknecht

Hi,

Simon Riggs wrote:

The smaller the partition size the greater the overhead of managing it.
Also I've been looking at read-only tables and compression, as you may
know. My idea was that in the future we could mark segments as either
- read-only 
- compressed

- able to be shipped off to hierarchical storage

Those ideas work best if the partitioning is based around the physical
file sizes we use for segments.


As much as I'd like this simplification.. But I'm still thinking of 
these segments as an implementation detail of Postgres, and not 
something the user should have to deal with.


Allowing the DBA to move segments to a different table space and giving 
him the possibility to check which tuples are in which segment seems 
awkward from a users perspective, IMO.


Regards

Markus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Markus Schiltknecht

Hi,

Simon Riggs wrote:

On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote:

On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
Agreed. Just a minor note: I find marked read-only too strong, as it 
implies an impossibility to write. I propose speaking about mostly-read 
segments, or optimized for reading or similar.


Hm.. yeah, after rereading, I realize that I've mixed up states no. 2 
and 3 here, sorry.



I do want some segments to be _marked_ read-only: I want attempted writes to
them to _fail_.


Well, I can see use cases for marking tuples or complete relations as 
read only. But segments?


I'm still puzzled about how a DBA is expected to figure out which 
segments to mark. Simon, are you assuming we are going to pass on 
segment numbers to the DBA one day?


If not, a more user friendly command like MARK READ ONLY WHERE date = 
2006 would have to move tuples around between segments, so as to be 
able to satisfy the split point exactly, right?



I think Markus thought that we would mark them read only automatically,
which was not my intention. I believe its possible to have this in a way
that will make you both happy. Some more explanation:

There would be three different states for a segment:
1. read write
2. optimized for reading, as Markus says it
3. marked read only by explicit command


Thanks for clarification.

Regards

Markus

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] [GENERAL] Slow PITR restore

2007-12-14 Thread Markus Schiltknecht

Hi,

Alvaro Herrera wrote:

Simon Riggs wrote:


ISTM its just autovacuum launcher + Hot Standby mixed.


I don't think you need a launcher at all.  Just get the postmaster to
start a configurable number of wal-replay processes (currently the
number is hardcoded to 1).


I also see similarity to what I do for Postgres-R: a manager and helper 
backends which can be started upon request. Such a scheme is currently 
used for autovacuum, I'm using it for replication, it could help for 
parallelizing recovery and it certainly helps for parallelizing queries 
as discussed in another thread.


Maybe it's worth considering a general framework for such a manager or 
auto launcher, as well as helper backends. It certainly depends on the 
complexity of that manager, but it should probably better be an external 
process.


What all of the helper backends have in common, AFAICT:

 - a connection to a database
 - no client connection
 - superuser privileges

(For parallelized queries, superuser privileges might appear wrong, but 
I'm arguing that parallelizing the rights checking isn't worth the 
trouble, so the initiating worker backend should do that and only 
delegate safe jobs to hepler backends. Or is that a serious limitation 
in a way?)


Most code for that already exists, as we already have various helpers. 
What's missing, IMO, is a communication channel between the worker and 
helper backends as well as between the backends and the manager. That's 
needed i.e. for worker backends being able to request helper backends 
and feed them with their wishes.


Unix pipes can only be set up between the parent and the child of a 
fork, they eat file descriptors, need to copy data to the kernel and 
back and IIRC, there were portability issues. That's why I've written 
the internal message passing (IMessage) stuff, see -patches [1].


I'm all for unifying such a manager process and generalizing the 
requesting and launching of helpers as well as management of their state 
(handling died helper processes, keeping a pool of idle helpers which 
are already connected to a database, etc..). Most of that already exists 
in my Postgres-R code, maybe I can derive a general purpose patch to 
start contributing code from Postgres-R?


Comments? Use cases I'm missing?

Regards

Markus

[1]: last time I published IMessage stuff on -patches, WIP:
http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [GENERAL] Slow PITR restore

2007-12-14 Thread Markus Schiltknecht

Hannu Krosing wrote:

until N fubbers used


..whatever a fubber is :-)

Nice typo!

Markus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [GENERAL] Slow PITR restore

2007-12-14 Thread Markus Schiltknecht

Hello Hannu,

Hannu Krosing wrote:
(For parallelized queries, superuser privileges might appear wrong, but 
I'm arguing that parallelizing the rights checking isn't worth the 
trouble, so the initiating worker backend should do that and only 
delegate safe jobs to hepler backends. Or is that a serious limitation 
in a way?)


at least functions defined with SECURITY DEFINER; may be a problem


Uhm.. what I had in mind was parallelizing seqential scans, index scans, 
joins and such - database internal stuff.


Parallelizing user defined functions (or what did you have in mind?) is 
more difficult and sometimes impossible, because the planner cannot know 
ahead, what the function's going to do.


However, thinking about it, maybe, one could and should try to 
parallelize computationally intensive IMMUTABLE functions. But already 
with STABLE ones I'm getting suspicious. It would require users to write 
real thread-safe (well, multi-process-safe) functions, which I doubt 
somewhat. Granted, they theoretically *should* be safe, but...


Anyway, if that's the only show stopper, one could certainly tell helper 
backends to substitute their superuser privileges with the invoker's 
privileges. Not sure if that's worth the trouble, though.


Regards

Markus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] VLDB Features

2007-12-13 Thread Markus Schiltknecht

Hello Gregory,

Gregory Stark wrote:

Oracle is using Direct I/O so they need the reader and writer threads to avoid
blocking on i/o all the time. We count on the OS doing readahead and buffering
our writes so we don't have to. Direct I/O and needing some way to do
asynchronous writes and reads are directly tied.


Yeah, except in cases where we can tell ahead non-sequential reads. 
Which admittedly doesn't come up too frequently and can probably be 
handled with posix_fadvice - as you are currently testing.



Where Parallel query is useful is when you have queries that involve a
substantial amount of cpu resources, especially if you have a very fast I/O
system which can saturate the bandwidth to a single cpu.


Full ACK, the very same applies to parallel querying on shared-nothing 
clusters. Those can help if the bandwidth to all processing cores 
together becomes the bottleneck (and the resulting data is relatively 
small compared to the input data).


For example, Sun's UltraSparc T2 features only 8 PCIe lanes for those 8 
cores, so you end up with 250 MiB/sec per core or about 32 MiB/sec per 
thread on average. To be fair: their 10 Gig Ethernet ports don't go via 
PCIe, so you get an additional 2x 1 GiB/sec for the complete chip. And 
memory bandwidth looks a lot better: Sun claims 60+ GiB/sec, leaving 
almost 8 GiB/sec per core or 1 GiB/sec per thread.


If my calculations for Intel are correct, a Quad Xeon with a 1.33 GHz 
FSB has around 21 GiB/sec throughput to main memory, giving 5 GiB/sec 
per core. (Why are these numbers so hard to find? It looks like Intel 
deliberately obfuscates them with FSB MHz or Giga-transactions per sec 
and the like.)


Regards

Markus


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] VLDB Features

2007-12-12 Thread Markus Schiltknecht

Hi,

Josh Berkus wrote:

Here's the other VLDB features we're missing:

Parallel Query


Uh.. this only makes sense in a distributed database, no? I've thought 
about parallel querying on top of Postgres-R. Does it make sense 
implementing some form of parallel querying apart from the distribution 
or replication engine?



Windowing Functions


Isn't Gavin Sherry working on this? Haven't read anything from him lately...

Parallel Index Build (not sure how this works exactly, but it speeds Oracle 
up considerably)


Sounds interesting *turs-away-to-google*


On-disk Bitmap Index (anyone game to finish GP patch?)


Anybody having an idea of what's missing there (besides good use cases, 
which some people doubt)? Again: Gavin?



Simon, we should start a VLDB-Postgres developer wiki page.


Thanks, Simon, wiki page looks good!

Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] VLDB Features

2007-12-12 Thread Markus Schiltknecht

Hi Josh,

Josh Berkus wrote:
Sure.  Imagine you have a 5TB database on a machine with 8 cores and only one 
concurrent user.  You'd like to have 1 core doing I/O, and say 4-5 cores 
dividing the scan and join processing into 4-5 chunks.


Ah, right, thank for enlightenment. Heck, I'm definitely too focused on 
replication and distributed databases :-)


However, there's certainly a great deal of an intersection between 
parallel processing on different machines and parallel processing on 
multiple CPUs - especially considering NUMA architecture. 
*comes-to-think-again*...



Isn't Gavin Sherry working on this? Haven't read anything from him
lately...


Me neither.  Swallowed by Greenplum and France.


Hm.. good for him, I guess!

Regards

Markus

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Ordered Append Node

2007-11-23 Thread Markus Schiltknecht

Hello Gregory,

Gregory Stark wrote:

It is kind of like a merge join but not quite. It's interleaving rows rather
than matching them up. It's more like the final merge of a sort which also
uses a heap to efficiently find the next value from the source tapes.


Well, maybe my point here is: why do you need the heap to sort? The 
inputs you've got are already sorted, you only need to merge them. To me 
this sounds very much like the final stage of a merge sort, which would 
not require much memory.


IMO, a merge sort could easier be implemented by a binary tree zipper 
node, as I had in mind. Leading to a plan like that (well, hey, this is 
all made up):


Zipper  (cost..., sort by public.t.i)
  -  Zipper  (cost .., sort by public.t.i)
- Zipper (cost .., sort by public.t.i)
 - Index Scan using ti1 on t1
 - Index Scan using t12 on t2
- Index Scan using ti2 on t3
  -  Zipper  (cost .., sort by public.t.i)
- Index Scan using ti4 on t4
- Index Scan using ti5 on t5

Every zipper node would simply have to keep the two top tuples from its 
inputs in memory, compare them and return the best.


But maybe that's exactly how Knuth's polyphase merge sort internally 
also merge their inputs (or runs). And perhaps it makes sense to show 
the user just one simple append node instead of throwing a tree of 
Zipper nodes at her. ;-)



Not necessarily but it is something Postgres supports and I don't think we
want to break it. Actually it's useful for partitioned tables if you build the
new partition in a separate table and then add it to the partitioned table. In
that case you may have gone through several steps of adding columns and
dropping them to get the structure to line up.


Agreed, especially because lining up the columns isn't that hard after all.

OTOH I think Postgres is way too flexible in how it allows partitioning 
to be done and thus it often can't optimize it properly. I'd very much 
like to teach it a stricter and simpler to use partitioning scheme than 
what we have with constraint exclusion.


But that's pipe dreaming, and your improvement to the append node is 
certainly a good step towards the right direction, keep up the good work!


Regards

Markus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Ordered Append Node

2007-11-23 Thread Markus Schiltknecht

Hi,

Florian Weimer wrote:

Florian Weimer wrote:

I think you need it because there are potentially many input types.


Eh, tapes.


Aha..


Given the partitioning case, I'd expect all rows to have an equal
tuple descriptor. Maybe this is a matter of what to optimize, then?

Could you elaborate on what use case you have in mind?


You need a priority queue to figure out from which tape (partition)
you need to remove the next tuple.


And why do you need lots of heap memory to do that? Anything wrong with 
the zipper approach I've outlined upthread?


Am I missing something?

Regards

Markus




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Ordered Append Node

2007-11-23 Thread Markus Schiltknecht

Hi,

Florian Weimer wrote:

I think you need it because there are potentially many input types.


Given the partitioning case, I'd expect all rows to have an equal tuple 
descriptor. Maybe this is a matter of what to optimize, then?


Could you elaborate on what use case you have in mind?

Regards

Markus

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Ordered Append Node

2007-11-23 Thread Markus Schiltknecht

Gregory Stark wrote:

Not quite the same since the Executor-based implementation would have a static
tree structure based on the partitions. Even if the partitions are all empty
except for one or two you would still have to push the result records through
all the nodes for the empty partitions.

A heap only has the next record from each input. If an input has reached eof
then the heap doesn't have an entry for that input. So if one of the inputs is
empty (often the parent of the inheritance tree) it doesn't require a test
anywhere to propagate the record up past it.


Ah, so the binary tree (binary heap?) gets adjusted dynamically. Very 
clever! (OTOH probably a micro optimization, but as code is already 
there, use it, yeah!)



I also did an optimization similar to the bounded-sort case where we check if
the next tuple from the same input which last contributed the result record
comes before the top element of the heap. That avoids having to do an insert
and siftup only to pull out the same record you just inserted. In theory this
is not an optimization but in practice I think partitioned tables will often
contain contiguous blocks of key values and queries will often be joining
against that key and therefore often want to order by it.


Hm... that assumes range partitioning, no? If you partition among three 
partitions by id modulo 3, tuples are most probably coming from one 
partition after the other, i.e.:

  1 2 3 1 2 3 1 2 3 ...

And with hash partitioning, you're completely unable to predict the 
ordering.



Ideally we would also be able to do this in the planner. If the planner could
determine from the constraints that all the key values from each partition are
non-overlapping and order them properly then it could generate a regular
append node with a path order without the overhead of the run-time
comparisons.


Agreed.


But that requires a) dealing with the problem of the parent table which has no
constraints and b) an efficient way to prove that constraints don't overlap
and order them properly. The latter looks like an O(n^2) problem to me, though
it's a planning problem which might be worth making slow in exchange for even
a small speedup at run-time.


Well, I think someday, Postgres needs better support for vertical data 
partitioning in general. Dealing with constraints and inheritance is way 
too flexible and prone to error. I'll shortly start a new thread about 
that, to outline my current thoughts about that topic.


Regards

Markus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Ordered Append Node

2007-11-23 Thread Markus Schiltknecht

Hi,

Heikki Linnakangas wrote:
AFAICT it's roughly the same data structure as the zipper tree you 
envisioned, but not implemented with separate executor nodes for each 
level.


Aha, that sounds good to me. ;-)

As I've already mentioned, I think it's even better to simply show the 
user a single append node, instead of lots of small nodes from a binary 
tree merger.


Regards

Markus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Ordered Append Node

2007-11-22 Thread Markus Schiltknecht

Hello Gregory,

Gregory Stark wrote:

I've been hacking on the idea of an Append node which maintains the ordering
of its subtables merging their records in order. This is important for
partitioned tables since otherwise a lot of plans are not available such as
merge joins.


Cool!

Some time ago, I've been trying something very similar, but didn't get 
very far. I'd like to share my thoughts anyway.


First of, I envisioned that append node to be applicable also to 
unordered reading from multiple partitions, simply interleaving between 
the partitions.



The logic I've followed is to do as follows:

1) Go through all subrels asking for any interesting pathkey lists. Gather up
   the union of all of these.


I also tried to modify the Append node first, then figured that it might 
be better to base on the merge join node instead. While this seems 
farther away, I had the hope that a binary tree of such 'plain merge' 
nodes would require less comparisons in total.


Plus, I found it a lot simpler to cope with exactly two input relations 
instead of n, as with the append node. :-)



2) Go back through all the subrels and for each accumulated pathkey list ask
   for the best path for that subrel for that pathkey list. 


3) Generate an two append paths for each of these pathkey lists, one using the
   best startup_cost subpath and one with the best total_cost subpath
   available for the pathkey list for each child rel. If there is none
   available take the best unordered path and put a sort node above it.

4) Additionally advertise the traditional unordered append node which our
   parent could choose to put a sort node above same as ever.

5) Append plan nodes look a lot like Sort plan nodes glued onto an Append plan
   node, with sort function oids and so on.

6) Append exec state nodes look a lot like a (a few fields from)
   tuplesortstate glued onto an Append node. They have the ScanKey array and
   the fmgr sort functions. They also have an array of TupleTableSlot and a
   heap of indexes into that array.

8) ExecAppend does something similar to tuplesort's bounded sort (I fear I'm
   going to get typecasted) or more to the point, similar to the final merge
   of a merge sort. It directly returns the slot the child rel last returned
   to it.


Uh.. well, yeah. I guess you have a better understanding of the planner 
and executor that I do.



Open questions:

1) I still haven't completely figured out what to do with equivalence classes.
   My hack of just stuffing all the append subrel vars into there seems to
   work fine. I need to understand what's going on to see if there's really a
   problem with it or not.

2) I'm not sure this code will work when the append rel is a target (ie UPDATE
   and DELETE stmts).

3) It does seem to work when the columns in the subrels don't line up but I
   didn't do anything special to handle this case.


Uh.. is there any use case for that? WRT partitioning certainly not, is 
there?


I'll have a look at your patch.

Regards

Markus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Raw device I/O for large objects

2007-09-18 Thread Markus Schiltknecht

Hi,

Georgi Chulkov wrote:

Please allow me to ask then:
1. In your opinion, would the above scenario indeed benefit from a raw-device 
interface for large objects?


No, because file systems also try to do what you outline above. They 
certainly don't split sequential data up into blocks and distribute them 
randomly over the device, at least not without having a pretty good 
reason to do so (with which you'd also have to fight).


The possible gain achievable is pretty minimal, especially in 
conjunction with a (hopefully battery backed) write cache.


2. How feasible it is to decouple general table storage from large object 
storage?


I think that would be the easiest part. I would go for a pluggable 
storage implementation, selectable per tablespace. But then again, I 
wouldn't do it at all. After all, this is what MySQL is doing. And we 
certainly don't want to repeat their mistakes! Or do you know anybody 
who goes like: Yepee, multiple storages engines to choose from for my 
(un)valuable data, lets put some here and others there


Let's optimize the *one* storage engine we have and try to make that 
work well together with the various filesystems it uses. Because 
filesystems are already very good in what they are used for. (And we are 
glad we can use a filesystem and don't need to implement one ourselves).


Regards

Markus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [COMMITTERS] pgsql: Fix up text concatenation so that it accepts all the reasonable

2007-09-17 Thread Markus Schiltknecht

Hi,

Gregory Stark wrote:

Perhaps if you're doing some form of replication between different
architectures you might want to use binary representation for your transfers.
Or if you're doing something in a PL language like compressing or bundling up
multiple data in a container format or something.


I'm not quite sure I understand where you're coming from, but isn't that 
what the send and recv functions for most data types are for? Don't 
those provide a cross-platform compatible binary representation?


Regards

Markus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [COMMITTERS] pgsql: Fix up text concatenation so that it accepts all the reasonable

2007-09-17 Thread Markus Schiltknecht

Hello Tom,

Tom Lane wrote:

I think you'd be nuts to bet your data on the binary representations
really being cross-platform compatible.


Can you elaborate on this? AFAICT the send/recv functions use network 
byte ordering. What are the other problems between different architectures?



There might be some excuse for
doing this within a single architecture, but I can't get very excited
about it ...


Is the textual representation (i.e. OidOutputFunctionCall) more cross 
platform compatible?


Gregory Stark wrote:
 Well they're not very useful for their intended purpose of
 client-server communication if they're not.

Agreed. Up until now, I'd have considered it a bug, if a send/recv on 
different platforms would not lead to the very same result.


Regards

Markus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [COMMITTERS] pgsql: Fix up text concatenation so that it accepts all the reasonable

2007-09-17 Thread Markus Schiltknecht

Hi,

Tom Lane wrote:

Well, you're probably fine with integers and text, but beyond that it
gets a bit more dicey.  I wouldn't want to assume that floats are
compatible across any random pair of architectures, and in the more
complex datatypes (such as arrays or geometric types) there might be
some impact from alignment rules.  (Or not, I'm too lazy to go look at
the moment.)


Okay, thank you for the hints, I'll go figure out.

Regards

Markus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] terms for database replication: synchronous vs eager

2007-09-14 Thread Markus Schiltknecht

Hello Jan,

thank you for your feedback.

Jan Wieck wrote:

On 9/7/2007 11:01 AM, Markus Schiltknecht wrote:
This violates the common understanding of synchrony, because you can't 
commit on a node A and then query another node B and expect it be 
coherent immediately.


That's right. And there is no guarantee about the lag at all. So you can 
find old data on node B long after you committed a change to node A.


I'm in doubt about the long after. In practice you'll mostly have 
nodes which perform about equally fast. And as the origin node has to do 
more processing, than a node which solely replays a transaction, it's 
trivial to balance the load.


Additionally, a node which lags behind is unable to commit any 
(conflicting) local transactions before having caught up (due to the GCS 
total ordering). So this is even somewhat self regulating.


Postgres-R is an asynchronous replication system by all means. It only 
makes sure that the workset data (that's what Postgres-R calls the 
replication log for one transaction)


It's most often referred to as the writeset.

has been received by a group 
communication system supporting total order and that the group 
communication system decided it to be the transaction that (logically) 
happened before any possibly conflicting concurrent transaction.


Correct. That's as far as the Postgres-R algorithm goes.

I should have been more precise on what I'm talking about, as I'm 
continuing to develop Postgres-R (the software). That might be another 
area where a new name should be introduced to differentiate between 
Postgres-R, the original algorithm and my continuous work on the 
software, implementing the algorithm.


This is the wonderful idea how Postgres-R will have a failsafe conflict 
resolution mechanism in an asynchronous system.


I don't know what you associate with the word eager.


I'm speaking of the property, that a transaction is replicated before 
commit, so as to avoid later conflicts. IMO, this is the only real 
requirement people have when requesting synchronous replication: most 
people don't need synchrony, but they need reliable commit guarantees.


I've noticed that you are simply speaking of a failsafe conflict 
resolution mechanism. I dislike that description, because is does not 
say anything about *when* the conflict resolution happens WRT commit. 
And there may well be lazy failsafe conflict resolutions mechanisms 
(i.e. for a counter), which reconciliate after commit.


I'd like to have a simple term, so that we could say: you probably don't 
need fully synchronous replication, but eager replication may already 
serve you well.


All I see is that 
Postgres-R makes sure that some other process, which might still reside 
on the same hardware as the DB, is now in charge of delivery. 


..and Postgres-R waits until that other process confirms the delivery, 
whatever exactly that means. See below.


This delay before commit is important. It is what makes Postgres-R 
eager, according to my definition of it. I'm open for better terms.


Nobody 
said that the GC implementation cannot have made the decision about the 
total order of two workset messages and already reported that to the 
local client application before those messages ever got transmitted over 
the wire.


While this is certainly true in theory, it does not make sense in 
practice. It would mean letting the GCS decide on a message ordering 
without having delivered the messages to be ordered. That would be 
troublesome for the GCS, because it could loose an already ordered 
message. Most GCS start their ordering algorithm by sending out the 
message to be ordered.


Anyway, as I've described on -hackers before, I'm intending to decouple 
replication from log writing. Thus not requiring the GCS to provide any 
delivery guarantees at all (GCSs are complicated enough already!). That 
would allow the user to decouple transaction processing nodes from log 
writing nodes. Those tasks have different I/O requirements anyway. And 
what would more that two or three replicas of the transaction logs be 
good for anyway? Think of them as an efficient backup - you won't need 
it until your complete cluster goes down.


Regards

Markus


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] terms for database replication: synchronous vs eager

2007-09-14 Thread Markus Schiltknecht

Hi,

Chris Browne wrote:

The approach that was going to be taken, in Slony-II, to apply locks
as early as possible so as to find conflicts as soon as possible,
rather than waiting, seems eager to me.


Agreed. WRT locking, one might also call it pessimistic, but that 
sounds so... negative.


I find the as soon as possible bit rather weak, instead it's exactly 
before the origin node confirms commit. Of course only conflicts which 
could possibly lead to an abort of the transaction in question are taken 
into account. A possible definition may be:


  Eager replication systems do only confirm the commit of a transaction
   after they have checked for cross-node conflicts, which could require
   the transaction to abort.  (While lazy systems may confirm the commit
   before).

Note how much less restrictive that definition is, that that of a fully 
synchronous system.


 But I'm not sure to what extent that notion has been drawn into the
 Postgres-R work...

My current variant of Postgres-R goes the very same path, using MVCC 
instead of locking wherever possible (with the very same effect, but 
allowing more concurrency :-) ).


Regards

Markus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] WIP patch for latestCompletedXid method of computing snapshot xmax

2007-09-08 Thread Markus Schiltknecht

Hello Tom,

Tom Lane wrote:

So on the strength of that, I'm going to go ahead and commit the patch,
but I'd be interested to see benchmarks from people with access to
better hardware.


I've just completed two dbt2 test runs on a mid-level system, with 4GB 
RAM and a 7 disk SATA RAID 1+0 w/ BBU. Once with code as of 2007/09/05 
18:00 (which is *before* the first lazy xid commit) and once with cvs 
HEAD (2007/09/08  +latestCompletedXid.patch.


Here are the results from the first test run (test run 33, without lazy 
xid):


$ cat 33/driver/results.out 
 Response Time (s)

 Transaction  %Average :90th %TotalRollbacks  %
  -  -  ---  ---  -
Delivery   3.97  3.745 : 7.844118440   0.00
   New Order  45.35  3.844 : 7.692   135192 1352   1.01
Order Status   3.95  2.728 : 6.371117640   0.00
 Payment  42.74  2.649 : 6.349   1274150   0.00
 Stock Level   4.00  2.172 : 5.634119150   0.00
  -  -  ---  ---  -

1103.45 new-order transactions per minute (NOTPM)
120.1 minute duration
0 total unknown errors
1003 second(s) ramping up


And that's with HEAD +latestCompletedXid.patch (test run 34):

$ cat 34/driver/results.out 
 Response Time (s)

 Transaction  %Average :90th %TotalRollbacks  %
  -  -  ---  ---  -
Delivery   3.96  3.843 : 8.223117600   0.00
   New Order  45.28  4.049 : 8.451   134398 1300   0.98
Order Status   3.97  2.877 : 6.815117770   0.00
 Payment  42.80  2.745 : 6.718   1270270   0.00
 Stock Level   4.00  2.280 : 6.129118590   0.00
  -  -  ---  ---  -

1097.71 new-order transactions per minute (NOTPM)
120.1 minute duration
0 total unknown errors
1003 second(s) ramping up


Both tests ran for two hours, had 100 warehouses and 50 connections. 
shared_buffers were set to 1024MB, effective_cachesize = 3800MB, all 
other settings were standard.


Regards

Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Markus Schiltknecht

Hi,

apoc9009 wrote:

Thadt is Replication NOT Backup


I've now read all of your messages in this thread, but I simply fail to 
understand why you are that much opposed to the term 'replication'. I 
think the only thing which comes any close to what you're looking for is 
replication (in particular eager multi-master replication).


I'd recommend you familiarize yourself with the world of database 
replication. You already know the important chapter from our manual, 
learn that by heart. Then read [2] and [3]. :-)


Regards

Markus

[1]: Postgres advocacy wiki:
http://developer.postgresql.org/index.php/Replication%2C_Clustering%2C_and_Connection_Pooling

[2]: Terms and Definitions of Database Replication
http://www.postgres-r.org/documentation/terms


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Markus Schiltknecht

Hi,

apoc9009 wrote:

Translation for you:
A Backup is a File or Set of Files thadt contains the Data of your 
Business critical Informations.
It should not be Archived on the same place, the same House or the same 
Room.


I disagree, a backup does not necessarily have to be a single file or a 
set of files. Wikipedia has this definition:


  backup refers to making copies of data so that these additional
   copies may be used to restore the original after a data loss event.

While for replica, it states:

  replica is a copy that is relatively indistinguishable from the
   original

Thus a backup can very well be thought of as replica, and vice versa.

A Replication Database has nothing to do with a Backup, it works only 
for Failover if the Primary

Database has a Mailfunction.


That's certainly plain wrong, see multi-master replication where 
failover doesn't make any sense. Wikipedia again (although that's 
unfair, as I've contributed to that definition myself) [1]:


  Replication is the process of sharing information so as to ensure
   consistency between redundant resources

..for example a master database and a backup.


Keep in Mind: Backup is NOT Replication!
Write it down 100 times and maybe you understand


A backup IS a replica. A backup IS a replica. A backup IS a replica. A 
backup IS a replica...


Regards

Markus

[1]: http://en.wikipedia.org/wiki/Replication_%28computer_science%29

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] terms for database replication: synchronous vs eager

2007-09-07 Thread Markus Schiltknecht

Hi,

I'm asking for advice and hints regarding terms in database replication, 
especially WRT Postgres-R. (Sorry for crossposting, but I fear not 
reaching enough people on the Postgres-R ML alone)


I'm struggling on how to classify the Postgres-R algorithm. Up until 
recently, most people thought of it as synchronous replication, but it's 
not synchronous in the strong (and very common) sense. I.e. after a node 
confirms to have committed a transaction, other nodes didn't necessarily 
commit already. (They only promise that they *will* commit without 
conflicts).


This violates the common understanding of synchrony, because you can't 
commit on a node A and then query another node B and expect it be 
coherent immediately.


None the less, Postgres-R is eager (or pessimistic?) in the sense that 
it replicates *before* committing, so as to avoid divergence. In [1] 
I've tried to make that distinction clear, and I'm currently advocating 
for using synchronous only in the very strong (and commonly used) sense. 
I've choosen the word 'eager' to mean 'replicates before committing'.


According to that definitions, Postgres-R is async but eager.

Do these definitions violate any common meaning? Maybe in other areas 
like distributed storage or lock managers?


Regards

Markus

[1]: Terms and Definitions of Database Replication
http://www.postgres-r.org/documentation/terms

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] PG-MQ?

2007-06-20 Thread Markus Schiltknecht

Hi Chris,

Chris Browne wrote:

I'm seeing some applications where it appears that there would be
value in introducing asynchronous messaging, ala message queueing.
http://en.wikipedia.org/wiki/Message_queue


ISTM that 'message queue' is a way too general term. There are hundreds 
of different queues at different levels on a standard server. So I'm 
somewhat unsure about what problem you want to solve.



c) There are lesser names, like isectd http://isectd.sf.net and the
(infamous?) Spread Toolkit which both implement memory-based messaging
systems.


If a GCS is about what you're looking for, then you also might want to 
consider these: ensemble, appia or jGroups. There's a Java layer called 
jGCS, which supports even more, similar systems.


Another commonly used term is 'reliable multicast', which guarantees 
that messages are delivered to a group of recipients. These algorithms 
often are the basis for a GCS.



My bias would be to have something that can basically run as a thin
set of stored procedures atop PostgreSQL :-).  It would be trivial to
extend that to support SOAP/XML-RPC, if desired.


Hm.. in Postgres-R I currently have (partial) support for ensemble and 
spread. Exporting that interface via stored procedures could be done, 
but you would probably need a manager process, as you certainly want 
your connections to persist across transactions (or not?).


Together with that process, we already have half of what Postgres-R is: 
an additional process which connects to the GCS. Thus I'm questioning, 
if there's value for exporting the interface. Can you think of other use 
cases than database replication? Why do you want to do that via the 
database, then, and not directly with the GCS?



It would be nice to achieve 'higher availability' by having queues
where you might replicate the contents (probably using the MQ system
itself ;-)) to other servers.


Uhm.. sorry, but I fail to see the big news here. Which replication 
solution does *not* work that way?


Regards

Markus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] COPYable logs status

2007-06-09 Thread Markus Schiltknecht

Hi,

Tom Lane wrote:

We *have* a log-writing process.  The problem is in getting the data to it.


Remember the imessages approach I'm using for Postgres-R? It passes 
messages around using shared memory and signals the receiver on incoming 
data. It's not perfect, sure, but it's a general solution to a common 
problem.


Maybe it's worth a thought, instead of fiddling with signals, special 
shmem areas and possible races every time the 'getting data to another 
process'-problem comes up?


Regards

Markus


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] COPYable logs status

2007-06-09 Thread Markus Schiltknecht

Tom Lane wrote:

Markus Schiltknecht [EMAIL PROTECTED] writes:

Tom Lane wrote:

We *have* a log-writing process.  The problem is in getting the data to it.


Remember the imessages approach I'm using for Postgres-R? It passes 
messages around using shared memory and signals the receiver on incoming 
data. It's not perfect, sure, but it's a general solution to a common 
problem.


Uh-huh.  And how will you get libc's dynamic-link code to buy into
issuing link error messages this way?  Not to mention every other bit
of code that might get linked into the backend?


I was refering to the 'getting data to another process' problem. If 
that's the problem (as you said upthread) imessages might be a solution.



Trapping what comes out of stderr is simply too useful a behavior to lose.


Sure. I've never said anything against that.

Regards

Markus

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-19 Thread Markus Schiltknecht

Hi,

Alvaro Herrera wrote:

Which is not always what happens in reality.  Consider for example that
we borrowed some files from NetBSD, OpenBSD, Tcl, zic and others.  It
would be nice to know exactly at what point we borrowed the file, so we
can go to the upstream repo and check if there's any bug fix that we
should also apply to our local copy.  And we _also_ modify locally the
file of course, so just digesting the file we have to get a SHA1 (or
whatever) identifier is not an option.


I consider such information (i.e. 'where is this file coming from') to 
be historical information. As such, this information clearly belongs to 
the VCS sphere and should be tracked and presented by the VCS.


Advanced VCSes can import files from other projects and properly track 
those files or propagate on request. Even subversion can do that to some 
extent. My point here is: given a decent VCS, you don't need such 
historical information as often as you do with CVS. You can sit back and 
let the VCS do the job. (Like looking up, when the last 'import' of the 
file from the external project happened, what changed and merge those 
changes back into your (locally modified variant of the) file.) And if 
you really want to dig in the history of your project, you can ask the 
VCS, which you are going to need anyway for other historic information.


Regards

Markus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-19 Thread Markus Schiltknecht

Hi

Jim C. Nasby wrote:

I understand the argument about metadata and all, and largely agree with
it. But on the other hand I think a version identifier is a critical
piece of information; it's just as critical as the file name when it
comes to identifying the information contained in the file.


If you really want the files in your releases to carry a version 
identifier, you should let your release process handle that. But often 
enough, people can't even tell the exact PostgreSQL version they are 
running. How do you expect them to be able to tell you what version a 
single file has?


For the developers: they have all the history the VCS offers them. There 
are tags to associate a release with a revision in your repository. And 
because a decent VCS can handle all the diff'ing, patching and merging 
you normally need, you shouldn't ever have to process files outside of 
your repository.


So what exactly is the purpose of a version identifier within the file's 
contents? For whom could such a thing be good for?


Regards

Markus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Auto Partitioning

2007-04-06 Thread Markus Schiltknecht

Simon Riggs wrote:

i.e. if we have partitions for each year (2001, 2002, 2003 2004, 2005,
2006, 2007) AND we have already proved that 2005 is excluded when we
have a WHERE clause saying year = 2006, then we should be able to use
the ordering to prove that partitions for 2004 and before are also
automatically excluded.


Provided you've set up the right constraints, the current 
constraint_exclusion feature does exactly that, no?



I'll think some more about the Merge node, but not right now.


I've looked at nodeAppend.c and nodeMergeJoin.c. Probably we can use 
much of nodeMergeJoin, just without the join? Instead returning the 
tuples as they are, but in the correct order. The nodeMergeJoin code can 
only handle two inputs (a left and a right node), but it might be 
beneficial to structure multiple merge nodes into a binary tree layout 
anyway. (I'm guessing that might reduce the amount of comparisons needed).


What do you think?

Regards

Markus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Auto Partitioning

2007-04-05 Thread Markus Schiltknecht

Hi,

Zeugswetter Andreas ADI SD wrote:



CREATE INDEX x ON test(a, b, c);

isn't the same as

CRETAE INDEX x ON test(c, b, a);


That is only a problem if you also want to avoid a sort (e.g. for an
order by),


..or if you want to use that index for 'WHERE a = 5'. The first one is 
probably helping you, the second isn't.



(an example would be a query where c=5 and b between 0 and 20
and two partitions one for 0 = b  10 and a second for 10 = b)


Hm.. in that case, an index on (a, b, c) wouldn't help. An index on (c, 
b, a) would be just perfect, agreed?


Now, for the partitioning: you simply have to scan two partitions in 
that case, no matter how you arrange your indexes. And this is where we 
need some sort of multi-table index scan functionality. (I'm not saying 
a multi-table index. Such a thing would be too large on disk. That 
functionality should probably better be realized by using the underlying 
per-table indexes).


That's why I'd say, the first columns of an index would have 
to be equal to all of the columns used in the partitioning key.


I correct my own statement somewhat, here: only in that case, a single 
table index can satisfy your request. For other cases, you'd have to 
query more than one partition's indexes and mix them correctly to 
maintain the right order, if required.



No. It may change performance in some situations, but it is not needed
for unique constraints.


Agreed, for unique constraints. But indexes are used for some more 
things than just unique constraints checking. ;-)


Regards

Markus


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Auto Partitioning

2007-04-05 Thread Markus Schiltknecht

Hi,

Martijn van Oosterhout wrote:
The executor would have to be clever enough to not do a single index 
scan, but possibly scan through multiple indexes when asking for 
uniqueness, depending on the partitioning rule set.


But it's not the executor that checks uniqueness, it's built into the
btre code.


Well, it's the executor calling into the btree code. Couldn't the 
executor choose which (btree-) indexes to query for uniqueness?



If someone manages to crack uniqueness for GiST indexes, we'll have our
answer, since it has exactly the same problem but on a different scale.
(Or vice-versa, if some gets uniqueness for multiple indexes, we can do
it for GiST also).


Uh.. can you elaborate on that? AFAICS, you would simply have to query 
multiple btree indexes and make sure non of them is violated. How would 
that help making unique GiST indexes possible? What's the problem there?


Regards

Markus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Hi,

NikhilS wrote:

The following things are TODOs:

iv) Auto generate rules using the checks mentioned for the partitions, to
handle INSERTs/DELETEs/UPDATEs to navigate them to the appropriate child.
Note that checks specified directly on the master table will get 
inherited

automatically.


Am planning to do the above by using the check constraint specified for 
each
partition. This constraint's raw_expr field ends up becoming the 
whereClause

for the rule specific to that partition.


I appreciate you efforts, but I'm not sure if this has been discussed 
enough. There seem to be two ideas floating around:


 - you are heading for automating the current kludge, which involves
   creating partitions and constraints by hand. AFAICT, you want to
   support list and range partitioning.

 - Simon Riggs has proposed partitioning functions, which could easily
   handle any type of partitioning (hash, list, range and any mix of
   those).

Both proposals do not have much to do with the missing multi-table 
indices. It's clear to me that we have to implement those someday, anyway.


AFAICT, the first proposal does not ease the task of writing correct 
constraints, so that we are sure that each row ends up in only exactly 
one partition. The second would.


But the second proposal makes it hard for the planner to choose the 
right partitions, i.e. if you request a range of ids, the planner would 
have to query the partitioning function for every possible value. The 
first variant could use constraint exclusion for that.


None of the two has gone as far as thinking about switching from one 
partitioning rule set to another. That gets especially hard if you 
consider database restarts during re-partitioning.



Here are some thought I have come up with recently. This is all about 
how to partition and not about how to implement multi-table indices. 
Sorry if this got somewhat longish. And no, this is certainly not for 
8.3 ;-)


I don't like partitioning rules, which leave open questions, i.e. when 
there are values for which the system does not have an answer (and would 
have to fall back to a default) or even worse, where it could give 
multiple correct answers. Given that premise, I see only two basic 
partitioning types:


 - splits: those can be used for what's commonly known as list and range
   partitioning. If you want customers A-M to end up on partition 1 and
   customers N-Z on partition 2 you would split between M and N. (That
   way, the system would still know what to do with a customer name
   beginning with an @ sign, for example. The only requirement for a
   split is that the underlying data type supports comparison
   operators.)

 - modulo: I think this is commonly known as hash partitioning. It
   requires an integer input, possibly by hashing, and calculates the
   remainder of a division by n. That should give an equal distribution
   among n partitions.

Besides the expression to work on, a split always needs one argument, 
the split point, and divides into two buckets. A modulo splits into two 
or more buckets and needs the divisor as an argument.


Of course, these two types can be combined. I like to think of these 
combinations as trees. Let me give you a simple examlpe:


 table customers
   |
   |
   split @ name = 'N'
/   \
   / \
 part1  part2



A combination of the two would look like:

 table invoices
   |
   |
   split @ id = 5
/   \
   / \
  hash(id) modulo 3 part4
 /| \
/ |  \
 part1   part2   part3


Knowledge of these trees would allow the planner to choose more wisely, 
i.e. given a comparative condition (WHERE id  10) it could check 
the splits in the partitioning tree and only scan the partitions 
necessary. Likewise with an equality condition (WHERE id = 1234).


As it's a better definition of the partitioning rules, the planner would 
not have to check constraints of all partitions, as the current 
constraint exclusion feature does. It might even be likely that querying 
this partitioning tree and then scanning the single-table index will be 
faster than an index scan on a multi-table index. At least, I cannot see 
why it should be any slower.


Such partitioning rule sets would allow us to re-partition by adding a 
split node on top of the tree. The split point would have to increment 
together with the progress of moving around the rows among the 
partitions, so that the database would always be in a consistent state 
regarding partitioning.


Additionally, it's easy to figure out, when no or only few moving 

Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Hi,

NikhilS wrote:

Our current partitioning solution is based on inheritance. With that in
mind, for 8.3 I thought an implementation based on auto rules creation 
would be the way to go.


That's completely reasonable. And as I've said, it's probably even a 
step towards what I've outlined (automation of creation of partitions).


Regards

Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Hi,

Simon Riggs wrote:

I agree with much of your post, though this particular point caught my
eye. If you'll forgive me for jumping on an isolated point in your post:


No problem.


Multi-table indexes sound like a good solution until you consider how
big they would be. The reason we need a multi-table index is because
we are using partitioning, which we wouldn't be doing unless the data
was fairly large. So the index is going to be (Num partitions *
fairly-large) in size, which means its absolutely enormous. Adding and
dropping partitions also becomes a management nightmare, so overall
multi-table indexes look unusable to me. Multi-table indexes also remove
the possibility of loading data quickly, then building an index on the
data, then adding the table as a partition - both the COPY and the
CREATE INDEX would be slower with a pre-existing multi-table index.


I agree. (And thanks to TOAST, we never have very wide tables with 
relatively few rows, right? I mean, something like pictures stored in 
bytea columns or some such.)



My hope is to have a mechanism to partition indexes or recognise that
they are partitioned, so that a set of provably-distinct unique indexes
can provide the exact same functionlity as a single large unique index,
just without the management nightmare.


Uhm... I don't quite get what you mean by provably-distinct unique 
indexes.


As long as the first columns of an index are equal to all columns of the 
partitioning columns, there is no problem. You could easily reduce to 
simple per-table indexes and using the partitioning rule set to decide 
which index to query.


But how to create an (unique) index which is completely different from 
the partitioning key?


Regards

Markus

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Hi,

Gregory Stark wrote:

Put another way, multi-table indexes defeat the whole purpose of having
partitioned the table in the first place. If you could have managed a single
massive index then you wouldn't have bothered partitioning.


That depends very much on the implementation of the multi-table index, 
as you describe below. I think the major missing part is not *how* such 
a meta-index should work - it's easily understandable, that one could 
use the per-table indices - but a programming interface, similar to the 
current index scan or sequential scan facility, which could return a 
table and tuple pointer, no?



However there is a use case that can be handled by a kind of compromise index.
Indexes that have leading columns which restrict all subtrees under that point
to a single partition can be handled by a kind of meta-index. So you have one
index which just points you to the right partition and corresponding index.


Yeah.


That lets you enforce unique constraints as long as the partition key is part
of the unique constraint.


Is that already sufficient? That would alter the ordering of the columns 
in the index, no? I mean:


CREATE INDEX x ON test(a, b, c);

isn't the same as

CRETAE INDEX x ON test(c, b, a);

That's why I'd say, the first column of an index would have to be equal 
to all of the columns used in the partitioning key.


Regards

Markus


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Hi,

Simon Riggs wrote:

Most high volume tables are Fact tables with potentially more than 1 row
per Object/Dimension, so the unique index isn't appropriate in those
cases.

When partitioning a Major Entity its much easier to regard the PK as the
partitioning key + unique key, which is frequently possible, even if it
does break the exhortation against intelligent keys.


Okay, so you are saying that a general purpose multi-table index isn't 
needed, but instead something based on the partitioning rule set and the 
per table indexes should be sufficient for the vast majority of cases?


Regards

Markus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Hi,

Gregory Stark wrote:

However there are also cases such as where you have a=0..99 in one partition
and a=100..199 in partition two, etc. It could still automatically build
indexes on (a,b,c) on each partition and somehow note that the unique
constraint is guaranteed across the whole partitioned table.


Uhm... yes, because 'a' is the partitioning key.

According to my outline for partitioning rule sets, you would have a 
split @ a = 100. Probably another one @ a = 200, etc... but none the 
less, 'a' is the only column needed to decide what partition a row has 
to end up in, so 'a' is the only column in the partitioning key.


What I'm saying is, that given your example, it's not easily possible to 
have an index on (b,a) even if 'a' is also in the partitioning key. It's 
very well possible to emulate a multi-table index on (a,b), though.



Brainstorming about this somewhat more: how about having multiple 
columns in the partitioning key, i.e. 'a' and 'b', and the following 
rule set (which admittedly is somewhat special):


 table sample
   |
   |
split @ a = 100
/   \
   / \
  split @ b = 100  part3
   /\
  /  \
part1   part2


An index on (a, b) could easily be 'emulated' by having such an index on 
all the partitions, but can we have an index on (b, a) like that? 
Probably not, because at the first split, we would have to duplicate. 
I.e. for an index scan on 'b = 22', we would have to scan the index on 
part3 as well as part1.


Thus one can say, that an multi-table index can only easily be 
'emulated', if it has the same columns as the partitioning key, in the 
same order. For the above example, these ones would be possible:


 (a)
 (a,b)
 (a,b,...)


Yet another thought: the emulation of multi-table indexes, in this case, 
is like concatenating the indexes of the partitions in the right order. 
Asking for an index scan for 'WHERE a = 95 AND a = 105' when having a 
split at a = 100, you would have to start on the index in the left 
bucket (with a  100) and return everything until the end of the index, 
then continue on the index in the right bucket (with a = 100). So you 
also have to be able to determine an order, which is easily possible for 
splits, but not so simple for modulos (hash partitioning).



For such a modulo node, the executor would have to start multiple index 
scans, i.e.:


 table sample
   |
   |
 'id' modulo 4
   /||  \
  / ||   \
  part1  part2  part3  part4

When scanning for a range (i.e. 'WHERE id = 5 AND id = 17'), the 
planner would have to request an index scan on each of the partition, 
joining the results in the right order.


So, why not completely emulate all multi-table index scans? The above 
restriction would disappear, if we could teach the planner and executor 
how to join multiple index scan results, no?



Questioning the other way around: do we need any sort of multi-table 
indexes at all, or isn't it enough to teach the planner and executor how 
to intelligently scan through (possibly) multiple indexes to get what is 
requested?


Regards

Markus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Hi,

Joshua D. Drake wrote:
If we don't have multi-table indexes how do we enforce a primary key 
against a partitioned set?


The executor would have to be clever enough to not do a single index 
scan, but possibly scan through multiple indexes when asking for 
uniqueness, depending on the partitioning rule set.


Regards

Markus


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Simon Riggs wrote:

The planner already uses the Append node to put together multiple plans.
The great thing is it will put together IndexScans and SeqScans as
applicable. No need for multi-scans as a special node type.


Yes... only that mixing 'concurrent' index scans in the right order 
would probably save us an extra sort step in some cases. Consider this 
with hash partitioning on (id):


  SELECT * FROM test WHERE id  1 AND id  999 ORDER BY id;

Every partition should have an index on (id), so we already have pretty 
well sorted data, we just need to mix the results of the index scan in 
the correct order, no?


Regards

Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Andrew Dunstan wrote:

David Fetter wrote:

That would be where the provably-distinct part comes in, so yes.


That assumes you can provide some provably distinct test. In the general 
case I have in mind that isn't so.


Could you please give a somewhat more concrete example, I'm not 
following here.


Thanks

Markus


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Auto Partitioning

2007-04-04 Thread Markus Schiltknecht

Hi,

Andrew Dunstan wrote:
 I guess my point was really that multi-table indexes might have uses
 beyond partitioning.

Aha, now I understand. Thanks for the clarification.

Say I have two tables, each with a field FKed to a field in a third 
table. I'd like to create the values to be unique across the referring 
tables. Now, there are various tricks that can be played either with 
custom triggers or redundant data to do this, but there's no easy way. 
However, a multi-table unique index would do it for me quite nicely, if 
we could create such a thing.


Maybe going into a similar direction and better think of it as a 
multi-table uniqueness constraint, which internally uses multiple, 
single-table indexes?


Regards

Markus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] SCMS question

2007-02-26 Thread Markus Schiltknecht

Hi,

Andrew Dunstan wrote:

O.k. everyone pay attention, I am about to agree with Greg! ;)

Greg are their tools to migrate CVS to monotone or whatever your
favorite is? The reason I ask is that I migrate the CVS to SVN every 4
hours I think it is and it isn't perfect.


monotone ships it's own cvs_import. I'm about to improve that, using 
roughly the same algorithm as cvs2svn 2.0. The cvs2svn people are very 
well aware that their version 1.x is good, but has problems in some areas.


There is a generic conversion tool called Tailor that might help you: 
http://www.darcs.net/DarcsWiki/Tailor


Tailor does not have support for branches, but can only convert one 
branch at a time. And it is not half as clever as cvs2svn regarding 
clever reconstruction of broken or manually tweaked CVS repositories.


Regards

Markus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] SCMS question

2007-02-26 Thread Markus Schiltknecht

Hi,

Matthew D. Fuller wrote:

I would say that a far greater contributor in practice would simply be
frequency.  If you diverge on your significant feature for 6 months,
then try to merge in upstream changes from the main dev, you will be
in hell no matter what merge algorithm you use.


Do you have experience with automatic merging tools, to back up this 
assertion. If so I'd be curious to know.


I'm maintaining the Postgres-R branch since about late PostgreSQL 7.4 
using monotone's cvs_import and propagating from the original PostgreSQL 
repository to my Postgres-R branch. And even if I propagate quite 
rarely, automatic merge tools (i.e. monotone) helped me a *great 
deal(tm)*.  (What's still awkward, is the lacking cvs_import.)



If you merge in
upstream changes every few days, however, you will have many fewer and
much simplier conflicts to deal with.


A VCS as good as monotone gives you the option to merge random revisions 
in between. Thus, if you didn't merge for six months, you can easily do 
incremental merges and i.e. do six times a merge of one month worth of 
mainline code.


Of course, that still seems like more work, than if you do it frequently 
:-)  But the VCS should give you the option and let *you* choose, 
instead of enforcing whatever it thinks should be good for you.



A VCS that makes frequent merges easy results in easier conflict
handling, not by some magical auto-resolution, but just by letting you
do it in ongoing regular and small bites.


I disagree, by experience. And please note, that it's not magic, but (in 
case of monotone) a provably correct (and simple to understand, I might 
add) algorithm which merges cleanly whenever possible.


Regards

Markus


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] SCMS question

2007-02-26 Thread Markus Schiltknecht

Hi,

Tom Lane wrote:

Yah know, the one bit of these pitches that always sounds like pure
snake oil is the claim that they offer some kind of mechanical solution
to merge conflicts.  AFAICS that has nothing to do with the SCMS in use
and everything to do with whether your diff command is AI-complete.


You should have said merge command. Every tried that? Or kdiff3? Try 
it, you will be surprised!


Regards

Markus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] [Monotone-devel] Re: SCMS question

2007-02-26 Thread Markus Schiltknecht

Hi,

Warren Turkal wrote:
Cvs2svn seems to make as much sense of CVS data as possible. The only real 
problem I have seen is with regard to the malformed files I mentioned 
earlier.


cvs2svn (1.x) still heavily relies on timestamps, which is certainly 
correct in most cases. But they are switching to a graph based approach 
for cvs2svn 2.0. I'm basing the reworked cvs_import feature of monotone 
on that same algorithm.


Regards

Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] SCMS question

2007-02-26 Thread Markus Schiltknecht

Hi,

[EMAIL PROTECTED] wrote:

I'll have to try kdiff3 - but the merge command, although it often works,
I strongly dislike when it marks up the lines as there was a conflict here
and gives you three files in the directory to choose to start from. This is
far too manual, which invites mistakes.


Agreed that this is somewhat annoying, but hey, it's a command line 
tool. How else would you solve displaying conflicts?



If kdiff3 is more like the ClearCase
graphical merge utility, I would far prefer that. Can you say I want change
2 followed by change 3 with checkboxes, a live final version to view, and
the ability to manually type or adjust lines in the final version to view?


Yup. That's possible. And much much more... ;-)  (I don't know the 
ClearCase tool, so I can't really offer a comparison, sorry.)


Others you might want to try:

 - meld (in python, IMO worse than kdiff3)
 - xxdiff (I've never really used that one, but other monotone hackers 
seem to like it as well)


Regards

Markus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] tsearch in core patch, for inclusion

2007-02-22 Thread Markus Schiltknecht

Hi,

Peter Eisentraut wrote:

Oleg Bartunov wrote:

It's not so big addition to the gram.y, see a list of commands
http://mira.sai.msu.su/~megera/pgsql/ftsdoc/sql-commands.html.


As we still to still discuss the syntax: is there a proposal for how a 
function based syntax would look like?


CREATE FULLTEXT CONFIGURATION myfts LIKE template_cfg AS DEFAULT;

just seems so much more SQL-like than:

SELECT add_fulltext_config('myfts', 'template_cfg', True);

I admit, that's a very simple and not thought through example. But as 
long as those who prefer not to extend the grammar don't come up with a 
better alternative syntax, one easily gets the impression that extending 
the grammar in general is evil.


In that proposed syntax, I would drop all =, ,, (, and ).  They 
don't seem necessary and they are untypical for SQL commands.  I'd 
compare with CREATE FUNCTION or CREATE SEQUENCE for SQL commands that 
do similar things.


Yup, I'd second that.

Regards

Markus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] SCMS question

2007-02-22 Thread Markus Schiltknecht

Hi,

Andrew Dunstan wrote:
1. The buildfarm is very heavily dependent on CVS, and any change to 
anything else will be quite painful. There is no guarantee that all the 
members even have SVN installed,


But you can guarantee they have CVS or even cvsup installed? That seems 
dubious to me.


let alone anything else. And someone 
would have to code and test significant client changes. That said, a lot 
of the tortuous logic could be removed - change detection would almost 
just resolve to comparing two tree numbers with SVN, for example.


..and a *real* VCS (as in monotone :-) ) would provide not only that, 
but give you correctness guarantees, built in certification of revisions 
(i.e. each buildfarm member could issue a cert on successful testing) 
and lightweight branches, so you could much easier test experimental 
patches of different authors. Just to name a few additional advantages.


2. Many people (and some buildfarm members) operate against mirrors of 
the main repo which are created with rsync or CVSup. I am not aware of 
any way to do the equivalent with SVN -  any info would be gratefully 
received. 


You might want to have a look at svk. It can do exactly that. And the 
Blog of Thomas explains how, see [1].


Of course, SVN is better at disconnected operation than CVS, 


Really? I've dropped subversion exactly because it sucks big time when 
disconnected. But again, I'm probably not comparing against CVS...


I have no doubt we'll change someday to something better. I don't know 
what it is and I don't think we need to be in any hurry. This space is 
still very fluid.


Yup. Good to hear you see it that way. As I understand, you have good 
reasons to be still using CVS, but are open to good suggestions. That's 
a very good thing, but easily slips by when reading all the critics and 
pro-CVS statements. ;-)


Regards

Markus

[1]: Remote Backup Of A Subversion Repository
http://moelhave.dk/2006/07/remote-mirroring-a-subversion-svn-repository/


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] SCMS question

2007-02-22 Thread Markus Schiltknecht

Hi,

[ I've CCed the monotone-devel list, as I'm sure those people are 
interested, too. ]


Stefan Kaltenbrunner wrote:
Beside that - are all of the currently supported Platforms officially 
supported by the proposed SCMSes ?


I can only speak for monotone. We have (had) buildbots for x86 (linux, 
netbsd, freebsd, win32), amd64 (linux), ppc (osx) and one sparc (osol). 
So far all gcc compiled, AFAIK.


We are very interested in increasing portability of monotone. If you 
could give me (or other monotone devels) ssh access to some of the more 
obscure boxes, that would help a lot. Please contact me privately.


most of the issues with CVS in that regard have already been worked 
around (and are therefore solved).


Huh? How do you guarantee the correctness of a local checkout? At best, 
you can check an md5 sum of a tar archive, but CVS itself does almost no 
integrity checking. Does the buildfarm code check somehow? Against what? 
(Note that we've already had quite some disk failures uncovered by 
monotone, which does extensive integrity checking. But I'm sure database 
people know how important that is, don't you?)


Or quickly test experimental patches? Is that solved?

Or merging between branches, to add another major annoyance of CVS (and 
subversion, for that matter).


I currently fetch the whole PostgreSQL repository via cvsup and then 
import it into monotone to be able to do serious work. Of course that's 
possible, and you can work around all the other limitations of CVS 
somehow, but it's annoying.


But I agree that for developers especially those that are doing large 
patches over a long period of time might gain something from another 
SCMS, but it is not really clear what that SCMS should be or if it 
warrants the imho enormous switching costs (and the potential disruption 
 in development until that switch is done which might take days if not 
weeks).


I certainly agree that switching to another VCS is a major undertaking. 
And I'm working on easing migration to monotone. And I'll quite 
certainly try again to convince you again, *at some point in the 
future*. I would not vote for switching the PostgreSQL repository to 
monotone, yet. (As if I had a vote...;-) )


Regards

Markus


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] SCMS question

2007-02-22 Thread Markus Schiltknecht

Hi,

Andrew Dunstan wrote:
CVSup is not required, and is absent from most existing clients. I don't 
use it any more since the Fedora project stopped supporting it.


..which is quite understandable, concerning the PITA compiling modula-3 
gives you (or at least has given me, it still hurts).


The point you are missing is that, while we know existing buildfarm 
members all have CVS installed, we don't know that they have SVN or 
whatever, and requiring them to install it will involve significant 
distributed pain. 


Okay, I certainly agree that CVS is much more wide spread and available 
than most (if not all) other VCSes. Let's change that ;-)


It will also involve some considerable localised pain 
(probably on my part) in rewriting the client. Right now I'm thinking it 
might make some sense to future-proof buildfarm by creating some sort of 
snapshot server. OTOH, if we avoid use of whatever SCM system that the 
project uses, we aren't testing that part of the process.


Did I mention that monotone is a very good snapshot server? *duck*

You probably don't want to reinvent the weel, as 'snapshot serving' is 
exactly what a VCS should do (among other things).



You're making Tom's point again :-)


Yeah, sorry, couldn't resist :-)

IIRC you don't need to be connected to the repo to run svn diff, 
whereas you do to run cvs diff.


Yes, in the simplest case of comparing against the immediate successor 
revision. But certainly not for: svn diff -r${FURTHER_IN_THE_PAST}, as 
subversion does not have that data available (nor does CVS, for that 
matter).


We know the warts. If this were a green fields project there is no doubt 
we would not use CVS. But many proponents of other systems ignore the 
downside of changing.


Well, I guess many advocates for other VCSes (like myself) simply don't 
particularly like to talk about the downsides...  But they are probably 
more aware of them than most other people.


One thing I want to know is that whatever we change to will still be 
there, maintained and in widespread use, many years down the track. So 
far I am not sure about that for any possible replacement, with the 
possible exception of SVN.


That's certainly a valid concern, too. Probably *the* one where monotone 
is weaker compared to git and mercurial. :-(  We are working on that 
issue, though.


Regards

Markus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] tsearch in core patch, for inclusion

2007-02-22 Thread Markus Schiltknecht

Hi,

Andrew Dunstan wrote:
If we are worried about the size of the transition table and keeping it 
in cache (see remarks from Tom upthread) then adding more keywords seems 
a bad idea, as it will surely expand the table.  OTOH, I'd hate to make 
that a design criterion. 


Yeah, me too. Especially because it's an implementation issue against 
ease of use. (Or can somebody convince me that functions would provide a 
simple interface?)


My main worry has been that the grammar would 
be stable.


You mean stability of the grammar for the new additions or for all the 
grammar? Why are you worried about that?


Just to quantify all this, I did a quick check on the grammar using 
bison -v - we appear to have 473 terminal symbols, and 420 non-terminal 
sybols in 1749 rules, generating 3142 states. The biggest tables 
generated are yytable and yycheck, each about 90kb on my machine.


That already sounds somewhat better that Tom's 300 kb. And considering 
that these caches most probably grow faster than our grammar...


Regards

Markus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [Monotone-devel] Re: [HACKERS] SCMS question

2007-02-22 Thread Markus Schiltknecht

Hello Richard,

you should probably have read the thread on the PostgreSQL -hackers 
mailing list I've linked to... at least you didn't make Tom's point ;-)


Richard Levitte - VMS Whacker wrote:

  1. Do you want to stay with CVS or do you want to move to something
 else?


Most PostgreSQL developers currently want to stay with CVS. Only some 
desperate souls including myself are fiddling with other VCSes.



  3. What would you want a replacement to be able to do?


That's being debated, with many voices saying: CVS (plus our own 
hackery) provides all we need.  (And be warned again: as soon as you 
point out an advantage of your favourite VCS, you're making Tom's point. 
;-) )



So far, I'm getting the sense that there are a lot of opinions on what
replacement system to use, a bit carelessly before having answered the
above questions thoroughly.


How did you get that impression? I'm currently *using* monotone for 
Postgres-R development, doing cvs_import and propagating to my branch. 
And I know others did the same already, too.


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] tsearch in core patch, for inclusion

2007-02-22 Thread Markus Schiltknecht

Hi,

Pavel Stehule wrote:

Functions maybe doesn't see efective, but user's cannot learn new syntax.


Are you serious? That argument speaks exactly *for* extending the 
grammar. From other databases, users are used to:


CREATE TABLE ... (SQL)
CREATE INDEX ... (SQL)
CREATE FULLTEXT INDEX ... (Transact-SQL)
CREATE TABLE (... FULLTEXT ...) (MySQL)
CREATE INDEX ... INDEXTYPE IS ctxsys.context PARAMETERS ... (Oracle Text)

And users are constantly complaining that PostgreSQL doesn't have 
fulltext indexing capabilities (if they don't know about tsearch2) or 
about how hard it is to use tsearch2.


SELECT create_fulltext_mapping(cfgname, ARRAY['lex..','..'], 
ARRAY['...']) is readable.


Hardly. Because it's not like SQL:

 - it's counter-intuitive to have to SELECT, when you want to CREATE 
something.

 - it's confusing to have two actions (select create)
 - why do I have to write ARRAYs to list parameters?
 - it's not obvious what you're selecting (return value?)
 - you have to keep track of the brackets, which can easily get messed 
up with two levels of them. Especially if the command gets multiple 
lines long.



I agree so enhancing parser oabout not standard construct isn't good.


Generally? Wow! This would mean PostgreSQL would always lack behind 
other RDBSes, regarding ease of use. Please don't do that!


Regards

Markus


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [Monotone-devel] Re: [HACKERS] SCMS question

2007-02-22 Thread Markus Schiltknecht

Hi,

Richard Levitte - VMS Whacker wrote:

In message [EMAIL PROTECTED] on Thu, 22 Feb 2007 17:38:26 +0100, Markus 
Schiltknecht [EMAIL PROTECTED] said:

markus  So far, I'm getting the sense that there are a lot of
markus  opinions on what replacement system to use, a bit carelessly
markus  before having answered the above questions thoroughly.
markus 
markus How did you get that impression?


You said it yourself: Most PostgreSQL developers currently want to
stay with CVS.


Uh, yah. But I was refering to the lots of opinions on what replacement 
system to use. This has not much to do with the want or need (for lack 
of a better alternative) to stay with CVS, IMO.



Unless there's a majority that wants to move on, I doubt there will be
a move.  In the end, it has to be a group effort, or it will simply
not happen.


I absolutely agree. And I'm quite sure most PostgreSQL developers also 
know that, thus I don't see much point in warning them - they are 
resistant enough ;-)


As you might have noticed, I myself don't consider monotone ready for 
use by the PostgreSQL project, yet. And I've never advocated for 
switching *now*. I only made Tom's point...


Regards

Markus


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] tsearch in core patch, for inclusion

2007-02-21 Thread Markus Schiltknecht

Hi,

Tom Lane wrote:

You mean four different object types.  I'm not totally clear on bison's
scaling behavior relative to the number of productions


You really want to trade parser performance (which is *very* 
implementation specific) for ease of use?


Bison generates a LALR [1] parser, which depend quite a bit on the 
number of productions. But AFAIK the dependency is mostly on memory 
consumption for the internal symbol sets, not so much on runtime 
complexity. I didn't find hard facts about runtime complexity of LALR, 
though (pointers are very welcome).


Are there any ongoing efforts to rewrite the parser (i.e. using another 
algorithm, like a recursive descent parser)?


Regards

Markus

[1]: Wikipedia on the LALR parsing algorithm:
http://en.wikipedia.org/wiki/LALR_parser

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] tsearch in core patch, for inclusion

2007-02-21 Thread Markus Schiltknecht

Hi,

Florian G. Pflug wrote:

According to http://en.wikipedia.org/wiki/LR_parser processing one
token in any LR(1) parser in the worst case needs to
 a) Do a lookup in the action table with the current (state, token) pair
 b) Do a lookup in the goto table with a (state, rule) pair.
 c) Push one state onto the stack, and pop n states with
n being the number of symbols (tokens or other rules) on the right
hand side of a rule.

a) and b) should be O(1). Processing one token pushes at most one state
onto the stack, so overall no more than N stats can be popped off again,
making the whole algorithm O(N) with N being the number of tokens of the
input stream.


Looks correct, thanks. What exactly is Tom worried about, then?

Are there any ongoing efforts to rewrite the parser (i.e. using 
another algorithm, like a recursive descent parser)?

Why would you want to do that?


I recall having read something about rewriting the parser. Together with 
Tom being worried about parser performance and knowing GCC has switched 
to a hand written parser some time ago, I suspected bison to be slow. 
That's why I've asked.


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] compilation of pg_config fails

2007-02-08 Thread Markus Schiltknecht

Hi,

since str(n?)cat got replaced with strlcat, I fail to build PostgreSQL 
(current CVS HEAD). HAVING_DECL_STRLCAT is not set, so AFAIK, the 
strlcat() function from src/port should be used. However, I've read the 
README there, but still don't quite know what's wrong.


The linker throws:

gcc -O1 -Wall -pg -Wall -Wmissing-prototypes -Wpointer-arith -Winline 
-Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -g 
pg_config.o  -L../../../src/port  -Wl,-rpath,'/usr/local/pgsql/lib' 
-lpgport -lz -lreadline -lcrypt -ldl -lm  -o pg_config

pg_config.o: In function `show_pgxs':
/home/markus/projects/pgsql/sources/trunk/src/bin/pg_config/pg_config.c:216: 


undefined reference to `strlcat'

Even if objdump confirms that strlcat.o is compiled in 
../../../src/port/libpgport.o:


In archive ../../../src/port/libpgport.a:

strlcpy.o: file format elf32-i386

SYMBOL TABLE:
 ldf *ABS*   strlcpy.c
 ld  .text   .text
 ld  .data   .data
 ld  .bss    .bss
 ld  .debug_abbrev   .debug_abbrev
 ld  .debug_info .debug_info
 ld  .debug_line .debug_line
 ld  .debug_frame    .debug_frame
 ld  .debug_loc  .debug_loc
 ld  .debug_pubnames .debug_pubnames
 ld  .debug_aranges  .debug_aranges
 ld  .debug_str  .debug_str
 ld  .note.GNU-stack .note.GNU-stack
 ld  .comment    .comment
 g F .text  0045 strlcpy
 *UND*   mcount

...

# uname -a:

Linux grml 2.6.20 #1 SMP PREEMPT Tue Feb 6 14:48:26 PST 2007 i686 GNU/Linux

Any idea?

Regards

Markus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Markus Schiltknecht

Hi,

Jan Wieck wrote:

Are we still discussing if the Postgres backend may provide support for
a commit timestamp, that follows the rules for Lamport timestamps in a
multi-node cluster?


No. And I think you know my opinion about that by now. ;-)


It seems more like we are drifting into what type of
replication system I should design to please most people.


Nobody is telling you what you should do. You're free to do whatever you 
want to.


I'm only trying to get a discussion going, because a) I'm interested in 
how you plan to solve these problems and b) in the past, most people 
were complaining that all the different replication efforts didn't try 
to work together. I'm slowly trying to open up and discuss what I'm 
doing with Postgres-R on the lists.


Just yesterday at the SFPUG meeting, I've experienced how confusing it 
is for the users to have such a broad variety of (existing and upcoming) 
replication solutions. And I'm all for working together and probably 
even for merging different replication solutions.


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Markus Schiltknecht

Hi,

Jan Wieck wrote:

Then let me give you a little puzzle just for the fun of it.

A database containing customer contact information (among other things) 
is a two node multimaster system. One is serving the customer web 
portal, the other is used by the company staff including the call 
center. At 13:45 the two servers lose connectivity to each other, yet 
the internal staff can access the internal server while the web portal 
is accessible from the outside. At 13:50 customer A updates their credit 
card information through the web portal, while customer B does the same 
through the call center. At 13:55 both customers change their mind to 
use yet another credit card, now customer A phones the call center while 
customer B does it via the internet.


Phew, a mind twister... one customer would already be enough to trigger 
that sort of conflict...


At 14:00 the two servers reconnect and go through the conflict 
resolution. How do you intend to solve both conflicts without using any 
clock, because that seems to be a stopword causing instant rejection 
of whatever you propose. Needless to say, both customers will be 
dissatisfied if you charge the wrong credit card during your next 
billing cycle.


Correct. But do these cases satisfy storing timestamps to each and every 
transaction you do? That's what I doubt, not the usefulness of time 
based conflict resolution for certain cases.


You can always add a time based conflict resolution, by adding a 
timestamp column and decide upon that one. I'd guess that the overall 
costs are lower that way.


But you've withdrawn that proposal already, so...

Which is a good discussion because one of the reasons why I stopped 
looking into Postgres-R is the fact that is based on the idea to push 
all the replication information through a system that generates a global 
serialized message queue. That by itself isn't the problem, but the fact 
that implementing a global serialized message queue has serious 
throughput issues that are (among other details) linked to the speed of 
light.


Agreed. Nevertheless, there are use cases for such systems, because they 
put less limitations to the application. One could even argue, that your 
above example would be one ;-)


I am trying to start with a system, that doesn't rely on such a 
mechanism for everything. I do intend to add an option later, that 
allows to declare a UNIQUE NOT NULL constraint to be synchronous. What 
that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE 
will require the node to currently be a member of the (quorum or 
priority defined) majority of the cluster.


Sounds reasonable.

An advisory lock system, 
based on a total order group communication, will grant the lock to the 
unique key values on a first come, first serve base. Every node in the 
cluster will keep those keys as locked until the asynchronous 
replication stream reports the locking transaction as ended. If another 
remote transaction in the meantime requires updating such key, the 
incoming stream from that node will be on hold until the lock is 
cleared. This is to protect agains node B replicating a transaction from 
node A and a later update on node B arrives on C before C got the first 
event from A. A node that got disconnected from the cluster must rebuild 
the current advisory lock list upon reconnecting to the cluster.


Yeah, this is a convenient way to replicate sequences via a GCS.

I think that this will be a way to overcome Postgres-R's communication 
bottleneck, as well as allowing limited update activity even during a 
completely disconnected state of a node. Synchronous or group 
communication messages are reduced to the cases, where the application 
cannot be implemented in a conflict free way, like allocating a natural 
primary key. There is absolutely no need to synchronize for example 
creating a sales order. 


Agreed, such cases can easily be optimized. But you have to be aware of 
he limitations these optimizations cause. Postgres-R is much more 
targeted at very general use cases.


An application can use global unique ID's for 
the order number. And everything possibly referenced by an order (items, 
customers, ...) is stored in a way that the references are never 
updated. Deletes to those possibly referenced objects are implemented in 
a two step process, where they are first marked obsolete, and later on 
things that have been marked obsolete for X long are deleted. A REPLICA 
TRIGGER on inserting an order will simply reset the obsolete flag of 
referenced objects. If a node is disconnected longer than X, you have a 
problem - hunt down the guy who defined X.


Yeah, that's another very nice optimization. Again, as long as you know 
the limitations, that's all well and fine.


Merging certain ideas to come up with an async/sync hybrid? Seems to me 
we have similar enough ideas to need conflict resolution, because we had 
them simultaneously but communicate them asynchronously.


Huh? 

Re: [HACKERS] Proposal: Commit timestamp

2007-02-06 Thread Markus Schiltknecht

Hi,

Zeugswetter Andreas ADI SD wrote:

And time based
is surely one of the important conflict resolution methods for async MM
replication.


That's what I'm questioning. Wouldn't any other deterministic, but 
seemingly random abort decision be as clever as time based conflict 
resolution? It would then be clear to the user that it's random and not 
some in most cases time based, but no in others and only if... thing.



Sure there are others, like rule based priority based but I think
you don't need additional backend functionality for those.


Got the point, yes. I'm impatient, sorry.

Neither the less, I'm questioning if is it worth adding backend 
functionality for that. And given this probably is the most wanted 
resolution method, this question might be heretical. You could also 
see it as sort of an user educating question: don't favor time based 
resolution if that's the one resolution method with the most traps.


Regards

Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-06 Thread Markus Schiltknecht

Hi,

Jan Wieck wrote:
Whatever strategy one will use, in an async multimaster there are always 
cases that can be resolved by rules (last update being one of them), and 
some that I can't even imagine solving so far. I guess some of the cases 
will simply boil down to the application has to make sure that ... 
never occurs. Think of a multi-item order, created on one node, while 
another node is deleting the long unused item (which would have to be 
backordered). Now while those two nodes figure out what to do to make 
this consistent again, a third node does a partial shipment of that 
order.


It helps to categorize these conflict types. There basically are:

* data conflicts: simple row data, i.e. update - update conflicts.

* uniqueness conflicts: two rows conflict because they'd violate a 
uniquenes constraint, i.e. insert - insert, update - insert or update - 
update.


* visibility conflicts: basically the remaining update - delete and 
delete - delete cases. But also SELECT FOR UPDATE candidates, etc... 
Everything having to do with a rows not yet or no longer being visible 
to a transaction.


Your example certainly involves a visibility conflict (update - delete). 
Not even (sync) Postgres-R can guarantee consistency on the visibility 
level, i.e.  a first transaction's SELECT FOR UPDATE might not see some 
just recently committed transactions newly inserted rows (because that 
one isn't replayed yet on the node, thus the transaction is working on 
an 'old' snapshot of the database state). Another simpler example: 
Postgres-R doesn't raise a serialization error on delete-delete 
conflicts, it simply deletes the row once, even if two transactions 
confirmed to have committed a transaction which deleted a row.


Luckily, most applications don't need that anyway, though.

The solution is simple, reinsert the deleted item ... 


..at which point timestamps certainly won't help :-)   Sorry, couldn't 
resist...


only that 
there were rather nasty ON DELETE CASCADE's on that item that removed 
all the consumer reviews, product descriptions, data sheets and what 
not. It's going to be an awful lot of undo.


Huh? Are you planning on aborting *parts* of a transaction? I didn't 
think about that, but my gut feeling is that you don't want to do that.


I haven't really made up my mind about a user defined rule based 
conflict resolution interface yet. I do plan to have a unique and 
foreign key constraint based, synchronous advisory locking system on top 
of my system in a later version (advisory key locks would stay in place 
until the transaction, that placed them, replicates).


You'd have to elaborate on that...

I guess you see by now why I wanted to keep the discussion about the 
individual, rather generic support features in the backend separate from 
the particular features I plan to implement in the replication system.


Sure. I know, discussions about replication can get endless, probably 
even are so by definition ;-)  But hey, they're fun!


Everyone has different needs and consequently an async multi-master 
must do a whole range of mutually exclusive things altogether ... 
because Postgres can never accept a partial solution. We want the egg 
laying milk-wool-pig or nothing.


Like the one which would result from a merge of such an async 
replication with a sync one? Imagine being able to choose between sync 
and async per transaction...


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-05 Thread Markus Schiltknecht

Hi,

Theo Schlossnagle wrote:

On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:
Obviously the counters will immediately drift apart based on the 
transaction load of the nodes as soon as the network goes down. And in 
order to avoid this clock confusion and wrong expectation, you'd 
rather have a system with such a simple, non-clock based counter and 
accept that it starts behaving totally wonky when the cluster 
reconnects after a network outage? I rather confuse a few people than 
having a last update wins conflict resolution that basically rolls 
dice to determine last.


If your cluster partition and you have hours of independent action and 
upon merge you apply a conflict resolution algorithm that has enormous 
effect undoing portions of the last several hours of work on the nodes, 
you wouldn't call that wonky?


You are talking about different things. Async replication, as Jan is 
planning to do, is per se wonky, because you have to cope with 
conflicts by definition. And you have to resolve them by late-aborting a 
transaction (i.e. after a commit). Or put it another way: async MM 
replication means continuing in disconnected mode (w/o quorum or some 
such) and trying to reconciliate later on. It should not matter if the 
delay is just some milliseconds of network latency or three days (except 
of course that you probably have more data to reconciliate).


For sane disconnected (or more generally, partitioned) operation in 
multi-master environments, a quorum for the dataset must be 
established.  Now, one can consider the database to be the dataset.  
So, on network partitions those in the quorum are allowed to progress 
with data modification and others only read.


You can do this to *prevent* conflicts, but that clearly belongs to the 
world of sync replication. I'm doing this in Postgres-R: in case of 
network partitioning, only a primary partition may continue to process 
writing transactions. For async replication, it does not make sense to 
prevent conflicts when disconnected. Async is meant to cope with 
conflicts. So as to be independent of network latency.


However, there is no 
reason why the dataset _must_ be the database and that multiple datasets 
_must_ share the same quorum algorithm.  You could easily classify 
certain tables or schema or partitions into a specific dataset and apply 
a suitable quorum algorithm to that and a different quorum algorithm to 
other disjoint data sets.


I call that partitioning (among nodes). And it's applicable to sync as 
well as async replication, while it makes more sense in sync replication.


What I'm more concerned about, with Jan's proposal, is the assumption 
that you always want to resolve conflicts by time (except for balances, 
for which we don't have much information, yet). I'd rather say that time 
does not matter much if your nodes are disconnected. And (especially in 
async replication) you should prevent your clients from committing to 
one node and then reading from another, expecting to find your data 
there. So why resolve by time? It only makes the user think you could 
guarantee that order, but you certainly cannot.


Regards

Markus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


  1   2   >