Re: [HACKERS] UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

2008-02-28 Thread Zeugswetter Andreas ADI SD

  Well, I guess the question is: if we don't offer some builtin way to
render 
  non-standard formats built into company products, will those
companies fix 
  their format or just not use PostgreSQL?
 
 Well, there is an advantage that Postgres has that some others don't:
you
 can extend Postgres pretty easily.  That suggests to me a reason to be
 conservative in what we build in.  This is consistent with the
principle,
 Be conservative in what you send, and liberal in what you accept.

Well, then the uuid input function should most likely disregard all -,
and accept the 4x-, 8x- formats and the like on input.

Andreas


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] pg_dump additional options for performance

2008-02-26 Thread Zeugswetter Andreas ADI SD

  As a user I'd really prefer all of this to be much more transparent,
and could 
  well imagine the -Fc format to be some kind of TOC + zip of table
data + post 
  load instructions (organized per table), or something like this.
  In fact just what you described, all embedded in a single file.
 
 If its in a single file then it won't perform as well as if its
separate
 files. We can put separate files on separate drives. We can begin
 reloading one table while another is still unloading. The OS will
 perform readahead for us on single files whereas on one file it will
 look like random I/O. etc.

Well, nowadays this is not generally true. You would most likely be
using stripesets, so you don't need more that one file for reading to be
using more than one spindle in parallel.
Also different threads or processes would be used, so readahead would be
done in parallel at different offsets in the file.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] configurability of OOM killer

2008-02-08 Thread Zeugswetter Andreas ADI SD

  while we are at it -- one feature would be great for 8.4, an
  ability to shange shared buffers size on the fly.  I expect
  it is not trivial, but would help fine-tuning running database.
  I think DBA would need to set maximum shared buffers size
  along the normal setting.
 
 Shared memory segments can't be resized... There's not even a kernel
 API to do it.

Yes, but the typical way around that is to allocate additional segments.
You would want a configurable size and a limit though.
Just wanted to air this possibility, cause it seems nobody is aware
here.
It does cause all sorts of issues, but it's not like there is no way to 
increase shared memory.

The dba would then reconfigure and restart at a convenient time to
reduce
the number of segments because that is typically more performant.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pg_dump additional options for performance

2008-02-06 Thread Zeugswetter Andreas ADI SD
Simon wrote:
 My proposal is to provide two additional modes:
 --schema-pre-load corresponding to (1) above
 --schema-post-load corresponding to (3) above

Sounds nice. 
For a large schema we might rather want one switch that dumps 2 files,
no ?
Probably also better from a mvcc perspective.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Oops - BF:Mastodon just died

2008-01-31 Thread Zeugswetter Andreas ADI SD

 http://msdn2.microsoft.com/en-us/library/b842y285(VS.71).aspx 
 appears to 
  suggest that the size of the field is fixed.
 
 That would imply that dumpbin fails at 4096 symbols per file.  While I
 surely wouldn't put it past M$ to have put in such a 
 limitation, I think
 it's more likely that the documentation is badly written.

Yes, it starts with 3 and goes to 4 digits above FFF

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanning GUCvariable

2008-01-30 Thread Zeugswetter Andreas ADI SD

  The plural seems better to me; there's no such thing as a solitary
  synchronized scan, no?  The whole point of the feature is to affect
  the behavior of multiple scans.
 
 +1. The plural is important IMHO.

ok, good.

 As I stated earlier, I don't really like this argument (we already
 broke badly designed applications a few times in the past) but we
 really need a way to guarantee that the execution of a query is stable
 and doesn't depend on external factors. And the original problem was
 to guarantee that pg_dump builds a dump as identical as possible to
 the existing data by ignoring external factors. It's now the case with
 your patch.
 The fact that it allows us not to break existing applications relying
 too much on physical ordering is a nice side effect though :).

One more question. It would be possible that a session that turned off
the synchronized_seqscans still be a pack leader for other later
sessions.
Do/should we consider that ?

The procedure would be:
start from page 0
iff no other pack is present fill the current scan position for others

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanning GUCvariable

2008-01-29 Thread Zeugswetter Andreas ADI SD

 It's a good point that we don't want pg_dump to screw up the cluster 
 order, but that's the only use case I've seen this far for disabling 
 sync scans. Even that wouldn't matter much if our estimate for 
 clusteredness didn't get screwed up by a table that looks 
 like this: 
 5 6 7 8 9 1 2 3 4

I do think the guc to turn it off is useful, only I don't understand the
reasoning that pg_dump needs it to maintain the basic clustered
property.

Sorry, but I don't grok this at all.
Why the heck would we care if we have 2 parts of the table perfectly
clustered,
because we started in the middle ? Surely our stats collector should
recognize
such a table as perfectly clustered. Does it not ? We are talking about
one
breakage in the readahead logic here, this should only bring the
clustered property
from 100% to some 99.99% depending on table size vs readahead window.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanning GUCvariable

2008-01-29 Thread Zeugswetter Andreas ADI SD

  +1. If we go with 'enable_sync_seqcans' for 8.3, and in a future
release 
  cycle we do test the cases Simon described above and we agree we
need to 
  do a fine tune to benefit from this feature, we will need to
deprecate 
  'enable_sync_seqscans' and invent another one
(sync_seqscans_threshold). 
  Looking at this perpective, IMHO we should go with the number (0.25)

  instead of the boolean.
 
 Surely the risk-of-needing-to-deprecate argument applies ten times
more
 strongly to a number than a boolean.

Yes, I would expect the tuning to be more system than user specific.
So imho a boolean userset would couple well with a tuning guc, that
may usefully not be userset (if we later discover a need for tuning at
all).

so +1 for the bool. 

synchronize[d]_seqscan sounds a bit better in my ears than the plural
synchronize_seqscans.
To me the latter somehow suggests influece on the whole cluster,
probably not 
worth further discussion though, so if someone says no, ok.

Andreas


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanning GUC variable

2008-01-28 Thread Zeugswetter Andreas ADI SD

  I liked the synchronized_sequential_scans idea myself.
 
  I think that's a bit too long. How about synchronized_scans, or
  synchronized_seqscans?
 
 We have enable_seqscan already, so that last choice seems to fit in.

Yes looks good, how about synchronized_seqscan without plural ?

Andreas


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Integrity check

2008-01-25 Thread Zeugswetter Andreas ADI SD

  This seems like a pretty horrid idea.  Bad pages shouldn't be
allowed to
  get into shared buffers in the first place.  Why not have the
checking
  logic operate outside shared buffers?
 
  It currently works outside the shared buffers, but I afraid about 
  collision due to parallel read and write access on one block. I'm
not 
  sure if parallel write(8k) and read(8k) is synchronized by kernel/fs
or 
  not. If not it should generates false positive results. If yes than
I'm 
  happy :-) with outside processing.
 
 We're already assuming that; otherwise base backups for PITR 
 don't work.

I think we could, but iirc we did not. We do not need that assumption if
you don't 
turn off fullpage writes. All pages that could potentially be changed
during the
backup exist in the WAL fullpages that have to be replayed. 
Don't we even now allways write fullpages to WAL during backup exactly
because we 
did not confirm that assumption ?

I think I recall times when some OS had trouble with this when you mixed
O_DIRECT (or was it also O_DATASYNC) and normal IO.

Andreas


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Declarative partitioning grammar

2008-01-16 Thread Zeugswetter Andreas ADI SD
 
  Personally I find the automatic partition idea intriguing, where you
  only have to choose an expression that equates to one value (value
  group) per partition (and possibly a way to derive a 
 partition name).
 
 IMO, better go right to a fully automated approach. Or why would you 
 need partition names in such a case?

Yes, I tend to aggree on that, with the exception that I think the
partition borders should be declarative expressions (equal in syntax to
a GROUP BY clause). 

Names are only for a short and crisp way to identify the partition for
the following operations:
- drop/detach data in a partition (as opposed to a normal delete)
- move to/create in other tablespace (I don't value that one, but others
have)
- reorg, create index

The drop can probably be done straight from a normal delete (a problem
is RI and triggers though).
The normal delete would need to have the smarts, that a delete covers a
whole partition, and thus mark the whole partition dead instead of it's
individual rows.

A detach would need some extra syntax, but could also be based on a
where clause that specifies which partitions are to be detached. I am
not so sure about how to usefully do the reorg part with where clauses.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Declarative partitioning grammar

2008-01-15 Thread Zeugswetter Andreas ADI SD

  I don't agree with that at all.  I can imagine plenty of situations
  where a tuple falling outside the range of available partitions
*should*
  be treated as an error.  For instance, consider timestamped
observations
  --- data in the future is certainly bogus, and data further back
than
  you want to deal with must be an entry error as well.
 
 Isn't it better to have these constraints as table constraints,
instead 
 of burying them in the partitioning definition? Mixing those two 
 concepts seems very wired to me.

Yes, but the problem with the timestamp partitioned tables is, that the
window is sliding. Thus you would need two alter tables for each new
period. One that changes the constraint + one that creates the new
partition. So it seems natural to join the two concepts for such a
partitioning syntax.

Personally I find the automatic partition idea intriguing, where you
only have to choose an expression that equates to one value (value
group) per partition (and possibly a way to derive a partition name).
Then a partition is automatically created when a new row arrives for a
new value. That does not however address Tom's concern of rejecting data
that is outside the acceptable window, but maybe that is better dealt
with in the application anyways.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Postgresql Materialized views

2008-01-14 Thread Zeugswetter Andreas ADI SD

  But you had to modify your queries.  I would think that a materialized
  views implementation worth its salt would put the view to work on the
  original, unmodified queries.

 
 I might be slow today (everyday? :-) ) - but what do you mean by this? 
 The only difference between *_table and *_view is that *_table is the 
 summary table and *_view is the view. The triggers on the tables the 
 view is derived from select from *_view and update *_table. The queries 
 remain unchanged except for deciding whether to use *_table or *_view.

Traditionally materialized views exist, so that you do not need to code 
differently.
Your queries still run on the detail table, but are silently answered
by a suitable MV. The MV might have count + other aggregated columns
grouped by some columns, and thus be able e.g. shortcircuit a 
select count(*) from atab. The MV should be MVCC aware (have different
values for different snapshots) and not substantially reduce possible 
concurrency of updates to the base table.
 
 For some further background - the base tables are a mirror of accpac 
 tables (augh!) from mssql. The view and summary table gathers 
 information from 5 or so of these tables including aggregates, 
 conditionals, sub-selects (different queries to the same base tables) 
 and deep joins. Perhaps my imagination is too limited - but I 
 don't see 
 how it would be easy to make syntactical sugar for this and still 
 maintain the performance I describe above. For about 30 lines of 
 pl/pgsql and some application-side updates (again from the 
 view to the 
 summary table) in the synchronization script it seems acceptable.

As long as you can formulate a normal view on the above statement,
you should be able to tell the db to materialize that.

A good MV feature would be able to use that MV regardless of whether
you select from the view, or use a statement that the view is a generalization 
of.

I think MV's where originally invented to boost benchmark results
and thus had to operate on given sql to base tables.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Postgresql Materialized views

2008-01-14 Thread Zeugswetter Andreas ADI SD

  Traditionally materialized views exist, so that you do not need to code 
  differently.
  Your queries still run on the detail table, but are silently answered
  by a suitable MV. The MV might have count + other aggregated columns
  grouped by some columns, and thus be able e.g. shortcircuit a 
  select count(*) from atab. The MV should be MVCC aware (have different
  values for different snapshots) and not substantially reduce possible 
  concurrency of updates to the base table.
 
 Note that you just raised the minimum bar for implementation of the
 feature by a couple orders of magnitude.  We cannot automatically
 substitute an MV into queries unless this is guaranteed not to change
 the results.  No lazy updates, MVCC transparency required, etc.

Yes, unfortunately. But don't you also think that this is what makes it 
a worthwhile feature ?

I mean, we do have the doityourself triggered summary table approach,
which is not overly difficult to set up. It needs some thought and possibly 
design
by the user to solve the most obvious concurrency issues, but it is doable.

Imho MV could be separated in 2 parts:
1: materialized and MVCC aware views (only used explicitly)
2: add the smarts to rewrite sql

Part 1 is already useful by itself since it provides a generic and easy
solution to concurrency for the user. (probably nice and mindboggling, how to 
best implement that, though :-)

The lazy update and non MVCC approach imho sounds too much like your
you can make it arbitrarily fast if it does not need to be correct :-) 

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Zeugswetter Andreas ADI SD

 I've kept a list of requests for improvement that I can share with
you;
 I've always been loathe to publish a list of bad points.

I think it would help understand the proposal if you also present the
shortcomings.
When you presented the positive and negative points, the negative list
did look intentionally short :-)
This imho provokes negative replies, the average hackers that reply are
no dummies eighter.

Some of the issues I see, with the proposal made so far:
- segment boundaries will imho sometimes be too fuzzy for a reasonably
short segment describing where clause
- fault isolation
- huge global indexes (index create/reorg needs to repeatedly sort data
that is long static)
- unpredictability
- when you delete large parts of the table you want that to be
instantaneous and cause little work on the db side
- partitioning that also partitions the indexes, this is imho a must
have for huge non static tables
- when the user tweaks segment size this is already declarative
- some indexes only needed for recent data (a partial index would need
to slide == recreate)

The whole subject is imho very interesting, and I expect more feedback
after 8.3 is out.  

I am also in the declarative camp. In my projects the partitioning is
the developer/designer's
responsibility, and thus all add/drop partition tasks are automated, no
dba.
Needless to say, all existing declarative implementations lack some of
the essential features on the implementation side, e.g.:
- use the btree order of partitions in plans that need more than one
partition
- a useful syntax that allows automatic creation/exclusion of partitions
(e.g. each month of a timestamp in one partition)
e.g. partition 'hugetable_'||to_char(ts, 'MM') with
extend(ts, year to month) 
- unique indexes, equally partitioned like table, that don't contain the
partitioning column[s]
- some lack expression based
- some lack instantaneous attach using a prefilled preindexed table
- some lack detach
- some need separate tablespace per partition

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] pgwin32_open returning EINVAL

2007-12-19 Thread Zeugswetter Andreas ADI SD

  Yeah, I think it would be useful to log one message if after (say) 5
  seconds we still haven't been able to open the file.
 
 Either that, or on the first run.

Imho 1-5s is better, so that would be after the 10-50th try.

 loop. It's supposed to loop 300 times.

Yes.
 
  (Are we OK with the idea of sleeping 1 second each time?)
 
 I think not. 0.1 seconds is better. We don't want to delay a full
second if
 it's just a transient thing.

Yes 0.1 s is imho good. Btw. m$ is talking about milliseconds
(http://support.microsoft.com/kb/316609) 

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [GENERAL] Slow PITR restore

2007-12-14 Thread Zeugswetter Andreas ADI SD

  Note that even though the processor is 99% in wait state the drive
is 
  only handling about 3 MB/s. That translates into a seek time of
2.2ms 
  which is actually pretty fast...But note that if this were a raid
array 
  Postgres's wouldn't be getting any better results. A Raid array
wouldn't 
  improve i/o latency at all and since it's already 99% waiting for
i/o 
  Postgres is not going to be able to issue any more.
 
 If it's a straight stupid RAID array, sure.  But when you introduce a
good 
 write caching controller into the mix, that can batch multiple writes,

 take advantage of more elevator sorting, and get more writes/seek 
 accomplished.  Combine that improvement with having multiple drives as

 well and the PITR performance situation becomes very different; you
really 
 can get more than one drive in the array busy at a time. It's also
true 
 that you won't see everything that's happening with vmstat because the

 controller is doing the low-level dispatching.

I don't follow. The problem is not writes but reads. And if the reads
are 
random enough no cache controller will help.

The basic message is, that for modern IO systems you need to make sure
that
enough parallel read requests are outstanding. Write requests are not an
issue,
because battery backed controllers can take care of that.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] WORM and Read Only Tables (v0.1)

2007-12-12 Thread Zeugswetter Andreas ADI SD

 There are a number of nasty
 limitations for partitions currently (not the least of which is that
real
 uniqueness guarantees are impractical),

Just to add an other opinion to this statement, because it imho sounds
overly 
pessimistic:

Uniqueness is currently perfectly practical, when the unique index
contains
the column[s] that is/are used in a non overlapping partitioning scheme.

If you cannot create separate unique indexes on each partition that
guarantee
global uniqueness because of the chosen partitioning scheme, you can
often 
reconsider your scheme (e.g. use natural keys instead of serials).

Other db software allows creating global indexes, or indexes with
separate 
partitioning schemes, but this is then often a pain. When you
drop/attach/detach
a partition such an index needs to be recreated or reorganized. This
then makes 
a large slow transaction out of attach/detach partition. 
If you don't need to attach/detach, there is still one other argument
against 
the huge global index which is fault isolation.

There is imho large room to make it better than others :-)
And I think we should not regard them as positive examples,
because that narrows the view.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] WORM and Read Only Tables (v0.1)

2007-12-12 Thread Zeugswetter Andreas ADI SD
 Getting partitioning/read-only right will allow 70+TB of that to be on
 tape or similar, which with compression can be reduced to maybe 20TB?
I
 don't want to promise any particular compression ratio, but it will
make
 a substantial difference, as I'm sure you realise.

Wouldn't one very substantial requirement of such storage be to
have it independent of db version, or even db product? Keeping
old hardware and software around can be quite expensive.

So, wouldn't a virtual table interface be a better match for such a   
problem ? Such a virtual table should be allowed to be part of a
partitioning 
scheme, have native or virtual indexes, ...

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Problem of a server gettext message.

2007-12-11 Thread Zeugswetter Andreas ADI SD

  GetText is conversion po(EUC_JP) to SJIS.

Yes.

 Are you sure about that?  Why would gettext be converting to SJIS,
when
 SJIS is nowhere in the environment it can see?

gettext is using GetACP () on Windows, wherever that gets it's info from
...
chcp did change the GetACP codepage in Hiroshi's example, but chcp
does not reflect in LC_*

Seems we may want to use bind_textdomain_codeset.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Ordered Append Node

2007-11-23 Thread Zeugswetter Andreas ADI SD

  But that requires a) dealing with the problem of the parent table
which has no
  constraints and ...

Imho we should provide a mechanism that forces the parent to be empty
and let the planner know.
e.g. a false constraint on parent ONLY.

Andreas


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Postgres 8.3 archive_command

2007-11-22 Thread Zeugswetter Andreas ADI SD
   Perhaps we should move the successful archived message to DEBUG1
now,
   except for the first message after the archiver starts or when the
   archive_command changes, plus one message every 255 segments? 
   That would reduce the log volume in the normal case without
endangering 
   our ability to see what is happening.
  
  Wouldn't it be more useful to increase the WAL segment size on such
  installations
  that switch WAL files so frequently that it is a problem for the log
?
  
  This currently needs a recompile. I wondered for some time now
whether
  16 Mb isn't
  too low for current hw. Maybe it is time for making WAL segment size
  changeable 
  in the conf with a clean shutdown.
 
 I think its too late in the release cycle to fully consider all the
 implications of that. 16MB is hardcoded in lots of places. The
 performance advantages of that have been mostly removed in 8.3, you
 should note.

Oh sorry, this was definitely not meant for 8.3. And here I didn't mean
the 
performance of the db issue, but an issue for archiving the WAL files. 
I think most archiving systems are not too happy with extremely frequent
backup calls. Also the overall handling of too many WAL files is imho
not handy.

Andreas


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Postgres 8.3 archive_command

2007-11-22 Thread Zeugswetter Andreas ADI SD

 I don't think that should even be a TODO item --- it seems far more
 likely to provide a foot-gun than useful capability.

On further reflection I think that initdb time is probably sufficient.
Do you think that would be a reasonable TODO ?

 Whether 16MB is still a reasonable default segment size is worth
 questioning, though I don't think that increasing it is an
open-and-shut
 proposition.  Larger segments mean more overhead in configurations
that
 force frequent segment switches, for instance.

Yes, imho there is no one size fits all (if there were, it would
probably be
 between 32 and 64 Mb). 
But there are installations where even 16Mb is too much e.g. for an 
embedded device.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Postgres 8.3 archive_command

2007-11-21 Thread Zeugswetter Andreas ADI SD

 Perhaps we should move the successful archived message to DEBUG1 now,
 except for the first message after the archiver starts or when the
 archive_command changes, plus one message every 255 segments? 
 That would reduce the log volume in the normal case without
endangering 
 our ability to see what is happening.

Wouldn't it be more useful to increase the WAL segment size on such
installations
that switch WAL files so frequently that it is a problem for the log ?

This currently needs a recompile. I wondered for some time now whether
16 Mb isn't
too low for current hw. Maybe it is time for making WAL segment size
changeable 
in the conf with a clean shutdown.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] How to keep a table in memory?

2007-11-14 Thread Zeugswetter Andreas ADI SD
Kevin Grittner wrote:
  . . .the abuse of such hints in applications I have seen is so
rampant as to
  make me doubt the utility of adding them anyway.  It's true that by
adding
  hints, you give a facility to a good, competent designer who has a
really

 I have trouble not seeing the point of any posts in this thread.
 Under our old, commercial database product, we had performance
 problems we addressed with a named caches feature -- you could
 declare a named cache of a particular size, and tweak some
 characteristics of it, then bind objects to it.  We came up with

Seems you simply fall in the competent category :-)

I know that another commercial product had introduced a pin table into
memory 
feature for a few years, but dropped it again in the current release.
It seems the potential for wrongdoing is significant :-(
At least a lock this table into memory must be accompanied by an
allow a max percentage of buffercache and something that loads the 
table on startup. But what do you do if it does not fit ?
Caching only parts of the table is useless for the mentioned use-case.

One aspect that has not been addressed is whether there is a way to 
cluster/partition the table in a way that reduces/clusters the number of
pages that need to 
be fetched by these not frequent enough but performance critical queries
?

This may solve the problem with a different approach.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: real procedures again (8.4)

2007-10-30 Thread Zeugswetter Andreas ADI SD

  I think the cool thing that Josh Berkus wants is
 
  return query select a, b, c from foo;
  return query select d, e, f from bar;
 
 
 maybe better
 
 SELECT a,b FROM foo;
 SELECT d, e FROM bar;
 
 procedure doesn't need return statement usually

The background for Quel was, that when selecting all fields from
an inheritance hierarchy you got the additional fields of each child.

Thus the field count and types could vary within one cursor.
Like if you would allow the following:
select a, b::int from foo 
union all
select a, c::varchar, d, e from bar

I don't think anybody would want to transfer that idea to sql clients.
In sql the first statement would define field count, name/alias and
type.
The second statement would need to implicitly cast or fail if it does
not match.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Getting to 8.3 beta1

2007-09-28 Thread Zeugswetter Andreas ADI SD

  * Do we bump the .so major version number for libpq?  I think we
should
  because there are two new exported functions since 8.2, and on at
least
  some platforms there's nothing else than major number to
disambiguate
  whether a client needs these or not.  Comments?

-1. You don't bump major if the old api can be used 1:1 with the new
lib.
New functions is not a reason for a major bump.
The major version business is to protect you from ruining currently
running
(old) programs, not from using a too old lib with newly compiled
programs. 

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (MaybeOFFTOPIC)

2007-09-28 Thread Zeugswetter Andreas ADI SD

 A nice improvement on that would be to have a rearchive_command to
 allow to sync the new bytes written since a previous archive_command
(so
 it needs a new placeholder start from this byte).  This allows
writing
 dd seek=%s skip=%s count=%b bs=1

But after a log switch nothing is filling that rest anymore.
Maybe this goes too much in the direction of a streaming the log
implementation,
which is imho better suited to ship transactions somewhere else as soon
as possible.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (MaybeOFFTOPIC)

2007-09-28 Thread Zeugswetter Andreas ADI SD

  The probably useful next step would be to pass the current length to
the
  archive_command,
  so it can write the filled part of the file without the need for a
  filter.
  
 I can see that helping a lot, but not by writing onto the file on
disk.
 If the file is nearly empty, that would be a lot of disk I/O which
doesn't
 need to happen.

I think you misunderstood what I meant.
The actual archive command is constructed by expanding certain
placeholders.
I am suggesting to add such a placeholder for the size of the filled
part of the log.

A hypothetical example (note suggested %b placeholder for size in
bytes):
archive_command=dd if=%p of=/backup/WAL/%f bs=1 count=%b

This allows to avoid unnecessary io for the backup of partially filled
logs.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (MaybeOFFTOPIC)

2007-09-27 Thread Zeugswetter Andreas ADI SD

  Attached is a modified version to implement both of these.  I also
bailed
  out if there was surplus input.  I tried an optimization of
allocating a
  separate buffer for outputting the zeros, to avoid repeated memset
calls.
  It didn't seem to make a very big difference; do you think it's
worth
  cluttering the code with that?
 
 Would it work to just ftruncate the file?

We would need to teach recovery to accept a short file if the last
record is a
valid switch log XLOG record. RestoreArchivedFile currently bails out if
the file
size is not XLogSegSize.

We need to make exact checks though, or this would reduce reliability.
(e.g. a short file must have records up to the very end)

The probably useful next step would be to pass the current length to the
archive_command,
so it can write the filled part of the file without the need for a
filter.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] HOT line pointer bloat and PageRepairFragmentation

2007-09-13 Thread Zeugswetter Andreas ADI SD

 CREATE TABLE test (a int, b char(200));
 CREATE UNIQUE INDEX testindx ON test(a);
 INSERT INTO test VALUES (1, 'foo');
 
 Now, if we repeatedly update the tuple so that each update is a
 COLD update, we would bloat the page with redirect-dead line pointers.

Um, sorry for not understanding, but why would a COLD update produce a 
redirect-dead line pointer (and not two LP_NORMAL ones) ?

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] HOT line pointer bloat and PageRepairFragmentation

2007-09-13 Thread Zeugswetter Andreas ADI SD

 The COLD updated (old) tuple would be pruned to dead line pointer
 once the tuple becomes DEAD. Normally that would let us reuse the
 tuple storage for other purposes. We do the same for DELETEd tuples.

Oh, I thought only pruned tuples from HOT chains can produce a 
redirect dead line pointer. 

This looks like a problem, since we might end up with a page filled with
LP_DEAD slots, that all have no visibility info and can thus not be
cleaned
by vacuum.

Maybe PageRepairFragmentation when called from HOT should prune less 
aggressively. e.g. prune until a max of 1/2 the available slots are
LP_DEAD,
and not prune the rest.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] SQL feature requests

2007-08-25 Thread Zeugswetter Andreas ADI SD

  I still don't see it as terrifically useful functionality, given
that it's 
  just saving you 4 keystrokes ...
 
 Less than that, because the AS is optional.  The only argument I've
 heard that carries much weight with me is that it eases porting from
 other DBMS's that allow this.  Are there any others besides Oracle?

FWIW IBM implemented this syntax in Informix 11.1 Cheetah released in
July 2007.
They also allow to omit the alias, and it imho makes sense.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] SQL feature requests

2007-08-23 Thread Zeugswetter Andreas ADI SD

  If your implementation accepts:
 
group by case when true then 'foo' end

I think he meant: group by case when true then foo end

 
 What would that mean? Regardless of whether or not it's accepted, it  
 should have *some* meaning.
 
 It's not equivalent to GROUP BY foo

Yea, but only because 'foo' is an identifier, and not a string constant.

 test=# select record_id as foo
  , count(observation_id) as bar
  from observation
  group by case when true
then 'foo'
   end;
 ERROR:  column observation.record_id must appear in the GROUP BY  
 clause or be used in an aggregate function

I think your example would be easier to understand if you removed the
quotes.
We don't detect the correctness of the above query. You can hardly say
that
this is a feature, but I am not inclined to see it as a troublesome bug
eighter.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] SQL feature requests

2007-08-23 Thread Zeugswetter Andreas ADI SD

  how much harder can it be to accept:
  
group by 'foo'

Presumably you meant group by foo.

Imho pg should accept group by foo. It could be part of a constant
removal, that also takes burden off the sort.

e.g. in select x, count(*) from bar where x=5 group by x, x could be
removed since it is constant.

 This is not about hardness of the implementation, but rather about
 non-confusing behaviour I think.
 
 AFAIK, group by 1 means group by the first selected column, not
 group all rows together. But group by 'foo' would carry the second
 meaning - group all rows together.

Yes. I don't see the issue. 1 is imho sufficiently different even from
1.
Pg is not alone in allowing column number in group by.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Naming of the prefab snowball stemmer dictionaries

2007-08-22 Thread Zeugswetter Andreas ADI SD

Sounds reasonable, but why exactly did we spell out english instead of en ?
Seems the abbrev is much easier to extract from LANG or browser prefs ...

Andreas  

-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Tom Lane
Gesendet: Mittwoch, 22. August 2007 17:11
An: Oleg Bartunov; Teodor Sigaev
Cc: pgsql-hackers@postgreSQL.org
Betreff: [HACKERS] Naming of the prefab snowball stemmer dictionaries 
[bayes][heur]
Wichtigkeit: Niedrig

I notice that the existing tsearch documentation that we've imported fairly 
consistently refers to Snowball dictionaries with names like en_stem, 
ru_stem, etc.  However, CVS HEAD is set up to create them with names 
english, russian, etc.  As I've been absorbing more of the docs I'm 
starting to wonder whether this is a good idea.  ISTM that these names 
encourage a novice to think that the one dictionary is all you could need for a 
given language; and there are enough examples of more-complex setups in the 
docs to make it clear that in fact Snowball is not the be-all and end-all of 
dictionaries.

I'm thinking that going back to the old naming convention (or something like it 
--- maybe english_stem, russian_stem, etc) would be better.
It'd help to give the right impression, namely that these dictionaries are a 
component of a solution but not necessarily all you need.

Thoughts?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] 2D partitioning of VLDB - sane or not?

2007-08-14 Thread Zeugswetter Andreas ADI SD

 Which brings us back to the original issue. If I decide to stick with 
 the current implementation and not improve our existing partitioning 
 mechanisms to scale to 100,000 partitions, I could do something like:

There is a point where you can leave the selection of the correct rows
to normal btree indexes.
I'd say that that point currently is well below 2000 partitions for all
common db systems.
I opt, that more partitions will only be useful for very limited use
cases,
and would be very interested in hearing of a practical partitioning
scheme where more partitions still show a performance advantage (any db
system).

Looks like in your case a partitioning scheme with 256 partitions on one
of the 2 dimensions sounds reasonable.
Or 32 instead of 256 bins for each dim, if your searches are
bidirectional.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] HOT pgbench results

2007-08-07 Thread Zeugswetter Andreas ADI SD

  unpatched   HOT 
  autovacuums 116 43
  autoanalyzes139 60
 
  HOT greatly reduces the number of vacuums needed. That's 
 good, that's 
  where the gains in throughput in longer I/O bound runs comes from.
 
 But surely failing to auto-analyze after a HOT update is a bad thing.

Well, the definition is that no index columns changed, so this seems
debateable.
It seems for OLTP you should not need an analyze, but for DSS filtering
or joining
on non indexed columns you would. And that would also only be relevant
if you created
custom statistics on non indexed columns.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Straightforward changes for increased SMP scalability

2007-07-17 Thread Zeugswetter Andreas ADI SD

 The NUM_BUFFER_PARTITIONS patch is fairly simple. We've 
 noticed gains with NUM_BUFFER_PARTITIONS set between 256 and 
 2048, but little to no gain after 2048, although this might 
 depend on the benchmark and platform being used. We've 

Might this also be a padding issue, because 2048 partitions seems mighty
high ?
Other db's seem to cope well with a max of 64 partitions.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] write_pipe_chunks patch messes up early error message output

2007-07-16 Thread Zeugswetter Andreas ADI SD

  Is there any reason we can't just use a check on whether 
 SysLoggerPID 
  is not 0?
 
 (a) that really shouldn't be exported out of postmaster.c, 
 and (b) it is not readily available to child backends is it?

Should there be child backends when the logger did not start ?
I'd think startup would be aborted if that happed ?

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] msvc and vista fun

2007-06-25 Thread Zeugswetter Andreas ADI SD

 user) . I can build as the admin user but when I come to run 
 initdb it fails, complaining that it can't find the postgres 
 executable.

FYI, this happens on my Win 2000 also.
Maybe a problem with mixed / \ path separators after RestrictExec.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Winner of naming discussions: Synchronous Commit

2007-06-25 Thread Zeugswetter Andreas ADI SD

 synchronous_commit
 Idea: Greg Stark
 Supporters: Simon, Josh, Tom, Bruce, Florian

There was one more:
asynchronous_commit
Idea: Florian G. Pflug
Supporters: none

But if you are calling the feature that (which imho is good), the guc
might as well get that name.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Sorted writes in checkpoint

2007-06-15 Thread Zeugswetter Andreas ADI SD

tests| pgbench | DBT-2 response time 
 (avg/90%/max)
  
 ---+-+
  ---+-+---
   LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
   + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
   + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
  
  (*) Don't write buffers that were dirtied after starting 
 the checkpoint.
  
  machine : 2GB-ram, SCSI*4 RAID-5
  pgbench : -s400 -t4 -c10  (about 5GB of database)
  DBT-2   : 60WH (about 6GB of database)
 
 I'm very surprised by the BM_CHECKPOINT_NEEDED results. What 
 percentage of writes has been saved by doing that? We would 
 expect a small percentage of blocks only and so that 
 shouldn't make a significant difference. I thought we 

Wouldn't pages that are dirtied during the checkpoint also usually be
rather hot ?
Thus if we lock one of those for writing, the chances are high that a
client needs to wait for the lock ? 
A write os call should usually be very fast, but when the IO gets
bottlenecked it might easily become slower.

Probably the recent result, that it saves ~53% of the writes, is
sufficient explanation though.

Very nice results :-) Looks like we want all of it including the sort. 

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Selecting a constant question: A summary

2007-06-13 Thread Zeugswetter Andreas ADI SD

 For some Unicode character sets, element_width can be as much as 4

In UTF8 one char can be up to 6 bytes, so 4 is not correct in general.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Selecting a constant question

2007-06-12 Thread Zeugswetter Andreas ADI SD

  Again, *all* other major relational databases do this ...  
 even blob fields have a maximum length reported from the database.
 
 So what are you doing with the max length? Not all data types 
 and values have a meaningful max length, so you have to be 
 able to deal with variable length data anyway.

Imho it has a lot to do with optimizing the interface.
If you know, that the max length is e.g. 16 bytes in UTF-8 for the 3
chars, you will probably not want any on the fly allocation smarts and
preallocate and bind those 16 bytes. When the max length value gets
larger, and it is a variable lenght type, the overhead of varlen
allocation smarts starts to pay off.

A generic interface should keep the sql parsing smarts at a minimum,
thus it cannot know that a returned column is actually a text constant.

Imho the request for a max length is very reasonable, but has no value
once it exceeds a certain size e.g. 64k.

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Selecting a constant question

2007-06-12 Thread Zeugswetter Andreas ADI SD

 Thats exactly the point. Consider
 select mytext from mytable ;
 
 How can PostgreSQL possibly know the maximum length of the 
 returned values *before* it has scanned the whole table?

I think this focuses too much on those cases where it is not possible.
When it is not feasible like with a text column, clients deal with it
already (obviously some better than others). 
It is for those cases where it would be feasible, like constants (or
concateneted columns), where the max length if properly returned could
be used to improve performance.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Truncate Permission

2007-06-11 Thread Zeugswetter Andreas ADI SD

  Wouldn't it be far more logical to decide that if a user has the 
  permissions to do a DELETE FROM table; then they have permission to
do 
  a TRUNCATE? Why make an additional permission?
 
 Truncate doesn't fire ON DELETE triggers.

Yes, but it would imho be ok if there are'nt any on delete triggers on
the table.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately

2007-06-08 Thread Zeugswetter Andreas ADI SD

   The launcher is set up to wake up in autovacuum_naptime seconds at
most.

Imho the fix is usually to have a sleep loop.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] TOAST usage setting

2007-06-08 Thread Zeugswetter Andreas ADI SD

 My next suggestion would be to leave EXTERN_TUPLES_PER_PAGE 
 as is, but:
 Split data wider than a page into page sized chunks as long 
 as they fill whole pages.
 Split the rest with EXTERN_TUPLES_PER_PAGE (4) as now.
 This would not waste more space than currently, but improve 
 performance for very wide columns.
 
 I can try to do a patch if you think that is a good idea, 
 can't do a lot of testing though.

I have a PoC patch running, but it is larger than expected because of
the size checks during read 
(toast_fetch_datum_slice not done, but would be straight forward).
Also the pg_control variable toast_max_chunk_size would need to be
renamed and reflect the
EXTERN_TUPLES_PER_PAGE (4) number and the fact that fullpage chunks are
used
(else the chunk size checks and slice could not work like now).

Should I pursue, keep for 8.4, dump it ?

The downside of this concept is, that chunks smaller than fullpage still
get split into the smaller pieces.
And the  ~8k chunks may well outnumber the  ~8k on real data. 
The up side is, that I do not see a better solution that would keep
slice cheap and still lower the overhead even for pathological cases.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately

2007-06-08 Thread Zeugswetter Andreas ADI SD

 The launcher is set up to wake up in autovacuum_naptime
seconds 
 at most.
  
  Imho the fix is usually to have a sleep loop.
 
 This is what we have.  The sleep time depends on the schedule 
 of next vacuum for the closest database in time.  If naptime 
 is high, the sleep time will be high (depending on number of 
 databases needing attention).

No, I meant a while (sleep 1(or 10) and counter  longtime) check for
exit instead of sleep longtime.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Implicit casts with generic arrays

2007-06-06 Thread Zeugswetter Andreas ADI SD

 For example in 8.2 this is mapped to array_prepend:
 
 regression=# select 'x'::text || array['aa','bb','cc'];
?column?
 --
  {x,aa,bb,cc}
 (1 row)
 
 but with the experimental code you get textcat:
 
 catany=# select 'x'::text || array['aa','bb','cc'];
   ?column?
 -
  x{aa,bb,cc}
 (1 row)

This is what I would have expected || to give, and not what 8.2 does.
So disregarding the rest of the argument I think that array_[pre|ap]pend
should have other operators.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] TOAST usage setting

2007-06-06 Thread Zeugswetter Andreas ADI SD

  No, you misunderstood. Bruce was suggesting changing the target to
512.
  That means if a row is wider than ~2k, toaster will try to toast
until 
  the base row is
  ~512 bytes. I would not do that part for 8.3. 
 
 OK, what do you suggest for 8.3?  Attached are my suggestion 
 to use 512 and a 4k chunk size, which I think means that 2.7k 
 is the worst values that has a loss of around 25%.

Oh, so I misunderstood you also. You are suggesting two changes:
TOAST_TUPLES_PER_PAGE   16
EXTERN_TUPLES_PER_PAGE  2

While I agree, that 2 might be a good compromise with low risc for now,
I think
that toasting all rows down to ~512 bytes is too narrowly targeted at
not reading wider columns.

When suggesting a new target, I interpreted you like so:
#define TOAST_TUPLES_PER_PAGE   4
#define TOAST_TUPLE_TARGET  \
MAXALIGN_DOWN((BLCKSZ - \
   MAXALIGN(sizeof(PageHeaderData) +
(TOAST_TUPLES_PER_PAGE-1) * sizeof(ItemIdData))) \
  / 16)
So we would only toast rows wider than 2k, but once toasting, toast the
base row down to 512.  

My suggestion would be to leave TOAST_TUPLES_PER_PAGE as is, because all
else would need extensive performance testing.
#define TOAST_TUPLES_PER_PAGE   4

My next suggestion would be to leave EXTERN_TUPLES_PER_PAGE as is, but:
Split data wider than a page into page sized chunks as long as they fill
whole pages.
Split the rest with EXTERN_TUPLES_PER_PAGE (4) as now.
This would not waste more space than currently, but improve performance
for very wide columns.

I can try to do a patch if you think that is a good idea, can't do a lot
of testing though.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] TOAST usage setting

2007-06-06 Thread Zeugswetter Andreas ADI SD

  While I agree, that 2 might be a good compromise with low risc for 
  now, I think that toasting all rows down to ~512 bytes is too
narrowly 
  targeted at not reading wider columns.
 
 Well, it is summarized here:
 
   http://momjian.us/expire/TOAST/SUMMARY.html
 
 It made non-TOAST access 2x faster, but TOAST 7x slower, and 
 that seemed like a good compromise.

Yes, my argument was that I expect that in the up to 1-2k range more use
cases will suffer the 7x slowdown, than see the speedup. But the ratio
certainly is hard to judge, and you may well be right.

e.g. for me TOAST_TUPLES_PER_PAGE 8 would be ok, I have base row widths
of ~700 in 2 tables that would suffer if further toasted, but none in
the 1k - 2k range.

I wonder whether this threshold isn't rather absolute, thus the 32k
pagesize users should probably use 32.

  When suggesting a new target, I interpreted you like so:
  #define TOAST_TUPLES_PER_PAGE   4
  #define TOAST_TUPLE_TARGET  \
  MAXALIGN_DOWN((BLCKSZ - \
 MAXALIGN(sizeof(PageHeaderData) +
  (TOAST_TUPLES_PER_PAGE-1) * sizeof(ItemIdData))) \
/ 16)
  So we would only toast rows wider than 2k, but once toasting, toast 
  the base row down to 512.
 
 That is certainly not my intent, and I don't see how you 
 would get the 2k number from that macro.  I think you are

the ~2k come from TOAST_TUPLE_THRESHOLD 

 looking at 8.2 and not CVS HEAD.  CVS HEAD has:
 
   #define TOAST_TUPLE_TARGET  TOAST_TUPLE_THRESHOLD

Nope, I meant what I said. To only change the target you would replace
above macro for TOAST_TUPLE_TARGET.
But I also don't see how this would be good.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] TOAST usage setting

2007-06-05 Thread Zeugswetter Andreas ADI SD

  The big question is do we want to drop the target tuple  size down
to 
  512, and increase the chunk size to 8k for 8.3?  Dropping the tuple 
  size down to 512 is going to give us some smaller TOAST values to
fill 
  in free space created by the 8k chuck size, assuming you have both 
  types of values in the table.  Do we want to increase the access
time 
  of long TOAST by 6% if it means having more wasted space for lots of

  4.1k values?
 
 If we do that people could see their disk space usage increase by up
to
 16x: currently 513 bytes fits in heap and takes (roughly) 513 
 bytes;

No, you misunderstood. Bruce was suggesting changing the target to 512.
That means if a row is wider than ~2k, toaster will try to toast until
the base row is
~512 bytes. I would not do that part for 8.3. 

 if we make that change it would then get toasted and 
 take 8K. I don't think we want to do that. Disk space aside, 
 it's almost certain to seriously hurt performance as soon as 
 you don't fit entirely in memory.

No, allowing one toast chunk to fill a page does not mean that every
chunk uses a whole page. 

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] What is happening on buildfarm member baiji?

2007-06-04 Thread Zeugswetter Andreas ADI SD

   Given this, I propose we simply #ifdef out the SO_REUSEADDR on
win32.

I agree, that this is what we should do.

   (A fairly good reference to read up on the options is at 
   http://msdn2.microsoft.com/en-us/library/ms740621.aspx
  
  Hmm ... if accurate, that page says in words barely longer than one 
  syllable that Microsoft entirely misunderstands the intended meaning

  of SO_REUSEADDR.
 
 Yes, that's how I read it as well.
 
  It looks like SO_EXCLUSIVEADDRUSE might be a bit closer to the 
  standard semantics; should we use that instead on Windoze?
 
 I think you're reading something wrong. The way I read it, 
 SO_EXCLUSIVEADDRUSE gives us pretty much the same behavior we have on
Unix
 *without* SO_REUSEADDR. There's a paragraph specificallyi 
 talking about the problem of restarting a server having to 
 wait for a timeout when using this switch.

Yup, that switch is no good eighter.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Constraint exclusion oddity with composite index

2007-06-04 Thread Zeugswetter Andreas ADI SD

  Assume the following:
  index on: (id, adate)
  constraint CHECK(adate  '01-01-2007' AND adate  '04-01-2007');
 

Um, the subject is CE, but the question is about an index ? Those are
separate issues.
 
  The planner will not use the index listed above.

  For what?
 
 select adate from parent where adate = '01-25-2007'

A possibly cheaper plan would be a self join to produce all possible
id's and join the index for each (id, adate) pair.
Note, that you need not check visibility of the id's you produce (index
only access). 
Is that what you were expecting ? This is not implemented.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Ye olde drop-the-database-you-just-left problem

2007-05-31 Thread Zeugswetter Andreas ADI SD
   However, it suddenly struck me that we could 
 probably make most of the problem go away if we put that same wait
into 
 DROP DATABASE itself --- that is, if we see other backends in the 
 target DB, sleep for a second or two and then recheck before erroring
out.

Yup, waiting in drop database up to 10-30 secs would imho be fine.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] TOAST usage setting

2007-05-31 Thread Zeugswetter Andreas ADI SD

 I tested EXTERN_TUPLES_PER_PAGE for values 4(default), 2, and 1:
 
   4   15.596
   2   15.197
   1   14.6
 
 which is basically a 3% decrease from 4-2 and 2-1.  The 
 test script and result are here:
 
   http://momjian.us/expire/TOAST2/
 
 shared_buffers again was 32MB so all the data was in memory.

Thanks for the test. (The test is for 1 row that is 100k wide.)

It is good. It shows, that we even see a small advantage in the
everything cached case.

What we don't have yet is numbers for whether EXTERN_TUPLES_PER_PAGE=1
substantially increases the toast table size for real life scenarios,
what happens in the worst case (~48% wastage compared to previous 12%),
and whether 1 row per page works well with autovacuum ?

The bad case (with EXTERN_TUPLES_PER_PAGE=1) is when most toast tuples
have a size over TOAST_MAX_CHUNK_SIZE_for_2+1 but enough smaller than a
page that we care about the wasteage. Maybe we can special case that
range.
Maybe determine (and lock) the freespace of any cheap-to-get-at non
empty page (e.g. the current insert target page) and splitting the toast
data there. 

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] TOAST usage setting

2007-05-30 Thread Zeugswetter Andreas ADI SD

 I reran the tests with hashtext(), and created a SUMMARY.HTML chart:
 
   http://momjian.us/expire/TOAST/
 
 What you will see is that pushing TEXT to a TOAST column 
 allows quick access to non-TOAST values and single-row TOAST 
 values, but accessing all TOAST columns is slower than 
 accessing them in the heap, by a factor of 3-18x.

Um, my understanding for this measurement would be to tweak
TOAST_TUPLE_THRESHOLD, and use a fixed TOAST_MAX_CHUNK_SIZE ?
Why would you want to increase the number of toast tuples required for
one column value ?

My expectation would be, that we want to allow a toast tuple to fill a
whole page (TOAST_TUPLES_PER_PAGE = 1),
but keep or reduce the threshold. Maybe we need other logic to find and
use free space for that, though (8.4 material).

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] TOAST usage setting

2007-05-30 Thread Zeugswetter Andreas ADI SD

  Whereas if you set toast_tuples_per_page to 8k then the only option 
  for Postgres will be to put each datum in its own page and 
 waste 1-3k 
  on every page.
 
 No, because actually the code is designed to make the toast 
 chunk size just enough less than 8K that the tuples fit.

He mentioned toasted values of 6-7k each. If all values are of that size
(like payment slip tiffs) there is nothing we would do with the
remaining 1-2k on each page. But that disadvantage disappears as soon as
you have just a little more variability in the length. Still, it might
be enough to opt for some freespace reuse smarts if we can think of a
cheap heuristic. But the cost to fetch such a distributed tuple would
be so huge I doubt there is anything to win but disk space.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] like/ilike improvements

2007-05-25 Thread Zeugswetter Andreas ADI SD

  However, I have just about convinced myself that we don't need 
  IsFirstByte for matching _ for UTF8, either preceded by % or
not, 
  as it should always be true. Can anyone come up with a counter
example?
 
 You have to be on a first byte before you can meaningfully 
 apply NextChar, and you have to use NextChar or else you 
 don't count characters correctly (eg __ must match 2 chars 
 not 2 bytes).

Well, for utf8 NextChar could advance to the next char even if the
current byte
position is in the middle of a multibyte char (skip over all 10xx). 

(Assuming utf16 surrogate pairs are not encoded as 2 x 3bytes, which is
not valid utf8 anyway)   

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Seq scans roadmap

2007-05-16 Thread Zeugswetter Andreas ADI SD

32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 
cache effect.

I'd say in a scenario where 32k pages are indicated you will also want
larger than average L2 caches.


How about using 256/blocksize?

The reading ahead uses 1/4 ring size. To the best of our knowledge, this
1/4 needs to be 128k for reading.
So I'd say we need 512/blocksize.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] What is happening on buildfarm member baiji?

2007-05-14 Thread Zeugswetter Andreas ADI SD

 Close. There was an Msys build from the 9th running on port 5432.

 2) VC++ and Msys builds will both happily start on the same 
 port at the same time. The first one to start listens on 5432 
 until it shuts down, at which point the second server takes 
 over seamlessly! It doesn't matter which is started first - 
 it's as if Windows is queuing up the listens on the port.

Um, we explicitly set SO_REUSEADDR. So the port will happily allow a
second bind.

http://support.microsoft.com/kb/307175 quote:
If you use SO_REUSADDR to bind multiple servers to the same port at the
same time, only one random listening socket accepts a connection
request.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Seq scans roadmap

2007-05-11 Thread Zeugswetter Andreas ADI SD

 Sorry, 16x8K page ring is too small indeed. The reason we 
 selected 16 is because greenplum db runs on 32K page size, so 
 we are indeed reading 128K at a time. The #pages in the ring 
 should be made relative to the page size, so you achieve 128K 
 per read.

Ah, ok. New disks here also have a peak at 128k with no other concurrent
IO.
Writes benefit from larger blocksizes though, 512k and more.
Reads with other concurrent IO might also benefit from larger
blocksizes.

Comment to all: to test optimal blocksizes make sure you have other
concurrent IO on the disk.

 Also agree that KillAndReadBuffer could be split into a 
 KillPinDontRead(), and ReadThesePinnedPages() functions. 
 However, we are thinking of AIO and would rather see a 
 ReadNPagesAsync() function.

Yes, you could start the aio and return an already read buffer to allow
concurrent cpu work.
However, you would still want to do blocked aio_readv calls to make sure
the physical read uses the large blocksize.
So I'd say aio would benefit from the same split.

In another posting you wrote:
 The patch has no effect on scans that do updates. 
 The KillAndReadBuffer routine does not force out a buffer if 
 the dirty bit is set. So updated pages revert to the current 
 performance characteristics.

Yes I see, the ring slot is replaced by a standard ReadBuffer in that
case, looks good.

I still think it would be better to write out the buffers and keep them
in the ring when possible, but that seems to need locks and some sort of
synchronization with the new walwriter, so looks like a nice project for
after 8.3.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Seq scans roadmap

2007-05-10 Thread Zeugswetter Andreas ADI SD

 In reference to the seq scans roadmap, I have just submitted 
 a patch that addresses some of the concerns.
 
 The patch does this:
 
 1. for small relation (smaller than 60% of bufferpool), use 
 the current logic 2. for big relation:
   - use a ring buffer in heap scan
   - pin first 12 pages when scan starts
   - on consumption of every 4-page, read and pin the next 4-page
   - invalidate used pages of in the scan so they do not 
 force out other useful pages

A few comments regarding the effects:

I do not see how this speedup could be caused by readahead, so what are
the effects ?
(It should make no difference to do the CPU work for count(*) inbetween
reading each block when the pages are not dirtied)
Is the improvement solely reduced CPU because no search for a free
buffer is needed and/or L2 cache locality ?

What effect does the advance pinnig have, avoid vacuum ?

A 16 x 8k page ring is too small to allow the needed IO blocksize of
256k.
The readahead is done 4 x one page at a time (=32k).
What is the reasoning behind 1/4 ring for readahead (why not 1/2), is
3/4 the trail for followers and bgwriter ?

I think in anticipation of doing a single IO call for more that one
page, the KillAndReadBuffer function should be split into two parts. One
that does the killing
for n pages, and one that does the reading for n pages.
Killing n before reading n would also have the positive effect of
grouping perhaps needed writes (not interleaving them with the reads).

I think the 60% Nbuffers is a very good starting point. I would only
introduce a GUC when we see evidence that it is needed (I agree with
Simon's partitioning comments, but I'd still wait and see). 

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Seq scans roadmap

2007-05-10 Thread Zeugswetter Andreas ADI SD

 Also, that patch doesn't address the VACUUM issue at all. And 
 using a small fixed size ring with scans that do updates can 
 be devastating. I'm experimenting with different ring sizes 
 for COPY at the moment. Too small ring leads to a lot of WAL 
 flushes, it's basically the same problem we have with VACUUM 
 in CVS HEAD.

My first take on that would be to simply abandon any dirty (and actually
also any still pinned) buffer from the ring and replace the ring slot
with a buffer from the freelist.
If the freelist is empty and LSN allows writing the buffer, write it
(and maybe try to group these).
If the LSN does not allow the write, replace the slot with a buffer from
LRU.

Andreas 

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Seq scans roadmap

2007-05-09 Thread Zeugswetter Andreas ADI SD

  Are you filling multiple buffers in the buffer cache with a single 
  read-call?
 
  yes, needs vector or ScatterGather IO.
 
 I would expect that to get only moderate improvement.

The vast improvement comes from 256k blocksize.

 To get 
 the full benefit I would think you would want to either fire 
 off a separate thread to do the read-ahead, use libaio, or 
 funnel the read-ahead requests to a separate thread like our 
 bgwriter only it would be a bgreader or something like that.

I like bgreader :-) But that looks even more difficult than grabbing 32
[scattered or contiguous] buffers at once.
Especially in a situation where there is no concurrent load it would be
nice to do CPU work while waiting for the next read ahead IO. If there
is enough parallel CPU load it is actually not so important. So I opt,
that on a high load server you get nearly all benefit without any sort
of aio. 

  The OS should be doing readahead for us anyway, so I don't see how 
  just issuing multiple ReadBuffers one after each other helps.
 
  Last time I looked OS readahead was only comparable to 32k blocked
reads.
  256k blocked reads still perform way better. Also when the OS is 
  confronted with an IO storm the 256k reads perform way better than
OS readahead.
 
 Well that's going to depend on the OS. Last I checked Linux's 
 readahead logic is pretty straightforward and doesn't try to 
 do any better than 32k readahead and is easily fooled. 
 However I wouldn't be surprised if that's changed. 

My test was on AIX, 32 or 64k seem quite common, at least as default
setting.
Also on some OS's (like HPUX) OS readahead and writebehind strategy
changes with large IO blocksizes, imho beneficially.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Seq scans roadmap

2007-05-08 Thread Zeugswetter Andreas ADI SD

 Nope - just reading N buffers ahead for seqscans.  Subsequent 
 calls use previously read pages.  The objective is to issue 
 contiguous reads to the OS in sizes greater than the PG page 
 size (which is much smaller than what is needed for fast 
 sequential I/O).

Problem here is that eighter you issue the large read into throwaway
private memory and hope that when you later read 8k you get the page
from OS buffercache, or you need ScatterGather IO and a way to grab 32
buffers at once.

 Yes, I think the ring buffer strategy should be used when the 
 table size is  1 x bufcache and the ring buffer should be of 
 a fixed size smaller than L2 cache (32KB - 128KB seems to work well).

How do you merge those two objectives? It seems the ring buffer needs to
be at least as large as the contiguous read size.
Thus you would need at least 256k ring buffer. Better yet have twice the
IO size as ring buffer size, so two sessions can alternately take the
lead while the other session still blocks a prev page. Modern L2 cache
is 8 Mb, so 512k seems no problem ?

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Seq scans roadmap

2007-05-08 Thread Zeugswetter Andreas ADI SD

  What do you mean with using readahead inside the heapscan? 
  Starting an async read request?
  
  Nope - just reading N buffers ahead for seqscans.  Subsequent calls 
  use previously read pages.  The objective is to issue 
 contiguous reads 
  to the OS in sizes greater than the PG page size (which is much 
  smaller than what is needed for fast sequential I/O).
 
 Are you filling multiple buffers in the buffer cache with a 
 single read-call?

yes, needs vector or ScatterGather IO.
 
 The OS should be doing readahead for us 
 anyway, so I don't see how just issuing multiple ReadBuffers 
 one after each other helps.

Last time I looked OS readahead was only comparable to 32k blocked
reads.
256k blocked reads still perform way better. Also when the OS is
confronted
with an IO storm the 256k reads perform way better than OS readahead.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Heap page diagnostic functions

2007-05-02 Thread Zeugswetter Andreas ADI SD

  Any suggestions? pgdiagnostics?
 
 Yes, I like diagnostics, or internals.  I just think 
 forensics isn't going to be understood by the average native 
 English speaker, let alone non-English speakers.

I think forensics is ok. The world is currently beeing swamped with
related tv shows :-)

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] too much WAL volume

2007-04-26 Thread Zeugswetter Andreas ADI SD

  Writing to a different area was considered in pg, but there were
more 
  negative issues than positive.
  So imho pg_compresslog is the correct path forward. The current
  discussion is only about whether we want a more complex
pg_compresslog 
  and no change to current WAL, or an increased WAL size for a less 
  complex implementation.
  Both would be able to compress the WAL to the same archive log
size.
 
 Huh?  As conceived, pg_compresslog does nothing to lower log 
 volume for general purposes, just on-disk storage size for 
 archiving.  It doesn't help us at all with the tremendous 
 amount of log we put out for an OLTP server, for example.

Ok, that is not related to the original discussion though.
I have thus changed the subject, and removed [PATCHES].

You cannot directly compare the pg WAL size with other db's since they
write parts to other areas (e.g. physical log in Informix). You would
need to include those writes in a fair comparison.
It is definitely not true, that writing to a different area has only
advantages. The consensus was, that writing the page images to the WAL
has more pro's. We could revisit the pros and cons though. 

Other options involve special OS and hardware (we already have that), or
accepting a high risc of needing a
restore after power outage (we don't have that, because we use no
mechanism to detect such a failure).

I am not sure that shrinking per WAL record size (other than the full
page images), e.g. by only logging changed bytes and not whole tuples,
would have a huge impact on OLTP tx/sec, since the limiting factor is
IO's per second and not Mb per second. Recent developments like HOT seem
a lot more promising in this regard since they avoid IO.

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-26 Thread Zeugswetter Andreas ADI SD

 So what happens if a backend is running with full_page_writes 
 = off, someone edits postgresql.conf to turns it on and 
 forgets to reload/ restart, and then we crash? You'll come up 
 in recovery mode thinking that f_p_w was turned on, when in 
 fact it wasn't.
 
 ISTM that we need to somehow log what the status of 
 full_page_writes is, if it's going to affect how recovery works.

Optimally recovery should do this when confronted with a full page image
only. The full page is in the same WAL record that first touches a page,
so this should not need to depend on a setting.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-25 Thread Zeugswetter Andreas ADI SD

  1) To deal with partial/inconsisitent write to the data file at
crash 
  recovery, we need full page writes at the first modification to
pages
  after each checkpoint.   It consumes much of WAL space.
 
 We need to find a way around this someday.  Other DBs don't 
 do this; it may be becuase they're less durable, or because 
 they fixed the problem.

They eighter can only detect a failure later (this may be a very long
time depending on access and verify runs) or they also write page
images. Those that write page images usually write before images to a
different area that is cleared periodically (e.g. during checkpoint).

Writing to a different area was considered in pg, but there were more
negative issues than positive.
So imho pg_compresslog is the correct path forward. The current
discussion is only about whether we want a more complex pg_compresslog
and no change to current WAL, or an increased WAL size for a less
complex implementation.
Both would be able to compress the WAL to the same archive log size.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-24 Thread Zeugswetter Andreas ADI SD

 3) To maintain crash recovery chance and reduce the amount of 
 archive log, removal of  unnecessary full page writes from 
 archive logs is a good choice.

Definitely, yes. pg_compresslog could even move the full pages written
during backup out of WAL and put them in a different file that needs to
be applied before replay of the corresponding WAL after a physical
restore. This would further help reduce log shipping volume.

 To do this, we need both logical log and full page writes in WAL.

This is only true in the sense, that it allows a less complex
implementation of pg_compresslog.

Basically a WAL record consists of info about what happened and
currently eighter per tuple new data or a full page image. The info of
what happened together with the full page image is sufficient to
reconstruct the per tuple new data. There might be a few WAL record
types (e.g. in btree split ?) where this is not so, but we could eighter
fix those or not compress those.

This is why I don't like Josh's suggested name of wal_compressable
eighter.
WAL is compressable eighter way, only pg_compresslog would need to be
more complex if you don't turn off the full page optimization. I think a
good name would tell that you are turning off an optimization.
(thus my wal_fullpage_optimization on/off)

Andreas


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [Fwd: PGBuildfarm member narwhal Branch HEAD Statuschanged from OK to InstallCheck failure]

2007-04-23 Thread Zeugswetter Andreas ADI SD

 Hmm, I'll give it a go when I'm back in the office, but bear 
 in mind this is a Mingw build on which debugging is nigh-on 
 impossible.

I use the Snapshot
http://prdownloads.sf.net/mingw/gdb-6.3-2.exe?download from sf.net.
It has some issues, but it is definitely useable.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-23 Thread Zeugswetter Andreas ADI SD

 I don't insist the name and the default of the GUC parameter. 
  I'm afraid wal_fullpage_optimization = on (default) makes 
 some confusion because the default behavior becomes a bit 
 different on WAL itself.

Seems my wal_fullpage_optimization is not a good name if it caused
misinterpretation already :-(

  Amount of WAL after 60min. run of DBT-2 benchmark 
  wal_add_optimization_info = off (default) 3.13GB
  
  how about wal_fullpage_optimization = on (default)

The meaning of wal_fullpage_optimization = on (default)
would be the same as your wal_add_optimization_info = off (default).
(Reversed name, reversed meaning of the boolean value)

It would be there to *turn off* the (default) WAL full_page
optimization.
For your pg_compresslog it would need to be set to off. 
add_optimization_info sounded like added info about/for some
optimization
which it is not. We turn off an optimization with the flag for the
benefit
of an easier pg_compresslog implementation.

As already said I would decouple this setting from the part that sets
the removeable full page flag in WAL, and making the recovery able to
skip dummy records. This I would do unconditionally.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-20 Thread Zeugswetter Andreas ADI SD

 With DBT-2 benchmark, I've already compared the amount of WAL.   The 
 result was as follows:
 
 Amount of WAL after 60min. run of DBT-2 benchmark 
 wal_add_optimization_info = off (default) 3.13GB

how about wal_fullpage_optimization = on (default)
 
 wal_add_optimization_info = on (new case) 3.17GB - can be 
 optimized to 0.31GB by pg_compresslog.
 
 So the difference will be around a couple of percents.   I think this
is 
 very good figure.
 
 For information,
 DB Size: 12.35GB (120WH)
 Checkpoint timeout: 60min.  Checkpoint occured only once in the run.

Unfortunately I think DBT-2 is not a good benchmark to test the disabled
wal optimization.
The test should contain some larger rows (maybe some updates on large
toasted values), and maybe more frequent checkpoints. Actually the poor
ratio between full pages and normal WAL content in this benchmark is
strange to begin with.
Tom fixed a bug recently, and it would be nice to see the new ratio. 

Have you read Tom's comment on not really having to be able to
reconstruct all record types from the full page image ? I think that
sounded very promising (e.g. start out with only heap insert/update). 

Then:
- we would not need the wal optimization switch (the full page flag
would always be added depending only on backup)
- pg_compresslog would only remove such full page images where it
knows how to reconstruct a normal WAL record from
- with time and effort pg_compresslog would be able to compress [nearly]
all record types's full images (no change in backend)

 I don't think replacing LSN works fine.  For full recovery to 
 the current time, we need both archive log and WAL.  
 Replacing LSN will make archive log LSN inconsistent with 
 WAL's LSN and the recovery will not work.

WAL recovery would have had to be modified (decouple LSN from WAL
position during recovery).
An archive log would have been a valid WAL (with appropriate LSN
advance records). 
 
 Reconstruction to regular WAL is proposed as 
 pg_decompresslog.  We should be careful enough not to make 
 redo routines confused with the dummy full page writes, as 
 Simon suggested.  So far, it works fine.

Yes, Tom didn't like LSN replacing eighter. I withdraw my concern
regarding pg_decompresslog.

Your work in this area is extremely valuable and I hope my comments are
not discouraging.

Thank you
Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Eliminating unnecessary left joins

2007-04-13 Thread Zeugswetter Andreas ADI SD

 I have this exact problem a lot.  There are actually cases 
 where you can eliminate regular joins, not just left joins.  
 For example:
 
 CREATE TABLE partner (
   id  serial,
   namevarchar(40) not null,
   primary key (id)
 );
 
 CREATE TABLE project (
   id  serial,
   namevarchar(40) not null,
   partner_id  integer not null references project (id)

^^^ -- I assume typo, should be partner
 );
 
 CREATE VIEW project_view AS
 SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM 
 project p, partner pp WHERE p.partner_id = pp.id;

Same advice to you:

1. add not null to your id's
2. CREATE VIEW project_view AS
   SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM 
   project p left outer join partner pp ON p.partner_id = pp.id;
3. wait (or implement :-) the left join optimization in pg

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-13 Thread Zeugswetter Andreas ADI SD

  Yup, this is a good summary.
  
  You say you need to remove the optimization that avoids the logging
of 
  a new tuple because the full page image exists.
  I think we must already have the info in WAL which tuple inside the 
  full page image is new (the one for which we avoided the WAL entry 
  for).
  
  How about this:
  Leave current WAL as it is and only add the not removeable flag to 
  full pages.
  pg_compresslog then replaces the full page image with a record for
the 
  one tuple that is changed.
  I tend to think it is not worth the increased complexity only to
save 
  bytes in the uncompressed WAL though.
 
 It is essentially what my patch proposes.  My patch includes 
 flag to full page writes which can be removed.

Ok, a flag that marks full page images that can be removed is perfect.

But you also turn off the optimization that avoids writing regular
WAL records when the info is already contained in a full-page image
(increasing the
uncompressed size of WAL).
It was that part I questioned. As already stated, maybe I should not
have because
it would be too complex to reconstruct a regular WAL record from the
full-page image.  
But that code would also be needed for WAL based partial replication, so
if it where too
complicated we would eventually want a switch to turn off the
optimization anyway
(at least for heap page changes).

  Another point about pg_decompresslog:
  
  Why do you need a pg_decompresslog ? Imho pg_compresslog should 
  already do the replacing of the full_page with the dummy entry. Then

  pg_decompresslog could be a simple gunzip, or whatever compression
was 
  used, but no logic.
 
 Just removing full page writes does not work.   If we shift the rest
of 
 the WAL, then LSN becomes inconsistent in compressed archive logs
which 
 pg_compresslog produces.   For recovery, we have to restore LSN as the

 original WAL.   Pg_decompresslog restores removed full page writes as
a 
 dumm records so that recovery redo functions won't be confused.

Ah sorry, I needed some pgsql/src/backend/access/transam/README reading.

LSN is the physical position of records in WAL. Thus your dummy record
size is equal to what you cut out of the original record.
What about disconnecting WAL LSN from physical WAL record position
during replay ?
Add simple short WAL records in pg_compresslog like: advance LSN by 8192
bytes.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [PATCHES] Reviewers Guide to Deferred Transactions/Transaction Guarantee

2007-04-13 Thread Zeugswetter Andreas ADI SD

I agree with Tom's reasoning about the suggested simplifications, sorry.

  3. Should the WALWriter also do the wal_buffers half-full write at
the 
  start of XLogInsert() ?
 
 That should go away entirely; to me the main point of the 
 separate wal-writer process is to take over responsibility 
 for not letting too many dirty wal buffers accumulate.

That also sounds a lot simpler, but I think Bruce wanted to be able to
give
some time guarantee to the not waiting for fsync txns.
When a commit only half-filled the page and no more WAL comes in for 
a long time, there is only WALWriter to do the IO.
The WALWriter would need to only flush a half-full page after timeout
iff it contains a commit record.

One more question on autocommit:
Do we wait for a flush for an autocommitted DML ?
Seems we generally should not.

Andreas


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Eliminating unnecessary left joins

2007-04-12 Thread Zeugswetter Andreas ADI SD

 Maybe odd, but simpler to optimize this way.
 
 Your idea would be also a very good optimization, there was 
 already a discussion about that here:
 http://archives.postgresql.org/pgsql-performance/2006-01/msg00
 151.php, but that time Tom refused it because it was too 
 expensive and rare. Maybe now he has a different opinion.
 However, left join optimization is lot simpler and cheaper, 
 and can be useful not only for O/R mappers, but for efficient 
 vertical partitioning as Simon mentioned.

For the views use case there is a simple solution without the expensive 
optimization:
If you have a PK FK relationship simply rewrite the view to use a left join 
instead
of a join. Since there is always one row on the outer (PK) side it makes no 
difference to the result set.
And then the left join optimization can be used.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-12 Thread Zeugswetter Andreas ADI SD

 I don't fully understand what transaction log means.   If it means 
 archived WAL, the current (8.2) code handle WAL as follows:

Probably we can define transaction log to be the part of WAL that is
not
full pages.

 1) If full_page_writes=off, then no full page writes will be 
 written to WAL, except for those during onlie backup (between 
 pg_start_backup and 
 pg_stop_backup).   The WAL size will be considerably small 
 but it cannot 
 recover from partial/inconsistent write to the database 
 files.  We have to go back to the online backup and apply all 
 the archive log.
 
 2) If full_page_writes=on, then full page writes will be 
 written at the first update of a page after each checkpoint, 
 plus full page writes at
 1).   Because we have no means (in 8.2) to optimize the WAL 
 so far, what 
 we can do is to copy WAL or gzip it at archive time.
 
 If we'd like to keep good chance of recovery after the crash, 
 8.2 provides only the method 2), leaving archive log size 
 considerably large.  My proposal maintains the chance of 
 crash recovery the same as in the case of full_page_writes=on 
 and reduces the size of archived log as in the case of 
 full_page_writes=off.

Yup, this is a good summary.

You say you need to remove the optimization that avoids 
the logging of a new tuple because the full page image exists.
I think we must already have the info in WAL which tuple inside the full
page image
is new (the one for which we avoided the WAL entry for).

How about this:
Leave current WAL as it is and only add the not removeable flag to full
pages.
pg_compresslog then replaces the full page image with a record for the
one tuple that is changed.
I tend to think it is not worth the increased complexity only to save
bytes in the uncompressed WAL though.

Another point about pg_decompresslog:

Why do you need a pg_decompresslog ? Imho pg_compresslog should already
do the replacing of the
full_page with the dummy entry. Then pg_decompresslog could be a simple
gunzip, or whatever compression was used,
but no logic.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] conflicting gettimeofday with MinGW

2007-04-11 Thread Zeugswetter Andreas ADI SD

mingw-runtime-3.10 introduced a gettimeofday declaration in sys/time.h
that is not compatible with port.h.
(current is mingw-runtime-3.12)

int __cdecl gettimeofday(struct timeval *__restrict__,
 void *__restrict__  /* tzp (unused) */);

The problem was already reported by Marc Lepage on June 26, 2006 shortly
after the mingw-runtime release,
(He reported back that it was a MinGW installation issue. But it is a
version issue.).

Now, we could probably make a case that MinGW needs to use struct
timezone instead of void,
but that change alone still does not make pg compile, because of the
#define TIMEZONE_GLOBAL timezone
in port.h.

Any opinions on how to proceed ?

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] conflicting gettimeofday with MinGW

2007-04-11 Thread Zeugswetter Andreas ADI SD

That page is ages out of date. The intended sync is apparently broken.
The current download area is on sourceforge
http://sf.net/project/showfiles.php?group_id=2435

Andreas 

  mingw-runtime-3.10 introduced a gettimeofday declaration in 
 sys/time.h 
  that is not compatible with port.h.
  (current is mingw-runtime-3.12)
 
  int __cdecl gettimeofday(struct timeval *__restrict__,
   void *__restrict__  /* tzp (unused) */);
 
  The problem was already reported by Marc Lepage on June 26, 2006 
  shortly after the mingw-runtime release, (He reported back 
 that it was 
  a MinGW installation issue. But it is a version issue.).
 
  Now, we could probably make a case that MinGW needs to use struct 
  timezone instead of void, but that change alone still does 
 not make 
  pg compile, because of the #define TIMEZONE_GLOBAL timezone
  in port.h.
 
  Any opinions on how to proceed ?
 
 

 
 I don't see 3.10 on the download page at 
 http://www.mingw.org/download.shtml

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] conflicting gettimeofday with MinGW

2007-04-11 Thread Zeugswetter Andreas ADI SD

  That page is ages out of date. The intended sync is 
 apparently broken.
  The current download area is on sourceforge
  http://sf.net/project/showfiles.php?group_id=2435
 

 
 *sigh*
 
 And what is in 3.12, which is apparently the current version?

Sorry that was implied. sys/time.h did not change between 3.10 and 3.12.
There is no #define to remove the declaration.

In cvs the file is marked as dead and gives a reference to cygwin's
w32api.
Maybe the answer is to simply note that the mingw-runtime binary from
sourceforge above 3.9 has a broken sys/time.h (it is not the file from
cygwin but the dead one from cvs). 

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Group Commit

2007-04-10 Thread Zeugswetter Andreas ADI SD

   I've been working on the patch to enhance our group commit
behavior. 
   The patch is a dirty hack at the moment, but I'm settled on the 
   algorithm I'm going to use and I know the issues involved.
  
  One question that just came to mind is whether Simon's
no-commit-wait 
  patch doesn't fundamentally alter the context of discussion for
this.
  Aside from the prospect that people won't really care about group 
  commit if they can just use the periodic-WAL-sync approach, ISTM
that 
  one way to get group commit is to just make everybody wait for the 
  dedicated WAL writer to write their commit record.

Yes good catch, I think we will want to merge the two.
But, you won't want to wait indefinitely, since imho the dedicated WAL
writer will primarily only want to write/flush full WAL pages. Maybe
flush half full WAL pages only after some longer timeout. But basically
this timeout should be longer than an individual backend is willing to
delay their commit.

   With a 
  sufficiently short delay between write/fsync attempts in the 
  background process, won't that net out at about the same place as a 
  complicated group-commit patch?

I don't think we want the delay so short, or we won't get any grouped
writes.

I think what we could do is wait up to commit_delay for the
dedicated WAL writer to do it's work. If it did'nt do it until timeout
let the backend do the flushing itself.

 I think the big question is whether commit_delay is ever going to be
generally useful.

It is designed to allow a higher transaction/second rate on a constantly
WAL bottlenecked system, so I think it still has a use case. I think you
should not compare it to no-commit-wait from the feature side (only
implementation).

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Auto Partitioning

2007-04-06 Thread Zeugswetter Andreas ADI SD

  (But that sounds rather like pie in the sky, actually.  Which other 
  databases can do that, and how do they do it?)
 
 Oracle does it, by building a big index. Few people use it.

And others allow a different partitioning strategy for each index,
but that has the same problem of how to remove partitions without
a huge amount of index reorganization.

 There are significant problems with this idea that I have already
 raised: 
 - how big would the index be? 
 - how would you add and remove partitions with any kind of 
 performance?
 If we partitioned on date range, that will surely increase over time.
 - the index could almost certainly never be REINDEXed because 
 of space requirements and time considerations.
 - if the indexed values were monotonically increasing the RHS 
 of the index would become a significant hotspot in load 
 performance, assuming high volume inserts into a large table

yes

 My argument is that there are significant real-world 
 disadvantages to having this feature, yet there exists a 
 reasonable workaround to avoid ever needing it.

I'd say a workaround can mostly be found but not always.
But I agree, that the downsides of one large global index are
substantial enough to not make this path attractive.

 Why would we spend time building and supporting it?

What I think we would like to have is putting the append nodes into an
order that allows removing the sort node whenever that can be done. And
maybe a merge node (that replaces the append and sort node) that can
merge presorted partitions.

I have one real example where I currently need one large non unique
index in Informix.
It is a journal table that is partitioned by client timestamp, 
but I need a select first 1000 (of possibly many mio rows) order by
server_timestamp in a range 
that naturally sometimes needs more than one partition because client
and server timestamps diverge.
Here the merge facility would allow me to not use the global index and
still avoid sorting
millions of rows (which would not finish in time).
Problem with the global index is, that I have to delete all rows from
the oldest partition before removing it
to avoid rebuilding the global index.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Auto Partitioning

2007-04-05 Thread Zeugswetter Andreas ADI SD

  That lets you enforce unique constraints as long as the partition
key 
  is part of the unique constraint.
 
 Is that already sufficient?

yes

 That would alter the ordering of 
 the columns in the index, no? I mean:

It produces ordered blocks of append nodes for range queries that span
multiple partitions,
but one unique key can still only be in exactly one of the partitions.
e.g. If you range partition by b, only one partition is applicable
regardless
of the position of b in the index. This is sufficient for a working
unique constraint
with current pg versions.

 CREATE INDEX x ON test(a, b, c);
 
 isn't the same as
 
 CRETAE INDEX x ON test(c, b, a);

That is only a problem if you also want to avoid a sort (e.g. for an
order by),
it is not an issue for filtering rows.
And in some cases the sort could still be avoided with some range
proving logic,
if you can bring the append nodes of partitions into an order that
represents 
the order by.
(an example would be a query where c=5 and b between 0 and 20
and two partitions one for 0 = b  10 and a second for 10 = b)  

 That's why I'd say, the first columns of an index would have 
 to be equal to all of the columns used in the partitioning key.

No. It may change performance in some situations, but it is not needed
for unique constraints.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Zeugswetter Andreas ADI SD

 What do others think? Should the argument to CHR() be a 
 Unicode code point or the numeric representation of the 
 database encoding?

When the database uses a single byte encoding, the chr function takes
the binary byte representation as an integer number between 0 and 255
(e.g. ascii code).
When the database encoding is one of the unicode encodings it takes a
unicode code point.
This is also what Oracle does.

Not sure what to do with other multibyte encodings.
Oracle only states that the numeric argument must resolve to one entire
code point,
whatever that is.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Zeugswetter Andreas ADI SD

  When the database uses a single byte encoding, the chr function
takes 
  the binary byte representation as an integer number between 0 and
255 
  (e.g. ascii code).
  When the database encoding is one of the unicode encodings it takes
a 
  unicode code point.
  This is also what Oracle does.
 
 Sorry, but this is *NOT* what Oracle does.
 At least if we can agree that the code point for the Euro 
 sign is 0x20AC.

yes

 
 SQL SELECT ASCII('EUR') AS DEC,
   2 TO_CHAR(ASCII('EUR'), 'XX') AS HEX
   3  FROM DUAL;
 
DEC HEX
 -- 
   14844588  E282AC
 
 The encoding in this example is AL32UTF8, which corresponds 
 to our UTF8.

You are right, I am sorry. My test was broken.

To get the euro symbol in Oracle with a AL32UTF8 encoding you use
chr(14844588)

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Modifying TOAST thresholds

2007-04-03 Thread Zeugswetter Andreas ADI SD

  ... should we revel
  in configurability, and allow CREATE TABLE/ALTER TABLE behavior to 
  vary depending on the current threshold setting?  We'd have to fix
the 
  toaster routines to not try to push stuff out-of-line when there is
no 
  out-of-line to push to ... but I think we probably had better do
that 
  anyway for robustness, if we're allowing any variability at all in 
  these numbers.
 
 Actually, upon looking closely at the toast code, it already 
 does the right thing when there's no toast table.  Good on 
 someone for getting that right.  But we still need to think 
 about whether it's sane for CREATE/ALTER TABLE to condition 
 the create-a-toast-table decision on a parameter that may now 
 be changeable.

I think it is ok to decide during creation with current settings.
When a user wants a toast table that has not been created we can direct
them to use some dummy alter table ... set storage ... and create a
toast 
table if it does not exist (and the new settings opt for one).

And a new threshold has immediate consequences for inline compression,
so a change is not ignored. 

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-03-30 Thread Zeugswetter Andreas ADI SD

 Without a switch, because both full page writes and 
 corresponding logical log is included in WAL, this will 
 increase WAL size slightly 
 (maybe about five percent or so).   If everybody is happy 
 with this, we 
 don't need a switch.

Sorry, I still don't understand that. What is the corresponding logical
log ?
It seems to me, that a full page WAL record has enough info to produce a

dummy LSN WAL entry. So insead of just cutting the full page wal record
you 
could replace it with a LSN WAL entry when archiving the log.

Then all that is needed is the one flag, no extra space ?

Andreas


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-03-30 Thread Zeugswetter Andreas ADI SD

 Archive recovery needs the 
 normal xlog record, which in some cases has been optimised 
 away because the backup block is present, since the full 
 block already contains the changes.

Aah, I didn't know that optimization exists.
I agree that removing that optimization is good/ok.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Patch queue concern

2007-03-29 Thread Zeugswetter Andreas ADI SD

  My feeling is we should have more regular sync points where the
patch 
  queue is emptied and everything committed or rejected.
 
 No doubt, but the real problem here is that 
 reviewing/committing other people's patches is not fun, it's 
 just work :-(.  So it's no surprise that it tends to get put 
 off.  Not sure what to do about that.

In my experience it mostly pays to keep people directly responsible for
their own work.
Every intermediate tester/reviewer/coordinator tends to reduce the
submitter's feeling for responsibility.
So I could imagine a modus operandi where a submitter states:
I feel confident that you can commit without review and will be availabe
for fixes/additional work required.
Maybe we have that in the form of committers that commit their own work
already.

But I do feel that some patches Bruce is talking about need aggreement
and help, not only review.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Reduction in WAL for UPDATEs

2007-03-28 Thread Zeugswetter Andreas ADI SD

  I agree that these values need a second look. I think a 
  TOAST_TUPLE_THRESHOLD well smaller than the current value would
still 
  easily pay its way. With a little caution to avoid wasting too much 
  effort on the last few bytes I suspect even as low as 
 400-500 bytes is probably worthwhile.

But a seq scan (or non cached access) would suddenly mutate to multiple
random accesses, so this is not a win-win situation.

Btw: Do we consider the existance of toasted columns in the seq-scan
cost estimation ?

Andreas 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Synchronized Scan update

2007-03-14 Thread Zeugswetter Andreas ADI SD

  The advantage of sync_scan_offset is that, in some situations, a 
  second scan can actually finish faster than if it were the only
query 
  executing, because a previous scan has already caused some blocks to

  be cached. However, 16 is a small number because that benefit would 
  only be once per scan, and sync scans are only helpful on large
tables.

Agreed.

 Alright, understood. That last part is actually something I 
 now want to avoid because it's using the current 
 cache-spoiling behaviour of seqscans to advantage. I'd like 
 to remove that behaviour, but it sounds like we can have both
 - SeqScans that don't spoil cache
 - Synch scans by setting sync_scan_offset to zero.
 
I like the idea of reducing tuning parameters, but we should, at
a 
minimum, still allow an on/off button for sync scans. My tests 
revealed that the wrong combination of 
OS/FS/IO-Scheduler/Controller could result in bad I/O behavior.
   
   Agreed
   
  
  Do you have an opinion about sync_scan_threshold versus a simple 
  sync_scan_enable?
 
 enable_sync_scan?

Seems the suggested guc's are very related. IIRC The agreed suggestion
was to use NBuffers (or a percentage thereof ?) to decide whether to
spoil the buffer cache for a seq scan. I seems this same metric should
be used to decide whether to sync a scan when sync scan is enabled. So
when the tablesize is below NBuffers (or a percentage thereof) you
neighter recycle buffers nor sync the seq scans.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread Zeugswetter Andreas ADI SD

  Since partition is inheritance-based, a simple DROP or  NO 
 INHERIT 
  will do the job to deal with the partition. Do we want to reinvent 
  additional syntax when these are around and are documented?
 
 Well, if the syntax for adding a new partition eventually 
 ends up as ALTER TABLE ADD PARTITION, then it would make more 
 sense that you remove a partition via ALTER TABLE DROP PARTITION.

But DROP PARTITION usually moves the data from this partition to other
partitions,
so it is something different.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Zeugswetter Andreas ADI SD

  Is there a particular reason why CLUSTER isn't MVCC-safe? It seems
to 
  me that it would be trivial to fix, by using SnapshotAny instead of 
  SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
  cluster command.
 
 It's trivial to fix now in this way, but it would break HOT, 
 since an indexscan only returns one row per index entry.

Well, with SnapshotAny HOT should probably return all possibly visible
tuples
with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
for reading a whole table a seq scan and sort is usually cheaper, at
least when the clustering is so bad that a CLUSTER is needed.)

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Zeugswetter Andreas ADI SD
  The hard part there is checking that the partition constraints are 
  distinct. If the partition constraints are added one at a time, you 
  can use the predicate testing logic to compare the to-be-added 
  partition's constraint against each of the already added
constraints. 
  That becomes an O(N) problem.

Yes, we could preevaluate that check (per index) in the DDL phase and
keep the info in a flag. 
Also interesting info is if there is an order the partitions can be read
in to satisfy a particular order by.

  What is really needed is a data structure that allows range
partitions 
  to be accessed more efficiently. This could make adding partitions
and 
  deciding in which partition a specific value goes an O(logN) 
  operation.

I do not really see a problem with O(N) since typical N currently range
from 10 to 200.
N = 1000 is already good for a = 10 TB table.
If a 10 GB partition were too large we should imho invest more in the
advanced indexing methods that are currently beeing developed.

 Directing data to child tables with triggers pretty much 
 necessitates having some way to codify what partition a 
 particular row belongs in.  
 IE: for partitioning by month, you'll see things like naming 
 the partition tables parent_table_name_$YEAR_$MONTH, so the 
 'partitioning function' takes a date or timestamp and then 
 returns what partition it belongs to. Perhaps there is some 
 way to use that mapping to drive the selection of what 
 partitions could contain a given value?

You put it in the first partition that has matching constraints.

 One possibility would be to require 3 functions for a partitioned
 table: one accepts the partitioning key and tells you what 
 partition it's in, one that tells you what the minimum 
 partitioning key for a partition would be, and one that tells 
 you what the maximum would be.  
 If the user supplied those 3 functions, I think it would be 
 possibly to automatically generate code for the triggers and 
 check constraints. The min/max partition key functions might 
 allow you to more efficiently do partition elimination, too.

I can see this as a good optional addition, but it can only be optional
else it would pretty much limit the methods that can be used for
partitioning. e.g. hash, modulo do not have a min,max per partition.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Auto creation of Partitions

2007-03-07 Thread Zeugswetter Andreas ADI SD

  iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, 
  pass it on to the children tables.
 
 How will you maintain a primary key in such a table, 
 considering that indexes can't span multiple tables?

Many partitioning schemes have (or can be made to have) only one
possible target partition for the primary key. Thus if you create
separate unique indexes on each partition the problem is solved.
For a first version I opt, that it is sufficient to disallow creation of
a unique index on the master, when the constraints (on columns of this
index) do not nail down a specific partition for each row (e.g. a hash
or a range on one of the index columns that does not overlap).
Later, global index, or indexes with separate partitioning rules can be
implemented, that cover the other cases.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


  1   2   >