[HACKERS] questionable item in HISTORY
Following item in HISTORY: * Add support for 3 and 4-byte UTF8 characters (John Hansen) Previously only one and two-byte UTF8 characters were supported. This is particularly important for support for some Chinese characters. is wrong since 3-byte UTF-8 characters are supported since UTF-8 support has been added to PostgreSQL. Correct description would be: * Add support for 4-byte UTF8 characters (John Hansen) Previously only up to three-byte UTF8 characters were supported. This is particularly important for support for some Chinese characters. In the mean time I wonder if we need to update UTF-8 -- locale encoding maps. The author of the patches stated that This is particularly important for support for some Chinese characters. I have no idea what encoding he is reffering to, but I wonder if the latest Chinense encoding standard GB18030 needs 4-byte UTF-8 mappings. If yes, we surely need to update utf8_to_gb18030.map. Anybody familiar with GB18030/UTF-8? -- SRA OSS, Inc. Japan Tatsuo Ishii ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] stack depth limit exceeded problem.
Oliver Jowett wrote: Thomas Hallgren wrote: PL/Java runs a JVM. Since a JVM is multi threaded, PL/Java goes to fairly extreme measures to ensure that only one thread at a time can access the backend. So far, this have worked well but there is one small problem. [...] I assume this means you have a single lock serializing requests to the backend? Yes, of course. I also make sure that the main thread cannot return until another thread that is servicing a backend request has completed. There's absolutely no way two threads can execute backend code simultaniously. If you can't solve the depth checking problem (Tom doesn't seem to like the idea of multiple threads calling into the backend..), what about turning the original thread (i.e. the main backend thread) into a backend interface thread that does nothing but feed callbacks into the backend on request? Then run all the user code in a separate thread that passes backend requests to the interface thread rather than directly executing them. If it starts extra threads which makes DB requests, the mechanism stays the same.. I though about that. The drawback is that each and every call must spawn a new thread, no matter how trivial that call might be. If you do a select from a table with 10,000 records and execute a function for each record, you get 20,000 context switches. Avoiding that kind of overhead is one of the motivating factors for keeping the VM in-process. I don't rule out such a solution but I'd like to have a discussion with Tom and iron out what the problems are when one thread at a time is allowed to execute. Perhaps I can solve them. Regards, Thomas Hallgren ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] stack depth limit exceeded problem.
On Sat, Sep 24, 2005 at 10:34:42AM +0200, Thomas Hallgren wrote: Oliver Jowett wrote: I assume this means you have a single lock serializing requests to the backend? Yes, of course. I also make sure that the main thread cannot return until another thread that is servicing a backend request has completed. There's absolutely no way two threads can execute backend code simultaniously. Ok, I have a question. PostgreSQL uses sigsetjmp/siglongjmp to handle errors in the backend. If you're changing the stack, how do you avoid the siglongjmp jumping back to a different stack? Or do you somehow avoid this problem altogether? I though about that. The drawback is that each and every call must spawn a new thread, no matter how trivial that call might be. If you do a select from a table with 10,000 records and execute a function for each record, you get 20,000 context switches. Avoiding that kind of overhead is one of the motivating factors for keeping the VM in-process. Well, on linux at least context switches are quite cheap. However, how does Java handle the possibility that functions never return. Do you wrap each call in a PG_TRY/PG_CATCH to propegate errors? Tricky issues... -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpWDeiFxvZex.pgp Description: PGP signature
Re: [HACKERS] stack depth limit exceeded problem.
Martijn van Oosterhout wrote: On Sat, Sep 24, 2005 at 10:34:42AM +0200, Thomas Hallgren wrote: Oliver Jowett wrote: I assume this means you have a single lock serializing requests to the backend? Yes, of course. I also make sure that the main thread cannot return until another thread that is servicing a backend request has completed. There's absolutely no way two threads can execute backend code simultaniously. Ok, I have a question. PostgreSQL uses sigsetjmp/siglongjmp to handle errors in the backend. If you're changing the stack, how do you avoid the siglongjmp jumping back to a different stack? Or do you somehow avoid this problem altogether? All calls use a PG_TRY/PG_CATCH. So yes, I think I avoid that problem altogether. I though about that. The drawback is that each and every call must spawn a new thread, no matter how trivial that call might be. If you do a select from a table with 10,000 records and execute a function for each record, you get 20,000 context switches. Avoiding that kind of overhead is one of the motivating factors for keeping the VM in-process. Well, on linux at least context switches are quite cheap. I know. And as I said, I don't rule out such a solution. But however cheap, there's still a performance penalty and added complexity. I rather avoid both if I can. At least until I know what the real problem is with the solution that I propose. However, how does Java handle the possibility that functions never return. Do you wrap each call in a PG_TRY/PG_CATCH to propegate errors? Yes. All backend exceptions are cought in a PG_CATCH and then propagated to Java as a ServerException. If there's no catch in the Java code, they are rethrown by the java_call_handler. This time with jump buffer that was setup by the backend when it invoked the call_handler. There's also a barrier that will prevent any further calls from the Java code once an exception has been thrown by the backend unless that call was wrapped in a savepoint construct. A savepoint rollback will unlock the barrier (this is not related to the thread issue of course). Regards, Thomas Hallgren ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] stack depth limit exceeded problem.
On Sat, Sep 24, 2005 at 12:26:58PM +0200, Thomas Hallgren wrote: Yes. All backend exceptions are cought in a PG_CATCH and then propagated to Java as a ServerException. If there's no catch in the Java code, they are rethrown by the java_call_handler. This time with jump buffer that was setup by the backend when it invoked the call_handler. There's also a barrier that will prevent any further calls from the Java code once an exception has been thrown by the backend unless that call was wrapped in a savepoint construct. A savepoint rollback will unlock the barrier (this is not related to the thread issue of course). Well, you seem to have dealt with the obvious issues I can see. I imagine you need also to worry about things like signal handling. Is there no way to reserve a stack just for PostgreSQL and switch to that stack, rather than switch threads (although, the stack is really the only thing that differentiates threads anyway...). Linux has sigaltstack so you can catch the stack overflow signal (and other signals obviously, but that's its main use), but it's not terribly portable. What you really need to do is set the stack_base_ptr every time you execute postgres with a new stack; that preserves existing semantics. Signals are the only way the kernel can pass control unexpectedly so if you handle those, postgres would never know it's threaded. I do wonder if there are any other assumptions made... Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpAoT6QNSsBT.pgp Description: PGP signature
Re: [HACKERS] stack depth limit exceeded problem.
Martijn van Oosterhout wrote: Linux has sigaltstack so you can catch the stack overflow signal (and other signals obviously, but that's its main use), but it's not terribly portable. I rely on the signal handler that the JVM uses for page-faults (which a stack overflow generally amounts to) and fpe exeptions so I know that they will generate java exceptions in a controlled way (which I in turn translate to elog(ERROR) on the main thread). What you really need to do is set the stack_base_ptr every time you execute postgres with a new stack; that preserves existing semantics. Exactly!. What I'd really like to do in threads other than main is: void* currentBase = switchStackBase(stackBaseOfMyThread); PG_TRY { /* service the call here */ switchStackBase(currentBase); } PG_CATCH { switchStackBase(currentBase); /* generate Java exception as usual */ } Signals are the only way the kernel can pass control unexpectedly so if you handle those, postgres would never know it's threaded. I do wonder if there are any other assumptions made... Have a nice day, You too. And thanks for all your input. Regards, Thomas Hallgren ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] stack depth limit exceeded problem.
On Sat, Sep 24, 2005 at 02:38:35PM +0200, Thomas Hallgren wrote: Martijn van Oosterhout wrote: Linux has sigaltstack so you can catch the stack overflow signal (and other signals obviously, but that's its main use), but it's not terribly portable. I rely on the signal handler that the JVM uses for page-faults (which a stack overflow generally amounts to) and fpe exeptions so I know that they will generate java exceptions in a controlled way (which I in turn translate to elog(ERROR) on the main thread). Well, actually, what I was thinking is if someone sends a -INT or -TERM to the backend, which thread will catch it? You have to block it in every thread except the one you want to catch it in if you want to control it. This means that for any signal handler that PostgreSQL installs, you need to intercept it with a wrapper function to make sure it runs in the right stack. Actually, while running backend code, you're probably fine since the elog stuff will handle it. But if a signal is received while the JVM is running, the signal handler will get the stack of the JVM. Now, PostgreSQLs signal handlers tend not to do much so you may be safe. They tend not to throws errors, but who knows... Still, this is all solvable I think... -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpLeqyX5gIeM.pgp Description: PGP signature
Re: [HACKERS] 64-bit API for large objects
Jeremy Drake [EMAIL PROTECTED] writes: On Fri, 23 Sep 2005, Tom Lane wrote: postgresql-fe.h defines a ton of stuff that has no business being visible to libpq's client applications. It's designed to be used by our *own* client-side code (psql and the like), but we have not made any attempt to keep it from defining stuff that would likely break other peoples' code. So does this mean that there is a different, more advanced and more likely to break random other code, client library where this call would fit better? I've been thinking more about this and come to these conclusions: 1. libpq_fe.h definitely cannot include postgres_fe.h; in fact, it has no business even defining a type named int64. That is way too likely to collide with symbols coming from elsewhere in a client compilation. I think what we need is to declare a type named pg_int64 and use that in the externally visible declarations. The most reasonable place to put the typedef is postgres_ext.h. This will mean making configure generate postgres_ext.h from a template postgres_ext.h.in, but that's no big deal. 2. We need a strategy for what to do when configure doesn't find a working int64 type. My inclination is to just not export the functions in that case. So normally, postgres_ext.h would contain something like #define HAVE_PG_INT64 1 typedef long long int pg_int64; but neither of these would appear if configure couldn't find a working type. In libpq-fe.h, we'd have #ifdef HAVE_PG_INT64 extern pg_int64 lo_lseek64(PGconn *conn, int fd, pg_int64 offset, int whence); extern pg_int64 lo_tell64(PGconn *conn, int fd); #endif and similarly for all the code inside libpq. The reason this seems like a good idea is that client code could key off #ifdef HAVE_PG_INT64 to detect whether the lo64 functions are available; which is useful even if you don't care about machines without int64, because you still need to think about machines with pre-8.2 PG installations. 3. This is still not 100% bulletproof, as it doesn't address situations like building PG with gcc and then trying to compile client apps with a vendor cc that doesn't understand long long int. The compile would choke on the typedef even if you weren't trying to use large objects at all. I don't see any very nice way around that. It might be worth doing this in postgres_ext.h: #ifndef NO_PG_INT64 #define HAVE_PG_INT64 1 typedef long long int pg_int64; #endif which would at least provide an escape hatch for such situations: define NO_PG_INT64 before trying to build. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] 64-bit API for large objects
Jeremy Drake [EMAIL PROTECTED] writes: In any case, are there any comments on the changes below libpq (the functions visible to queries on down)? Within the backend, I don't see the point in maintaining a distinction between 32- and 64-bit APIs for inv_api.c: you should just switch 'em to use int64. You did it that way for inv_getsize but then proceeded to add separate inv_seek64/tell64 functions, which is inconsistent. The submitted version of lo_tell64 isn't even calling the right one :-( The 32-bit version of lo_tell will need to check and error out if the value it'd need to return doesn't fit in int32. (Come to think of it, so should the 32-bit version of lo_lseek.) All of the LO code needs to be eyeballed to make sure it still behaves sanely if int64 is really only 32 bits. It would probably be a good idea also to introduce some overflow checks to detect cases where the current LO offset would overflow int64 after a read, write, or seek. This is missing from the existing code :-( It is possible to code overflow checks that work regardless of the size of int64; see int8.c for some inspiration. I'd suggest also that the offset be defined as signed not unsigned (int64 not uint64) as this will simplify the overflow checks and eliminate the prospect of scenarios where lo_tell64 cannot return a correct value. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] 64-bit API for large objects
On Fri, 23 Sep 2005, Tom Lane wrote: Jeremy Drake [EMAIL PROTECTED] writes: 2) The lo_*64, in order to be convenient from the client end, have functions added to libpq as the existing lo_* functions. The client side of libpq did not previously know anything about int64 or how to send/receive them. I added an include of postgres-fe.h (which according to the comment in that file looks like it should go there) so int64 would be defined, Unfortunately that's completely unacceptable from a namespace-pollution point of view. I don't quite understand. Allow me to cite the source, so we both are referring to the same thing here... [EMAIL PROTECTED] postgresql-8.0.3 $ head -n17 src/include/postgres_fe.h /*- * * postgres_fe.h *Primary include file for PostgreSQL client-side .c files * * This should be the first file included by PostgreSQL client libraries and * application programs --- but not by backend modules, which should include * postgres.h. * * * Portions Copyright (c) 1996-2005, PostgreSQL Global Development Group * Portions Copyright (c) 1995, Regents of the University of California * * $PostgreSQL: pgsql/src/include/postgres_fe.h,v 1.10 2004/12/31 22:03:19 pgsql Exp $ * *- */ Now I may not completely understand the term client, but I think libpq is a client library and anything which may use it would be an application program. So it seems it was an oversight on the part of libpq to not include it. Does the term client not mean what I thought it did (anything which connects to a postgresql server)? The real problem here is that int64 isn't a well-defined portable datatype, and so it's going to be very hard to export these functions in a way that won't break on different platforms, applications compiled with a different compiler than libpq was, etc. Umm, what wouldn't break if you switched compilers in a way that redefined sizeof(things)? I happen to know, even using the same compiler but just changing a compile flag (-m64) which changes sizes of integral types (sizeof(long) from 32 to 64 bits) will make such actions stop working on one of my tested platform. It sucks, I happen to not be fond of this because I tend not to have every library which is on my box built for both, but it is the way life is. I do not know of a platform where the size of an integral type can change and still be able to link against libraries and things. And if the size of some type is not changing, then things should already be correctly set for the platform. But I admit I have not met every platform in existance. Do you happen to be able to cite a platform where this is the case? For that matter, we can't even guarantee that they work at all: not all platforms even *have* int64 types. We have so far avoided putting any fundamental dependencies on int64 arithmetic into the system, and I'm a bit worried that this patch will break LO support entirely on platforms that don't have working int64 arithmetic. They should in fact break gracefully on such platforms, or at least as gracefully as any other int64-using code might. I did check a couple places for #ifdef INT64_BROKEN (or whatever it was called) to make sure that on those platforms something at least somewhat sane would happen. (they use 32 bits instead). Also, on those platforms, you could always use the non-64 versions if you were concerned about that. The patches would allow seeking past the old limit using the 32 function in stages (seek 2G, seek 2G, seek 2G would put you at 6G) if you do not mind wierd return values and tell not working. And if you use a platform which does not support 64bit integral types, then you cannot reasonably expect those functions to work correctly anyway. But they should compile at least. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- I don't wanna argue, and I don't wanna fight, But there will definitely be a party tonight... ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] Releasing memory during External sorting?
From: Dann Corbit [EMAIL PROTECTED] Sent: Sep 23, 2005 5:38 PM Subject: RE: [HACKERS] [PERFORM] Releasing memory during External sorting? _C Unleashed_ also explains how to use a callback function to perform arbitrary radix sorts (you simply need a method that returns the [bucketsize] most significant bits for a given data type, for the length of the key). So you can sort fairly arbitrary data in linear time (of course if the key is long then O(n*log(n)) will be better anyway.) But in any case, if we are talking about external sorting, then disk time will be so totally dominant that the choice of algorithm is practically irrelevant. Horsefeathers. Jim Gray's sorting contest site: http://research.microsoft.com/barc/SortBenchmark/ proves that the choice of algorithm can have a profound affect on performance. After all, the amount of IO done is the most important of the things that you should be optimizing for in choosing an external sorting algorithm. Clearly, if we know or can assume the range of the data in question the theoretical minimum amount of IO is one pass through all of the data (otherwise, we are back in O(lg(n!)) land ). Equally clearly, for HD's that one pass should involve as few seeks as possible. In fact, such a principle can be applied to _all_ forms of IO: HD, RAM, and CPU cache. The absolute best that any sort can possibly do is to make one pass through the data and deduce the proper ordering of the data during that one pass. It's usually also important that our algorithm be Stable, preferably Wholly Stable. Let's call such a sort Optimal External Sort (OES). Just how much faster would it be than current practice? The short answer is the difference between how long it currently takes to sort a file vs how long it would take to cat the contents of the same file to a RAM buffer (_without_ displaying it). IOW, there's SIGNIFICANT room for improvement over current standard practice in terms of sorting performance, particularly external sorting performance. Since sorting is a fundamental operation in many parts of a DBMS, this is a Big Deal. This discussion has gotten my creative juices flowing. I'll post some Straw Man algorithm sketches after I've done some more thought. Ron -Original Message- From: Dann Corbit [EMAIL PROTECTED] Sent: Friday, September 23, 2005 2:21 PM Subject: Re: [HACKERS] [PERFORM] Releasing memory during ... For the subfiles, load the top element of each subfile into a priority queue. Extract the min element and write it to disk. If the next value is the same, then the queue does not need to be adjusted. If the next value in the subfile changes, then adjust it. Then, when the lowest element in the priority queue changes, adjust the queue. Keep doing that until the queue is empty. You can create all the subfiles in one pass over the data. You can read all the subfiles, merge them, and write them out in a second pass (no matter how many of them there are). The Gotcha with Priority Queues is that their performance depends entirely on implementation. In naive implementations either Enqueue() or Dequeue() takes O(n) time, which reduces sorting time to O(n^2). The best implementations I know of need O(lglgn) time for those operations, allowing sorting to be done in O(nlglgn) time. Unfortunately, there's a lot of data manipulation going on in the process and two IO passes are required to sort any given file. Priority Queues do not appear to be very IO friendly. I know of no sorting performance benchmark contest winner based on Priority Queues. Replacement selection is not a good idea any more, since obvious better ideas should take over. Longer runs are of no value if you do not have to do multiple merge passes. Judging from the literature and the contest winners, Replacement Selection is still a viable and important technique. Besides Priority Queues, what obvious better ideas have you heard of? I have explained this general technique in the book C Unleashed, chapter 13. Sample code is available on the book's home page. URL please? ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] 64-bit API for large objects
David Fetter [EMAIL PROTECTED] writes: On Fri, Sep 23, 2005 at 05:40:09PM -0400, Tom Lane wrote: For that matter, we can't even guarantee that they work at all: not all platforms even *have* int64 types. What platforms that PG supports don't have int64 arithmetic? We claim to build with any ANSI C compiler, and there is no requirement for a 64-bit type in ANSI C. The historical project policy is that we should still build without such a type, and everything should still work except that the effective bounds of bigint data correspond to int32 instead of int64 limits. I see no reason to back off that policy. It's not very much harder to do it right. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] 64-bit API for large objects
Hey, While you guys are hacking at the LO code, it would be nice to consider the suggestions outlined here: http://archives.postgresql.org/pgsql-bugs/2004-07/msg00143.php -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 Y una voz del caos me habló y me dijo Sonríe y sé feliz, podría ser peor. Y sonreí. Y fui feliz. Y fue peor. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Proposed patch to clean up signed-ness warnings
Tatsuo Ishii [EMAIL PROTECTED] writes: Are you proposing that we change all the char * to unsigned char *? No, I suggest we change all char * to unsigned char * only where it points a string which could hold non ASCII character strings. Which is pretty nearly all of them... To support multiple charsets/collataions, I think we need to change the way to represent character strings from the unstructured char * to more intelligent structure (I know it's hard to implement that without significant performance loss, but I know we should do it in the future). Yeah, it's still not clear where we are going to end up, but in the meantime we've got a lot of warnings cluttering the code and making it hard to spot real problems. So unsigned char* is not enough for the goal anyway, I'm not against your patches. OK. No one else objected, so I'll go ahead and apply before the code drifts to the point of breaking the patch. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] 2 forks for md5?
Bruce Momjian wrote: Tom Lane wrote: I'm coming to agree with Andrew that a documentation patch might be the best answer. But where to put it ... under the description of the log_connections GUC var? I am thinking we should wait for someone else to notice the double log entries before mentioning it in the docs. If I had a more Machiavellian bent I would make sure that happened ;-) How about this note under log_connections?: Some clients (notably psql) sometimes try to connect without a password before trying with a password. This behaviour will generate two log lines if log_connections is turned on, even though to the user it appears that only one connection has occurred. cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] 64-bit API for large objects
On Sat, 24 Sep 2005, Tom Lane wrote: Jeremy Drake [EMAIL PROTECTED] writes: In any case, are there any comments on the changes below libpq (the functions visible to queries on down)? Within the backend, I don't see the point in maintaining a distinction between 32- and 64-bit APIs for inv_api.c: you should just switch 'em to use int64. You did it that way for inv_getsize but then proceeded to add separate inv_seek64/tell64 functions, which is inconsistent. Right. I did it the way you describe my first cut (I did this several times and changed my mind and started over). I was concerned (perhaps needlessly) about breaking the signatures of the inv_* functions which are visible outside of the .c file. This is why I did the getsize differently - it is static. But it sounds like there is no concern about changing the signatures of these functions, so I will change my patch to not maintain the seperate code paths in inv_api.c. The submitted version of lo_tell64 isn't even calling the right one :-( Oops. That's what I get for lots of copy/paste and doing it multiple times... Bonehead mistake, thanks for catching it. The 32-bit version of lo_tell will need to check and error out if the value it'd need to return doesn't fit in int32. (Come to think of it, so should the 32-bit version of lo_lseek.) That makes sense. Or it could return some value (INT_MAX?) which could mean that it is outside the range, so someone could still get at the data even if they are using a backwards client box? I don't know if that makes sense at all, it sounds pretty wacky since these clients would have no way of knowing where they are in the file. Erroring would probably be best. All of the LO code needs to be eyeballed to make sure it still behaves sanely if int64 is really only 32 bits. Of course. It would probably be a good idea also to introduce some overflow checks to detect cases where the current LO offset would overflow int64 after a read, write, or seek. This is missing from the existing code :-( It is possible to code overflow checks that work regardless of the size of int64; see int8.c for some inspiration. Yes. That would be good. These would be in the inv_* functions, correct? I'd suggest also that the offset be defined as signed not unsigned (int64 not uint64) as this will simplify the overflow checks and eliminate the prospect of scenarios where lo_tell64 cannot return a correct value. I intended to do that. I think the only place I made uint64 vs int64 was the getsize function, and that could be int64 also. I will look at that code and make sure I am not mixing them in ways that are not necessary and useful. I will take these suggestions and make another revision of the patch shortly. Also, I was considering exposing up an lo_getsize or lo_stat function which would tell you how big a large object was without having to seek to the end and look at the return value, requiring you to have the large object open at the time, to loose your old seek position (unless you do a tell first), and requires several more server round-trips (if not open, would involve open/seek/close, if open, could require tell/seek/seek). ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] 64-bit API for large objects
On Sat, 24 Sep 2005, Alvaro Herrera wrote: Hey, While you guys are hacking at the LO code, it would be nice to consider the suggestions outlined here: http://archives.postgresql.org/pgsql-bugs/2004-07/msg00143.php Included from that message for easier reference: 0) In Oid lo_creat(PGconn *conn, int mode), why is there a mode on lo_create? The mode is determined when the object is lo_open()ed, right? I think the docs basically said it is a vestigial feature, it used to be useful but the code evolved in such a way that it ceased being useful. It is probably still there to allow old code to continue to compile against newer servers without being recompiled. 1) There is no lo_truncate(PGconn *conn, int fd, off_t len). Did not notice that one. That is a good one to add if adding functions is in the cards. I bet when the person/people who are attempting to write to this api here get far enough, they would have noticed that too ;) 2) There is no lo_length(PGconn *conn, int fd). We did notice this one however. There is also no lo_stat(PGconn *conn, Oid lobjId). I have been thinking about implementing these two. I think I will make a revision of the patch at some point with these. The size ones will be extremely easy, the functionality is already there, just a matter of exposing it. The truncate is not too difficult, but actually requires me to think a little more ;) -- When does summertime come to Minnesota, you ask? Well, last year, I think it was a Tuesday. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] gcc4's uninitialized-variable warnings
I asked some gcc experts at Red Hat about the new variable-may-be-used- uninitialized warnings that gcc 4.x reports. These occur in cases like int i, j; ... foo(i, j); // use i and j I had thought that gcc was being stricter about the possibility that the called function might not set its output parameters, but the true story is entirely different. There's been no change in the strictness of the check for external function calls. What is happening is that if foo() is static and gcc chooses to inline it into the calling function, you will now see a warning if the transformed code fails the check. In essence this means that there is a code path through foo() that doesn't set the output parameter. Armed with that knowledge, we can fix these warnings by ensuring the callee sets the output parameters in all code paths; which is often cleaner than having the caller initialize the variables before call, as I was afraid we'd have to do. I'll work on cleaning these up. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] 64-bit API for large objects
Jeremy Drake [EMAIL PROTECTED] writes: 0) In Oid lo_creat(PGconn *conn, int mode), why is there a mode on lo_create? The mode is determined when the object is lo_open()ed, right? I think the docs basically said it is a vestigial feature, it used to be useful but the code evolved in such a way that it ceased being useful. It is probably still there to allow old code to continue to compile against newer servers without being recompiled. Yeah. There were once multiple types of large objects, and I suppose the mode argument told lo_creat which kind to create. I have no idea how the read/write bits got included into that --- it doesn't make any sense. As of PG 8.1, lo_creat just ignores the mode argument. We can't delete the argument though without causing a lot of compatibility headaches. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] \x output blowing up
Hi, On CVS tip, whenever I do \x output, it dies with an internal error in glibc on free. If you run it under valgrind, it complains about these lines of code: 700{ 701char *my_cell = pg_local_malloc(cell_w[i] + 1); 702 703 [Inv write 1 byte] strcpy(my_cell, *ptr); 704if (opt_align[i % col_count] == 'r' opt_numeric_locale) 705format_numeric_locale(my_cell); 706if (opt_border 2) 707fprintf(fout, %s\n, my_cell); 708else 709 [Inv read 1 byte] fprintf(fout, %-s%*s |\n, my_cell, dwidth - cell_w[i], ); 710free(my_cell); 711} Now, apart from the fact that the cell width != strlen in multibyte encodings, there must be something else because this is just select * from pg_proc and there are no multiple characters there AFAIK. I can't see it though. Hope this helps, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpid8LNrp3Xr.pgp Description: PGP signature
Re: [HACKERS] Start translating
Tom Lane wrote: One thing I had wanted to do before issuing the call for translations was to make a pass over the recent GiST changes, looking at elog-vs-ereport decisions and message wording. OK, can you tell me where those are and I can look at it. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] \x output blowing up
On Sat, Sep 24, 2005 at 11:45:08PM +0200, Martijn van Oosterhout wrote: Hi, On CVS tip, whenever I do \x output, it dies with an internal error in glibc on free. If you run it under valgrind, it complains about these lines of code: snip Ok, I worked out the direct cause, pg_wcswidth only returns the length upto the first newline and the line it breaks on is the multiline definition in _pg_expandarray. The quick fix should be to only allocate memory if it's going to call format_numeric_locale(), since then you know it's a number. It makes the code slightly more convoluated but it should be slightly more efficient. I actually have a working psql that handles and displays newlines properly, but it's too late for 8.1. It fixes all these issues properly. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpSxRFAxuskf.pgp Description: PGP signature
Re: [HACKERS] Start translating
Peter Eisentraut [EMAIL PROTECTED] writes: Tom Lane wrote: One thing I had wanted to do before issuing the call for translations was to make a pass over the recent GiST changes, looking at elog-vs-ereport decisions and message wording. OK, can you tell me where those are and I can look at it. Done already, but thanks for offering. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] questionable item in HISTORY
Tatsuo Ishii wrote: Following item in HISTORY: * Add support for 3 and 4-byte UTF8 characters (John Hansen) Previously only one and two-byte UTF8 characters were supported. This is particularly important for support for some Chinese characters. is wrong since 3-byte UTF-8 characters are supported since UTF-8 support has been added to PostgreSQL. Correct description would be: * Add support for 4-byte UTF8 characters (John Hansen) Previously only up to three-byte UTF8 characters were supported. This is particularly important for support for some Chinese characters. Release notes updated. In the mean time I wonder if we need to update UTF-8 -- locale encoding maps. The author of the patches stated that This is particularly important for support for some Chinese characters. I have no idea what encoding he is reffering to, but I wonder if the latest Chinense encoding standard GB18030 needs 4-byte UTF-8 mappings. If yes, we surely need to update utf8_to_gb18030.map. Anybody familiar with GB18030/UTF-8? Good question. The report we got in the past was that some UTF characters were being rejected even though they were valid UTF characters, mostly Chinese. I have no idea how they map to GB* character sets. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] Discarding relations from FSM
Per http://lnk.nu/developer.postgresql.org/43b.c: * The number of distinct relations tracked is limited by a configuration * variable (MaxFSMRelations). When this would be exceeded, we discard the * least recently used relation. A doubly-linked list with move-to-front * behavior keeps track of which relation is least recently used. Rather than keeping track of which relation was last used (presumably everytime something hits the FSM), wouldn't it make more sense to just drop the relation with the smallest amount of free space in the map? Of course maybe a better question is why we even limit based on the number of relations... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] 2 forks for md5?
Andrew Dunstan wrote: Bruce Momjian wrote: Tom Lane wrote: I'm coming to agree with Andrew that a documentation patch might be the best answer. But where to put it ... under the description of the log_connections GUC var? I am thinking we should wait for someone else to notice the double log entries before mentioning it in the docs. If I had a more Machiavellian bent I would make sure that happened ;-) How about this note under log_connections?: Some clients (notably psql) sometimes try to connect without a password before trying with a password. This behaviour will generate two log lines if log_connections is turned on, even though to the user it appears that only one connection has occurred. OK, you wore me down. :-) New text: This outputs a line to the server log detailing each successful connection. This is off by default, although it is probably very useful. Some client programs, like applicationpsql/, attempt to connect twice while determining if a password is required, so duplicate literalconnection received/ messasges are not a sign of problem. This option can only be set at server start or in the filenamepostgresql.conf/filename configuration file. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] gcc4's uninitialized-variable warnings
Tom Lane wrote: I asked some gcc experts at Red Hat about the new variable-may-be-used- uninitialized warnings that gcc 4.x reports. These occur in cases like int i, j; ... foo(i, j); // use i and j I had thought that gcc was being stricter about the possibility that the called function might not set its output parameters, but the true story is entirely different. There's been no change in the strictness of the check for external function calls. What is happening is that if foo() is static and gcc chooses to inline it into the calling function, you will now see a warning if the transformed code fails the check. In essence this means that there is a code path through foo() that doesn't set the output parameter. Armed with that knowledge, we can fix these warnings by ensuring the callee sets the output parameters in all code paths; which is often cleaner than having the caller initialize the variables before call, as I was afraid we'd have to do. I'll work on cleaning these up. Wow, that is a nifty complier check. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Discarding relations from FSM
Jim C. Nasby wrote: Per http://lnk.nu/developer.postgresql.org/43b.c: * The number of distinct relations tracked is limited by a configuration * variable (MaxFSMRelations). When this would be exceeded, we discard the * least recently used relation. A doubly-linked list with move-to-front * behavior keeps track of which relation is least recently used. Rather than keeping track of which relation was last used (presumably everytime something hits the FSM), wouldn't it make more sense to just drop the relation with the smallest amount of free space in the map? The one with the smallest amount of free space might be the one we want to vacuum next, so perhaps it will be added to very soon. The _usage_ of the free space is a much better indicator. Of course maybe a better question is why we even limit based on the number of relations... Limited shared memory. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] \x output blowing up
Well, it seems we are going to have to fix it somehow for 8.1. It is not crashing here so I can't work up a patch. Can you submit a minimal fix for 8.1? Thanks. --- Martijn van Oosterhout wrote: -- Start of PGP signed section. On Sat, Sep 24, 2005 at 11:45:08PM +0200, Martijn van Oosterhout wrote: Hi, On CVS tip, whenever I do \x output, it dies with an internal error in glibc on free. If you run it under valgrind, it complains about these lines of code: snip Ok, I worked out the direct cause, pg_wcswidth only returns the length upto the first newline and the line it breaks on is the multiline definition in _pg_expandarray. The quick fix should be to only allocate memory if it's going to call format_numeric_locale(), since then you know it's a number. It makes the code slightly more convoluated but it should be slightly more efficient. I actually have a working psql that handles and displays newlines properly, but it's too late for 8.1. It fixes all these issues properly. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. -- End of PGP section, PGP failed! -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Discarding relations from FSM
Jim C. Nasby [EMAIL PROTECTED] writes: Rather than keeping track of which relation was last used (presumably everytime something hits the FSM), wouldn't it make more sense to just drop the relation with the smallest amount of free space in the map? Why? That certainly wouldn't be cheaper to track. Of course maybe a better question is why we even limit based on the number of relations... Shared memory is fixed-size. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] \d on database with a lot of tables is slow
I have a client with a database that contains 4000 relations according to vacuum verbose, and \d in psql is painfully slow. In particular... - Seq Scan on pg_class c (cost=0.00..2343.09 rows=6124 width=73) (actual time=0.325..22100.840 rows=16856 loops=1) Filter: (((relkind = 'r'::char) OR (relkind = 'v'::char) OR (relkind = 'S'::char) OR (relkind = ''::char)) AND pg_table_is_visible(oid)) That's off my laptop, but they're seeing similar issues on an 8-way Opteron as well... I've messed around with adding indexes to a copy of pg_class to no avail. Any ideas on how to improve the performance? Also, not sure if this matters, but they're occasionally getting errors like 'Did not find relation named table that exists' (where 'table that exists' is the name of some table that is in the catalog) from \d. Does anyone know what causes that? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Vacuum questions...
Would it be difficult to vacuum as part of a dump? The reasoning behind this is that you have to read the table to do the dump anyway, so it would be a good time to be able to piggy-back other operations that need to read the entire table on top. I know vacuuming of indexes complicates this, so it's probably not as simple as just firing off a vacuum and copy at the same time (although that idea is probably worth testing, since it might still be a win). When dropping a table or index, is it's space immediately released in the FSM? Also, would it be possible to add some means to check the status of a running vacuum? Even with vacuum verbose, once it starts in on a large table you have no way to know how far along it is. Finally, if vacuum_delay is enabled, does vacuum_cost_page_miss consider a miss as not in the database buffer, or not in the kernel buffer? I remember discussions about trying to track IO request times to try and determine if something came out of kernel buffers or not, but AFAIK that's all vaporware right now... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] \d on database with a lot of tables is slow
On Sat, 2005-09-24 at 18:59 -0500, Jim C. Nasby wrote: I have a client with a database that contains 4000 relations according to vacuum verbose, and \d in psql is painfully slow. In particular... - Seq Scan on pg_class c (cost=0.00..2343.09 rows=6124 width=73) (actual time=0.325..22100.840 rows=16856 loops=1) Filter: (((relkind = 'r'::char) OR (relkind = 'v'::char) OR (relkind = 'S'::char) OR (relkind = ''::char)) AND pg_table_is_visible(oid)) That's off my laptop, but they're seeing similar issues on an 8-way Opteron as well... I've messed around with adding indexes to a copy of pg_class to no avail. Any ideas on how to improve the performance? It is probably the visibility checks. Is a \d fast if you include the full name (schema.table)? I brought this up a while ago and Tom has since rearranged some of the psql queries to move the visibility check to come after the other where clause segments. It would be nice if the cost of the function could be added somehow -- even if it was just a low, medium or high setting. This would allow the planner to shuffle the where clause executing ordering around in a reasonable manner. -- ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Discarding relations from FSM
On Sat, Sep 24, 2005 at 07:21:19PM -0400, Tom Lane wrote: Of course maybe a better question is why we even limit based on the number of relations... Shared memory is fixed-size. True, but can't the fixed memory required per-relation just be shared with the fixed memory used to store free pages? Though, the idea mentioned recently of just using one shared memory segment for everything and allocating dynamically within that probably makes more sense... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] \d on database with a lot of tables is slow
On Sat, Sep 24, 2005 at 08:20:47PM -0400, Rod Taylor wrote: On Sat, 2005-09-24 at 18:59 -0500, Jim C. Nasby wrote: I have a client with a database that contains 4000 relations according to vacuum verbose, and \d in psql is painfully slow. In particular... - Seq Scan on pg_class c (cost=0.00..2343.09 rows=6124 width=73) (actual time=0.325..22100.840 rows=16856 loops=1) Filter: (((relkind = 'r'::char) OR (relkind = 'v'::char) OR (relkind = 'S'::char) OR (relkind = ''::char)) AND pg_table_is_visible(oid)) That's off my laptop, but they're seeing similar issues on an 8-way Opteron as well... I've messed around with adding indexes to a copy of pg_class to no avail. Any ideas on how to improve the performance? It is probably the visibility checks. Is a \d fast if you include the full name (schema.table)? I brought this up a while ago and Tom has since rearranged some of the psql queries to move the visibility check to come after the other where clause segments. It would be nice if the cost of the function could be added somehow -- even if it was just a low, medium or high setting. This would allow the planner to shuffle the where clause executing ordering around in a reasonable manner. \d tablename is fast, yes. Maybe instead of re-arranging the query it would make more sense to roll the visibility check into the query itself (probably using a new system view). -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Discarding relations from FSM
Jim C. Nasby wrote: On Sat, Sep 24, 2005 at 07:21:19PM -0400, Tom Lane wrote: Of course maybe a better question is why we even limit based on the number of relations... Shared memory is fixed-size. True, but can't the fixed memory required per-relation just be shared with the fixed memory used to store free pages? The assumption is that the admin wants to control the allotment of memory, and doesn't want it to shift based on (perhaps temporary) load. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Vacuum questions...
Jim C. Nasby wrote: Would it be difficult to vacuum as part of a dump? The reasoning behind this is that you have to read the table to do the dump anyway, so it would be a good time to be able to piggy-back other operations that need to read the entire table on top. I know vacuuming of indexes complicates this, so it's probably not as simple as just firing off a vacuum and copy at the same time (although that idea is probably worth testing, since it might still be a win). This would be a nightmare on a large database. Think of how long it takes to dump 20 gig, now add how long it is going to take to vacuum that size of DB, now think about a 500 gig database. Actually this also probably would not gain you much in 8.1 as vacuum in theory is already dealing with itself. When dropping a table or index, is it's space immediately released in the FSM? I would have to double check but I believe you would have to vacuum to reclaim the space to the FSM because the relationship is still there just like when you delete (but not truncate). Also, would it be possible to add some means to check the status of a running vacuum? Even with vacuum verbose, once it starts in on a large table you have no way to know how far along it is. That is an interesting thought... Perhaps a quick scan of the table to see how many dead rows there are? Then check back every n/10 ? Hmmm... I am not a C guy so I don't know if that is technically feasible (although probably possible) but it is interesting from a DBA perspective. Although that could be an issue on a large table as well I think. Finally, if vacuum_delay is enabled, does vacuum_cost_page_miss consider a miss as not in the database buffer, or not in the kernel buffer? I remember discussions about trying to track IO request times to try and determine if something came out of kernel buffers or not, but AFAIK that's all vaporware right now... Good question, anyone else? Sincerely, Joshua D. Drake -- Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240 PostgreSQL Replication, Consulting, Custom Programming, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Vacuum questions...
On Sat, Sep 24, 2005 at 07:17:38PM -0500, Jim C. Nasby wrote: Finally, if vacuum_delay is enabled, does vacuum_cost_page_miss consider a miss as not in the database buffer, or not in the kernel buffer? The database buffer. I remember discussions about trying to track IO request times to try and determine if something came out of kernel buffers or not, but AFAIK that's all vaporware right now... I don't remember the discussion, but it certainly hasn't been implemented. -- Alvaro Herrerahttp://www.PlanetPostgreSQL.org Escucha y olvidarás; ve y recordarás; haz y entenderás (Confucio) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Releasing memory during External sorting?
Generally, when you read from a set of subfiles, the OS will cache the reads to some degree, so the disk-seek jitter is not always that bad. On a highly fragmented disk drive, you might also jump all over the place reading serially from a single subfile. Of course, every situation is different. At any rate, I would recommend to benchmark many different approaches. It is also rather important how much memory is available to perform sorts and reads. But with memory becoming larger and cheaper, it becomes less and less of a worry. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Meir Maor Sent: Friday, September 23, 2005 10:24 PM To: Simon Riggs Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org Subject: Re: [HACKERS] Releasing memory during External sorting? Calculating Optimal memory for disk based sort is based only on minimizing IO. A previous post stated we can merge as many subfiles as we want in a single pass, this is not accurate, as we want to eliminate disk seeks also in the merge phase, also the merging should be done by reading blocks of data from each subfile, if we have data of size N and M memory, then we will have K=N/M subfiles to merge after sorting each. in the merge operation if we want to merge all blocks in one pass we will read M/K data from each subfile into memory and begin merging, we will read another M/K block when the buffer from a subfile is empty, we would like disk seek time to be irrelavant when comparing to sequential IO time. We notice that we are performing IO in blocks of N/K^2 which is M/(N/M)^2 let us assume that sequeential IO is done at 100MB/s and that a random seek requires ~15ms. and we want seek time to be irrelavnt in one order of magnitute we get, that in the time of one random seek we can read 1.5MB of data and would get optimal performance if we perform IO in blocks of 15MB. and since in the merge algorithm showed above we perform IO in blocks of M/K we would like MK*15MB which results in a very large memory requirement. M^2N*15MB Msqrt(N*15MB) for example for sorting 10GB of data, we would like M380MB for optimal performance. alternativly if we can choose a diffrent algorithm in which we merge only a constant number of sunfiles to gether at a time but then we will require multiple passes to merge the entire file. we will require log(K) passes over the entire data and this approach obviously improves with increase of memory. The first aproach requires 2 passes of the entire data and K^2+K random seeks, the second aproach(when merging l blocks at a time) requires: log(l,K) passes over the data and K*l+K random seeks. On 9/23/05, Simon Riggs [EMAIL PROTECTED] wrote: I have concerns about whether we are overallocating memory for use in external sorts. (All code relating to this is in tuplesort.c) When we begin a sort we allocate (work_mem | maintenance_work_mem) and attempt to do the sort in memory. If the sort set is too big to fit in memory we then write to disk and begin an external sort. The same memory allocation is used for both types of sort, AFAICS. The external sort algorithm benefits from some memory but not much. Knuth says that the amount of memory required is very low, with a value typically less than 1 kB. I/O overheads mean that there is benefit from having longer sequential writes, so the optimum is much larger than that. I've not seen any data that indicates that a setting higher than 16 MB adds any value at all to a large external sort. I have some indications from private tests that very high memory settings may actually hinder performance of the sorts, though I cannot explain that and wonder whether it is the performance tests themselves that have issues. Does anyone have any clear data that shows the value of large settings of work_mem when the data to be sorted is much larger than memory? (I am well aware of the value of setting work_mem higher for smaller sorts, so any performance data needs to reflect only very large sorts). If not, I would propose that when we move from qsort to tapesort mode we free the larger work_mem setting (if one exists) and allocate only a lower, though still optimal setting for the tapesort. That way the memory can be freed for use by other users or the OS while the tapesort proceeds (which is usually quite a while...). Feedback, please. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings