[GENERAL] postgresql +AMD64 +big address spaces - does it work?
If I bought one of these boxes/OS combos as a postgresql database server, would postgresql be able to make the best use of it with a huge (e.g. 40GB) database? Box: HP ProLiant DL585, with 4 AMD64 CPUs and 64GB of RAM. (other vendor options also exist) OS: SUSE enterprise 8 linux for AMD (links to product info at bottom) e.g. Could postgresql make use of most of this address space to map a huge database (say 40GB) all paged into RAM at once? According to Suse, a process running on such a setup can individually address 0.5 terabytes of address space, and at any one point, with the hardware above, up to 64GB (minus a bit!) of this address space could be paged into physical memory at any one time. According to the docs I could find, I just need to do the following: 1. set postgresql.conf->shared_buffers to a value like (40GB/8KB), 2. Check cat /proc/sys/kernel/shmmax is big enough, or successfully increase it to > 40GB >From experience that's a long way from knowing it will work. Does anyone have any experience at all with such big memory configurations? Many thanks, Andy __ My references: Suse: www.suse.com/amd64 and www.suse.com/en/business/products/server/sles/sles8_amd64.pdf HP: http://h18004.www1.hp.com/products/servers/proliantdl585/index.html ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[GENERAL] Enough RAM for entire Database.. cost aside, is this going to be fastest?
Hello, Sorry for this newbish question. Briefly, my problem: -- I expect the database I'm working on to reach something in the order of 12-16 Gigabytes, and I am interested in understanding as much as I can about how I can make this go as fast as possible on a linux system. I haven't run such a large database before. The nature of the database is such that successive queries are very likely to lead to poor cache performance. I have lots of experience with disks, operating systems, caching, virtual memory, RAM etc. - just none running gargantuan databases very quickly! --- I've read all the performance tuning and configuration stuff I can, but there is one basic question I can't get an answer to: My question: If I can afford to have the entire database residing in RAM (within the postgresql shared buffer cache, swapped into real memory) without impacting other services on the machine, is there any reason why I shouldn't do it, other than cost? (!) Basically, I'm finding it almost impossible to predict 'how much RAM is right'. I know I don't need the *entire* database to sit in RAM, and a lot of this answer depends on a lot of things - the speed of IO, the nature of queries etc. But when you get to a certain amount of RAM, (specifically, the amount where nothing needs to be swapped out), then surely things get a bit more certain... or do they? So, could I, for example, setup postgresql with a 16 GB shared buffer cache and expect the postgresql backend processes to fly like the wind (CPU, RAM and disk write speed permitting)? I understand that writes can delay the progression of updates if setup in a certain way, and that's ok - I'm really just wondering if there are some *other* boundaries that will get in the way. I've read that I should be able to configure a linux box (with the right processors) to address up to 64GB of RAM, and I'm aware of more esoteric boxes like the SGI Altix 3000 which can go far higher, but maybe that's overkill.. If there are any resources out there that point to other experiences of others trying to coerce a huge database to run largely from RAM, I'd be grateful for the links. Many thanks Andy ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] postgresql +AMD64 +big address spaces - does it work?
Ok - just to end this thread, I think I understand what I was missing. I'll stop this thread, and just comment on my first thread. Thank you everyone who helped ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Enough RAM for entire Database.. cost aside, is this
> There's a good bit of depth in the archives of this list. I would start > searching back for discussions of effective_cache_size, as that is involved > in *costing* the caching job that the OS is doing. Thanks - that's just what I need to sink my teeth into. I'll have a trawl and get back later. Regards Andy ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Enough RAM for entire Database.. cost aside, is this
Hello Shridhar, Thanks for the reply. > There is no reason why you should not do it. How remains to be a point of > disagreement though. You don't allocate 16GB of shared buffers to postgresql. > That won't give you performance you need. I think in the other thread, Tom was alluding to this too. What is it about the shared buffer cache behaviour that makes it inefficient when it is very large? (assuming that the address space it occupies is allocated to RAM pages) Is there a good place I could look for some in depth details of its behaviour? Many thanks, Andy ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [GENERAL] postgresql +AMD64 +big address spaces - does it work?
> I get the feeling that, that regardless 64bit support or not, that the > *concept* of a database which just happens to all easily fit within RAM > isn't one that gets the thumbs down... Oops, I meant to say '*is*' one that gets the thumbs down... ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [GENERAL] Enough RAM for entire Database.. cost aside, is this
Hello again Mike, Thanks for the replies! Here's my next salvo! > Perhaps I'm a bit naive about complex data structure caching strategies, but > it seems to me that the overhead of tracking tuples (which is what you > would want if you are going to manage your own cache, as opposed to simply > caching disk pages as the OS does) would be memory-prohibitive except in > the most extreme cases of RAM>DB SIZE. But doesn't postgresql already make a decision about what needs to be in its workspace prior to executing a plan? I'm not actually suggesting it do anything different. In fact, the only thing I'm suggesting is that the workspace need not be limited in size. Remember - I'm only talking about the scenario where the entire database - every single tuple - can reside in the workspace (the shared buffer cache) at the same time. There is no extra 'managing your own cache' as you put it. Just let postgresql's postmaster do what it does. (assuming it doesn't kick stuff out of its shared buffer cache unless it needs the space for something else... ok -that may be an incorrect assumption - I need to find out for sure) But I think the overall effort is less, not more. Once in the shared buffer cache, a tuple never needs to be kicked out again to make way for another. That's what is so nice. Everything is just there... No need to calculate a least damaging cache replacement strategy, no need to wait for blocks of data to work their way from the disk or linux buffer cache to the postgresql workspace -it all ends up there, and stays there, with only syncing activity needing to take place when updates happen. > > I find that experience does not bear this out. There is a saying a coworker > of mine has about apps that try to solve problems, in this case caching, > that are well understood and generally handled well at other levels of the > "software stack"... he calls them "too smart by half" :) Well, I guess I'd agree and disagree with your friend. What a political animal I am! Here's why. Imagine this scenario. Your application uses 5 bits of data - A B C D and E. Only 3 can sit in memory at once. You want to do two transactions. The first combines A, B and C, so you load those, in order. Once you're finished with that first transaction, you then require to combine A, C and D. Now, an application driven cache will see immediately that the optimal cache replacement is to load D into the slot taken by B. With the OS in charge, you'll probably end up with a cache thrash. This is simplistic, but it is just this sort of smart caching that can lead to dramatic speedups, and I've got quite a bit of experience of seeing just that. I don't want to devalue the linux cache strategy. It is very appropriate for almost all situations. I'm talking about a very specific scenario here though. Does that make sense? > The problem this this assumption, as Tom Lane has said, is that generally > speaking, a kernel is much faster with respect to cache *lookup* than > postgres is. Postgres assumes that only a subset of the data stored in a > cluster will be needed at any one time. Under this assumption (that it > will need to store and shovel through thousands, instead of millions or > more, tuples) some optimizations have been made that would cause > performance degradation if used over the entire (potentially enormous) > dataset. > Again, in my scenario, postgres is *relieved* of the task of having to decide what data to choose to be in the workspace, since ALL of the data is accessible there in my scenario. The point about postgres having to shovel through millions of tuples is a different point. If I do a query which can only be executed by looking at a million tuples, then that is what has to happen, no matter what caching strategy you employ. Obviously, if you're shovelling data in and out of memory, then the *way* you traverse those tuples becomes important to efficiency, as loading sequential stuff off a disk is always faster. If it's *all* in the workspace, well it doesn't matter at all then - the planner can forget about disks and choose the fastest algorithmic option. Again - this is a simplification, but I hope you can see what I'm getting at. > Come now, Nov. 2001? Most of the issues brought up here have been solved > *in the kernel* since then. (In the general case, with the bevy of IO > schedulers, and the interuptable kernel.) The point is that if your data is sitting in two parts of memory at the same time, then that's a waste of memory, and it's extra time copying all that stuff around. The date of the article isn't relevant to this, I don't think. They make the point that no matter how efficient the OS kernel is, in the sort of scenario I am interested in, it just ends up being an unnecessary step. > > Well, Oracle isn't PG, and I can't speak to their buffering and caching > implementation. I'm just saying that these issues are generic issues for any high performance database server, regardless
Re: [GENERAL] Enough RAM for entire Database.. cost aside, is this
>If all the data fits into memory, then this cache thrashing won't occur, yes? No - it *can* occur in a 2 tier cache strategy. The critical question here is: *If* the data postgresql needs is in the linux buffer cache, what (if anything) does the OS have to do to make it available to the postmaster? The worst case would be an actual memory copy of the data from the linux cache to where postgresql actually asked for the file data to be loaded. I'll need to find out more about the linux buffer cache to answer this question for myself. If a memory copy is required, then this is a good way of visualising it: I live in London, but have to work in 5th avenue, NYC for a week. My company stupidly sets me up in a smelly hotel by JFK airport. London is 'The disk system' 5th Avenue is 'The postgresql workspace' My smelly hotel at JFK is the 'linux buffer cache'. Clearly, it's far quicker to commute to work each day from JFK, than it is to commute from London each day. But wouldn't it be faster still if I had a room in the swanky hotel on Fifth avenue above the office? Yes, I'd still have to arrive at JFK, but then I could stay in 5th avenue all week until I'm finished. And as far as direct IO is concerned, it gets better: that would be like taking a plane right straight from London to 5th avenue, avoiding the tedious immigration line at JFK. Having said all of that, I'll go and find more about the linux buffer cache before I pester the Postrgresql cache experts. Thanks for that test plan... I'm in the same position with you w.r.t. machinery - hence my thread (I was hoping someone had tried this!). But I will get my hands on something quite soon and I'll do your test and post the results. I knew I'd have no option than to become very familiar with the behaviour of whatever DB I chose, so I'm looking forward to it (sort of) >That is true, and will be until PG implements it's own cache memory management AND on-disk storage. I'd be willing to bet that any patches that dramatically increase speed for the majority of use cases would be considered strongly. I have to find out for sure, but since linux is being used in lots of HPC clusters now and for some real time applications, I'd be very surprised if there wasn't already a direct IO strategy in the kernel, but that's one of the things I'm looking at now. That being the case, and assuming that the postgresql cache operates on arbitrary sized pages (it seems set at 8KB pages by default - which, depending on the platform may be good, assuming also that the address space is appropriately machine page aligned), then it might be 'very easy indeed' to upgrade postgresql's cache manager to fire the necessary IO calls at the OS. Again - this is really something I need to talk to a pg cache expert about, but I've got a lot of catching up to do first. By the way - Direct IO doesn't mean the application has to know about the disk configuration or anything messy like that. It is still a block IO operation directed through the kernel. From the programmer point of view the changes are trivial. It just bypasses the OS's own generic file cache, and does a super efficient single DMA memory transfer between the IO system and the host RAM - all of this managed has to be managed by the OS, or you'll end up with a platform specific implementation that is of no real use. I would run a mile if this work required Postgresql to know about actual storage. >Finding target data (complex lookup in the PG shared buffers) will take longer than transferring disk pages from the OS cache, based on index information, and then doing a simple, fast scan. This is the current working assumption for postres. This is what scares me a little. Why would the PG cache lookup be complex? That's a question for me to find the answer to... I'll wade through the developer resources and source code etc. before bothering the experts. >Your contention, correct me if I've gotten this wrong, is that transferring it from the OS cache will be slower than finding it using a more complex lookup, but without moving the data from the OS cache to the PG workspace. >In both scenarios all data is in RAM. Well if the PG cache is slower at dealing with large data, then yes, I concede that this changes the picture. But at this point in time, my gut reaction would be to assume that the PG cache is blazingly efficient. It is, after all, the heart of the DB, and all the DBs I've seen have a similar architecture. >I see what you mean about a waste of memory, but if the costing parameters are tuned well for the use of the specific DB then current evidence with postgres shows that the former strategy is faster. I would be interested to see if the latter scenario could be made to work with the current PG codebase, or even if there is a simple patch that could speed up the shared_buffers case. Well it's still not clear that the memory does get wasted, in which case this 'problem' doesn't exist (See my comment about
Re: [GENERAL] Enough RAM for entire Database.. cost aside, is this
> It's not that making the cache bigger is inefficient, it's that the cache is > not used the way you are thinking. Ok, I think I've got it now. The missing piece of the puzzle was the existence of the Linux buffer cache. So that's what the effective_buffer_cache value is for(!) I read Shridhar Daithankar and Josh Berkus's 'Tuning Postgresql for performance' document, and some things stike me (though only with respect to my RAM much bigger than database scenario.) I think I'm a bit more depressed than I was earlier today! 1. Postgresql is a two tiered cache mechanism. The first tier - the postgresql shared buffer cache sits on the second, larger tier, the linux buffer cache. So bits of the same data end up being in memory...twice, and two cache mechanisms operate at the same time. (That's how I understand it). 2. Even if the linux buffer cache contains all the data required for an execution of a plan, there is still a load of memory copying to do between these two tiers. Though memory copying is faster than disk access, it is still an overhead, and isn't there the real problem of thrashing between these two tiers if the plan can't fit all the data into the top tier, even if the thrashing is restricted to the memory system? 3. The OS will implement a 'Least recently Used' cache replacement strategy on data in its cache. This isn't the optimal cache replacement strategy for the database. If postgresql is in charge of all its RAM resident data, it can make more intelligent decisions about which stuff isn't needed once used. This still leads me to think that the best thing to do in my 'much bigger RAM than database size' scenario would be for postgresql to allocate a shared buffer cache big enough for all the data + a bit. By doing this, there would be some performance enhancing gains, including: 1. If the planner *knew* (rather than guessed) that all the data was effectively 'there' in RAM, in the right place (i.e. its workspace), wouldn't it make choosing the optimal plan easier? (I see that effective_buffer_cache is a stab in that direction, but just because postgresql guesses the data will have been cached by linux, it doesn't mean it actually is - surely an important distinction.) 2. You'd avoid a whole layer of caching, along with the not necessarily aligned memory copies and other overheads that this introduces. Even in the optimal case where all the data needed does reside in RAM, it's not in the right bit of RAM. (I may have misunderstood this relationship between the shared buffer cache and the linux buffer cache - if no memory copying actually occurs - then I'll go away for a bit!) Two interesting things I dug up today: www.linuxjournal.com/article.php?sid=5482 (to do with a high performance DB living in an OS controlled environment) and http://www.ixora.com.au/tips/avoid_buffered_io.htm (to do with Oracle tuning) The way things are, I see that postgresql will 'gracefully degrade' in the very common scenario where the database shares a machine with limited resources and lots of processes, all of them hammering the machine, but in my 'optimal RAM optimised db server ' scenario, the scheme seems actually to reduce the potential for a blistering system. So given all of that, I can't help swinging back to my original question about whether it's a bad thing to make the shared buffer cache huge, so long as you have much more RAM than the size you choose. It'll still require all that nasty Linux cache <->postgresql cache memory copying, but it will at least reduce postgresql's exposure to the 'not particularly clever' least recently used cache replacement strategy employed by the OS. Am I getting any closer to understanding the way things are? Thanks for your tolerance, Andy p.s. (It would be great if you could just load the data directly into the shared buffer cache with direct IO, since that OS file buffer->shared buffer cache transfer is just a waste of effort in this scenario), and direct IO allows spectacular IO gains when you're working with certain raid configurations connected with a suitable fat wire.) Ok - I'll shutup now ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] postgresql +AMD64 +big address spaces - does it work?
Hello again and thanks to everyone for the replies so far. Tom, and all, I hear what you are all saying, and furthermore, in cases where the amount of RAM is much smaller than the database size, I agree totally. However, I'm *only* talking about a particular scenario which, till now, has really only been true of small databases, and that is: RAM - (total size of database) = (easily enough RAM for everything else to run smoothly on the server). I get the feeling that, that regardless 64bit support or not, that the *concept* of a database which just happens to all easily fit within RAM isn't one that gets the thumbs down... ...and I think I must be missing something! If you bear with me, could I just give you a better idea of where I'm coming from, because I think I may have missed an important concept, or might be able to sway you. (I don't mind which way this goes, other than my server dreams will be in tatters if you're right and I'm confused, so be gentle with me!) Before I start, just to repeat that that though I have only been using postgresql for 18 months and haven't run a huge database before, I do have a large amount of experience of designing memory intensive multi-threaded real time applications running in a unix virtual memory environment (specifically SGI IRIX boxes), and am very comfortable with the postrgresql implementation model. (and I did choose it over MySQL, so I am a fan.) Ok! --- Where I'm coming from! --- I am developing a postgresql based national website, which hopefully will have a million or two users if successful (though hopefully not all at once :) ) The details aren't important, other than to say that the functionality I'm planning is quite advanced - it's more of a web application than website as such, and the database is going to get an absolute hammering. Most importantly, a large percentage of the data within the database is going to be 'active' (i.e. the target of queries) on a minute by minute basis. Luckily for me, it isn't due to go live just yet :) There is now 'proper' 64 bit support in Linux for both Intel and AMD and we've already seen a few linux boxes with half a terabyte of RAM - shareable as a single address space. (e.g. boxes like the Altix 3000 from SGI) Even commodity server boxes (such as the HP DL585) now provide the environment for multiple threads to share a single huge address space backed up by up to 64GB of RAM, accessible by 4 CPUs. I'm not more up to speed on the hardware side, but this is only going one way: up! So, with this new hardware and 64 bit OSs, it is theoretically possible to dramatically reduce the IO bottlenecks on a database server by simply sticking in enough RAM so that the database ends up in RAM, leaving most of the IO being 'O' for syncing updates. We're talking about High Performance Database Surely this is a very desirable setup for someone like me? Here's how it could work: 1. The postgresql postmaster (is that right?) - the process responsible for setting up the shared buffer cache - does what it normally does, except that it works with 64 bit addressing so your cache can be 40MB, or 200GB. (CPU limited to 64GB currently in most 64bit processor systems but this will change) 2. You setup the buffer cache size so that it could easily hold *the entire database*. (Or the active part... the decision of how big is subject to all the current guidelines. The only difference is that there is no 2GB limit). e.g. on my 64GB system, my entire database is 16GB and I instruct postgresql to setup a 20GB shared buffer cache, in order to allow for headroom for updates. 3. At the postgresql level, database pages ('buffers?') get cached, and stay in the buffer cache, because there is enough space for everything to reside in the postgresql cache and more. 4. Additionally, at the virtual memory level, Linux isn't going to kick pages of this huge shared buffer cache out to disk, because there's plenty of RAM kicking around for everything else, and in fact, the target for such a machine is for it to be running at close to zero swap activity. We're talking about a high performance system after all, and swap activity is the last thing you want in a high performance setup. Note that Linux is still managing *all the memory*. I'm not interfering at all with virtual memory. (Though IRIX's mpin() is very handy... I must find out if Linux has a similar call) 5. Assuming your database *does* easily fit in the shared buffer cache, queries will run like the wind, limited only by CPU availability and the performance of the front end web servers + connections to deliver queries and accept the results. 6. You get more bang for your buck, especially in systems with a heavy load of concurrent postgresql backend activity, which would normally put enormous strain on the IO system. 7. An extra benefit is that the database a