Re: [PERFORM] Final decision
-Original Message- From: Josh Berkus [mailto:[EMAIL PROTECTED] Sent: 28 April 2005 04:09 To: Dave Page Cc: Joshua D. Drake; Joel Fradkin; PostgreSQL Perform Subject: Re: [PERFORM] Final decision Dave, folks, Err, yes. But that's not quite the same as core telling us the current driver is being replaced. Sorry, I spoke off the cuff.I also was unaware that work on the current driver had renewed. Us Core people are not omnicient, believe it or don't. I was under the impression that you and Bruce negiotiated the developer time! Certainly you and I chatted about it on IRC once... Ahh, well. Never mind. Mind you, having 2 different teams working on two different ODBC drivers is a problem for another list ... Absolutely. Regards, Dave. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Why is this system swapping?
On Apr 27, 2005, at 7:46 PM, Greg Stark wrote: In fact I think it's generally superior to having a layer like pgpool having to hand off all your database communication. Having to do an extra context switch to handle every database communication is crazy. I suppose this depends on how many machines / how much traffic you have. In one setup I run here I get away with 32 * 4 db connections instead of 500 * 4. Pretty simple to see the savings on the db machine. (Yes, it is a bad design as you said where static dynamic content are served from the same box. However it also saves money since I don't need machines sitting around serving up pixel.gif vs myBigApplication.cgi) -- Jeff Trout [EMAIL PROTECTED] http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Suggestions for a data-warehouse migration routine
Quoting Richard Rowell [EMAIL PROTECTED]: I've ported enough of my companies database to Postgres to make warehousing on PG a real possibility. I thought I would toss my data migration architecture ideas out for the list to shoot apart.. [...] Not much feedback required. Yes, dropping the entire database is faster and simpler. If your database is small enough that you can rebuild it from scratch every time, go for it. Yes, vacuum analyze required; creating indexes alone does not create statistics. From a I'd dump an extract of pg_stat[io_]user_(tables|indexes) to see how index usage and table load changes over time. -- Dreams come true, not free. -- S.Sondheim, ITW ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Josh Berkus josh@agliodbs.com: Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking the larger sample on its own; GIGO. Whether they are consistent with one another has no relationship to whether the larger sample correlates with the whole population. You can think of the tiny sample like anecdotal evidence for wonderdrugs. Actually, it's more to characterize how large of a sample we need. For example, if we sample 0.005 of disk pages, and get an estimate, and then sample another 0.005 of disk pages and get an estimate which is not even close to the first estimate, then we have an idea that this is a table which defies analysis based on small samples. Wheras if the two estimates are 1.0 stdev apart, we can have good confidence that the table is easily estimated. Note that this doesn't require progressively larger samples; any two samples would work. We're sort of wandering away from the area where words are a good way to describe the problem. Lacking a common scratchpad to work with, could I suggest you talk to someone you consider has a background in stats, and have them draw for you why this doesn't work? About all you can get out of it is, if the two samples are disjunct by a stddev, yes, you've demonstrated that the union of the two populations has a larger stddev than either of them; but your two stddevs are less info than the stddev of the whole. Breaking your sample into two (or three, or four, ...) arbitrary pieces and looking at their stddevs just doesn't tell you any more than what you start with. -- Dreams come true, not free. -- S.Sondheim, ITW ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
First I will comment my original idea. Second I will give another improved suggestion (an idea). I hope, that they will be useful for you. (I don't know, wether the first one was useful at all because it showed, that I and some others of us are not very good with statistics :( ) I haven't looked about the PostgreSQL code, so I don't know, that what is possible now, and what is not. I do know, that the full table scan and after that incremental statistics changes are a very big change, without looking at the code. I meant the following idea: - compare two equal sized samples. Then redo the same thing with double sized samples. So do lots of unnecessary work. Check out the correlation of the two samples to try to guess the distribution. So I tried to give you an idea, not to give you a full answer into the whole problem. I did read some parts of the attached PDFs. They did convince me, that it seems, that the heuristics for the hard cases would actually read almost the whole table in many cases. I did cover the too little sample problem by stating that the user should be able to give the minimum size of samples. This way you would avoid the too small sampling problem. My purpose was not to achieve at most 5% wrong estimates, but to decrease the 2000% wrong estimates, that are seen now sometimes. Conclusions: - No heuristics or similar thing of small samples will grant excellent results. - If you need excellent estimates, you need to process the whole table! - Some special cases, like primary keys and the unique indexes and special case column types do give easy ways to make estimates: For example, wether a boolean column has zero, one or two distinct values, it does not matter so much ??? Hashing seems the right choise for all of them. If I have understund correctly, the full table scans are out of questions for large tables at this time. The percentage idea of taking 10% samples seems good. So here is another suggestion: 1. Do a full percentage scan, starting at an arbitrary position. If the user's data is not homogenous, this hurts it, but this way it is faster. During that scan, try to figure out all those columns, that have at most 100 distinct values. Of course, with it you can't go into 100% accuracy, but if the full table scan is out of question now, it is better, if the accuracy is for example at most ten times wrong. You could also improve accuracy by instead of doing a 10% partial table scan, you could do 20 pieces of 0,5 percent partial table scans: This would improve accuracy a bit, but keep the speed almost the same as the partial table scan. Here are questions for the statisticians for distinct values calculation: If we want at most 1000% tolerance, how big percentage of table's one column must be processed? If we want at most 500% tolerance, how big percentage of table's one column must be processed? If we want at most 250% tolerance, how big percentage of table's one column must be processed? Better to assume, that there are at most 100 distinct values on a table, if it helps calculations. If we try to get as much with one discontinuous partial table scan (0,1-10% sample), here is the information, we can gather: 1. We could gather a histogram for max(100) distinct values for each column for every column. 2. We could measure variance and average, and the number of rows for these 100 distinct values. 3. We could count the number of rows, that didn't match with these 100 distinct values: they were left out from the histogram. 4. We could get a minimum and a maximum value for each column. = We could get exact information about the sample with one 0,1-10% pass for many columns. What you statisticans can gather about these values? My idea is programmatical combined with statistics: + Performance: scan for example 100 blocks each of size 100Mb, because disc I/O is much faster this way. + Enables larger table percentage. I hope it helps with the statistics formula. Required because of more robust statistics: take those blocks at random (not over each other) places to decrease the effect from hitting into statistically bad parts on the table. + Less table scan passes: scan all columns with limited hashing in the first pass. + All easy columns are found here with one pass. +- Harder columns need an own pass each, but we have some preliminary knoledge of them on the given sample after all (minimum and maximum values and the histogram of the 100 distinct values). Marko Ristola Greg Stark wrote: Dave Held [EMAIL PROTECTED] writes: Actually, it's more to characterize how large of a sample we need. For example, if we sample 0.005 of disk pages, and get an estimate, and then sample another 0.005 of disk pages and get an estimate which is not even close to the first estimate, then we have an idea that this is a table which defies analysis based on small samples. I buy that. Better yet is to use the entire sample you've gathered of .01 and
[PERFORM] index on different types
Hi folks, there's often some talk about indices cannot be used if datatypes dont match. On a larger (and long time growed) application I tend to use OID for references on new tables while old stuff is using integer. Is the planner smart enough to see both as compatible datatype or is manual casting required ? thx -- - Enrico Weigelt== metux IT service phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: [EMAIL PROTECTED] - Realtime Forex/Stock Exchange trading powered by postgresSQL :)) http://www.fxignal.net/ - ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Mischa Sandberg wrote: Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking the larger sample on its own; GIGO. Whether they are consistent with one another has no relationship to whether the larger sample correlates with the whole population. You can think of the tiny sample like anecdotal evidence for wonderdrugs. Ok, good point. I'm with Tom though in being very wary of solutions that require even one-off whole table scans. Maybe we need an additional per-table statistics setting which could specify the sample size, either as an absolute number or as a percentage of the table. It certainly seems that where D/N ~ 0.3, the estimates on very large tables at least are way way out. Or maybe we need to support more than one estimation method. Or both ;-) cheers andrew ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] index on different types
On Fri, Apr 29, 2005 at 04:35:13AM +0200, Enrico Weigelt wrote: there's often some talk about indices cannot be used if datatypes dont match. PostgreSQL 8.0 is smarter than previous versions in this respect. It'll use an index if possible even when the types don't match. On a larger (and long time growed) application I tend to use OID for references on new tables while old stuff is using integer. If you're using OIDs as primary keys then you might wish to reconsider. See the caveats in the documentation and in the FAQ: http://www.postgresql.org/docs/8.0/interactive/datatype-oid.html http://www.postgresql.org/docs/faqs.FAQ.html#4.12 Is the planner smart enough to see both as compatible datatype or is manual casting required ? You can use EXPLAIN to see what the planner will do, but be aware that the planner won't always use an index even if it could: if it thinks a sequential scan would be faster then it won't use an index. To see if using an index is possible, you could set enable_seqscan to off before executing EXPLAIN. In any case, a foreign key column probably ought to have the same type as the column it references -- is there a reason for making them different? -- Michael Fuhr http://www.fuhr.org/~mfuhr/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Distinct-Sampling (Gibbons paper) for Postgres
Well, this guy has it nailed. He cites Flajolet and Martin, which was (I thought) as good as you could get with only a reasonable amount of memory per statistic. Unfortunately, their hash table is a one-shot deal; there's no way to maintain it once the table changes. His incremental update doesn't degrade as the table changes. If there isn't the same wrangle of patent as with the ARC algorithm, and if the existing stats collector process can stand the extra traffic, then this one is a winner. Many thanks to the person who posted this reference in the first place; so sorry I canned your posting and can't recall your name. Now, if we can come up with something better than the ARC algorithm ... ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Distinct-Sampling (Gibbons paper) for Postgres
Now, if we can come up with something better than the ARC algorithm ... Tom already did. His clock-sweep patch is already in the 8.1 source. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq