Extended stats - value not in MCV list
Hi there, I've just started using extended stats cause the planner was giving me terrible estimates for a certain table. MCV extended stats solved my problem when values are in the extended MCV list, but estimates are still terrible when they are not in the MCV list. Limited as my knowledge of database internals is, I dug into the source code and found an important difference on how these not-MCV values are handled in the single-column and multi-column cases. For single columns, the estimate is calculated as follows: selectivity = (1 - sum(MCV_frequencies)) / (distinct_values - count(MCV)) Which seems to assume a uniform distribution of non-MCV values and looks like a sensible guess, at least to my amateur eyes. For multi-column statistics it seems to me that the estimate is calculated instead as: selectivity = 1 - sum(MCV_frequencies) Which instead seems to assume that the value could potentially be present in all the rows not covered by the MCV. This seems like an adequate upper bound, but is also quite conservative compared to the single-column estimate. In my specific case this yields a selectivity even higher than some of the least frequent MCV values, which is a condition that is actually checked for and compensated in the single-column estimate as an additional check. I have MCV and distinct extended stats, so I know the distinct_values stats is available. So I hope my question is clear from the above. How come the estimates are calculated with such different approaches? I insist I have no experience with database/query planner development, so excuse me if I am overlooking some obvious conceptual difference between single-column and multi-column stats. The single-column estimate is actually described in the documentation, but the multi-column estimate is not. If there is indeed a good reason for this difference, I think it should be documented. Thanks, Pedro
Re: Access a newer Version of PGDB (v13) with an older libpq (v10 x86)
On 5/1/21 3:59 AM, Wolfgang Rißler wrote: This is my problem, I completely dont know, how to start compiling my own actual 32bit libpq on windows (and I would like to do it with VS 2019). For libpqxx there have been some hints how to do so in the past, and now there is a complete project, which deals with compiling it on windows with VS and CMake. But I didnt find such hints for libpq or the whole postgresDB. Have you looked at below?: https://www.postgresql.org/docs/current/install-windows.html Thank you -- Adrian Klaver adrian.kla...@aklaver.com
Re: Access a newer Version of PGDB (v13) with an older libpq (v10 x86)
Am 30.04.2021 um 16:16 schrieb Tom Lane: =?UTF-8?Q?Wolfgang_Ri=c3=9fler?= writes: The problem is, that our application (IDE MS-VisualStudio, C++) has to be 32bit, because of some old 32bit-dll's, which we cant kick out at the moment. So I compiled a libpqxx with the last 32bit libpq (which is v10). Uh ... what's this about "last 32-bit libpq"? Ok, I meant, the last ready to use packaged 32-bit libpq from EDB (~_0). I can believe that a particular packager (EDB, say) might not be shipping prebuilt 32-bit binaries anymore. But if you are in a position to compile your own libraries then you can certainly build any release you want as 32-bit. This is my problem, I completely dont know, how to start compiling my own actual 32bit libpq on windows (and I would like to do it with VS 2019). For libpqxx there have been some hints how to do so in the past, and now there is a complete project, which deals with compiling it on windows with VS and CMake. But I didnt find such hints for libpq or the whole postgresDB. Or is there another provider, who supplys V13 32bit binary installers for Windows? I would recommend trying to use a reasonably late-vintage libpq; we do fix bugs in it on a regular basis. The common stumbling block for cross-version situations is that the client makes assumptions about system catalog contents that are not valid in some other server release. libpq proper doesn't really touch the catalogs, so it's mostly impervious to that problem; but you'll need to test your applications. Of course we'll do. One thing is, that we load and write bytea's. And as I read, there have been some changes. All other Operations are less problematic. Thank you -- - may the source be with you -
Re: -1/0 virtualtransaction
In case this helps anyone else, I found a simple way to get a rough idea of what's going on, which is to run: select (select count(distinct virtualtransaction) from pg_locks) as tx_with_locks, (select count(*) from pg_stat_activity where state = 'active') as active_tx, (select count(*) from pg_locks where virtualtransaction = '-1/0') as summarized_locks; I disabled the part of my application that seemed to be causing problems with too many writes (a background cleanup task) and then triggered it from a separate process. I can see the number of transactions with locks climbing when it hits a problematic item, while the number of active transactions (of course) stays low. Mike On Fri, Apr 30, 2021 at 4:53 PM Mike Beachy wrote: > > On Fri, Apr 30, 2021 at 7:12 AM Thomas Munro wrote: > > But do you have lots of short overlapping transactions so that there > > is never a moment where there are zero transactions running? > > Yeah, that almost certainly explains it. > > Thanks very much for the explanation about the summarized locks. > > > The number of SERIALIZABLEXACT objects is (max_connections + > > max_prepared_transactions) * 10. So, you could try increasing > > max_connections (without increasing the actual number of connections) > > to see if you can get to a point where you don't see these invalid > > virtual xids, and then maybe it'll be able to clean up locks more > > aggressively. > > Aha! I hadn't considered that some parameter besides > max_pred_locks_per_transaction would come into play. I'll give this a > shot. > > Thanks, > Mike
Re: database sorting algorithms.
Thanks a lot. I found out about this Youtube video (https://youtu.be/alJswNJ4P3U?t=1852), in case you guys are interested. This video really clarify about the time complixty of MergeSort. On Sat, May 1, 2021 at 3:19 PM Gavan Schneider wrote: > On 1 May 2021, at 17:06, Jian He wrote: > > Been self study Database, from database I deep dived into sorting > algorithms. > > Databases can do in-memory QuickSort. It also has an on-disk MergeSort. > > For MergeSort: I follow this tutorial https://youtu.be/6pV2IF0fgKY?t=1108 > (around 1 minutes only) > > Also check https://en.wikipedia.org/wiki/Merge_sort > > But I am still not fully understanding about *nlogn*. I understand how > many > passes it will take, that is* logn. * > Yes each pass will sort N elements. > But I still don't get the *N* stand f*or in n*logn.* > > So, answering the question… > The ’n’ refers to the need to do something to each element at least once, > so the sort time grows in simple proportion to the size of the list that > needs to be sorted. Unfortunately that is not enough work to get the list > sorted so extra steps are needed. The log(n) indicates how many extra steps > are needed. So the overall performance is proportional to the number of > elements (N) multiplied by the log of the number of elements, viz., N * > log(N) > > Regards > Gavan Schneider > —— > Gavan Schneider, Sodwalls, NSW, Australia > Explanations exist; they have existed for all time; there is always a > well-known solution to every human problem — neat, plausible, and wrong. > — H. L. Mencken, 1920 >
Re: database sorting algorithms.
On 1 May 2021, at 17:06, Jian He wrote: Been self study Database, from database I deep dived into sorting algorithms. Databases can do in-memory QuickSort. It also has an on-disk MergeSort. For MergeSort: I follow this tutorial https://youtu.be/6pV2IF0fgKY?t=1108 (around 1 minutes only) Also check https://en.wikipedia.org/wiki/Merge_sort But I am still not fully understanding about *nlogn*. I understand how many passes it will take, that is* logn. * Yes each pass will sort N elements. But I still don't get the *N* stand f*or in n*logn.* So, answering the question… The ’n’ refers to the need to do something to each element at least once, so the sort time grows in simple proportion to the size of the list that needs to be sorted. Unfortunately that is not enough work to get the list sorted so extra steps are needed. The log(n) indicates how many extra steps are needed. So the overall performance is proportional to the number of elements (N) multiplied by the log of the number of elements, viz., N * log(N) Regards Gavan Schneider —— Gavan Schneider, Sodwalls, NSW, Australia Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong. — H. L. Mencken, 1920
Re: database sorting algorithms.
Jian He: On Sat, May 1, 2021 at 9:07 AM Jian He wrote: > Been self study Database, from database I deep dived into sorting algorithms. Peek a good book, because that is a hairy topic with lot of math and other previous knowledge required. > Databases can do in-memory QuickSort. It also has an on-disk MergeSort. If you ignore top-n and other stuff, just full sorts, databases uses two types of sorting, internal sorts, which are the ones used when you can fit your data on ram, and external sorts, when they have to spill to tape/disk/whatever. For internal sorts you can use quicksort, heapsort, mergesort, even insertion sort. Quicksort with fallback to insertion is a popular one ( quicksort and merge sort are divide and conquer algorithms, but once they have divided enough it is easier to fallback to a simpler algorithm ). When the data no longers fit in ram you need to make an external sort. The classic way to do this is to sort chunks as big as they fit in ram, write presorted chunks to disk and then merge those chunks. These is called merge sort too, and is similar to a memory merge sort, but not the same ( when I've done it I use two way merge for memory merge sorts, but N-way for disk sort ( current memories are so big I never need to do more than 1 merge pass since the tape era ) > For MergeSort: I follow this tutorial https://youtu.be/6pV2IF0fgKY?t=1108 > (around 1 minutes only) > But I am still not fully understanding about nlogn. I understand how many > passes it will take, that is logn. > Yes each pass will sort N elements. > But I still don't get the N stand for in n*logn. Basically you do logn passes over N elements. > Why does each pass take n time to sort it? I have not time for looking at the video, but basic binary merge sort is N log N because you have logN passes and each one must merge N elements. Imagine you do an 8 elements pure merge sort. 1st you split it into 4+4 elements, recurse to sort them and merge them. Merging is an N-complexity op because you need to read the 8 elements. 4 elements are split into 2+2, same thing, merge neads four reads, but you have two merges, so eight reads again. 2 elements are split into 1+1, two reads for merging, but you have four chungk, so eight again. Total, 3 passes ( log2(8) ) of eight reads. Real sorts are more complex, but the N is always present, as you always need to at least read the full input set to sort. Note, in external sort you may have N elements, of which only M fit in ram. You do C=N/M chunks , these need C*MlogM = N logMtime to be sorted and C*M=N time to be written to and read from disk. But to merge C chunks you basically do logC comparison for element, so add N*logC N(logN-logM). If you add appropiate constants and add all you'll find the final result is O(NlogN). Francisco Olarte. Francisco Olarte.
database sorting algorithms.
Been self study Database, from database I deep dived into sorting algorithms. Databases can do in-memory QuickSort. It also has an on-disk MergeSort. For MergeSort: I follow this tutorial https://youtu.be/6pV2IF0fgKY?t=1108 (around 1 minutes only) But I am still not fully understanding about *nlogn*. I understand how many passes it will take, that is* logn. * Yes each pass will sort N elements. But I still don't get the *N* stand f*or in n*logn.* Why does each pass take *n time to sort it? *