Re: [HACKERS] A worst case for qsort
On Thu, Aug 7, 2014 at 11:07 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Aug 5, 2014 at 8:15 PM, Peter Geoghegan p...@heroku.com wrote: The adversarial method works for almost any polymorphic program recognizable as quicksort. The subject quicksort may copy values at will, or work with lists rather than arrays. It may even pick the pivot at random. The quicksort will be vulnerable provided only that it satisfies some mild assumptions that are met by every implementation I have seen. IMHO, while worst case performance is a very useful tool for analyzing algorithms (particularly their worst case time complexity), a worst case should be put in its practical context. For example, if we had reason to be concerned about *adversarial* inputs, I think that there is a good chance that our qsort() actually would be problematic to the point of driving us to prefer some generally slower alternative. If one is concerned about adversarial inputs, then as mentioned earlier, two main steps need to be taken. 1. Don't permit potential adversaries define the comparison function used by qsort(). 2. Perform random selection of pivot to prevent precomputed lists of data that invoke quadratic behavior in qsort. And before the argument that a random pivot will be less efficient than the median of median that's currently being used, you need to look at what is currently being used. 1. If a small partition is being sorted, the element at n/2 is selected as the pivot with n being the size of the partition. 2. If a medium partition is being sorted, then pivot selected is the median of the elements at 0, n/2, and n. 3. And finally, if a large partition is being sorted, potential pivots are selected from 0, n/8, n/4, 3n/8, n/2, 5n/8, 3n/4, 7n/8, and n. Of those 9 candidates, the median of medians is selected as the pivot. There is nothing in the world that would prevent the following. 1. Short partition - select a pivot at random. 2. Medium partition - select the median of 3 randomly selected candidates. 3. Large partition - select the median of medians of 9 randomly selected candidates. Using the above random pivot selection methods, the performance of a hardened qsort should be close to that of an unhardened qsort. The only real difference is the cost of generating random numbers to select the candidates for the median tests. However, if an malicious compare function is permitted to be defined, it would still be vulnerable to that attack, but it would be pretty much immune to a precomputed data attack. Or perhaps it just might be simpler to detect an attack and fall back to a sort algorithm that doesn't have quadratic behavior. What I mean by that is the qsort in the Engineering a Sort Function has a big O of approximately 1.1 n ln n comparisons. If upon starting qsort, it computes an upper bound of comparisons it's willing to perform. Say 3 n ln n or so (1.386 n log n is the estimate for a randomly selected pivot). Then if the upper bound is reached, the qsort function backs out leaving the array in a partially sorted order, and then calls a slower sort that doesn't exhibit quadratic behavior to actually finish the sort. The result of doing that would be most, if not all, sorts being handled by qsort in a timely fashion. But if bad luck or an adversary strikes, the qsort bails out after things look bad and passes the task to a different sort. I'll be the first to admit that the hybrid approach would be slower in those bad cases than it would be if the alternate sort was selected at the beginning, but it would be faster than letting qsort continue with quadratic behavior. -- There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. — C.A.R. Hoare
Re: [HACKERS] A worst case for qsort
I just browsed the paper linked by Peter and it looks like the attack has to be active against a currently executing qsort. In the paper, what happens is the comparison function is supplied by the attacker and effectively lies about the result of a comparison. It keeps the lies consistent in a very specific manner so that eventually qsort returns its input in a properly sorted fashion. So it seems to me that the vulnerability only exits if an attacker supplied comparison function is permitted. For all other cases, assuming that only vetted comparison functions are permitted, the selection of a random pivot would make an attack on qsort using specially tailored input data improbable.
[HACKERS] Looked at TODO:Considering improving performance of computing CHAR() value lengths
Greetings, I took at look at the TODO list and got interested in the possible optimization of the bcTruelen() function. Read the archived messages about that subject and decided to see what could be done. I tested the performance of 5 different versions of bcTruelen(). 1. The code as it exists in postgresql currently. 2. The code with the patch described in the above mentioned messages. 3. A modification of the code mentioned in 2 above using the concept of a sentinel. 4. A modification of the code mentioned in 1 above using a sentinel. 5. A modification of the code mentioned in 4 using knowledge of 1B and 4B headers. First, let me describe a sentinel. The key thing about using a sentinel is that with on, you can guarantee loop termination without having to check for an index. What you do is place a value that you're searching for 1 past the limit of the range you're searching for. By doing this, you can optimize your loop to have just 1 condition to check for instead of 2 (search value and index limit). This mean that your loop now needs to just execute 3 opcodes (pointer decrement, search compare, conditional jump) instead of 5 (pointer decrement, search compare, conditional jump, limit compare, conditional jump). In the case of bcTruelen(), the sentinel value is anything that's non-space. I ran all 5 different versions of bcTruelen() in a benchmark program that simply allocated a gig of memory, populated the memory with varlena structures of varying length, alignment, and trailing spaces. Then called the different versions of bcTruelen() using a pseudo random sequence to try and prevent processor caching from affecting the values. Overall, the results were as follows: 1. For small numbers of trailing space, the 4 at a time approach was a consistent loser in terms of performance. If there's a large number of trailing spaces, the performance of the 4 at a time approach could be quite impressive however. 2. The best typical performer was the single byte at a time approach using a sentinel and knowledge of the types of headers. For that routine, the performance compared against the current version of bcTruelen() was as follows CHAR(1) with only spaces. Current version wins. New version loses by about 20% CHAR(1) with non-space. New version wins by about 10%. CHAR(2) with only spaces. New version loses by about 4%. CHAR(2) with one space. New version wins by about 15%. CHAR(2) with no spaces. New version wins by about 10% Once CHAR(4) was reached, the new version was faster for all numbers of trailing spaces. I suspect the reason for the new routine being slower for short CHAR types with all spaces is because using the sentinel results in an extra loop iteration over the code that checks for limits each iteration. But once the number of iterations gets large enough, the shorter loop wins even though it iterated one extra time. The tests were performed on an Intel Core2 duo. The code effectively has 3 identical loops with different surroundings. 1. If the arg is pointing to a 1B style header, then there is no need to worry about a sentinel. No 1B header matches a space character whether the target system is big or little endian. 2. For those cases where the arg is pointing to 4B style header, you do have to worry about a good sentinel existing. For a little endian machine, such an issue happens if you're dealing with a CHAR(134217728) through CHAR(138412031). Unlikely to be sure. But for a big endian machine, CHAR(32), CHAR(288) and other values can cause problems. So for 4B style headers, a check is made to see if the sentinel location has a value of ' '. If it does, it is replaced with a zero, the scan is then performed, and then the sentinel is then restored with the original space. It is of course possible to unconditionally save the value at the sentinel location, stuff in a zero, then restore the original value, but in doing so, more overhead in incurred every call, slowing down the function. If you wish me to submit a patch, I can do so easily. Code is as follows: /* * True length (not counting trailing blanks) of a BpChar */ int bcTruelen(BpChar *arg) { char*s; char*p; if (VARATT_IS_1B(arg)) { s = (void *)arg; p = s + VARSIZE_1B(arg); do --p; while(*p == ' '); } else { s = VARDATA_4B(arg); p = s + VARSIZE_4B(arg)-VARHDRSZ; --s;/* Point to where sentinel will be */ if (*s == ' ') { *s = 0; /* Stuff in a non-blank sentinel */ do --p; while(*p == ' '); *s = ' '; /* Replace blank */ } else { do --p; while(*p == ' '); } } return p-s; /* Return length */ } -- There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the
Re: [HACKERS] Proposal for updating src/timezone
Agreed. Right now, I'm seeing about updating zic.c to match the IANA code combined with the modifications that postgres did to it. So far, it doesn't look like many functional changes have been done, but due to the use of pgindent, there's a LOT of cosmetic changes that add one heck of a lot of noise in determining the actual differences. After I get the code closely matching the IANA code, I'll submit a patch (unfortunately it will pretty much be an entire replacement of zic.c due to the massive changes IANA did between 2010c and 2014e). I will request very strongly that pgindent not be used again on the IANA code so future maintainers can easily perform a diff between the IANA code and the postgres code to determine the actual differences. I'll then see about doing the same with the other source files in timezone. On Fri, Jul 18, 2014 at 4:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: John Cochran j69coch...@gmail.com writes: Did a diff between the 2010c version of localtime.c and the postgres version and saw a lot more differences than what could have been expected from simple reformatting and adaptation. So I installed gitk and took a look at the change history of localtime.c Well, looking at the results, it pretty much put paid to the concept of ever importing the IANA code unaltered into the postgres tree. In fact, it looks like the original version of localtime.c was based upon one of the 2004a thru 2004c versions of the IANA code and since then has been independently maintained. That's certainly true, but I'm not sure that we should give up all that easily on getting closer to the IANA code again. For one thing, some of the thrashing had to do with the fact that we went over to 64-bit timestamps sooner than the IANA crew did. I'm assuming that they insist on 64-bit arithmetic now; we do too, so for instance some of the typedef hacking might no longer be necessary. AFAICT from a quick scan of the git logs, the main thing we've done to the API that's not in IANA is invent pg_next_dst_boundary(). Perhaps that could be pushed into a separate source file. Even if we only got to a point where the sources were the same except for a few narrow patches, it would make tracking their updates much easier. regards, tom lane -- There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. — C.A.R. Hoare
Re: [HACKERS] Proposal for updating src/timezone
On Sat, Jul 19, 2014 at 11:58 AM, Michael Banck mba...@gmx.net wrote: SNIP Maybe if you pgindent the IANA code as well, you can more easily diff the actual changes between the two, did you try that? Michael Unfortunately, pgindent doesn't work well with the IANA code as evident by some previous checkins. To quote a checkin done 2004-05-21 16:59:10 by Tom Lane pgindent did a pretty awful job on the timezone code, particularly with respect to doubly-starred comment blocks. Do some manual cleanup. As it is, I've finished checking the differences between the postgres and IANA code for zic.c after editing both to eliminate non-functional style differences such as indentation, function prototypes, comparing strchr results against NULL or 0, etc. It looks like the only differences from the stock code is support for the -P option implemented by Tom Lane and the substitution of the postgres code for getopt instead of the unix default as well as a few minor changes in some defaults and minor modifications to deal with Windows. Overall, rather small and trivial. Additionally, I discovered a little piece of low hanging fruit. The Makefile for the timezone code links together zic.o ialloc.o scheck.o and localtime.o in order to make zic. Additionally, zic.c has a stub implementation of pg_open_tzfile() in order to resolve a linkage issue with the localtime.o module for zic. Well, as it turns out, localtime.o doesn't supply ANY symbols that zic needs and therefore can be omitted entirely from the list of object files comprising zic. Which in turn means that the stub implementation of pg_open_tzfile can be removed from the postgres version of zic.c. I'm not bothering to submit a patch involving this since that patch will be quite short lived given my objective to bring the code up to date with the 2014e version of the IANA code. But I am submitting a bug report to IANA on the Makefile since the unneeded linkage with localtime.o is still in the 2014e code on their site. -- There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. — C.A.R. Hoare
[HACKERS] Proposal for updating src/timezone
Given my recent examination of the src/timezone subtree and the current code at IANA for timezone information and tools, I proposing the following changes and would like to first get group consensus on the change prior to investing any major effort. My proposal is the have the following directory structure .../src/timezone - Would have only PostgreSQL specific code .../src/timezone/tznames - would be unaltered from current (optionally this could be removed) .../src/timezone/iana - would contain unaltered code and data from IANA The .../src/timezone/data directory would be removed. In doing this, any future updates to the IANA code should be trivial to incorporate into the postgres tree. Reasons for change. 1. The current timezone code is based upon, according to the comments, version 2010c from IANA. IANA's latest code is version 2014e and there are substantial changes between those versions (as well as 35 intermediate versions ... Yes, IANA changes and updates the code regularly and frequently) 2. The current timezone data has a few entries that are not properly handled by any pre-2013 code so an update of some sort is needed anyway. 3. The latest code from IANA compiles with GCC using a kitchen sink array of warning options without any generated warnings. This should alleviate any concerns about not accepting any code unless it's warning free. In case anyone is interested, the GCC options are: -Dlint -g3 -O3 -fno-common -fstrict-aliasing -Wall -Wextra -Wbad-function-cast -Wcast-align -Wcast-qual -Wformat=2 -Winit-self -Wmissing-declarations -Wmissing-noreturn -Wmissing-prototypes -Wnested-externs -Wno-address -Wno-cast-qual -Wno-format-nonliteral -Wno-sign-compare -Wno-sign-conversion -Wno-type-limits -Wno-unused-parameter -Woverlength-strings -Wpointer-arith -Wshadow -Wstrict-prototypes -Wsuggest-attribute=const -Wsuggest-attribute=noreturn -Wsuggest-attribute=pure -Wtrampolines -Wwrite-strings Aspects of change that I'm not happy with. 1. I would have liked to recommend 2 sub-directories underneath .../src/timezone/iana named code and data, but unfortunately have to suggest untarring both the code and data files directly into the iana directory. The reason for this is that the IANA code does not compile using the IANA Makefile unless the script yearistype.sh is present and that script is currently present in the data tar file, not the code tar file. And that unfortunately means that splitting the IANA code and data into different directories would violate the objective of being able to deposit the files unaltered into postgres. 2. Depositing the IANA code unaltered would also deposit some junk files into the directory. The files would mainly consist of man pages and html files containing documentation for the timezone code. The extra files would consume approximately 500 kilobytes above what's actually needed, but otherwise wouldn't have any adverse effects. Thank you for reading this proposal, John Cochran -- There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. — C.A.R. Hoare
Re: [HACKERS] Proposal for updating src/timezone
On Fri, Jul 18, 2014 at 1:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: John Cochran j69coch...@gmail.com writes: My proposal is the have the following directory structure ... 1. I would have liked to recommend 2 sub-directories underneath ... I have exactly zero expectation of using their Makefile, so this is not a concern. 2. Depositing the IANA code unaltered would also deposit some junk files ... We're not doing that either, IMO. Why should we invest 500K of tarball space on that? Understandable, and both issues are in the category of Things I didn't like Did a diff between the 2010c version of localtime.c and the postgres version and saw a lot more differences than what could have been expected from simple reformatting and adaptation. So I installed gitk and took a look at the change history of localtime.c Well, looking at the results, it pretty much put paid to the concept of ever importing the IANA code unaltered into the postgres tree. In fact, it looks like the original version of localtime.c was based upon one of the 2004a thru 2004c versions of the IANA code and since then has been independently maintained. In any case, I think my best path forward is to look at the change history of the source files under src/timezone and determine the actual origin of each file based upon the git history and the IANA archive. Then see what's needed to bring the code up to date based upon the IANA code. Due to the configure option --with-system-tzdata, I assume that postgres needs to remain compatible with the output from the IANA zic program. That combined with the warning messages about some of the entries in the current timezone data not being compatible with pre-2013 code worries me a bit since the 2010c code from IANA does not support the output from the 2014e zic. -- There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. — C.A.R. Hoare
[HACKERS] Question about src/timezone/zic.c
I've recently found some time on my hands and have decided to see about contributing to the PostgreSQL project, so I've started browsing the code to get somewhat familiar with it. The file src/timezone/zic.c caused me to raise my eyebrows a bit and I have a question to ask because of this. I first noticed some loops dealing with the time that I suspect could be replaced with straight line code calling the julian date conversions. But in attempting to understand exactly what the loops were doing, I saw some calls to eitol() which is the function that I'm really questioning. The code for eitol() is as follows: static long eitol(int i) { longl; l = i; if ((i 0 l = 0) || (i == 0 l != 0) || (i 0 l = 0)) { (void) fprintf(stderr, _(%s: %d did not sign extend correctly\n), progname, i); exit(EXIT_FAILURE); } return l; } As you can see, all that function does is perform a length extension of a signed int to a signed long. Additionally, it has a fair amount of code to verify that the sign extension was properly done. Since the sign extension is actually performed via the simple l=i; statement towards the beginning of the function, it's rather obvious that we're relying upon the compiler to perform the sign extension. Additionally, since this function doesn't have any recovery if an invalid extension is performed, it means that if a faulty compiler is encountered, PostgreSQL will always fail to operate the instant any timezone computation is performed. This raises the question of Does anyone on this mailing list know of any C compiler that fails to properly sign extend an assignment of an int to a long? Or to put it another way, Are there any C compilers that fail to properly perform integer promotion from int to long? As things stand, it looks to me like that function eitol() can be simply deleted and the 22 calls to that function also removed. Shorter, simpler,faster code is always a good thing after all. Thank you for reading, John Cochran
Re: [HACKERS] regression failure - horology
In article [EMAIL PROTECTED], Tom Lane [EMAIL PROTECTED] wrote: Joe Conway [EMAIL PROTECTED] writes: Tom Lane wrote: Hm, I just had regression tests pass earlier this evening on RHL 8.0 (also HPUX 10.20). Are you using default config, or --enable-integer-datetimes? I'm using --enable-integer-datetimes on both. It looks like the regression checks need to use two different formats depending on if integer-timestamps is being used. For the normal floating point timestamps, the regression checks are correct since all timestamps between 4713BC to 5874897AD. However using a 64 bit integer to represented as number of microseconds since Jan 1, 2000 gives the following reduced range of 4713BC to about 294277AD. In order to represent to full range of timestamps for julian day counts from 0 to 2147483647 to the nearest microsecond, you require: 31 bits for the day 17 bits for the second 20 bits for the microsecond for a total of 68 bits needed to represent the full range of timestamps to the nearest microsecond. Suggested fix would be one or more of the following: 1. Reduce the range tested within the regression checks to what is capable for integer timestamps. 2. Use two different regression checks depending on if integer timestamps are in use. 3. Document the different ranges for timestamps depending on if integer timestamps are in use. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html