Re: [HACKERS] A worst case for qsort

2014-08-07 Thread John Cochran
On Thu, Aug 7, 2014 at 11:07 AM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Aug 5, 2014 at 8:15 PM, Peter Geoghegan p...@heroku.com wrote:
  The adversarial method works for almost any polymorphic program
  recognizable as quicksort. The subject quicksort may copy values at
  will, or work with lists rather than arrays. It may even pick the
  pivot at random. The quicksort will be vulnerable provided only that
  it satisfies some mild assumptions that are met by every
  implementation I have seen.
 
  IMHO, while worst case performance is a very useful tool for analyzing
  algorithms (particularly their worst case time complexity), a worst
  case should be put in its practical context. For example, if we had
  reason to be concerned about *adversarial* inputs, I think that there
  is a good chance that our qsort() actually would be problematic to the
  point of driving us to prefer some generally slower alternative.


If one is concerned about adversarial inputs, then as mentioned earlier,
two main steps need to be taken.
1. Don't permit potential adversaries define the comparison function used
by qsort().
2. Perform random selection of pivot to prevent precomputed lists of data
that invoke quadratic behavior in qsort.

And before the argument that a random pivot will be less efficient than the
median of median that's currently being used, you need to look at what is
currently being used.
1. If a small partition is being sorted, the element at n/2 is selected
as the pivot with n being the size of the partition.
2. If a medium partition is being sorted, then pivot selected is the
median of the elements at 0, n/2, and n.
3. And finally, if a large partition is being sorted, potential pivots
are selected from 0, n/8, n/4, 3n/8, n/2, 5n/8, 3n/4, 7n/8, and n.  Of
those 9 candidates, the median of medians is selected as the pivot.

There is nothing in the world that would prevent the following.
1. Short partition - select a pivot at random.
2. Medium partition - select the median of 3 randomly selected candidates.
3. Large partition - select the median of medians of 9 randomly selected
candidates.

Using the above random pivot selection methods, the performance of a
hardened qsort should be close to that of an unhardened qsort. The only
real difference is the cost of generating random numbers to select the
candidates for the median tests. However, if an malicious compare function
is permitted to be defined, it would still be vulnerable to that attack,
but it would be pretty much immune to a precomputed data attack.

Or perhaps it just might be simpler to detect an attack and fall back to a
sort algorithm that doesn't have quadratic behavior.

What I mean by that is the qsort in the Engineering a Sort Function has a
big O of approximately 1.1 n ln n comparisons. If upon starting qsort, it
computes an upper bound of comparisons it's willing to perform. Say 3 n ln
n or so (1.386 n log n is the estimate for a randomly selected pivot). Then
if the upper bound is reached, the qsort function backs out leaving the
array in a partially sorted order, and then calls a slower sort that
doesn't exhibit quadratic behavior to actually finish the sort.

The result of doing that would be most, if not all, sorts being handled by
qsort in a timely fashion. But if bad luck or an adversary strikes, the
qsort bails out after things look bad and passes the task to a different
sort. I'll be the first to admit that the hybrid approach would be slower
in those bad cases than it would be if the alternate sort was selected at
the beginning, but it would be faster than letting qsort continue with
quadratic behavior.

-- 

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare


Re: [HACKERS] A worst case for qsort

2014-08-06 Thread John Cochran
I just browsed the paper linked by Peter and it looks like the attack has
to be active against a currently executing qsort. In the paper, what
happens is the comparison function is supplied by the attacker and
effectively lies about the result of a comparison. It keeps the lies
consistent in a very specific manner so that eventually qsort returns its
input in a properly sorted fashion.

So it seems to me that the vulnerability only exits if an attacker supplied
comparison function is permitted. For all other cases, assuming that only
vetted comparison functions are permitted, the selection of a random pivot
would make an attack on qsort using specially tailored input data
improbable.


[HACKERS] Looked at TODO:Considering improving performance of computing CHAR() value lengths

2014-08-02 Thread John Cochran
Greetings,
I took at look at the TODO list and got interested in the possible
optimization of the bcTruelen() function. Read the archived messages about
that subject and decided to see what could be done.

I tested the performance of 5 different versions of bcTruelen().
1. The code as it exists in postgresql currently.
2. The code with the patch described in the above mentioned messages.
3. A modification of the code mentioned in 2 above using the concept of a
sentinel.
4. A modification of the code mentioned in 1 above using a sentinel.
5. A modification of the code mentioned in 4 using knowledge of 1B and 4B
headers.

First, let me describe a sentinel. The key thing about using a sentinel is
that with on, you can guarantee loop termination without having to check
for an index. What you do is place a value that you're searching for 1 past
the limit of the range you're searching for. By doing this, you can
optimize your loop to have just 1 condition to check for instead of 2
(search value and index limit). This mean that your loop now needs to just
execute 3 opcodes (pointer decrement, search compare, conditional jump)
instead of 5 (pointer decrement, search compare, conditional jump, limit
compare, conditional jump). In the case of bcTruelen(), the sentinel value
is anything that's non-space.

I ran all 5 different versions of bcTruelen() in a benchmark program that
simply allocated a gig of memory, populated the memory with varlena
structures of varying length, alignment, and trailing spaces. Then called
the different versions of bcTruelen() using a pseudo random sequence to try
and prevent processor caching from affecting the values.

Overall, the results were as follows:
1. For small numbers of trailing space, the 4 at a time approach was a
consistent loser in terms of performance. If there's a large number of
trailing spaces, the performance of the 4 at a time approach could be
quite impressive however.
2. The best typical performer was the single byte at a time approach using
a sentinel and knowledge of the types of headers. For that routine, the
performance compared against the current version of bcTruelen() was as
follows
CHAR(1) with only spaces. Current version wins. New version loses by about
20%
CHAR(1) with non-space. New version wins by about 10%.
CHAR(2) with only spaces. New version loses by about 4%.
CHAR(2) with one space. New version wins by about 15%.
CHAR(2) with no spaces. New version wins by about 10%

Once CHAR(4) was reached, the new version was faster for all numbers of
trailing spaces. I suspect the reason for the new routine being slower for
short CHAR types with all spaces is because using the sentinel results in
an extra loop iteration over the code that checks for limits each
iteration. But once the number of iterations gets large enough, the shorter
loop wins even though it iterated one extra time. The tests were performed
on an Intel Core2 duo.

The code effectively has 3 identical loops with different surroundings.
1. If the arg is pointing to a 1B style header, then there is no need to
worry about a sentinel. No 1B header matches a space character whether the
target system is big or little endian.
2. For those cases where the arg is pointing to 4B style header, you do
have to worry about a good sentinel existing. For a little endian machine,
such an issue happens if you're dealing with a CHAR(134217728) through
CHAR(138412031). Unlikely to be sure. But for a big endian machine,
CHAR(32), CHAR(288) and other values can cause problems. So for 4B style
headers, a check is made to see if the sentinel location has a value of '
'. If it does, it is replaced with a zero, the scan is then performed, and
then the sentinel is then restored with the original space. It is of course
possible to unconditionally save the value at the sentinel location, stuff
in a zero, then restore the original value, but in doing so, more overhead
in incurred every call, slowing down the function.

If you wish me to submit a patch, I can do so easily.

Code is as follows:
/*
 * True length (not counting trailing blanks) of a BpChar
 */
int
bcTruelen(BpChar *arg)
{
char*s;
char*p;

if (VARATT_IS_1B(arg))
{
s = (void *)arg;
p = s + VARSIZE_1B(arg);
do
--p;
while(*p == ' ');
}
else
{
s = VARDATA_4B(arg);
p = s + VARSIZE_4B(arg)-VARHDRSZ;
--s;/* Point to where sentinel will be */

if (*s == ' ') {
*s = 0; /* Stuff in a non-blank sentinel */
do
--p;
while(*p == ' ');
*s = ' ';   /* Replace blank */
} else {
do
--p;
while(*p == ' ');
}
}

return p-s; /* Return length */
}


-- 

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the 

Re: [HACKERS] Proposal for updating src/timezone

2014-07-19 Thread John Cochran
Agreed. Right now, I'm seeing about updating zic.c to match the IANA code
combined with the modifications that postgres did to it. So far, it doesn't
look like many functional changes have been done, but due to the use of
pgindent, there's a LOT of cosmetic changes that add one heck of a lot of
noise in determining the actual differences. After I get the code closely
matching the IANA code, I'll submit a patch (unfortunately it will pretty
much be an entire replacement of zic.c due to the massive changes IANA did
between 2010c and 2014e). I will request very strongly that pgindent not be
used again on the IANA code so future maintainers can easily perform a diff
between the IANA code and the postgres code to determine the actual
differences. I'll then see about doing the same with the other source files
in timezone.




On Fri, Jul 18, 2014 at 4:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 John Cochran j69coch...@gmail.com writes:
  Did a diff between the 2010c version of localtime.c and the postgres
  version and saw a lot more differences than what could have been expected
  from simple reformatting and adaptation. So I installed gitk and took a
  look at the change history of localtime.c

  Well, looking at the results, it pretty much put paid to the concept of
  ever importing the IANA code unaltered into the postgres tree. In fact,
 it
  looks like the original version of localtime.c was based upon one of the
  2004a thru 2004c versions of the IANA code and since then has been
  independently maintained.

 That's certainly true, but I'm not sure that we should give up all that
 easily on getting closer to the IANA code again.  For one thing, some of
 the thrashing had to do with the fact that we went over to 64-bit
 timestamps sooner than the IANA crew did.  I'm assuming that they insist
 on 64-bit arithmetic now; we do too, so for instance some of the typedef
 hacking might no longer be necessary.

 AFAICT from a quick scan of the git logs, the main thing we've done to the
 API that's not in IANA is invent pg_next_dst_boundary().  Perhaps that
 could be pushed into a separate source file.

 Even if we only got to a point where the sources were the same except for
 a few narrow patches, it would make tracking their updates much easier.

 regards, tom lane




-- 

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare


Re: [HACKERS] Proposal for updating src/timezone

2014-07-19 Thread John Cochran
On Sat, Jul 19, 2014 at 11:58 AM, Michael Banck mba...@gmx.net wrote:
SNIP

 Maybe if you pgindent the IANA code as well, you can more easily diff
 the actual changes between the two, did you try that?


 Michael


Unfortunately, pgindent doesn't work well with the IANA code as evident by
some previous checkins.

To quote a checkin done 2004-05-21 16:59:10 by Tom Lane
pgindent did a pretty awful job on the timezone code, particularly with
respect to doubly-starred comment blocks.  Do some manual cleanup.

As it is, I've finished checking the differences between the postgres and
IANA code for zic.c after editing both to eliminate non-functional style
differences such as indentation, function prototypes, comparing strchr
results against NULL or 0, etc. It looks like the only differences from the
stock code is support for the -P option implemented by Tom Lane and the
substitution of the postgres code for getopt instead of the unix default as
well as a few minor changes in some defaults and minor modifications to
deal with Windows. Overall, rather small and trivial.

Additionally, I discovered a little piece of low hanging fruit. The
Makefile for the timezone code links together zic.o ialloc.o scheck.o and
localtime.o in order to make zic.  Additionally, zic.c has a stub
implementation of pg_open_tzfile() in order to resolve a linkage issue with
the localtime.o module for zic.
Well, as it turns out, localtime.o doesn't supply ANY symbols that zic
needs and therefore can be omitted entirely from the list of object files
comprising zic. Which in turn means that the stub implementation of
pg_open_tzfile can be removed from the postgres version of zic.c.  I'm not
bothering to submit a patch involving this since that patch will be quite
short lived given my objective to bring the code up to date with the 2014e
version of the IANA code. But I am submitting a bug report to IANA on the
Makefile since the unneeded linkage with localtime.o is still in the 2014e
code on their site.

-- 

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare


[HACKERS] Proposal for updating src/timezone

2014-07-18 Thread John Cochran
Given my recent examination of the src/timezone subtree and the current
code at IANA for timezone information and tools, I proposing the following
changes and would like to first get group consensus on the change prior to
investing any major effort.

My proposal is the have the following directory structure

.../src/timezone - Would have only PostgreSQL specific code
.../src/timezone/tznames - would be unaltered from current (optionally this
could be removed)
.../src/timezone/iana - would contain unaltered code and data from IANA

The .../src/timezone/data directory would be removed.

In doing this, any future updates to the IANA code should be trivial to
incorporate into the postgres tree.

Reasons for change.
1. The current timezone code is based upon, according to the comments,
version 2010c from IANA. IANA's latest code is version 2014e and there are
substantial changes between those versions (as well as 35 intermediate
versions ... Yes, IANA changes and updates the code regularly and
frequently)

2. The current timezone data has a few entries that are not properly
handled by any pre-2013 code so an update of some sort is needed anyway.

3. The latest code from IANA compiles with GCC using a kitchen sink array
of warning options without any generated warnings. This should alleviate
any  concerns about not accepting any code unless it's warning free. In
case anyone is interested, the GCC options are: -Dlint -g3 -O3 -fno-common
-fstrict-aliasing  -Wall -Wextra -Wbad-function-cast -Wcast-align
-Wcast-qual -Wformat=2 -Winit-self -Wmissing-declarations
-Wmissing-noreturn -Wmissing-prototypes -Wnested-externs -Wno-address
-Wno-cast-qual -Wno-format-nonliteral -Wno-sign-compare
-Wno-sign-conversion -Wno-type-limits -Wno-unused-parameter
-Woverlength-strings -Wpointer-arith -Wshadow -Wstrict-prototypes
-Wsuggest-attribute=const -Wsuggest-attribute=noreturn
-Wsuggest-attribute=pure -Wtrampolines -Wwrite-strings

Aspects of change that I'm not happy with.
1. I would have liked to recommend 2 sub-directories underneath
.../src/timezone/iana named code and data, but unfortunately have to
suggest untarring both the code and data files directly into the iana
directory. The reason for this is that the IANA code does not compile using
the IANA Makefile unless the script yearistype.sh is present and that
script is currently present in the data tar file, not the code tar file.
And that unfortunately means that splitting the IANA code and data into
different directories would violate the objective of being able to deposit
the files unaltered into postgres.

2. Depositing the IANA code unaltered would also deposit some junk files
into the directory. The files would mainly consist of man pages and html
files containing documentation for the timezone code. The extra files would
consume approximately 500 kilobytes above what's actually needed, but
otherwise wouldn't have any adverse effects.

Thank you for reading this proposal,
John Cochran
-- 

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare


Re: [HACKERS] Proposal for updating src/timezone

2014-07-18 Thread John Cochran
On Fri, Jul 18, 2014 at 1:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 John Cochran j69coch...@gmail.com writes:
  My proposal is the have the following directory structure



 ...
  1. I would have liked to recommend 2 sub-directories underneath

...



I have exactly zero expectation of using their Makefile, so this is not a
 concern.

  2. Depositing the IANA code unaltered would also deposit some junk
 files

...

 We're not doing that either, IMO.  Why should we invest 500K of tarball
 space on that?



Understandable, and both issues are in the category of Things I didn't
like

Did a diff between the 2010c version of localtime.c and the postgres
version and saw a lot more differences than what could have been expected
from simple reformatting and adaptation. So I installed gitk and took a
look at the change history of localtime.c

Well, looking at the results, it pretty much put paid to the concept of
ever importing the IANA code unaltered into the postgres tree. In fact, it
looks like the original version of localtime.c was based upon one of the
2004a thru 2004c versions of the IANA code and since then has been
independently maintained.

In any case, I think my best path forward is to look at the change history
of the source files under src/timezone and determine the actual origin of
each file based upon the git history and the IANA archive. Then see what's
needed to bring the code up to date based upon the IANA code. Due to the
configure option --with-system-tzdata, I assume that postgres needs to
remain compatible with the output from the IANA zic program. That combined
with the warning messages about some of the entries in the current timezone
data not being compatible with pre-2013 code worries me a bit since the
2010c code from IANA does not support the output from the 2014e zic.
-- 

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare


[HACKERS] Question about src/timezone/zic.c

2014-07-16 Thread John Cochran
I've recently found some time on my hands and have decided to see about
contributing to the PostgreSQL project, so I've started browsing the code
to get somewhat familiar with it.

The file src/timezone/zic.c caused me to raise my eyebrows a bit and I have
a question to ask because of this.

I first noticed some loops dealing with the time that I suspect could be
replaced with straight line code calling the julian date conversions. But
in attempting to understand exactly what the loops were doing, I saw some
calls to eitol() which is the function that I'm really questioning. The
code for eitol() is as follows:

static long
eitol(int i)
{
longl;

l = i;
if ((i  0  l = 0) || (i == 0  l != 0) || (i  0  l = 0))
{
(void) fprintf(stderr,
   _(%s: %d did not sign extend correctly\n),
   progname, i);
exit(EXIT_FAILURE);
}
return l;
}

As you can see, all that function does is perform a length extension of a
signed int to a signed long. Additionally, it has a fair amount of code to
verify that the sign extension was properly done. Since the sign extension
is actually performed via the simple l=i; statement towards the beginning
of the function, it's rather obvious that we're relying upon the compiler
to perform the sign extension. Additionally, since this function doesn't
have any recovery if an invalid extension is performed, it means that if a
faulty compiler is encountered, PostgreSQL will always fail to operate the
instant any timezone computation is performed.

This raises the question of Does anyone on this mailing list know of any C
compiler that fails to properly sign extend an assignment of an int to a
long? Or to put it another way, Are there any C compilers that fail to
properly perform integer promotion from int to long?

As things stand, it looks to me like that function eitol() can be simply
deleted and the 22 calls to that function also removed. Shorter,
simpler,faster code is always a good thing after all.

Thank you for reading,
John Cochran


Re: [HACKERS] regression failure - horology

2003-03-02 Thread John Cochran
In article [EMAIL PROTECTED],
Tom Lane [EMAIL PROTECTED] wrote:
Joe Conway [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Hm, I just had regression tests pass earlier this evening on RHL 8.0
 (also HPUX 10.20).  Are you using default config, or
 --enable-integer-datetimes?

 I'm using --enable-integer-datetimes on both.


It looks like the regression checks need to use two different formats
depending on if integer-timestamps is being used.

For the normal floating point timestamps, the regression checks are correct
since all timestamps between 4713BC to 5874897AD.

However using a 64 bit integer to represented as number of microseconds since
Jan 1, 2000 gives the following reduced range of 4713BC to about 294277AD.
In order to represent to full range of timestamps for julian day counts from 0
to 2147483647 to the nearest microsecond, you require:
   31 bits for the day
   17 bits for the second
   20 bits for the microsecond
for a total of 68 bits needed to represent the full range of timestamps to
the nearest microsecond.

Suggested fix would be one or more of the following:
   1. Reduce the range tested within the regression checks to what is
  capable for integer timestamps.
   2. Use two different regression checks depending on if integer timestamps
  are in use.
   3. Document the different ranges for timestamps depending on if integer
  timestamps are in use.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html