Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-07 12:59:18 +0100, Andrew Gierth wrote: > > "Tom" == Tom Lane writes: > > Tom> Now, "shortest value that converts back exactly" is technically > Tom> cool, but I am not sure that it solves any real-world problem that > Tom> we have. > > Well, it seems to me that it is perfect for pg_dump. > > Also it's kind of a problem that our default float output is not > round-trip safe - people do keep wondering why they can select a row and > it'll show a certain value, but then doing WHERE col = 'xxx' on that > value does not find the row. Yes, testing equality of floats is bad, but > there's no reason to put in extra landmines. +1 > Tom> I'm also worried that introducing it would result in complaints like > Tom> > https://www.postgresql.org/message-id/CANaXbVjw3Y8VmapWuZahtcRhpE61hsSUcjquip3HuXeuN8y4sg%40mail.gmail.com > > Frankly for a >20x performance improvement in float8out I don't think > that's an especially big deal. +1. There's plenty complaints where we just say "sorry that it bothers you, but these larger concerns made us that way". > I do not see any obvious way to use this code to generate the same > output in the final digits that we currently do (in the sense of > overly-exact values like outputting 1.89991 for 1.9 when > extra_float_digits=3). But, why would that be required? Just to placate people wanting exactly the same output as before? I don't quite get how that'd be a useful requirement. Obviously we *do* need to support outputting non-exponent style output where appropriate, but that should mostly be different massaging of d2d()'s output, instead of calling to_chars() as the ryu upstream code does. ISTM we also need to support *reducing* the precision (for the case where people intentionally reduce extra_float_digits), but that similarly should be a SMOP, right?- - Andres
Re: Performance improvements for src/port/snprintf.c
> "Tom" == Tom Lane writes: Tom> However, it seems like it should still be on the table to look at Tom> other code that just does sprintf's job faster (such as the stb Tom> code Alexander mentioned). The stb sprintf is indeed a lot faster for floats than other sprintfs, but (a) it's still quite a lot slower than Ryu (COPY of my test table is 4.2 seconds with stb, vs 2.7 seconds with Ryu), and (b) it also produces changes in the insignificant digits, so while (it claims) the values are still round-trip convertible, they are neither the shortest representation nor the exact representation. For example, consider 1.9, which is 0x3FFE: exact value: 1.899911182158029987476766109466552734375 accepted input range: min: 1.89980015985556747182272374629974365234375 max: 1.90002220446049250313080847263336181640625 exact value rounded to 18 SF: 1.89991 Ryu output: 1.9E0 STB (%*.18g) output: 1.89992 sprintf (%*.18g) output: 1.89991 So while STB's output is in the acceptable range, it's not the result of rounding the exact value to 18 digits (as sprintf does on my system at least) and nor is it the shortest. Testing a bunch of random values it usually seems to be off from the rounded exact result by +/- 1 in the last digit. -- Andrew (irc:RhodiumToad)
Re: Performance improvements for src/port/snprintf.c
Andrew Gierth writes: > "Tom" == Tom Lane writes: > Tom> Now, "shortest value that converts back exactly" is technically > Tom> cool, but I am not sure that it solves any real-world problem that > Tom> we have. > Well, it seems to me that it is perfect for pg_dump. Perhaps. I was hoping for something we could slot into snprintf.c; not being able to select the number of digits to output is clearly a deal-breaker for that usage. But perhaps it's reasonable to allow "extra_float_digits = 3" to be redefined as meaning "use the shortest value that converts back exactly" in float[48]out. However, it seems like it should still be on the table to look at other code that just does sprintf's job faster (such as the stb code Alexander mentioned). If anything like that is acceptable for the general case, then we have to ask whether ryu is enough faster than *that* code, not faster than what we have now, to justify carrying another umpteen KB of independent code path for the dump-and-restore case. > Also it's kind of a problem that our default float output is not > round-trip safe - people do keep wondering why they can select a row and > it'll show a certain value, but then doing WHERE col = 'xxx' on that > value does not find the row. Unfortunately, I do not think it's going to be acceptable for default float output (as opposed to the dump/restore case) to become round-trip safe. The number of people complaining today would be dwarfed by the number of people complaining about extra garbage digits in their results. There isn't any compromise that will make things "just work" for people who are unaware of the subtleties of float arithmetic. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
> "Tom" == Tom Lane writes: Tom> Now, "shortest value that converts back exactly" is technically Tom> cool, but I am not sure that it solves any real-world problem that Tom> we have. Well, it seems to me that it is perfect for pg_dump. Also it's kind of a problem that our default float output is not round-trip safe - people do keep wondering why they can select a row and it'll show a certain value, but then doing WHERE col = 'xxx' on that value does not find the row. Yes, testing equality of floats is bad, but there's no reason to put in extra landmines. Tom> I'm also worried that introducing it would result in complaints like Tom> https://www.postgresql.org/message-id/CANaXbVjw3Y8VmapWuZahtcRhpE61hsSUcjquip3HuXeuN8y4sg%40mail.gmail.com Frankly for a >20x performance improvement in float8out I don't think that's an especially big deal. Tom> As for #2, my *very* short once-over of the code led me to think Tom> that the speed win comes mostly from use of wide integer Tom> arithmetic, Data point: forcing it to use 64-bit only (#define RYU_ONLY_64_BIT_OPS) makes negligible difference on my test setup. Tom> and maybe from throwing big lookup tables at the problem. If so, Tom> it's very likely possible that we could adopt those techniques Tom> without necessarily buying into the shortest-exact rule for how Tom> many digits to print. If you read the ACM paper (linked from the upstream github repo), it explains how the algorithm works by combining the radix conversion step with (the initial iterations of) the operation of finding the shortest representation. This allows limiting the number of bits needed for the intermediate results so that it can all be done in fixed-size integers, rather than using an arbitrary-precision approach. I do not see any obvious way to use this code to generate the same output in the final digits that we currently do (in the sense of overly-exact values like outputting 1.89991 for 1.9 when extra_float_digits=3). >> One option would be to stick with snprintf if extra_float_digits is >> less than 0 (or less than or equal to 0 and make the default 1) and >> use ryu otherwise, so that the option to get rounded floats is still >> there. (Apparently some people do use negative values of >> extra_float_digits.) Unlike other format-changing GUCs, this one >> already exists and is already used by people who want more or less >> precision, including by pg_dump where rount-trip conversion is the >> requirement. Tom> I wouldn't necessarily object to having some value of Tom> extra_float_digits that selects the shortest-exact rule, but I'm Tom> thinking maybe it should be a value we don't currently accept. Why would anyone currently set extra_float_digits > 0 if not to get round-trip-safe values? -- Andrew (irc:RhodiumToad)
Re: Performance improvements for src/port/snprintf.c
Andrew Gierth writes: > "Tom" == Tom Lane writes: > Tom> Yeah, one would hope. But I wonder whether it always produces the > Tom> same low-order digits, and if not, whether people will complain. > It won't produce the same low-order digits in general, since it has a > different objective: rather than outputting a decimal value which is the > true float value rounded to a fixed size by decimal rounding rules, it > produces the shortest decimal value which falls within the binary float > rounding interval of the true float value. i.e. the objective is to be > able to round-trip back to float and get the identical result. So I'm thinking that there are two, hopefully separable, issues here: 1. The algorithm for deciding how many digits to print. 2. The speed. Now, "shortest value that converts back exactly" is technically cool, but I am not sure that it solves any real-world problem that we have. I'm also worried that introducing it would result in complaints like https://www.postgresql.org/message-id/CANaXbVjw3Y8VmapWuZahtcRhpE61hsSUcjquip3HuXeuN8y4sg%40mail.gmail.com As for #2, my *very* short once-over of the code led me to think that the speed win comes mostly from use of wide integer arithmetic, and maybe from throwing big lookup tables at the problem. If so, it's very likely possible that we could adopt those techniques without necessarily buying into the shortest-exact rule for how many digits to print. > One option would be to stick with snprintf if extra_float_digits is less > than 0 (or less than or equal to 0 and make the default 1) and use ryu > otherwise, so that the option to get rounded floats is still there. > (Apparently some people do use negative values of extra_float_digits.) > Unlike other format-changing GUCs, this one already exists and is > already used by people who want more or less precision, including by > pg_dump where rount-trip conversion is the requirement. I wouldn't necessarily object to having some value of extra_float_digits that selects the shortest-exact rule, but I'm thinking maybe it should be a value we don't currently accept. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
> "Tom" == Tom Lane writes: Tom> Thanks. Just scanning through the code quickly, I note that it Tom> assumes IEEE float format, which is probably okay but I suppose we Tom> might want a configure switch to disable it (and revert to Tom> platform sprintf). Yeah, but even s390 these days supports IEEE floats in hardware so I'm not sure there are any platforms left that don't (that we care about). Tom> I couldn't immediately figure out if it's got endianness Tom> assumptions; but even if it does, that'd likely only affect the Tom> initial disassembly of the IEEE format, so probably not a huge Tom> deal. Upstream docs say it's fine with big-endian as long as the endianness of ints and floats is the same. Tom> I wonder which variant of the code you were testing (e.g. Tom> HAS_UINT128 or not). I was using clang 3.9.1 on FreeBSD amd64, and HAS_UINT128 ends up enabled by this test: #if defined(__SIZEOF_INT128__) && !defined(_MSC_VER) && !defined(RYU_ONLY_64_BIT_OPS) #define HAS_UINT128 ... >> The regression tests for float8 fail of course since Ryu's output >> format differs (it always includes an exponent, but the code for >> that part can be tweaked without touching the main algorithm). Tom> Yeah, one would hope. But I wonder whether it always produces the Tom> same low-order digits, and if not, whether people will complain. It won't produce the same low-order digits in general, since it has a different objective: rather than outputting a decimal value which is the true float value rounded to a fixed size by decimal rounding rules, it produces the shortest decimal value which falls within the binary float rounding interval of the true float value. i.e. the objective is to be able to round-trip back to float and get the identical result. One option would be to stick with snprintf if extra_float_digits is less than 0 (or less than or equal to 0 and make the default 1) and use ryu otherwise, so that the option to get rounded floats is still there. (Apparently some people do use negative values of extra_float_digits.) Unlike other format-changing GUCs, this one already exists and is already used by people who want more or less precision, including by pg_dump where rount-trip conversion is the requirement. Here are some examples of differences in digits, comparing ryu output with extra_float_digits=3: Pi: ryu 3.141592653589793E0 sprintf 3.14159265358979312 e: ryu 2.7182818284590455E0 sprintf 2.71828182845904553 1/10: ryu 1E-1 sprintf 0.16 1/3: ryu 3.333E-1 sprintf 0.15 2/3: ryu 6.666E-1 sprintf 0.3 -- Andrew.
Re: Performance improvements for src/port/snprintf.c
Andrew Gierth writes: > Tom> Oh yeah? Where's the code for this? > Upstream code is at https://github.com/ulfjack/ryu > ... > I attach the patch I've used for testing, which has these changes from > upstream Ryu: Thanks. Just scanning through the code quickly, I note that it assumes IEEE float format, which is probably okay but I suppose we might want a configure switch to disable it (and revert to platform sprintf). I couldn't immediately figure out if it's got endianness assumptions; but even if it does, that'd likely only affect the initial disassembly of the IEEE format, so probably not a huge deal. I wonder which variant of the code you were testing (e.g. HAS_UINT128 or not). There's a pretty large gap between this code and PG coding conventions, both as to layout and portability rules. I wonder if we'd be better off to implement the algorithm afresh instead of whacking this particular code past the point of unrecognizability. > The regression tests for > float8 fail of course since Ryu's output format differs (it always > includes an exponent, but the code for that part can be tweaked without > touching the main algorithm). Yeah, one would hope. But I wonder whether it always produces the same low-order digits, and if not, whether people will complain. We just had somebody griping about a change in insignificant zeroes in timestamp output :-(. Still, seems worth further investigation. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
> "Tom" == Tom Lane writes: >> Ryu is so blazing fast that with it, COPY of a table with 2million >> rows of 12 random float8 columns (plus id) becomes FASTER in text >> mode than in binary mode (rather than ~5x slower): Tom> Oh yeah? Where's the code for this? Upstream code is at https://github.com/ulfjack/ryu Most of that is benchmarking, java, and other stuff not interesting to us; the guts are under ryu/ and are dual-licensed under Boost 1.0 (which I think we can use, since the only difference from BSD seems to be a permissive one) as well as Apache 2.0 (which AFAIK we can't use). I attach the patch I've used for testing, which has these changes from upstream Ryu: - added ryu_ prefix to entry point functions - changed some #include file locations - added #define NDEBUG since there are a bunch of plain C assert()s but I didn't touch the formatting or style of the Ryu code so it's all C99 and // comments and OTB etc. For testing purposes what I did was to change float[48]out to use the Ryu code iff extra_float_digits > 0. This isn't likely what a final version should do, just a convenience flag. The regression tests for float8 fail of course since Ryu's output format differs (it always includes an exponent, but the code for that part can be tweaked without touching the main algorithm). -- Andrew (irc:RhodiumToad) diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c index df35557b73..b7e44b2eba 100644 --- a/src/backend/utils/adt/float.c +++ b/src/backend/utils/adt/float.c @@ -21,6 +21,7 @@ #include "catalog/pg_type.h" #include "common/int.h" +#include "common/ryu.h" #include "libpq/pqformat.h" #include "utils/array.h" #include "utils/float.h" @@ -245,6 +246,13 @@ float4out(PG_FUNCTION_ARGS) float4 num = PG_GETARG_FLOAT4(0); char *ascii; + if (extra_float_digits > 0) + { + ascii = (char *) palloc(24); + ryu_f2s_buffered(num, ascii); + PG_RETURN_CSTRING(ascii); + } + if (isnan(num)) PG_RETURN_CSTRING(pstrdup("NaN")); @@ -481,6 +489,13 @@ float8out_internal(double num) { char *ascii; + if (extra_float_digits > 0) + { + ascii = (char *) palloc(32); + ryu_d2s_buffered(num, ascii); + return ascii; + } + if (isnan(num)) return pstrdup("NaN"); diff --git a/src/common/Makefile b/src/common/Makefile index ec8139f014..ae89dd8c5e 100644 --- a/src/common/Makefile +++ b/src/common/Makefile @@ -44,8 +44,8 @@ override CPPFLAGS += -DVAL_LIBS="\"$(LIBS)\"" override CPPFLAGS := -DFRONTEND $(CPPFLAGS) LIBS += $(PTHREAD_LIBS) -OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \ - ip.o keywords.o link-canary.o md5.o pg_lzcompress.o \ +OBJS_COMMON = base64.o config_info.o controldata_utils.o d2s.o exec.o f2s.o \ + file_perm.o ip.o keywords.o link-canary.o md5.o pg_lzcompress.o \ pgfnames.o psprintf.o relpath.o \ rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \ username.o wait_error.o diff --git a/src/common/d2s.c b/src/common/d2s.c new file mode 100644 index 00..20fac52704 --- /dev/null +++ b/src/common/d2s.c @@ -0,0 +1,612 @@ +// Copyright 2018 Ulf Adams +// +// The contents of this file may be used under the terms of the Apache License, +// Version 2.0. +// +//(See accompanying file LICENSE-Apache or copy at +// http://www.apache.org/licenses/LICENSE-2.0) +// +// Alternatively, the contents of this file may be used under the terms of +// the Boost Software License, Version 1.0. +//(See accompanying file LICENSE-Boost or copy at +// https://www.boost.org/LICENSE_1_0.txt) +// +// Unless required by applicable law or agreed to in writing, this software +// is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. + +// Runtime compiler options: +// -DRYU_DEBUG Generate verbose debugging output to stdout. +// +// -DRYU_ONLY_64_BIT_OPS Avoid using uint128_t or 64-bit intrinsics. Slower, +// depending on your compiler. +// +// -DRYU_OPTIMIZE_SIZE Use smaller lookup tables. Instead of storing every +// required power of 5, only store every 26th entry, and compute +// intermediate values with a multiplication. This reduces the lookup table +// size by about 10x (only one case, and only double) at the cost of some +// performance. Currently requires MSVC intrinsics. + +#define NDEBUG + +#include "common/ryu.h" + +#include +#include +#include +#include +#include + +#ifdef RYU_DEBUG +#include +#include +#endif + +// ABSL avoids uint128_t on Win32 even if __SIZEOF_INT128__ is defined. +// Let's do the same for now. +#if defined(__SIZEOF_INT128__) && !defined(_MSC_VER) && !defined(RYU_ONLY_64_BIT_OPS) +#define HAS_UINT128 +#elif defined(_MSC_VER) && !defined(RYU_ONLY_64_BIT_OPS) && defined(_M_X64) \ + && !defined(__clang__) // https://bugs.llvm.org/show_bug.cgi?id=37755 +#define HAS_64_BIT_INTRINSICS +#endif + +#include "ryu_common.h" +#include "digit_table.h" +#include "d2s.h" +#include
Re: Performance improvements for src/port/snprintf.c
Andrew Gierth writes: > So here's a thing: I finally got to doing my performance tests for using > the Ryu float output code in float[48]out. > Ryu is so blazing fast that with it, COPY of a table with 2million rows > of 12 random float8 columns (plus id) becomes FASTER in text mode than > in binary mode (rather than ~5x slower): Oh yeah? Where's the code for this? > (And yes, I've double-checked the results and they look correct, other > than the formatting differences. COPY BINARY seems to have a bit more > overhead than text mode, even for just doing integers, I don't know > why.) The per-column overhead is more (length word vs delimiter) and I think the APIs for send/recv functions are potentially a bit less efficient too. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
> "Andres" == Andres Freund writes: Andres> I'm not convinced. Because of some hypothetical platform that Andres> may introduce strfromd() in a broken/slower manner, but where Andres> sprintf() is correct, we should not do the minimal work to Andres> alleviate an actual performance bottleneck in a trivial manner Andres> on linux? Our most widely used platform? If we find a platform Andres> where it's borked, we could just add a small hack into their Andres> platform template file. So here's a thing: I finally got to doing my performance tests for using the Ryu float output code in float[48]out. Ryu is so blazing fast that with it, COPY of a table with 2million rows of 12 random float8 columns (plus id) becomes FASTER in text mode than in binary mode (rather than ~5x slower): copy binary flttst to '/dev/null'; -- binary Time: 3222.444 ms (00:03.222) copy flttst to '/dev/null'; -- non-Ryu Time: 16416.161 ms (00:16.416) copy flttst to '/dev/null'; -- Ryu Time: 2691.642 ms (00:02.692) (And yes, I've double-checked the results and they look correct, other than the formatting differences. COPY BINARY seems to have a bit more overhead than text mode, even for just doing integers, I don't know why.) -- Andrew (irc:RhodiumToad)
Re: Performance improvements for src/port/snprintf.c
I stepped back a bit from the raw performance question and thought about what we actually want functionally in snprintf's float handling. There are a couple of points worth making: * The fact that float[48]out explicitly handle NaN and Inf cases is a leftover from when they had to cope with varying behavior of platform-specific snprintf implementations. Now that we've standardized on snprintf.c, it makes a lot more sense to enforce standardized printing of these values inside snprintf.c. That not only avoids repeated tests for these cases at different code levels, but ensures that the uniform behavior exists for all our invocations of *printf, not just float[48]out. * snprintf.c doesn't really work right for IEEE minus zero, as I recently noted in another thread (<23662.1538067...@sss.pgh.pa.us>). While this is not of significance for float[48]out, it might be a problem for other callers. Now that we've enforced usage of snprintf.c across-the-board, I think it's more important to worry about these corner cases. It's not that expensive to fix either; we can test for minus zero with something like this: static const double dzero = 0.0; if (value == 0.0 && memcmp(, , sizeof(double)) != 0) (ie, "it's equal to zero but not bitwise equal to zero"). While that looks like it might be expensive, I find that recent versions of both gcc and clang can optimize the memcmp call down to something like cmpq$0, 8(%rsp) so I think it's well worth the cost to get this right. The attached proposed patch addresses both of those points. Also, in service of the performance angle, I went ahead and made a roughly strfromd-like entry point in snprintf.c, but using an API that doesn't force textual conversion of the precision spec. As best I can tell, this patch puts the performance of float8out on par with what it was in v11, measuring using a tight loop like while (count-- > 0) { char *str = float8out_internal(val); pfree(str); CHECK_FOR_INTERRUPTS(); } For me, this is within a percent or two either way on a couple of different machines; that's within the noise level. regards, tom lane diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c index df35557..260377c 100644 *** a/src/backend/utils/adt/float.c --- b/src/backend/utils/adt/float.c *** Datum *** 243,272 float4out(PG_FUNCTION_ARGS) { float4 num = PG_GETARG_FLOAT4(0); ! char *ascii; ! ! if (isnan(num)) ! PG_RETURN_CSTRING(pstrdup("NaN")); ! ! switch (is_infinite(num)) ! { ! case 1: ! ascii = pstrdup("Infinity"); ! break; ! case -1: ! ascii = pstrdup("-Infinity"); ! break; ! default: ! { ! int ndig = FLT_DIG + extra_float_digits; ! ! if (ndig < 1) ! ndig = 1; ! ! ascii = psprintf("%.*g", ndig, num); ! } ! } PG_RETURN_CSTRING(ascii); } --- 243,252 float4out(PG_FUNCTION_ARGS) { float4 num = PG_GETARG_FLOAT4(0); ! char *ascii = (char *) palloc(32); ! int ndig = FLT_DIG + extra_float_digits; + (void) pg_strfromd(ascii, 32, ndig, num); PG_RETURN_CSTRING(ascii); } *** float8out(PG_FUNCTION_ARGS) *** 479,508 char * float8out_internal(double num) { ! char *ascii; ! ! if (isnan(num)) ! return pstrdup("NaN"); ! ! switch (is_infinite(num)) ! { ! case 1: ! ascii = pstrdup("Infinity"); ! break; ! case -1: ! ascii = pstrdup("-Infinity"); ! break; ! default: ! { ! int ndig = DBL_DIG + extra_float_digits; ! ! if (ndig < 1) ! ndig = 1; ! ! ascii = psprintf("%.*g", ndig, num); ! } ! } return ascii; } --- 459,468 char * float8out_internal(double num) { ! char *ascii = (char *) palloc(32); ! int ndig = DBL_DIG + extra_float_digits; + (void) pg_strfromd(ascii, 32, ndig, num); return ascii; } diff --git a/src/include/port.h b/src/include/port.h index e654d5c..0729c3f 100644 *** a/src/include/port.h --- b/src/include/port.h *** extern int pg_printf(const char *fmt,... *** 187,192 --- 187,195 #define fprintf pg_fprintf #define printf(...) pg_printf(__VA_ARGS__) + /* This is also provided by snprintf.c */ + extern int pg_strfromd(char *str, size_t count, int precision, double value); + /* Replace strerror() with our own, somewhat more robust wrapper */ extern char *pg_strerror(int errnum); #define strerror pg_strerror diff --git a/src/port/snprintf.c b/src/port/snprintf.c index ef496fa..897c683 100644 *** a/src/port/snprintf.c --- b/src/port/snprintf.c *** fmtfloat(double value, char type, int fo *** ,1120 int zeropadlen = 0; /* amount to pad with zeroes */ int padlen; /* amount to pad with spaces */ - /* Handle sign (NaNs have no sign) */ - if (!isnan(value) && adjust_sign((value <
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-05 11:54:59 -0400, Tom Lane wrote: >> I really think that what we ought to do is apply the float[48]out hack >> I showed in <30551.1538517...@sss.pgh.pa.us> and call it good, at least >> till such time as somebody wants to propose a full-on reimplementation of >> float output. I don't want to buy back into having platform dependencies >> in this area after having just expended a lot of sweat to get rid of them. > I'm not convinced. Because of some hypothetical platform that may > introduce strfromd() in a broken/slower manner, but where sprintf() is > correct, we should not do the minimal work to alleviate an actual > performance bottleneck in a trivial manner on linux? Our most widely > used platform? If we find a platform where it's borked, we could just > add a small hack into their platform template file. If it were a significant performance improvement, I'd be okay with that conclusion, but my measurements say that it's not. The extra complication is not free, and in my judgement it's not worth it. We certainly do need to buy back the performance we lost in float[48]out, but the hack I suggested does so --- on all platforms, not only Linux. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-05 11:54:59 -0400, Tom Lane wrote: > Andres Freund writes: > > [ let's use strfromd ] > > So I'm having second thoughts about this, based on the fact that > strfromd() in't strictly a glibc-ism but is defined in an ISO/IEC > standard. That means that we can expect to see it start showing up > on other platforms (though a quick search did not find any evidence > that it has yet). And that means that we'd better consider > quality-of-implementation issues. We know that glibc's version is > fractionally faster than using sprintf with "%.*g", but what are > the odds that that will be true universally? I don't have a warm > feeling about it, given that strfromd's API isn't a very good impedance > match to what we really need. > > I really think that what we ought to do is apply the float[48]out hack > I showed in <30551.1538517...@sss.pgh.pa.us> and call it good, at least > till such time as somebody wants to propose a full-on reimplementation of > float output. I don't want to buy back into having platform dependencies > in this area after having just expended a lot of sweat to get rid of them. I'm not convinced. Because of some hypothetical platform that may introduce strfromd() in a broken/slower manner, but where sprintf() is correct, we should not do the minimal work to alleviate an actual performance bottleneck in a trivial manner on linux? Our most widely used platform? If we find a platform where it's borked, we could just add a small hack into their platform template file. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > [ let's use strfromd ] So I'm having second thoughts about this, based on the fact that strfromd() in't strictly a glibc-ism but is defined in an ISO/IEC standard. That means that we can expect to see it start showing up on other platforms (though a quick search did not find any evidence that it has yet). And that means that we'd better consider quality-of-implementation issues. We know that glibc's version is fractionally faster than using sprintf with "%.*g", but what are the odds that that will be true universally? I don't have a warm feeling about it, given that strfromd's API isn't a very good impedance match to what we really need. I really think that what we ought to do is apply the float[48]out hack I showed in <30551.1538517...@sss.pgh.pa.us> and call it good, at least till such time as somebody wants to propose a full-on reimplementation of float output. I don't want to buy back into having platform dependencies in this area after having just expended a lot of sweat to get rid of them. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 14:01:35 -0400, Tom Lane wrote: >> BTW, so far as I can tell on F28, strfromd isn't exposed without >> "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary; >> what else does that affect? > So I don't think anything's needed to enable that in pg, given that we > define _GNU_SOURCE Ah, OK. I thought my test case had _GNU_SOURCE defined already, but it didn't. You might want to do something like what I stuck in for strchrnul, though: /* * glibc's declares strchrnul only if _GNU_SOURCE is defined. * While we typically use that on glibc platforms, configure will set * HAVE_STRCHRNUL whether it's used or not. Fill in the missing declaration * so that this file will compile cleanly with or without _GNU_SOURCE. */ #ifndef _GNU_SOURCE extern char *strchrnul(const char *s, int c); #endif regards, tom lane
Re: Performance improvements for src/port/snprintf.c
I wrote: >> Hm. I guess that'd be a bit better, but I'm not sure it's worth it. How >> about we simply add a static assert that long long isn't bigger than >> int64? > WFM, I'll make it happen. Actually, while writing a comment to go with that assertion, I decided this was dumb. If we're expecting the compiler to have "long long", and if we're convinced that no platforms define "long long" as wider than 64 bits, we may as well go with the s/int64/long long/g solution. That should result in no code change on any platform today. And it will still work correctly, if maybe a bit inefficiently, on some hypothetical future platform where long long is wider. We (or our successors) can worry about optimizing that when the time comes. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 14:01:35 -0400, Tom Lane wrote: > Andres Freund writes: > > So when using pg's snprintf() to print a single floating point number > > with precision, we get nearly a 10% boost. > > I just tested that using my little standalone testbed, and I failed > to replicate the result. I do see that strfromd is slightly faster, > but it's just a few percent measuring snprintf.c in isolation --- in > the overall context of COPY, I don't see how you get to 10% net savings. I just tested your patch, and I see (best of three): master: 16224.727 ms hack-use-of-strfromd.patch: 14944.927 ms So not quite 10%, but pretty close. CREATE TABLE somefloats(id serial, data1 float8, data2 float8, data3 float8); INSERT INTO somefloats(data1, data2, data3) SELECT random(), random(), random() FROM generate_series(1, 1000); VACUUM FREEZE somefloats; COPY somefloats TO '/dev/null'; What difference do you see? > So I continue to think there's something fishy about your test case. > > BTW, so far as I can tell on F28, strfromd isn't exposed without > "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary; > what else does that affect? My copy says: #undef __GLIBC_USE_IEC_60559_BFP_EXT #if defined __USE_GNU || defined __STDC_WANT_IEC_60559_BFP_EXT__ # define __GLIBC_USE_IEC_60559_BFP_EXT 1 #else # define __GLIBC_USE_IEC_60559_BFP_EXT 0 #endif And __USE_GNU is enabled by #ifdef _GNU_SOURCE # define __USE_GNU 1 #endif So I don't think anything's needed to enable that in pg, given that we define _GNU_SOURCE Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
On 2018-Oct-03, Tom Lane wrote: > Andres Freund writes: > BTW, so far as I can tell on F28, strfromd isn't exposed without > "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary; > what else does that affect? https://en.cppreference.com/w/c/experimental/fpext1 -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > So when using pg's snprintf() to print a single floating point number > with precision, we get nearly a 10% boost. I just tested that using my little standalone testbed, and I failed to replicate the result. I do see that strfromd is slightly faster, but it's just a few percent measuring snprintf.c in isolation --- in the overall context of COPY, I don't see how you get to 10% net savings. So I continue to think there's something fishy about your test case. BTW, so far as I can tell on F28, strfromd isn't exposed without "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary; what else does that affect? regards, tom lane diff --git a/src/port/snprintf.c b/src/port/snprintf.c index b9b6add..f75369c 100644 --- a/src/port/snprintf.c +++ b/src/port/snprintf.c @@ -1137,17 +1137,19 @@ fmtfloat(double value, char type, int forcesign, int leftjust, zeropadlen = precision - prec; fmt[0] = '%'; fmt[1] = '.'; - fmt[2] = '*'; - fmt[3] = type; - fmt[4] = '\0'; - vallen = sprintf(convert, fmt, prec, value); + fmt[2] = (prec / 100) + '0'; + fmt[3] = ((prec % 100) / 10) + '0'; + fmt[4] = (prec % 10) + '0'; + fmt[5] = type; + fmt[6] = '\0'; + vallen = strfromd(convert, sizeof(convert), fmt, value); } else { fmt[0] = '%'; fmt[1] = type; fmt[2] = '\0'; - vallen = sprintf(convert, fmt, value); + vallen = strfromd(convert, sizeof(convert), fmt, value); } if (vallen < 0) goto fail;
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 13:18:35 -0400, Tom Lane wrote: >> Having said that, maybe there's a case for changing the type spec >> in only the va_arg() call, and leaving snprintf's related local >> variables as int64. (Is that what you actually meant?) Then, >> if long long really is different from int64, at least we have >> predictable truncation behavior after fetching the value, rather >> than undefined behavior while fetching it. > Hm. I guess that'd be a bit better, but I'm not sure it's worth it. How > about we simply add a static assert that long long isn't bigger than > int64? WFM, I'll make it happen. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
On 2018-10-03 13:40:03 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-10-03 13:31:09 -0400, Tom Lane wrote: > >> I do not see the point of messing with snprintf.c here. I doubt that > >> strfromd is faster than the existing sprintf call (because the latter > >> can use ".*" instead of serializing and deserializing the precision). > > > I'm confused, the numbers I posted clearly show that it's faster? > > Those were in the context of whether float8out went through snprintf.c > or directly to strfromd, no? I measured both, changing float8out directly, and just adapting snprintf.c: > snprintf using sprintf via pg_double_to_string: > 16195.787 > > snprintf using strfromd via pg_double_to_string: > 14856.974 ms > > float8out using sprintf via pg_double_to_string: > 16176.169 > > float8out using strfromd via pg_double_to_string: > 13532.698 So when using pg's snprintf() to print a single floating point number with precision, we get nearly a 10% boost. The win unsurprisingly is bigger when not going through snprintf.c. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 13:18:35 -0400, Tom Lane wrote: > I wrote: > > Andres Freund writes: > >> - I know it's not new, but is it actually correct to use va_arg(args, > >> int64) > >> for ATYPE_LONGLONG? > > > Well, the problem with just doing s/int64/long long/g is that the > > code would then fail on compilers without a "long long" type. > > We could ifdef our way around that, but I don't think the code would > > end up prettier. > > I spent a bit more time thinking about that point. My complaint about > lack of long long should be moot given that we're now requiring C99. True, I didn't think of that. > So the two cases we need to worry about are (1) long long exists and > is 64 bits, and (2) long long exists and is wider than 64 bits. In > case (1) there's nothing actively wrong with the code as it stands. > In case (2), if we were to fix the problem by s/int64/long long/g, > the result would be that we'd be doing the arithmetic for all > integer-to-text conversions in 128 bits, which seems likely to be > pretty expensive. Yea, that seems quite undesirable. > So a "real" fix would probably require having separate versions of > fmtint for long and long long. I'm not terribly excited about > going there. I can see it happening some day when/if we need to > use 128-bit math more extensively than today, but I do not think > that day is close. Right, that seems a bit off. > (Are there *any* platforms where "long long" is 128 bits today?) Not that I'm aware off. > Having said that, maybe there's a case for changing the type spec > in only the va_arg() call, and leaving snprintf's related local > variables as int64. (Is that what you actually meant?) Then, > if long long really is different from int64, at least we have > predictable truncation behavior after fetching the value, rather > than undefined behavior while fetching it. Hm. I guess that'd be a bit better, but I'm not sure it's worth it. How about we simply add a static assert that long long isn't bigger than int64? Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 13:31:09 -0400, Tom Lane wrote: >> I do not see the point of messing with snprintf.c here. I doubt that >> strfromd is faster than the existing sprintf call (because the latter >> can use ".*" instead of serializing and deserializing the precision). > I'm confused, the numbers I posted clearly show that it's faster? Those were in the context of whether float8out went through snprintf.c or directly to strfromd, no? regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 13:31:09 -0400, Tom Lane wrote: > I do not see the point of messing with snprintf.c here. I doubt that > strfromd is faster than the existing sprintf call (because the latter > can use ".*" instead of serializing and deserializing the precision). I'm confused, the numbers I posted clearly show that it's faster? Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 12:54:52 -0400, Tom Lane wrote: >> (1) The need to use sprintf for portability means that we need very >> tight constraints on the precision spec *and* the buffer size *and* >> the format type (%f pretty much destroys certainty about how long the >> output string is). So this isn't going to be general purpose code. >> I think just writing it into float[48]out is sufficient. > Well, the numbers suggest it's also useful to do so from snprintf - it's > not that rare that we output floating point numbers from semi > performance critical code, even leaving aside float[48]out. So I'm not > convinced that we shouldn't do this from within snprintf.c too. Now we > could open-code it twice, but i'm not sure I see the point. I do not see the point of messing with snprintf.c here. I doubt that strfromd is faster than the existing sprintf call (because the latter can use ".*" instead of serializing and deserializing the precision). Even if it is, I do not want to expose an attractive-nuisance API in a header, and I think this would be exactly that. > If we just define the API as having to guarantee there's enough space > for the output format, I think it'll work well enough for now? No, because that's a recipe for buffer-overflow bugs. It's *hard* to be sure the buffer is big enough, and easy to make breakable assumptions. > snprintf.c already assumes everything floating point can be output in > 1024 chars, no? Indeed, and it's got hacks like a forced limit to precision 350 in order to make that safe. I don't want to be repeating the reasoning in fmtfloat() in a bunch of other places. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
I wrote: > Andres Freund writes: >> - I know it's not new, but is it actually correct to use va_arg(args, int64) >> for ATYPE_LONGLONG? > Well, the problem with just doing s/int64/long long/g is that the > code would then fail on compilers without a "long long" type. > We could ifdef our way around that, but I don't think the code would > end up prettier. I spent a bit more time thinking about that point. My complaint about lack of long long should be moot given that we're now requiring C99. So the two cases we need to worry about are (1) long long exists and is 64 bits, and (2) long long exists and is wider than 64 bits. In case (1) there's nothing actively wrong with the code as it stands. In case (2), if we were to fix the problem by s/int64/long long/g, the result would be that we'd be doing the arithmetic for all integer-to-text conversions in 128 bits, which seems likely to be pretty expensive. So a "real" fix would probably require having separate versions of fmtint for long and long long. I'm not terribly excited about going there. I can see it happening some day when/if we need to use 128-bit math more extensively than today, but I do not think that day is close. (Are there *any* platforms where "long long" is 128 bits today?) Having said that, maybe there's a case for changing the type spec in only the va_arg() call, and leaving snprintf's related local variables as int64. (Is that what you actually meant?) Then, if long long really is different from int64, at least we have predictable truncation behavior after fetching the value, rather than undefined behavior while fetching it. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 12:54:52 -0400, Tom Lane wrote: > Andres Freund writes: > > It seems the general "use strfromd if available" approach is generally > > useful, even if we need to serialize the precision. > > Agreed. > > > Putting it into an > > inline appears to be helpful, avoids some of the otherwise precision > > related branches. Do you have any feelings about which header to put > > the code in? I used common/string.h so far. > > I do not think it should be in a header, for two reasons: > > (1) The need to use sprintf for portability means that we need very > tight constraints on the precision spec *and* the buffer size *and* > the format type (%f pretty much destroys certainty about how long the > output string is). So this isn't going to be general purpose code. > I think just writing it into float[48]out is sufficient. Well, the numbers suggest it's also useful to do so from snprintf - it's not that rare that we output floating point numbers from semi performance critical code, even leaving aside float[48]out. So I'm not convinced that we shouldn't do this from within snprintf.c too. Now we could open-code it twice, but i'm not sure I see the point. If we just define the API as having to guarantee there's enough space for the output format, I think it'll work well enough for now? snprintf.c already assumes everything floating point can be output in 1024 chars, no? Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > It seems the general "use strfromd if available" approach is generally > useful, even if we need to serialize the precision. Agreed. > Putting it into an > inline appears to be helpful, avoids some of the otherwise precision > related branches. Do you have any feelings about which header to put > the code in? I used common/string.h so far. I do not think it should be in a header, for two reasons: (1) The need to use sprintf for portability means that we need very tight constraints on the precision spec *and* the buffer size *and* the format type (%f pretty much destroys certainty about how long the output string is). So this isn't going to be general purpose code. I think just writing it into float[48]out is sufficient. (2) It's already the case that most code trying to emit floats ought to go through float[48]out, in order to have standardized treatment of Inf and NaN. Providing some other API in a common header would just create a temptation to break that policy. Now, if we did write our own float output code then we would standardize Inf/NaN outputs inside that, and both of these issues would go away ... but I think what we'd do is provide something strfromd-like as an alternate API for that code, so we still won't need a wrapper. And anyway it doesn't sound like either of us care to jump that hurdle right now. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 11:59:27 -0400, Tom Lane wrote: >> I experimented with adding an initial check for "format is exactly %s" >> at the top of dopr(), and couldn't get excited about that. Instrumenting >> things showed that the optimization fired in only 1.8% of the calls >> during a run of our core regression tests. Now, that might not count >> as a really representative workload, but it doesn't make me think that >> the case is worth optimizing for us. > Seems right. I also have a hard time to believe that any of those "%s" > printfs are performance critical - we'd hopefully just have avoided the > sprintf in that case. Yup, that's probably a good chunk of the reason why there aren't very many. But we *do* use %s a lot to assemble multiple strings or combine them with fixed text, which is why the other form of the optimization is useful. >> But then it occurred to me that there's more than one way to skin this >> cat. We could, for an even cheaper extra test, detect that any one >> format specifier is just "%s", and use the same kind of fast-path >> within the loop. With the same sort of instrumentation, I found that >> a full 45% of the format specs executed in the core regression tests >> are just %s. That makes me think that a patch along the lines of the >> attached is a good win for our use-cases. Comparing to Fedora 28's >> glibc, this gets us to > Hm, especially if we special case the float->string conversions directly > at the hot callsites, that seems reasonable. I kinda wish we could just > easily move the format string processing to compile-time, but given > translatability that won't be widely possible even if it were otherwise > feasible. Yeah, there's a mighty big pile of infrastructure that depends on the way *printf works. I agree that one way or another we're going to be special-casing float8out and float4out. BTW, I poked around in the related glibc sources the other day, and it seemed like they are doing some sort of quasi-compilation of format strings. I couldn't figure out how they made it pay, though --- without some way to avoid re-compiling the same format string over and over, seems like it couldn't net out as a win. But if they are avoiding that, I didn't find where. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 12:22:13 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-10-03 12:07:32 -0400, Tom Lane wrote: > >> [ scratches head ... ] How would that work? Seems like it necessarily > >> adds a strlen() call to whatever we'd be doing otherwise. palloc isn't > >> going to be any faster just from asking it for slightly fewer bytes. > >> I think there might be something wrong with your test scenario ... > >> or there's more noise in the numbers than you thought. > > > I guess the difference is that we're more likely to find reusable chunks > > in aset.c and/or need fewer OS allocations. As the memory is going to > > be touched again very shortly afterwards, the cache effects probably are > > neglegible. > > > The strlen definitely shows up in profiles, it just seems to save at > > least as much as it costs. > > > Doesn't strike me as THAT odd? > > What it strikes me as is excessively dependent on one particular test > scenario. I don't mind optimizations that are tradeoffs between > well-understood costs, but this smells like handwaving that's going to > lose as much or more often than winning, once it hits the real world. I'm not particularly wedded to doing the allocation differently - I was just mildly wondering if the increased size of the allocations could be problematic. And that lead me to testing that. And reporting it. I don't think the real-world test differences are that large in this specific case, but whatever. It seems the general "use strfromd if available" approach is generally useful, even if we need to serialize the precision. Putting it into an inline appears to be helpful, avoids some of the otherwise precision related branches. Do you have any feelings about which header to put the code in? I used common/string.h so far. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 11:59:27 -0400, Tom Lane wrote: > I wrote: > > ... However, I did add recent glibc (Fedora 28) > > to the mix, and I was interested to discover that they seem to have > > added a fast-path for format strings that are exactly "%s", just as > > NetBSD did. I wonder if we should reconsider our position on doing > > that. It'd be a simple enough addition... > > I experimented with adding an initial check for "format is exactly %s" > at the top of dopr(), and couldn't get excited about that. Instrumenting > things showed that the optimization fired in only 1.8% of the calls > during a run of our core regression tests. Now, that might not count > as a really representative workload, but it doesn't make me think that > the case is worth optimizing for us. Seems right. I also have a hard time to believe that any of those "%s" printfs are performance critical - we'd hopefully just have avoided the sprintf in that case. > But then it occurred to me that there's more than one way to skin this > cat. We could, for an even cheaper extra test, detect that any one > format specifier is just "%s", and use the same kind of fast-path > within the loop. With the same sort of instrumentation, I found that > a full 45% of the format specs executed in the core regression tests > are just %s. That makes me think that a patch along the lines of the > attached is a good win for our use-cases. Comparing to Fedora 28's > glibc, this gets us to Hm, especially if we special case the float->string conversions directly at the hot callsites, that seems reasonable. I kinda wish we could just easily move the format string processing to compile-time, but given translatability that won't be widely possible even if it were otherwise feasible. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 12:07:32 -0400, Tom Lane wrote: >> [ scratches head ... ] How would that work? Seems like it necessarily >> adds a strlen() call to whatever we'd be doing otherwise. palloc isn't >> going to be any faster just from asking it for slightly fewer bytes. >> I think there might be something wrong with your test scenario ... >> or there's more noise in the numbers than you thought. > I guess the difference is that we're more likely to find reusable chunks > in aset.c and/or need fewer OS allocations. As the memory is going to > be touched again very shortly afterwards, the cache effects probably are > neglegible. > The strlen definitely shows up in profiles, it just seems to save at > least as much as it costs. > Doesn't strike me as THAT odd? What it strikes me as is excessively dependent on one particular test scenario. I don't mind optimizations that are tradeoffs between well-understood costs, but this smells like handwaving that's going to lose as much or more often than winning, once it hits the real world. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > FWIW, it seems that using a local buffer and than pstrdup'ing that in > float8out_internal is a bit faster, and would probably save a bit of > memory on average: > float8out using sprintf via pg_double_to_string, pstrdup: > 15370.774 > float8out using strfromd via pg_double_to_string, pstrdup: > 13498.331 [ scratches head ... ] How would that work? Seems like it necessarily adds a strlen() call to whatever we'd be doing otherwise. palloc isn't going to be any faster just from asking it for slightly fewer bytes. I think there might be something wrong with your test scenario ... or there's more noise in the numbers than you thought. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
I wrote: > ... However, I did add recent glibc (Fedora 28) > to the mix, and I was interested to discover that they seem to have > added a fast-path for format strings that are exactly "%s", just as > NetBSD did. I wonder if we should reconsider our position on doing > that. It'd be a simple enough addition... I experimented with adding an initial check for "format is exactly %s" at the top of dopr(), and couldn't get excited about that. Instrumenting things showed that the optimization fired in only 1.8% of the calls during a run of our core regression tests. Now, that might not count as a really representative workload, but it doesn't make me think that the case is worth optimizing for us. But then it occurred to me that there's more than one way to skin this cat. We could, for an even cheaper extra test, detect that any one format specifier is just "%s", and use the same kind of fast-path within the loop. With the same sort of instrumentation, I found that a full 45% of the format specs executed in the core regression tests are just %s. That makes me think that a patch along the lines of the attached is a good win for our use-cases. Comparing to Fedora 28's glibc, this gets us to Test case: %s snprintf time = 8.83615 ms total, 8.83615e-06 ms per iteration pg_snprintf time = 23.9372 ms total, 2.39372e-05 ms per iteration ratio = 2.709 Test case: %sx snprintf time = 59.4481 ms total, 5.94481e-05 ms per iteration pg_snprintf time = 29.8983 ms total, 2.98983e-05 ms per iteration ratio = 0.503 versus what we have as of this morning's commit: Test case: %s snprintf time = 7.7427 ms total, 7.7427e-06 ms per iteration pg_snprintf time = 26.2439 ms total, 2.62439e-05 ms per iteration ratio = 3.390 Test case: %sx snprintf time = 61.4452 ms total, 6.14452e-05 ms per iteration pg_snprintf time = 32.7516 ms total, 3.27516e-05 ms per iteration ratio = 0.533 The penalty for non-%s cases seems to be a percent or so, although it's barely above the noise floor in my tests. regards, tom lane diff --git a/src/port/snprintf.c b/src/port/snprintf.c index cad7345..b9b6add 100644 *** a/src/port/snprintf.c --- b/src/port/snprintf.c *** dopr(PrintfTarget *target, const char *f *** 431,436 --- 431,449 /* Process conversion spec starting at *format */ format++; + + /* Fast path for conversion spec that is exactly %s */ + if (*format == 's') + { + format++; + strvalue = va_arg(args, char *); + Assert(strvalue != NULL); + dostr(strvalue, strlen(strvalue), target); + if (target->failed) + break; + continue; + } + fieldwidth = precision = zpad = leftjust = forcesign = 0; longflag = longlongflag = pointflag = 0; fmtpos = accum = 0;
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 08:20:14 -0400, Tom Lane wrote: > Andres Freund writes: > >> While there might be value in implementing our own float printing code, > >> I have a pretty hard time getting excited about the cost/benefit ratio > >> of that. I think that what we probably really ought to do here is hack > >> float4out/float8out to bypass the extra overhead, as in the 0002 patch > >> below. > > > I'm thinking we should do a bit more than just that hack. I'm thinking > > of something (barely tested) like > > Meh. The trouble with that is that it relies on the platform's snprintf, > not sprintf, and that brings us right back into a world of portability > hurt. I don't feel that the move to C99 gets us out of worrying about > noncompliant snprintfs --- we're only requiring a C99 *compiler*, not > libc. See buildfarm member gharial for a counterexample. Oh, we could just use sprintf() and tell strfromd the buffer is large enough. I only used snprintf because it seemed more symmetric, and because I was at most 1/3 awake. > I'm happy to look into whether using strfromd when available buys us > anything over using sprintf. I'm not entirely convinced that it will, > because of the need to ASCII-ize and de-ASCII-ize the precision, but > it's worth checking. It's definitely faster. It's not a full-blown format parser, so I guess the cost of the conversion isn't too bad: https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/strfrom-skeleton.c;hb=HEAD#l68 CREATE TABLE somefloats(id serial, data1 float8, data2 float8, data3 float8); INSERT INTO somefloats(data1, data2, data3) SELECT random(), random(), random() FROM generate_series(1, 1000); VACUUM FREEZE somefloats; I'm comparing the times of: COPY somefloats TO '/dev/null'; master (including your commit): 16177.202 ms snprintf using sprintf via pg_double_to_string: 16195.787 snprintf using strfromd via pg_double_to_string: 14856.974 ms float8out using sprintf via pg_double_to_string: 16176.169 float8out using strfromd via pg_double_to_string: 13532.698 FWIW, it seems that using a local buffer and than pstrdup'ing that in float8out_internal is a bit faster, and would probably save a bit of memory on average: float8out using sprintf via pg_double_to_string, pstrdup: 15370.774 float8out using strfromd via pg_double_to_string, pstrdup: 13498.331 Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-02 17:54:31 -0400, Tom Lane wrote: >> Here's a version of this patch rebased over commit 625b38ea0. > Cool. Let's get that in... Cool, I'll push it shortly. >> While there might be value in implementing our own float printing code, >> I have a pretty hard time getting excited about the cost/benefit ratio >> of that. I think that what we probably really ought to do here is hack >> float4out/float8out to bypass the extra overhead, as in the 0002 patch >> below. > I'm thinking we should do a bit more than just that hack. I'm thinking > of something (barely tested) like Meh. The trouble with that is that it relies on the platform's snprintf, not sprintf, and that brings us right back into a world of portability hurt. I don't feel that the move to C99 gets us out of worrying about noncompliant snprintfs --- we're only requiring a C99 *compiler*, not libc. See buildfarm member gharial for a counterexample. I'm happy to look into whether using strfromd when available buys us anything over using sprintf. I'm not entirely convinced that it will, because of the need to ASCII-ize and de-ASCII-ize the precision, but it's worth checking. > FWIW, I think there's still a significant argument to be made that we > should work on our floating point IO performance. Both on the input and > output side. It's a significant practical problem. But both a fix like > you describe, and my proposal, should bring us to at least the previous > level of performance for the hot paths. So that'd then just be an > independent consideration. Well, an independent project anyway. I concur that it would have value; but whether it's worth the effort, and the possible behavioral changes, is not very clear to me. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
On 2018-10-02 17:54:31 -0400, Tom Lane wrote: > Here's a version of this patch rebased over commit 625b38ea0. > > That commit's fix for the possibly-expensive memset means that we need > to reconsider performance numbers for this patch. I re-ran my previous > tests, and it's still looking like this is a substantial win, as it makes > snprintf.c faster than the native snprintf for most non-float cases. > We're still stuck at something like 10% penalty for float cases. Cool. Let's get that in... > While there might be value in implementing our own float printing code, > I have a pretty hard time getting excited about the cost/benefit ratio > of that. I think that what we probably really ought to do here is hack > float4out/float8out to bypass the extra overhead, as in the 0002 patch > below. I'm thinking we should do a bit more than just that hack. I'm thinking of something (barely tested) like int pg_double_to_string(char *buf, size_t bufsize, char tp, int precision, double val) { charfmt[8]; #ifdef HAVE_STRFROMD if (precision != -1) { fmt[0] = '%'; fmt[1] = '.'; fmt[2] = '0' + precision / 10; fmt[3] = '0' + precision % 10; fmt[4] = tp; fmt[5] = '\0'; } else { fmt[0] = '%'; fmt[1] = tp; fmt[2] = '\0'; } return strfromd(buf, bufsize, fmt, val); #else if (precision != -1) { fmt[0] = '%'; fmt[1] = '.'; fmt[2] = '*'; fmt[3] = tp; fmt[4] = '\0'; } else { fmt[0] = '%'; fmt[1] = tp; fmt[2] = '\0'; } #undef snprintf return snprintf(buf, bufsize, fmt, precision, val); #define sprintf pg_snprintf #endif } and putting that in string.h or such. Then we'd likely be faster both when going through pg_sprintf etc when strfromd is available, and by using it directly in float8out etc, we'd be at least as fast as before. I can clean that up, just not tonight. FWIW, I think there's still a significant argument to be made that we should work on our floating point IO performance. Both on the input and output side. It's a significant practical problem. But both a fix like you describe, and my proposal, should bring us to at least the previous level of performance for the hot paths. So that'd then just be an independent consideration. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > I've looked through the patch. Looks good to me. Some minor notes: [ didn't see this till after sending my previous ] > - How about adding our own strchrnul for the case where we don't > HAVE_STRCHRNUL? It's possible that other platforms have something > similar, and the code wouldlook more readable that way. Sure, we could just make a "static inline strchrnul()" for use when !HAVE_STRCHRNUL. No objection. > - I know it's not new, but is it actually correct to use va_arg(args, int64) > for ATYPE_LONGLONG? Well, the problem with just doing s/int64/long long/g is that the code would then fail on compilers without a "long long" type. We could ifdef our way around that, but I don't think the code would end up prettier. Given that we only ever use "ll" modifiers via INT64_FORMAT, and that that'll only be set to "ll" if int64 is indeed "long long", those code paths should be dead code in any situation where the type pun is wrong. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Here's a version of this patch rebased over commit 625b38ea0. That commit's fix for the possibly-expensive memset means that we need to reconsider performance numbers for this patch. I re-ran my previous tests, and it's still looking like this is a substantial win, as it makes snprintf.c faster than the native snprintf for most non-float cases. We're still stuck at something like 10% penalty for float cases. While there might be value in implementing our own float printing code, I have a pretty hard time getting excited about the cost/benefit ratio of that. I think that what we probably really ought to do here is hack float4out/float8out to bypass the extra overhead, as in the 0002 patch below. For reference, I attach the testbed I'm using now plus some results. I wasn't able to get my cranky NetBSD system up today, so I don't have results for that. However, I did add recent glibc (Fedora 28) to the mix, and I was interested to discover that they seem to have added a fast-path for format strings that are exactly "%s", just as NetBSD did. I wonder if we should reconsider our position on doing that. It'd be a simple enough addition... regards, tom lane diff --git a/configure b/configure index 6414ec1..0448c6b 100755 *** a/configure --- b/configure *** fi *** 15100,15106 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" --- 15100,15106 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index 158d5a1..23b5bb8 100644 *** a/configure.in --- b/configure.in *** PGAC_FUNC_WCSTOMBS_L *** 1571,1577 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in --- 1571,1577 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 90dda8e..7894caa 100644 *** a/src/include/pg_config.h.in --- b/src/include/pg_config.h.in *** *** 523,528 --- 523,531 /* Define to 1 if you have the header file. */ #undef HAVE_STDLIB_H + /* Define to 1 if you have the `strchrnul' function. */ + #undef HAVE_STRCHRNUL + /* Define to 1 if you have the `strerror_r' function. */ #undef HAVE_STRERROR_R diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index 93bb773..f7a051d 100644 *** a/src/include/pg_config.h.win32 --- b/src/include/pg_config.h.win32 *** *** 394,399 --- 394,402 /* Define to 1 if you have the header file. */ #define HAVE_STDLIB_H 1 + /* Define to 1 if you have the `strchrnul' function. */ + /* #undef HAVE_STRCHRNUL */ + /* Define to 1 if you have the `strerror_r' function. */ /* #undef HAVE_STRERROR_R */ diff --git a/src/port/snprintf.c b/src/port/snprintf.c index 1be5f70..3094ad8 100644 *** a/src/port/snprintf.c --- b/src/port/snprintf.c *** flushbuffer(PrintfTarget *target) *** 314,320 } ! static void fmtstr(char *value, int leftjust, int minlen, int maxwidth, int pointflag, PrintfTarget *target); static void fmtptr(void *value, PrintfTarget *target); static void fmtint(int64 value, char type, int forcesign, --- 314,322 } ! static bool find_arguments(const char *format, va_list args, ! PrintfArgValue *argvalues); ! static void fmtstr(const char *value, int leftjust, int minlen, int maxwidth, int pointflag,
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-09-26 21:30:25 -0400, Tom Lane wrote: > Here's a rebased version of <15785.1536776...@sss.pgh.pa.us>. > > I think we should try to get this reviewed and committed before > we worry more about the float business. It would be silly to > not be benchmarking any bigger changes against the low-hanging > fruit here. I've looked through the patch. Looks good to me. Some minor notes: - How about adding our own strchrnul for the case where we don't HAVE_STRCHRNUL? It's possible that other platforms have something similar, and the code wouldlook more readable that way. - I know it's not new, but is it actually correct to use va_arg(args, int64) for ATYPE_LONGLONG? Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
On September 26, 2018 9:04:08 PM PDT, Thomas Munro wrote: >On Thu, Sep 27, 2018 at 3:55 PM Andrew Gierth > wrote: >> > "Andres" == Andres Freund writes: >> Andres> Hm, stb's results just for floating point isn't bad. The >above >> Andres> numbers were for %f %f. But as the minimal usage would be >about >> Andres> the internal usage of dopr(), here's comparing %.*f: >> >> Andres> snprintf time = 1324.87 ms total, 0.000264975 ms per >iteration >> Andres> pg time = 1434.57 ms total, 0.000286915 ms per iteration >> Andres> stbsp time = 552.14 ms total, 0.000110428 ms per iteration >> >> Hmm. We had a case recently on IRC where the performance of float8out >> turned out to be the major bottleneck: a table of about 2.7 million >rows >> and ~70 float columns showed an overhead of ~66 seconds for doing >COPY >> as opposed to COPY BINARY (the actual problem report was that doing >> "select * from table" from R was taking a minute+ longer than >expected, >> we got comparative timings for COPY just to narrow down causes). >> >> That translates to approx. 0.00035 ms overhead (i.e. time(float8out) >- >> time(float8send)) per conversion (Linux server, hardware unknown). >> >> That 66 seconds was the difference between 18s and 1m24s, so it >wasn't a >> small factor but totally dominated the query time. > >For perfect and cheap round trip to ASCII, not for human consumption, >I wonder about the hexadecimal binary float literal format from C99 >(and showing up in other places too). I'm not quite sure how we realistically would migrate to that though. Clients and their users won't understand it, and the more knowledgeable ones will already use the binary protocol. Answers -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Performance improvements for src/port/snprintf.c
On September 26, 2018 8:53:27 PM PDT, Andrew Gierth wrote: >> "Andres" == Andres Freund writes: > > Andres> Hm, stb's results just for floating point isn't bad. The above >Andres> numbers were for %f %f. But as the minimal usage would be about > Andres> the internal usage of dopr(), here's comparing %.*f: > > Andres> snprintf time = 1324.87 ms total, 0.000264975 ms per iteration > Andres> pg time = 1434.57 ms total, 0.000286915 ms per iteration > Andres> stbsp time = 552.14 ms total, 0.000110428 ms per iteration > >Hmm. We had a case recently on IRC where the performance of float8out >turned out to be the major bottleneck: a table of about 2.7 million >rows >and ~70 float columns showed an overhead of ~66 seconds for doing COPY >as opposed to COPY BINARY (the actual problem report was that doing >"select * from table" from R was taking a minute+ longer than expected, >we got comparative timings for COPY just to narrow down causes). > >That translates to approx. 0.00035 ms overhead (i.e. time(float8out) - >time(float8send)) per conversion (Linux server, hardware unknown). Sounds like it could be pretty precisely be the cost measured above. My laptop's a bit faster than most server CPUs and the test has perfect branch prediction... >That 66 seconds was the difference between 18s and 1m24s, so it wasn't >a >small factor but totally dominated the query time. Ugh. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Performance improvements for src/port/snprintf.c
On Thu, Sep 27, 2018 at 3:55 PM Andrew Gierth wrote: > > "Andres" == Andres Freund writes: > Andres> Hm, stb's results just for floating point isn't bad. The above > Andres> numbers were for %f %f. But as the minimal usage would be about > Andres> the internal usage of dopr(), here's comparing %.*f: > > Andres> snprintf time = 1324.87 ms total, 0.000264975 ms per iteration > Andres> pg time = 1434.57 ms total, 0.000286915 ms per iteration > Andres> stbsp time = 552.14 ms total, 0.000110428 ms per iteration > > Hmm. We had a case recently on IRC where the performance of float8out > turned out to be the major bottleneck: a table of about 2.7 million rows > and ~70 float columns showed an overhead of ~66 seconds for doing COPY > as opposed to COPY BINARY (the actual problem report was that doing > "select * from table" from R was taking a minute+ longer than expected, > we got comparative timings for COPY just to narrow down causes). > > That translates to approx. 0.00035 ms overhead (i.e. time(float8out) - > time(float8send)) per conversion (Linux server, hardware unknown). > > That 66 seconds was the difference between 18s and 1m24s, so it wasn't a > small factor but totally dominated the query time. For perfect and cheap round trip to ASCII, not for human consumption, I wonder about the hexadecimal binary float literal format from C99 (and showing up in other places too). -- Thomas Munro http://www.enterprisedb.com
Re: Performance improvements for src/port/snprintf.c
> "Andres" == Andres Freund writes: Andres> Hm, stb's results just for floating point isn't bad. The above Andres> numbers were for %f %f. But as the minimal usage would be about Andres> the internal usage of dopr(), here's comparing %.*f: Andres> snprintf time = 1324.87 ms total, 0.000264975 ms per iteration Andres> pg time = 1434.57 ms total, 0.000286915 ms per iteration Andres> stbsp time = 552.14 ms total, 0.000110428 ms per iteration Hmm. We had a case recently on IRC where the performance of float8out turned out to be the major bottleneck: a table of about 2.7 million rows and ~70 float columns showed an overhead of ~66 seconds for doing COPY as opposed to COPY BINARY (the actual problem report was that doing "select * from table" from R was taking a minute+ longer than expected, we got comparative timings for COPY just to narrow down causes). That translates to approx. 0.00035 ms overhead (i.e. time(float8out) - time(float8send)) per conversion (Linux server, hardware unknown). That 66 seconds was the difference between 18s and 1m24s, so it wasn't a small factor but totally dominated the query time. -- Andrew (irc:RhodiumToad)
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-09-26 21:44:41 -0400, Tom Lane wrote: >> BTW, were you thinking of plugging in strfromd() inside snprintf.c, >> or just invoking it directly from float[48]out? The latter would >> presumably be cheaper, and it'd solve the most pressing performance >> problem, if not every problem. > I wasn't actually seriously suggesting we should use strfromd, but I > guess one way to deal with this would be to add a wrapper routine that > could directly be called from float[48]out *and* from fmtfloat(). Yeah, something along that line occurred to me a bit later. > Wonder > if it'd be worthwhile to *not* pass that wrapper a format string, but > instead pass the sprecision as an explicit argument. Right, getting rid of the round trip to text for the precision seems like a win. I'm surprised that strfromd is defined the way it is and not with something like (double val, char fmtcode, int precision, ...) regards, tom lane
Re: Performance improvements for src/port/snprintf.c
On 2018-09-26 21:44:41 -0400, Tom Lane wrote: > Andres Freund writes: > > Reading around the interwebz lead me to look at ryu > > > https://dl.acm.org/citation.cfm?id=3192369 > > https://github.com/ulfjack/ryu/tree/46f4c5572121a6f1428749fe3e24132c3626c946 > > > That's an algorithm that always generates the minimally sized > > roundtrip-safe string output for a floating point number. That makes it > > insuitable for the innards of printf, but it very well could be > > interesting for e.g. float8out, especially when we currently specify a > > "too high" precision to guarantee round-trip safeity. > > Yeah, the whole business of round-trip safety is a bit worrisome. Seems like using a better algorithm also has the potential to make the output a bit smaller / more readable than what we currently produce. > If we change printf, and it produces different low-order digits > than before, will floats still round-trip correctly? I think we > have to ensure that they do. Yea, I think that's an absolutely hard requirement. It'd possibly be a good idea to add an assert that enforce that, although I'm not sure it's worth the portability issues around crappy system libcs that do randomly different things. > BTW, were you thinking of plugging in strfromd() inside snprintf.c, > or just invoking it directly from float[48]out? The latter would > presumably be cheaper, and it'd solve the most pressing performance > problem, if not every problem. I wasn't actually seriously suggesting we should use strfromd, but I guess one way to deal with this would be to add a wrapper routine that could directly be called from float[48]out *and* from fmtfloat(). Wonder if it'd be worthwhile to *not* pass that wrapper a format string, but instead pass the sprecision as an explicit argument. Would make the use in snprintf.c a bit more annoying (due to fFeEgG support), but probably considerably simpler and faster if we ever reimplement that ourself. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
On Thu, Sep 27, 2018 at 1:18 PM Andres Freund wrote: > On 2018-09-26 17:57:05 -0700, Andres Freund wrote: > > snprintf time = 1324.87 ms total, 0.000264975 ms per iteration > > pg time = 1434.57 ms total, 0.000286915 ms per iteration > > stbsp time = 552.14 ms total, 0.000110428 ms per iteration > > Reading around the interwebz lead me to look at ryu > > https://dl.acm.org/citation.cfm?id=3192369 > https://github.com/ulfjack/ryu/tree/46f4c5572121a6f1428749fe3e24132c3626c946 > > That's an algorithm that always generates the minimally sized > roundtrip-safe string output for a floating point number. That makes it > insuitable for the innards of printf, but it very well could be > interesting for e.g. float8out, especially when we currently specify a > "too high" precision to guarantee round-trip safeity. Wow. While all the algorithms have that round trip goal, they keep doing it faster. I was once interested in their speed for a work problem, and looked into the 30 year old dragon4 and 8 year old grisu3 algorithms. It's amazing to me that we have a new algorithm in 2018 for this ancient problem, and it claims to be 3 times faster than the competition. (Hah, I see that "ryū" is Japanese for dragon. "Grisù" is a dragon from an Italian TV series.) -- Thomas Munro http://www.enterprisedb.com
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-09-26 21:30:25 -0400, Tom Lane wrote: > Here's a rebased version of <15785.1536776...@sss.pgh.pa.us>. > > I think we should try to get this reviewed and committed before > we worry more about the float business. It would be silly to > not be benchmarking any bigger changes against the low-hanging > fruit here. Yea, no arguments there. I'll try to have a look tomorrow. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > Reading around the interwebz lead me to look at ryu > https://dl.acm.org/citation.cfm?id=3192369 > https://github.com/ulfjack/ryu/tree/46f4c5572121a6f1428749fe3e24132c3626c946 > That's an algorithm that always generates the minimally sized > roundtrip-safe string output for a floating point number. That makes it > insuitable for the innards of printf, but it very well could be > interesting for e.g. float8out, especially when we currently specify a > "too high" precision to guarantee round-trip safeity. Yeah, the whole business of round-trip safety is a bit worrisome. If we change printf, and it produces different low-order digits than before, will floats still round-trip correctly? I think we have to ensure that they do. If we just use strfromd(), then it's libc's problem if the results change ... but if we stick in some code we got from elsewhere, it's our problem. BTW, were you thinking of plugging in strfromd() inside snprintf.c, or just invoking it directly from float[48]out? The latter would presumably be cheaper, and it'd solve the most pressing performance problem, if not every problem. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Here's a rebased version of <15785.1536776...@sss.pgh.pa.us>. I think we should try to get this reviewed and committed before we worry more about the float business. It would be silly to not be benchmarking any bigger changes against the low-hanging fruit here. regards, tom lane diff --git a/configure b/configure index 6414ec1..0448c6b 100755 *** a/configure --- b/configure *** fi *** 15100,15106 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" --- 15100,15106 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index 158d5a1..23b5bb8 100644 *** a/configure.in --- b/configure.in *** PGAC_FUNC_WCSTOMBS_L *** 1571,1577 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in --- 1571,1577 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 90dda8e..7894caa 100644 *** a/src/include/pg_config.h.in --- b/src/include/pg_config.h.in *** *** 523,528 --- 523,531 /* Define to 1 if you have the header file. */ #undef HAVE_STDLIB_H + /* Define to 1 if you have the `strchrnul' function. */ + #undef HAVE_STRCHRNUL + /* Define to 1 if you have the `strerror_r' function. */ #undef HAVE_STRERROR_R diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index 93bb773..f7a051d 100644 *** a/src/include/pg_config.h.win32 --- b/src/include/pg_config.h.win32 *** *** 394,399 --- 394,402 /* Define to 1 if you have the header file. */ #define HAVE_STDLIB_H 1 + /* Define to 1 if you have the `strchrnul' function. */ + /* #undef HAVE_STRCHRNUL */ + /* Define to 1 if you have the `strerror_r' function. */ /* #undef HAVE_STRERROR_R */ diff --git a/src/port/snprintf.c b/src/port/snprintf.c index 2c77eec..1469878 100644 *** a/src/port/snprintf.c --- b/src/port/snprintf.c *** flushbuffer(PrintfTarget *target) *** 310,316 } ! static void fmtstr(char *value, int leftjust, int minlen, int maxwidth, int pointflag, PrintfTarget *target); static void fmtptr(void *value, PrintfTarget *target); static void fmtint(int64 value, char type, int forcesign, --- 310,318 } ! static bool find_arguments(const char *format, va_list args, ! PrintfArgValue *argvalues); ! static void fmtstr(const char *value, int leftjust, int minlen, int maxwidth, int pointflag, PrintfTarget *target); static void fmtptr(void *value, PrintfTarget *target); static void fmtint(int64 value, char type, int forcesign, *** static void fmtfloat(double value, char *** 322,332 PrintfTarget *target); static void dostr(const char *str, int slen, PrintfTarget *target); static void dopr_outch(int c, PrintfTarget *target); static int adjust_sign(int is_negative, int forcesign, int *signvalue); ! static void adjust_padlen(int minlen, int vallen, int leftjust, int *padlen); ! static void leading_pad(int zpad, int *signvalue, int *padlen, PrintfTarget *target); ! static void trailing_pad(int *padlen, PrintfTarget *target); /* --- 324,335 PrintfTarget *target); static void dostr(const char *str, int slen, PrintfTarget *target); static void dopr_outch(int c, PrintfTarget *target); + static void dopr_outchmulti(int c, int slen,
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-09-26 17:57:05 -0700, Andres Freund wrote: > snprintf time = 1324.87 ms total, 0.000264975 ms per iteration > pg time = 1434.57 ms total, 0.000286915 ms per iteration > stbsp time = 552.14 ms total, 0.000110428 ms per iteration Reading around the interwebz lead me to look at ryu https://dl.acm.org/citation.cfm?id=3192369 https://github.com/ulfjack/ryu/tree/46f4c5572121a6f1428749fe3e24132c3626c946 That's an algorithm that always generates the minimally sized roundtrip-safe string output for a floating point number. That makes it insuitable for the innards of printf, but it very well could be interesting for e.g. float8out, especially when we currently specify a "too high" precision to guarantee round-trip safeity. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
On 2018-09-26 17:40:22 -0700, Andres Freund wrote: > On 2018-09-26 20:25:44 -0400, Tom Lane wrote: > > Andres Freund writes: > > > On 2018-09-26 19:45:07 -0400, Tom Lane wrote: > > >> No, there are no additional layers that weren't there before --- > > >> snprintf.c's snprintf() slots in directly where the platform's did > > >> before. > > > > > Hm? What I mean is that we can't realistically be faster with the > > > current architecture, because for floating point we end up doing glibc > > > sprintf() in either case. > > > > Oh, you mean specifically for the float conversion case. I still say > > that I will *not* accept judging this code solely on the float case. > > Oh, it should definitely not be judged solely based on floating point, > we agree. > > > > The string and integer cases are at least as important if not more so. > > I think the integer stuff has become a *little* bit less important, > because we converted the hot cases over to pg_lto etc. > > > > >> Interesting. strfromd() is a glibc-ism, and a fairly recent one at > > >> that (my RHEL6 box doesn't seem to have it). > > > > > It's C99 afaict. > > > > It's not in POSIX 2008, and I don't see it in my admittedly-draft > > copy of C99 either. But that's not real relevant -- I don't see > > much reason not to use it if we want a quick and dirty answer for > > the platforms that have it. > > Right, I really just wanted some more baseline numbers. > > > > If we had more ambition, we might consider stealing the float > > conversion logic out of the "stb" library that Alexander pointed > > to upthread. It says it's public domain, so there's no license > > impediment to borrowing some code ... > > Yea, I started to play around with doing so with musl, but based on > early my benchmarks it's not fast enough to bother. I've not integrated > it into our code, but instead printed two floating point numbers with > your test: > > musl 500 iterations: > snprintf time = 3144.46 ms total, 0.000628892 ms per iteration > pg_snprintf time = 4215.1 ms total, 0.00084302 ms per iteration > ratio = 1.340 > > glibc 500 iterations: > snprintf time = 1680.82 ms total, 0.000336165 ms per iteration > pg_snprintf time = 2629.46 ms total, 0.000525892 ms per iteration > ratio = 1.564 > > So there's pretty clearly no point in even considering starting from > musl. Hm, stb's results just for floating point isn't bad. The above numbers were for %f %f. But as the minimal usage would be about the internal usage of dopr(), here's comparing %.*f: snprintf time = 1324.87 ms total, 0.000264975 ms per iteration pg time = 1434.57 ms total, 0.000286915 ms per iteration stbsp time = 552.14 ms total, 0.000110428 ms per iteration Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
On 2018-09-26 20:25:44 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-09-26 19:45:07 -0400, Tom Lane wrote: > >> No, there are no additional layers that weren't there before --- > >> snprintf.c's snprintf() slots in directly where the platform's did before. > > > Hm? What I mean is that we can't realistically be faster with the > > current architecture, because for floating point we end up doing glibc > > sprintf() in either case. > > Oh, you mean specifically for the float conversion case. I still say > that I will *not* accept judging this code solely on the float case. Oh, it should definitely not be judged solely based on floating point, we agree. > The string and integer cases are at least as important if not more so. I think the integer stuff has become a *little* bit less important, because we converted the hot cases over to pg_lto etc. > >> Interesting. strfromd() is a glibc-ism, and a fairly recent one at > >> that (my RHEL6 box doesn't seem to have it). > > > It's C99 afaict. > > It's not in POSIX 2008, and I don't see it in my admittedly-draft > copy of C99 either. But that's not real relevant -- I don't see > much reason not to use it if we want a quick and dirty answer for > the platforms that have it. Right, I really just wanted some more baseline numbers. > If we had more ambition, we might consider stealing the float > conversion logic out of the "stb" library that Alexander pointed > to upthread. It says it's public domain, so there's no license > impediment to borrowing some code ... Yea, I started to play around with doing so with musl, but based on early my benchmarks it's not fast enough to bother. I've not integrated it into our code, but instead printed two floating point numbers with your test: musl 500 iterations: snprintf time = 3144.46 ms total, 0.000628892 ms per iteration pg_snprintf time = 4215.1 ms total, 0.00084302 ms per iteration ratio = 1.340 glibc 500 iterations: snprintf time = 1680.82 ms total, 0.000336165 ms per iteration pg_snprintf time = 2629.46 ms total, 0.000525892 ms per iteration ratio = 1.564 So there's pretty clearly no point in even considering starting from musl. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-09-26 19:45:07 -0400, Tom Lane wrote: >> No, there are no additional layers that weren't there before --- >> snprintf.c's snprintf() slots in directly where the platform's did before. > Hm? What I mean is that we can't realistically be faster with the > current architecture, because for floating point we end up doing glibc > sprintf() in either case. Oh, you mean specifically for the float conversion case. I still say that I will *not* accept judging this code solely on the float case. The string and integer cases are at least as important if not more so. >> Interesting. strfromd() is a glibc-ism, and a fairly recent one at >> that (my RHEL6 box doesn't seem to have it). > It's C99 afaict. It's not in POSIX 2008, and I don't see it in my admittedly-draft copy of C99 either. But that's not real relevant -- I don't see much reason not to use it if we want a quick and dirty answer for the platforms that have it. If we had more ambition, we might consider stealing the float conversion logic out of the "stb" library that Alexander pointed to upthread. It says it's public domain, so there's no license impediment to borrowing some code ... regards, tom lane
Re: Performance improvements for src/port/snprintf.c
On 2018-09-26 19:45:07 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-09-26 15:04:20 -0700, Andres Freund wrote: > >> I assume this partially is just the additional layers of function calls > >> (psprintf, pvsnprintf, pg_vsnprintf, dopr) that are now done, in > >> addition to pretty much the same work as before (i.e. sprintf("%.*f")). > > No, there are no additional layers that weren't there before --- > snprintf.c's snprintf() slots in directly where the platform's did before. Hm? What I mean is that we can't realistically be faster with the current architecture, because for floating point we end up doing glibc sprintf() in either case. And after the unconditional replacement, we're doing a bunch of *additional* work (at the very least we're parsing the format string twice). > Well, ok, dopr() wasn't there before, but I trust you're not claiming > that glibc's implementation of snprintf() is totally flat either. I don't even think it's all that fast... > > I'm *NOT* proposing that as the actual solution, but as a datapoint, it > > might be interesting that hardcoding the precision and thus allowing use > > ofusing strfromd() instead of sprintf yields a *better* runtime than > > master. > > Interesting. strfromd() is a glibc-ism, and a fairly recent one at > that (my RHEL6 box doesn't seem to have it). But we could use it where > available. And it doesn't seem unreasonable to have a fast path for > the specific precision value(s) that float4/8out will actually use. It's C99 afaict. What I did for my quick hack is to just hack the precision as characters into the format that dopr() uses... Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > I kinda wonder if we shouldn't replace the non pg_* functions in > snprintf.c with a more modern copy of a compatibly licensed libc. Looks > to me like our implementation has forked off some BSD a fair while ago. Maybe, but the benchmarking I was doing last month didn't convince me that the *BSD versions were remarkably fast. There are a lot of cases where our version is faster. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-09-26 15:04:20 -0700, Andres Freund wrote: >> I assume this partially is just the additional layers of function calls >> (psprintf, pvsnprintf, pg_vsnprintf, dopr) that are now done, in >> addition to pretty much the same work as before (i.e. sprintf("%.*f")). No, there are no additional layers that weren't there before --- snprintf.c's snprintf() slots in directly where the platform's did before. Well, ok, dopr() wasn't there before, but I trust you're not claiming that glibc's implementation of snprintf() is totally flat either. I think it's just that snprintf.c is a bit slower in this case. If you look at glibc's implementation, they've expended a heck of a lot of code and sweat on it. The only reason we could hope to beat it is that we're prepared to throw out some functionality, like LC_NUMERIC handling. > I'm *NOT* proposing that as the actual solution, but as a datapoint, it > might be interesting that hardcoding the precision and thus allowing use > ofusing strfromd() instead of sprintf yields a *better* runtime than > master. Interesting. strfromd() is a glibc-ism, and a fairly recent one at that (my RHEL6 box doesn't seem to have it). But we could use it where available. And it doesn't seem unreasonable to have a fast path for the specific precision value(s) that float4/8out will actually use. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
On 2018-08-17 14:32:59 -0400, Tom Lane wrote: > I've been looking into the possible performance consequences of that, > in particular comparing snprintf.c to the library versions on macOS, > FreeBSD, OpenBSD, and NetBSD. While it held up well in simpler cases, > I noted that it was significantly slower on long format strings, which > I traced to two separate problems: > Perhaps there's a way to improve that > without writing our own floating-point conversion code, but I'm not > seeing an easy way offhand. I don't think that's a showstopper though. > This code is now faster than the native code for very many other cases, > so on average it should cause no real performance problem. I kinda wonder if we shouldn't replace the non pg_* functions in snprintf.c with a more modern copy of a compatibly licensed libc. Looks to me like our implementation has forked off some BSD a fair while ago. There seems to be a few choices. Among others: - freebsd libc: https://github.com/freebsd/freebsd/blob/master/lib/libc/stdio/vfprintf.c (floating point stuff is elsewhere) - musl libc: https://git.musl-libc.org/cgit/musl/tree/src/stdio/vfprintf.c - stb (as Alexander referenced earlier) https://github.com/nothings/stb/blob/master/stb_sprintf.h I've not benchmarked any of these. Just by looking at the code, the musl one looks by far the most compact - looks like all the relevant code is in the one file referenced. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
On 2018-09-26 15:04:20 -0700, Andres Freund wrote: > On 2018-09-12 14:14:15 -0400, Tom Lane wrote: > > Alexander Kuzmenkov writes: > > > I benchmarked this, using your testbed and comparing to libc sprintf > > > (Ubuntu GLIBC 2.27-0ubuntu3) and another implementation I know [1], all > > > compiled with gcc 5. > > > > Thanks for reviewing! > > > > The cfbot noticed that the recent dlopen patch conflicted with this in > > configure.in, so here's a rebased version. The code itself didn't change. > > Conflicts again, but not too hard to resolve. > > The mini benchmark from > http://archives.postgresql.org/message-id/20180926174645.nsyj77lx2mvtz4kx%40alap3.anarazel.de > is significantly improved by this patch. > > > 96bf88d52711ad3a0a4cc2d1d9cb0e2acab85e63: > > > > COPY somefloats TO '/dev/null'; > > COPY 1000 > > Time: 24575.770 ms (00:24.576) > > > > 96bf88d52711ad3a0a4cc2d1d9cb0e2acab85e63^: > > > > COPY somefloats TO '/dev/null'; > > COPY 1000 > > Time: 12877.037 ms (00:12.877) > > This patch: > > postgres[32704][1]=# ;SELECT pg_prewarm('somefloats');COPY somefloats TO > '/dev/null'; > Time: 0.269 ms > ┌┐ > │ pg_prewarm │ > ├┤ > │ 73530 │ > └┘ > (1 row) > > Time: 34.983 ms > COPY 1000 > Time: 15511.478 ms (00:15.511) > > > The profile from 96bf88d52711ad3a0a4cc2d1d9cb0e2acab85e63^ is: > + 38.15% postgres libc-2.27.so [.] __GI___printf_fp_l > + 13.98% postgres libc-2.27.so [.] hack_digit > +7.54% postgres libc-2.27.so [.] __mpn_mul_1 > +7.32% postgres postgres [.] CopyOneRowTo > +6.12% postgres libc-2.27.so [.] vfprintf > +3.14% postgres libc-2.27.so [.] __strlen_avx2 > +1.97% postgres postgres [.] heap_deform_tuple > +1.77% postgres postgres [.] AllocSetAlloc > +1.43% postgres postgres [.] psprintf > +1.25% postgres libc-2.27.so [.] _IO_str_init_static_internal > +1.09% postgres libc-2.27.so [.] _IO_vsnprintf > +1.09% postgres postgres [.] appendBinaryStringInfo > > The profile of master with this patch is: > > + 32.38% postgres libc-2.27.so [.] __GI___printf_fp_l > + 11.08% postgres libc-2.27.so [.] hack_digit > +9.55% postgres postgres [.] CopyOneRowTo > +6.24% postgres libc-2.27.so [.] __mpn_mul_1 > +5.01% postgres libc-2.27.so [.] vfprintf > +4.91% postgres postgres [.] dopr.constprop.4 > +3.53% postgres libc-2.27.so [.] __strlen_avx2 > +1.55% postgres libc-2.27.so [.] __strchrnul_avx2 > +1.49% postgres libc-2.27.so [.] __memmove_avx_unaligned_erms > +1.35% postgres postgres [.] AllocSetAlloc > +1.32% postgres libc-2.27.so [.] _IO_str_init_static_internal > +1.30% postgres postgres [.] FunctionCall1Coll > +1.27% postgres postgres [.] psprintf > +1.16% postgres postgres [.] appendBinaryStringInfo > +1.16% postgres libc-2.27.so [.] _IO_old_init > +1.06% postgres postgres [.] heap_deform_tuple > +1.02% postgres libc-2.27.so [.] sprintf > +1.02% postgres libc-2.27.so [.] _IO_vsprintf > > (all functions above 1%) > > > I assume this partially is just the additional layers of function calls > (psprintf, pvsnprintf, pg_vsnprintf, dopr) that are now done, in > addition to pretty much the same work as before (i.e. sprintf("%.*f")). I'm *NOT* proposing that as the actual solution, but as a datapoint, it might be interesting that hardcoding the precision and thus allowing use ofusing strfromd() instead of sprintf yields a *better* runtime than master. Time: 10255.134 ms (00:10.255) Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
On 2018-09-12 14:14:15 -0400, Tom Lane wrote: > Alexander Kuzmenkov writes: > > I benchmarked this, using your testbed and comparing to libc sprintf > > (Ubuntu GLIBC 2.27-0ubuntu3) and another implementation I know [1], all > > compiled with gcc 5. > > Thanks for reviewing! > > The cfbot noticed that the recent dlopen patch conflicted with this in > configure.in, so here's a rebased version. The code itself didn't change. Conflicts again, but not too hard to resolve. The mini benchmark from http://archives.postgresql.org/message-id/20180926174645.nsyj77lx2mvtz4kx%40alap3.anarazel.de is significantly improved by this patch. > 96bf88d52711ad3a0a4cc2d1d9cb0e2acab85e63: > > COPY somefloats TO '/dev/null'; > COPY 1000 > Time: 24575.770 ms (00:24.576) > > 96bf88d52711ad3a0a4cc2d1d9cb0e2acab85e63^: > > COPY somefloats TO '/dev/null'; > COPY 1000 > Time: 12877.037 ms (00:12.877) This patch: postgres[32704][1]=# ;SELECT pg_prewarm('somefloats');COPY somefloats TO '/dev/null'; Time: 0.269 ms ┌┐ │ pg_prewarm │ ├┤ │ 73530 │ └┘ (1 row) Time: 34.983 ms COPY 1000 Time: 15511.478 ms (00:15.511) The profile from 96bf88d52711ad3a0a4cc2d1d9cb0e2acab85e63^ is: + 38.15% postgres libc-2.27.so [.] __GI___printf_fp_l + 13.98% postgres libc-2.27.so [.] hack_digit +7.54% postgres libc-2.27.so [.] __mpn_mul_1 +7.32% postgres postgres [.] CopyOneRowTo +6.12% postgres libc-2.27.so [.] vfprintf +3.14% postgres libc-2.27.so [.] __strlen_avx2 +1.97% postgres postgres [.] heap_deform_tuple +1.77% postgres postgres [.] AllocSetAlloc +1.43% postgres postgres [.] psprintf +1.25% postgres libc-2.27.so [.] _IO_str_init_static_internal +1.09% postgres libc-2.27.so [.] _IO_vsnprintf +1.09% postgres postgres [.] appendBinaryStringInfo The profile of master with this patch is: + 32.38% postgres libc-2.27.so [.] __GI___printf_fp_l + 11.08% postgres libc-2.27.so [.] hack_digit +9.55% postgres postgres [.] CopyOneRowTo +6.24% postgres libc-2.27.so [.] __mpn_mul_1 +5.01% postgres libc-2.27.so [.] vfprintf +4.91% postgres postgres [.] dopr.constprop.4 +3.53% postgres libc-2.27.so [.] __strlen_avx2 +1.55% postgres libc-2.27.so [.] __strchrnul_avx2 +1.49% postgres libc-2.27.so [.] __memmove_avx_unaligned_erms +1.35% postgres postgres [.] AllocSetAlloc +1.32% postgres libc-2.27.so [.] _IO_str_init_static_internal +1.30% postgres postgres [.] FunctionCall1Coll +1.27% postgres postgres [.] psprintf +1.16% postgres postgres [.] appendBinaryStringInfo +1.16% postgres libc-2.27.so [.] _IO_old_init +1.06% postgres postgres [.] heap_deform_tuple +1.02% postgres libc-2.27.so [.] sprintf +1.02% postgres libc-2.27.so [.] _IO_vsprintf (all functions above 1%) I assume this partially is just the additional layers of function calls (psprintf, pvsnprintf, pg_vsnprintf, dopr) that are now done, in addition to pretty much the same work as before (i.e. sprintf("%.*f")). - Andres
Re: Performance improvements for src/port/snprintf.c
Alexander Kuzmenkov writes: > I benchmarked this, using your testbed and comparing to libc sprintf > (Ubuntu GLIBC 2.27-0ubuntu3) and another implementation I know [1], all > compiled with gcc 5. Thanks for reviewing! The cfbot noticed that the recent dlopen patch conflicted with this in configure.in, so here's a rebased version. The code itself didn't change. regards, tom lane diff --git a/configure b/configure index dd77742..5fa9396 100755 *** a/configure --- b/configure *** fi *** 15060,15066 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" --- 15060,15066 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index 3ada48b..93e8556 100644 *** a/configure.in --- b/configure.in *** PGAC_FUNC_WCSTOMBS_L *** 1544,1550 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in --- 1544,1550 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 4094e22..752a547 100644 *** a/src/include/pg_config.h.in --- b/src/include/pg_config.h.in *** *** 531,536 --- 531,539 /* Define to 1 if you have the header file. */ #undef HAVE_STDLIB_H + /* Define to 1 if you have the `strchrnul' function. */ + #undef HAVE_STRCHRNUL + /* Define to 1 if you have the `strerror' function. */ #undef HAVE_STRERROR diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index 6618b43..ea72c44 100644 *** a/src/include/pg_config.h.win32 --- b/src/include/pg_config.h.win32 *** *** 402,407 --- 402,410 /* Define to 1 if you have the header file. */ #define HAVE_STDLIB_H 1 + /* Define to 1 if you have the `strchrnul' function. */ + /* #undef HAVE_STRCHRNUL */ + /* Define to 1 if you have the `strerror' function. */ #ifndef HAVE_STRERROR #define HAVE_STRERROR 1 diff --git a/src/port/snprintf.c b/src/port/snprintf.c index 851e2ae..66151c2 100644 *** a/src/port/snprintf.c --- b/src/port/snprintf.c *** flushbuffer(PrintfTarget *target) *** 295,301 } ! static void fmtstr(char *value, int leftjust, int minlen, int maxwidth, int pointflag, PrintfTarget *target); static void fmtptr(void *value, PrintfTarget *target); static void fmtint(int64 value, char type, int forcesign, --- 295,303 } ! static bool find_arguments(const char *format, va_list args, ! PrintfArgValue *argvalues); ! static void fmtstr(const char *value, int leftjust, int minlen, int maxwidth, int pointflag, PrintfTarget *target); static void fmtptr(void *value, PrintfTarget *target); static void fmtint(int64 value, char type, int forcesign, *** static void fmtfloat(double value, char *** 307,317 PrintfTarget *target); static void dostr(const char *str, int slen, PrintfTarget *target); static void dopr_outch(int c, PrintfTarget *target); static int adjust_sign(int is_negative, int forcesign, int *signvalue); ! static void adjust_padlen(int minlen, int vallen, int leftjust, int *padlen); ! static void leading_pad(int zpad, int *signvalue, int *padlen, PrintfTarget *target); ! static void trailing_pad(int *padlen, PrintfTarget *target); /* --- 309,320 PrintfTarget *target); static void dostr(const char *str, int slen, PrintfTarget *target); static
Re: Performance improvements for src/port/snprintf.c
I benchmarked this, using your testbed and comparing to libc sprintf (Ubuntu GLIBC 2.27-0ubuntu3) and another implementation I know [1], all compiled with gcc 5.4.0 with -O2. I used bigger decimals in one of the formats, but otherwise they are the same as yours. Here is the table of conversion time relative to libc: format pg stb ("%2$.*3$f %1$d\n", 42, 123.456, 2) 1.03 - ("%.*g", 15, 123.456) 1.08 0.31 ("%10d", 15) 0.63 0.52 ("%s", "012345678900123456789001234 2.06 6.20 ("%d 012345678900123456789001234567 2.03 1.81 ("%1$d 0123456789001234567890012345 1.34 - ("%d %d", 845879348, 994502893) 1.97 0.59 Surprisingly, our implementation is twice faster than libc on "%10d". Stb is faster than we are with floats, but it uses its own algorithm for that. It is also faster with decimals, probably because it uses a two-digit lookup table, not one-digit like we do. Unfortunately it doesn't support dollars. 1. https://github.com/nothings/stb/blob/master/stb_sprintf.h -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Performance improvements for src/port/snprintf.c
I wrote: > [ snprintf-speedups-1.patch ] Here's a slightly improved version of that, with two changes: * Given the current state of the what-about-%m thread, it's no longer academic how well this performs relative to glibc's version. I poked at that and found that a lot of the discrepancy came from glibc using strchrnul() to find the next format specifier --- apparently, that function is a *lot* faster than the equivalent manual loop. So this version uses that if available. * I thought of a couple of easy wins for fmtfloat. We can pass the precision spec down to the platform's sprintf using "*" notation instead of converting it to text and back, and that also simplifies matters enough that we can avoid using an sprintf call to build the simplified format string. This seems to get us down to the vicinity of a 10% speed penalty on microbenchmarks of just float conversion, which is enough to satisfy me given the other advantages of switching to our own snprintf. regards, tom lane diff --git a/configure b/configure index 836d68d..dff9f0c 100755 *** a/configure --- b/configure *** fi *** 15032,15038 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" --- 15032,15038 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index 6e14106..c00bb8f 100644 *** a/configure.in --- b/configure.in *** PGAC_FUNC_WCSTOMBS_L *** 1535,1541 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in --- 1535,1541 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 827574e..da9cfa7 100644 *** a/src/include/pg_config.h.in --- b/src/include/pg_config.h.in *** *** 519,524 --- 519,527 /* Define to 1 if you have the header file. */ #undef HAVE_STDLIB_H + /* Define to 1 if you have the `strchrnul' function. */ + #undef HAVE_STRCHRNUL + /* Define to 1 if you have the `strerror' function. */ #undef HAVE_STRERROR diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index 46ce49d..73d7424 100644 *** a/src/include/pg_config.h.win32 --- b/src/include/pg_config.h.win32 *** *** 390,395 --- 390,398 /* Define to 1 if you have the header file. */ #define HAVE_STDLIB_H 1 + /* Define to 1 if you have the `strchrnul' function. */ + /* #undef HAVE_STRCHRNUL */ + /* Define to 1 if you have the `strerror' function. */ #ifndef HAVE_STRERROR #define HAVE_STRERROR 1 diff --git a/src/port/snprintf.c b/src/port/snprintf.c index 851e2ae..66151c2 100644 *** a/src/port/snprintf.c --- b/src/port/snprintf.c *** flushbuffer(PrintfTarget *target) *** 295,301 } ! static void fmtstr(char *value, int leftjust, int minlen, int maxwidth, int pointflag, PrintfTarget *target); static void fmtptr(void *value, PrintfTarget *target); static void fmtint(int64 value, char type, int forcesign, --- 295,303 } ! static bool find_arguments(const char *format, va_list args, ! PrintfArgValue *argvalues); ! static void fmtstr(const char *value, int leftjust, int minlen, int maxwidth, int pointflag, PrintfTarget *target); static void fmtptr(void *value, PrintfTarget *target); static void fmtint(int64 value, char type, int forcesign, *** static void
Performance improvements for src/port/snprintf.c
Over in the what-about-%m thread, we speculated about replacing the platform's *printf functions if they didn't support %m, which would basically mean using src/port/snprintf.c on all non-glibc platforms, rather than only on Windows as happens right now (ignoring some obsolete platforms with busted snprintf's). I've been looking into the possible performance consequences of that, in particular comparing snprintf.c to the library versions on macOS, FreeBSD, OpenBSD, and NetBSD. While it held up well in simpler cases, I noted that it was significantly slower on long format strings, which I traced to two separate problems: 1. Our implementation always scans the format string twice, so that it can sort out argument-ordering options (%n$). Everybody else is bright enough to do that only for formats that actually use %n$, and it turns out that it doesn't really cost anything extra to do so: you can just perform the extra scan when and if you first find a dollar specifier. (Perhaps there's an arguable downside for this, with invalid format strings that have non-dollar conversion specs followed by dollar ones: with this approach we might fetch some arguments before realizing that the format is broken. But a wrong format can cause indefinitely bad results already, so that seems like a pretty thin objection to me, especially if all other implementations share the same hazard.) 2. Our implementation is shoving simple data characters in the format out to the result buffer one at a time. More common is to skip to the next % as fast as possible, and then dump anything skipped over using the string-output code path, reducing the overhead of buffer overrun checking. The attached patch fixes both of those things, and also does some micro-optimization hacking to avoid loops around dopr_outch() as well as unnecessary use of pass-by-ref arguments. This version stacks up pretty well against all the libraries I compared it to. The remaining weak spot is that floating-point conversions are consistently 30%-50% slower than the native libraries, which is not terribly surprising considering that our implementation involves calling the native sprintf and then massaging the result. Perhaps there's a way to improve that without writing our own floating-point conversion code, but I'm not seeing an easy way offhand. I don't think that's a showstopper though. This code is now faster than the native code for very many other cases, so on average it should cause no real performance problem. I've attached both the patch and a simple performance testbed in case anybody wants to do their own measurements. For reference's sake, these are the specific test cases I looked at: snprintf(buffer, sizeof(buffer), "%2$.*3$f %1$d\n", 42, 123.456, 2); snprintf(buffer, sizeof(buffer), "%.*g", 15, 123.456); snprintf(buffer, sizeof(buffer), "%d %d", 15, 16); snprintf(buffer, sizeof(buffer), "%10d", 15); snprintf(buffer, sizeof(buffer), "%s", "0123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890"); snprintf(buffer, sizeof(buffer), "%d 0123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890", snprintf(buffer, sizeof(buffer), "%1$d 0123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890012345678900123456789001234567890", 42); A couple of other notes of interest: * The skip-to-next-% searches could alternatively be implemented with strchr(), although then you need a strlen() call if there isn't another %. glibc's version of strchr() is fast enough to make that a win, but since we're not contemplating using this atop glibc, that's not a case we care about. On other platforms the manual loop mostly seems to be faster. * NetBSD seems to have a special fast path for the case that the format string is exactly "%s". I did not adopt that idea here, reasoning that checking for it would add overhead to all other cases, making it probably a net loss overall. I'm prepared to listen to arguments otherwise, though. It is a common case, I just doubt it's common enough (and other library authors seem to agree). I'll add this to the upcoming CF. regards, tom lane diff --git a/src/port/snprintf.c b/src/port/snprintf.c index 851e2ae..211ff1b 100644 *** a/src/port/snprintf.c --- b/src/port/snprintf.c *** flushbuffer(PrintfTarget