Re: [PATCH] Make integer output faster in libgfortran
Follow-up patch committed, after my use of the one-argument variant of static_assert() broke bootstrap on Solaris (sorry Rainer!). The one-arg form is new since C23, while Solaris only supports the two-arg form (C11). I have confirmed that other target libraries use the two-arg form, and bootstrapped the attached patch on x86_64-pc-linux-gnu. FX static_assert.diff Description: Binary data
Re: [PATCH] Make integer output faster in libgfortran
Hi FX, (We could also do something like that for a 32-bit system, but that is another kettle of fish). We probably wouldn’t get a speed-up that big. Even on 32-bit targets (at least common ones), the 64-bit type and its operations (notably division) are implemented via CPU instructions, not library calls. I'll look at this a bit more closely and report :-) At this point, the output of integers is probably bound by the many layers of indirection of libgfortran's I/O system (which are necessary because of the rich I/O formatting allowed by the standard). There are a few things we could do. Getting a LTO-capable version of libgfortran would be a huge step, because the compiler could then strip out all of these layers. The speed of character:: c write (*,'(A)', advance="no") c could stand some improvement :-) Regards Thomas
Re: [PATCH] Make integer output faster in libgfortran
Hi, > I tested this on x86_64-pc-linux-gnu with > make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'" > and didn't see any problems. Thanks Thomas! Pushed. > (We could also do something like that for a 32-bit system, but > that is another kettle of fish). We probably wouldn’t get a speed-up that big. Even on 32-bit targets (at least common ones), the 64-bit type and its operations (notably division) are implemented via CPU instructions, not library calls. At this point, the output of integers is probably bound by the many layers of indirection of libgfortran's I/O system (which are necessary because of the rich I/O formatting allowed by the standard). Best, FX
Re: [PATCH] Make integer output faster in libgfortran
Hi fX, right now I don’t have a Linux system with 32-bit support. I’ll see how I can connect to gcc45, but if someone who is already set up to do can fire a quick regtest, that would be great;) I tested this on x86_64-pc-linux-gnu with make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'" and didn't see any problems. So, OK for trunk. (We could also do something like that for a 32-bit system, but that is another kettle of fish). Thanks for taking this up! Best regards Thomas
Re: [PATCH] Make integer output faster in libgfortran
Hi Thomas, > There are two possibilities: Either use gcc45 on the compile farm, or > run it with > make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'" Thanks, right now I don’t have a Linux system with 32-bit support. I’ll see how I can connect to gcc45, but if someone who is already set up to do can fire a quick regtest, that would be great ;) FX
Re: [PATCH] Make integer output faster in libgfortran
Hi FX, The patch has been bootstrapped and regtested on two 64-bit targets: aarch64-apple-darwin21 (development branch) and x86_64-pc-gnu-linux. I would like it to be tested on a 32-bit target without 128-bit integer type. Does someone have access to that? There are two possibilities: Either use gcc45 on the compile farm, or run it with make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'" which is the magic incantation to also use -m32 binaries. You'll need the 32-bit support on your Linux system, of course (which you can check quickly with a "hello world" kind of program with -m32). Regards Thomas
[PATCH] Make integer output faster in libgfortran
Hi, Integer output in libgfortran is done by passing values as the largest integer type available. This is what our gfc_itoa() function for conversion to decimal form uses, as well, performing series of divisions by 10. On targets with a 128-bit integer type (which is most targets, really, nowadays), division is slow, because it is implemented in software and requires a call to a libgcc function. We can speed this up in two easy ways: - If the value fits into 64-bit, use a simple 64-bit itoa() function, which does the series of divisions by 10 with hardware. Most I/O will actually fall into that case, in real-life, unless you’re printing very big 128-bit integers. - If the value does not fit into 64-bit, perform only one slow division, by 10^19, and use two calls to the 64-bit function to output each part (the low part needing zero-padding). What is the speed-up? It really depends on the exact nature of the I/O done. For the most common-case, list-directed I/O with no special format, the patch does not speed (or slow!) things for values up to HUGE(KIND=4), but speeds things up for larger values. For very large 128-bit values, it can cut the I/O time in half. I attach my own timing code to this email. Results before the patch (with previous itoa-patch applied, though): Timing for INTEGER(KIND=1) Value 0, time: 0.191409990 Value HUGE(KIND=1), time: 0.173687011 Timing for INTEGER(KIND=4) Value 0, time: 0.171809018 Value 1049, time: 0.177439988 Value HUGE(KIND=4), time: 0.217984974 Timing for INTEGER(KIND=8) Value 0, time: 0.178072989 Value HUGE(KIND=4), time: 0.214841008 Value HUGE(KIND=8), time: 0.276726007 Timing for INTEGER(KIND=16) Value 0, time: 0.175235987 Value HUGE(KIND=4), time: 0.217689037 Value HUGE(KIND=8), time: 0.280257106 Value HUGE(KIND=16), time: 0.420036077 Results after the patch: Timing for INTEGER(KIND=1) Value 0, time: 0.194633007 Value HUGE(KIND=1), time: 0.172436997 Timing for INTEGER(KIND=4) Value 0, time: 0.167517006 Value 1049, time: 0.176503003 Value HUGE(KIND=4), time: 0.172892988 Timing for INTEGER(KIND=8) Value 0, time: 0.171101034 Value HUGE(KIND=4), time: 0.174461007 Value HUGE(KIND=8), time: 0.180289030 Timing for INTEGER(KIND=16) Value 0, time: 0.175765991 Value HUGE(KIND=4), time: 0.181162953 Value HUGE(KIND=8), time: 0.186082959 Value HUGE(KIND=16), time: 0.207401991 Times are CPU times in seconds, for one million integer writes into a buffer string. With the patch, we see that integer decimal output is almost independent of the value written, meaning the I/O library overhead is dominant, not the decimal conversion. For this reason, I don’t think we really need a faster implementation of the 64-bit itoa, and can keep the current series-of-division-by-10 approach. --- This patch applies on top of my previous itoa-related patch at https://gcc.gnu.org/pipermail/fortran/2021-December/057218.html The patch has been bootstrapped and regtested on two 64-bit targets: aarch64-apple-darwin21 (development branch) and x86_64-pc-gnu-linux. I would like it to be tested on a 32-bit target without 128-bit integer type. Does someone have access to that? Once tested on a 32-bit target, OK to commit? FX itoa-faster.patch Description: Binary data timing.f90 Description: Binary data