Re: [Patch, libfortran] Improve performance of byte swapped IO
On Wed, Jan 23, 2013 at 12:32 AM, Thomas Koenig wrote: > Hi Janne, > >> PING**2 > > > this is OK. Thanks a lot for the work you put into this! Thanks for the review; committed as r195413. -- Janne Blomqvist
Re: [Patch, libfortran] Improve performance of byte swapped IO
Hi Janne, PING**2 this is OK. Thanks a lot for the work you put into this! Thomas
Re: [Patch, libfortran] Improve performance of byte swapped IO
PING**2 On Mon, Jan 14, 2013 at 12:44 AM, Janne Blomqvist wrote: > PING**1.2 > > Yet another slightly updated patch attached. Compared to the previous > version, now with specializations for size 12 and 16 as well. For the > real(10) benchmark, with the previous v3 patch (please disregard the > absolute values in the post quoted below, there were wrong due to a > bug): > > Unformatted sequential write/read performance test > Record size Write MB/s Read MB/s > == >4 80.578833140738340127.33074266188656 >8 137.61682156650559184.49033790407984 > 16 202.72871312800621275.98801561061816 > 32 275.33538767460863413.43956672052303 > 64 341.04488670485119555.13744525826564 > 128 384.77917051919820671.44655208024699 > 256 410.97208129045833763.97660513918527 > 512 425.76619227779878826.41086693364593 > 1024 430.77035999730009840.30757120448550 > 2048 438.30318459339475885.50033810296600 > 4096 455.79422809097599919.78265920652086 > 8192 465.74499205886326959.06963983370918 >16384 472.48133493971142991.11244162081744 >32768 471.000246195676031015.7428144049615 >65536 474.912352809499851021.2150519080892 > 131072 475.186644874409011006.3701982554830 > 262144 478.00435092846868985.17141300594039 > 524288 476.72837201590363991.74226579987987 > > With the new v4 patch: > > Unformatted sequential write/read performance test > Record size Write MB/s Read MB/s > == >4 87.353141847504133145.09410391177835 >8 166.95093628370549223.60877830048437 > 16 272.20937208187746364.91673986840277 > 32 415.26016354252715599.41744252952310 > 64 592.97676703528009900.53345964312450 > 128 748.272185471476861189.7131837787238 > 256 874.830985067143841561.3649529261234 > 512 935.694944811442841823.1760143164879 > 1024 983.516894918132151931.8773088107300 > 2048 1009.54917616513961971.6978586130062 > 4096 1115.58620276585522119.4151169997808 > 8192 1172.94002295682872184.1403983641089 >16384 1222.66592841531682258.5490449229878 >32768 1242.24176266972932251.8159046253918 >65536 1227.9967943962313.4106672387143 > 131072 1204.42956565440522129.1309150039478 > 262144 1135.79056143784582154.7146453789856 > 524288 1075.57690744026402170.5151501933169 > > > On Fri, Jan 11, 2013 at 10:41 PM, Janne Blomqvist > wrote: >> PING. >> >> Slightly updated patch attached, which further improves the generic >> size fallback that is used when the element size is not 2/4/8 bytes. >> Changing the us_perf benchmark to use real(10), with the v2 patch the >> performance is: >> >> Unformatted sequential write/read performance test >> Record size Write MB/s Read MB/s >> == >>4 59.02855042952208586.019754350948787 >>8 79.02832706313059095.803502000733374 >> 16 99.980457395413296138.68367462874946 >> 32 122.56886206338788180.05609910155042 >> 64 152.00478266944486212.69931319407567 >> 128 197.74137934940202235.19728791956828 >> 256 155.36245780017779244.60578379215929 >> 512 157.13385845966246245.07467397691480 >> 1024 177.26553799130201260.44908357795623 >> 2048 208.22852888945587260.21587143113527 >> 4096 222.88410474980634262.66162209490591 >> 8192 226.71167580652920265.81191407123663 >>16384 206.51818241747065263.59395165591724 >>32768 230.18707026455866265.88990325026526 >>65536 229.19783089391504268.04485112932684 >> 131072 231.1221566209267.40543904427710 >> 262144 230.72012123598142267.60086931504122 >> 524288 230.48959460456055268.78750211303725 >> >> With the new v3 patch I get >> >> Unformatted sequential write/read performance test >> Record size Write MB/s Read MB/s >> == >>4 59.77906112123994192.777125264010024 >>8 92.727504266051341126.64775563782673 >> 16 128.94793911
Re: [Patch, libfortran] Improve performance of byte swapped IO
PING**1.2 Yet another slightly updated patch attached. Compared to the previous version, now with specializations for size 12 and 16 as well. For the real(10) benchmark, with the previous v3 patch (please disregard the absolute values in the post quoted below, there were wrong due to a bug): Unformatted sequential write/read performance test Record size Write MB/s Read MB/s == 4 80.578833140738340127.33074266188656 8 137.61682156650559184.49033790407984 16 202.72871312800621275.98801561061816 32 275.33538767460863413.43956672052303 64 341.04488670485119555.13744525826564 128 384.77917051919820671.44655208024699 256 410.97208129045833763.97660513918527 512 425.76619227779878826.41086693364593 1024 430.77035999730009840.30757120448550 2048 438.30318459339475885.50033810296600 4096 455.79422809097599919.78265920652086 8192 465.74499205886326959.06963983370918 16384 472.48133493971142991.11244162081744 32768 471.000246195676031015.7428144049615 65536 474.912352809499851021.2150519080892 131072 475.186644874409011006.3701982554830 262144 478.00435092846868985.17141300594039 524288 476.72837201590363991.74226579987987 With the new v4 patch: Unformatted sequential write/read performance test Record size Write MB/s Read MB/s == 4 87.353141847504133145.09410391177835 8 166.95093628370549223.60877830048437 16 272.20937208187746364.91673986840277 32 415.26016354252715599.41744252952310 64 592.97676703528009900.53345964312450 128 748.272185471476861189.7131837787238 256 874.830985067143841561.3649529261234 512 935.694944811442841823.1760143164879 1024 983.516894918132151931.8773088107300 2048 1009.54917616513961971.6978586130062 4096 1115.58620276585522119.4151169997808 8192 1172.94002295682872184.1403983641089 16384 1222.66592841531682258.5490449229878 32768 1242.24176266972932251.8159046253918 65536 1227.9967943962313.4106672387143 131072 1204.42956565440522129.1309150039478 262144 1135.79056143784582154.7146453789856 524288 1075.57690744026402170.5151501933169 On Fri, Jan 11, 2013 at 10:41 PM, Janne Blomqvist wrote: > PING. > > Slightly updated patch attached, which further improves the generic > size fallback that is used when the element size is not 2/4/8 bytes. > Changing the us_perf benchmark to use real(10), with the v2 patch the > performance is: > > Unformatted sequential write/read performance test > Record size Write MB/s Read MB/s > == >4 59.02855042952208586.019754350948787 >8 79.02832706313059095.803502000733374 > 16 99.980457395413296138.68367462874946 > 32 122.56886206338788180.05609910155042 > 64 152.00478266944486212.69931319407567 > 128 197.74137934940202235.19728791956828 > 256 155.36245780017779244.60578379215929 > 512 157.13385845966246245.07467397691480 > 1024 177.26553799130201260.44908357795623 > 2048 208.22852888945587260.21587143113527 > 4096 222.88410474980634262.66162209490591 > 8192 226.71167580652920265.81191407123663 >16384 206.51818241747065263.59395165591724 >32768 230.18707026455866265.88990325026526 >65536 229.19783089391504268.04485112932684 > 131072 231.1221566209267.40543904427710 > 262144 230.72012123598142267.60086931504122 > 524288 230.48959460456055268.78750211303725 > > With the new v3 patch I get > > Unformatted sequential write/read performance test > Record size Write MB/s Read MB/s > == >4 59.77906112123994192.777125264010024 >8 92.727504266051341126.64775563782673 > 16 128.94793911163904184.69194300482837 > 32 169.78916283536847267.06752001266767 > 64 209.50296476919556341.60515130910238 > 128 236.36709738360679416.73212655
Re: [Patch, libfortran] Improve performance of byte swapped IO
PING. Slightly updated patch attached, which further improves the generic size fallback that is used when the element size is not 2/4/8 bytes. Changing the us_perf benchmark to use real(10), with the v2 patch the performance is: Unformatted sequential write/read performance test Record size Write MB/s Read MB/s == 4 59.02855042952208586.019754350948787 8 79.02832706313059095.803502000733374 16 99.980457395413296138.68367462874946 32 122.56886206338788180.05609910155042 64 152.00478266944486212.69931319407567 128 197.74137934940202235.19728791956828 256 155.36245780017779244.60578379215929 512 157.13385845966246245.07467397691480 1024 177.26553799130201260.44908357795623 2048 208.22852888945587260.21587143113527 4096 222.88410474980634262.66162209490591 8192 226.71167580652920265.81191407123663 16384 206.51818241747065263.59395165591724 32768 230.18707026455866265.88990325026526 65536 229.19783089391504268.04485112932684 131072 231.1221566209267.40543904427710 262144 230.72012123598142267.60086931504122 524288 230.48959460456055268.78750211303725 With the new v3 patch I get Unformatted sequential write/read performance test Record size Write MB/s Read MB/s == 4 59.77906112123994192.777125264010024 8 92.727504266051341126.64775563782673 16 128.94793911163904184.69194300482837 32 169.78916283536847267.06752001266767 64 209.50296476919556341.60515130910238 128 236.36709738360679416.73212655882151 256 251.79029695383340465.46804746749740 512 259.62269939828633500.87346060356265 1024 265.08842337586458508.95530627428275 2048 268.71795530051884532.12211365683640 4096 280.86546884821030546.88907054369884 8192 286.96049684823578569.60958187426183 16384 292.04368984868103608.11503416324865 32768 292.96677387959392629.80651297065833 65536 291.69098580137114624.27103478079641 131072 292.75666234956418605.99766136491496 262144 291.35520038228975611.59061455535834 524288 292.15446100501691623.76232623081580 On Sat, Jan 5, 2013 at 11:13 PM, Janne Blomqvist wrote: > On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener > wrote: >> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab >> wrote: >>> Janne Blomqvist writes: >>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c index c8ecc3a..bf2250a 100644 --- a/libgfortran/io/file_pos.c +++ b/libgfortran/io/file_pos.c @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u) } else { + uint32_t u32; + uint64_t u64; switch (length) { case sizeof(GFC_INTEGER_4): - reverse_memcpy (&m4, p, sizeof (m4)); + memcpy (&u32, p, sizeof (u32)); + u32 = __builtin_bswap32 (u32); + m4 = *(GFC_INTEGER_4*)&u32; >>> >>> Isn't that an aliasing violation? >> >> It looks like one. Why not simply do >> >>m4 = (GFC_INTEGER_4) u32; >> >> ? I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed? > > Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do > the above, C99 6.3.1.3(3) says that if the unsigned value is outside > the range of the signed variable, the result is > implementation-defined. Though I suppose the sensible > "implementation-defined behavior" in this case on a two's complement > target is to just do a bitwise copy. > > Anyway, to be really safe one could use memcpy instead; the compiler > optimizes small fixed size memcpy's just fine. Updated patch attached. > > > -- > Janne Blomqvist -- Janne Blomqvist bswap3.diff Description: Binary data
Re: [Patch, libfortran] Improve performance of byte swapped IO
On Sat, Jan 5, 2013 at 10:13 PM, Janne Blomqvist wrote: > On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener > wrote: >> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab >> wrote: >>> Janne Blomqvist writes: >>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c index c8ecc3a..bf2250a 100644 --- a/libgfortran/io/file_pos.c +++ b/libgfortran/io/file_pos.c @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, gfc_unit *u) } else { + uint32_t u32; + uint64_t u64; switch (length) { case sizeof(GFC_INTEGER_4): - reverse_memcpy (&m4, p, sizeof (m4)); + memcpy (&u32, p, sizeof (u32)); + u32 = __builtin_bswap32 (u32); + m4 = *(GFC_INTEGER_4*)&u32; >>> >>> Isn't that an aliasing violation? >> >> It looks like one. Why not simply do >> >>m4 = (GFC_INTEGER_4) u32; >> >> ? I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed? > > Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do > the above, C99 6.3.1.3(3) says that if the unsigned value is outside > the range of the signed variable, the result is > implementation-defined. Though I suppose the sensible > "implementation-defined behavior" in this case on a two's complement > target is to just do a bitwise copy. As libgfortran is a target library and thus always compiled by GCC you can rely on GCCs documented implementation-defined behavior here (which is to do bitwise re-interpretation). No need to obfuscate the code more than necessary. Richard. > Anyway, to be really safe one could use memcpy instead; the compiler > optimizes small fixed size memcpy's just fine. Updated patch attached. > > > -- > Janne Blomqvist
Re: [Patch, libfortran] Improve performance of byte swapped IO
On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener wrote: > On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab wrote: >> Janne Blomqvist writes: >> >>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c >>> index c8ecc3a..bf2250a 100644 >>> --- a/libgfortran/io/file_pos.c >>> +++ b/libgfortran/io/file_pos.c >>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, >>> gfc_unit *u) >>> } >>>else >>> { >>> + uint32_t u32; >>> + uint64_t u64; >>> switch (length) >>> { >>> case sizeof(GFC_INTEGER_4): >>> - reverse_memcpy (&m4, p, sizeof (m4)); >>> + memcpy (&u32, p, sizeof (u32)); >>> + u32 = __builtin_bswap32 (u32); >>> + m4 = *(GFC_INTEGER_4*)&u32; >> >> Isn't that an aliasing violation? > > It looks like one. Why not simply do > >m4 = (GFC_INTEGER_4) u32; > > ? I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed? Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do the above, C99 6.3.1.3(3) says that if the unsigned value is outside the range of the signed variable, the result is implementation-defined. Though I suppose the sensible "implementation-defined behavior" in this case on a two's complement target is to just do a bitwise copy. Anyway, to be really safe one could use memcpy instead; the compiler optimizes small fixed size memcpy's just fine. Updated patch attached. -- Janne Blomqvist bswap2.diff Description: Binary data
Re: [Patch, libfortran] Improve performance of byte swapped IO
On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab wrote: > Janne Blomqvist writes: > >> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c >> index c8ecc3a..bf2250a 100644 >> --- a/libgfortran/io/file_pos.c >> +++ b/libgfortran/io/file_pos.c >> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, >> gfc_unit *u) >> } >>else >> { >> + uint32_t u32; >> + uint64_t u64; >> switch (length) >> { >> case sizeof(GFC_INTEGER_4): >> - reverse_memcpy (&m4, p, sizeof (m4)); >> + memcpy (&u32, p, sizeof (u32)); >> + u32 = __builtin_bswap32 (u32); >> + m4 = *(GFC_INTEGER_4*)&u32; > > Isn't that an aliasing violation? It looks like one. Why not simply do m4 = (GFC_INTEGER_4) u32; ? I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed? Richard. > > Andreas. > > -- > Andreas Schwab, sch...@linux-m68k.org > GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 > "And now for something completely different."
Re: [Patch, libfortran] Improve performance of byte swapped IO
Janne Blomqvist writes: > diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c > index c8ecc3a..bf2250a 100644 > --- a/libgfortran/io/file_pos.c > +++ b/libgfortran/io/file_pos.c > @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, > gfc_unit *u) > } >else > { > + uint32_t u32; > + uint64_t u64; > switch (length) > { > case sizeof(GFC_INTEGER_4): > - reverse_memcpy (&m4, p, sizeof (m4)); > + memcpy (&u32, p, sizeof (u32)); > + u32 = __builtin_bswap32 (u32); > + m4 = *(GFC_INTEGER_4*)&u32; Isn't that an aliasing violation? Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
[Patch, libfortran] Improve performance of byte swapped IO
Hi, currently byte swapped unformatted IO can be quite slow compared to the same code with no byte swapping. There are two major reasons for this: 1) The byte swapping code path resorts to transferring data element by element, leading to a lot of overhead in the IO library. 2) The function used for the actual byte swapping, reverse_memcpy , while able to handle general element sizes, is not particularly fast, especially considering that many CPU's have fast byte swapping instructions (e.g. BSWAP on x86). In order to access these fast byte swapping instructions, gcc provides the __builtin_bswap{16,32,64} builtins, falling back to libgcc code for targets that lack support. The attached patch fixes these issues. For issue (1), the read path uses in-place byte swapping of the data that has been read into the user buffer, while the write path uses a larger temporary buffer (since we are not allowed to modify the user supplied data in this case). For issue(2), the patch uses __builtin_bswap{16,32,64} where appropriate, only falling back to reverse_memcpy for other sizes. With the attached test program run on a tmpfs filesystem to avoid doing actual disk IO, I get the following: - With no byte swapping: Unformatted sequential write/read performance test Record size Write MB/s Read MB/s == 4 52.72384281742220272.721158943820441 8 77.50829689085638697.237815640377221 16 110.26209495334321143.80831184546381 32 173.94872143231535221.89704881197937 64 282.19818562682684373.77854583735541 128 442.22084579742244628.80041029142183 256 636.69620860705299966.37723642576316 512 826.059688407380801380.8835166612221 1024 987.186864651975611763.5990036057208 2048 1047.67215441917102058.0875622043550 4096 1115.58171471348012251.8731832850176 8192 1191.50211509965902283.8893409728184 16384 1417.61109095193912441.0530373866482 32768 1570.44134790460182543.0836384048471 65536 1673.03787065029662651.2182395008308 131072 1697.49442461884452688.2398923155783 262144 1669.63298621458722735.668973292 524288 1594.46699352315522697.7208298823243 - Before patch, with byte swapping: Unformatted sequential write/read performance test Record size Write MB/s Read MB/s == 4 50.57281289368979368.858701306591627 8 58.68851330069031781.591733130441327 16 73.55118848060782096.638995590227665 32 91.593767813989018116.65817140076214 64 107.41379323761915128.32512066346368 128 121.33499652432221147.80777892360237 256 128.99627771476628155.91619889220266 512 135.02742063670030161.30042382365372 1024 137.02276709585524164.11267056940963 2048 138.62774254302394165.22456826188971 4096 139.27695763341924166.34707691429571 8192 147.64584950575932166.59526981475742 16384 147.91235479266419166.77890398940283 32768 150.77029430529927166.90834867503827 65536 151.59474472614465166.84075600288520 131072 155.75202672623249166.96550283835097 262144 155.36506626794849166.78075976148853 524288 155.64305086921487167.44468828946083 - After patch, with byte swapping: Unformatted sequential write/read performance test Record size Write MB/s Read MB/s == 4 49.41477177682136170.808060042286343 8 72.91815640245977293.234093684373946 16 102.72461544178078136.21700026949074 32 160.57240200649090205.97612602315186 64 249.32082957447636331.85515010907363 128 385.71299236810387522.06354804855266 256 535.40608912076459766.59668706247294 512 669.478641203685241006.4275938227961 1024 742.905388955002651187.9846039167674 2048 789.713405573405231333.8411634622269 4096 826.442532047316831395.5536995933605 8192 832.935403161166621361.4621716558986 16384 897.950819770101131469.0940087507722 32768 961.187363080333171533.7736812111871 65536 989.413849084968321564.7013916917260 131072 1003.61137620680401597.4063253370084 262144 980.030676643243961602.3188995993287