Re: [Patch, libfortran] Improve performance of byte swapped IO

2013-01-23 Thread Janne Blomqvist
On Wed, Jan 23, 2013 at 12:32 AM, Thomas Koenig  wrote:
> Hi Janne,
>
>> PING**2
>
>
> this is OK.  Thanks a lot for the work you put into this!

Thanks for the review; committed as r195413.

-- 
Janne Blomqvist


Re: [Patch, libfortran] Improve performance of byte swapped IO

2013-01-22 Thread Thomas Koenig

Hi Janne,


PING**2


this is OK.  Thanks a lot for the work you put into this!

Thomas




Re: [Patch, libfortran] Improve performance of byte swapped IO

2013-01-18 Thread Janne Blomqvist
PING**2

On Mon, Jan 14, 2013 at 12:44 AM, Janne Blomqvist
 wrote:
> PING**1.2
>
> Yet another slightly updated patch attached. Compared to the previous
> version, now with specializations for size 12 and 16 as well. For the
> real(10) benchmark, with the previous v3 patch (please disregard the
> absolute values in the post quoted below, there were wrong due to a
> bug):
>
>   Unformatted sequential write/read performance test
>  Record size   Write MB/s Read MB/s
>  ==
>4   80.578833140738340127.33074266188656
>8   137.61682156650559184.49033790407984
>   16   202.72871312800621275.98801561061816
>   32   275.33538767460863413.43956672052303
>   64   341.04488670485119555.13744525826564
>  128   384.77917051919820671.44655208024699
>  256   410.97208129045833763.97660513918527
>  512   425.76619227779878826.41086693364593
> 1024   430.77035999730009840.30757120448550
> 2048   438.30318459339475885.50033810296600
> 4096   455.79422809097599919.78265920652086
> 8192   465.74499205886326959.06963983370918
>16384   472.48133493971142991.11244162081744
>32768   471.000246195676031015.7428144049615
>65536   474.912352809499851021.2150519080892
>   131072   475.186644874409011006.3701982554830
>   262144   478.00435092846868985.17141300594039
>   524288   476.72837201590363991.74226579987987
>
> With the new v4 patch:
>
>  Unformatted sequential write/read performance test
>  Record size   Write MB/s Read MB/s
>  ==
>4   87.353141847504133145.09410391177835
>8   166.95093628370549223.60877830048437
>   16   272.20937208187746364.91673986840277
>   32   415.26016354252715599.41744252952310
>   64   592.97676703528009900.53345964312450
>  128   748.272185471476861189.7131837787238
>  256   874.830985067143841561.3649529261234
>  512   935.694944811442841823.1760143164879
> 1024   983.516894918132151931.8773088107300
> 2048   1009.54917616513961971.6978586130062
> 4096   1115.58620276585522119.4151169997808
> 8192   1172.94002295682872184.1403983641089
>16384   1222.66592841531682258.5490449229878
>32768   1242.24176266972932251.8159046253918
>65536   1227.9967943962313.4106672387143
>   131072   1204.42956565440522129.1309150039478
>   262144   1135.79056143784582154.7146453789856
>   524288   1075.57690744026402170.5151501933169
>
>
> On Fri, Jan 11, 2013 at 10:41 PM, Janne Blomqvist
>  wrote:
>> PING.
>>
>> Slightly updated patch attached, which further improves the generic
>> size fallback that is used when the element size is not 2/4/8 bytes.
>> Changing the us_perf benchmark to use real(10), with the v2 patch the
>> performance is:
>>
>>  Unformatted sequential write/read performance test
>>  Record size   Write MB/s Read MB/s
>>  ==
>>4   59.02855042952208586.019754350948787
>>8   79.02832706313059095.803502000733374
>>   16   99.980457395413296138.68367462874946
>>   32   122.56886206338788180.05609910155042
>>   64   152.00478266944486212.69931319407567
>>  128   197.74137934940202235.19728791956828
>>  256   155.36245780017779244.60578379215929
>>  512   157.13385845966246245.07467397691480
>> 1024   177.26553799130201260.44908357795623
>> 2048   208.22852888945587260.21587143113527
>> 4096   222.88410474980634262.66162209490591
>> 8192   226.71167580652920265.81191407123663
>>16384   206.51818241747065263.59395165591724
>>32768   230.18707026455866265.88990325026526
>>65536   229.19783089391504268.04485112932684
>>   131072   231.1221566209267.40543904427710
>>   262144   230.72012123598142267.60086931504122
>>   524288   230.48959460456055268.78750211303725
>>
>> With the new v3 patch I get
>>
>>  Unformatted sequential write/read performance test
>>  Record size   Write MB/s Read MB/s
>>  ==
>>4   59.77906112123994192.777125264010024
>>8   92.727504266051341126.64775563782673
>>   16   128.94793911

Re: [Patch, libfortran] Improve performance of byte swapped IO

2013-01-13 Thread Janne Blomqvist
PING**1.2

Yet another slightly updated patch attached. Compared to the previous
version, now with specializations for size 12 and 16 as well. For the
real(10) benchmark, with the previous v3 patch (please disregard the
absolute values in the post quoted below, there were wrong due to a
bug):

  Unformatted sequential write/read performance test
 Record size   Write MB/s Read MB/s
 ==
   4   80.578833140738340127.33074266188656
   8   137.61682156650559184.49033790407984
  16   202.72871312800621275.98801561061816
  32   275.33538767460863413.43956672052303
  64   341.04488670485119555.13744525826564
 128   384.77917051919820671.44655208024699
 256   410.97208129045833763.97660513918527
 512   425.76619227779878826.41086693364593
1024   430.77035999730009840.30757120448550
2048   438.30318459339475885.50033810296600
4096   455.79422809097599919.78265920652086
8192   465.74499205886326959.06963983370918
   16384   472.48133493971142991.11244162081744
   32768   471.000246195676031015.7428144049615
   65536   474.912352809499851021.2150519080892
  131072   475.186644874409011006.3701982554830
  262144   478.00435092846868985.17141300594039
  524288   476.72837201590363991.74226579987987

With the new v4 patch:

 Unformatted sequential write/read performance test
 Record size   Write MB/s Read MB/s
 ==
   4   87.353141847504133145.09410391177835
   8   166.95093628370549223.60877830048437
  16   272.20937208187746364.91673986840277
  32   415.26016354252715599.41744252952310
  64   592.97676703528009900.53345964312450
 128   748.272185471476861189.7131837787238
 256   874.830985067143841561.3649529261234
 512   935.694944811442841823.1760143164879
1024   983.516894918132151931.8773088107300
2048   1009.54917616513961971.6978586130062
4096   1115.58620276585522119.4151169997808
8192   1172.94002295682872184.1403983641089
   16384   1222.66592841531682258.5490449229878
   32768   1242.24176266972932251.8159046253918
   65536   1227.9967943962313.4106672387143
  131072   1204.42956565440522129.1309150039478
  262144   1135.79056143784582154.7146453789856
  524288   1075.57690744026402170.5151501933169


On Fri, Jan 11, 2013 at 10:41 PM, Janne Blomqvist
 wrote:
> PING.
>
> Slightly updated patch attached, which further improves the generic
> size fallback that is used when the element size is not 2/4/8 bytes.
> Changing the us_perf benchmark to use real(10), with the v2 patch the
> performance is:
>
>  Unformatted sequential write/read performance test
>  Record size   Write MB/s Read MB/s
>  ==
>4   59.02855042952208586.019754350948787
>8   79.02832706313059095.803502000733374
>   16   99.980457395413296138.68367462874946
>   32   122.56886206338788180.05609910155042
>   64   152.00478266944486212.69931319407567
>  128   197.74137934940202235.19728791956828
>  256   155.36245780017779244.60578379215929
>  512   157.13385845966246245.07467397691480
> 1024   177.26553799130201260.44908357795623
> 2048   208.22852888945587260.21587143113527
> 4096   222.88410474980634262.66162209490591
> 8192   226.71167580652920265.81191407123663
>16384   206.51818241747065263.59395165591724
>32768   230.18707026455866265.88990325026526
>65536   229.19783089391504268.04485112932684
>   131072   231.1221566209267.40543904427710
>   262144   230.72012123598142267.60086931504122
>   524288   230.48959460456055268.78750211303725
>
> With the new v3 patch I get
>
>  Unformatted sequential write/read performance test
>  Record size   Write MB/s Read MB/s
>  ==
>4   59.77906112123994192.777125264010024
>8   92.727504266051341126.64775563782673
>   16   128.94793911163904184.69194300482837
>   32   169.78916283536847267.06752001266767
>   64   209.50296476919556341.60515130910238
>  128   236.36709738360679416.73212655

Re: [Patch, libfortran] Improve performance of byte swapped IO

2013-01-11 Thread Janne Blomqvist
PING.

Slightly updated patch attached, which further improves the generic
size fallback that is used when the element size is not 2/4/8 bytes.
Changing the us_perf benchmark to use real(10), with the v2 patch the
performance is:

 Unformatted sequential write/read performance test
 Record size   Write MB/s Read MB/s
 ==
   4   59.02855042952208586.019754350948787
   8   79.02832706313059095.803502000733374
  16   99.980457395413296138.68367462874946
  32   122.56886206338788180.05609910155042
  64   152.00478266944486212.69931319407567
 128   197.74137934940202235.19728791956828
 256   155.36245780017779244.60578379215929
 512   157.13385845966246245.07467397691480
1024   177.26553799130201260.44908357795623
2048   208.22852888945587260.21587143113527
4096   222.88410474980634262.66162209490591
8192   226.71167580652920265.81191407123663
   16384   206.51818241747065263.59395165591724
   32768   230.18707026455866265.88990325026526
   65536   229.19783089391504268.04485112932684
  131072   231.1221566209267.40543904427710
  262144   230.72012123598142267.60086931504122
  524288   230.48959460456055268.78750211303725

With the new v3 patch I get

 Unformatted sequential write/read performance test
 Record size   Write MB/s Read MB/s
 ==
   4   59.77906112123994192.777125264010024
   8   92.727504266051341126.64775563782673
  16   128.94793911163904184.69194300482837
  32   169.78916283536847267.06752001266767
  64   209.50296476919556341.60515130910238
 128   236.36709738360679416.73212655882151
 256   251.79029695383340465.46804746749740
 512   259.62269939828633500.87346060356265
1024   265.08842337586458508.95530627428275
2048   268.71795530051884532.12211365683640
4096   280.86546884821030546.88907054369884
8192   286.96049684823578569.60958187426183
   16384   292.04368984868103608.11503416324865
   32768   292.96677387959392629.80651297065833
   65536   291.69098580137114624.27103478079641
  131072   292.75666234956418605.99766136491496
  262144   291.35520038228975611.59061455535834
  524288   292.15446100501691623.76232623081580


On Sat, Jan 5, 2013 at 11:13 PM, Janne Blomqvist
 wrote:
> On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
>  wrote:
>> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab  
>> wrote:
>>> Janne Blomqvist  writes:
>>>
 diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
 index c8ecc3a..bf2250a 100644
 --- a/libgfortran/io/file_pos.c
 +++ b/libgfortran/io/file_pos.c
 @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, 
 gfc_unit *u)
   }
else
   {
 +   uint32_t u32;
 +   uint64_t u64;
 switch (length)
   {
   case sizeof(GFC_INTEGER_4):
 -   reverse_memcpy (&m4, p, sizeof (m4));
 +   memcpy (&u32, p, sizeof (u32));
 +   u32 = __builtin_bswap32 (u32);
 +   m4 = *(GFC_INTEGER_4*)&u32;
>>>
>>> Isn't that an aliasing violation?
>>
>> It looks like one.  Why not simply do
>>
>>m4 = (GFC_INTEGER_4) u32;
>>
>> ?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?
>
> Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
> the above, C99 6.3.1.3(3) says that if the unsigned value is outside
> the range of the signed variable, the result is
> implementation-defined. Though I suppose the sensible
> "implementation-defined behavior" in this case on a two's complement
> target is to just do a bitwise copy.
>
> Anyway, to be really safe one could use memcpy instead; the compiler
> optimizes small fixed size memcpy's just fine. Updated patch attached.
>
>
> --
> Janne Blomqvist



-- 
Janne Blomqvist


bswap3.diff
Description: Binary data


Re: [Patch, libfortran] Improve performance of byte swapped IO

2013-01-06 Thread Richard Biener
On Sat, Jan 5, 2013 at 10:13 PM, Janne Blomqvist
 wrote:
> On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
>  wrote:
>> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab  
>> wrote:
>>> Janne Blomqvist  writes:
>>>
 diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
 index c8ecc3a..bf2250a 100644
 --- a/libgfortran/io/file_pos.c
 +++ b/libgfortran/io/file_pos.c
 @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, 
 gfc_unit *u)
   }
else
   {
 +   uint32_t u32;
 +   uint64_t u64;
 switch (length)
   {
   case sizeof(GFC_INTEGER_4):
 -   reverse_memcpy (&m4, p, sizeof (m4));
 +   memcpy (&u32, p, sizeof (u32));
 +   u32 = __builtin_bswap32 (u32);
 +   m4 = *(GFC_INTEGER_4*)&u32;
>>>
>>> Isn't that an aliasing violation?
>>
>> It looks like one.  Why not simply do
>>
>>m4 = (GFC_INTEGER_4) u32;
>>
>> ?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?
>
> Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
> the above, C99 6.3.1.3(3) says that if the unsigned value is outside
> the range of the signed variable, the result is
> implementation-defined. Though I suppose the sensible
> "implementation-defined behavior" in this case on a two's complement
> target is to just do a bitwise copy.

As libgfortran is a target library and thus always compiled by GCC you
can rely on GCCs documented implementation-defined behavior here
(which is to do bitwise re-interpretation).  No need to obfuscate the
code more than necessary.

Richard.

> Anyway, to be really safe one could use memcpy instead; the compiler
> optimizes small fixed size memcpy's just fine. Updated patch attached.
>
>
> --
> Janne Blomqvist


Re: [Patch, libfortran] Improve performance of byte swapped IO

2013-01-05 Thread Janne Blomqvist
On Sat, Jan 5, 2013 at 5:35 PM, Richard Biener
 wrote:
> On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab  wrote:
>> Janne Blomqvist  writes:
>>
>>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>>> index c8ecc3a..bf2250a 100644
>>> --- a/libgfortran/io/file_pos.c
>>> +++ b/libgfortran/io/file_pos.c
>>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, 
>>> gfc_unit *u)
>>>   }
>>>else
>>>   {
>>> +   uint32_t u32;
>>> +   uint64_t u64;
>>> switch (length)
>>>   {
>>>   case sizeof(GFC_INTEGER_4):
>>> -   reverse_memcpy (&m4, p, sizeof (m4));
>>> +   memcpy (&u32, p, sizeof (u32));
>>> +   u32 = __builtin_bswap32 (u32);
>>> +   m4 = *(GFC_INTEGER_4*)&u32;
>>
>> Isn't that an aliasing violation?
>
> It looks like one.  Why not simply do
>
>m4 = (GFC_INTEGER_4) u32;
>
> ?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?

Yes, GFC_INTEGER_4 is a typedef for int32_t. As for why I didn't do
the above, C99 6.3.1.3(3) says that if the unsigned value is outside
the range of the signed variable, the result is
implementation-defined. Though I suppose the sensible
"implementation-defined behavior" in this case on a two's complement
target is to just do a bitwise copy.

Anyway, to be really safe one could use memcpy instead; the compiler
optimizes small fixed size memcpy's just fine. Updated patch attached.


-- 
Janne Blomqvist


bswap2.diff
Description: Binary data


Re: [Patch, libfortran] Improve performance of byte swapped IO

2013-01-05 Thread Richard Biener
On Fri, Jan 4, 2013 at 11:35 PM, Andreas Schwab  wrote:
> Janne Blomqvist  writes:
>
>> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
>> index c8ecc3a..bf2250a 100644
>> --- a/libgfortran/io/file_pos.c
>> +++ b/libgfortran/io/file_pos.c
>> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, 
>> gfc_unit *u)
>>   }
>>else
>>   {
>> +   uint32_t u32;
>> +   uint64_t u64;
>> switch (length)
>>   {
>>   case sizeof(GFC_INTEGER_4):
>> -   reverse_memcpy (&m4, p, sizeof (m4));
>> +   memcpy (&u32, p, sizeof (u32));
>> +   u32 = __builtin_bswap32 (u32);
>> +   m4 = *(GFC_INTEGER_4*)&u32;
>
> Isn't that an aliasing violation?

It looks like one.  Why not simply do

   m4 = (GFC_INTEGER_4) u32;

?  I suppose GFC_INTEGER_4 is always the same size as uint32_t but signed?

Richard.

>
> Andreas.
>
> --
> Andreas Schwab, sch...@linux-m68k.org
> GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
> "And now for something completely different."


Re: [Patch, libfortran] Improve performance of byte swapped IO

2013-01-04 Thread Andreas Schwab
Janne Blomqvist  writes:

> diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
> index c8ecc3a..bf2250a 100644
> --- a/libgfortran/io/file_pos.c
> +++ b/libgfortran/io/file_pos.c
> @@ -140,15 +140,21 @@ unformatted_backspace (st_parameter_filepos *fpp, 
> gfc_unit *u)
>   }
>else
>   {
> +   uint32_t u32;
> +   uint64_t u64;
> switch (length)
>   {
>   case sizeof(GFC_INTEGER_4):
> -   reverse_memcpy (&m4, p, sizeof (m4));
> +   memcpy (&u32, p, sizeof (u32));
> +   u32 = __builtin_bswap32 (u32);
> +   m4 = *(GFC_INTEGER_4*)&u32;

Isn't that an aliasing violation?

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


[Patch, libfortran] Improve performance of byte swapped IO

2013-01-04 Thread Janne Blomqvist
Hi,

currently byte swapped unformatted IO can be quite slow compared to
the same code with no byte swapping. There are two major reasons for
this:

1) The byte swapping code path resorts to transferring data element by
element, leading to a lot of overhead in the IO library.

2) The function used for the actual byte swapping, reverse_memcpy ,
while able to handle general element sizes, is not particularly fast,
especially considering that many CPU's have fast byte swapping
instructions (e.g. BSWAP on x86). In order to access these fast byte
swapping instructions, gcc provides the __builtin_bswap{16,32,64}
builtins, falling back to libgcc code for targets that lack support.

The attached patch fixes these issues. For issue (1), the read path
uses in-place byte swapping of the data that has been read into the
user buffer, while the write path uses a larger temporary buffer
(since we are not allowed to modify the user supplied data in this
case). For issue(2), the patch uses __builtin_bswap{16,32,64} where
appropriate, only falling back to reverse_memcpy for other sizes.

With the attached test program run on a tmpfs filesystem to avoid
doing actual disk IO, I get the following:

- With no byte swapping:

 Unformatted sequential write/read performance test
 Record size   Write MB/s Read MB/s
 ==
   4   52.72384281742220272.721158943820441
   8   77.50829689085638697.237815640377221
  16   110.26209495334321143.80831184546381
  32   173.94872143231535221.89704881197937
  64   282.19818562682684373.77854583735541
 128   442.22084579742244628.80041029142183
 256   636.69620860705299966.37723642576316
 512   826.059688407380801380.8835166612221
1024   987.186864651975611763.5990036057208
2048   1047.67215441917102058.0875622043550
4096   1115.58171471348012251.8731832850176
8192   1191.50211509965902283.8893409728184
   16384   1417.61109095193912441.0530373866482
   32768   1570.44134790460182543.0836384048471
   65536   1673.03787065029662651.2182395008308
  131072   1697.49442461884452688.2398923155783
  262144   1669.63298621458722735.668973292
  524288   1594.46699352315522697.7208298823243

- Before patch, with byte swapping:

 Unformatted sequential write/read performance test
 Record size   Write MB/s Read MB/s
 ==
   4   50.57281289368979368.858701306591627
   8   58.68851330069031781.591733130441327
  16   73.55118848060782096.638995590227665
  32   91.593767813989018116.65817140076214
  64   107.41379323761915128.32512066346368
 128   121.33499652432221147.80777892360237
 256   128.99627771476628155.91619889220266
 512   135.02742063670030161.30042382365372
1024   137.02276709585524164.11267056940963
2048   138.62774254302394165.22456826188971
4096   139.27695763341924166.34707691429571
8192   147.64584950575932166.59526981475742
   16384   147.91235479266419166.77890398940283
   32768   150.77029430529927166.90834867503827
   65536   151.59474472614465166.84075600288520
  131072   155.75202672623249166.96550283835097
  262144   155.36506626794849166.78075976148853
  524288   155.64305086921487167.44468828946083

- After patch, with byte swapping:

 Unformatted sequential write/read performance test
 Record size   Write MB/s Read MB/s
 ==
   4   49.41477177682136170.808060042286343
   8   72.91815640245977293.234093684373946
  16   102.72461544178078136.21700026949074
  32   160.57240200649090205.97612602315186
  64   249.32082957447636331.85515010907363
 128   385.71299236810387522.06354804855266
 256   535.40608912076459766.59668706247294
 512   669.478641203685241006.4275938227961
1024   742.905388955002651187.9846039167674
2048   789.713405573405231333.8411634622269
4096   826.442532047316831395.5536995933605
8192   832.935403161166621361.4621716558986
   16384   897.950819770101131469.0940087507722
   32768   961.187363080333171533.7736812111871
   65536   989.413849084968321564.7013916917260
  131072   1003.61137620680401597.4063253370084
  262144   980.030676643243961602.3188995993287