Re: [PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures

2017-08-11 Thread Babu Moger

David,  Thanks for applying.

On 8/10/2017 4:38 PM, David Miller wrote:

From: Babu Moger 
Date: Mon,  7 Aug 2017 17:52:48 -0600


This series of patches updates the memcpy, memset, copy_to_user,
copy_from_user etc for SPARC M7/M8 architecture.

This doesn't build, you cannot assume the existence of "%ncc", it is a
recent addition.

Furthermore there is no need to ever use %ncc in v9 targetted code
anyways.

I'll fix that up, but this was a really disappointing build failure
to hit.

Thank you..


Meanwhile, two questions:

1) Is this also faster on T4 as well?  If it is, we can just get rid
of the T4 routines and use this on those chips as well.


At the time of this work, our focus was mostly on T7 and T8. We did not 
test this code on T4.
For T4 and other older configs we used NG4 versions. I would think it 
would require some

changes to make it work on T4.


2) There has been a lot of discussion and consideration put into how
a memcpy/memset routine might be really great for the local cpu
but overall pessimize performance for other cpus either locally
on the same core (contention for physical resources such as
ports to the store buffer and/or L3 cache) or on other cores.

Has any such study been done into these issues wrt. this new code?

No, we have not done this kind of study.



Re: [PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures

2017-08-11 Thread Babu Moger

David,  Thanks for applying.

On 8/10/2017 4:38 PM, David Miller wrote:

From: Babu Moger 
Date: Mon,  7 Aug 2017 17:52:48 -0600


This series of patches updates the memcpy, memset, copy_to_user,
copy_from_user etc for SPARC M7/M8 architecture.

This doesn't build, you cannot assume the existence of "%ncc", it is a
recent addition.

Furthermore there is no need to ever use %ncc in v9 targetted code
anyways.

I'll fix that up, but this was a really disappointing build failure
to hit.

Thank you..


Meanwhile, two questions:

1) Is this also faster on T4 as well?  If it is, we can just get rid
of the T4 routines and use this on those chips as well.


At the time of this work, our focus was mostly on T7 and T8. We did not 
test this code on T4.
For T4 and other older configs we used NG4 versions. I would think it 
would require some

changes to make it work on T4.


2) There has been a lot of discussion and consideration put into how
a memcpy/memset routine might be really great for the local cpu
but overall pessimize performance for other cpus either locally
on the same core (contention for physical resources such as
ports to the store buffer and/or L3 cache) or on other cores.

Has any such study been done into these issues wrt. this new code?

No, we have not done this kind of study.



Re: [PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures

2017-08-10 Thread David Miller
From: Babu Moger 
Date: Mon,  7 Aug 2017 17:52:48 -0600

> This series of patches updates the memcpy, memset, copy_to_user,
> copy_from_user etc for SPARC M7/M8 architecture.

This doesn't build, you cannot assume the existence of "%ncc", it is a
recent addition.

Furthermore there is no need to ever use %ncc in v9 targetted code
anyways.

I'll fix that up, but this was a really disappointing build failure
to hit.

Meanwhile, two questions:

1) Is this also faster on T4 as well?  If it is, we can just get rid
   of the T4 routines and use this on those chips as well.

2) There has been a lot of discussion and consideration put into how
   a memcpy/memset routine might be really great for the local cpu
   but overall pessimize performance for other cpus either locally
   on the same core (contention for physical resources such as
   ports to the store buffer and/or L3 cache) or on other cores.

   Has any such study been done into these issues wrt. this new code?


Re: [PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures

2017-08-10 Thread David Miller
From: Babu Moger 
Date: Mon,  7 Aug 2017 17:52:48 -0600

> This series of patches updates the memcpy, memset, copy_to_user,
> copy_from_user etc for SPARC M7/M8 architecture.

This doesn't build, you cannot assume the existence of "%ncc", it is a
recent addition.

Furthermore there is no need to ever use %ncc in v9 targetted code
anyways.

I'll fix that up, but this was a really disappointing build failure
to hit.

Meanwhile, two questions:

1) Is this also faster on T4 as well?  If it is, we can just get rid
   of the T4 routines and use this on those chips as well.

2) There has been a lot of discussion and consideration put into how
   a memcpy/memset routine might be really great for the local cpu
   but overall pessimize performance for other cpus either locally
   on the same core (contention for physical resources such as
   ports to the store buffer and/or L3 cache) or on other cores.

   Has any such study been done into these issues wrt. this new code?


[PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures

2017-08-07 Thread Babu Moger
This series of patches updates the memcpy, memset, copy_to_user, copy_from_user
etc for SPARC M7/M8 architecture.

New algorithm here takes advantage of the M7/M8 block init store ASIs, with much
more optimized way to improve the performance. More detail are in code comments.

Tested and compared the latency measured in ticks(NG4memcpy vs new M7memcpy).

1. Memset numbers(Aligned memset)

No.of bytes   NG4memsetM7memset Delta ((B-A)/A)*100
 (Avg.Ticks A) (Avg.Ticks B) (latency reduction)
  3 77  25  -67.53
  7 43  33  -23.25
  3272  68   -5.55
  128   164 44  -73.17
  256   335 68  -79.70
  512   511 220 -56.94
  1024  1552627 -59.60
  2048  35151322-62.38
  4096  63032472-60.78
  8192  13118   4867-62.89
  16384 26206   10371   -60.42
  32768 52501   18569   -64.63
  65536 100219  35899   -64.17


2. Memcpy numbers(Aligned memcpy)

No.of bytes   NG4memcpyM7memcpy Delta ((B-A)/A)*100
 (Avg.Ticks A) (Avg.Ticks B) (latency reduction)
  3 20  19  -5
  7 29  27  -6.89
  3230  28  -6.66
  128   89  69  -22.47
  256   142 143  0.70
  512   341 283 -17.00
  1024  1588655 -58.75
  2048  35531357-61.80
  4096  72182590-64.11
  8192  13701   5231-61.82
  16384 28304   10716   -62.13
  32768 56516   22995   -59.31
  65536 115443  50840   -55.96

3. Memset numbers(un-aligned memset)

No.of bytes   NG4memsetM7memset Delta ((B-A)/A)*100
 (Avg.Ticks A) (Avg.Ticks B) (latency reduction)
  3 40  31  -22.5
  7 52  29  -44.2307692308
  3289  86  -3.3707865169
  128   201 74  -63.184079602
  256   340 154 -54.7058823529
  512   961 335 -65.1404786681
  1024  1799686 -61.8677042802
  2048  35751260-64.7552447552
  4096  65602627-59.9542682927
  8192  13161   6018-54.273991338
  16384 26465   10439   -60.5554505951
  32768 52119   18649   -64.2184232238
  65536 101593  35724   -64.8361599717

4. Memcpy numbers(un-aligned memcpy)

No.of bytes   NG4memcpyM7memcpy Delta ((B-A)/A)*100
 (Avg.Ticks A) (Avg.Ticks B) (latency reduction)
  3 26  19  -26.9230769231
  7 48  45  -6.25
  3252  49  -5.7692307692
  128   284 334 17.6056338028
  256   430 482 12.0930232558
  512   646 690 6.8111455108
  1024  10511016-3.3301617507
  2048  178718181.7347509793
  4096  330933762.0247809006
  8192  81517444-8.673782358
  16384 34222   34556   0.9759803635
  32768 87851   95044   8.1877269468
  65536 158331  159572  0.7838010244

There is not much difference in numbers with Un-aligned copies
between NG4memcpy and M7memcpy because they both mostly use the
same algorithems.

v2:
 1. Fixed indentation issues found by David Miller
 2. Used ENTRY and ENDPROC for the labels in M7patch.S as suggested by David 
Miller
 3. Now M8 also will use M7memcpy. Also tested on M8 config.
 4. These patches are created on top of below M8 patches
https://patchwork.ozlabs.org/patch/792661/
https://patchwork.ozlabs.org/patch/792662/
However, I did not see these patches in sparc-next tree. It may be in queue 
now.
It is possible these patches might cause some build problems. It will 
resolve 
once all M8 patches are in sparc-next tree.

v0: Initial version

Babu Moger (4):
  arch/sparc: Separate the exception handlers from NG4memcpy
  arch/sparc: Rename exception handlers
  arch/sparc: Optimized memcpy, memset, copy_to_user, copy_from_user
for M7/M8
  arch/sparc: Add accurate exception reporting 

[PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures

2017-08-07 Thread Babu Moger
This series of patches updates the memcpy, memset, copy_to_user, copy_from_user
etc for SPARC M7/M8 architecture.

New algorithm here takes advantage of the M7/M8 block init store ASIs, with much
more optimized way to improve the performance. More detail are in code comments.

Tested and compared the latency measured in ticks(NG4memcpy vs new M7memcpy).

1. Memset numbers(Aligned memset)

No.of bytes   NG4memsetM7memset Delta ((B-A)/A)*100
 (Avg.Ticks A) (Avg.Ticks B) (latency reduction)
  3 77  25  -67.53
  7 43  33  -23.25
  3272  68   -5.55
  128   164 44  -73.17
  256   335 68  -79.70
  512   511 220 -56.94
  1024  1552627 -59.60
  2048  35151322-62.38
  4096  63032472-60.78
  8192  13118   4867-62.89
  16384 26206   10371   -60.42
  32768 52501   18569   -64.63
  65536 100219  35899   -64.17


2. Memcpy numbers(Aligned memcpy)

No.of bytes   NG4memcpyM7memcpy Delta ((B-A)/A)*100
 (Avg.Ticks A) (Avg.Ticks B) (latency reduction)
  3 20  19  -5
  7 29  27  -6.89
  3230  28  -6.66
  128   89  69  -22.47
  256   142 143  0.70
  512   341 283 -17.00
  1024  1588655 -58.75
  2048  35531357-61.80
  4096  72182590-64.11
  8192  13701   5231-61.82
  16384 28304   10716   -62.13
  32768 56516   22995   -59.31
  65536 115443  50840   -55.96

3. Memset numbers(un-aligned memset)

No.of bytes   NG4memsetM7memset Delta ((B-A)/A)*100
 (Avg.Ticks A) (Avg.Ticks B) (latency reduction)
  3 40  31  -22.5
  7 52  29  -44.2307692308
  3289  86  -3.3707865169
  128   201 74  -63.184079602
  256   340 154 -54.7058823529
  512   961 335 -65.1404786681
  1024  1799686 -61.8677042802
  2048  35751260-64.7552447552
  4096  65602627-59.9542682927
  8192  13161   6018-54.273991338
  16384 26465   10439   -60.5554505951
  32768 52119   18649   -64.2184232238
  65536 101593  35724   -64.8361599717

4. Memcpy numbers(un-aligned memcpy)

No.of bytes   NG4memcpyM7memcpy Delta ((B-A)/A)*100
 (Avg.Ticks A) (Avg.Ticks B) (latency reduction)
  3 26  19  -26.9230769231
  7 48  45  -6.25
  3252  49  -5.7692307692
  128   284 334 17.6056338028
  256   430 482 12.0930232558
  512   646 690 6.8111455108
  1024  10511016-3.3301617507
  2048  178718181.7347509793
  4096  330933762.0247809006
  8192  81517444-8.673782358
  16384 34222   34556   0.9759803635
  32768 87851   95044   8.1877269468
  65536 158331  159572  0.7838010244

There is not much difference in numbers with Un-aligned copies
between NG4memcpy and M7memcpy because they both mostly use the
same algorithems.

v2:
 1. Fixed indentation issues found by David Miller
 2. Used ENTRY and ENDPROC for the labels in M7patch.S as suggested by David 
Miller
 3. Now M8 also will use M7memcpy. Also tested on M8 config.
 4. These patches are created on top of below M8 patches
https://patchwork.ozlabs.org/patch/792661/
https://patchwork.ozlabs.org/patch/792662/
However, I did not see these patches in sparc-next tree. It may be in queue 
now.
It is possible these patches might cause some build problems. It will 
resolve 
once all M8 patches are in sparc-next tree.

v0: Initial version

Babu Moger (4):
  arch/sparc: Separate the exception handlers from NG4memcpy
  arch/sparc: Rename exception handlers
  arch/sparc: Optimized memcpy, memset, copy_to_user, copy_from_user
for M7/M8
  arch/sparc: Add accurate exception reporting