Re: [PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures
David, Thanks for applying. On 8/10/2017 4:38 PM, David Miller wrote: From: Babu MogerDate: Mon, 7 Aug 2017 17:52:48 -0600 This series of patches updates the memcpy, memset, copy_to_user, copy_from_user etc for SPARC M7/M8 architecture. This doesn't build, you cannot assume the existence of "%ncc", it is a recent addition. Furthermore there is no need to ever use %ncc in v9 targetted code anyways. I'll fix that up, but this was a really disappointing build failure to hit. Thank you.. Meanwhile, two questions: 1) Is this also faster on T4 as well? If it is, we can just get rid of the T4 routines and use this on those chips as well. At the time of this work, our focus was mostly on T7 and T8. We did not test this code on T4. For T4 and other older configs we used NG4 versions. I would think it would require some changes to make it work on T4. 2) There has been a lot of discussion and consideration put into how a memcpy/memset routine might be really great for the local cpu but overall pessimize performance for other cpus either locally on the same core (contention for physical resources such as ports to the store buffer and/or L3 cache) or on other cores. Has any such study been done into these issues wrt. this new code? No, we have not done this kind of study.
Re: [PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures
David, Thanks for applying. On 8/10/2017 4:38 PM, David Miller wrote: From: Babu Moger Date: Mon, 7 Aug 2017 17:52:48 -0600 This series of patches updates the memcpy, memset, copy_to_user, copy_from_user etc for SPARC M7/M8 architecture. This doesn't build, you cannot assume the existence of "%ncc", it is a recent addition. Furthermore there is no need to ever use %ncc in v9 targetted code anyways. I'll fix that up, but this was a really disappointing build failure to hit. Thank you.. Meanwhile, two questions: 1) Is this also faster on T4 as well? If it is, we can just get rid of the T4 routines and use this on those chips as well. At the time of this work, our focus was mostly on T7 and T8. We did not test this code on T4. For T4 and other older configs we used NG4 versions. I would think it would require some changes to make it work on T4. 2) There has been a lot of discussion and consideration put into how a memcpy/memset routine might be really great for the local cpu but overall pessimize performance for other cpus either locally on the same core (contention for physical resources such as ports to the store buffer and/or L3 cache) or on other cores. Has any such study been done into these issues wrt. this new code? No, we have not done this kind of study.
Re: [PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures
From: Babu MogerDate: Mon, 7 Aug 2017 17:52:48 -0600 > This series of patches updates the memcpy, memset, copy_to_user, > copy_from_user etc for SPARC M7/M8 architecture. This doesn't build, you cannot assume the existence of "%ncc", it is a recent addition. Furthermore there is no need to ever use %ncc in v9 targetted code anyways. I'll fix that up, but this was a really disappointing build failure to hit. Meanwhile, two questions: 1) Is this also faster on T4 as well? If it is, we can just get rid of the T4 routines and use this on those chips as well. 2) There has been a lot of discussion and consideration put into how a memcpy/memset routine might be really great for the local cpu but overall pessimize performance for other cpus either locally on the same core (contention for physical resources such as ports to the store buffer and/or L3 cache) or on other cores. Has any such study been done into these issues wrt. this new code?
Re: [PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures
From: Babu Moger Date: Mon, 7 Aug 2017 17:52:48 -0600 > This series of patches updates the memcpy, memset, copy_to_user, > copy_from_user etc for SPARC M7/M8 architecture. This doesn't build, you cannot assume the existence of "%ncc", it is a recent addition. Furthermore there is no need to ever use %ncc in v9 targetted code anyways. I'll fix that up, but this was a really disappointing build failure to hit. Meanwhile, two questions: 1) Is this also faster on T4 as well? If it is, we can just get rid of the T4 routines and use this on those chips as well. 2) There has been a lot of discussion and consideration put into how a memcpy/memset routine might be really great for the local cpu but overall pessimize performance for other cpus either locally on the same core (contention for physical resources such as ports to the store buffer and/or L3 cache) or on other cores. Has any such study been done into these issues wrt. this new code?
[PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures
This series of patches updates the memcpy, memset, copy_to_user, copy_from_user etc for SPARC M7/M8 architecture. New algorithm here takes advantage of the M7/M8 block init store ASIs, with much more optimized way to improve the performance. More detail are in code comments. Tested and compared the latency measured in ticks(NG4memcpy vs new M7memcpy). 1. Memset numbers(Aligned memset) No.of bytes NG4memsetM7memset Delta ((B-A)/A)*100 (Avg.Ticks A) (Avg.Ticks B) (latency reduction) 3 77 25 -67.53 7 43 33 -23.25 3272 68 -5.55 128 164 44 -73.17 256 335 68 -79.70 512 511 220 -56.94 1024 1552627 -59.60 2048 35151322-62.38 4096 63032472-60.78 8192 13118 4867-62.89 16384 26206 10371 -60.42 32768 52501 18569 -64.63 65536 100219 35899 -64.17 2. Memcpy numbers(Aligned memcpy) No.of bytes NG4memcpyM7memcpy Delta ((B-A)/A)*100 (Avg.Ticks A) (Avg.Ticks B) (latency reduction) 3 20 19 -5 7 29 27 -6.89 3230 28 -6.66 128 89 69 -22.47 256 142 143 0.70 512 341 283 -17.00 1024 1588655 -58.75 2048 35531357-61.80 4096 72182590-64.11 8192 13701 5231-61.82 16384 28304 10716 -62.13 32768 56516 22995 -59.31 65536 115443 50840 -55.96 3. Memset numbers(un-aligned memset) No.of bytes NG4memsetM7memset Delta ((B-A)/A)*100 (Avg.Ticks A) (Avg.Ticks B) (latency reduction) 3 40 31 -22.5 7 52 29 -44.2307692308 3289 86 -3.3707865169 128 201 74 -63.184079602 256 340 154 -54.7058823529 512 961 335 -65.1404786681 1024 1799686 -61.8677042802 2048 35751260-64.7552447552 4096 65602627-59.9542682927 8192 13161 6018-54.273991338 16384 26465 10439 -60.5554505951 32768 52119 18649 -64.2184232238 65536 101593 35724 -64.8361599717 4. Memcpy numbers(un-aligned memcpy) No.of bytes NG4memcpyM7memcpy Delta ((B-A)/A)*100 (Avg.Ticks A) (Avg.Ticks B) (latency reduction) 3 26 19 -26.9230769231 7 48 45 -6.25 3252 49 -5.7692307692 128 284 334 17.6056338028 256 430 482 12.0930232558 512 646 690 6.8111455108 1024 10511016-3.3301617507 2048 178718181.7347509793 4096 330933762.0247809006 8192 81517444-8.673782358 16384 34222 34556 0.9759803635 32768 87851 95044 8.1877269468 65536 158331 159572 0.7838010244 There is not much difference in numbers with Un-aligned copies between NG4memcpy and M7memcpy because they both mostly use the same algorithems. v2: 1. Fixed indentation issues found by David Miller 2. Used ENTRY and ENDPROC for the labels in M7patch.S as suggested by David Miller 3. Now M8 also will use M7memcpy. Also tested on M8 config. 4. These patches are created on top of below M8 patches https://patchwork.ozlabs.org/patch/792661/ https://patchwork.ozlabs.org/patch/792662/ However, I did not see these patches in sparc-next tree. It may be in queue now. It is possible these patches might cause some build problems. It will resolve once all M8 patches are in sparc-next tree. v0: Initial version Babu Moger (4): arch/sparc: Separate the exception handlers from NG4memcpy arch/sparc: Rename exception handlers arch/sparc: Optimized memcpy, memset, copy_to_user, copy_from_user for M7/M8 arch/sparc: Add accurate exception reporting
[PATCH v2 0/4] Update memcpy, memset etc. for M7/M8 architectures
This series of patches updates the memcpy, memset, copy_to_user, copy_from_user etc for SPARC M7/M8 architecture. New algorithm here takes advantage of the M7/M8 block init store ASIs, with much more optimized way to improve the performance. More detail are in code comments. Tested and compared the latency measured in ticks(NG4memcpy vs new M7memcpy). 1. Memset numbers(Aligned memset) No.of bytes NG4memsetM7memset Delta ((B-A)/A)*100 (Avg.Ticks A) (Avg.Ticks B) (latency reduction) 3 77 25 -67.53 7 43 33 -23.25 3272 68 -5.55 128 164 44 -73.17 256 335 68 -79.70 512 511 220 -56.94 1024 1552627 -59.60 2048 35151322-62.38 4096 63032472-60.78 8192 13118 4867-62.89 16384 26206 10371 -60.42 32768 52501 18569 -64.63 65536 100219 35899 -64.17 2. Memcpy numbers(Aligned memcpy) No.of bytes NG4memcpyM7memcpy Delta ((B-A)/A)*100 (Avg.Ticks A) (Avg.Ticks B) (latency reduction) 3 20 19 -5 7 29 27 -6.89 3230 28 -6.66 128 89 69 -22.47 256 142 143 0.70 512 341 283 -17.00 1024 1588655 -58.75 2048 35531357-61.80 4096 72182590-64.11 8192 13701 5231-61.82 16384 28304 10716 -62.13 32768 56516 22995 -59.31 65536 115443 50840 -55.96 3. Memset numbers(un-aligned memset) No.of bytes NG4memsetM7memset Delta ((B-A)/A)*100 (Avg.Ticks A) (Avg.Ticks B) (latency reduction) 3 40 31 -22.5 7 52 29 -44.2307692308 3289 86 -3.3707865169 128 201 74 -63.184079602 256 340 154 -54.7058823529 512 961 335 -65.1404786681 1024 1799686 -61.8677042802 2048 35751260-64.7552447552 4096 65602627-59.9542682927 8192 13161 6018-54.273991338 16384 26465 10439 -60.5554505951 32768 52119 18649 -64.2184232238 65536 101593 35724 -64.8361599717 4. Memcpy numbers(un-aligned memcpy) No.of bytes NG4memcpyM7memcpy Delta ((B-A)/A)*100 (Avg.Ticks A) (Avg.Ticks B) (latency reduction) 3 26 19 -26.9230769231 7 48 45 -6.25 3252 49 -5.7692307692 128 284 334 17.6056338028 256 430 482 12.0930232558 512 646 690 6.8111455108 1024 10511016-3.3301617507 2048 178718181.7347509793 4096 330933762.0247809006 8192 81517444-8.673782358 16384 34222 34556 0.9759803635 32768 87851 95044 8.1877269468 65536 158331 159572 0.7838010244 There is not much difference in numbers with Un-aligned copies between NG4memcpy and M7memcpy because they both mostly use the same algorithems. v2: 1. Fixed indentation issues found by David Miller 2. Used ENTRY and ENDPROC for the labels in M7patch.S as suggested by David Miller 3. Now M8 also will use M7memcpy. Also tested on M8 config. 4. These patches are created on top of below M8 patches https://patchwork.ozlabs.org/patch/792661/ https://patchwork.ozlabs.org/patch/792662/ However, I did not see these patches in sparc-next tree. It may be in queue now. It is possible these patches might cause some build problems. It will resolve once all M8 patches are in sparc-next tree. v0: Initial version Babu Moger (4): arch/sparc: Separate the exception handlers from NG4memcpy arch/sparc: Rename exception handlers arch/sparc: Optimized memcpy, memset, copy_to_user, copy_from_user for M7/M8 arch/sparc: Add accurate exception reporting