RE: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses"
[AMD Official Use Only] I have tried a lot of different options, I do not recall now. Anyway, it is reverted and I do not seem to have resources to further pursue it. Stas -Original Message- From: Maxim Kuvyrkov Sent: Friday, October 1, 2021 11:16 To: Mekhanoshin, Stanislav Cc: linaro-toolchain@lists.linaro.org Subject: Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses" [CAUTION: External Email] > On 1 Oct 2021, at 21:06, Mekhanoshin, Stanislav > wrote: > > [AMD Official Use Only] > >> You mentioned that you saw different results for another ARM target — could >> you elaborate please? > > When I was trying to reproduce hmmer asm I was trying to use different ARM > targets. I was never able to pick the one you were using apparently, but then > got very different results with different targets. Our benchmarking CI is using default armhf target (--target=armv7a-linux-gnueabihf) with no additional -mcpu=/-march tuning flags. Is it the same in your testing? If so, then Clang should generate exactly same assembly in both cases, and have same extra reloads in 456.hmmer The hardware used in benchmarking is Cortex-A15, which is still one of the most popular cores. Which one you used in your experiments? Thanks, -- Maxim Kuvyrkov https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C251f58135a0e4ee609a708d985078a56%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637687089717973498%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ThWhyirXioKcezaw5VAnU862oKOFZ41h1CjYTPQlF5E%3D&reserved=0 > > Stas > > -Original Message- > From: Maxim Kuvyrkov > Sent: Friday, October 1, 2021 3:05 > To: Mekhanoshin, Stanislav > Cc: linaro-toolchain@lists.linaro.org > Subject: Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow > rematerialization of virtual reg uses" > > [CAUTION: External Email] > > Hi Stanislav, > > I fully understand the challenges of compiler optimizations and the fact that > a generally-good optimisation can slow down a small number of benchmarks. > > Still, benchmarking your original patch (commit > 92c1fd19abb15bc68b1127a26137a69e033cdb39) on arm-linux-gnueabihf results in > overall runtime slow-down across C/C++ subset of SPEC CPU2006: > - 0.25% runtime geomean increase at -O2 > - 0.37% runtime geomean increase at -O3 > > See [1] for the numbers. > > You mentioned that you saw different results for another ARM target — could > you elaborate please? > > [1] > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1USWty9Vdx6JLo7TGddbkoKVUCiC4wtneOhhbHf5WXfc%2Fedit%3Fusp%3Dsharing&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C251f58135a0e4ee609a708d985078a56%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637687089717973498%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=31VTIoeZpH2KU7QstyGVPSNUijna37paZagNoq37nSI%3D&reserved=0 > > Regards, > > -- > Maxim Kuvyrkov > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C251f58135a0e4ee609a708d985078a56%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637687089717973498%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ThWhyirXioKcezaw5VAnU862oKOFZ41h1CjYTPQlF5E%3D&reserved=0 > >> On 29 Sep 2021, at 20:13, Mekhanoshin, Stanislav >> wrote: >> >> [AMD Official Use Only] >> >> Maxim, >> >> This is really difficult for me to work on this as I do not have various >> targets and HW affected. I am sure there were quite a lot of progressions, >> but as I said in the beginning regressions are also inevitable, just like >> every time a heuristic is involved. For the hmmer case I was getting quite >> different results just by selecting a different ARM target. So without a >> good way to measure it and given the heuristic approach I cannot satisfy all >> the requests from multiple parties. Our target (AMDGPU) does this for a long >> time and I believe it is overall beneficial. It is somewhat pity I cannot >> make this a universal optimization, but I am also time constrained as there >> is other work to do too. >> >> Stas >> >> -Original Message- >> From: Maxim Kuvyrkov >> Sent: Wednesday, September 29, 2021 4:17 >> To: Mekhanoshin, Stanislav >> Cc: linaro-toolchain@lists.linaro.org >> Subject: Re: [TCWG CI] 456.hmmer slow
RE: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses"
[AMD Official Use Only] > You mentioned that you saw different results for another ARM target — could > you elaborate please? When I was trying to reproduce hmmer asm I was trying to use different ARM targets. I was never able to pick the one you were using apparently, but then got very different results with different targets. Stas -Original Message- From: Maxim Kuvyrkov Sent: Friday, October 1, 2021 3:05 To: Mekhanoshin, Stanislav Cc: linaro-toolchain@lists.linaro.org Subject: Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses" [CAUTION: External Email] Hi Stanislav, I fully understand the challenges of compiler optimizations and the fact that a generally-good optimisation can slow down a small number of benchmarks. Still, benchmarking your original patch (commit 92c1fd19abb15bc68b1127a26137a69e033cdb39) on arm-linux-gnueabihf results in overall runtime slow-down across C/C++ subset of SPEC CPU2006: - 0.25% runtime geomean increase at -O2 - 0.37% runtime geomean increase at -O3 See [1] for the numbers. You mentioned that you saw different results for another ARM target — could you elaborate please? [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1USWty9Vdx6JLo7TGddbkoKVUCiC4wtneOhhbHf5WXfc%2Fedit%3Fusp%3Dsharing&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C875cf130a5b3482342a808d984c2e333%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637686796375377160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=X09fEBSn%2FykJ09vMSf2YoGnkODBoJAKhnma8KX9%2BxUE%3D&reserved=0 Regards, -- Maxim Kuvyrkov https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C875cf130a5b3482342a808d984c2e333%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637686796375377160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=P503kX0DWQuMMr72znAXgvIdh2IsBDvliTko5%2F%2B4x6Q%3D&reserved=0 > On 29 Sep 2021, at 20:13, Mekhanoshin, Stanislav > wrote: > > [AMD Official Use Only] > > Maxim, > > This is really difficult for me to work on this as I do not have various > targets and HW affected. I am sure there were quite a lot of progressions, > but as I said in the beginning regressions are also inevitable, just like > every time a heuristic is involved. For the hmmer case I was getting quite > different results just by selecting a different ARM target. So without a good > way to measure it and given the heuristic approach I cannot satisfy all the > requests from multiple parties. Our target (AMDGPU) does this for a long time > and I believe it is overall beneficial. It is somewhat pity I cannot make > this a universal optimization, but I am also time constrained as there is > other work to do too. > > Stas > > -Original Message- > From: Maxim Kuvyrkov > Sent: Wednesday, September 29, 2021 4:17 > To: Mekhanoshin, Stanislav > Cc: linaro-toolchain@lists.linaro.org > Subject: Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow > rematerialization of virtual reg uses" > > [CAUTION: External Email] > > I thought the speed up and slow-down from "Allow rematerialization of virtual > reg uses" were for different benchmarks, but they are for the same benchmark > - 456.hmmer - but for different compilation flags. > > - At -O2 the patch slows down 456.hmmer by 5% from 751s to 771s. > - At -O2 -flto patch speeds up 456.hmmer by 5% from 803s to 765s. > > Two observations from this: > 1. 456.hmmer is very sensitive to this optimisation > 2. LTO screws up on 456.hmmer. > > -- > Maxim Kuvyrkov > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C875cf130a5b3482342a808d984c2e333%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637686796375377160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=P503kX0DWQuMMr72znAXgvIdh2IsBDvliTko5%2F%2B4x6Q%3D&reserved=0 > >> On 29 Sep 2021, at 14:06, Maxim Kuvyrkov wrote: >> >> Hi Stanislav, >> >> Just FYI. Your original patch improved 456.hmmer by 5%, that's a nice speed >> up! >> >> -- >> Maxim Kuvyrkov >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C875cf130a5b3482342a808d984c2e333%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637686796375377160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=P503kX0DWQuMMr72zn
Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses"
> On 1 Oct 2021, at 21:06, Mekhanoshin, Stanislav > wrote: > > [AMD Official Use Only] > >> You mentioned that you saw different results for another ARM target — could >> you elaborate please? > > When I was trying to reproduce hmmer asm I was trying to use different ARM > targets. I was never able to pick the one you were using apparently, but then > got very different results with different targets. Our benchmarking CI is using default armhf target (--target=armv7a-linux-gnueabihf) with no additional -mcpu=/-march tuning flags. Is it the same in your testing? If so, then Clang should generate exactly same assembly in both cases, and have same extra reloads in 456.hmmer The hardware used in benchmarking is Cortex-A15, which is still one of the most popular cores. Which one you used in your experiments? Thanks, -- Maxim Kuvyrkov https://www.linaro.org > > Stas > > -Original Message- > From: Maxim Kuvyrkov > Sent: Friday, October 1, 2021 3:05 > To: Mekhanoshin, Stanislav > Cc: linaro-toolchain@lists.linaro.org > Subject: Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow > rematerialization of virtual reg uses" > > [CAUTION: External Email] > > Hi Stanislav, > > I fully understand the challenges of compiler optimizations and the fact that > a generally-good optimisation can slow down a small number of benchmarks. > > Still, benchmarking your original patch (commit > 92c1fd19abb15bc68b1127a26137a69e033cdb39) on arm-linux-gnueabihf results in > overall runtime slow-down across C/C++ subset of SPEC CPU2006: > - 0.25% runtime geomean increase at -O2 > - 0.37% runtime geomean increase at -O3 > > See [1] for the numbers. > > You mentioned that you saw different results for another ARM target — could > you elaborate please? > > [1] > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1USWty9Vdx6JLo7TGddbkoKVUCiC4wtneOhhbHf5WXfc%2Fedit%3Fusp%3Dsharing&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C875cf130a5b3482342a808d984c2e333%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637686796375377160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=X09fEBSn%2FykJ09vMSf2YoGnkODBoJAKhnma8KX9%2BxUE%3D&reserved=0 > > Regards, > > -- > Maxim Kuvyrkov > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C875cf130a5b3482342a808d984c2e333%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637686796375377160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=P503kX0DWQuMMr72znAXgvIdh2IsBDvliTko5%2F%2B4x6Q%3D&reserved=0 > >> On 29 Sep 2021, at 20:13, Mekhanoshin, Stanislav >> wrote: >> >> [AMD Official Use Only] >> >> Maxim, >> >> This is really difficult for me to work on this as I do not have various >> targets and HW affected. I am sure there were quite a lot of progressions, >> but as I said in the beginning regressions are also inevitable, just like >> every time a heuristic is involved. For the hmmer case I was getting quite >> different results just by selecting a different ARM target. So without a >> good way to measure it and given the heuristic approach I cannot satisfy all >> the requests from multiple parties. Our target (AMDGPU) does this for a long >> time and I believe it is overall beneficial. It is somewhat pity I cannot >> make this a universal optimization, but I am also time constrained as there >> is other work to do too. >> >> Stas >> >> -Original Message- >> From: Maxim Kuvyrkov >> Sent: Wednesday, September 29, 2021 4:17 >> To: Mekhanoshin, Stanislav >> Cc: linaro-toolchain@lists.linaro.org >> Subject: Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow >> rematerialization of virtual reg uses" >> >> [CAUTION: External Email] >> >> I thought the speed up and slow-down from "Allow rematerialization of >> virtual reg uses" were for different benchmarks, but they are for the same >> benchmark - 456.hmmer - but for different compilation flags. >> >> - At -O2 the patch slows down 456.hmmer by 5% from 751s to 771s. >> - At -O2 -flto patch speeds up 456.hmmer by 5% from 803s to 765s. >> >> Two observations from this: >> 1. 456.hmmer is very sensitive to this optimisation >> 2. LTO screws up on 456.hmmer. >> >> -- >> Maxim Kuvyrkov >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
RE: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses"
[AMD Official Use Only] Maxim, This is really difficult for me to work on this as I do not have various targets and HW affected. I am sure there were quite a lot of progressions, but as I said in the beginning regressions are also inevitable, just like every time a heuristic is involved. For the hmmer case I was getting quite different results just by selecting a different ARM target. So without a good way to measure it and given the heuristic approach I cannot satisfy all the requests from multiple parties. Our target (AMDGPU) does this for a long time and I believe it is overall beneficial. It is somewhat pity I cannot make this a universal optimization, but I am also time constrained as there is other work to do too. Stas -Original Message- From: Maxim Kuvyrkov Sent: Wednesday, September 29, 2021 4:17 To: Mekhanoshin, Stanislav Cc: linaro-toolchain@lists.linaro.org Subject: Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses" [CAUTION: External Email] I thought the speed up and slow-down from "Allow rematerialization of virtual reg uses" were for different benchmarks, but they are for the same benchmark - 456.hmmer - but for different compilation flags. - At -O2 the patch slows down 456.hmmer by 5% from 751s to 771s. - At -O2 -flto patch speeds up 456.hmmer by 5% from 803s to 765s. Two observations from this: 1. 456.hmmer is very sensitive to this optimisation 2. LTO screws up on 456.hmmer. -- Maxim Kuvyrkov https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C06739cf07d704b0ae9c808d9833ab4db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637685110452392032%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=lT0JQgOBKwpI7H04MR%2BBFww5RKAiXTq3XQiLEBQSBCE%3D&reserved=0 > On 29 Sep 2021, at 14:06, Maxim Kuvyrkov wrote: > > Hi Stanislav, > > Just FYI. Your original patch improved 456.hmmer by 5%, that's a nice speed > up! > > -- > Maxim Kuvyrkov > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C06739cf07d704b0ae9c808d9833ab4db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637685110452402029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=YyOt%2FmkYeomR8vtrFndKNlUOyKTe4kbFRTv9xMoktjY%3D&reserved=0 > >> On 28 Sep 2021, at 08:21, ci_not...@linaro.org wrote: >> >> After llvm commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 >> Author: Stanislav Mekhanoshin >> >> Revert "Allow rematerialization of virtual reg uses" >> >> the following benchmarks slowed down by more than 2%: >> - 456.hmmer slowed down by 5% from 7649 to 8028 perf samples >> >> Below reproducer instructions can be used to re-build both "first_bad" and >> "last_good" cross-toolchains used in this bisection. Naturally, the scripts >> will fail when triggerring benchmarking jobs if you don't have access to >> Linaro TCWG CI. >> >> For your convenience, we have uploaded tarballs with pre-processed source >> and assembly files at: >> - First_bad save-temps: >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.linaro.org%2Fjob%2Ftcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO%2F16%2Fartifact%2Fartifacts%2Fbuild-08d7eec06e8cf5c15a96ce11f311f1480291a441%2Fsave-temps%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C06739cf07d704b0ae9c808d9833ab4db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637685110452402029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=6aQN%2FwqNrcGw5fYNZf8jJqzQdAtAsuTgbZbDPM5Ob8o%3D&reserved=0 >> - Last_good save-temps: >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.linaro.org%2Fjob%2Ftcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO%2F16%2Fartifact%2Fartifacts%2Fbuild-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af%2Fsave-temps%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C06739cf07d704b0ae9c808d9833ab4db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637685110452402029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PqQtn5CJt%2BJtZOxxgwKdIIrPW0zCZbfbnB5vO%2FEm%2BhU%3D&reserved=0 >> - Baseline save-temps: >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.linaro.org%2Fjob%2Ftcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO%2F16%2Fartifact%2Fartifacts%2Fbuild-baseline%2Fsave-temps%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C06739cf07d704b0ae9c808d9833ab4d
Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses"
Hi Stanislav, I fully understand the challenges of compiler optimizations and the fact that a generally-good optimisation can slow down a small number of benchmarks. Still, benchmarking your original patch (commit 92c1fd19abb15bc68b1127a26137a69e033cdb39) on arm-linux-gnueabihf results in overall runtime slow-down across C/C++ subset of SPEC CPU2006: - 0.25% runtime geomean increase at -O2 - 0.37% runtime geomean increase at -O3 See [1] for the numbers. You mentioned that you saw different results for another ARM target — could you elaborate please? [1] https://docs.google.com/spreadsheets/d/1USWty9Vdx6JLo7TGddbkoKVUCiC4wtneOhhbHf5WXfc/edit?usp=sharing Regards, -- Maxim Kuvyrkov https://www.linaro.org > On 29 Sep 2021, at 20:13, Mekhanoshin, Stanislav > wrote: > > [AMD Official Use Only] > > Maxim, > > This is really difficult for me to work on this as I do not have various > targets and HW affected. I am sure there were quite a lot of progressions, > but as I said in the beginning regressions are also inevitable, just like > every time a heuristic is involved. For the hmmer case I was getting quite > different results just by selecting a different ARM target. So without a good > way to measure it and given the heuristic approach I cannot satisfy all the > requests from multiple parties. Our target (AMDGPU) does this for a long time > and I believe it is overall beneficial. It is somewhat pity I cannot make > this a universal optimization, but I am also time constrained as there is > other work to do too. > > Stas > > -Original Message- > From: Maxim Kuvyrkov > Sent: Wednesday, September 29, 2021 4:17 > To: Mekhanoshin, Stanislav > Cc: linaro-toolchain@lists.linaro.org > Subject: Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow > rematerialization of virtual reg uses" > > [CAUTION: External Email] > > I thought the speed up and slow-down from "Allow rematerialization of virtual > reg uses" were for different benchmarks, but they are for the same benchmark > - 456.hmmer - but for different compilation flags. > > - At -O2 the patch slows down 456.hmmer by 5% from 751s to 771s. > - At -O2 -flto patch speeds up 456.hmmer by 5% from 803s to 765s. > > Two observations from this: > 1. 456.hmmer is very sensitive to this optimisation > 2. LTO screws up on 456.hmmer. > > -- > Maxim Kuvyrkov > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C06739cf07d704b0ae9c808d9833ab4db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637685110452392032%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=lT0JQgOBKwpI7H04MR%2BBFww5RKAiXTq3XQiLEBQSBCE%3D&reserved=0 > >> On 29 Sep 2021, at 14:06, Maxim Kuvyrkov wrote: >> >> Hi Stanislav, >> >> Just FYI. Your original patch improved 456.hmmer by 5%, that's a nice speed >> up! >> >> -- >> Maxim Kuvyrkov >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linaro.org%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C06739cf07d704b0ae9c808d9833ab4db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637685110452402029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=YyOt%2FmkYeomR8vtrFndKNlUOyKTe4kbFRTv9xMoktjY%3D&reserved=0 >> >>> On 28 Sep 2021, at 08:21, ci_not...@linaro.org wrote: >>> >>> After llvm commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 >>> Author: Stanislav Mekhanoshin >>> >>> Revert "Allow rematerialization of virtual reg uses" >>> >>> the following benchmarks slowed down by more than 2%: >>> - 456.hmmer slowed down by 5% from 7649 to 8028 perf samples >>> >>> Below reproducer instructions can be used to re-build both "first_bad" and >>> "last_good" cross-toolchains used in this bisection. Naturally, the >>> scripts will fail when triggerring benchmarking jobs if you don't have >>> access to Linaro TCWG CI. >>> >>> For your convenience, we have uploaded tarballs with pre-processed source >>> and assembly files at: >>> - First_bad save-temps: >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.linaro.org%2Fjob%2Ftcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO%2F16%2Fartifact%2Fartifacts%2Fbuild-08d7eec06e8cf5c15a96ce11f311f1480291a441%2Fsave-temps%2F&data=04%7C01%7CStanislav.Mekhanoshin%40amd.com%7C06739cf07d704b0ae9c808d9833ab4db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63768511045240202
Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses"
Hi Stanislav, Just FYI. Your original patch improved 456.hmmer by 5%, that’s a nice speed up! -- Maxim Kuvyrkov https://www.linaro.org > On 28 Sep 2021, at 08:21, ci_not...@linaro.org wrote: > > After llvm commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 > Author: Stanislav Mekhanoshin > >Revert "Allow rematerialization of virtual reg uses" > > the following benchmarks slowed down by more than 2%: > - 456.hmmer slowed down by 5% from 7649 to 8028 perf samples > > Below reproducer instructions can be used to re-build both "first_bad" and > "last_good" cross-toolchains used in this bisection. Naturally, the scripts > will fail when triggerring benchmarking jobs if you don't have access to > Linaro TCWG CI. > > For your convenience, we have uploaded tarballs with pre-processed source and > assembly files at: > - First_bad save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-08d7eec06e8cf5c15a96ce11f311f1480291a441/save-temps/ > - Last_good save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af/save-temps/ > - Baseline save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-baseline/save-temps/ > > Configuration: > - Benchmark: SPEC CPU2006 > - Toolchain: Clang + Glibc + LLVM Linker > - Version: all components were built from their tip of trunk > - Target: arm-linux-gnueabihf > - Compiler flags: -O2 -flto -marm > - Hardware: NVidia TK1 4x Cortex-A15 > > This benchmarking CI is work-in-progress, and we welcome feedback and > suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans > is to add support for SPEC CPU2017 benchmarks and provide "perf > report/annotate" data behind these reports. > > THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, > REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. > > This commit has regressed these CI configurations: > - tcwg_bmk_llvm_tk1/llvm-master-arm-spec2k6-O2_LTO > > First_bad build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-08d7eec06e8cf5c15a96ce11f311f1480291a441/ > Last_good build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af/ > Baseline build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-baseline/ > Even more details: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/ > > Reproduce builds: > > mkdir investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 > cd investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 > > # Fetch scripts > git clone https://git.linaro.org/toolchain/jenkins-scripts > > # Fetch manifests and test.sh script > mkdir -p artifacts/manifests > curl -o artifacts/manifests/build-baseline.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/manifests/build-baseline.sh > --fail > curl -o artifacts/manifests/build-parameters.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/manifests/build-parameters.sh > --fail > curl -o artifacts/test.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/test.sh > --fail > chmod +x artifacts/test.sh > > # Reproduce the baseline build (build all pre-requisites) > ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh > > # Save baseline build state (which is then restored in artifacts/test.sh) > mkdir -p ./bisect > rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ > --exclude /llvm/ ./ ./bisect/baseline/ > > cd llvm > > # Reproduce first_bad build > git checkout --detach 08d7eec06e8cf5c15a96ce11f311f1480291a441 > ../artifacts/test.sh > > # Reproduce last_good build > git checkout --detach e8e2edd8ca88f8b0a7dba141349b2aa83284f3af > ../artifacts/test.sh > > cd .. > > > Full commit (up to 1000 lines): > > commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 > Author: Stanislav Mekhanoshin > Date: Fri Sep 24 09:53:51 2021 -0700 > >Revert "Allow rematerialization of virtual reg uses" > >Reverted due to two distcint performance regression reports. > >This reverts commit 92c1fd19abb15bc68b1127a26137a69e033cdb39. > --- > llvm/include/llvm/CodeGen/TargetInstrInfo.h| 12 +- > llvm/lib/CodeGen/TargetInstrInfo.cpp |9 +- > llvm/test/CodeGen/AMDGPU/remat-sop.mir | 60 - > llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll
[TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses"
After llvm commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin Revert "Allow rematerialization of virtual reg uses" the following benchmarks slowed down by more than 2%: - 456.hmmer slowed down by 5% from 7649 to 8028 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-08d7eec06e8cf5c15a96ce11f311f1480291a441/save-temps/ - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af/save-temps/ - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-baseline/save-temps/ Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: arm-linux-gnueabihf - Compiler flags: -O2 -flto -marm - Hardware: NVidia TK1 4x Cortex-A15 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tk1/llvm-master-arm-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-08d7eec06e8cf5c15a96ce11f311f1480291a441/ Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af/ Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-baseline/ Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/ Reproduce builds: mkdir investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 cd investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/manifests/build-baseline.sh --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/manifests/build-parameters.sh --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/test.sh --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 08d7eec06e8cf5c15a96ce11f311f1480291a441 ../artifacts/test.sh # Reproduce last_good build git checkout --detach e8e2edd8ca88f8b0a7dba141349b2aa83284f3af ../artifacts/test.sh cd .. Full commit (up to 1000 lines): commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin Date: Fri Sep 24 09:53:51 2021 -0700 Revert "Allow rematerialization of virtual reg uses" Reverted due to two distcint performance regression reports. This reverts commit 92c1fd19abb15bc68b1127a26137a69e033cdb39. --- llvm/include/llvm/CodeGen/TargetInstrInfo.h| 12 +- llvm/lib/CodeGen/TargetInstrInfo.cpp |9 +- llvm/test/CodeGen/AMDGPU/remat-sop.mir | 60 - llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll | 28 +- llvm/test/CodeGen/ARM/funnel-shift-rot.ll | 32 +- llvm/test/CodeGen/ARM/funnel-shift.ll | 30 +- .../test/CodeGen/ARM/illegal-bitfield-loadstore.ll | 30 +- llvm/test/CodeGen/ARM/neon-copy.ll | 10 +- llvm/test/CodeGen/Mips/llvm-ir/ashr.ll | 227 +- llvm/test/CodeGen/Mips/llvm-ir/lshr.ll | 206 +- llvm/test/Cod
Re: [TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses"
I thought the speed up and slow-down from "Allow rematerialization of virtual reg uses" were for different benchmarks, but they are for the same benchmark — 456.hmmer — but for different compilation flags. - At -O2 the patch slows down 456.hmmer by 5% from 751s to 771s. - At -O2 -flto patch speeds up 456.hmmer by 5% from 803s to 765s. Two observations from this: 1. 456.hmmer is very sensitive to this optimisation 2. LTO screws up on 456.hmmer. -- Maxim Kuvyrkov https://www.linaro.org > On 29 Sep 2021, at 14:06, Maxim Kuvyrkov wrote: > > Hi Stanislav, > > Just FYI. Your original patch improved 456.hmmer by 5%, that’s a nice speed > up! > > -- > Maxim Kuvyrkov > https://www.linaro.org > >> On 28 Sep 2021, at 08:21, ci_not...@linaro.org wrote: >> >> After llvm commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 >> Author: Stanislav Mekhanoshin >> >> Revert "Allow rematerialization of virtual reg uses" >> >> the following benchmarks slowed down by more than 2%: >> - 456.hmmer slowed down by 5% from 7649 to 8028 perf samples >> >> Below reproducer instructions can be used to re-build both "first_bad" and >> "last_good" cross-toolchains used in this bisection. Naturally, the scripts >> will fail when triggerring benchmarking jobs if you don't have access to >> Linaro TCWG CI. >> >> For your convenience, we have uploaded tarballs with pre-processed source >> and assembly files at: >> - First_bad save-temps: >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-08d7eec06e8cf5c15a96ce11f311f1480291a441/save-temps/ >> - Last_good save-temps: >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af/save-temps/ >> - Baseline save-temps: >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-baseline/save-temps/ >> >> Configuration: >> - Benchmark: SPEC CPU2006 >> - Toolchain: Clang + Glibc + LLVM Linker >> - Version: all components were built from their tip of trunk >> - Target: arm-linux-gnueabihf >> - Compiler flags: -O2 -flto -marm >> - Hardware: NVidia TK1 4x Cortex-A15 >> >> This benchmarking CI is work-in-progress, and we welcome feedback and >> suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans >> is to add support for SPEC CPU2017 benchmarks and provide "perf >> report/annotate" data behind these reports. >> >> THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, >> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. >> >> This commit has regressed these CI configurations: >> - tcwg_bmk_llvm_tk1/llvm-master-arm-spec2k6-O2_LTO >> >> First_bad build: >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-08d7eec06e8cf5c15a96ce11f311f1480291a441/ >> Last_good build: >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af/ >> Baseline build: >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/build-baseline/ >> Even more details: >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/ >> >> Reproduce builds: >> >> mkdir investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 >> cd investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 >> >> # Fetch scripts >> git clone https://git.linaro.org/toolchain/jenkins-scripts >> >> # Fetch manifests and test.sh script >> mkdir -p artifacts/manifests >> curl -o artifacts/manifests/build-baseline.sh >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/manifests/build-baseline.sh >> --fail >> curl -o artifacts/manifests/build-parameters.sh >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/manifests/build-parameters.sh >> --fail >> curl -o artifacts/test.sh >> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-arm-spec2k6-O2_LTO/16/artifact/artifacts/test.sh >> --fail >> chmod +x artifacts/test.sh >> >> # Reproduce the baseline build (build all pre-requisites) >> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh >> >> # Save baseline build state (which is then restored in artifacts/test.sh) >> mkdir -p ./bisect >> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ >> --exclude /llvm/ ./ ./bisect/baseline/ >> >> cd llvm >> >> # Reproduce first_bad build >> git checkout --detach 08d7eec06e8cf5c15a96ce11f311f1480291a441 >> ../artifacts/test.sh >> >> # Reproduce last_good build >> git checkout --detach e8e2edd8ca88f8b0a7dba141349b2aa8