Bug#949638: tesseract: uses -march=native
On Sun, May 24, 2020 at 10:14:49PM +0200, Stefan Weil wrote: > Adrian, I am afraid that there is a misunderstanding. > > The code part which is compiled with -march=native is never executed by > default. I get that point. > There is a command line option which allows users to select the code > which is used for certain time critical calculations (dot product). A > wrong choice is not a security problem You misunderstand the part about the security update, security updates are just the most common reason why a package gets updated (and therefore rebuilt) in a stable distribution. Example: Debian 11 will be released in summer 2021. In autumn 2021 a user sets up a new system and selects "native" for an important production setup with an Intel CPU. In spring 2022 a (security or other) update for Tesseract happens in Debian 11, built on a buildd with the latest AMD CPU. The working production setup suddenly always crashes. > That's quite common for other packages including the standard C > library and scientific libraries, too. They all contain optimized > functions which require certain hardware and which crash otherwise. With proper runtime autodetection of the hardware, if you manage to get a crash it is a bug in these packages. It is quite rare that packages offer manual selection in addition to autodetection. > but simply will crash the > application, no matter whether the user selected "native", "avx" or > "neon". Even when built on the same computer I would have doubts whether automatic vectorization[1] of the trivial C code really beats the hand-written AVX2 code, but when the code is not even built for the computer in question what's the point? A "native" option meaning "some random buildd somewhere" is just confusing, it doesn't make sense for distributions. > Regards > > Stefan cu Adrian [1] if it happens at all, the Debian package build currently overwrites the -O3 with a subsequent -O2
Bug#949638: tesseract: uses -march=native
Adrian, I am afraid that there is a misunderstanding. The code part which is compiled with -march=native is never executed by default. There is a command line option which allows users to select the code which is used for certain time critical calculations (dot product). A wrong choice is not a security problem but simply will crash the application, no matter whether the user selected "native", "avx" or "neon". That's quite common for other packages including the standard C library and scientific libraries, too. They all contain optimized functions which require certain hardware and which crash otherwise. Regards Stefan
Bug#949638: tesseract: uses -march=native
Control: severity -1 serious On Fri, Jan 24, 2020 at 10:28:36PM +0100, Stefan Weil wrote: > Am 24.01.20 um 21:53 schrieb peter green: > > > I still don't think -march=native is appropriate for a binary > > distribution though. If you want to offer different versions of the > > code built with different CPU requirements, that is fine, but please > > don't let them depend on what CPU happens to be in the autobuilder. > > Better ideas are welcome. > > Tesseract is used for mass processing of books which can take many weeks > or even months. Therefore it is very important that the time critical > code (dot product calculation) runs as fast as possible. > > For x86_64 we know the available SIMD instructions (SSE, AVX, ...) which > can be used, add code for all variants and check at runtime what is > supported by the CPU. > > For all other architectures (including ARM) there is currently no such > special code, and the default code is rather slow. By using > -march=native for the alternate code, hopefully the compiler will > produce code which runs faster on any machine which is similar to the > build machine. And which will crash on any machine that is not compatible. If Debian gets an AMD Threadripper buildd, then -march=native code built on that buildd will not run on any CPU from Intel. The same is true for other architectures like ARM - if a buildd happens to be the latest server hardware and uses some uncommon CPU extension, it is not compatible with many machines. > Users who build Tesseract on the machine which is also > used for the mass production will get the best result like that. Users > using a distribution can try the "native" option and either crash the > program or get a possibly faster result. > > I see the problem of builds which depend on an autobuilder which may be > different for each build. "each build" might be a security update in stable on a much newer buildd, the tested setup running in production might then just crash. As user I can handle a quirk or two when setting up a machine, but anything that breaks out of the void is a huge pain. If Tesseract is the main usage of the machine and every percetage point of performance matters, I would likely build it myself instead of using the distribution package. > What would be the best solution for > distributions? Suppress the code using a new configure option or some > magic which detects that the build is for a Debian distribution? Choose > compiler flags manually for the "native" option (that is already > possible, see my previous answer)? Other solutions? The important point is that manual installations and distributions have different needs. When a user is manually compiling and installing a software on a machine -march=native can be used for everything, not just the most time critical part - it is faster with no real downside. For a distribution you want several variants like what you already have for x86_64, and no -march=native ever. If 32bit ARM still matters for your users they might want to use NEON, many Debian buildds have CPUs that do not support NEON. For 64bit ARM they might want armv8.2-a+dotprod, I do not think any current Debian buildd supports that. You should know best which extensions actually make a difference, and when the fastest option is autodetected at runtime it is most likely to benefit the user of a distribution package. > Stefan cu Adrian
Bug#949638: tesseract: uses -march=native
> For example, default compiler flags in Debian unstable now: > $ dpkg-buildflags --get CXXFLAGS > -g -O2 -fdebug-prefix-map=/home/boris=. -fstack-protector-strong -Wformat > -Werror=format-security > $ dpkg-buildflags --get CFLAGS > -g -O2 -fdebug-prefix-map=/home/boris=. -fstack-protector-strong -Wformat > -Werror=format-security > $ dpkg-buildflags --get CPPFLAGS > -Wdate-time -D_FORTIFY_SOURCE=2 > $ dpkg-buildflags --get LDFLAGS > -Wl,-z,relro This is on amd64 of course. Sorry for extra message. -- Boris
Bug#949638: tesseract: uses -march=native
Hi, > I see the problem of builds which depend on an autobuilder which may be > different for each build. What would be the best solution for > distributions? 1) Special configuration option, which disables all CPU specific optimizations in compiler flags. or 2) Special configuration option, which disables all additional compiler flags which tesseract developers tends to add. Only compiler flags from system environment will be used in this case. For example, default compiler flags in Debian unstable now: $ dpkg-buildflags --get CXXFLAGS -g -O2 -fdebug-prefix-map=/home/boris=. -fstack-protector-strong -Wformat -Werror=format-security $ dpkg-buildflags --get CFLAGS -g -O2 -fdebug-prefix-map=/home/boris=. -fstack-protector-strong -Wformat -Werror=format-security $ dpkg-buildflags --get CPPFLAGS -Wdate-time -D_FORTIFY_SOURCE=2 $ dpkg-buildflags --get LDFLAGS -Wl,-z,relro But many (if not most of) packages are built with additional flags now, see: https://wiki.debian.org/Hardening > Suppress the code using a new configure option or some > magic which detects that the build is for a Debian distribution? This is never an option for Debian. Also do not forget about other GNU/Linux and *BSD distributions... Hope this helps. Best regards, Boris
Bug#949638: tesseract: uses -march=native
Am 24.01.20 um 21:53 schrieb peter green: > I still don't think -march=native is appropriate for a binary > distribution though. If you want to offer different versions of the > code built with different CPU requirements, that is fine, but please > don't let them depend on what CPU happens to be in the autobuilder. Better ideas are welcome. Tesseract is used for mass processing of books which can take many weeks or even months. Therefore it is very important that the time critical code (dot product calculation) runs as fast as possible. For x86_64 we know the available SIMD instructions (SSE, AVX, ...) which can be used, add code for all variants and check at runtime what is supported by the CPU. For all other architectures (including ARM) there is currently no such special code, and the default code is rather slow. By using -march=native for the alternate code, hopefully the compiler will produce code which runs faster on any machine which is similar to the build machine. Users who build Tesseract on the machine which is also used for the mass production will get the best result like that. Users using a distribution can try the "native" option and either crash the program or get a possibly faster result. I see the problem of builds which depend on an autobuilder which may be different for each build. What would be the best solution for distributions? Suppress the code using a new configure option or some magic which detects that the build is for a Debian distribution? Choose compiler flags manually for the "native" option (that is already possible, see my previous answer)? Other solutions? Stefan
Bug#949638: tesseract: uses -march=native
Severity 949638 normal Thanks On 24/01/2020 19:16, Stefan Weil wrote: As far as I know all Linux distributions use the autoconf based build, Debian certainly does appear to be using the autoconf based build. The default autoconf build uses -march=native only if it is supported by the compiler Which, of course it is. and only for a single file, but not for the rest of the code. The code from that single file is not executed by default, but only if an advanced user runs Tesseract with a special command line option (-c dotproduct=native). Ok, that dramatically reduces the impact of this issue. Downgrading the bug to normal. I still don't think -march=native is appropriate for a binary distribution though. If you want to offer different versions of the code built with different CPU requirements, that is fine, but please don't let them depend on what CPU happens to be in the autobuilder.
Bug#949638: tesseract: uses -march=native
It is not necessary to patch Tesseract code if for whatever reason -march=native is completely unwanted. `make libtesseract_native_la_CXXFLAGS=` will override the extra compiler flags which are used to produce the native code, so only the default flags which don't include -march=native will be used. Stefan
Bug#949638: tesseract: uses -march=native
> The URL for the patch is 404. s/tessarect/tesseract/ The fixed URL is https://debdiffs.raspbian.org/main/t/tesseract/. Stefan
Bug#949638: tesseract: uses -march=native
Am 24.01.20 um 19:55 schrieb Jeff Breidenbach: > > Regarding: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=949638 > > Thank you, Peter. > > 1. The URL for the patch is 404. > > 2. There may be some subtlety with -march=native, specifically related to > detection of SIMD instructions like AVX2. There's been an enormous > amount of back & forth on this topic in upstream over the years, so > I'd like > to take this bug there and let them weigh in. > > Jeff That might be a false alarm. Tesseract supports two different build systems, one based on cmake, one based on autoconf. As far as I know all Linux distributions use the autoconf based build, so they should not be affected by the existing problems from the cmake build. The default autoconf build uses -march=native only if it is supported by the compiler and only for a single file, but not for the rest of the code. The code from that single file is not executed by default, but only if an advanced user runs Tesseract with a special command line option (-c dotproduct=native). Stefan
Bug#949638: tesseract: uses -march=native
BCC: Stefan Weil since I don't know if he wants his email posted in bugs.debian.org Regarding: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=949638 Thank you, Peter. 1. The URL for the patch is 404. 2. There may be some subtlety with -march=native, specifically related to detection of SIMD instructions like AVX2. There's been an enormous amount of back & forth on this topic in upstream over the years, so I'd like to take this bug there and let them weigh in. Jeff
Bug#949638: tesseract: uses -march=native
Package: tesseract Version: 4.1.0-1 Severity: serious Tags: patch I recently discovered that tesseract 4.1.1-1 failed the armv7 contamination check we run in raspbian. Investigating shows that since version 4.1.0-1 tesseract started using -march=native. This compiler option is totally inappropriate for a binary distribution like Debian or Raspbian, because it means that the minimum CPU requirements of the resulting binaries will depend on what CPU the buildbox happens to have. 4.1.0-1 was never built in raspbian, I am not sure why 4.1.0-2 passed the contamination check in raspbian. My best guess is that -march=native on arm is poorly implemented and does not recognise the CPUs on some of our buildboxes. Anyway I whipped up a fix and uploaded it to raspbian. A debdiff should appear soon at https://debdiffs.raspbian.org/main/t/tessarect/