Bug#949638: tesseract: uses -march=native

2020-05-24 Thread Adrian Bunk
On Sun, May 24, 2020 at 10:14:49PM +0200, Stefan Weil wrote:
> Adrian, I am afraid that there is a misunderstanding.
> 
> The code part which is compiled with -march=native is never executed by
> default.

I get that point.

> There is a command line option which allows users to select the code
> which is used for certain time critical calculations (dot product). A
> wrong choice is not a security problem

You misunderstand the part about the security update,
security updates are just the most common reason why
a package gets updated (and therefore rebuilt) in a
stable distribution.

Example:
Debian 11 will be released in summer 2021.
In autumn 2021 a user sets up a new system and selects "native"
for an important production setup with an Intel CPU.
In spring 2022 a (security or other) update for Tesseract happens
in Debian 11, built on a buildd with the latest AMD CPU.
The working production setup suddenly always crashes.

> That's quite common for other packages including the standard C
> library and scientific libraries, too. They all contain optimized
> functions which require certain hardware and which crash otherwise.

With proper runtime autodetection of the hardware, if you manage to get 
a crash it is a bug in these packages. It is quite rare that packages 
offer manual selection in addition to autodetection.

> but simply will crash the
> application, no matter whether the user selected "native", "avx" or
> "neon".

Even when built on the same computer I would have doubts whether
automatic vectorization[1] of the trivial C code really beats the 
hand-written AVX2 code, but when the code is not even built for
the computer in question what's the point?

A "native" option meaning "some random buildd somewhere" is just
confusing, it doesn't make sense for distributions.

> Regards
> 
> Stefan

cu
Adrian

[1] if it happens at all, the Debian package build currently overwrites
the -O3 with a subsequent -O2



Bug#949638: tesseract: uses -march=native

2020-05-24 Thread Stefan Weil
Adrian, I am afraid that there is a misunderstanding.

The code part which is compiled with -march=native is never executed by
default.

There is a command line option which allows users to select the code
which is used for certain time critical calculations (dot product). A
wrong choice is not a security problem but simply will crash the
application, no matter whether the user selected "native", "avx" or
"neon". That's quite common for other packages including the standard C
library and scientific libraries, too. They all contain optimized
functions which require certain hardware and which crash otherwise.

Regards

Stefan



Bug#949638: tesseract: uses -march=native

2020-05-24 Thread Adrian Bunk
Control: severity -1 serious

On Fri, Jan 24, 2020 at 10:28:36PM +0100, Stefan Weil wrote:
> Am 24.01.20 um 21:53 schrieb peter green:
> 
> > I still don't think -march=native is appropriate for a binary
> > distribution though. If you want to offer different versions of the
> > code built with different CPU requirements, that is fine, but please
> > don't let them depend on what CPU happens to be in the autobuilder.
> 
> Better ideas are welcome.
> 
> Tesseract is used for mass processing of books which can take many weeks
> or even months. Therefore it is very important that the time critical
> code (dot product calculation) runs as fast as possible.
> 
> For x86_64 we know the available SIMD instructions (SSE, AVX, ...) which
> can be used, add code for all variants and check at runtime what is
> supported by the CPU.
> 
> For all other architectures (including ARM) there is currently no such
> special code, and the default code is rather slow. By using
> -march=native for the alternate code, hopefully the compiler will
> produce code which runs faster on any machine which is similar to the
> build machine.

And which will crash on any machine that is not compatible.

If Debian gets an AMD Threadripper buildd, then -march=native code built 
on that buildd will not run on any CPU from Intel.

The same is true for other architectures like ARM - if a buildd happens 
to be the latest server hardware and uses some uncommon CPU extension, 
it is not compatible with many machines.

> Users who build Tesseract on the machine which is also
> used for the mass production will get the best result like that. Users
> using a distribution can try the "native" option and either crash the
> program or get a possibly faster result.
> 
> I see the problem of builds which depend on an autobuilder which may be
> different for each build.

"each build" might be a security update in stable on a much newer buildd,
the tested setup running in production might then just crash.

As user I can handle a quirk or two when setting up a machine,
but anything that breaks out of the void is a huge pain.

If Tesseract is the main usage of the machine and every percetage point 
of performance matters, I would likely build it myself instead of using 
the distribution package.

> What would be the best solution for
> distributions? Suppress the code using a new configure option or some
> magic which detects that the build is for a Debian distribution? Choose
> compiler flags manually for the "native" option (that is already
> possible, see my previous answer)? Other solutions?

The important point is that manual installations and distributions have
different needs.

When a user is manually compiling and installing a software on a machine 
-march=native can be used for everything, not just the most time 
critical part - it is faster with no real downside.

For a distribution you want several variants like what you already
have for x86_64, and no -march=native ever.

If 32bit ARM still matters for your users they might want to use NEON,
many Debian buildds have CPUs that do not support NEON.

For 64bit ARM they might want armv8.2-a+dotprod,
I do not think any current Debian buildd supports that.

You should know best which extensions actually make a difference,
and when the fastest option is autodetected at runtime it is most
likely to benefit the user of a distribution package.

> Stefan

cu
Adrian



Bug#949638: tesseract: uses -march=native

2020-01-27 Thread Boris Pek
> For example, default compiler flags in Debian unstable now:
> $ dpkg-buildflags --get CXXFLAGS
> -g -O2 -fdebug-prefix-map=/home/boris=. -fstack-protector-strong -Wformat 
> -Werror=format-security
> $ dpkg-buildflags --get CFLAGS
> -g -O2 -fdebug-prefix-map=/home/boris=. -fstack-protector-strong -Wformat 
> -Werror=format-security
> $ dpkg-buildflags --get CPPFLAGS
> -Wdate-time -D_FORTIFY_SOURCE=2
> $ dpkg-buildflags --get LDFLAGS
> -Wl,-z,relro

This is on amd64 of course.

Sorry for extra message.

-- 
Boris



Bug#949638: tesseract: uses -march=native

2020-01-27 Thread Boris Pek
Hi,

> I see the problem of builds which depend on an autobuilder which may be
> different for each build. What would be the best solution for
> distributions?

1) Special configuration option, which disables all CPU specific optimizations
   in compiler flags.

or

2) Special configuration option, which disables all additional compiler flags
   which tesseract developers tends to add. Only compiler flags from system
   environment will be used in this case.

For example, default compiler flags in Debian unstable now:
$ dpkg-buildflags --get CXXFLAGS
-g -O2 -fdebug-prefix-map=/home/boris=. -fstack-protector-strong -Wformat 
-Werror=format-security
$ dpkg-buildflags --get CFLAGS
-g -O2 -fdebug-prefix-map=/home/boris=. -fstack-protector-strong -Wformat 
-Werror=format-security
$ dpkg-buildflags --get CPPFLAGS
-Wdate-time -D_FORTIFY_SOURCE=2
$ dpkg-buildflags --get LDFLAGS
-Wl,-z,relro

But many (if not most of) packages are built with additional flags now, see:
https://wiki.debian.org/Hardening

> Suppress the code using a new configure option or some
> magic which detects that the build is for a Debian distribution?

This is never an option for Debian. Also do not forget about other GNU/Linux
and *BSD distributions...

Hope this helps.

Best regards,
Boris



Bug#949638: tesseract: uses -march=native

2020-01-24 Thread Stefan Weil
Am 24.01.20 um 21:53 schrieb peter green:

> I still don't think -march=native is appropriate for a binary
> distribution though. If you want to offer different versions of the
> code built with different CPU requirements, that is fine, but please
> don't let them depend on what CPU happens to be in the autobuilder.


Better ideas are welcome.

Tesseract is used for mass processing of books which can take many weeks
or even months. Therefore it is very important that the time critical
code (dot product calculation) runs as fast as possible.

For x86_64 we know the available SIMD instructions (SSE, AVX, ...) which
can be used, add code for all variants and check at runtime what is
supported by the CPU.

For all other architectures (including ARM) there is currently no such
special code, and the default code is rather slow. By using
-march=native for the alternate code, hopefully the compiler will
produce code which runs faster on any machine which is similar to the
build machine. Users who build Tesseract on the machine which is also
used for the mass production will get the best result like that. Users
using a distribution can try the "native" option and either crash the
program or get a possibly faster result.

I see the problem of builds which depend on an autobuilder which may be
different for each build. What would be the best solution for
distributions? Suppress the code using a new configure option or some
magic which detects that the build is for a Debian distribution? Choose
compiler flags manually for the "native" option (that is already
possible, see my previous answer)? Other solutions?

Stefan



Bug#949638: tesseract: uses -march=native

2020-01-24 Thread peter green

Severity 949638 normal
Thanks

On 24/01/2020 19:16, Stefan Weil wrote:

As far as I know all Linux distributions use the autoconf based build,

Debian certainly does appear to be using the autoconf based build.

The default autoconf build uses -march=native only if it is supported by
the compiler


Which, of course it is.


  and only for a single file, but not for the rest of the
code. The code from that single file is not executed by default, but
only if an advanced user runs Tesseract with a special command line
option (-c dotproduct=native).

Ok, that dramatically reduces the impact of this issue. Downgrading the bug to 
normal.

I still don't think -march=native is appropriate for a binary distribution 
though. If you want to offer different versions of the code built with 
different CPU requirements, that is fine, but please don't let them depend on 
what CPU happens to be in the autobuilder.



Bug#949638: tesseract: uses -march=native

2020-01-24 Thread Stefan Weil
It is not necessary to patch Tesseract code if for whatever reason
-march=native is completely unwanted.

`make libtesseract_native_la_CXXFLAGS=` will override the extra compiler
flags which are used to produce the native code, so only the default
flags which don't include -march=native will be used.

Stefan



Bug#949638: tesseract: uses -march=native

2020-01-24 Thread Stefan Weil
> The URL for the patch is 404.

s/tessarect/tesseract/

The fixed URL is https://debdiffs.raspbian.org/main/t/tesseract/.

Stefan



Bug#949638: tesseract: uses -march=native

2020-01-24 Thread Stefan Weil
Am 24.01.20 um 19:55 schrieb Jeff Breidenbach:

>
> Regarding: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=949638
>
> Thank you, Peter.
>
> 1. The URL for the patch is 404.
>
> 2. There may be some subtlety with -march=native, specifically related to
> detection of  SIMD instructions like AVX2. There's been an enormous
> amount of back & forth on this topic in upstream over the years, so
> I'd like
> to take this bug there and let them weigh in.
>
> Jeff


That might be a false alarm.

Tesseract supports two different build systems, one based on cmake, one
based on autoconf.

As far as I know all Linux distributions use the autoconf based build,
so they should not be affected by the existing problems from the cmake
build.

The default autoconf build uses -march=native only if it is supported by
the compiler and only for a single file, but not for the rest of the
code. The code from that single file is not executed by default, but
only if an advanced user runs Tesseract with a special command line
option (-c dotproduct=native).

Stefan



Bug#949638: tesseract: uses -march=native

2020-01-24 Thread Jeff Breidenbach
BCC: Stefan Weil since I don't know if he wants his email posted in
bugs.debian.org

Regarding: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=949638

Thank you, Peter.

1. The URL for the patch is 404.

2. There may be some subtlety with -march=native, specifically related to
detection of  SIMD instructions like AVX2. There's been an enormous
amount of back & forth on this topic in upstream over the years, so I'd like
to take this bug there and let them weigh in.

Jeff


Bug#949638: tesseract: uses -march=native

2020-01-22 Thread peter green

Package: tesseract
Version: 4.1.0-1
Severity: serious
Tags: patch

I recently discovered that tesseract 4.1.1-1 failed the armv7 contamination 
check we run in raspbian.

Investigating shows that since version 4.1.0-1 tesseract started using 
-march=native. This compiler option is totally inappropriate for a binary 
distribution like Debian or Raspbian, because it means that the minimum CPU 
requirements of the resulting binaries will depend on what CPU the buildbox 
happens to have.

4.1.0-1 was never built in raspbian, I am not sure why 4.1.0-2 passed the 
contamination check in raspbian. My best guess is that -march=native on arm is 
poorly implemented and does not recognise the CPUs on some of our buildboxes.

Anyway I whipped up a fix and uploaded it to raspbian. A debdiff should appear 
soon at https://debdiffs.raspbian.org/main/t/tessarect/