Re: [gentoo-user] Re: CFLAGs for kernel compilation

2015-05-02 Thread Volker Armin Hemmann
Am 02.05.2015 um 07:04 schrieb Nikos Chantziaras:
 On 01/05/15 10:44, Andrew Savchenko wrote:
 On Fri, 1 May 2015 05:09:51 + (UTC) Martin Vaeth wrote:
 Andrew Savchenko birc...@gentoo.org wrote:

 That's why kernel makes sure that no floating point instructions
 sneaks in using CFLAGS, you may see a lot of -mno-${intrucion_set}
 flags when running make -V.

 So it should be sufficient that the kernel does not use float
 or double, shouldn't it?

 No. Optimizer paths may be very unobvious, i.e. I'll not be
 surprised if under some conditions vectorizer may use float
 instructions for int code.

 The kernel uses -O2 and several -march variants (e.g. -march=core2).
 Several other options are used to prevent GCC from generating
 unsuitable code.

 Specifying another -march variant does not affect the optimizer
 though. It only affects the code generator. If you don't modify the
 other CFLAGS and only change -march, you will not get FP instructions
 unless you use FP in the code.

 Also, I'd be very interested to see *any* optimization that would
 somehow transform integer code to FP code (note that SIMD is not FP
 and is perfectly fine in the kernel.) In fact, optimizers tend to
 transform FP into SIMD, at least on x86 (and other architectures that
 have fast SIMD instructions.) If I inspect the generated assembly from
 GCC or Clang, I cannot find FP anywhere, even for code using float
 and double operations. They get converted to SIMD on modern CPUs
 (unless you specify a compiler flag that tells it to use the FPU, for
 example if you need 80-bit extended precision, which is supported by
 the x86 FPU.)




http://www.agner.org/optimize/calling_conventions.pdf

Device drivers under Linux
Linux systems use lazy saving of floating point registers and vector
registers. This means
that these registers are not saved and restored on every task switch.
Instead they are
saved/restored on the first access after a task switch. This method
saves time in case no
more than one thread uses these registers. The lazy saving scheme is not
supported in
kernel mode. Any device driver that attempts to use these registers
improperly will cause an
exception that will probably make the system crash. A device driver that
needs to use vector
registers must first save these registers by calling the function
kernel_fpu_begin() and
restore the registers by calling kernel_fpu_end() before returning or
sleeping. These
functions also prevent pre-emptive interruption of the device driver
which could otherwise
mess up the registers. kernel_fpu_begin() saves all floating point
registers and vector
registers if available.
There is no red zone in 64-bit Linux kernel mode.
The programmer should be aware of these restrictions if calling any
other library than the
system kernel libraries from a device driver.




[gentoo-user] Re: CFLAGs for kernel compilation

2015-05-02 Thread Nikos Chantziaras

On 02/05/15 14:19, Volker Armin Hemmann wrote:

Am 02.05.2015 um 07:04 schrieb Nikos Chantziaras:

On 01/05/15 10:44, Andrew Savchenko wrote:

On Fri, 1 May 2015 05:09:51 + (UTC) Martin Vaeth wrote:

Andrew Savchenko birc...@gentoo.org wrote:


That's why kernel makes sure that no floating point instructions
sneaks in using CFLAGS, you may see a lot of -mno-${intrucion_set}
flags when running make -V.


So it should be sufficient that the kernel does not use float
or double, shouldn't it?


No. Optimizer paths may be very unobvious, i.e. I'll not be
surprised if under some conditions vectorizer may use float
instructions for int code.


The kernel uses -O2 and several -march variants (e.g. -march=core2).
Several other options are used to prevent GCC from generating
unsuitable code.

Specifying another -march variant does not affect the optimizer
though. It only affects the code generator. If you don't modify the
other CFLAGS and only change -march, you will not get FP instructions
unless you use FP in the code.

Also, I'd be very interested to see *any* optimization that would
somehow transform integer code to FP code (note that SIMD is not FP
and is perfectly fine in the kernel.) In fact, optimizers tend to
transform FP into SIMD, at least on x86 (and other architectures that
have fast SIMD instructions.) If I inspect the generated assembly from
GCC or Clang, I cannot find FP anywhere, even for code using float
and double operations. They get converted to SIMD on modern CPUs
(unless you specify a compiler flag that tells it to use the FPU, for
example if you need 80-bit extended precision, which is supported by
the x86 FPU.)





http://www.agner.org/optimize/calling_conventions.pdf


Not sure what you're trying to say.




Re: [gentoo-user] Re: CFLAGs for kernel compilation

2015-05-02 Thread Volker Armin Hemmann
Am 02.05.2015 um 14:06 schrieb Nikos Chantziaras:
 On 02/05/15 14:37, Volker Armin Hemmann wrote:
 Am 02.05.2015 um 13:25 schrieb Nikos Chantziaras:

 The kernel uses -O2 and several -march variants (e.g. -march=core2).
 Several other options are used to prevent GCC from generating
 unsuitable code.

 Specifying another -march variant does not affect the optimizer
 though. It only affects the code generator. If you don't modify the
 other CFLAGS and only change -march, you will not get FP instructions
 unless you use FP in the code.

 http://www.agner.org/optimize/calling_conventions.pdf

 Not sure what you're trying to say.


 that simd is not save in kernel if not carefully guarded.

 Really people, just don't fuck around with the cflags.

 I still fail to see the relevance. Unless you mean using a different
 -O level. In that case, yes. You shouldn't. But I was talking about
 -march.


you said this


 (note that SIMD is not FP and is perfectly fine in the kernel.)

and I have shown you that you are wrong.






[gentoo-user] Re: CFLAGs for kernel compilation

2015-05-02 Thread Nikos Chantziaras

On 02/05/15 15:10, Volker Armin Hemmann wrote:

Am 02.05.2015 um 14:06 schrieb Nikos Chantziaras:

On 02/05/15 14:37, Volker Armin Hemmann wrote:

Am 02.05.2015 um 13:25 schrieb Nikos Chantziaras:


The kernel uses -O2 and several -march variants (e.g. -march=core2).
Several other options are used to prevent GCC from generating
unsuitable code.

Specifying another -march variant does not affect the optimizer
though. It only affects the code generator. If you don't modify the
other CFLAGS and only change -march, you will not get FP instructions
unless you use FP in the code.


http://www.agner.org/optimize/calling_conventions.pdf


Not sure what you're trying to say.



that simd is not save in kernel if not carefully guarded.

Really people, just don't fuck around with the cflags.


I still fail to see the relevance. Unless you mean using a different
-O level. In that case, yes. You shouldn't. But I was talking about
-march.



you said this



(note that SIMD is not FP and is perfectly fine in the kernel.)


and I have shown you that you are wrong.


Not sure why you think that. The kernel crypto routines are full of SIMD 
code (like SSE and AVX.) Automatic vectorization wouldn't work. But 
-march is not going to introduce that.





[gentoo-user] Re: CFLAGs for kernel compilation

2015-05-02 Thread Nikos Chantziaras

On 02/05/15 14:37, Volker Armin Hemmann wrote:

Am 02.05.2015 um 13:25 schrieb Nikos Chantziaras:


The kernel uses -O2 and several -march variants (e.g. -march=core2).
Several other options are used to prevent GCC from generating
unsuitable code.

Specifying another -march variant does not affect the optimizer
though. It only affects the code generator. If you don't modify the
other CFLAGS and only change -march, you will not get FP instructions
unless you use FP in the code.


http://www.agner.org/optimize/calling_conventions.pdf


Not sure what you're trying to say.



that simd is not save in kernel if not carefully guarded.

Really people, just don't fuck around with the cflags.


I still fail to see the relevance. Unless you mean using a different -O 
level. In that case, yes. You shouldn't. But I was talking about -march.





Re: [gentoo-user] Re: CFLAGs for kernel compilation

2015-05-02 Thread Volker Armin Hemmann
Am 02.05.2015 um 13:25 schrieb Nikos Chantziaras:
 On 02/05/15 14:19, Volker Armin Hemmann wrote:
 Am 02.05.2015 um 07:04 schrieb Nikos Chantziaras:
 On 01/05/15 10:44, Andrew Savchenko wrote:
 On Fri, 1 May 2015 05:09:51 + (UTC) Martin Vaeth wrote:
 Andrew Savchenko birc...@gentoo.org wrote:

 That's why kernel makes sure that no floating point instructions
 sneaks in using CFLAGS, you may see a lot of -mno-${intrucion_set}
 flags when running make -V.

 So it should be sufficient that the kernel does not use float
 or double, shouldn't it?

 No. Optimizer paths may be very unobvious, i.e. I'll not be
 surprised if under some conditions vectorizer may use float
 instructions for int code.

 The kernel uses -O2 and several -march variants (e.g. -march=core2).
 Several other options are used to prevent GCC from generating
 unsuitable code.

 Specifying another -march variant does not affect the optimizer
 though. It only affects the code generator. If you don't modify the
 other CFLAGS and only change -march, you will not get FP instructions
 unless you use FP in the code.

 Also, I'd be very interested to see *any* optimization that would
 somehow transform integer code to FP code (note that SIMD is not FP
 and is perfectly fine in the kernel.) In fact, optimizers tend to
 transform FP into SIMD, at least on x86 (and other architectures that
 have fast SIMD instructions.) If I inspect the generated assembly from
 GCC or Clang, I cannot find FP anywhere, even for code using float
 and double operations. They get converted to SIMD on modern CPUs
 (unless you specify a compiler flag that tells it to use the FPU, for
 example if you need 80-bit extended precision, which is supported by
 the x86 FPU.)




 http://www.agner.org/optimize/calling_conventions.pdf

 Not sure what you're trying to say.




that simd is not save in kernel if not carefully guarded.

Really people, just don't fuck around with the cflags.



Re: [gentoo-user] Re: CFLAGs for kernel compilation

2015-05-02 Thread Volker Armin Hemmann
Am 02.05.2015 um 14:38 schrieb Nikos Chantziaras:
 On 02/05/15 15:10, Volker Armin Hemmann wrote:
 Am 02.05.2015 um 14:06 schrieb Nikos Chantziaras:
 On 02/05/15 14:37, Volker Armin Hemmann wrote:
 Am 02.05.2015 um 13:25 schrieb Nikos Chantziaras:

 The kernel uses -O2 and several -march variants (e.g.
 -march=core2).
 Several other options are used to prevent GCC from generating
 unsuitable code.

 Specifying another -march variant does not affect the optimizer
 though. It only affects the code generator. If you don't modify the
 other CFLAGS and only change -march, you will not get FP
 instructions
 unless you use FP in the code.

 http://www.agner.org/optimize/calling_conventions.pdf

 Not sure what you're trying to say.


 that simd is not save in kernel if not carefully guarded.

 Really people, just don't fuck around with the cflags.

 I still fail to see the relevance. Unless you mean using a different
 -O level. In that case, yes. You shouldn't. But I was talking about
 -march.


 you said this


 (note that SIMD is not FP and is perfectly fine in the kernel.)

 and I have shown you that you are wrong.

 Not sure why you think that. The kernel crypto routines are full of
 SIMD code (like SSE and AVX.) Automatic vectorization wouldn't work.
 But -march is not going to introduce that

and never used in interrupt context and carefully guarded. You act like
'oh, you can use simd instructions without any consideration' and that
is just not true.



[gentoo-user] Re: CFLAGs for kernel compilation

2015-05-02 Thread James
Volker Armin Hemmann volkerarmin at googlemail.com writes:


  http://www.agner.org/optimize/calling_conventions.pdf
 
  Not sure what you're trying to say.
 
 
  that simd is not save in kernel if not carefully guarded.
 
  Really people, just don't fuck around with the cflags.
 
  I still fail to see the relevance. Unless you mean using a different
  -O level. In that case, yes. You shouldn't. But I was talking about
  -march.
 
 
  you said this
 
 
  (note that SIMD is not FP and is perfectly fine in the kernel.)
 
  and I have shown you that you are wrong.
 
  Not sure why you think that. The kernel crypto routines are full of
  SIMD code (like SSE and AVX.) Automatic vectorization wouldn't work.
  But -march is not going to introduce that
 
 and never used in interrupt context and carefully guarded. You act like
 'oh, you can use simd instructions without any consideration' and that
 is just not true.


Volker,
Historically, you are correct. Looking forward, GCC-5.x will (can?) change
this as the simd and other hardware, including (DDR_5) memory all become
available  for (compiler) usage. For the longest time, we the FOSS
communities, have at best been given access to low lever APIs for access to
some of these hardware resources. All processor architectures are at war.
Intel (the bastards) have had FPGA and tools to reconfigure the amount and
types of hardwware in some of their  processors for quite some time. 

The Arm64 cores have simd (GPU if you like) centric cores on the same SOC as
the arm64 bit licensed CPU cores. The new gpu has already been integrated
into the processor cores (same substrate) just the the i387 FPU was some
decades ago. So Arm is providing 'bare metal' access to various customers
and compilers Since there are thousands of vendors building up customer
arm64 SOCs there is no way for Arm to constrict, like Intel, Nvidia and AMD
have historically done. Game_set_match.  

Even though those GPU cores available via arm64 are very weak compared to
Nvidia and AMD; bare metal access to those (gpu) resources if far superior
to what Intel (dragging their feet), Nvidia or AMD are offering. Just look
at how AMD's Mantle has stalled for the FOSS communities. Amd, via
competition from  a myriad of arm SOC vendors, is being forced to roll out
Arm64 bit server chips, just to stay relevant. Both of you guys are looking
at this issue, from historically color-coded sunglasses. Change is here; get
onboard with helping the masses help themselves to the feeding (coding) freenzy.


What a pair of really smart guys like you (2) should be doing is setting up
a gentoo wiki listing and demonstrating for others how to profile low
level codes: both kernel and system level, so these other gentoo folks *can
learn* about what you are saying  by example; running tools such as
kernelshark, and other performance/profiling types of analysis. Providing
seemless  and generic access to the gpu resources will go a long way towards
revitalizing FOSS cryptographic dominance; and that is a very good thing. ymmv.


For the record,  most simd hardware really sucks for dense_matrix
requirements. Most simd hardware only really works for sparse matrix
apps, like x.264 because the overlying (embedded) algorithms used are poorly
documented by intention from the hardware vendors. I do not have direct
proof; but I strongly suspect this is the case because the simd pipelined
memory  that these low level APIs give to FOSS community, are memory
constricted by design.


peace,
James







[gentoo-user] Re: CFLAGs for kernel compilation

2015-05-01 Thread Nikos Chantziaras

On 01/05/15 10:44, Andrew Savchenko wrote:

On Fri, 1 May 2015 05:09:51 + (UTC) Martin Vaeth wrote:

Andrew Savchenko birc...@gentoo.org wrote:


That's why kernel makes sure that no floating point instructions
sneaks in using CFLAGS, you may see a lot of -mno-${intrucion_set}
flags when running make -V.


So it should be sufficient that the kernel does not use float
or double, shouldn't it?


No. Optimizer paths may be very unobvious, i.e. I'll not be
surprised if under some conditions vectorizer may use float
instructions for int code.


The kernel uses -O2 and several -march variants (e.g. -march=core2). 
Several other options are used to prevent GCC from generating unsuitable 
code.


Specifying another -march variant does not affect the optimizer though. 
It only affects the code generator. If you don't modify the other CFLAGS 
and only change -march, you will not get FP instructions unless you use 
FP in the code.


Also, I'd be very interested to see *any* optimization that would 
somehow transform integer code to FP code (note that SIMD is not FP and 
is perfectly fine in the kernel.) In fact, optimizers tend to transform 
FP into SIMD, at least on x86 (and other architectures that have fast 
SIMD instructions.) If I inspect the generated assembly from GCC or 
Clang, I cannot find FP anywhere, even for code using float and 
double operations. They get converted to SIMD on modern CPUs (unless 
you specify a compiler flag that tells it to use the FPU, for example if 
you need 80-bit extended precision, which is supported by the x86 FPU.)





[gentoo-user] Re: CFLAGs for kernel compilation

2015-05-01 Thread James
Andrew Savchenko bircoph at gentoo.org writes:


  I can hardly imagine that otherwise the compiler converts integer
  or pointer arithmetic into floating point arithmetics, or is
  this really the case for certain flags?  If yes, why should these
  flags *ever* be useful?
  I mean: The context switching happens for non-kernel code as well,
  doesn't it?


First off, reading this thread, I cannot really tell what the intended use
of the the highly tuned kernels is to be. For almost all workstation
and server proposes, what has been previously stated is mostly correct. If
you really want test these waters, do it on a system that is not in your
critical path. You tune and experiment, you are going to bork your box.
Water coolers on the CPUs is a good idea when taxing FPU and other simd
hareware on the CPU, imho. sys-power/Powertop is your friend.


 Yes, context switching happens for all code and have its costs. But
 for userspace code context switching happens for many other
 reasons, e.g. on each syscall (userspace - kernelspace switching).
 Also some user applications may need high precision or context
 switching pays off due to mass parallel data processing, e.g. SIMD
 instructions in scientific or multimedia applications. 

 (
Here here, I knew we had an LU expert int he crowd. Most scientific
or highly parallelized number cruncing does benefit from experimenting
with settings and *profiling* the results (trace-cdm + kernelshark)
are in portage and are very useful for analysis of hardware timings,
context switching and a myriad of other issues. Be careful, you can
sink a lifetime into such efforts with little to show for your efforts.
The best thing is to read up on specific optimizations for specific
codes as vetted by the specific hardware in your processors. Tuning for
one need will most likely retard other types of performances; that is
why before you delve into these waters, you really need to learn about
profiling both target (applicattion) and kernel codes, *BEFORE* randomly
tuning the advanced numerical intricacies of your hardware resources.
Start with memory and cgroups before worrying about the hardware inside
your processors (cpu and gpu).


 But unless special conditions mentioned above, fixed point is still 
 faster in userspace, some ffmpeg codecs have both fixed and floating 
 point implementations, you may compare them. Programming in fixed point
 is much harder, so most people avoid it unless they have a very
 goode reason to use it. And dont't forget that kernel is
 performance critical unlike most of userspace applications.

Video (mpeg, h.264 and such) massively benefits from the enhanced matrix
abilities of the simd hardware in your video card's GPU. These bare metal
resources are being integrated into gcc-5.1+ for experimentation. But,
it is likely going to take a year or so before ordinary users of linux
resources see these performance gains.  I would  encourage you
to experiment, but *never on your main workstation*. I'm purchasing
a new nvidia video card just to benchmark and tune some numerically
intesive codes that use sci-libs/magma. Although this will be my
currently fastest video card, it will sit in a box that not used
for visual eye candy (gaming, anime, ray_traces etc).


The mesos clustering codes (shark, storm, tachyon etc) and MP(I) codes are
going to fundamentally change the numerical processing landscape for even
small linux clusters. An excellent bit of code to get your feet_wet is
sys-apps/hwloc. More than FPU, MP(I)  {sys-cluster/openmpi} and other
clustering codes are going to allow you to use the  DDR(4|5) memory found in
many video cards (GPU) via *RDMA*. The world is rapidly changing and many
old fixed point integer folks do not see the Tsunami that is just
off_shore. Many computationally expensive codes have development project to
move to an in-memory [1] environment where  HD resources are avoided as
much as possible in a cluster environment. Clustered resources tuned for
such things as a video rendering farm, will have very different optimized
kernels than your KDE(G*) workstation or web server. medica-gfx/Blender is
another excellent collection of codes that benefits from all sorts of tuning
on a special_purpose system.

So do you really have a valid need to tune the FPU performance due to a
numerically demanding applications?   YMMV

 Best regards,
 Andrew Savchenko


hth,
James

[1] https://amplab.cs.berkeley.edu/





Re: [gentoo-user] Re: CFLAGs for kernel compilation

2015-05-01 Thread Andrew Savchenko
On Fri, 1 May 2015 05:09:51 + (UTC) Martin Vaeth wrote:
 Andrew Savchenko birc...@gentoo.org wrote:
 
  That's why kernel makes sure that no floating point instructions
  sneaks in using CFLAGS, you may see a lot of -mno-${intrucion_set}
  flags when running make -V.
 
 So it should be sufficient that the kernel does not use float
 or double, shouldn't it?

No. Optimizer paths may be very unobvious, i.e. I'll not be
surprised if under some conditions vectorizer may use float
instructions for int code.

 I can hardly imagine that otherwise the compiler converts integer
 or pointer arithmetic into floating point arithmetics, or is
 this really the case for certain flags?  If yes, why should these
 flags *ever* be useful?
 I mean: The context switching happens for non-kernel code as well,
 doesn't it?

Yes, context switching happens for all code and have its costs. But
for userspace code context switching happens for many other
reasons, e.g. on each syscall (userspace - kernelspace switching).
Also some user applications may need high precision or context
switching pays off due to mass parallel data processing, e.g. SIMD
instructions in scientific or multimedia applications. But unless
special conditions mentioned above, fixed point is still faster in
userspace, some ffmpeg codecs have both fixed and floating point
implementations, you may compare them. Programming in fixed point
is much harder, so most people avoid it unless they have a very
goode reason to use it. And dont't forget that kernel is
performance critical unlike most of userspace applications.

Best regards,
Andrew Savchenko


pgpmtvztAOVCW.pgp
Description: PGP signature


[gentoo-user] Re: CFLAGs for kernel compilation

2015-04-30 Thread Martin Vaeth
Andrew Savchenko birc...@gentoo.org wrote:

 That's why kernel makes sure that no floating point instructions
 sneaks in using CFLAGS, you may see a lot of -mno-${intrucion_set}
 flags when running make -V.

So it should be sufficient that the kernel does not use float
or double, shouldn't it?
I can hardly imagine that otherwise the compiler converts integer
or pointer arithmetic into floating point arithmetics, or is
this really the case for certain flags?  If yes, why should these
flags *ever* be useful?
I mean: The context switching happens for non-kernel code as well,
doesn't it?




[gentoo-user] Re: CFLAGs for kernel compilation

2015-04-29 Thread Nikos Chantziaras

On 29/04/15 16:35, Holger Hoffstätte wrote:

On Wed, 29 Apr 2015 15:18:23 +0200, Ralf wrote:


Damn, you're absolutely right.

I just tested it using make V=1.
kernel make does override CFLAGs from the outside.

But that's interesting: my processor supports -march=core-avx2 and none
of the linux kernel processor family uses this flag...


https://github.com/graysky2/kernel_gcc_patch


This is already applied when enabling the experimental USE flag. At 
least, that's what the docs claim:


  $ equery uses gentoo-sources




[gentoo-user] Re: CFLAGs for kernel compilation

2015-04-29 Thread Nikos Chantziaras

On 30/04/15 02:52, Nikos Chantziaras wrote:

On 29/04/15 16:35, Holger Hoffstätte wrote:

On Wed, 29 Apr 2015 15:18:23 +0200, Ralf wrote:


Damn, you're absolutely right.

I just tested it using make V=1.
kernel make does override CFLAGs from the outside.

But that's interesting: my processor supports -march=core-avx2 and none
of the linux kernel processor family uses this flag...


https://github.com/graysky2/kernel_gcc_patch


This is already applied when enabling the experimental USE flag. At
least, that's what the docs claim:

   $ equery uses gentoo-sources


However, I just checked and it's not being applied. So either the 
documentation is wrong, or the ebuild/eclass has a bug.






[gentoo-user] Re: CFLAGs for kernel compilation

2015-04-29 Thread Holger Hoffstätte
On Wed, 29 Apr 2015 15:18:23 +0200, Ralf wrote:

 Damn, you're absolutely right.
 
 I just tested it using make V=1.
 kernel make does override CFLAGs from the outside.
 
 But that's interesting: my processor supports -march=core-avx2 and none
 of the linux kernel processor family uses this flag...

https://github.com/graysky2/kernel_gcc_patch

-h