Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-20 Thread Daniel Amelang

(resending this now that the mailing list is back up)

On 1/16/07, Fernando Herrera [EMAIL PROTECTED] wrote:

El mar, 16-01-2007 a las 12:20 +0200, ext Daniel Stone escribió:
 We don't currently use the MBX block at all: there's no driver or
 anything to hook into.

There was a linux driver for PowerVR from Imagination Technologies for
2.4 kernels, but I think is not open source :(


Yea, it's not. While I was in the Linux driver section of the PowerVR
website, I also saw this:

We have currently no plans of providing drivers supporting updated kernels.

Where updated kernels refers to  2.4. Ouch. Too bad there isn't a
large enough developer community to make an open-source driver
feasible.

Maybe if Nokia pressured TI who then pressured PowerVR...I know, I
know, not gonna happen.

Dan Amelang
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-20 Thread Daniel Amelang

(resending this now that the mailing list is back up)

On 1/16/07, Zeeshan Ali [EMAIL PROTECTED] wrote:

Hello!

 Now, the recently announced Nokia N800 is different from the 770 in
 various ways that are interesting for Cairo performance. I've got my
 eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator.

   Yeah! me too. The combined power of these two can make it possible
to optimize a lot of nice free software out there for the N800 device.
 However! while former is fully documented and the documentation is
available for general public, it doesn't have a lot to offer. ARMv6
SIMD only operate on 32-bit words and hence i find it unlikely that it
can be used to optimize double fp emulation in contrast to the intel
wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or
was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions
can still be used to optimize a lot of code but would it be a good
idea for cairo if you need to convert the operand values to ints and
the result(s) back to float?


No int - float conversion necessary. At this level, cairo uses ints
exclusively. To clarify, the part of cairo I'm thinking could use the
ARM SIMD is the pixman library which is almost an exact client-side
mirror (copy, really) of the fb section of the X server. It's the part
that implements the Porter-Duff operators in software. Floats are long
out of the picture at this point.

This misunderstanding is common due to wide-spread confusion regarding
what role floating-point plays in cairo's internals. Most floats that
arrive via an API call are converted into an integer type (e.g.
fixed-point) early on. Cairo uses integer arithmetic for most of its
internal computation. With that clarification, it should be no
surprise that much of the recent FP optimizations in cairo was just a
matter of speeding up conversions from floating point to an integer
type.

Anyway, I think the 32-bit ARM SIMD could possibly get us some speedup
similar to how the existing MMX/SSE code has helped for x86 (for the
curious ones, see fbmmx.c in cairo or xserver). And since the MMX/SSE
code hasn't needed to drop down to raw assembly for to get a nice
speedup (it uses intrinsics), your ARM SIMD intrinsics code is much
appreciated.

Dan Amelang
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-17 Thread Siarhei Siamashka
On Tuesday 16 January 2007 12:08, Zeeshan Ali wrote:

  Now, the recently announced Nokia N800 is different from the 770 in
  various ways that are interesting for Cairo performance. I've got my
  eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator.

Yeah! me too. The combined power of these two can make it possible
 to optimize a lot of nice free software out there for the N800 device.
  However! while former is fully documented and the documentation is
 available for general public, it doesn't have a lot to offer. ARMv6
 SIMD only operate on 32-bit words and hence i find it unlikely that it
 can be used to optimize double fp emulation in contrast to the intel
 wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or
 was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions
 can still be used to optimize a lot of code but would it be a good
 idea for cairo if you need to convert the operand values to ints and
 the result(s) back to float?

Well, OMAP2420 seems to support floating point in hardware, so all this stuff
is probably not needed anymore :)

   I have already been thinking on utilizing ARMv6 before the N800 was
 release to public. My proposed plan of attack for the community (and
 also the Nokia employees) is simply the following:

 1. Patch GCC to provide ARMv6 intrinsics. (1 MM at most)
 2. Patch liboil [1] to utilize these intrinsics when compiled for
 ARMv6 target (1-3 MM)
 3. Make all the software utilize liboil wherever appropriate or ARMv6
 intrinsics directly if needed.

The 3rd step would ensure that you are optimizing your software for
 all the platforms for which liboil provides optimizations. OTOH! one
 can skip step#1 and write liboil implementations in assembly.

I already did a little progress on this and the result is two
 header files which provides inline functions abstracting the assembly
 instructions. I am attaching the headers. One of my friend was
 supposed to convert them to gcc intrinsics and patch gcc but i never
 got around to finish them. However I am attaching the headers so
 anyone can use it as a starter if he/she likes.

According to my tests, performance improvement from using such header 
files is minimal. They are easy to use, but the improvement is generally not
very good.

When I benchmarked idct performance, I also tested C implementaion with some
macros for fast armv5te 16-bit multiplication out of curiasity. Performance
improvement was only about 5%. While at the same time, handcrafted code
improves performance by as much as 50% (and still has potential for more
optimizations):
http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2006-September/045837.html

The very similar minimal effect is obtained from using such macros in ffmpeg
mp3 decoder.
 
The explanation is simple. Compiler is not able to shedule instructions 
as good as human especially if it has some 'alien' parts of code inserted 
in the flow of its instructions via inline asm. For example, this multiply
instruction takes 1 cycle to execute, but the result has 1 extra cycle latency
(for ARM9, it is even higher for ARM11 and is equal to 2 cycles) and you can't
use it immediately in the next instruction. As gcc does not know about the
sheduling of such instructions when using just macros, it may try to use
the result immediately and suffer form 1 or more cycles penalty because of
pipeline interlock.

So if really good performance is required, nothing can beat handcrafted
assembly yet. Of course it makes sense to profile code and optimize only 
time critical relatively small leaf functions.

By the way, free software is really poorly optimized for ARM right now. For
example, SDL is not optimized for ARM, xserver is probably not optimized 
as well, a lot of performance critical parts of code in various software are
still only implemented in C for ARM while they have x86 assembly 
optimizations long ago. Considering that Internet Tablets might have a tight
competition  with x86 UMPC devices in the near future, ARM poweded devices 
are at some disadvantage now. Is this something that we should try to
change? :-)
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-16 Thread Zeeshan Ali

Hello!


Now, the recently announced Nokia N800 is different from the 770 in
various ways that are interesting for Cairo performance. I've got my
eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator.


  Yeah! me too. The combined power of these two can make it possible
to optimize a lot of nice free software out there for the N800 device.
However! while former is fully documented and the documentation is
available for general public, it doesn't have a lot to offer. ARMv6
SIMD only operate on 32-bit words and hence i find it unlikely that it
can be used to optimize double fp emulation in contrast to the intel
wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or
was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions
can still be used to optimize a lot of code but would it be a good
idea for cairo if you need to convert the operand values to ints and
the result(s) back to float?

 I have already been thinking on utilizing ARMv6 before the N800 was
release to public. My proposed plan of attack for the community (and
also the Nokia employees) is simply the following:

1. Patch GCC to provide ARMv6 intrinsics. (1 MM at most)
2. Patch liboil [1] to utilize these intrinsics when compiled for
ARMv6 target (1-3 MM)
3. Make all the software utilize liboil wherever appropriate or ARMv6
intrinsics directly if needed.

  The 3rd step would ensure that you are optimizing your software for
all the platforms for which liboil provides optimizations. OTOH! one
can skip step#1 and write liboil implementations in assembly.

  I already did a little progress on this and the result is two
header files which provides inline functions abstracting the assembly
instructions. I am attaching the headers. One of my friend was
supposed to convert them to gcc intrinsics and patch gcc but i never
got around to finish them. However I am attaching the headers so
anyone can use it as a starter if he/she likes.

 Using PowerVR MBX accelerator is a completely different story.
Although it has a lot to offer but I failed to find any documentation
on it. There were tons of documentation on how to use the OpenGL ES
implemented on top of MBX. If you come across any documentation on
that, please let me know.

[1] http://liboil.freedesktop.org/

--
Regards,

Zeeshan Ali
Design Engineer, SW
Open Source Software Operations
Nokia Multimedia
#ifndef __ARMV6_ARITHMETIC__
#define __ARMV6_ARITHMETIC__

/** 8-bit SIMD operations */


/* Signed 8-bit SIMD add */
static __inline unsigned long sadd8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(
	sadd8 %0, %1, %2\n
	: =r (d)
	: r (n), r (m)
	: cc);

return d;
}

/* Signed 8-bit SIMD subtraction */
static __inline unsigned long ssub8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(
	ssub8 %0, %1, %2\n
	: =r (d)
	: r (n), r (m)
	: cc);

return d;
}

/* Unsigned 8-bit SIMD addition */
static __inline unsigned long uadd8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(
	uadd8 %0, %1, %2\n
	: =r (d)
	: r (n), r (m)
	: cc);

return d;
}

/* Unsigned 8-bit SIMD subtraction */
static __inline unsigned long usub8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(usub8 %0, %1, %2
	: =r (d)
	: r (n), r (m)
	: cc);

return d;
}

/* Signed saturating 8-bit SIMD addition */
static __inline unsigned long qadd8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(qadd8 %0, %1, %2
	: =r (d)
	: r (n), r (m)
	: cc);

return d;
}

/* Signed saturating 8-bit SIMD subtraction */
static __inline unsigned long qsub8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(qsub8 %0, %1, %2
	: =r (d)
	: r (n), r (m)
	: cc);

return d;
}

/* Unsigned saturating 8-bit SIMD addition */
static __inline unsigned long uqadd8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(uqadd8 %0, %1, %2
	: =r (d)
	: r (n), r (m)
	: cc);

return d;
}

/* Unsigned saturating 8-bit SIMD subtraction */
static __inline unsigned long uqsub8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(uqsub8 %0, %1, %2
	: =r (d)
	: r (m), r (n)
	: cc);

return d;
}

/** 16-bit SIMD operations */

/* Signed 16-bit SIMD add */
static __inline unsigned long sadd16(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(sadd16 %0, %1, %2
	: =r (d)
	: r (n), r (m)
	: cc);

return d;
}

/* Signed 16-bit SIMD subtraction */
static __inline unsigned long ssub16(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(ssub16 %0, %1, %2
	: =r (d)
	: r (n), r (m)
	: cc);


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-16 Thread Daniel Stone
Hi,

On Mon, Jan 15, 2007 at 09:48:35PM -0800, ext Daniel Amelang wrote:
 - Write a new Cairo backend that targets OpenVG, since the PowerVR MBX
 has fully-accelerated OpenVG rendering. I haven't found anything about
 OpenVG + Maemo 3.0, so maybe the software infrastructure isn't there
 yet to do this.
 
 - Something involving the OpenGL capabilities of the MBX. It doesn't
 support shaders, so it would be pretty limited. It does support
 multitexturing, so maybe a poor man's glitz is feasible.

We don't currently use the MBX block at all: there's no driver or
anything to hook into.

Cheers,
Daniel
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-16 Thread Fernando Herrera
El mar, 16-01-2007 a las 12:20 +0200, ext Daniel Stone escribió:
 We don't currently use the MBX block at all: there's no driver or
 anything to hook into.

There was a linux driver for PowerVR from Imagination Technologies for
2.4 kernels, but I think is not open source :(

Salu2

___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers