Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-20 Thread Daniel Amelang

(resending this now that the mailing list is back up)

On 1/16/07, Zeeshan Ali <[EMAIL PROTECTED]> wrote:

Hello!

> Now, the recently announced Nokia N800 is different from the 770 in
> various ways that are interesting for Cairo performance. I've got my
> eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator.

   Yeah! me too. The combined power of these two can make it possible
to optimize a lot of nice free software out there for the N800 device.
 However! while former is fully documented and the documentation is
available for general public, it doesn't have a lot to offer. ARMv6
SIMD only operate on 32-bit words and hence i find it unlikely that it
can be used to optimize double fp emulation in contrast to the intel
wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or
was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions
can still be used to optimize a lot of code but would it be a good
idea for cairo if you need to convert the operand values to ints and
the result(s) back to float?


No int <-> float conversion necessary. At this level, cairo uses ints
exclusively. To clarify, the part of cairo I'm thinking could use the
ARM SIMD is the pixman library which is almost an exact client-side
mirror (copy, really) of the fb section of the X server. It's the part
that implements the Porter-Duff operators in software. Floats are long
out of the picture at this point.

This misunderstanding is common due to wide-spread confusion regarding
what role floating-point plays in cairo's internals. Most floats that
arrive via an API call are converted into an integer type (e.g.
fixed-point) early on. Cairo uses integer arithmetic for most of its
internal computation. With that clarification, it should be no
surprise that much of the recent FP optimizations in cairo was just a
matter of speeding up conversions from floating point to an integer
type.

Anyway, I think the 32-bit ARM SIMD could possibly get us some speedup
similar to how the existing MMX/SSE code has helped for x86 (for the
curious ones, see fbmmx.c in cairo or xserver). And since the MMX/SSE
code hasn't needed to drop down to raw assembly for to get a nice
speedup (it uses intrinsics), your ARM SIMD intrinsics code is much
appreciated.

Dan Amelang
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-20 Thread Daniel Amelang

(resending this now that the mailing list is back up)

On 1/16/07, Fernando Herrera <[EMAIL PROTECTED]> wrote:

El mar, 16-01-2007 a las 12:20 +0200, ext Daniel Stone escribió:
> We don't currently use the MBX block at all: there's no driver or
> anything to hook into.

There was a linux driver for PowerVR from Imagination Technologies for
2.4 kernels, but I think is not open source :(


Yea, it's not. While I was in the Linux driver section of the PowerVR
website, I also saw this:

"We have currently no plans of providing drivers supporting updated kernels."

Where "updated kernels" refers to > 2.4. Ouch. Too bad there isn't a
large enough developer community to make an open-source driver
feasible.

Maybe if Nokia pressured TI who then pressured PowerVR...I know, I
know, not gonna happen.

Dan Amelang
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-17 Thread Siarhei Siamashka
On Tuesday 16 January 2007 12:08, Zeeshan Ali wrote:

> > Now, the recently announced Nokia N800 is different from the 770 in
> > various ways that are interesting for Cairo performance. I've got my
> > eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator.
>
>Yeah! me too. The combined power of these two can make it possible
> to optimize a lot of nice free software out there for the N800 device.
>  However! while former is fully documented and the documentation is
> available for general public, it doesn't have a lot to offer. ARMv6
> SIMD only operate on 32-bit words and hence i find it unlikely that it
> can be used to optimize double fp emulation in contrast to the intel
> wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or
> was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions
> can still be used to optimize a lot of code but would it be a good
> idea for cairo if you need to convert the operand values to ints and
> the result(s) back to float?

Well, OMAP2420 seems to support floating point in hardware, so all this stuff
is probably not needed anymore :)

>   I have already been thinking on utilizing ARMv6 before the N800 was
> release to public. My proposed plan of attack for the community (and
> also the Nokia employees) is simply the following:
>
> 1. Patch GCC to provide ARMv6 intrinsics. (1 MM at most)
> 2. Patch liboil [1] to utilize these intrinsics when compiled for
> ARMv6 target (1-3 MM)
> 3. Make all the software utilize liboil wherever appropriate or ARMv6
> intrinsics directly if needed.
>
>The 3rd step would ensure that you are optimizing your software for
> all the platforms for which liboil provides optimizations. OTOH! one
> can skip step#1 and write liboil implementations in assembly.
>
>I already did a little progress on this and the result is two
> header files which provides inline functions abstracting the assembly
> instructions. I am attaching the headers. One of my friend was
> supposed to convert them to gcc intrinsics and patch gcc but i never
> got around to finish them. However I am attaching the headers so
> anyone can use it as a starter if he/she likes.

According to my tests, performance improvement from using such header 
files is minimal. They are easy to use, but the improvement is generally not
very good.

When I benchmarked idct performance, I also tested C implementaion with some
macros for fast armv5te 16-bit multiplication out of curiasity. Performance
improvement was only about 5%. While at the same time, handcrafted code
improves performance by as much as 50% (and still has potential for more
optimizations):
http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2006-September/045837.html

The very similar minimal effect is obtained from using such macros in ffmpeg
mp3 decoder.
 
The explanation is simple. Compiler is not able to shedule instructions 
as good as human especially if it has some 'alien' parts of code inserted 
in the flow of its instructions via inline asm. For example, this multiply
instruction takes 1 cycle to execute, but the result has 1 extra cycle latency
(for ARM9, it is even higher for ARM11 and is equal to 2 cycles) and you can't
use it immediately in the next instruction. As gcc does not know about the
sheduling of such instructions when using just macros, it may try to use
the result immediately and suffer form 1 or more cycles penalty because of
pipeline interlock.

So if really good performance is required, nothing can beat handcrafted
assembly yet. Of course it makes sense to profile code and optimize only 
time critical relatively small leaf functions.

By the way, free software is really poorly optimized for ARM right now. For
example, SDL is not optimized for ARM, xserver is probably not optimized 
as well, a lot of performance critical parts of code in various software are
still only implemented in C for ARM while they have x86 assembly 
optimizations long ago. Considering that Internet Tablets might have a tight
competition  with x86 UMPC devices in the near future, ARM poweded devices 
are at some disadvantage now. Is this something that we should try to
change? :-)
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-16 Thread Fernando Herrera
El mar, 16-01-2007 a las 12:20 +0200, ext Daniel Stone escribió:
> We don't currently use the MBX block at all: there's no driver or
> anything to hook into.

There was a linux driver for PowerVR from Imagination Technologies for
2.4 kernels, but I think is not open source :(

Salu2

___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-16 Thread Daniel Stone
Hi,

On Mon, Jan 15, 2007 at 09:48:35PM -0800, ext Daniel Amelang wrote:
> - Write a new Cairo backend that targets OpenVG, since the PowerVR MBX
> has fully-accelerated OpenVG rendering. I haven't found anything about
> OpenVG + Maemo 3.0, so maybe the software infrastructure isn't there
> yet to do this.
> 
> - Something involving the OpenGL capabilities of the MBX. It doesn't
> support shaders, so it would be pretty limited. It does support
> multitexturing, so maybe a poor man's glitz is feasible.

We don't currently use the MBX block at all: there's no driver or
anything to hook into.

Cheers,
Daniel
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Improving Cairo performance on the N800

2007-01-16 Thread Zeeshan Ali

Hello!


Now, the recently announced Nokia N800 is different from the 770 in
various ways that are interesting for Cairo performance. I've got my
eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator.


  Yeah! me too. The combined power of these two can make it possible
to optimize a lot of nice free software out there for the N800 device.
However! while former is fully documented and the documentation is
available for general public, it doesn't have a lot to offer. ARMv6
SIMD only operate on 32-bit words and hence i find it unlikely that it
can be used to optimize double fp emulation in contrast to the intel
wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or
was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions
can still be used to optimize a lot of code but would it be a good
idea for cairo if you need to convert the operand values to ints and
the result(s) back to float?

 I have already been thinking on utilizing ARMv6 before the N800 was
release to public. My proposed plan of attack for the community (and
also the Nokia employees) is simply the following:

1. Patch GCC to provide ARMv6 intrinsics. (1 MM at most)
2. Patch liboil [1] to utilize these intrinsics when compiled for
ARMv6 target (1-3 MM)
3. Make all the software utilize liboil wherever appropriate or ARMv6
intrinsics directly if needed.

  The 3rd step would ensure that you are optimizing your software for
all the platforms for which liboil provides optimizations. OTOH! one
can skip step#1 and write liboil implementations in assembly.

  I already did a little progress on this and the result is two
header files which provides inline functions abstracting the assembly
instructions. I am attaching the headers. One of my friend was
supposed to convert them to gcc intrinsics and patch gcc but i never
got around to finish them. However I am attaching the headers so
anyone can use it as a starter if he/she likes.

 Using PowerVR MBX accelerator is a completely different story.
Although it has a lot to offer but I failed to find any documentation
on it. There were tons of documentation on how to use the OpenGL ES
implemented on top of MBX. If you come across any documentation on
that, please let me know.

[1] http://liboil.freedesktop.org/

--
Regards,

Zeeshan Ali
Design Engineer, SW
Open Source Software Operations
Nokia Multimedia
#ifndef __ARMV6_ARITHMETIC__
#define __ARMV6_ARITHMETIC__

/** 8-bit SIMD operations */


/* Signed 8-bit SIMD add */
static __inline unsigned long sadd8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(
	"sadd8 %0, %1, %2\n"
	: "=r" (d)
	: "r" (n), "r" (m)
	: "cc");

return d;
}

/* Signed 8-bit SIMD subtraction */
static __inline unsigned long ssub8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(
	"ssub8 %0, %1, %2\n"
	: "=r" (d)
	: "r" (n), "r" (m)
	: "cc");

return d;
}

/* Unsigned 8-bit SIMD addition */
static __inline unsigned long uadd8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__(
	"uadd8 %0, %1, %2\n"
	: "=r" (d)
	: "r" (n), "r" (m)
	: "cc");

return d;
}

/* Unsigned 8-bit SIMD subtraction */
static __inline unsigned long usub8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__("usub8 %0, %1, %2"
	: "=r" (d)
	: "r" (n), "r" (m)
	: "cc");

return d;
}

/* Signed saturating 8-bit SIMD addition */
static __inline unsigned long qadd8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__("qadd8 %0, %1, %2"
	: "=r" (d)
	: "r" (n), "r" (m)
	: "cc");

return d;
}

/* Signed saturating 8-bit SIMD subtraction */
static __inline unsigned long qsub8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__("qsub8 %0, %1, %2"
	: "=r" (d)
	: "r" (n), "r" (m)
	: "cc");

return d;
}

/* Unsigned saturating 8-bit SIMD addition */
static __inline unsigned long uqadd8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__("uqadd8 %0, %1, %2"
	: "=r" (d)
	: "r" (n), "r" (m)
	: "cc");

return d;
}

/* Unsigned saturating 8-bit SIMD subtraction */
static __inline unsigned long uqsub8(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__("uqsub8 %0, %1, %2"
	: "=r" (d)
	: "r" (m), "r" (n)
	: "cc");

return d;
}

/** 16-bit SIMD operations */

/* Signed 16-bit SIMD add */
static __inline unsigned long sadd16(unsigned long n, unsigned long m)
{
unsigned long d;

__asm__ __volatile__("sadd16 %0, %1, %2"
	: "=r" (d)
	: "r" (n), "r" (m)
	: "cc");

return d;
}

/* Signed 16-bit SIMD subtraction */
static __inline unsigned long ssub16(unsigned long n, unsigned long m)
{
unsigned long d;

_

[maemo-developers] Improving Cairo performance on the N800

2007-01-15 Thread Daniel Amelang

(Double posting here, apologies for any overlap)

Now that Cairo on the 770 is performing pretty well, I hope the way is
cleared for Maemo to switch over to Cairo (and a more recent GTK).
Several of us have put a lot of effort into speeding it up, so it
would be nice to see the fruits of our labors on the Nokia devices. If
there are any more outstanding performance issues, let us know on the
Cairo list. FYI, Carl has projected that Cairo 1.4 (the first stable
release with the new optimizations) will be out in the next month or
so.

Now, the recently announced Nokia N800 is different from the 770 in
various ways that are interesting for Cairo performance. I've got my
eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator.

In other news, I'm looking for a class project for an embedded
software course I'm taking, so I'm thinking I can kill two birds with
one stone if I can turn some Cairo on OMAP 2420 optimizations into
something my professor will give me a grade for. So, I'm looking for
some feedback on the following ideas:

- Write some ARMv6 SIMD assembly for Cairo's image backend (pixman).
If this turns out to be feasible and advantageous, the resulting code
could also be incorporated into the fb part of the X server.

- Write a new Cairo backend that targets OpenVG, since the PowerVR MBX
has fully-accelerated OpenVG rendering. I haven't found anything about
OpenVG + Maemo 3.0, so maybe the software infrastructure isn't there
yet to do this.

- Something involving the OpenGL capabilities of the MBX. It doesn't
support shaders, so it would be pretty limited. It does support
multitexturing, so maybe a poor man's glitz is feasible.

Any ideas?

Dan Amelang
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers