Re: [maemo-developers] Improving Cairo performance on the N800
(resending this now that the mailing list is back up) On 1/16/07, Zeeshan Ali <[EMAIL PROTECTED]> wrote: Hello! > Now, the recently announced Nokia N800 is different from the 770 in > various ways that are interesting for Cairo performance. I've got my > eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator. Yeah! me too. The combined power of these two can make it possible to optimize a lot of nice free software out there for the N800 device. However! while former is fully documented and the documentation is available for general public, it doesn't have a lot to offer. ARMv6 SIMD only operate on 32-bit words and hence i find it unlikely that it can be used to optimize double fp emulation in contrast to the intel wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions can still be used to optimize a lot of code but would it be a good idea for cairo if you need to convert the operand values to ints and the result(s) back to float? No int <-> float conversion necessary. At this level, cairo uses ints exclusively. To clarify, the part of cairo I'm thinking could use the ARM SIMD is the pixman library which is almost an exact client-side mirror (copy, really) of the fb section of the X server. It's the part that implements the Porter-Duff operators in software. Floats are long out of the picture at this point. This misunderstanding is common due to wide-spread confusion regarding what role floating-point plays in cairo's internals. Most floats that arrive via an API call are converted into an integer type (e.g. fixed-point) early on. Cairo uses integer arithmetic for most of its internal computation. With that clarification, it should be no surprise that much of the recent FP optimizations in cairo was just a matter of speeding up conversions from floating point to an integer type. Anyway, I think the 32-bit ARM SIMD could possibly get us some speedup similar to how the existing MMX/SSE code has helped for x86 (for the curious ones, see fbmmx.c in cairo or xserver). And since the MMX/SSE code hasn't needed to drop down to raw assembly for to get a nice speedup (it uses intrinsics), your ARM SIMD intrinsics code is much appreciated. Dan Amelang ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Improving Cairo performance on the N800
(resending this now that the mailing list is back up) On 1/16/07, Fernando Herrera <[EMAIL PROTECTED]> wrote: El mar, 16-01-2007 a las 12:20 +0200, ext Daniel Stone escribió: > We don't currently use the MBX block at all: there's no driver or > anything to hook into. There was a linux driver for PowerVR from Imagination Technologies for 2.4 kernels, but I think is not open source :( Yea, it's not. While I was in the Linux driver section of the PowerVR website, I also saw this: "We have currently no plans of providing drivers supporting updated kernels." Where "updated kernels" refers to > 2.4. Ouch. Too bad there isn't a large enough developer community to make an open-source driver feasible. Maybe if Nokia pressured TI who then pressured PowerVR...I know, I know, not gonna happen. Dan Amelang ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Improving Cairo performance on the N800
On Tuesday 16 January 2007 12:08, Zeeshan Ali wrote: > > Now, the recently announced Nokia N800 is different from the 770 in > > various ways that are interesting for Cairo performance. I've got my > > eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator. > >Yeah! me too. The combined power of these two can make it possible > to optimize a lot of nice free software out there for the N800 device. > However! while former is fully documented and the documentation is > available for general public, it doesn't have a lot to offer. ARMv6 > SIMD only operate on 32-bit words and hence i find it unlikely that it > can be used to optimize double fp emulation in contrast to the intel > wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or > was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions > can still be used to optimize a lot of code but would it be a good > idea for cairo if you need to convert the operand values to ints and > the result(s) back to float? Well, OMAP2420 seems to support floating point in hardware, so all this stuff is probably not needed anymore :) > I have already been thinking on utilizing ARMv6 before the N800 was > release to public. My proposed plan of attack for the community (and > also the Nokia employees) is simply the following: > > 1. Patch GCC to provide ARMv6 intrinsics. (1 MM at most) > 2. Patch liboil [1] to utilize these intrinsics when compiled for > ARMv6 target (1-3 MM) > 3. Make all the software utilize liboil wherever appropriate or ARMv6 > intrinsics directly if needed. > >The 3rd step would ensure that you are optimizing your software for > all the platforms for which liboil provides optimizations. OTOH! one > can skip step#1 and write liboil implementations in assembly. > >I already did a little progress on this and the result is two > header files which provides inline functions abstracting the assembly > instructions. I am attaching the headers. One of my friend was > supposed to convert them to gcc intrinsics and patch gcc but i never > got around to finish them. However I am attaching the headers so > anyone can use it as a starter if he/she likes. According to my tests, performance improvement from using such header files is minimal. They are easy to use, but the improvement is generally not very good. When I benchmarked idct performance, I also tested C implementaion with some macros for fast armv5te 16-bit multiplication out of curiasity. Performance improvement was only about 5%. While at the same time, handcrafted code improves performance by as much as 50% (and still has potential for more optimizations): http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2006-September/045837.html The very similar minimal effect is obtained from using such macros in ffmpeg mp3 decoder. The explanation is simple. Compiler is not able to shedule instructions as good as human especially if it has some 'alien' parts of code inserted in the flow of its instructions via inline asm. For example, this multiply instruction takes 1 cycle to execute, but the result has 1 extra cycle latency (for ARM9, it is even higher for ARM11 and is equal to 2 cycles) and you can't use it immediately in the next instruction. As gcc does not know about the sheduling of such instructions when using just macros, it may try to use the result immediately and suffer form 1 or more cycles penalty because of pipeline interlock. So if really good performance is required, nothing can beat handcrafted assembly yet. Of course it makes sense to profile code and optimize only time critical relatively small leaf functions. By the way, free software is really poorly optimized for ARM right now. For example, SDL is not optimized for ARM, xserver is probably not optimized as well, a lot of performance critical parts of code in various software are still only implemented in C for ARM while they have x86 assembly optimizations long ago. Considering that Internet Tablets might have a tight competition with x86 UMPC devices in the near future, ARM poweded devices are at some disadvantage now. Is this something that we should try to change? :-) ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Improving Cairo performance on the N800
El mar, 16-01-2007 a las 12:20 +0200, ext Daniel Stone escribió: > We don't currently use the MBX block at all: there's no driver or > anything to hook into. There was a linux driver for PowerVR from Imagination Technologies for 2.4 kernels, but I think is not open source :( Salu2 ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Improving Cairo performance on the N800
Hi, On Mon, Jan 15, 2007 at 09:48:35PM -0800, ext Daniel Amelang wrote: > - Write a new Cairo backend that targets OpenVG, since the PowerVR MBX > has fully-accelerated OpenVG rendering. I haven't found anything about > OpenVG + Maemo 3.0, so maybe the software infrastructure isn't there > yet to do this. > > - Something involving the OpenGL capabilities of the MBX. It doesn't > support shaders, so it would be pretty limited. It does support > multitexturing, so maybe a poor man's glitz is feasible. We don't currently use the MBX block at all: there's no driver or anything to hook into. Cheers, Daniel ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Improving Cairo performance on the N800
Hello! Now, the recently announced Nokia N800 is different from the 770 in various ways that are interesting for Cairo performance. I've got my eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator. Yeah! me too. The combined power of these two can make it possible to optimize a lot of nice free software out there for the N800 device. However! while former is fully documented and the documentation is available for general public, it doesn't have a lot to offer. ARMv6 SIMD only operate on 32-bit words and hence i find it unlikely that it can be used to optimize double fp emulation in contrast to the intel wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions can still be used to optimize a lot of code but would it be a good idea for cairo if you need to convert the operand values to ints and the result(s) back to float? I have already been thinking on utilizing ARMv6 before the N800 was release to public. My proposed plan of attack for the community (and also the Nokia employees) is simply the following: 1. Patch GCC to provide ARMv6 intrinsics. (1 MM at most) 2. Patch liboil [1] to utilize these intrinsics when compiled for ARMv6 target (1-3 MM) 3. Make all the software utilize liboil wherever appropriate or ARMv6 intrinsics directly if needed. The 3rd step would ensure that you are optimizing your software for all the platforms for which liboil provides optimizations. OTOH! one can skip step#1 and write liboil implementations in assembly. I already did a little progress on this and the result is two header files which provides inline functions abstracting the assembly instructions. I am attaching the headers. One of my friend was supposed to convert them to gcc intrinsics and patch gcc but i never got around to finish them. However I am attaching the headers so anyone can use it as a starter if he/she likes. Using PowerVR MBX accelerator is a completely different story. Although it has a lot to offer but I failed to find any documentation on it. There were tons of documentation on how to use the OpenGL ES implemented on top of MBX. If you come across any documentation on that, please let me know. [1] http://liboil.freedesktop.org/ -- Regards, Zeeshan Ali Design Engineer, SW Open Source Software Operations Nokia Multimedia #ifndef __ARMV6_ARITHMETIC__ #define __ARMV6_ARITHMETIC__ /** 8-bit SIMD operations */ /* Signed 8-bit SIMD add */ static __inline unsigned long sadd8(unsigned long n, unsigned long m) { unsigned long d; __asm__ __volatile__( "sadd8 %0, %1, %2\n" : "=r" (d) : "r" (n), "r" (m) : "cc"); return d; } /* Signed 8-bit SIMD subtraction */ static __inline unsigned long ssub8(unsigned long n, unsigned long m) { unsigned long d; __asm__ __volatile__( "ssub8 %0, %1, %2\n" : "=r" (d) : "r" (n), "r" (m) : "cc"); return d; } /* Unsigned 8-bit SIMD addition */ static __inline unsigned long uadd8(unsigned long n, unsigned long m) { unsigned long d; __asm__ __volatile__( "uadd8 %0, %1, %2\n" : "=r" (d) : "r" (n), "r" (m) : "cc"); return d; } /* Unsigned 8-bit SIMD subtraction */ static __inline unsigned long usub8(unsigned long n, unsigned long m) { unsigned long d; __asm__ __volatile__("usub8 %0, %1, %2" : "=r" (d) : "r" (n), "r" (m) : "cc"); return d; } /* Signed saturating 8-bit SIMD addition */ static __inline unsigned long qadd8(unsigned long n, unsigned long m) { unsigned long d; __asm__ __volatile__("qadd8 %0, %1, %2" : "=r" (d) : "r" (n), "r" (m) : "cc"); return d; } /* Signed saturating 8-bit SIMD subtraction */ static __inline unsigned long qsub8(unsigned long n, unsigned long m) { unsigned long d; __asm__ __volatile__("qsub8 %0, %1, %2" : "=r" (d) : "r" (n), "r" (m) : "cc"); return d; } /* Unsigned saturating 8-bit SIMD addition */ static __inline unsigned long uqadd8(unsigned long n, unsigned long m) { unsigned long d; __asm__ __volatile__("uqadd8 %0, %1, %2" : "=r" (d) : "r" (n), "r" (m) : "cc"); return d; } /* Unsigned saturating 8-bit SIMD subtraction */ static __inline unsigned long uqsub8(unsigned long n, unsigned long m) { unsigned long d; __asm__ __volatile__("uqsub8 %0, %1, %2" : "=r" (d) : "r" (m), "r" (n) : "cc"); return d; } /** 16-bit SIMD operations */ /* Signed 16-bit SIMD add */ static __inline unsigned long sadd16(unsigned long n, unsigned long m) { unsigned long d; __asm__ __volatile__("sadd16 %0, %1, %2" : "=r" (d) : "r" (n), "r" (m) : "cc"); return d; } /* Signed 16-bit SIMD subtraction */ static __inline unsigned long ssub16(unsigned long n, unsigned long m) { unsigned long d; _
[maemo-developers] Improving Cairo performance on the N800
(Double posting here, apologies for any overlap) Now that Cairo on the 770 is performing pretty well, I hope the way is cleared for Maemo to switch over to Cairo (and a more recent GTK). Several of us have put a lot of effort into speeding it up, so it would be nice to see the fruits of our labors on the Nokia devices. If there are any more outstanding performance issues, let us know on the Cairo list. FYI, Carl has projected that Cairo 1.4 (the first stable release with the new optimizations) will be out in the next month or so. Now, the recently announced Nokia N800 is different from the 770 in various ways that are interesting for Cairo performance. I've got my eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator. In other news, I'm looking for a class project for an embedded software course I'm taking, so I'm thinking I can kill two birds with one stone if I can turn some Cairo on OMAP 2420 optimizations into something my professor will give me a grade for. So, I'm looking for some feedback on the following ideas: - Write some ARMv6 SIMD assembly for Cairo's image backend (pixman). If this turns out to be feasible and advantageous, the resulting code could also be incorporated into the fb part of the X server. - Write a new Cairo backend that targets OpenVG, since the PowerVR MBX has fully-accelerated OpenVG rendering. I haven't found anything about OpenVG + Maemo 3.0, so maybe the software infrastructure isn't there yet to do this. - Something involving the OpenGL capabilities of the MBX. It doesn't support shaders, so it would be pretty limited. It does support multitexturing, so maybe a poor man's glitz is feasible. Any ideas? Dan Amelang ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers