Re: [OpenJDK 2D-Dev] Pisces AA renderer performance improvement.

Jim Graham Fri, 05 Aug 2011 17:32:44 -0700

Hi Denis,

I've finally started wrapping my head around this. It's a nice new takeon reorganizing the work and I'm intrigued by the fact that you saidthat this is getting close to beating Ductus.


Some comments:

- Side note - I could repurpose the previous incarnation of the file forFX rendering with just a few minor modifications - it looks like thosedays are over now, though, and I'll have to massage it quite a bit moreto share the code... (Frowny face for me)

- I'd like to see the Renderer cache implemented in the RenderingEnginecode since it is outside the scope of the work that Renderer does.Also, your table could fill up with partially used renderers if thereare ever any bugs that cause it to throw an exception before it callsgiveBack(). I'd like to see some code that tries to make sure they arealways given back even if errors happen (or, at least, they are removedfrom the data structures if we feel that they aren't appropriate toreuse after the error).

- Renderer, line 158 - "while (--toadd >= 0)" allows the compiler to usethe condition codes from the decrement to satisfy the test.

- Why not just have 1 tracker that manages TILE_SIZE<<SUB_POS valuesinstead of TILE_SIZE trackers each of which manages SUB_POS values?Unfortunately that means that EvenOdd would have to keep an array(because 32*8 is more bits than any single primitive type we have), butsince we already need to have an array of the EvenOdd trackers, wearen't any worse off and in fact we are ahead on storage becausecurrently you have 32*32 bits per set of trackers (plus overhead for 32tracker objects).


Some suggestions for future work:

(Don't need to deal with these now, but I'll throw the ideas on thetable so they don't get lost)

- or and line don't really need a full 32-bits. They could sharestorage, or they could be stored in byte[] arrays, or they could evenshare a byte[] array.

- if you store last_subpixy instead of numys then you wouldn't have tokeep storing the new number back in the arrays - it would become aread-only value.

- I think it would be worth it to amortize the removal of dead pointers.All it would take would be to store a NULL in the spot and set aboolean, then they could all be collapsed out in one pass at the end.To get even more creative, if you sort them the other way around andthen scan from count=>0 then you could just copy the value "ptrs[i] =ptrs[--count]" and let the insertion sort at the start of the next pixelrow deal with it.

- Another way to manage the merge sort would be to have a temp array ofSUB_PIX_Y values, quickly store all of the crossings into it eitherforwards or backwards in a pair of tight loops (one for slope >0 and onefor slope <0) and then do a regular merge sort in a second tight loop.If you use the values at [ci,ci+toadd) as the temp array then one testcould skip the merge sort step entirely because the values would alreadybe sorted (and this is likely to happen). My thoughts were that 2tighter loops might be faster than one more complicated loop.

I'm going to send this now before I get too bogged down in figuring outthe new design of the TileGenerator...


                        ...jim

On 7/19/2011 2:01 PM, Denis Lila wrote:

Hi Jim.

We spoke about this a while ago and I started working
on it. This is what I have so far:
http://icedtea.classpath.org/~dlila/webrevs/RendererPerf/

My main intention was to remove some stages
from the renderer (like you suggested) because what
we were doing was:
1. Transform curves to lines and put the lines in a buffer.
2. Iterate through scanlines, get crossings with the lines,
    compute alphas, and store them into an array.
3. Compress the alpha array using RLE.
4. When asked, use the RLE encoding to generate alpha tiles.

and I was hoping to be able to go directly from curves to
crossings, and then generate the tiles from the crossing
data. This would reduce the intermediate storage and would
move around much less data.

At first, I implemented this with an int[][] where each inner
array represented the crossings in one pixel row, and the
crossings themselves packed 3 chunks of data into one int:
the x coordinate of the crossing, the orientation, the scanline
within the pixel row where the crossing was located (this was
a number in [0:1<<SUBPIXEL_LG_POSITIONS_Y) ).
The crossings needed to be sorted, and I did this with Arrays.sort.
This was fast in small tests, but for any shape with more than 4 edges
per pixel row it started having huge scalability problems
(for larger shapes we were spending 90+% of the time in
quicksort).

The version in the webrev is the fastest yet, and it uses
a class similar to the ScanlineIterator for the sorting
(but now it iterates through pixel rows, not scanlines).
Unfortunately this means that there still are two levels
of intermediate storage (edges and crossings, analogous
to edges and RLE elements in the old version), but the
only way I can think of how to get rid of one of them
and still take use an average linear time sort is if we
start outputting alphas pixelrow by pixelrow, rather than
in 32x32 tiles.

Anyway, it's much faster than it used to be, so we should
probably push it (after I implement a cache eviction
policy).

Any suggestions/criticisms are always welcome.

Thank you,
Denis.

Re: [OpenJDK 2D-Dev] Pisces AA renderer performance improvement.

Reply via email to