Hi Denis,

I've finally started wrapping my head around this. It's a nice new take on reorganizing the work and I'm intrigued by the fact that you said that this is getting close to beating Ductus.

Some comments:

- Side note - I could repurpose the previous incarnation of the file for FX rendering with just a few minor modifications - it looks like those days are over now, though, and I'll have to massage it quite a bit more to share the code... (Frowny face for me)

- I'd like to see the Renderer cache implemented in the RenderingEngine code since it is outside the scope of the work that Renderer does. Also, your table could fill up with partially used renderers if there are ever any bugs that cause it to throw an exception before it calls giveBack(). I'd like to see some code that tries to make sure they are always given back even if errors happen (or, at least, they are removed from the data structures if we feel that they aren't appropriate to reuse after the error).

- Renderer, line 158 - "while (--toadd >= 0)" allows the compiler to use the condition codes from the decrement to satisfy the test.

- Why not just have 1 tracker that manages TILE_SIZE<<SUB_POS values instead of TILE_SIZE trackers each of which manages SUB_POS values? Unfortunately that means that EvenOdd would have to keep an array (because 32*8 is more bits than any single primitive type we have), but since we already need to have an array of the EvenOdd trackers, we aren't any worse off and in fact we are ahead on storage because currently you have 32*32 bits per set of trackers (plus overhead for 32 tracker objects).

Some suggestions for future work:
(Don't need to deal with these now, but I'll throw the ideas on the table so they don't get lost)

- or and line don't really need a full 32-bits. They could share storage, or they could be stored in byte[] arrays, or they could even share a byte[] array.

- if you store last_subpixy instead of numys then you wouldn't have to keep storing the new number back in the arrays - it would become a read-only value.

- I think it would be worth it to amortize the removal of dead pointers. All it would take would be to store a NULL in the spot and set a boolean, then they could all be collapsed out in one pass at the end. To get even more creative, if you sort them the other way around and then scan from count=>0 then you could just copy the value "ptrs[i] = ptrs[--count]" and let the insertion sort at the start of the next pixel row deal with it.

- Another way to manage the merge sort would be to have a temp array of SUB_PIX_Y values, quickly store all of the crossings into it either forwards or backwards in a pair of tight loops (one for slope >0 and one for slope <0) and then do a regular merge sort in a second tight loop. If you use the values at [ci,ci+toadd) as the temp array then one test could skip the merge sort step entirely because the values would already be sorted (and this is likely to happen). My thoughts were that 2 tighter loops might be faster than one more complicated loop.

I'm going to send this now before I get too bogged down in figuring out the new design of the TileGenerator...

                        ...jim

On 7/19/2011 2:01 PM, Denis Lila wrote:
Hi Jim.

We spoke about this a while ago and I started working
on it. This is what I have so far:
http://icedtea.classpath.org/~dlila/webrevs/RendererPerf/

My main intention was to remove some stages
from the renderer (like you suggested) because what
we were doing was:
1. Transform curves to lines and put the lines in a buffer.
2. Iterate through scanlines, get crossings with the lines,
    compute alphas, and store them into an array.
3. Compress the alpha array using RLE.
4. When asked, use the RLE encoding to generate alpha tiles.

and I was hoping to be able to go directly from curves to
crossings, and then generate the tiles from the crossing
data. This would reduce the intermediate storage and would
move around much less data.

At first, I implemented this with an int[][] where each inner
array represented the crossings in one pixel row, and the
crossings themselves packed 3 chunks of data into one int:
the x coordinate of the crossing, the orientation, the scanline
within the pixel row where the crossing was located (this was
a number in [0:1<<SUBPIXEL_LG_POSITIONS_Y) ).
The crossings needed to be sorted, and I did this with Arrays.sort.
This was fast in small tests, but for any shape with more than 4 edges
per pixel row it started having huge scalability problems
(for larger shapes we were spending 90+% of the time in
quicksort).

The version in the webrev is the fastest yet, and it uses
a class similar to the ScanlineIterator for the sorting
(but now it iterates through pixel rows, not scanlines).
Unfortunately this means that there still are two levels
of intermediate storage (edges and crossings, analogous
to edges and RLE elements in the old version), but the
only way I can think of how to get rid of one of them
and still take use an average linear time sort is if we
start outputting alphas pixelrow by pixelrow, rather than
in 32x32 tiles.

Anyway, it's much faster than it used to be, so we should
probably push it (after I implement a cache eviction
policy).

Any suggestions/criticisms are always welcome.

Thank you,
Denis.

Reply via email to