Hi Jim.

> - Create a curve class and store an array of those so you don't have
> to iterate 3 different arrays of values and use array accesses to grab
> the data (each array access is checked for index OOB exceptions).

I actually implemented something like this in my first draft of the current
version of renderer. I didn't stick with it because it was a slower than
even what we have in openjdk6. Granted, my implementation was a bit more
high level than simply making 3 classes to represent lines quads and cubics,
but it seemed pretty hopeless, so I didn't spend any time figuring out
exactly what it was that made it slower.

> - Or perform AFD on curves as they come into Renderer, but only save 
> line segment edges in the edges array.  This would use more storage,
> but the iterations of the AFD would happen in a tight loop as the data
> comes in rather than having to store all of their AFD data back in the quads

I considered this before abandoning the high level version I mention above.
I didn't go with it because, while I am not worried about the higher memory
consumption, I was afraid of the impact that having this many edges would
have on the qsort call and on lines 99-113 and 140-148 in next().
How about this: we change the format of the edges array to be an array of
sequences of edges. So, right now the edges array has this format:
E* where E represents 6 consecutive floats {ymin,ymax,curx,cury,or,slope}.
I am proposing we change it to {n,or}({ymin,ymax,curx,cury,slope})^n.
lineTo's would add an edge sequence with n=1 to the edges array. If a call
to quadTo or curveTo produced N curves, it would simply put N,or at the
end of the edges array, and then append the data for the N produced edges.
I think this would give us the best of both worlds, in that we can do all
the AFD iterations in a tight loop close to the input methods and it
doesn't present any problems with respect to sorting or managing the
active list. It can probably be implemented completely transparently with
respect to ScanlineIterator. The details of
the implementation involve an interesting speed/memory trade off, but we
can discuss that later.

> - Convert to native.  Note that we use a native version of the pisces
> that you started with to do some rendering in FX.  I tried porting to
> use your new (Java) renderer in FX and performance went down even
> though you show it to be faster than what was there before.  So, your Java 
> renderer compares favorably to the old Java pisces, but both compare 
> unfavorably to the old native pisces.  Maybe we should convert your
> code to native and see if that gives us a performance boost.  It's nice to
> use pure Java, but there is a lot of "shoehorning" of data going on
> here that could be done much more easily and naturally in native code.

I've been wanting to do this for a very long time. C and C++ are more
convenient for this type of work. I didn't because I've never used the JNI
before. I guess this is as good a time to learn as any. I'd still like to
keep the debugging of native code to a minimum, so we should implement
as much as possible in java before starting on this.


I still need to think some more about your other suggestion. I'll reply
to it tomorrow.

> How is that for "food for thought"?
Delicious :)

Regards,
Denis.

----- "Jim Graham" <james.gra...@oracle.com> wrote:

> Hi Denis,
> 
> On 10/25/2010 7:34 AM, Denis Lila wrote:
> >> (and I have some ideas on further optimizations to consider if you
> are still
> >> game after this goes in)...
> >
> > I'd love to hear what they are.
> 
> Here are my thoughts:
> 
> - Currently Renderer has more stages than we probably should have:
>      for (each full pixel row) {
>          for (each subpixel row) {
>              for (each curve on subpixel row) {
>                  convert curve into crossing
>                      crossing is (subpixelx:31 + dir:1)
>              }
>              sort crossings for subpixel row
>              apply wind rule and convert crossings into alpha deltas
>          }
>          convert alpha deltas into run lengths
>      }
>      for (each tile) {
>          convert run lengths into alphas
>      }
> I'm thinking we should be able to get rid of a couple of those
> stages...
> 
> - One alternate process would be:
>      (all work is done in the tail end of the cache itself)
>      for (each full pixel row) {
>          for (each curve on full pixel row) {
>              convert curve into N subpixel crossings
>                  (subpixelx:28 + subpixelfracty:3 + dir:1)
>          }
>      }
>      sort all crossings for entire pixel row
>      maybe collapse raw crossings into modified accumulated crossings
>      record start of row in index array
>      for (each tile) {
>          convert raw or collapsed crossings directly into alphas
>      }
> Note that we could simply leave the raw crossings in the cache and
> then 
> process them in the tile generator, but that would require more
> storage 
> in the cache since a given path would tend to have 8 different entries
> 
> as it crossed every subpixel row.  If we collapse them, then we can
> sum 
> the entries for a given x coordinate into a single entry and store:
>      (pixelx:25 + coveragedelta:7)
> where the coverage delta is a signed quantity up to N_SUBPIX^2 so it 
> uses the same storage we needed for the subpixel parts of both X and Y
> 
> plus the direction bit.  It likely needs a paired value in the next x
> 
> pixel location just like our current "alpha deltas" needs as well.  If
> 
> we wanted to optimize that then we could devote one more bit to "the 
> next pixel will consume all of the remainder of the N^2 coverage", but
> 
> there would be cases where that would not be true (such as if the
> pixel 
> row has only partial vertical coverage by the path).  It's probably 
> simpler to just have deltas for every pixel.
> 
> The storage here would likely be similar to the storage used for the 
> current cache since the current RLE cache uses 2 full ints to store a
> 
> coverage and a count.  And in cases where we have one pixel taking up
> 
> partial coverage and the following pixel taking up the remainder of
> the 
> full coverage then we have 4 ints, but the "crossing delta" system
> would 
> only have 2 ints.
> 
> Other thoughts...
> 
> and curves arrays and then reload the data for every sub-pixel step. 
> Renderer still takes curves, it just breaks them down immediately
> rather 
> than on the fly.  If there are only a small number of edges per curve
> 
> then the storage might not be that much worse because the quad and
> curve 
> arrays already store more values than the edge array.
> 
>                       ...jim

Reply via email to