Hi Jim.

I have a new webrev: 
http://icedtea.classpath.org/~dlila/webrevs/perfWebrev/webrev/

> How about looking more at the stroking end of the process and I'll dig
> a little more into optimal rasterization code.  I have a lot of
> experience with optimizing rasterizer code (and JNI if it comes to that), but
> very little with the curve manipulations involved in stroking (math is so 
> *hard* at my age ;-)...

I tried to do this. I used the netbeans compiler, and it said that a large
chunk of the time (about 12% is spent in curveTo). curveTo does almost nothing:
it just packs up the curve array and delegates to somethingTo. This makes me 
think
that there's not a whole lot that can be done to improve Stroker's performance 
(I'm
ok with this, since J2DBench and Java2DDemo animation frame rates both say that
non antialiased and non dashed performance is better than even closed source 
java).
I did make one small change though: I inlined the calls to dxat, dyat, ddxat, 
ddyat
in ROCsq because the profiler said that a lot of time was spent there. This 
made a
surprisingly large difference (but still not that great overall).

I also fixed the dashing performance degradation. I removed the binary sort, and
am now using getCubicRoots to find the t where to split. Another hugely 
significant
change was using Math.sqrt instead of Math.hypot in the implementation of 
Helpers.linelen.
I had been meaning to do this for a while, since sqrt is about 100 times faster 
on my machine
than hypot, but I didn't realize it would have such a large impact on dashing. 
Anyway, dashing
is now much, much faster than before. It is even faster than the flattening 
version we used
to use. The precision might not be as good as the current, slow implementation, 
but it's
only noticeable for curves with a lot of acceleration, and even then only if 
you do
a side by side comparison of the 2 implementations. The benchmarks display a 
200%-500%
improvement, so I think it is well worth it. Unfortunately curve dashing is 
still a bit
slower than the closed source counterpart, but not by much.

I also tweaked the AFD algorithm for quadratic curves. It's a bit faster now.

A while ago you made a suggestion on how to improve anti aliasing performance:

> Here are my thoughts:
> 
>       - Currently Renderer has more stages than we probably should have:
>            for (each full pixel row) {
>                for (each subpixel row) {
>                    for (each curve on subpixel row) {
>                        convert curve into crossing
>                            crossing is (subpixelx:31 + dir:1)
>                    }
>                    sort crossings for subpixel row
>                    apply wind rule and convert crossings into alpha deltas
>                }
>                convert alpha deltas into run lengths
>            }
>            for (each tile) {
>                convert run lengths into alphas
>            }
>       I'm thinking we should be able to get rid of a couple of those stages...
> 
>       - One alternate process would be:
>            (all work is done in the tail end of the cache itself)
>            for (each full pixel row) {
>                for (each curve on full pixel row) {
>                    convert curve into N subpixel crossings
>                        (subpixelx:28 + subpixelfracty:3 + dir:1)
>                }
>            }
>            sort all crossings for entire pixel row
>            maybe collapse raw crossings into modified accumulated crossings
>            record start of row in index array
>            for (each tile) {
>                convert raw or collapsed crossings directly into alphas
>            }
>       Note that we could simply leave the raw crossings in the cache and then
>       process them in the tile generator, but that would require more storage
>       in the cache since a given path would tend to have 8 different entries
>       as it crossed every subpixel row.  If we collapse them, then we can sum
>       the entries for a given x coordinate into a single entry and store:
>            (pixelx:25 + coveragedelta:7)
>       where the coverage delta is a signed quantity up to N_SUBPIX^2 so it
>       uses the same storage we needed for the subpixel parts of both X and Y
>       plus the direction bit.  It likely needs a paired value in the next x
>       pixel location just like our current "alpha deltas" needs as well.  If
>       we wanted to optimize that then we could devote one more bit to "the
>       next pixel will consume all of the remainder of the N^2 coverage", but
>       there would be cases where that would not be true (such as if the pixel
>       row has only partial vertical coverage by the path).  It's probably
>       simpler to just have deltas for every pixel.
> 
>       The storage here would likely be similar to the storage used for the
>       current cache since the current RLE cache uses 2 full ints to store a
>       coverage and a count.  And in cases where we have one pixel taking up
>       partial coverage and the following pixel taking up the remainder of the
>       full coverage then we have 4 ints, but the "crossing delta" system would
>       only have 2 ints.

I tried to implement something like this. What I did was: I reduce the length 
of the buckets array to have only one bucket per pixel row. I removed the
array that kept track of how many lines were in each bucket. In next() I 
iterated
through the bucket for the current pixel row. For each line, I iterated through
all the scanlines in the current pixel row that it crossed. For each of those, I
added a crossing (I still kept these sorted using insertion sort, although this 
might
have hurt me - simply quicksorting all the crossings just before returning from 
next()
might have been better). The format of the crossings was like so:
the SUBPIXEL_Y_POSITIONS most insignificant bits were reserved for the y 
coordinate
within the current pixel row of the crossing. The next most insignificant bit 
was for
the orientation. The remaining bits were for the actual crossing value.

I'm not sure if this is what you had in mind, but I abandoned this because the 
benchmarks
showed a slow down.

Regards,
Denis.

----- "Jim Graham" <james.gra...@oracle.com> wrote:

> Hi Denis,
> 
> Also, I've gotten another 20% improvement out of the design with a few
> 
> more tweaks.  (Though I measured the 20% in the stripped down version
> 
> that I'm prototyping with FX so I'm not sure how much of that 20%
> would 
> show up through the layers of the 2D code.  Overall, I've about
> doubled 
> the frame rates of the prototype since your first drop that you
> checked 
> in to the OpenJDK repository.)
> 
> 
>                       ...jim
> 
> On 11/4/2010 9:20 AM, Clemens Eisserer wrote:
> > Hi Denis,
> >
> >> It's not obvious to me why this happened, so I think now I will
> put
> >> this type of optimization aside and convert to JNI,
> >
> > I've only followed your discussion with Jim but skipped all the
> > in-depth discussion.
> >> From my prior experiences usually  JNI is not woth the trouble, if
> you
> > don't have a serious reason why using native code would have
> > advantages (like the possibility of using SIMD or when value-types
> > would be a huge benefit), it has its own performance pitfalls
> > especially if the workload is small and things like
> Get*ArrayCritical
> > cause scalability problems because they have to lock the GC.
> >
> >> where profiling
> >> will be easier (for me - I haven't been able to make OProfile work
> >> for java yet).
> >
> > Have you had a look at the Netbeans profiler? It supports sampling
> > based testing to keep the influence of the profiler at a minimum.
> >
> > Thanks for working on this!
> >
> > - Clemens

Reply via email to