Re: [Moblin Dev] Why Hildon Is Removed from Moblin2

The Rasterman Thu, 13 Nov 2008 14:29:15 -0800

On Thu, 13 Nov 2008 18:20:31 +0000 Chris Lord <[EMAIL PROTECTED]> babbled:



> This clearly isn't a fair test - it's apples to apples, but Clutter
> isn't an apple, it's more like a 10-course banquet. Also, comparing a
> dual-core 3ghz machine vs 8600GT probably isn't of the same order as,
> say, an Atom 1.6ghz vs. a GM965-based GPU either. You don't mention
> anything about RAM or video RAM either, which are two important factors,
> given that a lot of your tests will probably be restricted by how fast
> memory can be read/written.

clutter doesn't have a comparison that you can make - until you write a
properly optimised software rasteriser for it to compare. evas has a software
one as well as gl etc. etc. and a 3ghz core 2 desktop cpu vs the nv8600gt
desktop gpu is very fair - same ballpark in power consumption and target. right
now atom is a very immature cpu.

> Also, Matthew's '100x' comment was clearly meant to just give the
> general impression 'much faster'. Even if it were just, say, 10x faster,
> that still means it can do something 10 times that you could do only
> once in software in the same time, and it can do it without lumbering
> the CPU too.

sure - it is faster - or can be, but you then pay a price by setting a hardware
bar that below which the software just won't work at all (is entirely
unusable). that is what some of the initial comments that started this were at
- they'd like something that works on their non-gpu supported hardware. i don't
muc care what intel pushes/uses. you make your own hardware, but if you want
moblin to be something more general to work across a wider berth of hardware,
then you'll want to take the non-gpu world seriously. as i said - i have
clients that just this past week are pushing their way into a lower and lower
performance envelope where the only thing they have is a dumb fb and a cpu.
they are cutting costs and silicon space. they want something that works on
their high-end gpu carrying soc's as well as their low end. and if they can get
bling on both they are most happy. but as i said - this is simply the world
that clutter doesn't care about. gpu or nothing. that's a design choice and is
fair enough. what is not fair is that you characterise software rendering as so
incredibly slow compared to "hardware" that its unusable, which is entirely NOT
the case. empirically it is otherwise - numbers and actual usage show it.

> That aside, I'm a little suspicious of the numbers too;
> 
> > let me quote some "interesting results".
> > 
> > 1. alpha blending a whole bunch of non-scaled images (1:1) gl ONLY managed
> > to be 4x faster than software... not 100's of times.
> 
> I outright just don't believe this. Alpha-blending is inherently a slow
> operation and you'd definitely see a larger speed up doing this in
> hardware, unless your test is very limited, or you're taking some
> quality-affecting or constraining short-cuts.

incorrect. in fact the alpha blending is done at FSB speed. the software
engine is SLI (it splits rendering between as many cpus+cores as you have). i
have sat down and benchmarked in gory detail. a dual thread blender routine
does not beat a single one - i was surprised, until i did the math. it was FSB
limited. cpu couldn't read/write to memory faster. believe it or not - i've
learnt to think about my routines and write them with some level of
optimisation in mind. i care about my cycle usage. the test is apples vs
apples. it's a 1:1 scaled image, so neither gpu nor cpu need to any scale
calculations. on the gl side its an ARGB texture mapped onto a quad and drawn
128 times per frame in various positions into the backbuffer, then swapped to
the front each frame. in software its the exact same test with the exact same
pixle output - ARGB32 pixels (same image) alpha blended onto an argb32 dest tmp
buffer and finally copied to the screen per frame. image size is the same,
image pixel positions the same. color space the same (both use ARGB32 premul).
both use non-alpha destination buffers and both target an x window for output.
fyi the alpha blender is mmx/sse and as i mentioned above... is limited by
memory access speeds :)

> Could you elaborate on what this test actually does? Perhaps you only
> see a 4x speed up if you blit alpha-blended rectangles, one at a time,
> per rectangle... Try alpha-blending 20 reasonable-sized (say 256x256)
> textures, on a reasonable-sized (say 1024x768) buffer, with varying
> alpha over a background image. I'd be hard-pushed to believe, even with
> your skewed setup, that you'd only see a 4x speed-up.

images are 120x160 - draw 128 of them per frame. output window is 720x420. so
it's reasonable. images have varying alpha (it's the E logo with a soft shadow
rendered in 3d with anti-aliasing).

it's not skewed. its very much fair.

> > 2. at BEST when doing "smooth scaling" (that means for GL GL_LINEAR vs full
> > super/sub sampling in software (which is much higher quality especially on
> > down-scale) gl manages at BEST to be 30x faster than software. different
> > algorithms here so software is at a major disadvantage due to its higher
> > quality.
> 
> Well, 30x is quite a lot, I'm not sure why you say it as if it isn't a
> big deal - if you're doing basic UI things, why do you even need
> something better than bi-linear filtering? Saying software is at a
> disadvantage because it's higher quality is silly. Either compare it
> with anisotropic filtering, or write a fast bilinear software filter. I
> think you'll find that the hardware does this a lot better.

it's not an apples to apples comparison. when downscaling gl just samples
points - software literally will read the entire source region and compute.
example:

i scale an image that is 1000x1000 down to 10x10. for gl with GL_LINEAR at most
it will chose 4 poitns per output pixel (only 100 output pixels) and
interpolate. software will sample a 100x100 region from the source, compute a
sume, average them and use that. the output quality is massively different.
it's very noticable scaling down icons and other detailed images. the software
scaler could be simplified and get a significant speedup (several times) if
such super-sampling were dropped. imaging i throw ansiotropic sampling at gl at
lets say a level of 100 (which i haven't even seen supported before). then
you'd have something more comparable. gl gets a speedup here due to a quality
drop. i know the quality difference - it si noticable and i prefer the full
sampling.

> > 3. apples vs apples... scaling an image with "GL_NEAREST" in software and
> > gl... gl is 5.8times faster.
> 
> Yes, but scaling a 2d, axis-aligned texture using nearest-neighbour
> scaling looks AWFUL and you'd never do this. This test is pointless.

same as above with downscaling but you seem to not care about it being
awful. :) also depending on your source data... this can be perfectly nice. in
some cases its BETTER than linear/bi-linear etc. interpolation as the results
are visually better. i use it in specific cases as a speedup when i know the
image data will not suffer from being scaled this way.

> > 4. software is actually FASTER for alpha blending a whole bunch of
> > rectangles - 25% faster. gl uses just a quad. software is at an advantage -
> > it's smarter adn can calculate deltas better
> 
> I can believe this, for your setup, but only because you have a very
> powerful CPU and probably have very fast memory. Do you still get a 25%
> faster result if you count the time it takes to copy the buffer from
> system memory to video memory though? Also, 25% faster than what, a
> microsecond? I'd contend that alpha blending rectangles isn't that
> common an operation, vs. say scaling, rotation, or alpha-blending a
> texture, and it happens so fast using either method, that quoting a 25%
> speed gain isn't actually as big a deal as it sounds.

yes ALL the above tests INCLUDE display. they include getting pixels to the
window so it's visible. on both sides. gl has the major advantage here as its
renderings are already in video ram and display for it is a much faster path
than the copy from the system ram ARGB buffer up to video ram. despite this,
software wins :)

> > 5. gl shos overhead on having to upload ARGB pixels textures - software
> > manages to do the argb data change test about 25% faster than gl (that
> > includes also drawing the image to the screen after upload of new data).
> 
> I'm not sure what this test is? If you mean it takes longer to upload
> ARGB data to video memory than it does to system memory, I'm surprised
> the result is only that much faster. I guess that's a good thing though,
> it's always good to find out that system->video memory bandwidth isn't
> as big an issue as you think it is :)

both tests involve writing new ARGB pixel data (simple fixed values in a loop)
whihc is of course software. the software engine avoids a "upload to texture"
as you actually write directly to the image data itself. the gl engine needs to
upload the texture data. then this data is rendered on both sides (gl vs
software) and displayed. so in fact in both cases data needs upload - in fact
the SAME data. software has to upload its resultant rendered buffer. gl has to
upload the texture source data. it just happens at different pipeline stages
(one at the start, the other at the end). despite the fact both have to upload
- in this case, the exact same amount of data, software wins.

> > i'm just saying this above as i believe that there are bogus facts being
> > floated about (efl starves your cpu - incorrect. it uses just as
> > little/much as clutter would.. if you use the gl engine or any other
> > accelerated back end - xrender (an the acceleration is working) etc.) and
> > the numbers on "gl is 100x faster" i would say is entirely bogus. i am
> > pitting a very very very fast gpu against a very fast laptop cpu. and the
> > cpu is doing a stellar job of keeping up... considering. and both engines
> > can be optimised - i know just what i could do to improve them, and i don't
> > see that your numbers will change so drastically if i did it on both
> > ends. :)
> 
> Yes, if you use EFL's GL engine you might get the same performance as
> Clutter, at the expense of an API that doesn't integrate as well with
> the rest of the Moblin libraries, and is arguably not as nice to use
> (but then, I would say this being a GTK and Clutter hacker, so take this
> with a bowl of salt).

you are right - and i'd never argue there. if you want to live in a gtk and
gnome world clutter integrates. nicely. it's "g-compliant". in our world we
simply have created the technology we need when comparable technology just
doesn't exist or is inadequate or comes with baggage we don't want. we have
recycled where appropriate and useful. yes evas is more limited in
functionality - though as with all things - that is not a static thing. it has
served us quite well so far and will expand - don't worry. it's not a fixed
"will forever do only what it currently does" thing. don't get me wrong. i am
not bagging clutter nor do i want to change any decision to use clutter in
moblin. i just want to correct false and misleading statements of "fact" that
for example say "efl cpu-starves you" when that is not true (if you use the gl
engine). if clutter is non non-gpu hardware... clutter will not just cpu starve
you, compared to efl - it will be like watching continental drift. so you need
to be factual and accurate. the above tests are apples vs apples. i am not
trying to make gl look bad! i use it for my media center for example - i get
slightly better framerates for hdtv (1920x1080) video than software :) i know
what it can do and how it does it. but the speedup numbers compared to well
engineered software are not so great. they are better but you are not talking
such massive factors. of course there is a major shift happening in the gpu
world these days - just look at larabee(sp?). that's not a gpu! it's a
multi-core cpu with some extra gpu-helpful commands. something like evas's
software engine would actually run nicely on it - as it's already multi-core
(as above- if it saw 16 or 32 cores, it'd split the rendering between them just
like your gpu splits your texel calculation/output between its multiple
pipelines). inside the future of gpu's is software... it's just got more direct
access to pixel input and output data, as well as a lot of cores.

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    [EMAIL PROTECTED]

_______________________________________________
Moblin dev Mailing List
[email protected]

To manage or unsubscribe from this mailing list visit:
https://lists.moblin.org/mailman/listinfo/dev or your user account on 
http://moblin.org once logged in.

For more information on the Moblin Developer Mailing lists visit:
http://moblin.org/community/mailing-lists

Re: [Moblin Dev] Why Hildon Is Removed from Moblin2

Reply via email to