Re: Opportunity for speedup
On Wed, Mar 11, 2009 at 4:13 PM, Daniel Drake wrote: > 2009/3/1 Bobby Powers : > > I can't seem to get ul-warning to come up properly, so if anyone can > > tell me what I'm doing wrong that would be great. I've got it to work > > by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but > > neither Scott's nor my chkconfig comments seem to work. > > Here's a fixed ul-warning initscript. > thanks, the fix is pushed. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
2009/3/1 Bobby Powers : > I can't seem to get ul-warning to come up properly, so if anyone can > tell me what I'm doing wrong that would be great. I've got it to work > by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but > neither Scott's nor my chkconfig comments seem to work. Here's a fixed ul-warning initscript. ul-warning Description: Binary data ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
Hi Bobby, On 1 Mar 2009, at 21:44, Bobby Powers wrote: > I've fixed a few issues, packaged up bootanim-2.3-1, and (finally) > actually ran some benchmarks. Results (all times in seconds): > > fresh os801, from pressing the power button to appearance of sugar's > prompt for name screen > 80 > 79 > 78 > > with rhgb-client renamed so that init can't find it: > 69 > 68 > > and with bootanim-2.(1-3) rpm installed: > 67 > 67 > 67 > 68 > 67 > > If anyone is unconvinced, I could run more tests, but this seems > pretty good to me. Its a 15% overall speedup in the boot process. I've just run a test here with candidate 801; average over 5 runs; starting on button press, stopping when XO first appears in users colours: Before bootanim-2.3-1.i386.rpm: 85.9 seconds After patching: 74.6 seconds Booting in ugly text mode (includes the 3 sec ok wait): 72.2 seconds So, if this 10 sec boot saving gets accepted in a future build, you've just gained the world 1,400 extra hours of XO usage from the time this patch lands, and for every day thereafter (assumes a conservative 500K kids boot their XO just once a day on average). Fantastic work, what an impressive butterfly effect!! :-) --Gary > Interesting notes: > chkconfig doesn't like binary services - it parses services in > /etc/init.d to look for metadata in comments, and the mechanism to > override this data (sticking a file with the same name in > /etc/chkconfig.d with appropriate comments) doesn't seem to work if > the original script can't be parsed. So I had to make small wrappers > for ul-warning, boot-anim-start and boot-anim-stop. This doesn't seem > to affect performance. > > I can't seem to get ul-warning to come up properly, so if anyone can > tell me what I'm doing wrong that would be great. I've got it to work > by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but > neither Scott's nor my chkconfig comments seem to work. > > source: > http://dev.laptop.org/git?p=users/bobbyp/bootanim > koji-built rpms: > http://dev.laptop.org/~bobbyp/bootanim/ > (koji task https://koji.fedoraproject.org/koji/taskinfo? > taskID=1211738 ) > > I don't know if this could make it into 8.2.1, or what the process > would be toward getting it at least in the Rawhide/SOAS images, but it > seems pretty low risk (assuming someone can tell me what I'm doing > wrong w.r.t. ul-warning). > > yours, > Bobby > > On Thu, Feb 19, 2009 at 3:03 AM, Mitch Bradley wrote: >> Cool! >> >> Bobby Powers wrote: >>> >>> On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley >>> wrote: >>> I just measured the time taken by the boot animation by the simple technique of renaming /usr/bin/rhgb-client so the initscripts can't find it. >>> >>> how did you measure exactly? stopwatch? I'd like to recreate the >>> tests. It sounds like you did this on a freshly flashed system? >>> >> >> Yes on both counts. Stopwatch on freshly-flashed os7.img . >> >> >>> With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60 seconds from first dot (indicating OFW transfer to Linux) to Sugar "prompt for your name". Without it, 53 seconds. I repeated the test several times with consistent results. Clearly, it should be possible to display that amount of information in much less than 7 seconds. The boot animation code is in the OLPC domain, not the upstream domain, so replacing it should be relatively free of upstream politics. So if anybody is interested in implementing a relatively simple boot-time speedup, I offer this as low-hanging fruit. I suggest 1 second (differential time between animation and no- animation cases) as a reasonable target goal, assuming images of the complexity of the current ones. Arbitrary full-screen graphics might require more time, but speeding up the baseline case is a good starting point. Go wild. >>> >>> So I've taken a first cut at this, implemented with the following >>> design considerations (mostly from a conversation with Mitch) >>> - the Python client/server was reimplemented as several standalone C >>> programs (boot-anim-start, boot-anim-client, and some cleanup in >>> boot-anim-stop) >>> - a client and server was used before because there is state >>> information that needs to be saved: we need to keep track of where >>> in >>> the animation we are. We can keep track of this by using offscreen >>> memory in the framebuffer (its 16MB in size, and only the first 2ish >>> MB is used for the onscreen graphics (my terminology might be off >>> here)). For state we really only need to keep track of 2 integers, >>> one for the current frame number and another to store the offset of >>> the next diff to apply. >>> - on startup we load an initial image into the framebuffer (the >>> first >>> 1200*900*2 bytes, since we use 2 bytes per pixel
Re: Opportunity for speedup
On Sun, Mar 01, 2009 at 04:44:01PM -0500, Bobby Powers wrote: > I can't seem to get ul-warning to come up properly, so if anyone can > tell me what I'm doing wrong that would be great. What actually goes wrong? Is ul-warning executed? -- James Cameronmailto:qu...@us.netrek.org http://quozl.netrek.org/ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
On 1 Mar 2009, at 21:44, Bobby Powers wrote: > I've fixed a few issues, packaged up bootanim-2.3-1, and (finally) > actually ran some benchmarks. Results (all times in seconds): > > fresh os801, from pressing the power button to appearance of sugar's > prompt for name screen > 80 > 79 > 78 > > with rhgb-client renamed so that init can't find it: > 69 > 68 > > and with bootanim-2.(1-3) rpm installed: > 67 > 67 > 67 > 68 > 67 > > If anyone is unconvinced, I could run more tests, but this seems > pretty good to me. Its a 15% overall speedup in the boot process. Hey Bobby, that sounds great, many thanks for putting the effort in! I'll try your rpm on one of the XOs here and ping back with some additional measurements. Regards, --Gary > Interesting notes: > chkconfig doesn't like binary services - it parses services in > /etc/init.d to look for metadata in comments, and the mechanism to > override this data (sticking a file with the same name in > /etc/chkconfig.d with appropriate comments) doesn't seem to work if > the original script can't be parsed. So I had to make small wrappers > for ul-warning, boot-anim-start and boot-anim-stop. This doesn't seem > to affect performance. > > I can't seem to get ul-warning to come up properly, so if anyone can > tell me what I'm doing wrong that would be great. I've got it to work > by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but > neither Scott's nor my chkconfig comments seem to work. > > source: > http://dev.laptop.org/git?p=users/bobbyp/bootanim > koji-built rpms: > http://dev.laptop.org/~bobbyp/bootanim/ > (koji task https://koji.fedoraproject.org/koji/taskinfo? > taskID=1211738 ) > > I don't know if this could make it into 8.2.1, or what the process > would be toward getting it at least in the Rawhide/SOAS images, but it > seems pretty low risk (assuming someone can tell me what I'm doing > wrong w.r.t. ul-warning). > > yours, > Bobby > > On Thu, Feb 19, 2009 at 3:03 AM, Mitch Bradley wrote: >> Cool! >> >> Bobby Powers wrote: >>> >>> On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley >>> wrote: >>> I just measured the time taken by the boot animation by the simple technique of renaming /usr/bin/rhgb-client so the initscripts can't find it. >>> >>> how did you measure exactly? stopwatch? I'd like to recreate the >>> tests. It sounds like you did this on a freshly flashed system? >>> >> >> Yes on both counts. Stopwatch on freshly-flashed os7.img . >> >> >>> With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60 seconds from first dot (indicating OFW transfer to Linux) to Sugar "prompt for your name". Without it, 53 seconds. I repeated the test several times with consistent results. Clearly, it should be possible to display that amount of information in much less than 7 seconds. The boot animation code is in the OLPC domain, not the upstream domain, so replacing it should be relatively free of upstream politics. So if anybody is interested in implementing a relatively simple boot-time speedup, I offer this as low-hanging fruit. I suggest 1 second (differential time between animation and no- animation cases) as a reasonable target goal, assuming images of the complexity of the current ones. Arbitrary full-screen graphics might require more time, but speeding up the baseline case is a good starting point. Go wild. >>> >>> So I've taken a first cut at this, implemented with the following >>> design considerations (mostly from a conversation with Mitch) >>> - the Python client/server was reimplemented as several standalone C >>> programs (boot-anim-start, boot-anim-client, and some cleanup in >>> boot-anim-stop) >>> - a client and server was used before because there is state >>> information that needs to be saved: we need to keep track of where >>> in >>> the animation we are. We can keep track of this by using offscreen >>> memory in the framebuffer (its 16MB in size, and only the first 2ish >>> MB is used for the onscreen graphics (my terminology might be off >>> here)). For state we really only need to keep track of 2 integers, >>> one for the current frame number and another to store the offset of >>> the next diff to apply. >>> - on startup we load an initial image into the framebuffer (the >>> first >>> 1200*900*2 bytes, since we use 2 bytes per pixel for color >>> information), and then load in a series of changes to the >>> framebuffer >>> image (<300KB). This takes the form of a series of diffs >>> - for each update (a valid call to boot-anim-client) we apply the >>> next >>> diff in the series to the onscreen image and update our state >>> information >>> - after applying the last diff we have (the end in the animation >>> series), freeze the DCON (when I first attempted to freeze the DCON >>> when z-boot-anim-stop was
Re: Opportunity for speedup
I've fixed a few issues, packaged up bootanim-2.3-1, and (finally) actually ran some benchmarks. Results (all times in seconds): fresh os801, from pressing the power button to appearance of sugar's prompt for name screen 80 79 78 with rhgb-client renamed so that init can't find it: 69 68 and with bootanim-2.(1-3) rpm installed: 67 67 67 68 67 If anyone is unconvinced, I could run more tests, but this seems pretty good to me. Its a 15% overall speedup in the boot process. Interesting notes: chkconfig doesn't like binary services - it parses services in /etc/init.d to look for metadata in comments, and the mechanism to override this data (sticking a file with the same name in /etc/chkconfig.d with appropriate comments) doesn't seem to work if the original script can't be parsed. So I had to make small wrappers for ul-warning, boot-anim-start and boot-anim-stop. This doesn't seem to affect performance. I can't seem to get ul-warning to come up properly, so if anyone can tell me what I'm doing wrong that would be great. I've got it to work by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but neither Scott's nor my chkconfig comments seem to work. source: http://dev.laptop.org/git?p=users/bobbyp/bootanim koji-built rpms: http://dev.laptop.org/~bobbyp/bootanim/ (koji task https://koji.fedoraproject.org/koji/taskinfo?taskID=1211738 ) I don't know if this could make it into 8.2.1, or what the process would be toward getting it at least in the Rawhide/SOAS images, but it seems pretty low risk (assuming someone can tell me what I'm doing wrong w.r.t. ul-warning). yours, Bobby On Thu, Feb 19, 2009 at 3:03 AM, Mitch Bradley wrote: > Cool! > > Bobby Powers wrote: >> >> On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley wrote: >> >>> >>> I just measured the time taken by the boot animation by the simple >>> technique of renaming /usr/bin/rhgb-client so the initscripts can't find >>> it. >>> >> >> how did you measure exactly? stopwatch? I'd like to recreate the >> tests. It sounds like you did this on a freshly flashed system? >> > > Yes on both counts. Stopwatch on freshly-flashed os7.img . > > >> >>> >>> With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60 >>> seconds from first dot (indicating OFW transfer to Linux) to Sugar >>> "prompt for your name". Without it, 53 seconds. I repeated the test >>> several times with consistent results. >>> >>> Clearly, it should be possible to display that amount of information in >>> much less than 7 seconds. >>> >>> The boot animation code is in the OLPC domain, not the upstream domain, >>> so replacing it should be relatively free of upstream politics. >>> >>> So if anybody is interested in implementing a relatively simple >>> boot-time speedup, I offer this as low-hanging fruit. >>> >>> I suggest 1 second (differential time between animation and no-animation >>> cases) as a reasonable target goal, assuming images of the complexity of >>> the current ones. Arbitrary full-screen graphics might require more >>> time, but speeding up the baseline case is a good starting point. >>> >>> Go wild. >>> >> >> So I've taken a first cut at this, implemented with the following >> design considerations (mostly from a conversation with Mitch) >> - the Python client/server was reimplemented as several standalone C >> programs (boot-anim-start, boot-anim-client, and some cleanup in >> boot-anim-stop) >> - a client and server was used before because there is state >> information that needs to be saved: we need to keep track of where in >> the animation we are. We can keep track of this by using offscreen >> memory in the framebuffer (its 16MB in size, and only the first 2ish >> MB is used for the onscreen graphics (my terminology might be off >> here)). For state we really only need to keep track of 2 integers, >> one for the current frame number and another to store the offset of >> the next diff to apply. >> - on startup we load an initial image into the framebuffer (the first >> 1200*900*2 bytes, since we use 2 bytes per pixel for color >> information), and then load in a series of changes to the framebuffer >> image (<300KB). This takes the form of a series of diffs >> - for each update (a valid call to boot-anim-client) we apply the next >> diff in the series to the onscreen image and update our state >> information >> - after applying the last diff we have (the end in the animation >> series), freeze the DCON (when I first attempted to freeze the DCON >> when z-boot-anim-stop was called it left the screen in an inconsistent >> state, I believe because of X startup) >> - its designed to be as light as possible, using syscalls instead of >> libc functions as much as possible (the only thing we use libc for is >> string comparison, which could be replaced with a local function). >> while its written like this, I haven't worked on cutting down the >> linking (I need some guidance for that) >> > > To reduce the execution footprint, you could tr
Re: Opportunity for speedup
da...@lang.hm wrote: > > d) compile the delta set into the client program. That works, but 1) It requires more work from the VM system on each invocation of the client program, which is now 1.x MB instead of 4K. 2) If a deployment wants to change the image set, it needs a compiler toolchain instead of a (small) delta-encoding program. Speed-wise, (d) might be a wash, or perhaps even a slight win. It depends on how efficient the VM system is, and the effectiveness of the filesystem buffer cache at preventing re-reads of the client process image (paging directly from JFFS2 is not possible). The framebuffer hack avoids numerous assumptions about the effectiveness of clever but complex subsystems (e.g. the VM system, the filesystem buffer cache, the shared library mechanisms, zlib, JFFS2 compression, ...). ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
On Thu, 19 Feb 2009, Mitch Bradley wrote: > da...@lang.hm wrote: >> >> right, but why read the current framebuffer? you don't touch most of it, >> you aren't going to do anything different based on what's there (you are >> just going to overlay your new info there) so all you really need to do is >> to write the parts tha need to change. > > You don't read the on-screen part of the framebuffer. You copy delta data > from off-screen framebuffer memory to portions of the on-screen framebuffer > memory. > > On-screen vs. off-screen is irrelevant to the speed - read access to the > memory that is reserved for display controller use is similarly "slow" in > both cases. But considering that the delta data is small compared to the > full images, it's worth it to store the deltas there, thus avoiding the > overhead of the other alternatives for maintaining the context from one call > to the next. > > Those alternatives are: > > a) Server process maintains context on behalf of repeatedly-executed client > process. This incurs the complexity of client-server architectures - > setup/teardown, library overhead, interprocess communication, scheduling. > > b) Client program reads new delta data from a file on each invocation. This > incurs the filesystem overhead of opening a file on each invocation (in > comparison, the off-screen framebuffer solution requires only a single open() > and a single read() on the first invocation. > > c) Client program reads delta set into a shared memory segment and then > reattaches to that segment on subsequent invocations. This is similar to the > framebuffer approach except that it uses faster memory for the persistent > storage. It might be a win from a speed perspective, but it is a bit more > complex, requiring the program to deal with two memory objects instead of > just one. The total amount of time that it could possibly save is about 50 > mS, since that it the time it takes to read the delta set from the off-screen > framebuffer. And if we use the RLE encoding suggested by Wade, the amount of > off-screen data is halved, so the best-case savings are reduced to 25 mS > total. d) compile the delta set into the client program. does this really need to be a general-purpose solution here? or is this really only used for this specific purpose. David Lang ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
da...@lang.hm wrote: > > right, but why read the current framebuffer? you don't touch most of > it, you aren't going to do anything different based on what's there > (you are just going to overlay your new info there) so all you really > need to do is to write the parts tha need to change. You don't read the on-screen part of the framebuffer. You copy delta data from off-screen framebuffer memory to portions of the on-screen framebuffer memory. On-screen vs. off-screen is irrelevant to the speed - read access to the memory that is reserved for display controller use is similarly "slow" in both cases. But considering that the delta data is small compared to the full images, it's worth it to store the deltas there, thus avoiding the overhead of the other alternatives for maintaining the context from one call to the next. Those alternatives are: a) Server process maintains context on behalf of repeatedly-executed client process. This incurs the complexity of client-server architectures - setup/teardown, library overhead, interprocess communication, scheduling. b) Client program reads new delta data from a file on each invocation. This incurs the filesystem overhead of opening a file on each invocation (in comparison, the off-screen framebuffer solution requires only a single open() and a single read() on the first invocation. c) Client program reads delta set into a shared memory segment and then reattaches to that segment on subsequent invocations. This is similar to the framebuffer approach except that it uses faster memory for the persistent storage. It might be a win from a speed perspective, but it is a bit more complex, requiring the program to deal with two memory objects instead of just one. The total amount of time that it could possibly save is about 50 mS, since that it the time it takes to read the delta set from the off-screen framebuffer. And if we use the RLE encoding suggested by Wade, the amount of off-screen data is halved, so the best-case savings are reduced to 25 mS total. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
On Thu, 19 Feb 2009, Mitch Bradley wrote: > da...@lang.hm wrote: >> >> if you have the diff of the images, do you need to read from the >> framebuffer at all? since you know what you put there, and know what you >> want to change, can't you just write your changed information to the right >> place? > > The framebuffer in this case is serving as persistent shared memory, thus > avoiding the extra complexity of a client/server architecture to maintain the > sequencing state. > > The extremely-tiny (4K - 1 memory page) client program initially reads the > first frame into the on-screen framebuf and the delta set into off-screen > framebuffer memory. On subsequent invocations, the client copies another > delta into the on-screen framebuf. > > If it is statically linked and uses only direct syscalls, the exec() overhead > is minimal - no shell process instantiation, no script startup, no ld.so > invocations, no mapping in shared libraries, no relocation. right, but why read the current framebuffer? you don't touch most of it, you aren't going to do anything different based on what's there (you are just going to overlay your new info there) so all you really need to do is to write the parts tha need to change. David Lang ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
da...@lang.hm wrote: > > if you have the diff of the images, do you need to read from the > framebuffer at all? since you know what you put there, and know what > you want to change, can't you just write your changed information to > the right place? The framebuffer in this case is serving as persistent shared memory, thus avoiding the extra complexity of a client/server architecture to maintain the sequencing state. The extremely-tiny (4K - 1 memory page) client program initially reads the first frame into the on-screen framebuf and the delta set into off-screen framebuffer memory. On subsequent invocations, the client copies another delta into the on-screen framebuf. If it is statically linked and uses only direct syscalls, the exec() overhead is minimal - no shell process instantiation, no script startup, no ld.so invocations, no mapping in shared libraries, no relocation. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
On Thu, 19 Feb 2009, Bobby Powers wrote: > On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian wrote: >> I'd suggest just uncompressing the various image files and re-timing >> as a start. The initial implementation was uncompressed, but people >> complained about space usage on the emulator images (which are >> uncompressed). The current code supports both uncompressed and >> compressed image formats. For uncompressed images, putting the bits >> on the screen is an mmap and memcpy, so I can't imagine any >> implementation being faster than that (it's possible, of course, that >> what's stealing CPU is the shell's invocation of the client program; >> recoding just that little part in C should be trivial, since it does >> nothing but write to a socket IIRC.) >> >> Anyway, further benchmarking of the current implementation is probably >> worthwhile before a complete reimplementation is called for. But if >> you want to reimplement it from scratch, go nuts. >> --scott > > I already re-implemented it - it was a fun optimization project and > introduction to lower level systems programming. Using Mitch's D565 > format to keep track of only the parts of the image that change cut > down the implementation size significantly. Its now only 2 > uncompressed images (frame00.565 and ul-warning.565), and <300KB of > differences for the animation sequence. I understand reads from video > memory (which I think is what the framebuffer is?) can be extremely > slow, so it could turn out faster to open a D565 file, mmap it and > mcpy the several tens of kilobytes of differences to the framebuffer > than it is to read those differences from one part of video memory to > another. > > This is where benchmarking should give some clearer answers. if you have the diff of the images, do you need to read from the framebuffer at all? since you know what you put there, and know what you want to change, can't you just write your changed information to the right place? David Lang ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
Bobby Powers wrote: > On Thu, Feb 19, 2009 at 1:56 PM, Wade Brainerd wrote: > >> RLE (run length encoding) compresses sequences of identical pixels ("runs") >> as value/count pairs. >> So abbccc would be stored as 1a 10b 3c. >> The decompressor looks like: >> while (cur < end) >> { >>unsigned short count = *cur++; >>unsigned short value = *cur++; >>while (count--) >> *dest++ = value; >> } >> This can be faster than memcpy because you are reading significantly less >> memory than you would with memcpy, thus fewer cache misses are incurred. >> Because the startup images are mostly spans solid colors, this kind of >> compression works very well. If that were not the case, say if there were a >> left-to-right gradient in the background, RLE would probably make things >> worse, thus you have to be careful when choosing it. >> But the smaller size on disk and in memory would probably improve >> performance in other ways as well. >> Best, >> Wade >> > > thanks, that makes sense > We are already getting some portion of the possible compression by doing the "iframe style" delta encoding of the second and subsequent frames, but the rle is still of some use. It does a good job of shrinking the first frame, and it halves the size of the delta wad. The first-frame-shrink could also be accomplished by the trick of assuming an initial solid background and representing the first frame as a delta from that. In either case, it looks like rle decoding might be a nice addition, as it reduces the size of the frames on disk from 1.2 MB to about 140 KB. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
On Thu, Feb 19, 2009 at 1:56 PM, Wade Brainerd wrote: > RLE (run length encoding) compresses sequences of identical pixels ("runs") > as value/count pairs. > So abbccc would be stored as 1a 10b 3c. > The decompressor looks like: > while (cur < end) > { >unsigned short count = *cur++; >unsigned short value = *cur++; >while (count--) > *dest++ = value; > } > This can be faster than memcpy because you are reading significantly less > memory than you would with memcpy, thus fewer cache misses are incurred. > Because the startup images are mostly spans solid colors, this kind of > compression works very well. If that were not the case, say if there were a > left-to-right gradient in the background, RLE would probably make things > worse, thus you have to be careful when choosing it. > But the smaller size on disk and in memory would probably improve > performance in other ways as well. > Best, > Wade thanks, that makes sense ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
Oh, and you can feed one of the 565 files through my 'rle.c' program to see the compression ratio firsthand. On Thu, Feb 19, 2009 at 1:56 PM, Wade Brainerd wrote: > RLE (run length encoding) compresses sequences of identical pixels ("runs") > as value/count pairs. > So abbccc would be stored as 1a 10b 3c. > > The decompressor looks like: > > while (cur < end) > { >unsigned short count = *cur++; >unsigned short value = *cur++; >while (count--) > *dest++ = value; > } > > This can be faster than memcpy because you are reading significantly less > memory than you would with memcpy, thus fewer cache misses are incurred. > > Because the startup images are mostly spans solid colors, this kind of > compression works very well. If that were not the case, say if there were a > left-to-right gradient in the background, RLE would probably make things > worse, thus you have to be careful when choosing it. > > But the smaller size on disk and in memory would probably improve > performance in other ways as well. > > Best, > Wade > > > On Thu, Feb 19, 2009 at 1:49 PM, Bobby Powers wrote: > >> 2009/2/19 Wade Brainerd : >> > On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian >> wrote: >> >> >> >> I'd suggest just uncompressing the various image files and re-timing >> >> as a start. The initial implementation was uncompressed, but people >> >> complained about space usage on the emulator images (which are >> >> uncompressed). The current code supports both uncompressed and >> >> compressed image formats. For uncompressed images, putting the bits >> >> on the screen is an mmap and memcpy, so I can't imagine any >> >> implementation being faster than that (it's possible, of course, that >> >> what's stealing CPU is the shell's invocation of the client program; >> >> recoding just that little part in C should be trivial, since it does >> >> nothing but write to a socket IIRC.) >> > >> > I implemented a RLE compressor specifically for these 16bit image files >> the >> > last time this question came up. This can certainly be faster than >> memcpy >> > since we are talking memory performance. >> >> Can you explain this? I don't think I have enough knowledge to >> evaluate your claim. >> >> bobby >> > > ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
RLE (run length encoding) compresses sequences of identical pixels ("runs") as value/count pairs. So abbccc would be stored as 1a 10b 3c. The decompressor looks like: while (cur < end) { unsigned short count = *cur++; unsigned short value = *cur++; while (count--) *dest++ = value; } This can be faster than memcpy because you are reading significantly less memory than you would with memcpy, thus fewer cache misses are incurred. Because the startup images are mostly spans solid colors, this kind of compression works very well. If that were not the case, say if there were a left-to-right gradient in the background, RLE would probably make things worse, thus you have to be careful when choosing it. But the smaller size on disk and in memory would probably improve performance in other ways as well. Best, Wade On Thu, Feb 19, 2009 at 1:49 PM, Bobby Powers wrote: > 2009/2/19 Wade Brainerd : > > On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian > wrote: > >> > >> I'd suggest just uncompressing the various image files and re-timing > >> as a start. The initial implementation was uncompressed, but people > >> complained about space usage on the emulator images (which are > >> uncompressed). The current code supports both uncompressed and > >> compressed image formats. For uncompressed images, putting the bits > >> on the screen is an mmap and memcpy, so I can't imagine any > >> implementation being faster than that (it's possible, of course, that > >> what's stealing CPU is the shell's invocation of the client program; > >> recoding just that little part in C should be trivial, since it does > >> nothing but write to a socket IIRC.) > > > > I implemented a RLE compressor specifically for these 16bit image files > the > > last time this question came up. This can certainly be faster than > memcpy > > since we are talking memory performance. > > Can you explain this? I don't think I have enough knowledge to > evaluate your claim. > > bobby > ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
Bobby Powers wrote: > On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian wrote: > >> I'd suggest just uncompressing the various image files and re-timing >> as a start. The initial implementation was uncompressed, but people >> complained about space usage on the emulator images (which are >> uncompressed). The current code supports both uncompressed and >> compressed image formats. For uncompressed images, putting the bits >> on the screen is an mmap and memcpy, so I can't imagine any >> implementation being faster than that (it's possible, of course, that >> what's stealing CPU is the shell's invocation of the client program; >> recoding just that little part in C should be trivial, since it does >> nothing but write to a socket IIRC.) >> >> Anyway, further benchmarking of the current implementation is probably >> worthwhile before a complete reimplementation is called for. But if >> you want to reimplement it from scratch, go nuts. >> --scott >> > > I already re-implemented it - it was a fun optimization project and > introduction to lower level systems programming. Using Mitch's D565 > format to keep track of only the parts of the image that change cut > down the implementation size significantly. Its now only 2 > uncompressed images (frame00.565 and ul-warning.565), and <300KB of > differences for the animation sequence. I understand reads from video > memory (which I think is what the framebuffer is?) can be extremely > slow, so it could turn out faster to open a D565 file, mmap it and > mcpy the several tens of kilobytes of differences to the framebuffer > than it is to read those differences from one part of video memory to > another. > It is easy to measure just how "slow" video memory reads are. Lets test 256K (0x4): ok screen-ih iselect ok t( frame-buffer-adr frame-buffer-adr 4. + 4. move )t 56,272 uS Conversely, for memory to frame buffer: ok t( load-base frame-buffer-adr 4. + 4. move )t 05,407 uS So frame buffer reads are slower. But the total amount of time that we have "wasted" is 50 milliseconds over the whole procedure. I suspect that it will be difficult to come up with a way to save those 50 mS that doesn't cost a similar amount of time in setup. For ongoing stuff like run-time graphics operations, it's clearly important to avoid "slow" operations, but in this case, we are trading off slow FB accesses against the complexity of maintaining persistent state in main memory. > This is where benchmarking should give some clearer answers. > > yours, > Bobby > ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
2009/2/19 Wade Brainerd : > On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian wrote: >> >> I'd suggest just uncompressing the various image files and re-timing >> as a start. The initial implementation was uncompressed, but people >> complained about space usage on the emulator images (which are >> uncompressed). The current code supports both uncompressed and >> compressed image formats. For uncompressed images, putting the bits >> on the screen is an mmap and memcpy, so I can't imagine any >> implementation being faster than that (it's possible, of course, that >> what's stealing CPU is the shell's invocation of the client program; >> recoding just that little part in C should be trivial, since it does >> nothing but write to a socket IIRC.) > > I implemented a RLE compressor specifically for these 16bit image files the > last time this question came up. This can certainly be faster than memcpy > since we are talking memory performance. Can you explain this? I don't think I have enough knowledge to evaluate your claim. bobby ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian wrote: > I'd suggest just uncompressing the various image files and re-timing > as a start. The initial implementation was uncompressed, but people > complained about space usage on the emulator images (which are > uncompressed). The current code supports both uncompressed and > compressed image formats. For uncompressed images, putting the bits > on the screen is an mmap and memcpy, so I can't imagine any > implementation being faster than that (it's possible, of course, that > what's stealing CPU is the shell's invocation of the client program; > recoding just that little part in C should be trivial, since it does > nothing but write to a socket IIRC.) > > Anyway, further benchmarking of the current implementation is probably > worthwhile before a complete reimplementation is called for. But if > you want to reimplement it from scratch, go nuts. > --scott I already re-implemented it - it was a fun optimization project and introduction to lower level systems programming. Using Mitch's D565 format to keep track of only the parts of the image that change cut down the implementation size significantly. Its now only 2 uncompressed images (frame00.565 and ul-warning.565), and <300KB of differences for the animation sequence. I understand reads from video memory (which I think is what the framebuffer is?) can be extremely slow, so it could turn out faster to open a D565 file, mmap it and mcpy the several tens of kilobytes of differences to the framebuffer than it is to read those differences from one part of video memory to another. This is where benchmarking should give some clearer answers. yours, Bobby ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian wrote: > I'd suggest just uncompressing the various image files and re-timing > as a start. The initial implementation was uncompressed, but people > complained about space usage on the emulator images (which are > uncompressed). The current code supports both uncompressed and > compressed image formats. For uncompressed images, putting the bits > on the screen is an mmap and memcpy, so I can't imagine any > implementation being faster than that (it's possible, of course, that > what's stealing CPU is the shell's invocation of the client program; > recoding just that little part in C should be trivial, since it does > nothing but write to a socket IIRC.) I implemented a RLE compressor specifically for these 16bit image files the last time this question came up. This can certainly be faster than memcpy since we are talking memory performance. GZip+RLE also beats plain GZip on size, again due to the contents of the images. http://wadeb.com/rle.c http://wadeb.com/unrle.c -Wade ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
C. Scott Ananian wrote: > I'd suggest just uncompressing the various image files and re-timing > as a start. The initial implementation was uncompressed, but people > complained about space usage on the emulator images (which are > uncompressed). The current code supports both uncompressed and > compressed image formats. For uncompressed images, putting the bits > on the screen is an mmap and memcpy, so I can't imagine any > implementation being faster than that (it's possible, of course, that > what's stealing CPU is the shell's invocation of the client program; > recoding just that little part in C should be trivial, since it does > nothing but write to a socket IIRC.) > > Anyway, further benchmarking of the current implementation is probably > worthwhile before a complete reimplementation is called for. But if > you want to reimplement it from scratch, go nuts. > --scott > > It has already been reimplemented. The "disk" I/O time for 26 full-screen images is several seconds. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
I'd suggest just uncompressing the various image files and re-timing as a start. The initial implementation was uncompressed, but people complained about space usage on the emulator images (which are uncompressed). The current code supports both uncompressed and compressed image formats. For uncompressed images, putting the bits on the screen is an mmap and memcpy, so I can't imagine any implementation being faster than that (it's possible, of course, that what's stealing CPU is the shell's invocation of the client program; recoding just that little part in C should be trivial, since it does nothing but write to a socket IIRC.) Anyway, further benchmarking of the current implementation is probably worthwhile before a complete reimplementation is called for. But if you want to reimplement it from scratch, go nuts. --scott -- ( http://cscott.net/ ) ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
>> I just measured the time taken by the boot animation by the simple >> technique of renaming /usr/bin/rhgb-client so the initscripts can't find it. > > how did you measure exactly? stopwatch? I'd like to recreate the > tests. It sounds like you did this on a freshly flashed system? There were a number of tools used by some of the Fedora devs for boot speed when developing plymouth to replace the old RHGB system. It would be interesting to plymouth in this (both text and graphical) to see what the comparison is like. It might be possible to get alot of the wins that Fedora got with very little work as plymouth has a full plugin system so shouldn't be hard to add the OLPC boot logos in. Peter ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
mitch wrote: > Bobby Powers wrote: > > - its designed to be as light as possible, using syscalls instead of > > libc functions as much as possible (the only thing we use libc for is > > string comparison, which could be replaced with a local function). > > while its written like this, I haven't worked on cutting down the > > linking (I need some guidance for that) great stuff bobby -- i'm happy to help with any remaining details if you like. > > > > To reduce the execution footprint, you could try linking it against > dietlibc, http://www.fefe.de/dietlibc/ > > I'm not sure just how much time that would save; maybe it wouldn't be > significant. But it's worth a try. my gut says that using already present glibc shared lib will be cheaper than introducing a new library, even if it's small and static. but you're right it's worth a try. > > and source is avail at > > http://dev.laptop.org/git?p=users/bobbyp/bootanim i took a very brief look. as a favor to future maintainers, i think you could either a) merge boot-anim-start/client/stop and ul-warning into a single executable (much of the code is the same) or b) extract the common parts (e.g. initial_setup(), and the code that mmaps the framebuffer) into a boot-anim-utils.c or something like that. (and while i'm all for reducing dependencies, the XO has so much else going on that i don't think using against string libraries or even stdio will affect things much in the greater scheme of things. so i'd have used fputs rather than write(2,...) for errors. but i understand the intent.) paul =- paul fox, p...@laptop.org ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
Cool! Bobby Powers wrote: > On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley wrote: > >> I just measured the time taken by the boot animation by the simple >> technique of renaming /usr/bin/rhgb-client so the initscripts can't find it. >> > > how did you measure exactly? stopwatch? I'd like to recreate the > tests. It sounds like you did this on a freshly flashed system? > Yes on both counts. Stopwatch on freshly-flashed os7.img . > >> With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60 >> seconds from first dot (indicating OFW transfer to Linux) to Sugar >> "prompt for your name". Without it, 53 seconds. I repeated the test >> several times with consistent results. >> >> Clearly, it should be possible to display that amount of information in >> much less than 7 seconds. >> >> The boot animation code is in the OLPC domain, not the upstream domain, >> so replacing it should be relatively free of upstream politics. >> >> So if anybody is interested in implementing a relatively simple >> boot-time speedup, I offer this as low-hanging fruit. >> >> I suggest 1 second (differential time between animation and no-animation >> cases) as a reasonable target goal, assuming images of the complexity of >> the current ones. Arbitrary full-screen graphics might require more >> time, but speeding up the baseline case is a good starting point. >> >> Go wild. >> > > So I've taken a first cut at this, implemented with the following > design considerations (mostly from a conversation with Mitch) > - the Python client/server was reimplemented as several standalone C > programs (boot-anim-start, boot-anim-client, and some cleanup in > boot-anim-stop) > - a client and server was used before because there is state > information that needs to be saved: we need to keep track of where in > the animation we are. We can keep track of this by using offscreen > memory in the framebuffer (its 16MB in size, and only the first 2ish > MB is used for the onscreen graphics (my terminology might be off > here)). For state we really only need to keep track of 2 integers, > one for the current frame number and another to store the offset of > the next diff to apply. > - on startup we load an initial image into the framebuffer (the first > 1200*900*2 bytes, since we use 2 bytes per pixel for color > information), and then load in a series of changes to the framebuffer > image (<300KB). This takes the form of a series of diffs > - for each update (a valid call to boot-anim-client) we apply the next > diff in the series to the onscreen image and update our state > information > - after applying the last diff we have (the end in the animation > series), freeze the DCON (when I first attempted to freeze the DCON > when z-boot-anim-stop was called it left the screen in an inconsistent > state, I believe because of X startup) > - its designed to be as light as possible, using syscalls instead of > libc functions as much as possible (the only thing we use libc for is > string comparison, which could be replaced with a local function). > while its written like this, I haven't worked on cutting down the > linking (I need some guidance for that) > To reduce the execution footprint, you could try linking it against dietlibc, http://www.fefe.de/dietlibc/ I'm not sure just how much time that would save; maybe it wouldn't be significant. But it's worth a try. > comments and suggestions welcome :) > > I'd appreciate any testing as well as any code review. (the shutdown > image appears to be broken, FYI. i haven't looked at that in depth, > its probably a one line fix.) > rpms (built with mock) are available at > http://dev.laptop.org/~bobbyp/bootanim/ > and source is avail at > http://dev.laptop.org/git?p=users/bobbyp/bootanim > > -Bobby > ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Opportunity for speedup
On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley wrote: > I just measured the time taken by the boot animation by the simple > technique of renaming /usr/bin/rhgb-client so the initscripts can't find it. how did you measure exactly? stopwatch? I'd like to recreate the tests. It sounds like you did this on a freshly flashed system? > With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60 > seconds from first dot (indicating OFW transfer to Linux) to Sugar > "prompt for your name". Without it, 53 seconds. I repeated the test > several times with consistent results. > > Clearly, it should be possible to display that amount of information in > much less than 7 seconds. > > The boot animation code is in the OLPC domain, not the upstream domain, > so replacing it should be relatively free of upstream politics. > > So if anybody is interested in implementing a relatively simple > boot-time speedup, I offer this as low-hanging fruit. > > I suggest 1 second (differential time between animation and no-animation > cases) as a reasonable target goal, assuming images of the complexity of > the current ones. Arbitrary full-screen graphics might require more > time, but speeding up the baseline case is a good starting point. > > Go wild. So I've taken a first cut at this, implemented with the following design considerations (mostly from a conversation with Mitch) - the Python client/server was reimplemented as several standalone C programs (boot-anim-start, boot-anim-client, and some cleanup in boot-anim-stop) - a client and server was used before because there is state information that needs to be saved: we need to keep track of where in the animation we are. We can keep track of this by using offscreen memory in the framebuffer (its 16MB in size, and only the first 2ish MB is used for the onscreen graphics (my terminology might be off here)). For state we really only need to keep track of 2 integers, one for the current frame number and another to store the offset of the next diff to apply. - on startup we load an initial image into the framebuffer (the first 1200*900*2 bytes, since we use 2 bytes per pixel for color information), and then load in a series of changes to the framebuffer image (<300KB). This takes the form of a series of diffs - for each update (a valid call to boot-anim-client) we apply the next diff in the series to the onscreen image and update our state information - after applying the last diff we have (the end in the animation series), freeze the DCON (when I first attempted to freeze the DCON when z-boot-anim-stop was called it left the screen in an inconsistent state, I believe because of X startup) - its designed to be as light as possible, using syscalls instead of libc functions as much as possible (the only thing we use libc for is string comparison, which could be replaced with a local function). while its written like this, I haven't worked on cutting down the linking (I need some guidance for that) comments and suggestions welcome :) I'd appreciate any testing as well as any code review. (the shutdown image appears to be broken, FYI. i haven't looked at that in depth, its probably a one line fix.) rpms (built with mock) are available at http://dev.laptop.org/~bobbyp/bootanim/ and source is avail at http://dev.laptop.org/git?p=users/bobbyp/bootanim -Bobby ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Opportunity for speedup
I just measured the time taken by the boot animation by the simple technique of renaming /usr/bin/rhgb-client so the initscripts can't find it. With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60 seconds from first dot (indicating OFW transfer to Linux) to Sugar "prompt for your name". Without it, 53 seconds. I repeated the test several times with consistent results. Clearly, it should be possible to display that amount of information in much less than 7 seconds. The boot animation code is in the OLPC domain, not the upstream domain, so replacing it should be relatively free of upstream politics. So if anybody is interested in implementing a relatively simple boot-time speedup, I offer this as low-hanging fruit. I suggest 1 second (differential time between animation and no-animation cases) as a reasonable target goal, assuming images of the complexity of the current ones. Arbitrary full-screen graphics might require more time, but speeding up the baseline case is a good starting point. Go wild. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel