Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
In floating point, all you have to do is flip a sign bit. But with integers, it's not so easy. There is no instruction for absolute value in MMX, you have to use a four instruction sequence and two registers. Slower than squaring a value, which only takes two instructions. I finally found a mmx2 reference, and you're right about that. MMX2 added psadbw, packed sum of absolute differences. If you have 8-bit unsigned data it makes computing SAD pretty darn easy, you can find the SAD of 8 pixels in one instruction. There are really very very few MMX-only current CPUs. Basically everything after K6 with MMX supports the psadbw instruction. Though why did mpeg2enc use variance in the first place? Maybe it's a better estimator than SAD for motion compensation fit? Its just not that simple. It uses SAD for 'coarse' motion estimation and switches to variance for the final selection of the particular motion estimation mode. This combination provides a good speed/quality trade-off. Experiments with 'only variance' were unimpressive in their quality improvements and 'only SAD' costs quite a lot of quality for modest speed gain. All the 'low hanging fruit' in the current motion estimation algorithm has long since been picked... The level of altivec optimizations ffmpeg vs mpeg2 is probably an important factor in any speed difference, and one that wouldn't matter for other CPUs, which the level of MMX/MMX2/SSE optimizations makes a large difference. You have to be very careful to compare like coding profiles, motion search radii and suchlike too. Its easy to be twice as fast if your simply trying half as hard to find a good encoding. However, there *are* bottlenecks in mpeg2enc. For a real speedy mode predictive motion estimation algorithms would need to be used. Andrew --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On 19 Dec 2003, Florin Andrei wrote: On Fri, 2003-12-19 at 01:49, Steven M. Schultz wrote: At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on my G4 Powerbook. Yes, ffmpeg has a big speed advantage (~2x) but The difference is even larger than I thought... ffmpeg was decoding the DV file and encoding the audio at the same time but I had mpeg2enc reading a pre-staged .y4m file. Any chance repeating that on an Intel or AMD processor? ffmpeg -i input.dv -vcodec mpeg2video -f mpeg -b 5000 -g 15 foo.mpg then compare to mpeg2enc -R 0 -b 5000 -4 3 -2 2 -o foo1.mpg input.y4m Have fun :) At the moment my AMD system's booked solid for other encoding jobs, be a couple days before I can run some tests on it.Hmmm, time to fire up the dual P4 system and get it sync'd up on all the projects. Maybe over the weekend. Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tuesday 16 December 2003 23:35, Richard Ellis wrote: Hi Richard, In that case it will kill the majority of the performance benifit provided by the caches, because there's very little locality of reference for the cache to compensate for. It moves through at least 512k for pass one, then through the same 512k again for pass two, but the data in the cache is from the end of the frame, and we are starting over at the beginning of the frame. Massive cache thrash in that case. Memory bandwidth becomes a much more limiting factor. Exactly what I though when I restructured encoding to a per macroblock basis a few months back. The performance gain was not measurable. They key 'thinko' here is that most of the time goes into motion estimation and in motion estimation the search windows of neighbouring macroblocks overlap 90%. Cache locality is pretty good. Playing around with prefetch (etc etc) has never brought measurable gains. The main bottleneck in the current encoder (for modern CPUs) is the first phase (4X4 subsampling) of the subsampling motion estimation hierarchy. For speed this would need to be replaced with a predictive estimator. The next bottlenecks would be the run-length coding and the use of variance instead of SAD in motion compensation mode and DCT mode selection. Sadly there's not too much can be done easily about the former and the latter cannot be removed without noticeable reduction in encoding quality (I tried it :-(). However, I have some ideas to try when I get back in the new year! Andrew --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Fri, 19 Dec 2003, Andrew Stevens wrote: The next bottlenecks would be the run-length coding and the use of variance instead of SAD in motion compensation mode and DCT mode selection. Sadly Is SAD really any faster to calculate than variance? SAD uses an absolute value-add operation while variance is multiply-add. Multiply-add is usually the most heavily optimized operation a cpu can perform. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Fri, 19 Dec 2003, Steven M. Schultz wrote: On Fri, 19 Dec 2003, Trent Piepho wrote: On Fri, 19 Dec 2003, Andrew Stevens wrote: Is SAD really any faster to calculate than variance? SAD uses an absolute value-add operation while variance is multiply-add. Multiply-add is usually the most heavily optimized operation a cpu can perform. Au contraire. Multiply is a lot slower than abs(). All abs() has to do is flip a sign bit (effectively) and that's going to be a lot In floating point, all you have to do is flip a sign bit. But with integers, it's not so easy. There is no instruction for absolute value in MMX, you have to use a four instruction sequence and two registers. Slower than squaring a value, which only takes two instructions. Though you can cleverly combine an unsigned subtraction and absolute value operation into four instructions total, and perform it on eight unsigned bytes at a time. So you can compute an absolute value of differences quite a bit faster under MMX than I was thinking you could. Clearly SAD would be faster than variance. Another advantage of SAD is that you can find an intermediate result easier than with variance. That way you can short-circuit the SAD calculation if you have already reached the best SAD already found. faster than any multiply. And aren't there MMX2/SSE abs+add instructions - that would make abs/add quite fast. I finally found a mmx2 reference, and you're right about that. MMX2 added psadbw, packed sum of absolute differences. If you have 8-bit unsigned data it makes computing SAD pretty darn easy, you can find the SAD of 8 pixels in one instruction. Though why did mpeg2enc use variance in the first place? Maybe it's a better estimator than SAD for motion compensation fit? At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on my G4 Powerbook. Yes, ffmpeg has a big speed advantage (~2x) but The level of altivec optimizations ffmpeg vs mpeg2 is probably an important factor in any speed difference, and one that wouldn't matter for other CPUs, which the level of MMX/MMX2/SSE optimizations makes a large difference. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Fri, Dec 19, 2003 at 01:34:38AM -0800, Trent Piepho wrote: On Fri, 19 Dec 2003, Andrew Stevens wrote: The next bottlenecks would be the run-length coding and the use of variance instead of SAD in motion compensation mode and DCT mode selection. Sadly Is SAD really any faster to calculate than variance? SAD uses an absolute value-add operation while variance is multiply-add. Multiply-add is usually the most heavily optimized operation a cpu can perform. You are thinking DSP chips, not general purpose CPU's. For DSP's, yes, multiply-add is very heavilly optimized, but for general purpose CPU's, it's often not quite so heavilly optimized. Additionally, if you've got an SSE2 capable x86 chip, it's got parallel SAD operations in the SSE2 instruction set. There isn't an SSE2 mul-add operation yet. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Fri, 2003-12-19 at 01:49, Steven M. Schultz wrote: At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on my G4 Powerbook. Yes, ffmpeg has a big speed advantage (~2x) but the resulting output is 'grainy' (same bitrate, no B frames) (and the rate control is, well, almost non existent - ~2x spikes that'd drive a hardware player nuts). Any chance repeating that on an Intel or AMD processor? -- Florin Andrei http://florin.myip.org/ --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 2003-12-16 at 12:33, Andrew Stevens wrote: Hi all, First off a bit of background to the multi-threading in the current stable branch. First off: - Parallelism is primarily frame-by-frame. This means that the final phases of the encoding lock on completion of the reference frame (prediction and DCT transform) and the predecessor (bit allocation). If you have a really fast CPU that motion estimates and DCT's very fast you will get lower parallelisation. If you use -R 0 you will get very litte parallelism *at all*. Certainly not enough to make -M 3 sensible. Yet again, good to know. This line (generally, a triple loop for 0-3 M, 0-1 I and 0-2 R): Produces this (approximately 1010 frames), encoding times (real time / user time, gives a bit of a view as to how busy the CPUs were during the real time, optimal should be 1m realtime, 2m user time, right? and average system time was 3.0s, with +/- 0.2s for all tests): (options on each call were: -f 8 -g 9 -G 18 -v 0 -E -10 -K kvcd -4 2 -2 1 -F 1 rawstream.yuv ) -M 0 -I 0 -R 0: 1m 6.082s 0m 50.050s baselines -M 0 -I 0 -R 1: 1m 16.545s 0m 58.980s .. -M 0 -I 0 -R 2: 1m 34.511s 1m 17.045s .. -M 0 -I 1 -R 0: 2m 7.344s 1m 49.495s .. -M 0 -I 1 -R 1: 1m 59.665s 1m 42.215s .. -M 0 -I 1 -R 2: 2m 30.990s 2m 30.990s .. -M 1 -I 0 -R 0: 1m 5.713s 0m 49.800s -0.35s -M 1 -I 0 -R 1: 1m 15.305s 0m 58.975s -1.2s -M 1 -I 0 -R 2: 1m 34.057s 1m 17.090s -0.5s -M 1 -I 1 -R 0: 2m 5.928s 1m 49.700s -1.3s -M 1 -I 1 -R 1: 1m 59.019s 1m 41.955s -0.6s -M 1 -I 1 -R 2: 2m 49.149s 2m 31.440s +19.2s -M 2 -I 0 -R 0: 1m 0.503s 0m 25.930s -5.5s -M 2 -I 0 -R 1: 0m 53.418s 0m 58.950s -23s -M 2 -I 0 -R 2: 1m 7.418s 1m 18.145s -27s -M 2 -I 1 -R 0: 1m 54.534s 1m 50.060s -13s -M 2 -I 1 -R 1: 1m 15.489s 0m 1.040s -- uhm...? -M 2 -I 1 -R 2: 1m 54.720s 1m 16.720s -36s -M 3 -I 0 -R 0: 0m 57.533s 0m 50.610s -8.5s -M 3 -I 0 -R 1: 0m 51.541s 0m 40.265s -25s -M 3 -I 0 -R 2: 1m 5.996s 0m 54.325s -29s -M 3 -I 1 -R 0: 1m 50.570s 1m 49.715s -17s -M 3 -I 1 -R 1: 1m 14.462s 1m 8.530s -45s -M 3 -I 1 -R 2: 1m 36.192 0m 52.145s -54s Interestingly, and I think this has to do with the I/O buffering, -M 0 is slower than -M 1 by a small fraction in all tests. And as Steven Shultz had suggested, -I 1 is a bad bad idea. It never improved performance, and made it in fact quite a bit worse (the man page is right :). (Of course, -M 1 will be at least two processes, and since I have a real dual system, it makes sense, and may not hold true for a single CPU) Also, encoding with one B frame is a touch faster in -I 1 mode than encoding without them, but it is slower when you encode two B frames instead of just one. I find this interesting.. I would have expected a single B frame to take a bit longer than none at all, and that is the case when -I 0 is on, but not when it's -I 1. Any ideas on that one? In the end -M 3 is not reasonably faster in -I 0 -R 0, but flys along at -I 0 -R 2 compared to baseline, and gets fair gains at -I 0 -R 1, while dropping encoding time by another 14 seconds for the same frameset. So, does this boil down to the fastest is -M 3 -I 0 -R 1? The numbers on -M 3 -I 1 -R 2 show a 54 second improvement over the tests with -M 0, but it takes almost 50% longer than -M 3 -I 0 -R 1. The file size of 3-1-2 is 13,807,067 and the file size of 3-0-1 is 13,402,673. The file is smaller, and is encoded faster, and viewing them now, the quality is at least on par (3-0-1 looked a tad better). - There is also a parallel read-ahead thread but this rarely soaks much CPU on modern CPUs. The MPEG_DEVEL branch encoder stripes all encoding phases to allow much more scalable parallelisation. You might want to give it a go - I'd be interested in the results! I'd love to, but I couldn't find it in CVS. I found everything else in the SF CVS branch, but not mjpegtools itself. N.b. in a 'realistic' scenario you're running the multiplexer and audio encoding in parallel with the encoder and video filters communicating via pipes and named FIFO's. This setup usually saturate a modern dual machine No multiplexing and no audio encoding (AC3 pass through and multiplexing of DVD streams is done after completion of the video encoding). There is the overhead of decoding the original MPEG2 stream into YUV, but that's about all else that transcode (which I'm using) is dumping into the pipe. I avoided any of that on this run by just dumping the file in an already decoded format (pgmtoy4m output). cheers, Andrew PS I'm away on vacation for a couple of weeks from friday so there'll be a bit of pause in answering emails / posts from then ;-) --- This SF.net email is sponsored
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 2003-12-16 at 13:15, Richard Ellis wrote: On Tue, Dec 16, 2003 at 12:33:52AM -0700, Slepp Lukwai wrote: .. It's a dual Athlon, which inherently means 266FSB (DDR 266), though the memory is actually Hynix PC3200 w/ timings set as low as they go on this board (2-2-2), which gives me about 550MB/s memory bandwidth according to memtest, with a 13GB/s L1 and something like 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, it SHOULD be able to push enough to keep the frames encoding at 100% CPU, in theory. Yes, but just one 720x480 DVD quality frame is larger than 256k in size, so a 256k cache per CPU isn't helping too much overall considering how many frames there are in a typical video to be encoded. Plus, my experience with Athlon's is that they are actually faster at mpeg2enc encoding that Intel chips of equivalent speed ratings (the Athlon's 3dnow/mmx implimentation is faster) and so they put a heavier stress on one's memory bandwidth than an equivalent speed Intel chip would. It's possible that 275MB/s per CPU just isn't fast enough to keep up with the rate that mpeg2enc can consume data on an Athlon. Yes, I expect the cache to only be able to fit the mpeg2enc code sections, not any of the data it uses. If the code keeps getting bumped out, then that's a problem. And 275MB/s may not be enough, true... It's too bad the Athlon dual chipset (AMD 768MPX) can't do above about 140 MHz bus speeds to see how much memory speed affects it. Of course, Andrew would be much better suited to discuss mpeg2enc's memory access patterns during encoding, which depending on how it does go about accessing memory can better make use of the 256k of cache, or cause the 256k of cache to be constantly thrashed in and out. It could be interesting to use cachegrind on mpeg2enc and see what it declares for cache hit/miss, but I find cachegrind tends to make a 1 minute runtime hit 10 minutes, so I may not bother.. Now that's just silly. Why would you hurt the CPUs by running such bloat as Mozilla? I can't think of how many times Mozilla has gone nuts on me and used 100% CPU without reason, and you can't kill it any normal UI way.. Good ol' killall. However, I love it. It's a great browser. Just rather hungry at times. I suppose there's a reason the logo is a dinosaur. : Hmm... Interesting. I've had it sometimes just stop but never go nuts with 100% CPU, and although I usually do CLI kill it if need be, FVWM2's destroy window command has never failed to get rid of it if I don't bother to go CLI to do so. In fact, FVWM2's destroy has never failed to get rid of anything that went wonky. It's the X windows equivalent to a kill -9 from the CLI. I've had it lock up and X becomes unresponsive since it's in a loop doing some expensive operation of some sort. It's strange. I don't see it nearly as often with the newer Mozillas as I did the old ones (in fact, haven't seen it in over a month). --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 2003-12-16 at 23:17, Bernhard Praschinger wrote: -M 0: 2m 11.9s -M 1: 2m 10.6s, -1.3s -M 2: 1m 27.7s, -44.2s -M 3: 1m 26.5s, -45.4s That values look much better. :-) Now you have seen the mpeg2enc can go faster. It's like it used to be. : I'm going to try it on a full video, with a few options. I figure I'll let it run through 24 hours of encoding time (about 6 different trials) and see how each result turns out, and so on. I'll let you all know when it's done. : I have tried the command you used on my machine, and I have seen the same problem. Also 3 processes and each only 33% . (time lav2yuv n1000.eli | mpeg2enc -I 0 -f 8 -b 9800 -p -a 3 -o test.m2v -S -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0) Yes.. So it's definitely the -R 0, but -R 1 is faster than the default of -R 2 (i think that's the default?) Note that I responded in an earlier message with a total of 24 timings across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting results that -M 3 -I 0 -R 1 worked fastest of all of them (same source material I used for the above, and it took 51 seconds). So, I think the -I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it is still quite a bit slower than -I 0 (which I use since the input is Progressive 23.976fps) Thats strange. It makes a mild bit of sense.. But just a little. I'm just running some encodings to see which option causes the problem. On my machine the -R 0 caused the problem. If I used -R 1/2 or or R option, I got 3 processes each using about 45-50%. Which should total about 150% CPU instead of 99% that it uses with -R 0. My brain had given up the time I started my computer that evening ;) Mine usually does that at about 8am. : Just as you enter work ? ;) Self employed, thereby just as I crawl out of bed, and it my brain stays broken until about noon. That's what I get for staying up till 4am playing with mpeg2enc. : Encoding without the -R 0 seems to solve the problem, by now. I'm going to see what speeds I get about halfway through a video, when nothing from disk is in cache anymore, the encoders/decoders are in full swing, and everything sort of settles down.. Should be interesting. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, Dec 16, 2003 at 06:54:22PM -0700, Slepp Lukwai wrote: As a side note, I'm also using a 200Hz timer, instead of the standard 100Hz. Though I don't see this doing anything but making it quicker, as it reduces latency on scheduling, while slightly increasing scheduler overhead and context switching (or is an SSE/3Dnow! CS really expensive, anyone know?). A 200Hz timer will have only one effect on batch type processes, slowing them down. And mpeg2enc is essentially a batch type process. Why? Because of the increased scheduler overhead. Now, you may be hard put to measure the slowdown because so many other effects will swamp it (one HD seek that takes a few ms would swamp a large part of the scheduler overhead) but it's still there. The only thing that's quicker with a 200hz timer is interactive response where you want to see your X cursor move the instant you touch the mouse. Yes, context switching (at least for SSE) is more expensive, because the 8 128bit SSE registers may need to be saved. I don't know off the top of my head if Intel implimented lazy context saves for SSE like with the x86-fpu stack. If they did, then not all context swaps incur the SSE save overhead, but when one does, there is more data to save. I wonder if it comes back to the increased timing of the scheduler? (Though it's using a supposed O(1) scheduler, which should offset that). The O(1) scheduler does not change the context switch overhead timing. The O(1) scheduler simply says that no matter how many processes are waiting to run, it's a constant time to find the next one when we do need to context switch. But a 200hz timer will still use up 2x as much cpu time running the scheduler as 100hz timer will. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Mon, 2003-12-15 at 21:08, Richard Ellis wrote: What program are you using to monitor CPU usage while mpeg2enc runs? Some versions of top (if you are using top) report percentages as a roll-up of the whole SMP machine, so that 3x33% usage really means 99% utilization of the machine, where the machine means both processors combined. Other versions report a per-cpu percentage instead of rolling everything together. I hate the combined ratings, so I already setup top to report per CPU usage, so I can see 200% usage instead of it showing 50% as 100% on one CPU (it's misleading when you deal with single CPUs almost all day for work). Additionally, why kind of memory do you have attached to the cpu's? Mpeg encoding is very memory bandwidth hungry to begin with, and with two cpu's trying to eat at the same trough, a not quite as fast as it should be memory subsystem can produce results like what you are seeing. It's because with the two cpu's trying to run mpeg2enc, they together oversaturate the memory bus, causing both to wait. But with only one mpeg2enc thread running, the entire memory bus bandwidth is available to that one cpu alone. I've noticed. I never saw really how much memory it used unti I used the buffer program with -t. It was moving gigs of data for a short period of frames (perhaps 10,000 frames). It's a dual Athlon, which inherently means 266FSB (DDR 266), though the memory is actually Hynix PC3200 w/ timings set as low as they go on this board (2-2-2), which gives me about 550MB/s memory bandwidth according to memtest, with a 13GB/s L1 and something like 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, it SHOULD be able to push enough to keep the frames encoding at 100% CPU, in theory. I don't think there's enough overhead on this machine to qualify as keeping it even half saturated. This is why I want the Corsair XMS Pro memory with load meters on them. (Per bank load meters, even). FWIW, when my desktop machine was a dual PII-400Mhz box, I almost always had two mpeg2enc threads eating up 97-98%cpu on both PII chips. The few times both cpu's were not fully saturated at mpeg encoding was when I'd bother them with something silly like browsing the web with mozilla. :) Now that's just silly. Why would you hurt the CPUs by running such bloat as Mozilla? I can't think of how many times Mozilla has gone nuts on me and used 100% CPU without reason, and you can't kill it any normal UI way.. Good ol' killall. However, I love it. It's a great browser. Just rather hungry at times. I suppose there's a reason the logo is a dinosaur. : --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Mon, 2003-12-15 at 20:27, Steven M. Schultz wrote: On Mon, 15 Dec 2003, Slepp Lukwai wrote: faster to begin with. However, in both cases, after multiple tests and trying different things, I can't get the SMP modes to be fast at all. In fact, they're slower than the non-SMP modes. I think I see what you're doing that could cause that. I've never seen the problem - using -M 2 is not going to be 2x as fast though if that was the expectation. ~40% speedup or so is what I see (from about 10fps to 14fps) typically. Tried it without any options, same effect. I'm definitely seeing nowhere near 40% speedup, which is what boggles me. I expected at least reasonable gains of 25%. When encoding with the -M 0 with .92, I get around 19fps. When I use -M That's full sized (720x480) is it? Sounds more like a SVCD or perhaps 1/2 D1 (bit of a misnomer - D1 is actually a digital video tape deck) at 352x480.At 1/2 size yes, around 20fps or a bit more I've seen. But I'm usually tossing in a bit of filtering so the process is a slower. Sorry, upon further testing, I actually average around 14fps at DVD quality (720x480, 9800kbit/s). (see all the details of my command lines in the post I sent in responce to Bernhard). I installed 'buffer', set it up with a 32MB buffer and put it in the 10MB is about all I use - it's just a cushion to prevent the encoder from having to wait (-M 1 is the default - there's I/O readahead going on) for input. Yeh, I tried 20 first, then 32, but in the end, it made no difference at all. Has anyone found a way around this, or is it time to look at the source and see what's up? And for reference, it's a dual Athlon MP 2100+, which is below the '2600' that the Howto references as fast. I'm using dual 2800s and around 14-15fps for DVD encoding is what I usually get. It's interesting that I'm faster with dual 2100s than the dual 2800 (or at least on par). I suppose it really comes down to command line options, but you would need to compare those yourself (since I haven't seen yours). The actual command line is: mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o test.m2v -S -M 3 -4 2 -2 1 -r 32 -q 5 -Q 3.0 -K kvcd You have progressive non-interlaced source? If not then -I 0 is not the right option. According to the docs -I 1 turns on interlacing support, and causes un-needed overhead if it is known progressive material. Hence the -I 0 (plus transcode sets that, though I could override it). The speed up from multiple processors comes, I believe (but if I'm wrong I'm sure someone will tactfully point that out ;)) the speedup comes from the motion estimation of the 2 fields/frame being done in parallel. Oh. Son of a... If that's all it is... Try -I 1 (or just leave out the '-I and let it default. Oh, and there's no real benefit from going above -M 2. I had a 4 cpu box and tried -M 4 and saw no gain over -M 3 (which in turn was a very minimal increase over -M 2). I've never even bothered with -M 4 (well, not for a real run, anyway, just as a quick test). If you want to speed things up by a good percentage try encoding without B frames. Those are computationally a lot more expensive than I or P frames. -R 0 will disable B frames. I just enabled that, and that's how I'm hitting 15fps instead of 8, and the quality is good and the size is just fine. And do you realize that increasing the search radius (-r) slows things down?Leave the -r value defaulted to 16 and you should see encoding speed up. Yup, entirely aware. I do like the minor difference it makes, though. I'm not in it for speed, really, I just want to see both CPUs get used to their potentials and give me the equivalent of a 4200+ ; If it takes 6 hours to transcode a movie because I set -r32 (I noticed a larger difference with -4 -2 options, btw, than -r16 vs -r32), that's fine, but I feel it could be faster. All in all - the defaults are fairly sane so if you're not certain about an option, well, let it default. And drop the -Q unless you want artifacting - especially values over 2. Under some conditions (it's partly material dependent) the -Q can generate really obnoxious color blocks and similar artifacts.Much better results (especially with clean source material) can be obtained with -E -8 or perhaps -E -10. Until I upgraded to .92, I didn't have those options. I'm using them now, in combination with -Q, but I find the artifacts are almost never there (I used to do -q 4 and -Q 4.0, and it looked about the same as the 5/3.0). Of course the -M 3 changes to 2 and 0 in testing. I also tested it with and without the buffer program in the list. Another notable thing, is that with
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Mon, 2003-12-15 at 22:44, Bernhard Praschinger wrote: Hallo I was doing some testing of both the older version (1.6.1.90) and the newer version of mpeg2enc (1.6.1.92). First off, the .92 was somewhat faster to begin with. However, in both cases, after multiple tests and trying different things, I can't get the SMP modes to be fast at all. In fact, they're slower than the non-SMP modes. With slower, I hope you mean mpeg2enc needs more time to encode the movie. And not the time the encoding need in the realtime. Slower by wallclock slower. It took less time to re-encode the entire thing with -M 0 than when I used -M 3. (I didn't let it run through 2, since it takes over 4 hours as is). (K, after all these tests, the dual stuff is running faster, but not fast enough over a full movie to even warrant the extra threads). Top output of the 3 running mpeg2enc with mjpegtools 1.6.1.92 on the Dual Athlon MP 2100+. That's with -M3. Top usage is 2% and the decoder is only about 10% intermittent. So, I'm neglecting those for the moment. I'm using transcode, by the way (though I found the same results when not using transcode and doing a straight pipe from decoded MPEG2 frames). Note the top dumps below ignore the memory usage (which has approximately 640MB of free RAM (really free, not cache or anything, it's a clean boot, 127 processes running in all cases)). Cpu0 : 50.0% user, 8.6% system, 0.0% nice, 41.4% idle Cpu1 : 53.4% user, 4.3% system, 0.0% nice, 42.2% idle PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 11234 slepp 16 0 43436 42m 968 S 38.2 4.2 0:16.96 mpeg2enc 12422 slepp 16 0 43436 42m 968 S 34.5 4.2 0:16.86 mpeg2enc 623 slepp 16 0 43436 42m 968 R 33.6 4.2 0:17.14 mpeg2enc Command line: time /usr/bin/transcode -u 120,2 -M 0 -V -q 1 -f 24,1 --color 1 -x mpeg2,null -y mpeg2enc,null -e 48000,16 -A -N 0x2000 -F 8,'-S -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0' --pulldown -w 9800 -i 28DaysLater.m2v -o test3 --print_status 50 -c 0-1000 Results:[import_mpeg2.so] tcextract -x mpeg2 -i 28DaysLater.m2v -d 1 | tcdecode -x mpeg2 -d 1 -y yv12 [export_mpeg2enc.so] *** init-v *** ! [export_mpeg2enc.so] cmd=mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o test3.m2v -S -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0 ++ WARN: [mpeg2enc] 3:2 movie pulldown with frame rate set to decode rate not display rate ++ WARN: [mpeg2enc] 3:2 Setting frame rate code to display rate = 4 (29.970 fps) encoding frame [950], 14.93 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116) clean up | frame threads | unload modules | cancel signal | internal threads | done [transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s 73.56user 7.76system 1:09.29elapsed 117%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (2055major+31007minor)pagefaults 0swaps (I can't find how to turn off line wrap. Sorry...) Note I used 120 incoming frame buffers with 2 threads decoding the video. The buffer usage of transcode never dropped below 90 frames buffered, so the buffering was keeping pace. Here's the identical command, the only thing changed is -M 3 to -M 2 (this time I included a snapshot of tcdecode, but note that it isn't always in the top 3 of the list, it comes and goes quite frequently, and the transcode buffers stay right around 110 to 116 frames): Cpu0 : 61.8% user, 7.3% system, 0.0% nice, 30.9% idle Cpu1 : 50.5% user, 12.8% system, 0.0% nice, 36.7% idle PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 20631 slepp 19 0 39824 38m 984 R 51.1 3.9 0:03.79 mpeg2enc 14434 slepp 17 0 39824 38m 984 R 45.7 3.9 0:03.94 mpeg2enc 29969 slepp 16 0 2644 2644 668 S 13.7 0.3 0:01.95 tcdecode And the output of time (and the end of transcode): encoding frame [950], 14.33 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116) clean up | frame threads | unload modules | cancel signal | internal threads | done [transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s 74.89user 7.68system 1:11.95elapsed 114%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (1979major+26920minor)pagefaults 0swaps And with -M 1 instead of -M 2: Cpu0 : 87.0% user, 13.0% system, 0.0% nice, 0.0% idle Cpu1 : 22.2% user, 5.6% system, 0.0% nice, 72.2% idle PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 31916 slepp 25 0 36192 35m 984 R 90.3 3.5 0:07.58 mpeg2enc 3690 slepp 16 0 2644 2644 668 S 14.7 0.3 0:01.91 tcdecode Note that it's now using an entire CPU (other processes keep sharing, but it's still using a full CPU). And the transcode/time results: encoding frame [950], 14.19 fps, 95.2%, ETA: 0:00:03, ( 0| 0|117) clean up | frame threads | unload modules | cancel signal | internal threads | done [transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s 73.98user 7.51system 1:12.42elapsed 112%CPU
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 16 Dec 2003, Slepp Lukwai wrote: Tried it without any options, same effect. I'm definitely seeing nowhere near 40% speedup, which is what boggles me. I expected at least reasonable gains of 25%. I think that has to do with the -I setting... Sorry, upon further testing, I actually average around 14fps at DVD quality (720x480, 9800kbit/s). (see all the details of my command lines Ah, that's more like it then. It's interesting that I'm faster with dual 2100s than the dual 2800 (or at least on par). I suppose it really comes down to command line options, but you would need to compare those yourself (since I haven't Friend of mine has dual 2400s and my setup is ~10-15% faster as I recall - he's getting around 11fps as a rule where I see 14 or so. I'm usually adding a bit of overhead with the chroma conversion. I build smilutils with ffmpeg/libavcodec (to use ffmpeg's DV codec) and then run the data thru something like: smil2yuv -i 2 file.dv | filters | y4mscaler -O chromass=420_MPEG2 |... Produces better output that the default which uses libdv but does cost a bit in cpu use. According to the docs -I 1 turns on interlacing support, and causes un-needed overhead if it is known progressive material. Hence the -I 0 (plus transcode sets that, though I could override it). But unless you have the raw 23.976fps progressive data (with the 3:2 pulldown undone) then I think '-I 1' is the option to use. But then I might be confused (wouldn't be the first time ;)). That would explain why the encoding rate I see is lower since I'm using -I 1. wrong I'm sure someone will tactfully point that out ;)) the speedup comes from the motion estimation of the 2 fields/frame being done in parallel. Oh. Son of a... If that's all it is... Yep - I'm fairly sure that is why you're not seeing any improvement when using -M 2. without B frames. Those are computationally a lot more expensive than I or P frames. -R 0 will disable B frames. I just enabled that, and that's how I'm hitting 15fps instead of 8, and the quality is good and the size is just fine. Great! It takes, from what I've seen, extraordinarily clean sources before -R 0 has no or little effect. to their potentials and give me the equivalent of a 4200+ ; If it takes 6 hours to transcode a movie because I set -r32 (I noticed a larger difference with -4 -2 options, btw, than -r16 vs -r32), that's fine, but Yep - -4 1 will close to double the time over -4 2 and the difference in bitrate/filesize is measured in tenths of a percent. Hardly worth it. Not all that much difference between -4 2 and -4 3 though. better results (especially with clean source material) can be obtained with -E -8 or perhaps -E -10. Until I upgraded to .92, I didn't have those options. I'm using them On noisy source material the -E option has almost no effect but the cleaner the input the more effect even modest values of -E have. now, in combination with -Q, but I find the artifacts are almost never there (I used to do -q 4 and -Q 4.0, and it looked about the same as the 5/3.0). Perhaps Richard Ellis could chime in with his experiences with -Q ;) Right, with -I 0 the cpus take turns but there's little parallelism. And again, son of I didn't realize the parallelization was done based on interlacing settings. Looking back on it that makes sense though. A P frame depends on the preceeding P frame - rather sequential in nature since you can't move on to the next one without completing the first one... The MPEG decoding doesn't take much, and the pipe overhead is negligble, Pipe overhead sneaks up on you though. One pipe? Not a real problem, two? Begins to be noticed but isn't too bad. Four or five? Yeah, it starts to take a hit on the overall speed of the system - the data has to go up/down thru the kernel all those times and that's not free. (As I write this, I'm still waiting for the -M 2 run to finish, so it'll arrive before the tests results to Bernhard make it out). You might try, for timing purposes, without -I 0 and see what, if any effect that has. Might be a useful data point. Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED]
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hi all, First off a bit of background to the multi-threading in the current stable branch. First off: - Parallelism is primarily frame-by-frame. This means that the final phases of the encoding lock on completion of the reference frame (prediction and DCT transform) and the predecessor (bit allocation). If you have a really fast CPU that motion estimates and DCT's very fast you will get lower parallelisation. If you use -R 0 you will get very litte parallelism *at all*. Certainly not enough to make -M 3 sensible. - There is also a parallel read-ahead thread but this rarely soaks much CPU on modern CPUs. The MPEG_DEVEL branch encoder stripes all encoding phases to allow much more scalable parallelisation. You might want to give it a go - I'd be interested in the results! N.b. in a 'realistic' scenario you're running the multiplexer and audio encoding in parallel with the encoder and video filters communicating via pipes and named FIFO's. This setup usually saturate a modern dual machine cheers, Andrew PS I'm away on vacation for a couple of weeks from friday so there'll be a bit of pause in answering emails / posts from then ;-) --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hallo Top output of the 3 running mpeg2enc with mjpegtools 1.6.1.92 on the Dual Athlon MP 2100+. That's with -M3. Top usage is 2% and the decoder is only about 10% intermittent. So, I'm neglecting those for the moment. I'm using transcode, by the way (though I found the same results when not using transcode and doing a straight pipe from decoded MPEG2 frames). Note the top dumps below ignore the memory usage (which has approximately 640MB of free RAM (really free, not cache or anything, it's a clean boot, 127 processes running in all cases)). Cpu0 : 50.0% user, 8.6% system, 0.0% nice, 41.4% idle Cpu1 : 53.4% user, 4.3% system, 0.0% nice, 42.2% idle PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 11234 slepp 16 0 43436 42m 968 S 38.2 4.2 0:16.96 mpeg2enc 12422 slepp 16 0 43436 42m 968 S 34.5 4.2 0:16.86 mpeg2enc 623 slepp 16 0 43436 42m 968 R 33.6 4.2 0:17.14 mpeg2enc Command line: time /usr/bin/transcode -u 120,2 -M 0 -V -q 1 -f 24,1 --color 1 -x mpeg2,null -y mpeg2enc,null -e 48000,16 -A -N 0x2000 -F 8,'-S -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0' --pulldown -w 9800 -i 28DaysLater.m2v -o test3 --print_status 50 -c 0-1000 Could you run a few test (please). Get some frames (100-1000) as yuv format. I gues that should be possible even with transcode. ;) (I do not use transcode so I can't help, or get the test streams on mjpeg.sf.net) And do afterwards something like that: cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v or lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v So you can be soure that nothing else makes any troubels. And check thant how it is going. That should not take to long. Than you can add the options you used, to see if anything there causes the probelm of non increasing framerate. I use the 2.6.0-test8 kernel. Maybe that changes the situation. I used to be using 2.5.63 or similar, but have rebuilt the machine with 2.4.20 with scheduling optimizations and other goodies (gentoo). I noticed a number of speed ups in most other parallel processes (cinelerra, MPI povray, gcc). Of course, most of the patches in the gentoo 2.4.20 kernel are stock in 2.5+ (I also used 2.6.0-test8, but this Asus board doesn't behave under that kernel, and it crashed whenever i'd load the CPUs or IDE buses :) Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX) Sorry if the mail is a bit confusing, [...] Hopefully this one didn't ramble on TOO long. My brain had given up the time I started my computer that evening ;) But I'm not really knowing why the situation is that bad. auf hoffentlich bald, Berni the Chaos of Woodquarter Email: [EMAIL PROTECTED] www: http://www.lysator.liu.se/~gz/bernhard --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, Dec 16, 2003 at 09:27:53AM -0800, Steven M. Schultz wrote: Perhaps Richard Ellis could chime in with his experiences with -Q ;) It seems that with the right set of options, and the right set of input data, -Q can help to create some really nasty looking artifacts. And again, son of I didn't realize the parallelization was done based on interlacing settings. Looking back on it that makes sense though. A P frame depends on the preceeding P frame - rather sequential in nature since you can't move on to the next one without completing the first one... The P frame dependency chain is how the artifacts come about based on Andrew's explanation. It's accumulated round off error in the iDCT routines. Made worse by -Q as well as -R 0 and a few other options. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, Dec 16, 2003 at 12:33:52AM -0700, Slepp Lukwai wrote: On Mon, 2003-12-15 at 21:08, Richard Ellis wrote: Additionally, why kind of memory do you have attached to the cpu's? Mpeg encoding is very memory bandwidth hungry to begin with, and with two cpu's trying to eat at the same trough, a not quite as fast as it should be memory subsystem can produce results like what you are seeing. ... ... It's a dual Athlon, which inherently means 266FSB (DDR 266), though the memory is actually Hynix PC3200 w/ timings set as low as they go on this board (2-2-2), which gives me about 550MB/s memory bandwidth according to memtest, with a 13GB/s L1 and something like 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, it SHOULD be able to push enough to keep the frames encoding at 100% CPU, in theory. Yes, but just one 720x480 DVD quality frame is larger than 256k in size, so a 256k cache per CPU isn't helping too much overall considering how many frames there are in a typical video to be encoded. Plus, my experience with Athlon's is that they are actually faster at mpeg2enc encoding that Intel chips of equivalent speed ratings (the Athlon's 3dnow/mmx implimentation is faster) and so they put a heavier stress on one's memory bandwidth than an equivalent speed Intel chip would. It's possible that 275MB/s per CPU just isn't fast enough to keep up with the rate that mpeg2enc can consume data on an Athlon. Of course, Andrew would be much better suited to discuss mpeg2enc's memory access patterns during encoding, which depending on how it does go about accessing memory can better make use of the 256k of cache, or cause the 256k of cache to be constantly thrashed in and out. FWIW, when my desktop machine was a dual PII-400Mhz box, I almost always had two mpeg2enc threads eating up 97-98%cpu on both PII chips. The few times both cpu's were not fully saturated at mpeg encoding was when I'd bother them with something silly like browsing the web with mozilla. :) Now that's just silly. Why would you hurt the CPUs by running such bloat as Mozilla? I can't think of how many times Mozilla has gone nuts on me and used 100% CPU without reason, and you can't kill it any normal UI way.. Good ol' killall. However, I love it. It's a great browser. Just rather hungry at times. I suppose there's a reason the logo is a dinosaur. : Hmm... Interesting. I've had it sometimes just stop but never go nuts with 100% CPU, and although I usually do CLI kill it if need be, FVWM2's destroy window command has never failed to get rid of it if I don't bother to go CLI to do so. In fact, FVWM2's destroy has never failed to get rid of anything that went wonky. It's the X windows equivalent to a kill -9 from the CLI. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 16 Dec 2003, Andrew Stevens wrote: Hi all, First off a bit of background to the multi-threading in the current stable branch. First off: - Parallelism is primarily frame-by-frame. This means that the final phases of the encoding lock on completion of the reference frame (prediction and DCT If one were using closed and fixed length GOPs would it make sense to parallelize the encoding of complete GOPs? Each cpu could be dispatched a set of N frames that comprise a closed GOP and a master thread could write the GOPs out in the correct order. But as Andrew mentioned - but the time filters and other processing is added in a dual cpu system's pretty well saturated. Quad cpu systems are very much a niche (and expensive) item (not to mention the noise they make;)) Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 16 Dec 2003, Richard Ellis wrote: 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, it SHOULD be able to push enough to keep the frames encoding at 100% CPU, in theory. Yes, but just one 720x480 DVD quality frame is larger than 256k in size, so a 256k cache per CPU isn't helping too much overall considering how many frames there are in a typical video to be A 720x480 4:2:0 frame is about 512KB, at 550MB/sec there is enough memory bandwidth to encode at about 1000 frames/sec if all you had to do was read the data. Obviously the encoder runs somewhat slower than that, so each byte of data must be accessed multiple times. That's where the cache helps. Of course, Andrew would be much better suited to discuss mpeg2enc's memory access patterns during encoding, which depending on how it does go about accessing memory can better make use of the 256k of cache, or cause the 256k of cache to be constantly thrashed in and out. I seem to recall that one of the biggest performance bottlenecks of mpeg2enc is they way it accesses memory. It runs each step of the encoding processes and en entire frame at a time. It's much more cache friendly run every stage of the encoding process on a single macroblock before moving on the to next macroblock. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 16 Dec 2003, Steven M. Schultz wrote: First off a bit of background to the multi-threading in the current stable branch. First off: - Parallelism is primarily frame-by-frame. This means that the final phases of the encoding lock on completion of the reference frame (prediction and DCT If one were using closed and fixed length GOPs would it make sense to parallelize the encoding of complete GOPs? Each cpu could be dispatched a set of N frames that comprise a closed GOP and a master thread could write the GOPs out in the correct order. But what about bit allocation? You need to know how big the last GOP was to figure out how many bits you can use for the next GOP. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hi Steven, Trent, But what about bit allocation? You need to know how big the last GOP was to figure out how many bits you can use for the next GOP. Actually, this is not such a big deal provided the GOPs are well seperated. Simplifying a little, you just need to ensure that you have = the assumed amount of decoder buffer full at the end of each 'chunk' as you assumed starting to encode its successor. However, this idea came to mind more as a sneaky way of doing accurately sized single-pass encoding: work on multiple 'segments' spread across the video sequence so you get a good statistical sample of how your total bit-consumption is going relative to your target. This is rotten for parallelism thought because you have two more or less totally uncorrelated memory footprints. For DVD 'segments' would kind of naturally correlate with 'chapters' at the authoring level. In the MPEG_DEVEL branch encoding of each frame (apart from the bit-packed coding and bit allocation which is only a small fraction of the CPU load) is simply striped across the available CPUs. This has a nice side effect of reducing each CPUs working set too as it only deals with a fraction of a frame. Having said all that I'll probably simply do a simple two-pass encoding mode first (much simpler frame feeding!). Of course, Andrew would be much better suited to discuss mpeg2enc's memory access patterns during encoding, which depending on how it does go about accessing memory can better make use of the 256k of cache, or cause the 256k of cache to be constantly thrashed in and out. I seem to recall that one of the biggest performance bottlenecks of mpeg2enc is they way it accesses memory. It runs each step of the encoding processes and en entire frame at a time. It's much more cache friendly run every stage of the encoding process on a single macroblock before moving on the to next macroblock. The single-macroblock approach has been implemented for quite some time now (since the move to C++ roughly). In rather basic English speed improved by... bugger all. I was *most* surprised, it could well be that the story is rather different on multi-CPU machines. At least I like to hope the work wasn't wasted ;-) Actually, the memory footprint of encoding is much larger than you'd think. Remember each 16x16 int16_t difference macroblock gets generated from nastily unaligned 16x16 or 16x8 uint8_t predictors and a 16x16 uint8_t picture macroblock. The difference is then DCT-ed in place into 4 8x8 int16_t DCT blocks which are then quantised in 4 8x8 int16_t quantised DCT blocks. Where mpeg2enc could speed up is: - DCT blocks are in 'correct' and not transposed form. This is simply a waste as by transposing quantiser matrices and the scan sequence you can simply skip this. - Each quantised DCT block is seperately stored. Nice for debugging, poor for memory performance ;-) - DCT is not combined with quantisation when this is possible. - Motion estimation (probably wastefully) computes a lot of variances that could probably better be replaced by SAD for fast encoding modes. - The current GOP sizing approach is wasteful. Frame type should only be decided once the best encoding modest (Intra, various inter motion prediction modes) is known. Basically, you turn a B/P frame into an I frame if you've reached your GOP length limit or it has enough Intra coded blocks that it is more compact that way. Unfortunately, the current allocation algorithm still has a few 'left over' elements that need to know GOP size in advance that need to be replaced before this can be fixed. I'm currently working on bit-allocation (basically, a two-pass / look-ahead mode plus the above improvement). A similar approach can be used for deciding B/P frame selection but this is expensive in CPU as you basically have to do encode each potential B frame's reference frame twice. I'm playing around with ideas for trying B frames out and if they don't seem worthwhile turning them off and then periodically checking if it might make sense to turn them on a again. Andrew --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, Dec 16, 2003 at 12:45:48PM -0800, Trent Piepho wrote: On Tue, 16 Dec 2003, Richard Ellis wrote: 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, it SHOULD be able to push enough to keep the frames encoding at 100% CPU, in theory. Yes, but just one 720x480 DVD quality frame is larger than 256k in size, so a 256k cache per CPU isn't helping too much overall considering how many frames there are in a typical video to be A 720x480 4:2:0 frame is about 512KB, at 550MB/sec there is enough memory bandwidth to encode at about 1000 frames/sec if all you had to do was read the data. Obviously the encoder runs somewhat slower than that, so each byte of data must be accessed multiple times. That's where the cache helps. With motion estimation each byte would end up being accessed more than once for each new radius that was examined. Plus motion estimation is between at least two frames, so we are dealing with at least about 1M of data to be accessed eventually in the course of encoding one frame. Of course, Andrew would be much better suited to discuss mpeg2enc's memory access patterns during encoding, which depending on how it does go about accessing memory can better make use of the 256k of cache, or cause the 256k of cache to be constantly thrashed in and out. I seem to recall that one of the biggest performance bottlenecks of mpeg2enc is they way it accesses memory. It runs each step of the encoding processes and en entire frame at a time. It's much more cache friendly run every stage of the encoding process on a single macroblock before moving on the to next macroblock. In that case it will kill the majority of the performance benifit provided by the caches, because there's very little locality of reference for the cache to compensate for. It moves through at least 512k for pass one, then through the same 512k again for pass two, but the data in the cache is from the end of the frame, and we are starting over at the beginning of the frame. Massive cache thrash in that case. Memory bandwidth becomes a much more limiting factor. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hallo On Tue, 2003-12-16 at 12:57, Bernhard Praschinger wrote: Could you run a few test (please). Get some frames (100-1000) as yuv format. I gues that should be possible even with transcode. ;) (I do not use transcode so I can't help, or get the test streams on mjpeg.sf.net) With about 1010 frames of YUV using to dump it in (instead of cat), I get these: -M 0: 2m 11.9s -M 1: 2m 10.6s, -1.3s -M 2: 1m 27.7s, -44.2s -M 3: 1m 26.5s, -45.4s That values look much better. :-) Now you have seen the mpeg2enc can go faster. I have tried the command you used on my machine, and I have seen the same problem. Also 3 processes and each only 33% . (time lav2yuv n1000.eli | mpeg2enc -I 0 -f 8 -b 9800 -p -a 3 -o test.m2v -S -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0) Note that I responded in an earlier message with a total of 24 timings across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting results that -M 3 -I 0 -R 1 worked fastest of all of them (same source material I used for the above, and it took 51 seconds). So, I think the -I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it is still quite a bit slower than -I 0 (which I use since the input is Progressive 23.976fps) Thats strange. And do afterwards something like that: cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v or lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v So you can be soure that nothing else makes any troubels. And check thant how it is going. That should not take to long. Than you can add the options you used, to see if anything there causes the probelm of non increasing framerate. Compared to the run with my long options line, these are I'm just running some encodings to see which option causes the problem. On my machine the -R 0 caused the problem. If I used -R 1/2 or or R option, I got 3 processes each using about 45-50%. Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX) Nice board, that one. Asus A7M-266D.. I should've grabbed the MSI K7D Master for the same price, I hear much nicer things about it. My brain had given up the time I started my computer that evening ;) Mine usually does that at about 8am. : Just as you enter work ? ;) But I'm not really knowing why the situation is that bad. I'm just not seeing the dual CPU usage that would warrant even running in multiple threads, when I could instead transcode two entirely separate items as though I had two machines, which makes some sense (I did that the other day, worked rather well). But, if I can make a single copy work by flooding both CPUs with activity, then I'll be happier, since it should take quite a bit less time to encode a full movie. Encoding without the -R 0 seems to solve the problem, by now. auf hoffentlich bald, Berni the Chaos of Woodquarter Email: [EMAIL PROTECTED] www: http://www.lysator.liu.se/~gz/bernhard --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Mon, 15 Dec 2003, Slepp Lukwai wrote: faster to begin with. However, in both cases, after multiple tests and trying different things, I can't get the SMP modes to be fast at all. In fact, they're slower than the non-SMP modes. I think I see what you're doing that could cause that. I've never seen the problem - using -M 2 is not going to be 2x as fast though if that was the expectation. ~40% speedup or so is what I see (from about 10fps to 14fps) typically. When encoding with the -M 0 with .92, I get around 19fps. When I use -M That's full sized (720x480) is it? Sounds more like a SVCD or perhaps 1/2 D1 (bit of a misnomer - D1 is actually a digital video tape deck) at 352x480.At 1/2 size yes, around 20fps or a bit more I've seen. But I'm usually tossing in a bit of filtering so the process is a slower. I installed 'buffer', set it up with a 32MB buffer and put it in the 10MB is about all I use - it's just a cushion to prevent the encoder from having to wait (-M 1 is the default - there's I/O readahead going on) for input. Has anyone found a way around this, or is it time to look at the source and see what's up? And for reference, it's a dual Athlon MP 2100+, which is below the '2600' that the Howto references as fast. I'm using dual 2800s and around 14-15fps for DVD encoding is what I usually get. The actual command line is: mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o test.m2v -S -M 3 -4 2 -2 1 -r 32 -q 5 -Q 3.0 -K kvcd You have progressive non-interlaced source? If not then -I 0 is not the right option. The speed up from multiple processors comes, I believe (but if I'm wrong I'm sure someone will tactfully point that out ;)) the speedup comes from the motion estimation of the 2 fields/frame being done in parallel. Try -I 1 (or just leave out the '-I and let it default. Oh, and there's no real benefit from going above -M 2. I had a 4 cpu box and tried -M 4 and saw no gain over -M 3 (which in turn was a very minimal increase over -M 2). If you want to speed things up by a good percentage try encoding without B frames. Those are computationally a lot more expensive than I or P frames. -R 0 will disable B frames. And do you realize that increasing the search radius (-r) slows things down?Leave the -r value defaulted to 16 and you should see encoding speed up. All in all - the defaults are fairly sane so if you're not certain about an option, well, let it default. And drop the -Q unless you want artifacting - especially values over 2. Under some conditions (it's partly material dependent) the -Q can generate really obnoxious color blocks and similar artifacts.Much better results (especially with clean source material) can be obtained with -E -8 or perhaps -E -10. Of course the -M 3 changes to 2 and 0 in testing. I also tested it with and without the buffer program in the list. Another notable thing, is that with the newest version .92, -M3 causes three 33% usage processes Right, with -I 0 the cpus take turns but there's little parallelism. to exist (leaving an entire CPU idle), while M2 causes two 60% processes to exist. With .90, -Mx causes 2 50-70% processes and the rest never do Hmmm, I see 100% use on the two 2800s - but some of that would be the DV decoding and pipe overhead of course. First thing I'd try is lowering -r to 24 at most or just defaulting it. Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hallo I was doing some testing of both the older version (1.6.1.90) and the newer version of mpeg2enc (1.6.1.92). First off, the .92 was somewhat faster to begin with. However, in both cases, after multiple tests and trying different things, I can't get the SMP modes to be fast at all. In fact, they're slower than the non-SMP modes. With slower, I hope you mean mpeg2enc needs more time to encode the movie. And not the time the encoding need in the realtime. When encoding with the -M 0 with .92, I get around 19fps. When I use -M 2 or -M 3, I get around 14fps. The CPU utilization sits at about 60 to 70% across both CPUs, but hits 99.9% when using just one. Thats really strange. Which programm dod you use for monitoring your CPU utilisation ? top and/or xosview ? If you used time for knowing what amout of time is used, the important value for you use the real line, and not the user line. The user line reports the time the command needed on both CPUs. On a dual machine that has nothing other things to do, the real time is lower than the user time. The overhead you need for 2 threads incresses the user time a litte, but lowers the real time. I installed 'buffer', set it up with a 32MB buffer and put it in the stream, and it didn't make any difference at all. It would be nice to use mpeg2enc on two CPUs to it's full speed, which would net me faster than real-time, but thus far I haven't been able to. What was your full comand ? When I use lav2yuv files | mpeg2enc -f 8 -o test.m2v. My system (the 2600 Athlon MP I mentioned in howto) mpeg2enc needs nearly 100% of one cpu and lav2yuv nedds another 5-10%. Encoding of 1000 frames takes that mount of time: 2m16.944s When I add -M 2 The speedup is nice, mpeg2enc has two thread eac needing about 65-70%, lav2yuv needs about 15%. Encoding of 1000 frames takes that mount of time: 1m37.881s Adding buffer to a simple command line does not speed up anything. buffer helps if you have a pipeline with serveral stages like: lav2yuv | yuvdenoise | yuvscaler | mpeg2enc Has anyone found a way around this, or is it time to look at the source and see what's up? I have no need, because I think it works properly. And for reference, it's a dual Athlon MP 2100+, which is below the '2600' that the Howto references as fast. Of course the -M 3 changes to 2 and 0 in testing. I also tested it with and without the buffer program in the list. Another notable thing, is that with the newest version .92, -M3 causes three 33% usage processes to exist (leaving an entire CPU idle), while M2 causes two 60% processes to exist. With .90, -Mx causes 2 50-70% processes and the rest never do anything. Just for the fun, I have tested it with -M 3, and than I saw 3 mpeg3nc thread each using about 45-50%, that improved the needed time compared to -M 2 by another 10 seconds. -M 4 didn't cange much at all, only a 4th process needing about 10%. I use the 2.6.0-test8 kernel. Maybe that changes the situation. The percent numbers reported by top have to be read carefully. At least my top reports them fo a single CPU, so you can have processes using up to 200% and then both cpus have full load. But in the task/cpu stats line 100% utilisation are for both CPUs Sorry if the mail is a bit confusing, auf hoffentlich bald, Berni the Chaos of Woodquarter Email: [EMAIL PROTECTED] www: http://www.lysator.liu.se/~gz/bernhard --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users