Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

Slepp Lukwai Tue, 16 Dec 2003 02:50:04 -0800

On Mon, 2003-12-15 at 22:44, Bernhard Praschinger wrote:
> Hallo
> 
> > I was doing some testing of both the older version (1.6.1.90) and the
> > newer version of mpeg2enc (1.6.1.92). First off, the .92 was somewhat
> > faster to begin with. However, in both cases, after multiple tests and
> > trying different things, I can't get the SMP modes to be fast at all. In
> > fact, they're slower than the non-SMP modes.
> With slower, I hope you mean "mpeg2enc needs more time to encode the
> movie". 
> And not the time the encoding need in the "realtime".


Slower by wallclock slower. It took less time to re-encode the entire
thing with -M 0 than when I used -M 3. (I didn't let it run through 2,
since it takes over 4 hours as is). (K, after all these tests, the dual
stuff is running faster, but not fast enough over a full movie to even
warrant the extra threads).

Top output of the 3 running mpeg2enc with mjpegtools 1.6.1.92 on the
Dual Athlon MP 2100+. That's with -M3. Top usage is 2% and the decoder
is only about 10% intermittent. So, I'm neglecting those for the moment.
I'm using transcode, by the way (though I found the same results when
not using transcode and doing a straight pipe from decoded MPEG2
frames). Note the top dumps below ignore the memory usage (which has
approximately 640MB of free RAM (really free, not cache or anything,
it's a clean boot, 127 processes running in all cases)).

 Cpu0 :  50.0% user,   8.6% system,   0.0% nice,  41.4% idle
 Cpu1 :  53.4% user,   4.3% system,   0.0% nice,  42.2% idle
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11234 slepp     16   0 43436  42m  968 S 38.2  4.2   0:16.96 mpeg2enc
12422 slepp     16   0 43436  42m  968 S 34.5  4.2   0:16.86 mpeg2enc
  623 slepp     16   0 43436  42m  968 R 33.6  4.2   0:17.14 mpeg2enc

Command line:
time /usr/bin/transcode -u 120,2 -M 0 -V -q 1 -f 24,1 --color 1 -x
mpeg2,null -y mpeg2enc,null -e 48000,16 -A -N 0x2000 -F 8,'-S 9999 -M 3
-g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0' --pulldown -w 9800
-i 28DaysLater.m2v -o test3 --print_status 50 -c 0-1000

Results:[import_mpeg2.so] tcextract -x mpeg2 -i "28DaysLater.m2v" -d 1 |
tcdecode -x mpeg2 -d 1 -y yv12
[export_mpeg2enc.so] *** init-v *** !
[export_mpeg2enc.so] cmd=mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a
3 -o "test3".m2v -S 9999 -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K
kvcd -R 0
++ WARN: [mpeg2enc] 3:2 movie pulldown with frame rate set to decode
rate not display rate
++ WARN: [mpeg2enc] 3:2 Setting frame rate code to display rate = 4
(29.970 fps)
encoding frame [950],  14.93 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s

73.56user 7.76system 1:09.29elapsed 117%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (2055major+31007minor)pagefaults 0swaps

(I can't find how to turn off line wrap. Sorry...)

Note I used 120 incoming frame buffers with 2 threads decoding the
video. The buffer usage of transcode never dropped below 90 frames
buffered, so the buffering was keeping pace.

Here's the identical command, the only thing changed is -M 3 to -M 2
(this time I included a snapshot of tcdecode, but note that it isn't
always in the top 3 of the list, it comes and goes quite frequently, and
the transcode buffers stay right around 110 to 116 frames):

 Cpu0 :  61.8% user,   7.3% system,   0.0% nice,  30.9% idle
 Cpu1 :  50.5% user,  12.8% system,   0.0% nice,  36.7% idle
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20631 slepp     19   0 39824  38m  984 R 51.1  3.9   0:03.79 mpeg2enc
14434 slepp     17   0 39824  38m  984 R 45.7  3.9   0:03.94 mpeg2enc
29969 slepp     16   0  2644 2644  668 S 13.7  0.3   0:01.95 tcdecode

And the output of time (and the end of transcode):
encoding frame [950],  14.33 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s

74.89user 7.68system 1:11.95elapsed 114%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (1979major+26920minor)pagefaults 0swaps


And with -M 1 instead of -M 2:

 Cpu0 :  87.0% user,  13.0% system,   0.0% nice,   0.0% idle
 Cpu1 :  22.2% user,   5.6% system,   0.0% nice,  72.2% idle
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31916 slepp     25   0 36192  35m  984 R 90.3  3.5   0:07.58 mpeg2enc
 3690 slepp     16   0  2644 2644  668 S 14.7  0.3   0:01.91 tcdecode

Note that it's now using an entire CPU (other processes keep sharing,
but it's still using a full CPU).

And the transcode/time results:

encoding frame [950],  14.19 fps, 95.2%, ETA: 0:00:03, ( 0| 0|117)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s

73.98user 7.51system 1:12.42elapsed 112%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (1979major+25821minor)pagefaults 0swaps

We're up by 3 seconds over the initial -M 3 run.

And we get to -M 0:
 Cpu0 :  17.6% user,   7.4% system,   0.0% nice,  75.0% idle
 Cpu1 :  82.7% user,  16.4% system,   0.0% nice,   0.9% idle
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15460 slepp     25   0 36180  35m  980 R 95.9  3.5   0:05.50 mpeg2enc
27772 slepp     16   0  2644 2644  668 S 10.1  0.3   0:01.74 tcdecode

And the results of time:

encoding frame [950],  14.17 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s
74.67user 7.93system 1:12.58elapsed 113%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (1978major+25818minor)pagefaults 0swaps

Now, we're still at 1:12.58, we started at 1:09.29. That's a mere 3.29
seconds difference on 1000 frames. However, it's just getting into the
swing of things. Here's -M 3, -M 2, -M 1 and -M 0 timings on 10,000
frames, same parameters otherwise (I ran the first timing through twice
in order to fully cache all input data that the processes were going to
use, to negate disk usage timings):

-M 0: 11:21.50 realtime, 690.97 user, 74.27 system, 112% CPU
  transcode says: 14.68 fps
-M 1: 11:20.07 realtime, 699.68 user, 69.98 system, 111% CPU
  transcode says: 14.71 fps
-M 2: 11:04.27 realtime, 689.30 user, 70.94 system, 114% CPU
  transcode says: 15.06 fps
-M 3: --incomplete--
  transcode says:

Note on this, now... When I run two separate transcodes to each do half
the movie, I get 27fps (100% on each CPU) combined rate. And updatedb
kicked in halfway through the -M 3 run, so I didn't let it finish. It
dropped to 8fps after the disks got pegged.

If we take the difference over 10,000 frames and put it into the frame
count of this specific movie (198,890), we get:

-M 0: 13,554 seconds
-M 2: 13,211 seconds

A 340 second (5.5 minute difference), yet if I run a separate process on
each chunk, I would see more of a 80% improvement, instead of 2.5%. So,
I suppose I jumped the gun on it being slower, however, it isn't
reasonably faster for the supposed extra CPU usage.

However, the speed increase is showing, isn't it... But it's not much.
Not even close to the rates you presented. Based on Steven's suggestion
that -I 1 is required to get good parallel response, here's a result on
that (-M 3):

-M 3 -I 1: 0:02.05 realtime, 2.00 user, 0.05 system, 198% CPU
  transcode says: 4878.04 fps

Sorry, that was on the 256 processor cray........ Back to reality.. 

Okay. I made that up. The real results:

-M 3 -I 1: -- cancelled after 10 minutes, since it was only 40% done.
(It took 2:39 to do 1000 frames).
  transcode says: 6.76 fps

CPU usage was even lower (at about 100% of one CPU).

(That's the last time I trust transcode's fps ratings, btw, they seem to
be a bit skewed overall, as the timings are showing a difference in
encoding speed. But overall CPU usage is still within 2% of each other,
as shown by the time command).

> > When encoding with the -M 0 with .92, I get around 19fps. When I use -M
> > 2 or -M 3, I get around 14fps. The CPU utilization sits at about 60 to
> > 70% across both CPUs, but hits 99.9% when using just one.
> Thats really strange. 
> Which programm dod you use for monitoring your CPU utilisation ?
> top and/or xosview ?

top, as well as a little dock application in KDE that shows CPU usage.
It never peaks to 100% (2 second updates).

> If you used time for knowing what amout of time is used, the important
> value for you use the "real" line, and not the "user" line.
> The user line reports the time the command needed on both CPUs. On a
> dual machine that has nothing other things to do, the real time is lower
> than the user time. The "overhead" you need for 2 threads incresses the
> user time a litte, but lowers the real time. 

Yes, I know. :> It's the realtime that is bothersome. Frame rate
consistently drops in all tests over a frame range. It starts out about
the same pace. Once you get a few thousand frames into it, it all levels
out and the timing goes way off.

> > I installed 'buffer', set it up with a 32MB buffer and put it in the
> > stream, and it didn't make any difference at all. It would be nice to
> > use mpeg2enc on two CPUs to it's full speed, which would net me faster
> > than real-time, but thus far I haven't been able to.
> What was your full comand ?

I inserted buffer into the transcode command list, basically. (Replaced
mpeg2enc with a C program that executed buffer as a replacement to the C
process and then called mpeg2enc-real as a child).

> When I use lav2yuv files | mpeg2enc -f 8 -o test.m2v. 
> My system (the 2600 Athlon MP I mentioned in howto) mpeg2enc needs
> nearly 100% of one cpu and lav2yuv nedds another 5-10%. 
> Encoding of 1000 frames takes that mount of time: 2m16.944s
> 
> When I add -M 2 
> The speedup is nice, mpeg2enc has two thread eac needing about 65-70%,
> lav2yuv needs about 15%.
> Encoding of 1000 frames takes that mount of time: 1m37.881s

I see nothing near that improvement. That's nearly 25% better, though
I'm finding it getting worse on long runs.

> Adding buffer to a simple command line does not speed up anything.
> buffer helps if you have a pipeline with serveral stages like: lav2yuv |
> yuvdenoise | yuvscaler | mpeg2enc

Well that explains why it made no difference. :>

> > Has anyone found a way around this, or is it time to look at the source
> > and see what's up?
> I have no need, because I think it works properly. 

It used to. The older rips I did worked at a whopping 4fps, but I only
got 2fps if I didn't use -M2 or -M3. (I had really high quality
settings, since they were SVCD targets).

> > And for reference, it's a dual Athlon MP 2100+, which is below the
> > '2600' that the Howto references as fast.
> >
> > Of course the -M 3 changes to 2 and 0 in testing. I also tested it with
> > and without the buffer program in the list. Another notable thing, is
> > that with the newest version .92, -M3 causes three 33% usage processes
> > to exist (leaving an entire CPU idle), while M2 causes two 60% processes
> > to exist. With .90, -Mx causes 2 50-70% processes and the rest never do
> > anything.
> Just for the fun, I have tested it with -M 3, and than I saw 3 mpeg3nc
> thread each using about 45-50%, that improved the needed time compared
> to -M 2 by another 10 seconds. -M 4 didn't cange much at all, only a 4th
> process needing about 10%.

A note on that, as was stated the first message. When using -M3 on .90,
one process remains idle almost the entire time. For example, if PID 1
and 2 used 10 minutes of CPU time, PID 3 used 30 seconds. With .92, all
processes share equal CPU time.

> I use the 2.6.0-test8 kernel. Maybe that changes the situation. 

I used to be using 2.5.63 or similar, but have rebuilt the machine with
2.4.20 with scheduling optimizations and other goodies (gentoo). I
noticed a number of speed ups in most other parallel processes
(cinelerra, MPI povray, gcc). Of course, most of the patches in the
gentoo 2.4.20 kernel are stock in 2.5+ (I also used 2.6.0-test8, but
this Asus board doesn't behave under that kernel, and it crashed
whenever i'd load the CPUs or IDE buses :<)

> The percent numbers reported by top have to be read carefully. At least
> my top reports them fo a single CPU, so you can have processes using up
> to 200% and then both cpus have full load. 
> But in the task/cpu stats line 100% utilisation are for both CPUs !!!!

Not in Irix mode. Nor in dual view mode. I checked all that. They are
reporting per CPU load, so I can have two running at 100% and showing
both CPUs completely busy.

> Sorry if the mail is a bit confusing,

Not at all. :> My response took nearly an hour to write, though, since
it took so long to transcode 10,000 frames. As an interesting tidbit,
however, the Athlon XP 3200+ I have sitting beside me, which I use as a
webserver and such, only manages about a 10% speed improvement over the
dual Athlon 2100, which I found rather amusing. However, that is
comparing 1.6.1.92 to 1.6.1.90, as well as some customized gcc flags to
the .92 build (the .90 was fairly stock standard build, though I imagine
I can squeeze quite a bit more speed out of the .90 if I were to
recompile it with better targets).

Hopefully this one didn't ramble on TOO long.




-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

Reply via email to