Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2004-01-04 Thread Andrew Stevens

 In floating point, all you have to do is flip a sign bit.  But with
 integers, it's not so easy.  There is no instruction for absolute value in
 MMX, you have to use a four instruction sequence and two registers.  Slower
 than squaring a value, which only takes two instructions.

 I finally found a mmx2 reference, and you're right about that.  MMX2 added
 psadbw, packed sum of absolute differences.  If you have 8-bit unsigned
 data it makes computing SAD pretty darn easy, you can find the SAD of 8
 pixels in one instruction.

There are really very very few MMX-only current CPUs.   Basically everything 
after K6 with MMX supports the psadbw instruction.

 Though why did mpeg2enc use variance in the first place?  Maybe it's a
 better estimator than SAD for motion compensation fit?

Its just not that simple.  It uses SAD for 'coarse' motion estimation and 
switches to variance for the final selection of the particular motion 
estimation mode.  This combination provides a good speed/quality trade-off.
Experiments with 'only variance' were unimpressive in their quality 
improvements and 'only SAD' costs quite a lot of quality for modest speed 
gain.   All the 'low hanging fruit' in the current motion estimation 
algorithm has long since been picked...

 The level of altivec optimizations ffmpeg vs mpeg2 is probably an important
 factor in any speed difference, and one that wouldn't matter for other
 CPUs, which the level of MMX/MMX2/SSE optimizations makes a large
 difference.

You have to be very careful to compare like coding profiles, motion search 
radii and suchlike too.  

Its easy to be twice as fast if your simply trying half as hard to find a good 
encoding.   However, there *are* bottlenecks in mpeg2enc.   For a real speedy 
mode predictive motion estimation algorithms would need to be used.

Andrew



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-20 Thread Steven M. Schultz

On 19 Dec 2003, Florin Andrei wrote:

 On Fri, 2003-12-19 at 01:49, Steven M. Schultz wrote:
 
  At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on
  my G4 Powerbook.  Yes, ffmpeg has a big speed advantage (~2x) but

The difference is even larger than I thought...   ffmpeg was decoding
the DV file and encoding the audio at the same time but I had mpeg2enc
reading a pre-staged .y4m file.

 Any chance repeating that on an Intel or AMD processor?

ffmpeg -i input.dv -vcodec mpeg2video -f mpeg -b 5000 -g 15 foo.mpg

then compare to 

mpeg2enc -R 0 -b 5000 -4 3 -2 2 -o foo1.mpg  input.y4m

Have fun :)

At the moment my AMD system's booked solid for other encoding jobs,
be a couple days before I can run some tests on it.Hmmm, time to
fire up the dual P4 system and get it sync'd up on all the projects.
Maybe over the weekend.

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Andrew Stevens
On Tuesday 16 December 2003 23:35, Richard Ellis wrote:
Hi Richard,

 In that case it will kill the majority of the performance benifit
 provided by the caches, because there's very little locality of
 reference for the cache to compensate for.  It moves through at least
 512k for pass one, then through the same 512k again for pass two, but
 the data in the cache is from the end of the frame, and we are
 starting over at the beginning of the frame.  Massive cache thrash in
 that case.  Memory bandwidth becomes a much more limiting factor.

Exactly what I though when I restructured encoding to a per macroblock basis a 
few months back.  The performance gain was not measurable.

They key 'thinko' here is that most of the time goes into motion estimation 
and in motion estimation the search windows of neighbouring macroblocks 
overlap  90%.  Cache locality is pretty good. Playing around with prefetch 
(etc etc) has never brought measurable gains.

The main bottleneck in the current encoder (for modern CPUs) is the first 
phase (4X4 subsampling) of the subsampling motion estimation hierarchy.   For 
speed this would need to be replaced with a predictive estimator.

The next bottlenecks would be the run-length coding and the use of variance 
instead of SAD in motion compensation mode and DCT mode selection.  Sadly 
there's not too much can be done easily about the former and the latter 
cannot be removed without noticeable reduction in encoding quality (I tried 
it :-().

However, I have some ideas to try when I get back in the new year!

Andrew




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Trent Piepho
On Fri, 19 Dec 2003, Andrew Stevens wrote:
 The next bottlenecks would be the run-length coding and the use of variance 
 instead of SAD in motion compensation mode and DCT mode selection.  Sadly 

Is SAD really any faster to calculate than variance?  SAD uses an absolute
value-add operation while variance is multiply-add.  Multiply-add is usually
the most heavily optimized operation a cpu can perform.




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Trent Piepho
On Fri, 19 Dec 2003, Steven M. Schultz wrote:
 On Fri, 19 Dec 2003, Trent Piepho wrote:
 
  On Fri, 19 Dec 2003, Andrew Stevens wrote:
  
  Is SAD really any faster to calculate than variance?  SAD uses an absolute
  value-add operation while variance is multiply-add.  Multiply-add is usually
  the most heavily optimized operation a cpu can perform.
 
   Au contraire.   Multiply is a lot slower than abs().  All abs() has
   to do is flip a sign bit (effectively) and that's going to be a lot

In floating point, all you have to do is flip a sign bit.  But with integers,
it's not so easy.  There is no instruction for absolute value in MMX, you have
to use a four instruction sequence and two registers.  Slower than squaring a
value, which only takes two instructions. 

Though you can cleverly combine an unsigned subtraction and absolute value
operation into four instructions total, and perform it on eight unsigned bytes
at a time.  So you can compute an absolute value of differences quite a bit
faster under MMX than I was thinking you could.  Clearly SAD would be faster
than variance.

Another advantage of SAD is that you can find an intermediate result easier
than with variance.  That way you can short-circuit the SAD calculation if you
have already reached the best SAD already found.

   faster than any multiply.   And aren't there MMX2/SSE abs+add 
   instructions - that would make abs/add quite fast.

I finally found a mmx2 reference, and you're right about that.  MMX2 added
psadbw, packed sum of absolute differences.  If you have 8-bit unsigned data
it makes computing SAD pretty darn easy, you can find the SAD of 8 pixels in
one instruction. 

Though why did mpeg2enc use variance in the first place?  Maybe it's a better
estimator than SAD for motion compensation fit?

   At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on
   my G4 Powerbook.  Yes, ffmpeg has a big speed advantage (~2x) but

The level of altivec optimizations ffmpeg vs mpeg2 is probably an important
factor in any speed difference, and one that wouldn't matter for other CPUs,
which the level of MMX/MMX2/SSE optimizations makes a large difference.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Richard Ellis
On Fri, Dec 19, 2003 at 01:34:38AM -0800, Trent Piepho wrote:
 On Fri, 19 Dec 2003, Andrew Stevens wrote:
  The next bottlenecks would be the run-length coding and the use
  of variance instead of SAD in motion compensation mode and DCT
  mode selection.  Sadly
 
 Is SAD really any faster to calculate than variance?  SAD uses an
 absolute value-add operation while variance is multiply-add. 
 Multiply-add is usually the most heavily optimized operation a cpu
 can perform.

You are thinking DSP chips, not general purpose CPU's.  For DSP's,
yes, multiply-add is very heavilly optimized, but for general purpose
CPU's, it's often not quite so heavilly optimized.

Additionally, if you've got an SSE2 capable x86 chip, it's got
parallel SAD operations in the SSE2 instruction set.  There isn't an
SSE2 mul-add operation yet.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Florin Andrei
On Fri, 2003-12-19 at 01:49, Steven M. Schultz wrote:

   At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on
   my G4 Powerbook.  Yes, ffmpeg has a big speed advantage (~2x) but
   the resulting output is 'grainy' (same bitrate, no B frames) (and the
   rate control is, well, almost non existent - ~2x spikes that'd drive
   a hardware player nuts).   

Any chance repeating that on an Intel or AMD processor?

-- 
Florin Andrei

http://florin.myip.org/



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Slepp Lukwai
On Tue, 2003-12-16 at 12:33, Andrew Stevens wrote:
 Hi all,
 
 First off a bit of background to the multi-threading in the current stable 
 branch.  First off:
 
 - Parallelism is primarily frame-by-frame.  This means that the final phases 
 of the encoding lock on completion of the reference frame (prediction and DCT 
 transform) and the predecessor (bit allocation).   If you have a really fast 
 CPU that motion estimates and DCT's very fast you will get lower 
 parallelisation.  If you use -R 0 you will get very litte parallelism *at 
 all*.   Certainly not enough to make -M 3 sensible.

Yet again, good to know.

This line (generally, a triple loop for 0-3 M, 0-1 I and 0-2 R):

Produces this (approximately 1010 frames), encoding times (real time /
user time, gives a bit of a view as to how busy the CPUs were during the
real time, optimal should be 1m realtime, 2m user time, right? and
average system time was 3.0s, with +/- 0.2s for all tests):

(options on each call were:
 -f 8 -g 9 -G 18 -v 0 -E -10 -K kvcd -4 2 -2 1 -F 1  rawstream.yuv
)

-M 0 -I 0 -R 0: 1m  6.082s  0m 50.050s  baselines
-M 0 -I 0 -R 1: 1m 16.545s  0m 58.980s  ..
-M 0 -I 0 -R 2: 1m 34.511s  1m 17.045s  ..
-M 0 -I 1 -R 0: 2m  7.344s  1m 49.495s  ..
-M 0 -I 1 -R 1: 1m 59.665s  1m 42.215s  ..
-M 0 -I 1 -R 2: 2m 30.990s  2m 30.990s  ..

-M 1 -I 0 -R 0: 1m  5.713s  0m 49.800s  -0.35s
-M 1 -I 0 -R 1: 1m 15.305s  0m 58.975s  -1.2s
-M 1 -I 0 -R 2: 1m 34.057s  1m 17.090s  -0.5s
-M 1 -I 1 -R 0: 2m  5.928s  1m 49.700s  -1.3s
-M 1 -I 1 -R 1: 1m 59.019s  1m 41.955s  -0.6s
-M 1 -I 1 -R 2: 2m 49.149s  2m 31.440s  +19.2s

-M 2 -I 0 -R 0: 1m  0.503s  0m 25.930s  -5.5s
-M 2 -I 0 -R 1: 0m 53.418s  0m 58.950s  -23s
-M 2 -I 0 -R 2: 1m  7.418s  1m 18.145s  -27s
-M 2 -I 1 -R 0: 1m 54.534s  1m 50.060s  -13s
-M 2 -I 1 -R 1: 1m 15.489s  0m  1.040s -- uhm...?
-M 2 -I 1 -R 2: 1m 54.720s  1m 16.720s  -36s

-M 3 -I 0 -R 0: 0m 57.533s  0m 50.610s  -8.5s
-M 3 -I 0 -R 1: 0m 51.541s  0m 40.265s  -25s
-M 3 -I 0 -R 2: 1m  5.996s  0m 54.325s  -29s
-M 3 -I 1 -R 0: 1m 50.570s  1m 49.715s  -17s
-M 3 -I 1 -R 1: 1m 14.462s  1m  8.530s  -45s
-M 3 -I 1 -R 2: 1m 36.192   0m 52.145s  -54s

Interestingly, and I think this has to do with the I/O buffering, -M 0
is slower than -M 1 by a small fraction in all tests. And as Steven
Shultz had suggested, -I 1 is a bad bad idea. It never improved
performance, and made it in fact quite a bit worse (the man page is
right :). (Of course, -M 1 will be at least two processes, and since I
have a real dual system, it makes sense, and may not hold true for a
single CPU)

Also, encoding with one B frame is a touch faster in -I 1 mode than
encoding without them, but it is slower when you encode two B frames
instead of just one. I find this interesting.. I would have expected a
single B frame to take a bit longer than none at all, and that is the
case when -I 0 is on, but not when it's -I 1. Any ideas on that one?

In the end -M 3 is not reasonably faster in -I 0 -R 0, but flys along at
-I 0 -R 2 compared to baseline, and gets fair gains at -I 0 -R 1, while
dropping encoding time by another 14 seconds for the same frameset. So,
does this boil down to the fastest is -M 3 -I 0 -R 1?

The numbers on -M 3 -I 1 -R 2 show a 54 second improvement over the
tests with -M 0, but it takes almost 50% longer than -M 3 -I 0 -R 1. The
file size of 3-1-2 is 13,807,067 and the file size of 3-0-1 is
13,402,673. The file is smaller, and is encoded faster, and viewing them
now, the quality is at least on par (3-0-1 looked a tad better).

 - There is also a parallel read-ahead thread but this rarely soaks much CPU on 
 modern CPUs.
 
 The MPEG_DEVEL branch encoder stripes all encoding phases to allow much more 
 scalable parallelisation.  You might want to give it a go - I'd be interested 
 in the results!

I'd love to, but I couldn't find it in CVS. I found everything else in
the SF CVS branch, but not mjpegtools itself.

 N.b. in a 'realistic' scenario you're running the multiplexer and audio 
 encoding in parallel with the encoder and video filters communicating via 
 pipes and named FIFO's.   This setup usually saturate a modern dual machine 

No multiplexing and no audio encoding (AC3 pass through and multiplexing
of DVD streams is done after completion of the video encoding). There is
the overhead of decoding the original MPEG2 stream into YUV, but that's
about all else that transcode (which I'm using) is dumping into the
pipe. I avoided any of that on this run by just dumping the file in an
already decoded format (pgmtoy4m output).

 cheers,
 
   Andrew
 PS
 I'm away on vacation for a couple of weeks from friday so there'll be a bit of 
 pause in answering emails / posts from then ;-)




---
This SF.net email is sponsored 

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Slepp Lukwai
On Tue, 2003-12-16 at 13:15, Richard Ellis wrote:
 On Tue, Dec 16, 2003 at 12:33:52AM -0700, Slepp Lukwai wrote:
 .. It's a dual Athlon, which inherently means 266FSB (DDR 266),
  though the memory is actually Hynix PC3200 w/ timings set as low as
  they go on this board (2-2-2), which gives me about 550MB/s memory
  bandwidth according to memtest, with a 13GB/s L1 and something like
  6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.  At 550MB/s, it
  SHOULD be able to push enough to keep the frames encoding at 100%
  CPU, in theory.
 
 Yes, but just one 720x480 DVD quality frame is larger than 256k in
 size, so a 256k cache per CPU isn't helping too much overall
 considering how many frames there are in a typical video to be
 encoded.  Plus, my experience with Athlon's is that they are actually
 faster at mpeg2enc encoding that Intel chips of equivalent speed
 ratings (the Athlon's 3dnow/mmx implimentation is faster) and so they
 put a heavier stress on one's memory bandwidth than an equivalent
 speed Intel chip would.  It's possible that 275MB/s per CPU just
 isn't fast enough to keep up with the rate that mpeg2enc can consume
 data on an Athlon.

Yes, I expect the cache to only be able to fit the mpeg2enc code
sections, not any of the data it uses. If the code keeps getting bumped
out, then that's a problem. And 275MB/s may not be enough, true... It's
too bad the Athlon dual chipset (AMD 768MPX) can't do above about 140
MHz bus speeds to see how much memory speed affects it.

 Of course, Andrew would be much better suited to discuss mpeg2enc's
 memory access patterns during encoding, which depending on how it
 does go about accessing memory can better make use of the 256k of
 cache, or cause the 256k of cache to be constantly thrashed in and
 out.

It could be interesting to use cachegrind on mpeg2enc and see what it
declares for cache hit/miss, but I find cachegrind tends to make a 1
minute runtime hit 10 minutes, so I may not bother..

  Now that's just silly. Why would you hurt the CPUs by running such bloat
  as Mozilla? I can't think of how many times Mozilla has gone nuts on me
  and used 100% CPU without reason, and you can't kill it any normal UI
  way.. Good ol' killall. However, I love it. It's a great browser. Just
  rather hungry at times. I suppose there's a reason the logo is a
  dinosaur. :
 
 Hmm... Interesting.  I've had it sometimes just stop but never go
 nuts with 100% CPU, and although I usually do CLI kill it if need be,
 FVWM2's destroy window command has never failed to get rid of it if
 I don't bother to go CLI to do so.  In fact, FVWM2's destroy has
 never failed to get rid of anything that went wonky.  It's the X
 windows equivalent to a kill -9 from the CLI.

I've had it lock up and X becomes unresponsive since it's in a loop
doing some expensive operation of some sort. It's strange. I don't see
it nearly as often with the newer Mozillas as I did the old ones (in
fact, haven't seen it in over a month).



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Slepp Lukwai
On Tue, 2003-12-16 at 23:17, Bernhard Praschinger wrote:
  -M 0: 2m 11.9s
  -M 1: 2m 10.6s, -1.3s
  -M 2: 1m 27.7s, -44.2s
  -M 3: 1m 26.5s, -45.4s
 That values look much better.  :-)
 Now you have seen the mpeg2enc can go faster.

It's like it used to be. : I'm going to try it on a full video, with a
few options. I figure I'll let it run through 24 hours of encoding time
(about 6 different trials) and see how each result turns out, and so on.
I'll let you all know when it's done. :

 I have tried the command you used on my machine, and I have seen the 
 same problem. Also 3 processes and each only 33% .
 
 (time lav2yuv n1000.eli | mpeg2enc -I 0 -f 8 -b 9800 -p -a 3 -o test.m2v
 -S  -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0)

Yes.. So it's definitely the -R 0, but -R 1 is faster than the default
of -R 2 (i think that's the default?)

  Note that I responded in an earlier message with a total of 24 timings
  across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting
  results that -M 3 -I 0 -R 1 worked fastest of all of them (same source
  material I used for the above, and it took 51 seconds). So, I think the
  -I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it
  is still quite a bit slower than -I 0 (which I use since the input is
  Progressive 23.976fps)
 Thats strange.

It makes a mild bit of sense.. But just a little.

 I'm just running some encodings to see which option causes the problem. 
 
 On my machine the -R 0 caused the problem. If I used -R 1/2 or or R
 option, I got 3 processes each using about 45-50%. 

Which should total about 150% CPU instead of 99% that it uses with -R 0.

   My brain had given up the time I started my computer that evening ;)
  Mine usually does that at about 8am. :
 Just as you enter work ? ;)

Self employed, thereby just as I crawl out of bed, and it my brain stays
broken until about noon. That's what I get for staying up till 4am
playing with mpeg2enc. :

 Encoding without the -R 0 seems to solve the problem, by now.

I'm going to see what speeds I get about halfway through a video, when
nothing from disk is in cache anymore, the encoders/decoders are in full
swing, and everything sort of settles down.. Should be interesting.




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Richard Ellis
On Tue, Dec 16, 2003 at 06:54:22PM -0700, Slepp Lukwai wrote:
 As a side note, I'm also using a 200Hz timer, instead of the standard
 100Hz. Though I don't see this doing anything but making it quicker, as
 it reduces latency on scheduling, while slightly increasing scheduler
 overhead and context switching (or is an SSE/3Dnow! CS really expensive,
 anyone know?).

A 200Hz timer will have only one effect on batch type processes,
slowing them down.  And mpeg2enc is essentially a batch type process. 
Why?  Because of the increased scheduler overhead.  Now, you may be
hard put to measure the slowdown because so many other effects will
swamp it (one HD seek that takes a few ms would swamp a large part of
the scheduler overhead) but it's still there.

The only thing that's quicker with a 200hz timer is interactive
response where you want to see your X cursor move the instant you
touch the mouse.

Yes, context switching (at least for SSE) is more expensive, because
the 8 128bit SSE registers may need to be saved.  I don't know off
the top of my head if Intel implimented lazy context saves for SSE
like with the x86-fpu stack.  If they did, then not all context swaps
incur the SSE save overhead, but when one does, there is more data to
save.

 I wonder if it comes back to the increased timing of the scheduler?
 (Though it's using a supposed O(1) scheduler, which should offset
 that).

The O(1) scheduler does not change the context switch overhead
timing.  The O(1) scheduler simply says that no matter how many
processes are waiting to run, it's a constant time to find the next
one when we do need to context switch.  But a 200hz timer will still
use up 2x as much cpu time running the scheduler as 100hz timer will.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Slepp Lukwai
On Mon, 2003-12-15 at 21:08, Richard Ellis wrote:
 What program are you using to monitor CPU usage while mpeg2enc runs? 
 Some versions of top (if you are using top) report percentages as a
 roll-up of the whole SMP machine, so that 3x33% usage really means
 99% utilization of the machine, where the machine means both
 processors combined.  Other versions report a per-cpu percentage
 instead of rolling everything together.

I hate the combined ratings, so I already setup top to report per CPU
usage, so I can see 200% usage instead of it showing 50% as 100% on one
CPU (it's misleading when you deal with single CPUs almost all day for
work).

 Additionally, why kind of memory do you have attached to the cpu's? 
 Mpeg encoding is very memory bandwidth hungry to begin with, and with
 two cpu's trying to eat at the same trough, a not quite as fast as it
 should be memory subsystem can produce results like what you are
 seeing.  It's because with the two cpu's trying to run mpeg2enc, they
 together oversaturate the memory bus, causing both to wait.  But with
 only one mpeg2enc thread running, the entire memory bus bandwidth is
 available to that one cpu alone.

I've noticed. I never saw really how much memory it used unti I used the
buffer program with -t. It was moving gigs of data for a short period of
frames (perhaps 10,000 frames). It's a dual Athlon, which inherently
means 266FSB (DDR 266), though the memory is actually Hynix PC3200 w/
timings set as low as they go on this board (2-2-2), which gives me
about 550MB/s memory bandwidth according to memtest, with a 13GB/s L1
and something like 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.

At 550MB/s, it SHOULD be able to push enough to keep the frames encoding
at 100% CPU, in theory. I don't think there's enough overhead on this
machine to qualify as keeping it even half saturated. This is why I want
the Corsair XMS Pro memory with load meters on them. (Per bank load
meters, even).

 FWIW, when my desktop machine was a dual PII-400Mhz box, I almost
 always had two mpeg2enc threads eating up 97-98%cpu on both PII
 chips.  The few times both cpu's were not fully saturated at mpeg
 encoding was when I'd bother them with something silly like browsing
 the web with mozilla. :)

Now that's just silly. Why would you hurt the CPUs by running such bloat
as Mozilla? I can't think of how many times Mozilla has gone nuts on me
and used 100% CPU without reason, and you can't kill it any normal UI
way.. Good ol' killall. However, I love it. It's a great browser. Just
rather hungry at times. I suppose there's a reason the logo is a
dinosaur. :



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Slepp Lukwai
On Mon, 2003-12-15 at 20:27, Steven M. Schultz wrote:
 On Mon, 15 Dec 2003, Slepp Lukwai wrote:
 
  faster to begin with. However, in both cases, after multiple tests and
  trying different things, I can't get the SMP modes to be fast at all. In
  fact, they're slower than the non-SMP modes.
 
   I think I see what you're doing that could cause that.   I've never
   seen the problem - using -M 2 is not going to be 2x as fast though
   if that was the expectation.   ~40% speedup or so is what I see
   (from about 10fps to 14fps) typically.

Tried it without any options, same effect. I'm definitely seeing nowhere
near 40% speedup, which is what boggles me. I expected at least
reasonable gains of 25%.

  When encoding with the -M 0 with .92, I get around 19fps. When I use -M
 
   That's full sized (720x480) is it?   Sounds more like a SVCD 
   or perhaps 1/2 D1 (bit of a misnomer - D1 is actually a digital
   video tape deck) at 352x480.At 1/2 size yes, around 20fps or a bit
   more I've seen.   But I'm usually tossing in a bit of filtering so
   the process is a slower.

Sorry, upon further testing, I actually average around 14fps at DVD
quality (720x480, 9800kbit/s). (see all the details of my command lines
in the post I sent in responce to Bernhard).

  I installed 'buffer', set it up with a 32MB buffer and put it in the
 
   10MB is about all I use - it's just a cushion to prevent the encoder
   from having to wait (-M 1 is the default - there's I/O readahead
   going on) for input.

Yeh, I tried 20 first, then 32, but in the end, it made no difference at
all.

  Has anyone found a way around this, or is it time to look at the source
  and see what's up?
   
  And for reference, it's a dual Athlon MP 2100+, which is below the
  '2600' that the Howto references as fast.
   
   I'm using dual 2800s and around 14-15fps for DVD encoding is what I
   usually get.

It's interesting that I'm faster with dual 2100s than the dual 2800 (or
at least on par). I suppose it really comes down to command line
options, but you would need to compare those yourself (since I haven't
seen yours).

  The actual command line is:
  mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o test.m2v -S  -M
  3 -4 2 -2 1 -r 32 -q 5 -Q 3.0 -K kvcd
 
   You have progressive non-interlaced source?   If not then -I 0 is
   not the right option. 

According to the docs -I 1 turns on interlacing support, and causes
un-needed overhead if it is known progressive material. Hence the -I 0
(plus transcode sets that, though I could override it).

   The speed up from multiple processors comes, I believe (but if I'm
   wrong I'm sure someone will tactfully point that out ;)) the speedup
   comes from the motion estimation of the 2 fields/frame being done in
   parallel.

Oh. Son of a... If that's all it is...

   Try -I 1 (or just leave out the '-I and let it default.
 
   Oh, and there's no real benefit from going above -M 2.   I had a 4
   cpu box and tried -M 4 and saw no gain over -M 3 (which in turn
   was a very minimal increase over -M 2).

I've never even bothered with -M 4 (well, not for a real run, anyway,
just as a quick test).

   If you want to speed things up by a good percentage try encoding
   without B frames.   Those are computationally a lot more expensive
   than I or P frames.   -R 0 will disable B frames.

I just enabled that, and that's how I'm hitting 15fps instead of 8, and
the quality is good and the size is just fine.

   And do you realize that increasing the search radius (-r) slows
   things down?Leave the -r value defaulted to 16 and you should
   see encoding speed up.

Yup, entirely aware. I do like the minor difference it makes, though.
I'm not in it for speed, really, I just want to see both CPUs get used
to their potentials and give me the equivalent of a 4200+ ; If it takes
6 hours to transcode a movie because I set -r32 (I noticed a larger
difference with -4 -2 options, btw, than -r16 vs -r32), that's fine, but
I feel it could be faster.

   All in all - the defaults are fairly sane so if you're not certain
   about an option, well, let it default.
 
   And drop the -Q unless you want artifacting - especially values over 2.   
   Under some conditions (it's partly material dependent) the -Q can
   generate really obnoxious color blocks and similar artifacts.Much
   better results (especially with clean source material) can be obtained
   with -E -8 or perhaps -E -10.

Until I upgraded to .92, I didn't have those options. I'm using them
now, in combination with -Q, but I find the artifacts are almost never
there (I used to do -q 4 and -Q 4.0, and it looked about the same as the
5/3.0).

  Of course the -M 3 changes to 2 and 0 in testing. I also tested it with
  and without the buffer program in the list. Another notable thing, is
  that with 

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Slepp Lukwai
On Mon, 2003-12-15 at 22:44, Bernhard Praschinger wrote:
 Hallo
 
  I was doing some testing of both the older version (1.6.1.90) and the
  newer version of mpeg2enc (1.6.1.92). First off, the .92 was somewhat
  faster to begin with. However, in both cases, after multiple tests and
  trying different things, I can't get the SMP modes to be fast at all. In
  fact, they're slower than the non-SMP modes.
 With slower, I hope you mean mpeg2enc needs more time to encode the
 movie. 
 And not the time the encoding need in the realtime. 

Slower by wallclock slower. It took less time to re-encode the entire
thing with -M 0 than when I used -M 3. (I didn't let it run through 2,
since it takes over 4 hours as is). (K, after all these tests, the dual
stuff is running faster, but not fast enough over a full movie to even
warrant the extra threads).

Top output of the 3 running mpeg2enc with mjpegtools 1.6.1.92 on the
Dual Athlon MP 2100+. That's with -M3. Top usage is 2% and the decoder
is only about 10% intermittent. So, I'm neglecting those for the moment.
I'm using transcode, by the way (though I found the same results when
not using transcode and doing a straight pipe from decoded MPEG2
frames). Note the top dumps below ignore the memory usage (which has
approximately 640MB of free RAM (really free, not cache or anything,
it's a clean boot, 127 processes running in all cases)).

 Cpu0 :  50.0% user,   8.6% system,   0.0% nice,  41.4% idle
 Cpu1 :  53.4% user,   4.3% system,   0.0% nice,  42.2% idle
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
11234 slepp 16   0 43436  42m  968 S 38.2  4.2   0:16.96 mpeg2enc
12422 slepp 16   0 43436  42m  968 S 34.5  4.2   0:16.86 mpeg2enc
  623 slepp 16   0 43436  42m  968 R 33.6  4.2   0:17.14 mpeg2enc

Command line:
time /usr/bin/transcode -u 120,2 -M 0 -V -q 1 -f 24,1 --color 1 -x
mpeg2,null -y mpeg2enc,null -e 48000,16 -A -N 0x2000 -F 8,'-S  -M 3
-g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0' --pulldown -w 9800
-i 28DaysLater.m2v -o test3 --print_status 50 -c 0-1000

Results:[import_mpeg2.so] tcextract -x mpeg2 -i 28DaysLater.m2v -d 1 |
tcdecode -x mpeg2 -d 1 -y yv12
[export_mpeg2enc.so] *** init-v *** !
[export_mpeg2enc.so] cmd=mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a
3 -o test3.m2v -S  -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K
kvcd -R 0
++ WARN: [mpeg2enc] 3:2 movie pulldown with frame rate set to decode
rate not display rate
++ WARN: [mpeg2enc] 3:2 Setting frame rate code to display rate = 4
(29.970 fps)
encoding frame [950],  14.93 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s

73.56user 7.76system 1:09.29elapsed 117%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (2055major+31007minor)pagefaults 0swaps

(I can't find how to turn off line wrap. Sorry...)

Note I used 120 incoming frame buffers with 2 threads decoding the
video. The buffer usage of transcode never dropped below 90 frames
buffered, so the buffering was keeping pace.

Here's the identical command, the only thing changed is -M 3 to -M 2
(this time I included a snapshot of tcdecode, but note that it isn't
always in the top 3 of the list, it comes and goes quite frequently, and
the transcode buffers stay right around 110 to 116 frames):

 Cpu0 :  61.8% user,   7.3% system,   0.0% nice,  30.9% idle
 Cpu1 :  50.5% user,  12.8% system,   0.0% nice,  36.7% idle
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
20631 slepp 19   0 39824  38m  984 R 51.1  3.9   0:03.79 mpeg2enc
14434 slepp 17   0 39824  38m  984 R 45.7  3.9   0:03.94 mpeg2enc
29969 slepp 16   0  2644 2644  668 S 13.7  0.3   0:01.95 tcdecode

And the output of time (and the end of transcode):
encoding frame [950],  14.33 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s

74.89user 7.68system 1:11.95elapsed 114%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (1979major+26920minor)pagefaults 0swaps


And with -M 1 instead of -M 2:

 Cpu0 :  87.0% user,  13.0% system,   0.0% nice,   0.0% idle
 Cpu1 :  22.2% user,   5.6% system,   0.0% nice,  72.2% idle
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
31916 slepp 25   0 36192  35m  984 R 90.3  3.5   0:07.58 mpeg2enc
 3690 slepp 16   0  2644 2644  668 S 14.7  0.3   0:01.91 tcdecode

Note that it's now using an entire CPU (other processes keep sharing,
but it's still using a full CPU).

And the transcode/time results:

encoding frame [950],  14.19 fps, 95.2%, ETA: 0:00:03, ( 0| 0|117)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s

73.98user 7.51system 1:12.42elapsed 112%CPU 

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Steven M. Schultz

On Tue, 16 Dec 2003, Slepp Lukwai wrote:

 Tried it without any options, same effect. I'm definitely seeing nowhere
 near 40% speedup, which is what boggles me. I expected at least
 reasonable gains of 25%.

I think that has to do with the -I setting...

 Sorry, upon further testing, I actually average around 14fps at DVD
 quality (720x480, 9800kbit/s). (see all the details of my command lines

Ah, that's more like it then.   

 It's interesting that I'm faster with dual 2100s than the dual 2800 (or
 at least on par). I suppose it really comes down to command line
 options, but you would need to compare those yourself (since I haven't

Friend of mine has dual 2400s and my setup is ~10-15% faster as I
recall - he's getting around 11fps as a rule where I see 14 or so.

I'm usually adding a bit of overhead with the chroma conversion.  I
build smilutils with ffmpeg/libavcodec (to use ffmpeg's DV codec)
and then run the data thru something like:
smil2yuv -i 2 file.dv | filters | y4mscaler -O chromass=420_MPEG2 |...

Produces better output that the default which uses libdv but does
cost a bit in cpu use.

 According to the docs -I 1 turns on interlacing support, and causes
 un-needed overhead if it is known progressive material. Hence the -I 0
 (plus transcode sets that, though I could override it).

But unless you have the raw 23.976fps progressive data (with the 3:2
pulldown undone) then I think '-I 1' is the option to use.   But then 
I might be confused (wouldn't be the first time ;)).

That would explain why the encoding rate I see is lower since I'm
using -I 1.

  wrong I'm sure someone will tactfully point that out ;)) the speedup
  comes from the motion estimation of the 2 fields/frame being done in
  parallel.
 
 Oh. Son of a... If that's all it is...

Yep - I'm fairly sure that is why you're not seeing any improvement
when using -M 2.

  without B frames.   Those are computationally a lot more expensive
  than I or P frames.   -R 0 will disable B frames.
 
 I just enabled that, and that's how I'm hitting 15fps instead of 8, and
 the quality is good and the size is just fine.

Great!   It takes, from what I've seen, extraordinarily clean sources
before -R 0 has no or little effect.

 to their potentials and give me the equivalent of a 4200+ ; If it takes
 6 hours to transcode a movie because I set -r32 (I noticed a larger
 difference with -4 -2 options, btw, than -r16 vs -r32), that's fine, but

Yep - -4 1 will close to double the time over -4 2 and the 
difference in bitrate/filesize is measured in tenths of a percent. 
Hardly worth it.   Not all that much difference between -4 2 and
-4 3 though.

  better results (especially with clean source material) can be obtained
  with -E -8 or perhaps -E -10.
 
 Until I upgraded to .92, I didn't have those options. I'm using them

On noisy source material the -E option has almost no effect  but the
cleaner the input the more effect even modest values of -E have.

 now, in combination with -Q, but I find the artifacts are almost never
 there (I used to do -q 4 and -Q 4.0, and it looked about the same as the
 5/3.0).

Perhaps Richard Ellis could chime in with his experiences with -Q ;)

  Right, with -I 0 the cpus take turns but there's little parallelism.
 
 And again, son of I didn't realize the parallelization was done
 based on interlacing settings.

Looking back on it that makes sense though.   A P frame depends on the
preceeding P frame - rather sequential in nature since you can't
move on to the next one without completing the first one...

 The MPEG decoding doesn't take much, and the pipe overhead is negligble,

Pipe overhead sneaks up on you though.   One pipe?  Not a real problem,
two?  Begins to be noticed but isn't too bad.   Four or five?   Yeah,
it starts to take a hit on the overall speed of the system - the data
has to go up/down thru the kernel all those times and that's not free.

 (As I write this, I'm still waiting for the -M 2 run to finish, so it'll
 arrive before the tests results to Bernhard make it out).

You might try, for timing purposes, without -I 0 and see what, if any
effect that has.   Might be a useful data point.

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Andrew Stevens
Hi all,

First off a bit of background to the multi-threading in the current stable 
branch.  First off:

- Parallelism is primarily frame-by-frame.  This means that the final phases 
of the encoding lock on completion of the reference frame (prediction and DCT 
transform) and the predecessor (bit allocation).   If you have a really fast 
CPU that motion estimates and DCT's very fast you will get lower 
parallelisation.  If you use -R 0 you will get very litte parallelism *at 
all*.   Certainly not enough to make -M 3 sensible.

- There is also a parallel read-ahead thread but this rarely soaks much CPU on 
modern CPUs.

The MPEG_DEVEL branch encoder stripes all encoding phases to allow much more 
scalable parallelisation.  You might want to give it a go - I'd be interested 
in the results!

N.b. in a 'realistic' scenario you're running the multiplexer and audio 
encoding in parallel with the encoder and video filters communicating via 
pipes and named FIFO's.   This setup usually saturate a modern dual machine 

cheers,

Andrew
PS
I'm away on vacation for a couple of weeks from friday so there'll be a bit of 
pause in answering emails / posts from then ;-)





---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Bernhard Praschinger
Hallo

 Top output of the 3 running mpeg2enc with mjpegtools 1.6.1.92 on the
 Dual Athlon MP 2100+. That's with -M3. Top usage is 2% and the decoder
 is only about 10% intermittent. So, I'm neglecting those for the moment.
 I'm using transcode, by the way (though I found the same results when
 not using transcode and doing a straight pipe from decoded MPEG2
 frames). Note the top dumps below ignore the memory usage (which has
 approximately 640MB of free RAM (really free, not cache or anything,
 it's a clean boot, 127 processes running in all cases)).
 
  Cpu0 :  50.0% user,   8.6% system,   0.0% nice,  41.4% idle
  Cpu1 :  53.4% user,   4.3% system,   0.0% nice,  42.2% idle
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 11234 slepp 16   0 43436  42m  968 S 38.2  4.2   0:16.96 mpeg2enc
 12422 slepp 16   0 43436  42m  968 S 34.5  4.2   0:16.86 mpeg2enc
   623 slepp 16   0 43436  42m  968 R 33.6  4.2   0:17.14 mpeg2enc
 
 Command line:
 time /usr/bin/transcode -u 120,2 -M 0 -V -q 1 -f 24,1 --color 1 -x
 mpeg2,null -y mpeg2enc,null -e 48000,16 -A -N 0x2000 -F 8,'-S  -M 3
 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0' --pulldown -w 9800
 -i 28DaysLater.m2v -o test3 --print_status 50 -c 0-1000
Could you run a few test (please).  Get some frames (100-1000) as yuv
format. I gues that should be possible even with transcode. ;)
(I do not use transcode so I can't help, or get the test streams on
mjpeg.sf.net)

And do afterwards something like that:
cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v 
or 
lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v

So you can be soure that nothing else makes any troubels. And check
thant how it is going. That should not take to long. Than you can add
the options you used, to see if anything there causes the probelm of non
increasing framerate. 

  I use the 2.6.0-test8 kernel. Maybe that changes the situation.
 I used to be using 2.5.63 or similar, but have rebuilt the machine with
 2.4.20 with scheduling optimizations and other goodies (gentoo). I
 noticed a number of speed ups in most other parallel processes
 (cinelerra, MPI povray, gcc). Of course, most of the patches in the
 gentoo 2.4.20 kernel are stock in 2.5+ (I also used 2.6.0-test8, but
 this Asus board doesn't behave under that kernel, and it crashed
 whenever i'd load the CPUs or IDE buses :)
Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX) 

  Sorry if the mail is a bit confusing,
[...]
 Hopefully this one didn't ramble on TOO long.
My brain had given up the time I started my computer that evening ;)

But I'm not really knowing why the situation is that bad.


auf hoffentlich bald,

Berni the Chaos of Woodquarter

Email: [EMAIL PROTECTED]
www: http://www.lysator.liu.se/~gz/bernhard


---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Richard Ellis
On Tue, Dec 16, 2003 at 09:27:53AM -0800, Steven M. Schultz wrote:
 
 Perhaps Richard Ellis could chime in with his experiences with -Q
 ;)

It seems that with the right set of options, and the right set of
input data, -Q can help to create some really nasty looking
artifacts.  

  And again, son of I didn't realize the parallelization was
  done based on interlacing settings.
   
 Looking back on it that makes sense though.   A P frame depends on
 the preceeding P frame - rather sequential in nature since you
 can't move on to the next one without completing the first one...

The P frame dependency chain is how the artifacts come about based on
Andrew's explanation.  It's accumulated round off error in the iDCT
routines.  Made worse by -Q as well as -R 0 and a few other options.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Richard Ellis
On Tue, Dec 16, 2003 at 12:33:52AM -0700, Slepp Lukwai wrote:
 On Mon, 2003-12-15 at 21:08, Richard Ellis wrote:
  Additionally, why kind of memory do you have attached to the cpu's? 
  Mpeg encoding is very memory bandwidth hungry to begin with, and with
  two cpu's trying to eat at the same trough, a not quite as fast as it
  should be memory subsystem can produce results like what you are
  seeing. ...

 ... It's a dual Athlon, which inherently means 266FSB (DDR 266),
 though the memory is actually Hynix PC3200 w/ timings set as low as
 they go on this board (2-2-2), which gives me about 550MB/s memory
 bandwidth according to memtest, with a 13GB/s L1 and something like
 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.  At 550MB/s, it
 SHOULD be able to push enough to keep the frames encoding at 100%
 CPU, in theory.

Yes, but just one 720x480 DVD quality frame is larger than 256k in
size, so a 256k cache per CPU isn't helping too much overall
considering how many frames there are in a typical video to be
encoded.  Plus, my experience with Athlon's is that they are actually
faster at mpeg2enc encoding that Intel chips of equivalent speed
ratings (the Athlon's 3dnow/mmx implimentation is faster) and so they
put a heavier stress on one's memory bandwidth than an equivalent
speed Intel chip would.  It's possible that 275MB/s per CPU just
isn't fast enough to keep up with the rate that mpeg2enc can consume
data on an Athlon.

Of course, Andrew would be much better suited to discuss mpeg2enc's
memory access patterns during encoding, which depending on how it
does go about accessing memory can better make use of the 256k of
cache, or cause the 256k of cache to be constantly thrashed in and
out.

  FWIW, when my desktop machine was a dual PII-400Mhz box, I almost
  always had two mpeg2enc threads eating up 97-98%cpu on both PII
  chips.  The few times both cpu's were not fully saturated at mpeg
  encoding was when I'd bother them with something silly like browsing
  the web with mozilla. :)
 
 Now that's just silly. Why would you hurt the CPUs by running such bloat
 as Mozilla? I can't think of how many times Mozilla has gone nuts on me
 and used 100% CPU without reason, and you can't kill it any normal UI
 way.. Good ol' killall. However, I love it. It's a great browser. Just
 rather hungry at times. I suppose there's a reason the logo is a
 dinosaur. :

Hmm... Interesting.  I've had it sometimes just stop but never go
nuts with 100% CPU, and although I usually do CLI kill it if need be,
FVWM2's destroy window command has never failed to get rid of it if
I don't bother to go CLI to do so.  In fact, FVWM2's destroy has
never failed to get rid of anything that went wonky.  It's the X
windows equivalent to a kill -9 from the CLI.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Steven M. Schultz

On Tue, 16 Dec 2003, Andrew Stevens wrote:

 Hi all,
 
 First off a bit of background to the multi-threading in the current stable 
 branch.  First off:
 
 - Parallelism is primarily frame-by-frame.  This means that the final phases 
 of the encoding lock on completion of the reference frame (prediction and DCT 

If one were using closed and fixed length GOPs would it make
sense to parallelize the encoding of complete GOPs?   Each cpu
could be dispatched a set of N frames that comprise a closed GOP and
a master thread could write the GOPs out in the correct order.

But as Andrew mentioned - but the time filters and other processing
is added in a dual cpu system's pretty well saturated.   Quad cpu
systems are very much a niche (and expensive) item (not to mention 
the noise they make;))

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Trent Piepho
On Tue, 16 Dec 2003, Richard Ellis wrote:
  6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.  At 550MB/s, it
  SHOULD be able to push enough to keep the frames encoding at 100%
  CPU, in theory.
 
 Yes, but just one 720x480 DVD quality frame is larger than 256k in
 size, so a 256k cache per CPU isn't helping too much overall
 considering how many frames there are in a typical video to be

A 720x480 4:2:0 frame is about 512KB, at 550MB/sec there is enough memory
bandwidth to encode at about 1000 frames/sec if all you had to do was read the
data.  Obviously the encoder runs somewhat slower than that, so each byte of
data must be accessed multiple times.  That's where the cache helps.

 Of course, Andrew would be much better suited to discuss mpeg2enc's
 memory access patterns during encoding, which depending on how it
 does go about accessing memory can better make use of the 256k of
 cache, or cause the 256k of cache to be constantly thrashed in and
 out.

I seem to recall that one of the biggest performance bottlenecks of mpeg2enc
is they way it accesses memory.  It runs each step of the encoding processes
and en entire frame at a time.  It's much more cache friendly run every stage
of the encoding process on a single macroblock before moving on the to next
macroblock.




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Trent Piepho
On Tue, 16 Dec 2003, Steven M. Schultz wrote:
  First off a bit of background to the multi-threading in the current stable 
  branch.  First off:
  
  - Parallelism is primarily frame-by-frame.  This means that the final phases 
  of the encoding lock on completion of the reference frame (prediction and DCT 
 
   If one were using closed and fixed length GOPs would it make
   sense to parallelize the encoding of complete GOPs?   Each cpu
   could be dispatched a set of N frames that comprise a closed GOP and
   a master thread could write the GOPs out in the correct order.

But what about bit allocation?  You need to know how big the last GOP was to
figure out how many bits you can use for the next GOP.




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Andrew Stevens
Hi Steven,  Trent,

 But what about bit allocation?  You need to know how big the last GOP was
 to figure out how many bits you can use for the next GOP.

Actually, this is not such a big deal provided the GOPs are well seperated.  
Simplifying a little, you just need to ensure that you have = the assumed 
amount of decoder buffer full at the end of each 'chunk' as you assumed 
starting to encode its successor.

However, this idea came to mind more as a sneaky way of doing accurately sized 
single-pass encoding: work on multiple 'segments' spread across the video 
sequence so you get a good statistical sample of how your total 
bit-consumption is going relative to your target.  This is rotten for 
parallelism thought because you have two more or less totally uncorrelated 
memory footprints.  For DVD 'segments' would kind of naturally correlate with 
'chapters' at the authoring level.

In the MPEG_DEVEL branch encoding of each frame (apart from the bit-packed 
coding and bit allocation which is only a small fraction of the CPU load) is 
simply striped across the available CPUs.  This has a nice side effect of 
reducing each CPUs working set too as it only deals with a fraction of a 
frame.

Having said all that I'll probably simply do a simple two-pass encoding mode 
first (much simpler frame feeding!).


  Of course, Andrew would be much better suited to discuss mpeg2enc's
  memory access patterns during encoding, which depending on how it
  does go about accessing memory can better make use of the 256k of
  cache, or cause the 256k of cache to be constantly thrashed in and
  out.

 I seem to recall that one of the biggest performance bottlenecks of
 mpeg2enc is they way it accesses memory.  It runs each step of the encoding
 processes and en entire frame at a time.  It's much more cache friendly run
 every stage of the encoding process on a single macroblock before moving on
 the to next macroblock.

The single-macroblock approach has been implemented for quite some time now 
(since the move to C++ roughly).  In rather basic English speed improved 
by... bugger all.  I was *most* surprised, it could well be that the story is 
rather different on multi-CPU machines.  At least I like to hope the work 
wasn't wasted ;-)

Actually, the memory footprint of encoding is much larger than you'd think.  
Remember each 16x16 int16_t difference macroblock gets generated from nastily 
unaligned 16x16 or 16x8 uint8_t predictors and a 16x16 uint8_t picture 
macroblock.  The difference is then DCT-ed in place into 4 8x8 int16_t DCT 
blocks which are then quantised in 4 8x8 int16_t quantised DCT blocks.

Where mpeg2enc could speed up is:

- DCT blocks are in 'correct' and not transposed form.  This is simply a waste 
as by transposing quantiser matrices and the scan sequence you can simply 
skip this.

- Each quantised DCT block is seperately stored.  Nice for debugging, poor for 
memory performance ;-)

- DCT is not combined with quantisation when this is possible.

- Motion estimation (probably wastefully) computes a lot of variances that 
could probably better be replaced by SAD for fast encoding modes.

- The current GOP sizing approach is wasteful.   Frame type should only be 
decided once the best encoding modest (Intra, various inter motion prediction 
modes) is known.  Basically, you turn a B/P frame into an I frame if you've 
reached your GOP length limit or it has enough Intra coded blocks that it is 
more compact that way.   Unfortunately, the current allocation algorithm 
still has a few 'left over' elements that need to know GOP size in advance 
that need to be replaced before this can be fixed.   I'm currently working on 
bit-allocation (basically, a two-pass / look-ahead mode plus the above 
improvement).

A similar approach can be used for deciding B/P frame selection but this is 
expensive in CPU as you basically have to do encode each potential B frame's 
reference frame twice.  I'm playing around with ideas for trying B frames out 
and if they don't seem worthwhile turning them off and then periodically 
checking if it might make sense to turn them on a again.


Andrew



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Richard Ellis
On Tue, Dec 16, 2003 at 12:45:48PM -0800, Trent Piepho wrote:
 On Tue, 16 Dec 2003, Richard Ellis wrote:
   6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.  At 550MB/s,
   it SHOULD be able to push enough to keep the frames encoding at
   100% CPU, in theory.
  
  Yes, but just one 720x480 DVD quality frame is larger than 256k
  in size, so a 256k cache per CPU isn't helping too much overall
  considering how many frames there are in a typical video to be
 
 A 720x480 4:2:0 frame is about 512KB, at 550MB/sec there is enough
 memory bandwidth to encode at about 1000 frames/sec if all you had
 to do was read the data.  Obviously the encoder runs somewhat
 slower than that, so each byte of data must be accessed multiple
 times.  That's where the cache helps.

With motion estimation each byte would end up being accessed more
than once for each new radius that was examined.  Plus motion
estimation is between at least two frames, so we are dealing with at
least about 1M of data to be accessed eventually in the course of
encoding one frame.

  Of course, Andrew would be much better suited to discuss
  mpeg2enc's memory access patterns during encoding, which
  depending on how it does go about accessing memory can better
  make use of the 256k of cache, or cause the 256k of cache to be
  constantly thrashed in and out.
 
 I seem to recall that one of the biggest performance bottlenecks of
 mpeg2enc is they way it accesses memory.  It runs each step of the
 encoding processes and en entire frame at a time.  It's much more
 cache friendly run every stage of the encoding process on a single
 macroblock before moving on the to next macroblock.

In that case it will kill the majority of the performance benifit
provided by the caches, because there's very little locality of
reference for the cache to compensate for.  It moves through at least
512k for pass one, then through the same 512k again for pass two, but
the data in the cache is from the end of the frame, and we are
starting over at the beginning of the frame.  Massive cache thrash in
that case.  Memory bandwidth becomes a much more limiting factor.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Bernhard Praschinger
Hallo

 On Tue, 2003-12-16 at 12:57, Bernhard Praschinger wrote:
  Could you run a few test (please).  Get some frames (100-1000) as yuv
  format. I gues that should be possible even with transcode. ;)
  (I do not use transcode so I can't help, or get the test streams on
  mjpeg.sf.net)
 
 With about 1010 frames of YUV using  to dump it in (instead of cat), I
 get these:
 
 -M 0: 2m 11.9s
 -M 1: 2m 10.6s, -1.3s
 -M 2: 1m 27.7s, -44.2s
 -M 3: 1m 26.5s, -45.4s
That values look much better.  :-)
Now you have seen the mpeg2enc can go faster.

I have tried the command you used on my machine, and I have seen the 
same problem. Also 3 processes and each only 33% .

(time lav2yuv n1000.eli | mpeg2enc -I 0 -f 8 -b 9800 -p -a 3 -o test.m2v
-S  -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0)


 Note that I responded in an earlier message with a total of 24 timings
 across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting
 results that -M 3 -I 0 -R 1 worked fastest of all of them (same source
 material I used for the above, and it took 51 seconds). So, I think the
 -I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it
 is still quite a bit slower than -I 0 (which I use since the input is
 Progressive 23.976fps)
Thats strange.

  And do afterwards something like that:
  cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v
  or
  lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v
 
  So you can be soure that nothing else makes any troubels. And check
  thant how it is going. That should not take to long. Than you can add
  the options you used, to see if anything there causes the probelm of non
  increasing framerate.
 Compared to the run with my long options line, these are 
I'm just running some encodings to see which option causes the problem. 

On my machine the -R 0 caused the problem. If I used -R 1/2 or or R
option, I got 3 processes each using about 45-50%. 

  Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX)
 Nice board, that one. Asus A7M-266D.. I should've grabbed the MSI K7D
 Master for the same price, I hear much nicer things about it.


  My brain had given up the time I started my computer that evening ;)
 Mine usually does that at about 8am. :
Just as you enter work ? ;)

  But I'm not really knowing why the situation is that bad.
 
 I'm just not seeing the dual CPU usage that would warrant even running
 in multiple threads, when I could instead transcode two entirely
 separate items as though I had two machines, which makes some sense (I
 did that the other day, worked rather well). But, if I can make a single
 copy work by flooding both CPUs with activity, then I'll be happier,
 since it should take quite a bit less time to encode a full movie.
Encoding without the -R 0 seems to solve the problem, by now.


auf hoffentlich bald,

Berni the Chaos of Woodquarter

Email: [EMAIL PROTECTED]
www: http://www.lysator.liu.se/~gz/bernhard


---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-15 Thread Steven M. Schultz

On Mon, 15 Dec 2003, Slepp Lukwai wrote:

 faster to begin with. However, in both cases, after multiple tests and
 trying different things, I can't get the SMP modes to be fast at all. In
 fact, they're slower than the non-SMP modes.

I think I see what you're doing that could cause that.   I've never
seen the problem - using -M 2 is not going to be 2x as fast though
if that was the expectation.   ~40% speedup or so is what I see
(from about 10fps to 14fps) typically.

 When encoding with the -M 0 with .92, I get around 19fps. When I use -M

That's full sized (720x480) is it?   Sounds more like a SVCD 
or perhaps 1/2 D1 (bit of a misnomer - D1 is actually a digital
video tape deck) at 352x480.At 1/2 size yes, around 20fps or a bit
more I've seen.   But I'm usually tossing in a bit of filtering so
the process is a slower.

 I installed 'buffer', set it up with a 32MB buffer and put it in the

10MB is about all I use - it's just a cushion to prevent the encoder
from having to wait (-M 1 is the default - there's I/O readahead
going on) for input.

 Has anyone found a way around this, or is it time to look at the source
 and see what's up?

 And for reference, it's a dual Athlon MP 2100+, which is below the
 '2600' that the Howto references as fast.

I'm using dual 2800s and around 14-15fps for DVD encoding is what I
usually get.

 The actual command line is:
 mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o test.m2v -S  -M
 3 -4 2 -2 1 -r 32 -q 5 -Q 3.0 -K kvcd

You have progressive non-interlaced source?   If not then -I 0 is
not the right option. 

The speed up from multiple processors comes, I believe (but if I'm
wrong I'm sure someone will tactfully point that out ;)) the speedup
comes from the motion estimation of the 2 fields/frame being done in
parallel.

Try -I 1 (or just leave out the '-I and let it default.

Oh, and there's no real benefit from going above -M 2.   I had a 4
cpu box and tried -M 4 and saw no gain over -M 3 (which in turn
was a very minimal increase over -M 2).

If you want to speed things up by a good percentage try encoding
without B frames.   Those are computationally a lot more expensive
than I or P frames.   -R 0 will disable B frames.

And do you realize that increasing the search radius (-r) slows
things down?Leave the -r value defaulted to 16 and you should
see encoding speed up.

All in all - the defaults are fairly sane so if you're not certain
about an option, well, let it default.

And drop the -Q unless you want artifacting - especially values over 2.   
Under some conditions (it's partly material dependent) the -Q can
generate really obnoxious color blocks and similar artifacts.Much
better results (especially with clean source material) can be obtained
with -E -8 or perhaps -E -10.

 Of course the -M 3 changes to 2 and 0 in testing. I also tested it with
 and without the buffer program in the list. Another notable thing, is
 that with the newest version .92, -M3 causes three 33% usage processes

Right, with -I 0 the cpus take turns but there's little parallelism.

 to exist (leaving an entire CPU idle), while M2 causes two 60% processes
 to exist. With .90, -Mx causes 2 50-70% processes and the rest never do

Hmmm, I see 100% use on the two 2800s - but some of that would be
the DV decoding and pipe overhead of course.

First thing I'd try is lowering -r to 24 at most or just defaulting it.

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-15 Thread Bernhard Praschinger
Hallo

 I was doing some testing of both the older version (1.6.1.90) and the
 newer version of mpeg2enc (1.6.1.92). First off, the .92 was somewhat
 faster to begin with. However, in both cases, after multiple tests and
 trying different things, I can't get the SMP modes to be fast at all. In
 fact, they're slower than the non-SMP modes.
With slower, I hope you mean mpeg2enc needs more time to encode the
movie. 
And not the time the encoding need in the realtime. 

 When encoding with the -M 0 with .92, I get around 19fps. When I use -M
 2 or -M 3, I get around 14fps. The CPU utilization sits at about 60 to
 70% across both CPUs, but hits 99.9% when using just one.
Thats really strange. 
Which programm dod you use for monitoring your CPU utilisation ?
top and/or xosview ?

If you used time for knowing what amout of time is used, the important
value for you use the real line, and not the user line.
The user line reports the time the command needed on both CPUs. On a
dual machine that has nothing other things to do, the real time is lower
than the user time. The overhead you need for 2 threads incresses the
user time a litte, but lowers the real time. 

 I installed 'buffer', set it up with a 32MB buffer and put it in the
 stream, and it didn't make any difference at all. It would be nice to
 use mpeg2enc on two CPUs to it's full speed, which would net me faster
 than real-time, but thus far I haven't been able to.
What was your full comand ?

When I use lav2yuv files | mpeg2enc -f 8 -o test.m2v. 
My system (the 2600 Athlon MP I mentioned in howto) mpeg2enc needs
nearly 100% of one cpu and lav2yuv nedds another 5-10%. 
Encoding of 1000 frames takes that mount of time: 2m16.944s

When I add -M 2 
The speedup is nice, mpeg2enc has two thread eac needing about 65-70%,
lav2yuv needs about 15%.
Encoding of 1000 frames takes that mount of time: 1m37.881s

Adding buffer to a simple command line does not speed up anything.
buffer helps if you have a pipeline with serveral stages like: lav2yuv |
yuvdenoise | yuvscaler | mpeg2enc

 Has anyone found a way around this, or is it time to look at the source
 and see what's up?
I have no need, because I think it works properly. 
 
 And for reference, it's a dual Athlon MP 2100+, which is below the
 '2600' that the Howto references as fast.

 Of course the -M 3 changes to 2 and 0 in testing. I also tested it with
 and without the buffer program in the list. Another notable thing, is
 that with the newest version .92, -M3 causes three 33% usage processes
 to exist (leaving an entire CPU idle), while M2 causes two 60% processes
 to exist. With .90, -Mx causes 2 50-70% processes and the rest never do
 anything.
Just for the fun, I have tested it with -M 3, and than I saw 3 mpeg3nc
thread each using about 45-50%, that improved the needed time compared
to -M 2 by another 10 seconds. -M 4 didn't cange much at all, only a 4th
process needing about 10%.

I use the 2.6.0-test8 kernel. Maybe that changes the situation. 

The percent numbers reported by top have to be read carefully. At least
my top reports them fo a single CPU, so you can have processes using up
to 200% and then both cpus have full load. 
But in the task/cpu stats line 100% utilisation are for both CPUs 

Sorry if the mail is a bit confusing,
auf hoffentlich bald,

Berni the Chaos of Woodquarter

Email: [EMAIL PROTECTED]
www: http://www.lysator.liu.se/~gz/bernhard


---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278alloc_id=3371op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users