> Does anyone have a theory why syscalls are so expensive in FreeBSD? Here
> are the results of unixbench 4.1 on two machines. First is the machine
> running FreeBSD HEAD (debugging disabled) on a dual-core Athlon 64 (i386
> mode), 2 GHz:
I ran the syscall benchmark from UnixBench on the same hard
> is bigger better or worse?
For sysbench bigger is better (more transactions per second). For
ffmpeg, lower is better - e.g. the time to transcode the first 120
seconds of the selected video is less, so it ran faster.
Josh
___
freebsd-performance@freeb
> I just ran through some of my benchmarks on a kernel build from
> sources as of today, and I've noticed an improvement for the ffmpeg
> workload. Here's a comparison of 4bsd, ule (BETA1) and ule (BETA3).
> This is vanilla source with no patches applied:
Sorry, the ministat output was mangled. I'
Jeff,
I just ran through some of my benchmarks on a kernel build from
sources as of today, and I've noticed an improvement for the ffmpeg
workload. Here's a comparison of 4bsd, ule (BETA1) and ule (BETA3).
This is vanilla source with no patches applied:
x 4bsd
+ ule
* uleb3
+-
> These ministat results show that the latest patch (alone) results in
> slightly worse performance for ffmpeg and buildworld, but slightly
> better results for sysbench.
Please disregard those conclusions, I misread the ffmpeg results and
didn't look at all thread counts for the sysbench runs.
L
> Try the /usr/src/tools/tools/ministat utility for a simple and effective
> way to compare these kinds of noisy measurements and extract reliable
> comparisons.
Thanks again for the suggestion, Kris!
I compiled results for 5 runs of the three benchmarks I've been using
(ffmpeg, sysbench(mysql),
> BTW, it doesn't make much sense to be measuring to millisecond precision
> a quantity which has variation that is unknown but probably much larger
> :) When trying to make comparisons to identify performance changes, a
> careful statistical approach is necessary.
>
> Try the /usr/src/tools/tools
> ffmpeg: 1:38.885
>
> sysbench: (4,8,12,16 threads respectively):
>2221.93
>2327.87
>2292.49
>2269.29
>
> And buildworld: 13m47.052s
I ran these after changing the slice value to 7 as well with this patch.
ffmpeg: 1:38.547
sysbench:
2236.55
2321.02
2271.76
2254.85
> Josh, I had an interesting thought today. What if the reason 4BSD is
> faster is because it distributes load more evenly across all packages
> because it distributes randomly? ULE distributed across cores evenly but
> not packages. Can you try the attached patch? This also turns the
> defau
> That's expected due to the fuzzy rounding of 128 / 10, etc. Can you set
> slice_min and slice both equal to 7 and see if the numbers come out
> better than without the patch but with a slice value of 7? Basically I'm
> trying to isolate the effects of the different slice handling in this
> patc
> That's expected due to the fuzzy rounding of 128 / 10, etc. Can you set
> slice_min and slice both equal to 7 and see if the numbers come out
> better than without the patch but with a slice value of 7? Basically I'm
> trying to isolate the effects of the different slice handling in this
> patc
> Sysbench results:
> # threadsslice=7 slice=13 slice_min=4 slice_min=2
> 42265.672250.36 2261.712297.08
> 82300.252310.02 2306.792313.61
> 12 2269.542304.04 2296.542279.73
>
> Turns out the last patch I posted had a small compile error because I
> edited it by hand to remove one section. Here's an updated patch that
> fixes that and changes the min/max slice values to something more
> reasonable. Slice min should be around 4 with a max of 12.
>
> Also looks like 4BSD
> Josh, I included one too many changes in the diff and it made the results
> ambiguous. I've scaled it back slightly by removing the changes to
> sched_pickcpu() and included the patch in this email again. Can you run
> through your tests once more? I'd like to commit this part soon as it
> hel
> Josh, thanks for your help so far. This has been very useful.
You're welcome, glad to help! Thanks for the effort and the patch.
> Any testing you can run this through is appreciated. Anyone else lurking
> in this thread who would like to is also welcome to report back findings.
Here are a f
> What would be interesting to know is if the sum of the temperatures is any
> different. 4BSD gets a much more random distribution of load because a
> thread is run on whatever cpu context switches next. ULE will have
> specific load patterns since it scans lists of cpus in a fixed order to
> as
> What was the -j value and number of processors?
-j 8.
I did the following (one warm up, 3 times in a row after that, averaged):
cd /usr/src
rm -rf /usr/obj/*
make clean
time make -j8 -DNOCLEAN buildworld
The system is a Q6600, so 4 cores.
Thanks,
Josh
> buildworld isn't cooperating for me, but once I iron that out, I'll
> post some results there as well :)
I was able to get buildworld compiling ok and here are the results:
4BSDULE.13ULE.7
13:24.7313:44.2813:38.85
Only a 1.75% difference when the slice value is set to 7
> Thank you, that was very useful. I may have something to test very soon.
Sounds great Jeff, just say the word when you need someone to do the
testing. I'll be glad to help!
> What would be interesting to know is if the sum of the temperatures is any
> different. 4BSD gets a much more random d
> Could you try spot checking a couple of tests with kern.sched.slice set to
> half its present value? 4BSD on average will use half the slice that ULE
> will by default.
The initial value was 13, and I changed it to 7. Here is the time
result for the ffmpeg run:
13: 1:39.09
7:1:37.01
I al
> I'm confident that we can improve things. It will probably not make the
> cut for 7.0 since it will be too disruptive. I'm sure it can be
> backported before 7.1 when ULE is likely to become the default.
That sounds great! I figured it was something that would have to wait
until 7.0 released.
> Your tests with ffmpeg threads vs processes probably is triggering more
> context switches due to lock contention in the kernel in the threads case.
> This is also likely the problem with some super-smack tests. On each
> context switch 4BSD has an opportunity to perfectly balance the CPUs. ULE
> Yes, that's the proper default. You could try setting steal_thresh to 1. I
> noticed a problem with building ports on an 8 core Xeon system while 8
> distributed.net crunchers were running. The port build would proceed
> incredibly slowly, steal_thresh=1 helped a little bit. It might not make up
> kern.sched.steal_thresh is/was one of the more effective tuning sysctls. rev
> 1.205 of sched_ule had a change that was supposed to automatically adjust it
> based on the number of cores. Is this the same 8 core system as the
> other thread? In that case the commit dictates steal_thresh should be
> 5-6% is a lot. ULE has some tuning for makeworld in -current, which
> for me reduced it to less than 1% slower than 4BSD (down from 5-10%
> slower), for the case of makeworld -j4 over nfs on a 2-CPU system with
> the sources pre-cached on the server and objects on a local file system,
> and exte
> We can not ignore this performance bug, also I had found that ULE is
> slower than 4BSD when testing super-smack's update benchmark on my
> dual-core machine.
I actually saw improved performance with ULE over 4BSD for
super-smack. What were the parameters you used for your testing? These
were mi
I decided to do some testing of concurrent processes (rather than a
single process that's multi-threaded). Specifically, I ran 4 ffmpeg
(without the -threads option) commands at the same time. The
difference was less than a percent:
4bsd: 439.92 real 1755.91 user 1.08 sys
ule:
> My next step is to run some transcodes with mencoder to see if it has
> similar performance between the two schedulers. When I have those
> results, I'll post them to this thread.
mencoder is linked against the same libx264 library that ffmpeg uses
for h.264 encoding, so I was expecting similar
> Just curious, but are these results obtained while you are
> overclocking your 2.4ghz CPU to 3.4ghz? That might be a useful
> datapoint.
Yes they are with the CPU overclocked. I have verified the results
when not overclocked as well (running at stock).
> It also might be useful to know what s
> ULE is tuned towards providing cpu affinity compilation and evidently
> encoding are workloads that do not benefit from affinity. Before we
> conclude that it is slower, try building with -j5, -j6, j7.
Here are the results of running ffmpeg with 4 through 8 threads on
both schedulers:
4 threads
Hello,
I posted this to the stable mailing list, as I thought it was
pertinent there, but I think it will get better attention here. So I
apologize in advance for cross-posting if this is a faux pas. :)
Anyway, in summary, ULE is about 5-6 % slower than 4BSD for two
workloads that I am sensitive
Hello,
After reading man tuning, I began poking around at my IDE drives to
see how their performance was in FreeBSD. I noticed that writes are
quite slow (on the order of 15MB/s) compared to reads (55MB/s).
In some initial googling, I saw a thread from early 2005 about 5.3 and
performance problem
32 matches
Mail list logo