26.04.2014 18:19, Alexander E. Patrakov wrote:
24.04.2014 22:09, Peter Meerwald wrote:
From: Peter Meerwald <p.meerw...@bct-electronic.com>

The generic matrix remapping is rather inefficient; special-case code
improves performance by 3x easily.

I have looked at this and the 10th patch. For 10/11, I have no
objections. 11/11 definitely works and improves things, but...

+static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst,
const int16_t *src, unsigned n) {
+    unsigned i;
+
+    for (i = n >> 2; i > 0; i--) {
+        dst[0] = (src[0] + src[1])/2;
+        dst[1] = (src[2] + src[3])/2;
+        dst[2] = (src[4] + src[5])/2;
+        dst[3] = (src[6] + src[7])/2;
+        src += 8;
+        dst += 4;
+    }
+    for (i = n & 3; i; i--) {
+        dst[0] = (src[0] + src[1])/2;
+        src += 2;
+        dst += 1;
+    }
+}

Why are we doing the compiler's job here? Yes, I understand that there
are precedents of manually unrolling the loop here, but this actually
slows things down with -O3 on gcc-4.8.2! Here are my results regarding
stereo to mono s16ne conversions with different CFLAGS on an amd64
machine (Intel(R) Core(TM) i7-4770S forced to 3.9 GHz by Intel Turbo
Boost).

The tests below are with the cpu-test rework patches applied (but not
reviewed).

With -O2 -pipe, and your code, I get:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 62098 usec (avg: 620.98, min = 612, max = 764, stddev = 20.9442).
orig: 125770 usec (avg: 1257.7, min = 1247, max = 1392, stddev = 24.9169).

With -O3 -pipe, and your code, I get:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 120105 usec (avg: 1201.05, min = 1157, max = 1472, stddev = 50.5987).
orig: 127543 usec (avg: 1275.43, min = 1234, max = 1682, stddev = 56.4764).

Now let's test this:

static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst,
const int16_t *src, unsigned n) {
     while (n--) {
         dst[0] = (src[0] + src[1])/2;
         src += 2;
         dst += 1;
     }
}

With -O2 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 82468 usec (avg: 824.68, min = 814, max = 984, stddev = 23.8113).
orig: 126014 usec (avg: 1260.14, min = 1248, max = 1429, stddev = 27.8855).

With -O3 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 57797 usec (avg: 577.97, min = 567, max = 687, stddev = 18.9386).
orig: 123601 usec (avg: 1236.01, min = 1219, max = 1377, stddev = 30.3412).

I.e. -O3 with the simplest possible implementation slightly beats your
hand-optimized loop here. probably because the compiler was smart enough
to insert some SSE2 stuff automatically.

The above should not be counted as an objection to your patch. We can
always clean up this and the existing hand-rolled code later.

Now waiting while clang-3.4 compiles...


Simple code, clang, -O2 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 82375 usec (avg: 823.75, min = 794, max = 1387, stddev = 90.7334).
orig: 134835 usec (avg: 1348.35, min = 1263, max = 2151, stddev = 110.471).

Simple code, clang, -O3 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 80987 usec (avg: 809.87, min = 794, max = 1016, stddev = 30.6149).
orig: 130819 usec (avg: 1308.19, min = 1287, max = 1507, stddev = 38.0144).

Your code, -O2 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 63764 usec (avg: 637.64, min = 615, max = 946, stddev = 39.2402).
orig: 132069 usec (avg: 1320.69, min = 1302, max = 1658, stddev = 45.6867).

Your code, -O3 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 61143 usec (avg: 611.43, min = 598, max = 801, stddev = 32.9057).
orig: 130071 usec (avg: 1300.71, min = 1286, max = 1641, stddev = 43.0877).

OK, so on clang your code has its benefits. Keep it.

--
Alexander E. Patrakov
_______________________________________________
pulseaudio-discuss mailing list
pulseaudio-discuss@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pulseaudio-discuss

Reply via email to