Hi Jorn,

yes that is correct. I think however that the virtual loudspeaker stage is 
unnecessary. It is equivalent if you expand the left and right HRTFs into 
spherical harmonics and multiply their coefficients (in the frequency domain) 
directly with the coefficients of the sound scene (which in the 1st-order case 
is the B-format recording). This is simpler and more elegant I think. Taking 
the IFFT of each coefficient of the HRTFs, you end up with an FIR filter that 
maps the respective HOA signal to its binaural output, hence as you said it's 
always 2*(HOA channels) no matter what. Arbitrary rotations can be done on the 
HOA signals before the HOA-to-binaural filters, so head-tracking is perfectly 
possible.

Best,
Archontis

________________________________________
From: Sursound [sursound-boun...@music.vt.edu] on behalf of Jörn Nettingsmeier 
[netti...@stackingdwarves.net]
Sent: 26 January 2016 22:52
To: sursound@music.vt.edu
Subject: [Sursound] Never do math in public, or my take on explaining B-format 
to binaural

I think the 8 impulses are used differently. I'm scared of trying to
explain something of which my own understanding is somewhat hazy, but
here it goes: please correct me ruthlessly. Even if in the end I wish
I'd never been born, there might be something to learn from the
resulting discussion :)

W goes to loudspeaker LS1, LS2, ..., LSn.
Same for X, Y, and Z.

Each LSn then goes both to left ear and right ear.

So you start with a 4 to n matrix, feeding into an n to 2 matrix. The
component-to-speaker convolutions and the speaker-to-ear convolutions
(the HRTFs) are constant.

Convolution and mixing are both linear, time-invariant operations. That
means they can be performed in any order and the result will be
identical. I guess in math terms they are transitive and associative, so
that (a # X) + (b # X) is the same as (a + b) # X, and a # b # c is the
same as a # (b # c), where "#" means convolution.

So the convolution steps can be pre-computed as follows, where DEC(N,m)
is the decoding coefficient of component N to loudspeaker m, expressed
as convolution with a dirac pulse of the appropriate value:

L = W # DEC(W,LS1) # HRTF(L,LS1) + ... + W # DEC(W,LSn) # HRTF(L,LSn)
   + X # DEC(X,LS1) # HRTF(L,LS1) + ... + X # DEC(X,LSn) # HRTF(L,LSn)
   + Y # ...
   + Z # ...

(same for R)

which can be expressed as

L = W # ( (DEC(W,LS1) # HRTF(L,LS1) + ... + DEC(W,LSn) # HRTF(L,LSn) )
   + X # ...
   + Y # ...
   + Z # ...

(same for R).

Note that everything in brackets is now constant and can be folded into
a single convolution kernel.

That means you can, for first order, reduce the problem to 8
convolutions, going from {WXYZ} to {LR} directly. The complexity is
constant no matter how many virtual loudspeakers you use.

Of course, that does not take into account dual-band decoding. But if we
express the cross-over filters as another convolution and split the
decoding matrix into a hf and lf part, we can also throw both halves of
the decoder together and do everything in one go.

For nth order, you have (n-1)² * 2 convolutions to handle.

For head-tracking, the virtual loudspeakers would move with the head (so
that we don't have to swap HRTFs), and the Ambisonic signal would be
counter-rotated accordingly. Of course that gets the torso reflections
slightly wrong as it assumes the whole upper body moves, rather than
just the neck, but I guess it's a start.



--
Jörn Nettingsmeier
Lortzingstr. 11, 45128 Essen, Tel. +49 177 7937487

Meister für Veranstaltungstechnik (Bühne/Studio)
Tonmeister VDT

http://stackingdwarves.net

_______________________________________________
Sursound mailing list
Sursound@music.vt.edu
https://mail.music.vt.edu/mailman/listinfo/sursound - unsubscribe here, edit 
account or options, view archives and so on.

Reply via email to