Hi Jorn, yes that is correct. I think however that the virtual loudspeaker stage is unnecessary. It is equivalent if you expand the left and right HRTFs into spherical harmonics and multiply their coefficients (in the frequency domain) directly with the coefficients of the sound scene (which in the 1st-order case is the B-format recording). This is simpler and more elegant I think. Taking the IFFT of each coefficient of the HRTFs, you end up with an FIR filter that maps the respective HOA signal to its binaural output, hence as you said it's always 2*(HOA channels) no matter what. Arbitrary rotations can be done on the HOA signals before the HOA-to-binaural filters, so head-tracking is perfectly possible.
Best, Archontis ________________________________________ From: Sursound [sursound-boun...@music.vt.edu] on behalf of Jörn Nettingsmeier [netti...@stackingdwarves.net] Sent: 26 January 2016 22:52 To: sursound@music.vt.edu Subject: [Sursound] Never do math in public, or my take on explaining B-format to binaural I think the 8 impulses are used differently. I'm scared of trying to explain something of which my own understanding is somewhat hazy, but here it goes: please correct me ruthlessly. Even if in the end I wish I'd never been born, there might be something to learn from the resulting discussion :) W goes to loudspeaker LS1, LS2, ..., LSn. Same for X, Y, and Z. Each LSn then goes both to left ear and right ear. So you start with a 4 to n matrix, feeding into an n to 2 matrix. The component-to-speaker convolutions and the speaker-to-ear convolutions (the HRTFs) are constant. Convolution and mixing are both linear, time-invariant operations. That means they can be performed in any order and the result will be identical. I guess in math terms they are transitive and associative, so that (a # X) + (b # X) is the same as (a + b) # X, and a # b # c is the same as a # (b # c), where "#" means convolution. So the convolution steps can be pre-computed as follows, where DEC(N,m) is the decoding coefficient of component N to loudspeaker m, expressed as convolution with a dirac pulse of the appropriate value: L = W # DEC(W,LS1) # HRTF(L,LS1) + ... + W # DEC(W,LSn) # HRTF(L,LSn) + X # DEC(X,LS1) # HRTF(L,LS1) + ... + X # DEC(X,LSn) # HRTF(L,LSn) + Y # ... + Z # ... (same for R) which can be expressed as L = W # ( (DEC(W,LS1) # HRTF(L,LS1) + ... + DEC(W,LSn) # HRTF(L,LSn) ) + X # ... + Y # ... + Z # ... (same for R). Note that everything in brackets is now constant and can be folded into a single convolution kernel. That means you can, for first order, reduce the problem to 8 convolutions, going from {WXYZ} to {LR} directly. The complexity is constant no matter how many virtual loudspeakers you use. Of course, that does not take into account dual-band decoding. But if we express the cross-over filters as another convolution and split the decoding matrix into a hf and lf part, we can also throw both halves of the decoder together and do everything in one go. For nth order, you have (n-1)² * 2 convolutions to handle. For head-tracking, the virtual loudspeakers would move with the head (so that we don't have to swap HRTFs), and the Ambisonic signal would be counter-rotated accordingly. Of course that gets the torso reflections slightly wrong as it assumes the whole upper body moves, rather than just the neck, but I guess it's a start. -- Jörn Nettingsmeier Lortzingstr. 11, 45128 Essen, Tel. +49 177 7937487 Meister für Veranstaltungstechnik (Bühne/Studio) Tonmeister VDT http://stackingdwarves.net _______________________________________________ Sursound mailing list Sursound@music.vt.edu https://mail.music.vt.edu/mailman/listinfo/sursound - unsubscribe here, edit account or options, view archives and so on.