Re: [Sursound] about principled rendering of ambisonic to binaural

Sampo Syreeni Sun, 16 Oct 2022 13:41:52 -0700

On 2022-09-13, Fons Adriaensen wrote:

Even if that case it isn't as simple as you seem to think.


Obviously I simplify. I know a thing or two already.

Any set of measured HRIR will need some non trivial preprocessingbefore it can be used. One reason is low-frequency errors. Accurate IRmeasurements below say 200 Hz are difficult (unless you have a verybig and good anechoic room). OTOH we know that HRIR in that frequencyrange are very low order and can be synthesised quite easily.

As such we put in a priori knowledge into the model, and/or somehowrepeat the measurement, coherently adding the resulting signal so as tobring down the noise floor.

Another reason is that you can't reduce a set of HRIR to low order(the order of the content you want to render) without introducingsignificant new errors.

I believe the systematic way to talk about this is to see howdirectional sources reduced to a central framework is about theFourier-Bessel reduction, which doesn't reduce easily to the rectilinearFourier decomposition. Even their reduced orders aren't comparible. Ineach frame, a low order, finite order field which looks perfectlyeven/in-quadrature, in the other has an infinite order decomposition.

But they work pretty well against each other for most outside-of-the-rigsources. The higher order crossterms in the transform fall off fast, sothat you can approximate pretty well either way. That's where theDaniel, Nicol & Moreau NFC-HOA paper comes from. (They also did Hankelfunctions, in outwards going energy transfer. Their solution is exact,and they've talked about the connection to rectangular WFS, but eventhey didn't quantify this all fully.)

One way to reduce these is to reduce or even fully remove ITD at midand high frequencies, again depending on the order the renderer issupposed to support.

This is all implicit in the order of reconstruction. ITD is justderivative of the soundfield over your ears. Of course Gerzon tookMakita theory, but the latter is derivable from first the acoustic waveequation, and then its ecologically minded, reduced psychoacoustics.Once you go to third order ambisonic or beyond, no psychoacoustics arenecessary.

Getting the magnitudes (and hence ILD) accurate requires much lowerorder than if you also want to keep the delays.

My point is that in binaural work, especially if head tracked, you caneasily get to order twenty or so. No ITD/ILD-analysis needed, becauseit'll mimic physical reality.

If we can just do it right. How do we, from a sparse and anisotropicmeaasurement set?

Compared to these and some other issues, not having a set on a regulargrid (e.g. t-design or Lebedev) is the least of problems you willencounter.


Tell me more? I don't recognize these ones just yet.

There are other considerations. For best results you need headtracking and a plausible room sound (even if the content alreadyincludes its own).


In plausible room reverberation, I might have some ideas as well. :)

The practical solutions do not depend on such concepts and are muchmore ad-hoc. Some members of my team and myself worked on them for thelast three years. Most of the results are confidential, althoughothers (e.g. IEM) have arrived at some similar results and publishedthem.

IEM is serious shit. If the results are confidential, so be it, but atthe same time, if I'd have something to contribute, I'd happily sign anNDA. Just to be in the know, and contribute; I've done it before, andwould do again.

It'd just be fun to actually solve or at least quantify the limits ofthis particular mathematical problem. How well can you actually go fromsay a KEMAR set to a binaural-to-ambisonic rendition? Isotropically? Ifyou don't really have the bottomwards set? How do you interpolate, andwhere's your overall error metric? And so on?!? My problem.

Another question is if for high quality binaural rendering, startingfrom Ambisonic content is a good idea at all.

Obviously it isn't, in all cases. For example, if your content is suchthat the listener mostly looks forward, as say towards a movie screen, afully isotropic ambisonic signal of any order wastes bandwidth/entropy.A lot of it. Even at first order POA, probably something like 2+epsilonchannels worth of it. If they used just periphonic ambisonic, they'd getmore typical theatrical sounds. In fact, if they really optimized thething for frontal sound, and maybe a supplemental, theatrically mindedsurround track, such as in Dolby Surround, maybe it'd need even less.

But the thing is, ambisonic has always been an excercise in generality,regularity, and so virtual auditory reality, with mathematicalcertainty. Above instant efficiency or cheapness. It's never been aboutwhat is easy, but about being able to look allround, and to perceive thesame isotropic soundfield, even if you look up or down. The auditoryidea of what we now call "immersive VR".

Sure, there have been many compatibility formats on the way, to ease"The Transition". But quite surely the whole of e.g. Gerzon's vision hasbeen for all of us to go into something like full holophony.

Simple fact is that if you want really good results you need very highorder,

Yes, though you can do better in limited circumstances. There's lots ofstuff in this vein in the early pantophonic ambisonic work, and beyond.TriField I think was one, or Gerzon's work in compatible frontal stereo.

1. such content isn't available from direct recordings (we don't haveeven 10th order microphpones), so it has to be synthetic,


True. Which is probably why we have things like Dolby Atmos.

But at the same time, all of these formats nowaday include at leastfirst order ambisonic, and often right upto third order. Because it'svery difficult to make synthetic sound from a recording of real, livesound. The only real, systematic way of recording full 3D sound, evennow, is via ambisonic principles.

2. rendering it from an Ambisonic format would be very inefficient.For example for order 20 you'd need 441 convolutions if you assume L/Rhead symmetry, twice that number if you don't.

I am not too certain that is true. If you want to implement that oneefficiently, you would want to do a reductiont to 3rd order, or at most4-5th order, isotropic. Because 3rd order sounds pretty damn exactalready. (I've had the distinct privilege of going into an anechoic roomto listen to that, thanks to Ville Pulkki and Archontis Politis, atAalto University Lab of Acoustics and Signal Processing. With theirtweaks, no less.)

I'm thoroughly sure you can even in the LTI regime reduce thoseconvolutions to a very small part of the original, while retainingfull perceptual quality. And then, as I once surmised, you can also, inthe ambisonic framework, exchange the order of convolutions and thematrix, for lesser work at lower directional order. (And for gaming andwhatever VR, for zero latency in processing, at a cost.)

The fun thing here is that I come about this as a signal processingfiend, who has never actually implemented a single algorithm. But whoknows all. Starting from things like the Gardner convolution algorithm.Which would be how we implement zero delay convolution even in multiplechannels/dimensions. Karhunen-Loeve-transform, as the optimum entropygathering one, even above the second best Discrete Cosine Transform,which we'd probably use here, over channels and time.

I in fact have been thinking, for the longest of times, how to optimallypack ambisonic signals, well. None of the current systems work, becausethey are aimed at intensity stereo; the precise thing ambisonicoriginally was against.

Compare this to rendering from object encoded content (i.e. monosignals plus directional metadata). You need only two convolutions perobject.


Atmos, yeah. Here we go again. :D

If you encode those in a principled fashion, as a part of a soundfield,1) you don't have to encode their direction any more accurately thanyour hearing requires, and 2) their statistical similarity, especiallyif they are close to each other, will lead to denser coding in toto.Both in analog, and especially in digital, encoding. You can fit moredirectional hearing into lesser bandwidth, whichever way.

Starting from a sufficiently dense HRIR set, you can easily generate a
new set on a regular grid with a few thousand points, and interpolate
them (VBAP style) in real time.

That's the point. "Sufficiently dense." What if it isn't dense in someparts of the sphere, such as in below? How do you interpolate for yourintegration *there*? I mean very few sets of HRIR/HRTF sets, such as theKEMAR set, include adequate coverage towards right-down.

That makes it very difficult to extrapolate towards a whole spheresolution. It makes the whole sphere solution irregular, andoften downright illposed.

This can give you the same resolution as e.g. order 40 Ambisonics atfraction of the complexity.

No, it can't. It seemingly can, but if you go through the theory whichled to the idea of perfect quadrature, *actually* in the Fourier domainyou will be introducing quite a number of aliasing artifacts, inspace/direction. They will also be difficult to control or prove, evenif we know they will be minor an sich. For instance their interferenceproducts might sometimes be arbitrarily large.


Ciao,

Moro, ystäväni. :)
--
Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
_______________________________________________
Sursound mailing list
Sursound@music.vt.edu
https://mail.music.vt.edu/mailman/listinfo/sursound - unsubscribe here, edit 
account or options, view archives and so on.

Re: [Sursound] about principled rendering of ambisonic to binaural

Reply via email to