On 2022-09-13, Fons Adriaensen wrote:

Even if that case it isn't as simple as you seem to think.

Obviously I simplify. I know a thing or two already.

Any set of measured HRIR will need some non trivial preprocessing before it can be used. One reason is low-frequency errors. Accurate IR measurements below say 200 Hz are difficult (unless you have a very big and good anechoic room). OTOH we know that HRIR in that frequency range are very low order and can be synthesised quite easily.

As such we put in a priori knowledge into the model, and/or somehow repeat the measurement, coherently adding the resulting signal so as to bring down the noise floor.

Another reason is that you can't reduce a set of HRIR to low order (the order of the content you want to render) without introducing significant new errors.

I believe the systematic way to talk about this is to see how directional sources reduced to a central framework is about the Fourier-Bessel reduction, which doesn't reduce easily to the rectilinear Fourier decomposition. Even their reduced orders aren't comparible. In each frame, a low order, finite order field which looks perfectly even/in-quadrature, in the other has an infinite order decomposition.

But they work pretty well against each other for most outside-of-the-rig sources. The higher order crossterms in the transform fall off fast, so that you can approximate pretty well either way. That's where the Daniel, Nicol & Moreau NFC-HOA paper comes from. (They also did Hankel functions, in outwards going energy transfer. Their solution is exact, and they've talked about the connection to rectangular WFS, but even they didn't quantify this all fully.)

One way to reduce these is to reduce or even fully remove ITD at mid and high frequencies, again depending on the order the renderer is supposed to support.

This is all implicit in the order of reconstruction. ITD is just derivative of the soundfield over your ears. Of course Gerzon took Makita theory, but the latter is derivable from first the acoustic wave equation, and then its ecologically minded, reduced psychoacoustics. Once you go to third order ambisonic or beyond, no psychoacoustics are necessary.

Getting the magnitudes (and hence ILD) accurate requires much lower order than if you also want to keep the delays.

My point is that in binaural work, especially if head tracked, you can easily get to order twenty or so. No ITD/ILD-analysis needed, because it'll mimic physical reality.

If we can just do it right. How do we, from a sparse and anisotropic meaasurement set?

Compared to these and some other issues, not having a set on a regular grid (e.g. t-design or Lebedev) is the least of problems you will encounter.

Tell me more? I don't recognize these ones just yet.

There are other considerations. For best results you need head tracking and a plausible room sound (even if the content already includes its own).

In plausible room reverberation, I might have some ideas as well. :)

The practical solutions do not depend on such concepts and are much more ad-hoc. Some members of my team and myself worked on them for the last three years. Most of the results are confidential, although others (e.g. IEM) have arrived at some similar results and published them.

IEM is serious shit. If the results are confidential, so be it, but at the same time, if I'd have something to contribute, I'd happily sign an NDA. Just to be in the know, and contribute; I've done it before, and would do again.

It'd just be fun to actually solve or at least quantify the limits of this particular mathematical problem. How well can you actually go from say a KEMAR set to a binaural-to-ambisonic rendition? Isotropically? If you don't really have the bottomwards set? How do you interpolate, and where's your overall error metric? And so on?!? My problem.

Another question is if for high quality binaural rendering, starting from Ambisonic content is a good idea at all.

Obviously it isn't, in all cases. For example, if your content is such that the listener mostly looks forward, as say towards a movie screen, a fully isotropic ambisonic signal of any order wastes bandwidth/entropy. A lot of it. Even at first order POA, probably something like 2+epsilon channels worth of it. If they used just periphonic ambisonic, they'd get more typical theatrical sounds. In fact, if they really optimized the thing for frontal sound, and maybe a supplemental, theatrically minded surround track, such as in Dolby Surround, maybe it'd need even less.

But the thing is, ambisonic has always been an excercise in generality, regularity, and so virtual auditory reality, with mathematical certainty. Above instant efficiency or cheapness. It's never been about what is easy, but about being able to look allround, and to perceive the same isotropic soundfield, even if you look up or down. The auditory idea of what we now call "immersive VR".

Sure, there have been many compatibility formats on the way, to ease "The Transition". But quite surely the whole of e.g. Gerzon's vision has been for all of us to go into something like full holophony.

Simple fact is that if you want really good results you need very high order,

Yes, though you can do better in limited circumstances. There's lots of stuff in this vein in the early pantophonic ambisonic work, and beyond. TriField I think was one, or Gerzon's work in compatible frontal stereo.

1. such content isn't available from direct recordings (we don't have even 10th order microphpones), so it has to be synthetic,

True. Which is probably why we have things like Dolby Atmos.

But at the same time, all of these formats nowaday include at least first order ambisonic, and often right upto third order. Because it's very difficult to make synthetic sound from a recording of real, live sound. The only real, systematic way of recording full 3D sound, even now, is via ambisonic principles.

2. rendering it from an Ambisonic format would be very inefficient. For example for order 20 you'd need 441 convolutions if you assume L/R head symmetry, twice that number if you don't.

I am not too certain that is true. If you want to implement that one efficiently, you would want to do a reductiont to 3rd order, or at most 4-5th order, isotropic. Because 3rd order sounds pretty damn exact already. (I've had the distinct privilege of going into an anechoic room to listen to that, thanks to Ville Pulkki and Archontis Politis, at Aalto University Lab of Acoustics and Signal Processing. With their tweaks, no less.)

I'm thoroughly sure you can even in the LTI regime reduce those convolutions to a very small part of the original, while retaining full perceptual quality. And then, as I once surmised, you can also, in the ambisonic framework, exchange the order of convolutions and the matrix, for lesser work at lower directional order. (And for gaming and whatever VR, for zero latency in processing, at a cost.)

The fun thing here is that I come about this as a signal processing fiend, who has never actually implemented a single algorithm. But who knows all. Starting from things like the Gardner convolution algorithm. Which would be how we implement zero delay convolution even in multiple channels/dimensions. Karhunen-Loeve-transform, as the optimum entropy gathering one, even above the second best Discrete Cosine Transform, which we'd probably use here, over channels and time.

I in fact have been thinking, for the longest of times, how to optimally pack ambisonic signals, well. None of the current systems work, because they are aimed at intensity stereo; the precise thing ambisonic originally was against.

Compare this to rendering from object encoded content (i.e. mono signals plus directional metadata). You need only two convolutions per object.

Atmos, yeah. Here we go again. :D

If you encode those in a principled fashion, as a part of a soundfield, 1) you don't have to encode their direction any more accurately than your hearing requires, and 2) their statistical similarity, especially if they are close to each other, will lead to denser coding in toto. Both in analog, and especially in digital, encoding. You can fit more directional hearing into lesser bandwidth, whichever way.

Starting from a sufficiently dense HRIR set, you can easily generate a
new set on a regular grid with a few thousand points, and interpolate
them (VBAP style) in real time.

That's the point. "Sufficiently dense." What if it isn't dense in some parts of the sphere, such as in below? How do you interpolate for your integration *there*? I mean very few sets of HRIR/HRTF sets, such as the KEMAR set, include adequate coverage towards right-down.

That makes it very difficult to extrapolate towards a whole sphere solution. It makes the whole sphere solution irregular, and often downright illposed.

This can give you the same resolution as e.g. order 40 Ambisonics at fraction of the complexity.

No, it can't. It seemingly can, but if you go through the theory which led to the idea of perfect quadrature, *actually* in the Fourier domain you will be introducing quite a number of aliasing artifacts, in space/direction. They will also be difficult to control or prove, even if we know they will be minor an sich. For instance their interference products might sometimes be arbitrarily large.

Ciao,

Moro, ystäväni. :)
--
Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
_______________________________________________
Sursound mailing list
Sursound@music.vt.edu
https://mail.music.vt.edu/mailman/listinfo/sursound - unsubscribe here, edit 
account or options, view archives and so on.

Reply via email to