Re: [Sursound] format convention of Ambisonic Sound Library Files

Sampo Syreeni Sun, 29 May 2022 14:25:39 -0700

On 2022-05-25, Fons Adriaensen wrote:

I am still not too sure if even now things are standardised enoughfor anyone to write a fresh file format for it that pleases everyone.

It doesn't have to please anybody at all. It just needs to work, and beimplemented in a widely known and FLOSS library. Repetto's Libsndfilealways comes to mind here, from discussions on the musicdsp list.

Suffice it to say, such things can't just be written,they have to beimplemented and tested, e.g. on vast tiered speaker layouts whichvery few people have access to.

I'd argue that library-wise, even a very basic decoder would do most ofthe lifting. Once the foot is in the door, then developers would jump atthe opportunity to better their wares.

So what's missing from the base ambisonic toolchain? Not the fileformat, because AmbiX/CAF does just fine and its extended format *more*than fine. As Fons just laid it out.

What's missing is plug-and-play integration with Libsndfile and plainbinary, a basic shelving n:th-order decoder, and n:th order streamingprimitives for processing, as an integrated and optimized FLOSS package,written in C. Ported to the main architectures used out there for soundplayback and recording, so x86/64, ARM, and maybe Verilog/VHDL forFPGL/ASIC work. If you were to be funny about it, an MPI implementationfor massively parallel research work.

And of course it will need to use a 64bit-friendly file format too...


Which CAF is. RIFF can be so as well, but not *nearly* as neatly.

Ambix seems to be the de facto standard today, except that instead ofthe required CAF container most people seem to use the same channelorder and gains in a WAVEX file. Which in turn means that the'extended' format is not available as it depends on CAF's UUID chunck.

Obviously RIFF, as in WAVEX, is a chunked format as well. It's wellcapable of carrying a UUID-kinda optional chunk. In fact, IFF/RIFF arethe progenitors to CAF/MPEG-4 BMFF as well as WAVEX/RF64. The nested TLVstructure, akin to even ASN.1, is fully retained.

A problem with the Ambix spec is that the internet never forgets, andif you search for it you could easily find the now invalid originaldocument, or a version of the same with 'corrections' that make ithard to read and interpret.

None of that matters if you just go and implement the thing and spreadyour code around. "Code is law." Not the spec.

What is IMHO missing in the Ambix format is an optional UUID chunkthat would contain the same info as a broadcast WAV, and in exactlythe same format.

Ought to be simple enough to add as side metadata, right? Also usable byany part of the toolchain I outlined above.

What I can't advocate for are tons of different options for ordering,interlacing and bitwidths. Those are impossible to implementefficiently, especially in hardware. What you need is a fixed globalchannel order, from which something can be detracted. You can't do thatby including null channels, because they will consume memory bandwidthas well. What you can do is specify a certain order and then maskchannels out of it; this is what was done in the USB specification, andit's even more doable within some derivative of AmbiX.

It's also what we tried to do with Martin Leese when we specified the(from the beginning defunct) Ogg Channel mapping machinery (https://wiki.xiph.org/OggPCM ). That piece of work tried to order all ofthe possible multichannel types by way of the existing bit masks havingto do with individual channels. So basically by their quantizeddirectional angle.

...but not quite so in toto, because there are plenty of channeldefinitions in use with extra semantics. E.g. the THX standard wants tohave side back channels which are front-back dipoles, with no directsound towards the central listener. This sort of thing is almostimpossible to describe within the ambisonic framework, without includinga high order room model. And besides, semantically, the minimum model isalready encapsulated by the *intention* of the emitter being there:it's left back/right emitting field with no directed sound, intended toexcite diffuse room modes from there, despite the size of theauditorium. It's rather difficult to exhibit something like thatprecisely in the ambisonic framework, yet it's still something you haveto handle in cinematic audio, so we/I just left it in there to be workedout by the decoder.

Such metadata is really essential for some users. It could easily beadded without breaking anything.

So why not just copy the format and throw it into a chunk? That's whatTLV-formats like CAF are ment to do? Obviously since it's metadata, itshould go before the massive data proper. But otherwise, why not justthrow it in there, as an option? It doesn't cost anything, basically,gauged against the gigabytes or even terabytes of multichannel audio, itworks just well with the extensibility machinery of the IFF-type TLVmachinery, and even the standard says the reader ought to disregard itif it doesn't recognize the tag. So sayeth IFF/RIFF/CAF/BMFF, all ofthem.

[1] Except for the channel description chunk which is the usualmishmash of everything the authors could imagine, and still notcapable of describing arbitrary channel uses or just saying'undefined'. Luckily it is not required for the Ambix format.

In that OggPCM work of my and Leese's, I actually thought of doing themapping so called "right". As in it being fully general for ambisonicand even WFS work.

The trouble is that it would have been almost unimplementable. Justthink about it: first you want to do compatibility conding from L/F toM/S and backwards. That's because you want to hold your original signalset as-is for preservation purposes. Then you want to do the same for4.0, 5.1, whatever intensity panned sources you might have. Then youwant to do a static POA folddown to stereo, or maybe a slant octagon.You want to do all of that adaptivity using current, minimum audiohardware, which knows pretty much nothing at all about the ambisonicframework, while being forwards compatible. You want to support *all* ofthe extant intensity panning frameworks, as the reigning paradigm, whilebeing at least somewhat compatible with ambisonics, esp. pre-encodedC-formats like BHJ and G, plus mixed order, so that you can in generaldo pantophony as well, cheaply.

Pretty much the only way to do that even to the ballpark is to have thekind of sparse decoding matrix I said, and to leave it at real, fixedpoint, 16 bits. That's the only format you can universally work with,given current CPUs, DSPs, slicers, network controllers, and the rest ofthe hardware/software. If you do something else, all of the I/Ocombinations aren't (as) workable.

Obviously what you'd *really* want in the metadata is a fully time- andfrequency agile matrix, which incorporates both the central soundfield,*and* the whole of the surroundings. I certainly know how to encode sucha thing: just take your favourite ambisonic central decomposition,sparse it out and put out the metadata. Do your favourite sparsing outas you wish, sample per sample frame, feeding out metadata fordeconstruction as you go.

Then model your environs the same way. Just as you model the incomingwaves on the Bessel side, model the outgoing ones via Hankel functions.To whatever accuracy you want them to be. Because then your decoder willhave to compute what is coming back from your environs, by reflectionand diffraction, from your now-arbitrarily scaled environs. Say, likewhat happens when you go in an RPG from a room to an arena; and say,what happens when the arena has a big, flat wall, with a specularreflection to the right. At worst, falling down back to you because ofthe mortar shot you just fired. So that the overall system isn't LTI,but just multichannel LI, while you're moving against the shockwave youjust put out, against the reflector.

All of this is easily describable. In an audioformat. It just needs lotsof coefficients. The trouble is that it needs *lots* of coefficients,and they need to *stream* in order to properly describe where you are inthe game landscape. Quite probably you need more coefficients for youfilters than you actually have audio dataa, even in high orderambisonic.

That's why you need to find useful, common, a priori bases, in which toexpress your soundfields, and not just the most general sphericalharmonical decomposition. You need to be able to do both, like DolbyAtmos and MPEG-4 do. In both time and space. You probably need to havespaces in your overall acoustic model, which encode directionalreverberation, even, statistically, without modelling the precise wavepropagation via spherical Bessel or Hankel functions. You need to beable to simplify, and *not* paass on all of those various coefficients.Even to decode your signals properly.

Finally, how would I encode the true, lossless matrix into the OggPCMstream? Well...

As a multidimensional wavelet tree, zero-tree encoded, using asufficiently high order Daubechies mother-wavelet.

--
Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
_______________________________________________
Sursound mailing list
Sursound@music.vt.edu
https://mail.music.vt.edu/mailman/listinfo/sursound - unsubscribe here, edit 
account or options, view archives and so on.

Re: [Sursound] format convention of Ambisonic Sound Library Files

Reply via email to