On 2022-05-25, Fons Adriaensen wrote:

I am still not too sure if even now things are standardised enough for anyone to write a fresh file format for it that pleases everyone.

It doesn't have to please anybody at all. It just needs to work, and be implemented in a widely known and FLOSS library. Repetto's Libsndfile always comes to mind here, from discussions on the musicdsp list.

Suffice it to say, such things can't just be written,they have to be implemented and tested, e.g. on vast tiered speaker layouts which very few people have access to.

I'd argue that library-wise, even a very basic decoder would do most of the lifting. Once the foot is in the door, then developers would jump at the opportunity to better their wares.

So what's missing from the base ambisonic toolchain? Not the file format, because AmbiX/CAF does just fine and its extended format *more* than fine. As Fons just laid it out.

What's missing is plug-and-play integration with Libsndfile and plain binary, a basic shelving n:th-order decoder, and n:th order streaming primitives for processing, as an integrated and optimized FLOSS package, written in C. Ported to the main architectures used out there for sound playback and recording, so x86/64, ARM, and maybe Verilog/VHDL for FPGL/ASIC work. If you were to be funny about it, an MPI implementation for massively parallel research work.

And of course it will need to use a 64bit-friendly file format too...

Which CAF is. RIFF can be so as well, but not *nearly* as neatly.

Ambix seems to be the de facto standard today, except that instead of the required CAF container most people seem to use the same channel order and gains in a WAVEX file. Which in turn means that the 'extended' format is not available as it depends on CAF's UUID chunck.

Obviously RIFF, as in WAVEX, is a chunked format as well. It's well capable of carrying a UUID-kinda optional chunk. In fact, IFF/RIFF are the progenitors to CAF/MPEG-4 BMFF as well as WAVEX/RF64. The nested TLV structure, akin to even ASN.1, is fully retained.

A problem with the Ambix spec is that the internet never forgets, and if you search for it you could easily find the now invalid original document, or a version of the same with 'corrections' that make it hard to read and interpret.

None of that matters if you just go and implement the thing and spread your code around. "Code is law." Not the spec.

What is IMHO missing in the Ambix format is an optional UUID chunk that would contain the same info as a broadcast WAV, and in exactly the same format.

Ought to be simple enough to add as side metadata, right? Also usable by any part of the toolchain I outlined above.

What I can't advocate for are tons of different options for ordering, interlacing and bitwidths. Those are impossible to implement efficiently, especially in hardware. What you need is a fixed global channel order, from which something can be detracted. You can't do that by including null channels, because they will consume memory bandwidth as well. What you can do is specify a certain order and then mask channels out of it; this is what was done in the USB specification, and it's even more doable within some derivative of AmbiX.

It's also what we tried to do with Martin Leese when we specified the (from the beginning defunct) Ogg Channel mapping machinery ( https://wiki.xiph.org/OggPCM ). That piece of work tried to order all of the possible multichannel types by way of the existing bit masks having to do with individual channels. So basically by their quantized directional angle.

...but not quite so in toto, because there are plenty of channel definitions in use with extra semantics. E.g. the THX standard wants to have side back channels which are front-back dipoles, with no direct sound towards the central listener. This sort of thing is almost impossible to describe within the ambisonic framework, without including a high order room model. And besides, semantically, the minimum model is already encapsulated by the *intention* of the emitter being there: it's left back/right emitting field with no directed sound, intended to excite diffuse room modes from there, despite the size of the auditorium. It's rather difficult to exhibit something like that precisely in the ambisonic framework, yet it's still something you have to handle in cinematic audio, so we/I just left it in there to be worked out by the decoder.

Such metadata is really essential for some users. It could easily be added without breaking anything.

So why not just copy the format and throw it into a chunk? That's what TLV-formats like CAF are ment to do? Obviously since it's metadata, it should go before the massive data proper. But otherwise, why not just throw it in there, as an option? It doesn't cost anything, basically, gauged against the gigabytes or even terabytes of multichannel audio, it works just well with the extensibility machinery of the IFF-type TLV machinery, and even the standard says the reader ought to disregard it if it doesn't recognize the tag. So sayeth IFF/RIFF/CAF/BMFF, all of them.

[1] Except for the channel description chunk which is the usual mishmash of everything the authors could imagine, and still not capable of describing arbitrary channel uses or just saying 'undefined'. Luckily it is not required for the Ambix format.

In that OggPCM work of my and Leese's, I actually thought of doing the mapping so called "right". As in it being fully general for ambisonic and even WFS work.

The trouble is that it would have been almost unimplementable. Just think about it: first you want to do compatibility conding from L/F to M/S and backwards. That's because you want to hold your original signal set as-is for preservation purposes. Then you want to do the same for 4.0, 5.1, whatever intensity panned sources you might have. Then you want to do a static POA folddown to stereo, or maybe a slant octagon. You want to do all of that adaptivity using current, minimum audio hardware, which knows pretty much nothing at all about the ambisonic framework, while being forwards compatible. You want to support *all* of the extant intensity panning frameworks, as the reigning paradigm, while being at least somewhat compatible with ambisonics, esp. pre-encoded C-formats like BHJ and G, plus mixed order, so that you can in general do pantophony as well, cheaply.

Pretty much the only way to do that even to the ballpark is to have the kind of sparse decoding matrix I said, and to leave it at real, fixed point, 16 bits. That's the only format you can universally work with, given current CPUs, DSPs, slicers, network controllers, and the rest of the hardware/software. If you do something else, all of the I/O combinations aren't (as) workable.

Obviously what you'd *really* want in the metadata is a fully time- and frequency agile matrix, which incorporates both the central soundfield, *and* the whole of the surroundings. I certainly know how to encode such a thing: just take your favourite ambisonic central decomposition, sparse it out and put out the metadata. Do your favourite sparsing out as you wish, sample per sample frame, feeding out metadata for deconstruction as you go.

Then model your environs the same way. Just as you model the incoming waves on the Bessel side, model the outgoing ones via Hankel functions. To whatever accuracy you want them to be. Because then your decoder will have to compute what is coming back from your environs, by reflection and diffraction, from your now-arbitrarily scaled environs. Say, like what happens when you go in an RPG from a room to an arena; and say, what happens when the arena has a big, flat wall, with a specular reflection to the right. At worst, falling down back to you because of the mortar shot you just fired. So that the overall system isn't LTI, but just multichannel LI, while you're moving against the shockwave you just put out, against the reflector.

All of this is easily describable. In an audioformat. It just needs lots of coefficients. The trouble is that it needs *lots* of coefficients, and they need to *stream* in order to properly describe where you are in the game landscape. Quite probably you need more coefficients for you filters than you actually have audio dataa, even in high order ambisonic.

That's why you need to find useful, common, a priori bases, in which to express your soundfields, and not just the most general spherical harmonical decomposition. You need to be able to do both, like Dolby Atmos and MPEG-4 do. In both time and space. You probably need to have spaces in your overall acoustic model, which encode directional reverberation, even, statistically, without modelling the precise wave propagation via spherical Bessel or Hankel functions. You need to be able to simplify, and *not* paass on all of those various coefficients. Even to decode your signals properly.

Finally, how would I encode the true, lossless matrix into the OggPCM stream? Well...

As a multidimensional wavelet tree, zero-tree encoded, using a sufficiently high order Daubechies mother-wavelet.
--
Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
_______________________________________________
Sursound mailing list
Sursound@music.vt.edu
https://mail.music.vt.edu/mailman/listinfo/sursound - unsubscribe here, edit 
account or options, view archives and so on.

Reply via email to