On 2021-11-20, Andy Farnell wrote:

Now maybe your problem looks like this (unless I have gravely misunderstood it); You want to find a set of known codes with the best statistical properties for signal dithering which can be expressed generatively (perhaps as a set of keys for generators of keys for generators... chained thusly) that can be compactly sent or pre-known by the receiver. Searching (resyncing) is now measured in milliseconds.

This is not too far from what I'm doing at the moment, actually. But there are those pesky little things which come from audio and statistics in the way, which have lead to my current (unworkable; why I'm asking) architecture.

Basically, if we had reliable metadata telling us the system was or was not used, I could probably just use a linear feedback shift register of long enough period, using a random seed conveyed in the metadata, and be done with it. That's a problem solved on the crypto side, for some decades now. However, there are at least four problems with that.

First, we don't often have a side channel available on which to reliably convey whether the scheme was applied. So, if you subtract out something which wasn't there to begin with, you add another RPDF/TPDF worth of noise instead.

Second, the synchronization: only coherent subtraction works, so if there's a timing glitch of even a single sampling time, even with an encoded bitstreaam you'll immediately go from coherent to incoherent subtraction for the rest of the stream, adding instead of subtracting noise.

Third, if you can't reliably pass a random seed to the decoder every time, so that you have to start with a known initial condition for your PRNG, you might not have much of a problem with one-shot audio signals, but you're definitely going to end up with trouble when you work with ensembles of signals, because the dither signal is then always the same, and piles up in ensemble averages. That might seem like a small deal, but let's just think about what this would e.g. do to a deep learning based speech recognition software's training step: the learner could quite easily latch onto the constant, recurring dither signal. Applied over a corpus of millions of training signals, the dither might easily amplify into a feature beyond any desirable feature to be extracted.

And fourth, *actually* LSFR's don't survive too many tests of indistinguishability from true randomness. That's usually more of an issue for cryptographic work, and things like Monte Carlo methods in computational physics. But in fact it *does* show up in audio work as well, from time to time: e.g. fast shift-and-add algorithms for impulse response measurement infamously react *really* badly to any nonlinearity when the probe noise is derived from an LSFR. That brings on a spectre of further concerns, which I'd like to engineer through once and for all, instead of leaving them hanging. I'd like to make my scheme the no-nonsense go-to for any digitized signal, after all; the silver bullet, built into anything and everything.

So, let me recount my current solution in a bit more historical detail, wrt how it tries to address the above concerns.

My first iteration was just subtractive dither using an LSFR. Synchrony and metadata was assumed. Of course it can't work out in the wild, but that's what you build as a proof of concept. It used the orthodox 1 bit peak-to-peak RPDF as the dither. It makes even 4-bit audio surprisingly pleasant to listen to, because you can suddenly hear well below the noise floor, and compared to additive TPDF, the drop in noise floor is highly noticeable.

Of course, next I added a second generator LSFR with a period mutually prime to the first one, in order to arrive at TPDF. In subtractive mode there is of course no difference, but in additive mode, TPDF of course works better. Next I also implemented the oldest psychoacoustic trick of them all in dither, which is to say I differenced two subsequent dither values, leading to a first order highpass filter. Even better quality, but not something I'm aiming at in the end.

Now, the blind synchronization problem. If I want the technique to be broadly applicable, it needs to see -- even without metadata -- whether it is being used. Also because of the above considerations, it needs to be able to key itself randomly to avoid leaving repetitive traces of itself on running data. It also needs to do all of this without relying on fully correlative machinery, because that sort of thing is too expensive to be implemented on every PCM I/O-line out there. Compatibility between different word widths is also an issue, because as we know from audio practice, 16 bit signals are often passed over 24 bit interfaces, simply keeping the lower order (maybe even some of the higher order) bits zero.

So, I do that parity trick over every word received. It's not sensitive to the relative alignment of the utility data within a sample word. At the same time it's an optimal collector of entropy over the bits: any entropy in any of the bits of the word will be reflected into the parity, so that even if one of them holds evidence of true physical noise, the parity will be at least as noisy. This is a trick borrowed from entropy pool work on the crypto side. Additionally, the trick is cheap in both hardware and software, and in fact on the CPU side often comes as a side effect of any arithmetic operation, and sometimes even a load. It's inelegant to implement in plain C, but still quite doable and reasonably fast.

Next, I maintain a small deterministic automaton which decodes a comma-free code from the bitstream assembled above. The idea here is that *if* the bitstream above gathers as much entropy from the noise inherent to digitized analogue signals, it'll be statistically more or less white and independently distributed.

If you want to do synchro the normal way, you'd correlate some amount of successive bits against a known sync word and go from there. However, correlation is expensive, so that we'll do *stochastic*, *pseudorandom* correlation instead. This is a trick I learnt from rsync, which first does a fast CRC and only then goes into further validation, and at the same time from the drop-out sampling methods: if you want to derive circularly uniform noise on the 2D plane, often the best method is not to derive it directly, but just derive a rectangular distribution, measure whether it is within a circle, and then just drop off and repeat.

Thus what I do is I gather a slow speed bitstream from the inverse coding, which latches that compressed parity bit into a 32 bit shift register, every so rarely. The idea is that we slowly gain entropy from the incoming stream, sampling it stochastically, and that the interval of sampling -- i.e. when we advance the long shift register -- does something like a stochastic exponential backoff, into the past.

That's then because I'm aiming for very little hardware and memory, so that this kind of thing could become universal. In a cheap device, you don't have the luxury of a lot of memory, even if you process tons of data per microsecond. So if you want to rekey your dither generator reliably, so that you don't add perceptible patterns into the signal while rekeying, you have to be capable of scrambling your PRNG using noise from rather a long time back in the signal. I mean, it's not uncommon for entire seconds of digital zero to come in, with no entropy in them, and so on. If you relied on straight correlative machinery, you'd have to remember what has gone before, and nobody can afford that kind of machinery for what is essentially a finishing touch, and not the real meat.

So instead we sample every now and then. The sampling algorithm proceeds by comparing a fixed number of the bits in the long shift register against a constant. If the bits in the register match, a new bit is shifted in. Under the theory that we have a good entropy extractor in the XOR trick, what is being tested is a fully white sequence, so that it does not matter which value we test against. Then, since the testing and latching interval itself depends on a hit, the testing interval relevant to resynch/latching will itself spread out (stochastically) exponentially in the the past. With very little circuitry or computational load.

Penultimately, the problem, and two things I want to implement. The problem is that since self-synchronizing gizmoes like mine need to read their own output on the encoder side, so that the decoder can stay onboard, my current implementation breaks up. It often produces very short-period resynch cycles, so that the codec, as it is, does more damage to the signal than good. I'm very certain the basic *approach* is sound, but since the implementation is plenty complex, I'm having trouble pinpointing what precisely leads to this higher than expected reset velocity. Certainly I don't know how to analyze the total dynamics of the feedback loop so as to guarantee those nice statistical bounds I'd like to, to begin with.

In implementation, there are also two things lacking as of now. First, positive, correlative latch-on. Even now, my code does catch on. But for high fidelity open-ended work, I'd like it to confirm before it latches. In my architecture it's easy if you know which S/N ratio you're working at, or bitwidth, because after a latch, you know your generated dither sequence, and can correlate it against your samples. It would be efficient, because you don't need to do a sliding *correlation*, but just a single vector product. The trouble is that doing that optimally still depends on relative levels and the like.

Finally, we'd also like to solve the dual problem of unlatching onto a stream after error. While the machinery used to solve the positive synch problem is pretty much the same as in here, really disengaging this kind of active processing from the signal chain might have to happen much more rapidly and reliably. In order not to degrade valuable data, say, from a multi-billion dollar space probe. And in these kinds of measurement/telemetry applications, it *would* have to be possible to later on unprocess from the signal even coding faults which would be of no significance in usual audio work.

Seriously, I've thought about this stuff for quite a while now, and would like ideas. (Some of which I've already gotten; thank ya'll!)
--
Sampo Syreeni, aka decoy - [email protected], http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

Reply via email to