>Thanks Christopher. Could you say a little about what makes long
>duration sibilants difficult?

The compression system works, essentially, by recognising that a strong 
signal component will mask (ie render inaudible to the human auditory 
system) lower level components that are nearby in frequency and (to a 
lesser extent) time. Thus, if there are a few strong components surrounded 
by low level noise (a reasonable model of music) then the noise is masked, 
and all the available bits in the bitstream are available to code the 
relatively small amount of information contained in the tones. Most music 
fits this model pretty well, and with the allocation of bits in the coder 
optimised for this type of signal (tones in low level noise) the result is 
excellent perceived fidelity at compression ratios of 5:1 or so for ATRAC.

A sibilant, however, does not fit this model well.  A sibilant consists of 
broadband noise at a more or less  constant amplitude across a relatively 
wide range of frequencies. Since there are no components stronger than 
others, little masking occurs, and so the algorithm is forced to start 
throwing away stuff that is audible, so as to keep inside the maximum 
allowable bitrate. Since the coder is optimised for tones, throwing away 
bits of the noise signal tends to make it start to sound slightly tonal 
(hence the "swishing" quality). If the event is very short then you don't 
really notice, partly because of the temporal masking noted above. As the 
duration lengthens your ear/brain has longer to notice that something's not 
quite right - you start to notice compression artefacts.

Low bitrate speech coders (eg those used for digital cellphones and 
military comms) get around this problem by having more than one model - a 
tonal one for voiced components like vowels (similar in essence to the 
ATRAC model, but differing in implementation), and a noise-like one for 
"sh" sounds and the like. The coder can switch, on a frame by frame basis 
between these alternate models. The absolute fidelity of the noise-like 
models tends not to be good, but the output definitely does still sound 
like noise. It is difficult to make these noise-like models behave well 
enough to be useful for high quality music coding.

Speech coders are getting very good these days. Digital telephone exchanges 
operate on uncompressed (non-linear) PCM at 64k bits/sec. GSM (the European 
cellular  standard) compresses speech to 9600 bits/sec, whilst retaining an 
entirely usable level of intelligibility and speaker recognition (the 
ability to recognise the voice of the person at the far end, not just what 
they're saying). I have heard intelligible (just about) speech coded at as 
low as 100 bits per second. This is remarkable given that to code the 64 
phonemes of western speech at 10 phonemes per second (again, typical for 
western speech) requires an absolute minimum of 60 bits/second.

Christopher Hicks


-----------------------------------------------------------------
To stop getting this list send a message containing just the word
"unsubscribe" to [EMAIL PROTECTED]

Reply via email to