Re: [Freetel-codec2] Clarifications on quantization

2023-09-20 Thread david
Hi Robin,

> Regarding the quantization of sinusoidal magnitudes/amplitudes, you
> write in a
> blog post (https://www.rowetel.com/?p=130) that the "red line" Am is
> quantized.
> This is not the plain frequency curve (the green one Sw). How exactly
> do you
> derive Am from Sw?

By sampling the LPC synthesis filter Pw=1/|A(e^jw)|^2 at each harmonic.

> But in the Harmonic Sinusoidal Model, you need to have all L
> amplitudes
> available to synthesize the speech signal. How is that achieved? Are
> you simply
> synthesizing 10 harmonics with an appropriately scaled Wo no matter
> what?
> 

The LSPs are converted back to LPC coeffcients {ak}, which are used to
create a LPC synthesis filter, which we sample.  Well actually we take
the RMS value of the spectra in that band rather than sampling at the
harmonic centre.  The blog post you linked to explains that a little
further down, and I think it's in the thesis too.

> The fundamental frequency is determined by trying a number of
> frequencies
> between 50-500 Hz, determining the sinusoidal amplitudes, decoding
> that data and
> comparing it with the original signal? The fundamental frequency will
> be the one
> where that comparison yields the smallest error. This is the
> algorithm described
> in chapter 3.4 of your PhD thesis.
> 
We use the non linear pitch estimation algorithm (in the thesis), the
the MBE pitch estimator (which you outlined above) is used for
refinement of the pitch estimate.

> What's the algorithm you are using to estimate voicing?

The MBE algorithm, but the voicing of all bands is averaged to get a
single metric which we compare to a threshold.

> Furthermore, LPC analysis is performed directly on the speech samples
> (time
> domain) according to the block diagram. How does that fit together
> with using Am
> which is obviously a feature in the frequency domain?

The Am are extracted using freq domain techniques for the purpose of
estimating voicing.  In the LPC quantised modes, then Am are then
discarded and the time domain LPC are transformed to LSPs and sent to
the decoder, where the Am are extracted.
 
> I do have a little bit of experience in signal/audio processing, but
> still find
> it hard to understand all of it. Okay I admit, I get terribly
> confused.

Yes, we realise there is a gap here.  We plan to write a complete
algorithm description to provide a reference in one place.

Cheers,
David R



___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


[Freetel-codec2] Clarifications on quantization

2023-09-20 Thread Robin Haberkorn via Freetel-codec2
Greetings, Dr. Rowe!

Thanks for your excellent work and for publishing it as Open Source for everyone
to study. Which I am trying to do.
I am going to give a report about Codec 2 in university and would like to
clarify a few aspects, I find hard to grasp.

Regarding the quantization of sinusoidal magnitudes/amplitudes, you write in a
blog post (https://www.rowetel.com/?p=130) that the "red line" Am is quantized.
This is not the plain frequency curve (the green one Sw). How exactly do you
derive Am from Sw?
Okay, so you use that as the input of an LPC (p=10) and the 10 LPC coefficients
are transformed to LSPs to decrease the influence of transmission errors. Also,
there will always be 10 values, no matter how much harmonics are in the signal.
You can of course determine the number of harmonics in the decoder just given
the fundamental frequency.
But in the Harmonic Sinusoidal Model, you need to have all L amplitudes
available to synthesize the speech signal. How is that achieved? Are you simply
synthesizing 10 harmonics with an appropriately scaled Wo no matter what?

The fundamental frequency is determined by trying a number of frequencies
between 50-500 Hz, determining the sinusoidal amplitudes, decoding that data and
comparing it with the original signal? The fundamental frequency will be the one
where that comparison yields the smallest error. This is the algorithm described
in chapter 3.4 of your PhD thesis.

What's the algorithm you are using to estimate voicing? I correctly understand
that you determine the voicing (1 bit: voiced or unvoiced) every 10ms (putting
two results into every frame that gets output every 20ms)?
The purpose of this information is in assisting the reconstruction of phase-data
in the decoder, esp. to get unvoiced sounds right?
At the same time, the block diagram (https://www.rowetel.com/?page_id=452)
mentions an MBE voicing estimation. How many frequency bands for MBE can there
be if the result is a single bit per 10ms chunk?

Furthermore, LPC analysis is performed directly on the speech samples (time
domain) according to the block diagram. How does that fit together with using Am
which is obviously a feature in the frequency domain?

I do have a little bit of experience in signal/audio processing, but still find
it hard to understand all of it. Okay I admit, I get terribly confused.

Best regards,
Robin Haberkorn


___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2