Greg Maxwell wrote:

> People encode more then songs. Consider a good 'book on cd' recording at
> 64Kbit/s. I havn't done any tests (have to dig out a book on CD disk), but
> I imagine that the human speech duty cycle is preety low and dropping to
> 32Kbit/s would be a good savings. though I wouldn't want to trust the VBR
> model not to drop to 32Kbit in the middle of word and cause audiable
> distortion.

I actually tried encoding something similar:  Pimsleur Comprehensive
French II.  This consists of 16 audio CD's for a total of 16 hours of
English and French conversation.  My goal was to encode all 16 hours
into the minimal amount of space on my PC, without noticable loss of
sound quality or annoying artifacts.  After some fiddling, I was able to
find the magic sequence of programs and flags to compress it at 24kbps
(no VBR), for a total of 150MB for all 16 CD's!  Here's what I did:

* First, I realized that the recording was mono.  I used cdda2wav (all
of this was done under Linux) to rip a mono WAV instead of a stereo one.

* Then, I tried both FhG mp3enc V3.1 and LAME 3.13.  LAME produced some
nasty high frequency buzzing artifacts, while FhG's was a bit better,
but still had a few artifacts at 32kbps.

* As an aside, I also attempted to use the free libgsm, but because it
starts with an 8kHz, 13-bit input file (which, even uncompressed, sounds
noticably worse than 11kHz), the quality just couldn't cut it.  I also
wanted to stay with MP3 for compatibility with the huge number of
players out there.  If anyone knows of any other alternative voice
codecs that work with Linux and Windows, please let me know; however MP3
was more than good enough for this test.

* Because I was compressing the human voice, I realized that I didn't
want the encoder to even attempt to encode high frequencies.  I was able
to significantly reduce artifacts, and produce a very nice 24kbps
encoding by having cdda2wav downsample to a mono 11kHz WAV instead of
44.1kHz (flags: "-m -a 4").  The flags for mp3enc31 are "-br 24000 -esr
11025".  Without the -esr flag, mp3enc will upsample or downsample the
WAV to 16kHz, which doesn't sound as good as forcing it to 11kHz.

I believe the low-bitrate, low-samplerate MP3's are actually an
extension to the standard (MPEG 2.5), but mpg123, xmms, and newer
versions of WinAmp play them perfectly.  Unfortunately, LAME does not
support the 11kHz sample rate, and had far more severe artifacts than
mp3enc at the higher sample rates, so here is one area where LAME could
potentially be improved.

At the very least, I have a feeling that implementing a simple low-pass
filter around 5kHz could remove the annoying high frequency artifacts I
heard which was 99% of the problem with the LAME encoder for this task.

Another area of exploration would be with VBR.  mp3enc31 doesn't support
VBR, but for this task in particular, it could have been very useful,
due to the frequent 5+ second pauses between narration of phrases
(during which the listener is supposed to repeat them out loud to get
practice with pronunciation).  I'm certain that an encoder with the
quality of mp3enc and decent VBR support could get down to 16kbps with
no trouble.  However, I believe mp3enc31 does save some bits up from the
quiet sections because it seems that the sound quality improves a bit
starting around 5 seconds into the playback.

However, all things considered, 150MB for 16 hours of speech is pretty
respectable!  Too bad I had to use a commercial encoder to get that
quality, but hopefully this message will give you some ideas of
potential areas to improve LAME.  If anyone would like to work with the
original audio files I used for this test, I can make a 30 second
snippet and upload it to my website or mail it to Mark to put on the
LAME page.

-Jake
--
MP3 ENCODER mailing list ( http://geek.rcc.se/mp3encoder/ )

Reply via email to