Bug#980839: rnnoise model training data

2024-04-15 Thread Petter Reinholdtsen
According to https://github.com/xiph/rnnoise/releases/tag/v0.2 >,
the new version just released was been trained on only publicly
available data.  Time to see if rnnoise can go into Debian?

-- 
Happy hacking
Petter Reinholdtsen



Bug#980839: rnnoise model training data

2021-02-01 Thread Ralph Giles

On Mon, 01 Feb 2021 16:19:03 +0800 Paul Wise wrote:

> It has been made clear in this Hacker News subthread that the
> RNNoisem odel has been trained in part using proprietary data:

This is correct. There's been some discussion about this on IRC with
respect to that thread and the Debian Machine Learning and Software
Freedom policy proposal:

 
https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst

There was some confusion because Mozilla collected public data
submissions for the project. According to author Jean-Marc Valin, these
data were *not used* to train the currently-published rnnoise model,
which was instead trained on other free and non-free data sets.

The crowdsourced data set was published under a CC0 license and is
available for further work, but it needs cleaning and characterization
before it can be directly useful for training.

Data download: https://media.xiph.org/rnnoise/
Original click-through soliciting license agreement from submittors:
https://web.archive.org/web/20171003052023/https://people.xiph.org/~jm/demo/rnnoise/donate.html

So, rnnoise falls under the "toxic candy" model classification in the
policy proposal.

It's good to have names for these situations, and definitely good to
ask for public data for training models, but I don't think it would be
reasonable to block packaging rnnoise based on this criterion.
Compression technologies, whether for voice, music, images, or video
have all been tested and tuned against source data which is not all
publicly redistributable. For example, the codebooks of the speex
codec, part of Debian since 2002, were trained on some of the same
proprietary datasets as the default rnnoise model. Even the Linux
kernel is tuned using proprietary workloads.

Recent interest in machine learning has made better tools for model
training available, bringing us closer to applying the modification
aspect of software freedom to parameter sets used to configure
software. That's a step forward. Deciding that new models must reach a
higher bar than established code would be a step back.


signature.asc
Description: This is a digitally signed message part