Hi, On Tue, May 21, 2019 at 12:11:14AM -0700, Mo Zhou wrote: > Hi people,
I see your good intention but this is basically changing status-quo for the main requirement. > https://salsa.debian.org/lumin/deeplearning-policy > (issue tracker is enabled) I read it ;-) > This draft is conservative and overkilling, and currently > only focus on software freedom. That's exactly where we > start, right? OK but it can't be where we end-up-with. Before scientific "deep learning" data, we already have practical "deep learning" data in our archive. Please note one of the most popular Japanese input method mozc will be kicked out from main as a starter if we start enforcing this new guideline. > Specifically, I defined 3 types of pre-trained machine > learning models / deep learning models: > > Free Model, ToxicCandy Model. Non-free Model > > Developers who'd like to touch DL software should be > cautious to the "ToxicCandy" models. Details can be > found in my draft. With a labeling like "ToxicCandy Model" for the situation, it makes bad impression on people and I am afraid people may not be make rational decision. Is this characterization correct and sane one? At least, it looks to me that this is changing status-quo of our policy and practice severely. So it is worth evaluating idea without labeling. As long as the "data" comes in the form which allows us to modify it and re-train it to make it better with a set of free software tools to do it, we shouldn't make it non-free, for sure. That is my position and I think this was what we operated as the project. We never asked how they are originally made. The touchy question is how easy it should be to modify and re-train, etc. Let's list analogy cases. We allow a photo of something on our archive as wallpaper etc. We don't ask object of photo or tool used to make it to be FREE. Debian logo is one example which was created by Photoshop as I understand. Another analogy to consider is how we allow independent copyright and license for the dictionary like data which must have processed previous copyrighted (possibly non-free) texts by human brain and maybe with some script processing. Packages such as opendict, *spell-*, dict-freedict-all, ... are in main. I agree it is nice to have base data in the package. If you can, please include the training data if it is a FREE set. But it may become unrealistic for Debian to getting into business of distributing many GB of training data for this purpose. You may be talking data size being over 10s of GB. This is another thing you should realize -- So mandating its inclusion is unpractical since it is not the focus point on which Debian needs to spend its resource. Let's talk about actual cases in main. "mecab" is free a tool for Japanese text morphological analysis which can create CRF optimized parameters from the marked-up training data. (This is also the base of mozc which uses such data to create desirable typing output in normal Japanese text input from the keyboard.) One of the dictionary for mecab is 800MB compressed deb in main: unidic-mecab which is 2.2GB data in text format containing CRF optimized parameters and other text data obtained by training. These text and parameters are triple licensed BSD/LGPL/GPL. Re-training this is very straight forward application of mecab tool with additional data only. So this is FREE as it can be in current practice and we have it in main. https://unidic.ninjal.ac.jp/ When these CRF parameters were initially made, it used non-free data (Japanese Government funded) available in multiple DVDs with hefty price and restriction on its use and its redistribution. This base data for training is as NON-FREE as it can be so we don't distribute. https://pj.ninjal.ac.jp/corpus_center/bccwj/dvd-index.html In case of MOZC, the original training data is only available in Google and not published by them. Actually, tweaking data is possible but consistently retraining this data in MOZC may not be a trivial application of mecab tool. We are placing this in main now, anyway since its data (CRF optimized parameters and other text data ) are licensed under BSD-3-clause and we have MOZC in main. Regards, Osamu