Hi folks, I seem to be good at starting lengthy mailing list threads. That said, I encountered some fun facts while reading papers today. I think this is also a good chance for me write a batch of some other relevant thoughts.
TLDR; well, I wrote too much again. I have thrown my mail to ChatGPT for a brief summary. The following are two versions. They are plain copies without my edits. <<< begin TLDR version 1 generated by chatgpt The email consists of various interesting points related to AI and licensing issues in software. The first point highlights how some state-of-the-art LLMs avoid using GPL code. The author discusses the potential audiences for a license revision or a brand new license. The second point mentions the importance of reproducibility of LLMs. The third point explains how AI's impact on the software ecosystem is increasing, which is inevitable, and it is challenging to enforce the declaration of AI usage. Finally, the author discusses some ongoing and future works on the Debian side related to AI and licensing issues. >>> end TLDR version 1 generated by chatgpt <<< begin TLDR version 2 generated by chatgpt The email discusses some fun facts and relevant thoughts related to licensing issues and the impact of AI in the software ecosystem. The author mentions that the licensing issue of trained deep neural networks and their outputs is complicated, and some AI software projects may avoid GPL code usage in their training data to prevent potential licensing issues. The author also discusses the increasing impact of AI in the software ecosystem, and its potential use in generating code, images, and texts. The email concludes by mentioning ongoing and future works on the Debian side. >>> end TLDR version 2 generated by chatgpt --[[ Fun Fact 1: GPL code usage may be avoided in state-of-the-art LLM [2] LLaMA [3] is one of the state-of-the-art LLMs that you can download and deploy on a local machine. Its training data involves Github, but the authors only use the software projects licensed under Apache-2, BSD, and MIT. The licensing issue of trained deep neural networks, as well as the outputs of neural networks (such as generated texts, generated code, generated images, etc) is already a mess. That said, at least a part of the research community surely knows the complicated implication of using GPL code for training. Or they don't have to avoid using a pile of high quality code. People mentioned some potential licensing work in the previous related thread [1]. But I don't see a clear and practical goal for free software community to reach. There are two types of potential audiences for a license revision or a brand new license. (1) the first type is the free software authors. If the authors do not want their code become a part of the super AI that will destroy the world someday[4], some special licenses or some special clauses can be used to prevent the AI training dataset usage. But, isn't it funny that "training a neural network" is excluded from software freedom? Meanwhile, excluding these code from the training datasets won't hurt the LLM trainers because a large portion of differently licensed projects are still usable. (2) The second type of potential audience is AI software upstream. In my opinion, I'd say there is almost nothing to do for free software communities. If we write some license terms that look funny to the AI software upstreams, they will simply not play with these licenses. --[[ Fun Fact 2: Reproducibility of LLMs The LLaMA paper [3] emphasized that the training set of these models only involve publically available datasets (no proprietary hidden datasets, no undocumented datasets). I can see that before the downstream software communities complain about the reproducibility, the research community will complain about the same thing far in advance. --[[ Recall 1: ML-Policy If I have to trim the ML-policy into one single sentence, then it will be the definition of "toxic candy" -- A pre-trained neural network, that somehow (very likely incorrectly licensed under an open source software license is still very likely problematic. This will be more and more useful, as long as more software projects try to integrate neural networks for interesting applications. It works as a warning when you see a giant binary blob (sometimes the network can be small... only several megabytes or so) in the upstream source regardless of its license. --[[ Fun Fact 3: AI's Impact to software ecosystem is increasing Even if the licensing of neural networks, as well as the copyright/licensing issue of neural networks is still a mess, the trend is not stoppable. If you kept an eye on the github trending list, you will see the ratio of ai software climbing. Even if we hesitate to introduce some AI software into our archive, the impact of AI will gradually flow into our free archive, inevitably: (1) a code snippet might be generated by AI, and modified by the upstream author without declaring the participation of AI. (2) documentation texts might be generated by AI. With the state of the art LLM, you can simply throw your undocumented code snippet and let it explain what the piece of code does. (3) pictures, icons, svgs, generated by AI. (4) ... It is impossible to enforce the declaration of AI usage everywhere applicable. Even worse, detecting the AI generated results is largely a deadend -- the goal of generative AI is exactly to produce indistinguishable results. As long as the AI is strong enough, detecting it will be nearly impossible. There are some papers about the detection, but I refrain from excessively expanding this. --[[ Recall 2: SIMDebian This is a deprecated attempt that tries to bump the ISA baseline for using the modern CPU intrinsics. One of my motivations for proposing this is -- neural network computation can be brutal. Bumping the ISA baseline will significantly help if you run it on CPU. Just for reference. However, as long as the user has GPU, running neural network on CPU is almost nothing beyond a waste of time. --[[ Recall 3: Debian User Package Repository This is a deprecated attempt that tries to create a ebuild-like source-based distribution for .deb packages. One of the motivations for proposing this is -- redistributing AI software with neural networks through archive is problematic ... but it is ok if the neural network is downloaded by the user through the script locally, and the package is built locally bu the end user. As for the problematic licensing issue... anyway the software works, and the components in question are not distributed by us. Just for reference. This is not important now. Surely there are too many non-standard ways to install software. --[[ Some ongoing and future works on the Debian side Debian always provides a solid base system [5], upon which some upper layer application collections like pypi, anaconda, and docker worked very well. Due to many intricate reasons, such as the clearly limited volunteer bandwidth, Debian archive is not suitable as an alternative to these ecosystems. I'll refrain from expanding this to avoid going off topic. Please request if you want to read more on this. That said, we can still incorporate some of the most important software infrastructure in our archive, such as deep learning frameworks, and the neural network acceleration libraries. The upper layer applications are not discussed. PyTorch is currently the most prevalent deep learning framework. It is in good shape in our archive as well. A random trending AI project on github will largely be based on PyTorch nowadays. I have just uploaded the CUDA version of pytorch to NEW queue recently. While I can still handle this package on my own, its compilation and testing is brutal [7]. Welcome to join me for the maintainance if you are interested in it... In my opinion, TensorFlow will gradually fade away for Jax[6]. I really don't suggest anyone to pursue Tensorflow packaging as of 2023. I have already orphaned the whole tensorflow dependency tree under my name. (I acknowledge that I'm a PyTorch user and I have bias about TensorFlow's obscure API and terrible documentations.) See below if you want to get involved. --[[ Team Advertisement Debian Deep Learning Team <debian...@lists.debian.org> welcomes new contributors. The mailing list is currently abused for general discussion and two tracks of development works: 1. https://salsa.debian.org/deeplearning-team Deep Learning frameworks 2. https://salsa.debian.org/rocm-team ROCm is AMD's free software counterpart to Nvidia's proprietary CUDA. (I wouldn't bother to create this team if it were non-free) There could be an Intel/SYCL team in the future. But intel is not yet ready to upstream their SYCL implementation into llvm. I'll only try this by my self when pytorch starts to support intel/sycl. Thanks for reading the long mail. Hope you find some interesting topics and inspirations here. [1] https://lists.debian.org/debian-project/2023/02/msg00017.html [2] LLM = Large Language Model, such as GPT-3, GPT-4, etc. [3] https://arxiv.org/pdf/2302.13971.pdf [4] Yes, please write as many bugs as possible in your code. Your bugs could be herotic if it chokes a super AI trying to destroy the world. (I'm not serious) [5] IIRC, one of the UNIX philosophy goes, "do one thing, and do it well". [6] https://github.com/google/jax [7] Debomatic-amd64 has got an Xeon E5-2697v3 (IIRC). It takes ~3 hours for a full build and checks for the CPU version of pytorch without ccache. The CUDA version will only take longer time.