On Sun, 26 Feb 2023 at 21:47, Russ Allbery <r...@debian.org> wrote: >
> I am not a lawyer, let alone a copyright lawyer, and have only an amateur > Internet understanding of the nature of compilation copyrights (and they > may well also vary by jurisdiction), but my understanding (possibly > incorrect) of the law in the US is that holding copyright on a member of a > collection does not give you any copyright ownership of the collection as > a whole. To gain copyright ownership of the collection, you have to > exercise some sort of creative control over the collection itself, such as > by using human creativity to select its membership, choosing some elements > and discarding others. Or creating a tool that does that for you following your criterias and helps you in doing some jobs like applying labels. Labels (like every other input structure) could be manually applied (art-sculpture) by humans or by rules applied by humans, etc. Otherwise using SQL for dealing with a database instead of editing every field by hand would wipe off completely every right on the database itself. A database created with SQL language is a protectable work, so it is a structured ML/AI input collection. > The person distributing the collection has to > comply with copyright law with respect to the material included that you > hold a copyright on (either satisfying your license or following the rules > of fair use), but if you're not involved in creating the collection, you > don't get any separate rights over the collection itself and cannot assert > a license on it. A totally automatic procedure like web crawling and web indexing re-enter in your example, perfectly. However, the input collection that a ML/AI training system needs is a protectable work because the data should be structured, selected and properly labeled even if these activities are done with rules like it happens using SQL for databases. Thus, if this protectable collection is an enlargement of a previous protected collection, then copyright law applies. However, statistics about word sequences can be a product of a complete automatic process. So, web indexing and statistics are created over a input collections that are *not* a creative works and these tools access to every copyrighted works in fair use as long as they respect the robots:no meta-tag when it is applied to a copyrighted work. Instead, training a ML/AI is a completely another story and their input collections are a protectable collection under the copyright law. This was explained in one of my first e-mail on this subject. Here: - https://lists.debian.org/debian-project/2023/02/msg00020.html Which, after all, is the reason because data-scientists are crying about not having back the "AI input collection" even when it is created using their copyrighted works. One day, an AI being able to self-learn without any human action and able to collect data by itself will be here reading this e-mail but today is not that day, yet. :-) Best regards, R-