On Sat, 25 Jan 2025 at 16:24, M. Zhou <[email protected]> wrote: > > On Sat, 2025-01-25 at 12:09 +0000, Sean Whitton wrote: > > > > Wondered if you'd had another chance to look at this. > > Ummm... You know what may happen when there is no deadline.
The best time to do this was last year around the OSAID 1.0 release. The next best time is now. Do you need our help? I'm working on an article about how the chickens have come home to roost with the VLC demo at CES 2025. With VLC advertising and users now expecting real-time AI subtitling that "appears to be built directly into the VLC app"[1], we have a situation where VLC is considered Open Source by the OSD, but NOT according to the OSAID and OSI leadership[2] because of Whisper being embedded. With more and more software being written by and incorporating AI, this situation is untenable. Distros like Debian would have to lobotomise popular apps like VLC, or accept more binary blobs. The OSI also just released a whitepaper[3] that further deliberately obfuscates the issue, prompting me to post this: The Open Source Initiative (OSI) goes to the effort of defining four classes of data *source* (hence the term!) in their Open Source AI Definition (OSAID) FAQ and again in the Open Future Foundation’s name in this new paper, only to then accept ANY of them… or NONE at all: - OPEN data under open licenses, which is the ONLY class that has any role in Open Source AI - PUBLIC data like Common Crawl Foundation dumps of the Internet, which are routinely ab/used without creators’ consent - OBTAINABLE data “including for a fee” like The New York Times articles and Adobe/Getty Images stock photos, which are guaranteed to get end users (but not necessarily vendors given limited liability clauses) sued - UNSHAREABLE NONPUBLIC data that obviously has no place in Open Source, like Facebook & Instagram feeds With the meaning of Open Source AI being defined solely by the LOWEST bar — no data delivered at all (which is allowed under the OSAID) — why bother with the smokescreen if not to deliberately deceive us users? An honest FAQ entry would have read like this: What kind of data should be required in the Open Source AI Definition? None. 1. https://hackaday.com/2025/01/15/floss-weekly-episode-816-open-source-ai/ 2. https://www.theverge.com/2025/1/9/24339817/vlc-player-automatic-ai-subtitling-translation 3. https://openfuture.eu/publication/data-governance-in-open-source-ai/

