Thanks for the quick response Jilayne, and apologies for my verbosity here. I don't know what you know.
> what dataset are you referring to There are actually many different datasets published by Fedora, and much of the task ahead is to bring them under one roof where they can be cross-referenced. But most of them have very similar data, and the same general privacy questions apply. To answer your question, "Datanommer" is the biggest of these. This is a record of all the messages sent between various Fedora systems. For example, let's say that system A stores code. When user "bob" changes code for project "foo" which is stored in system A, system A sends a message to all other Fedora systems saying, "user bob made this change to foo's code." Depending on the message, any number of other systems might respond. For example, system B might rebuild the code for project "foo" and release a new version of package "foo" to Fedora end users. Meanwhile, system C might be tracking how many code changes were made by bob this month for the purpose of issuing recognition rewards to incentivize contributions. These systems might then also issue their own additional messages. Going back to system B, once it has released the new version of package "foo", it might send a message saying that it has done so successfully, and another system responds by announcing the new version somewhere. In general, this is the "event driven" fabric by which most of our applications communicate with each other. Every message ever sent on this fabric back to 2012 is currently stored in the Datanommer database and is published by Fedora in several places, including on Datagrepper which Neil shared with you. As I've illustrated, these messages include records of user activity tied to a username and a timestamp. Depending on the type of message, additional information is also included, such as "this is the code that bob changed" in my example. The dataset is enormous and difficult to review. There are over 28,000 different types of messages in the system currently, which we refer to as "topics". The consensus seems to be that this data does not include the IP addresses of users or any other data that could be used to track a user's location or discover their identity beyond username. However, it's worth noting that there also seems to be no single party responsible for ensuring that this is true and that it remains true. In fact, I'm a bit concerned that given the vast quantity and variety of data here, some accidental PII may be discovered given the elevated community scrutiny that I'm anticipating with the elevated ease of access that Hatlas provides. If I'm wrong about having a responsible party assigned, it would ease some concerns to publish that point of contact somewhere in our documentation. > what is the intent of using this consolidated data? Our previous Fedora Project Leader established a goal of doubling our number of contributors by 2028, aka "Strategy 2028". However, today we have no way of measuring how many contributors we have, let alone whether that number is growing, or by how much, and which efforts of ours are most successful in growing this number, and where we ought to invest. This very simple question of how many contributors we have is complex to answer because there are many types of contributions. For example, someone who hosts a regional Fedora conference is certainly a contributor by Strategy 2028 standards, but it's not likely that those activities will require new code, so measuring contributors as "those who contribute code" will miss this type of person and many others. Additionally, there is a lot of nuance in many cases. Someone who makes a single code contribution of 10,000 lines of code representing hundreds of hours of work is likely more engaged with the project than someone who corrected 2 separate typos in our documentation. We need to be able to combine these many sources of data and all of the nuances within them to ascertain the size and "health" of our community, for the purpose of improving it. I've developed Hatlas as a downstream product of Fedora data mainly because that was the most expedient way for me to facilitate these analyses. My long-term plan is to bring this project fully into Fedora, running under Fedora's governance and on Fedora's servers. However, the fact is that anyone could build a similar system with no special access or permissions required, and use this data for very different purposes such as publishing what time of day a specific user is active in order to encourage harassment, or to see which admins change their password the least frequently. (To be clear, the password change example is hyperbole, but data of similar sensitivity could be inadvertantly present in such a vast dataset. In fact, the task of discovering what data we have is the first step of building our analyses, and will likely be an ongoing activity.) I've already received comments of concern along these lines and I anticipate many more after publicly announcing Hatlas, which is why I'm asking for clarity -- to be able to be able to definitively say, "the Fedora data policies do permit collecting and publishing this data, and it is compliant with all applicable laws." (Which may result in the community asking for a policy change.) Or, alternatively, to figure out what might need to change from the current state to get us to that point. Thanks again, Michael Winters On November 11, 2025 6:07:00 PM CST, Jilayne Lovejoy via legal <[email protected]> wrote: >Hi Michael, > >Thanks for raising this. Having looked to your site, I'm a bit unclear as to >what dataset you are referring to (what is datanommer, is that an existing set >of data or a name you made?) and what exactly is published publicly already? > >It sounds like you are potentially pulling data from a variety of different >sources, is that right? If so, what are these sources and what is the intent >of using this consolidated data? > >Thanks, >Jilayne > >On 11/11/25 10:27 AM, Michael Winters via legal wrote: >> Hello Legal, >> >> My name is Michael Winters, typically known here as @mwinters. I have some >> questions about Fedora's data privacy policies, which I'll provide a bit of >> context to first. >> >> There has been a long-standing desire within Fedora for better tools with >> which to analyze our user data and understand our community so that we can >> improve it. To this end, I have recently created a "Data Lakehouse" proof >> of concept known as "Hatlas", available at https://hatlas.mwinters.net . >> This technology consolidates data from existing public Fedora datasets and >> provides simplified tools to facilitate public access and analysis. >> >> Since these datasets were previously quite difficult to access, I believe >> that most people are unaware of what data exists about them within Fedora >> and/or the fact that it's being published publicly. I expect that the >> announcement of easy access to this data will raise some community concerns >> about data privacy, so this email is in anticipation of those concerns. I >> wish to have clear resources to refer people to, and current resources such >> as https://docs.fedoraproject.org/en-US/legal/privacy/ have left some >> questions open. >> >> In particular, many of these datasets include usernames and records of user >> activity tied to those usernames, e.g. the contents and exact timing of >> forum posts, git commits, group membership changes, etc. My current >> questions are: >> >> 1) Does an arbitrary username (not necessarily tied to a real name) >> constitute PII which must be protected / anonymized? It is not currently >> anonymized in Fedora datasets. >> >> 2) Do current Fedora policies permit collecting user activity tied to >> usernames? This is not explicitly stated under "Information We Collect", >> though it is mentioned later under "Using (Processing) Your Personal Data." >> >> 3) Do current Fedora policies permit publishing user activity tied to >> usernames? Section "Sharing Your Personal Data" does mention "For research >> activities", but it does not specify that data must be shared *only* in >> aggregate. >> >> 4) How does GDPR view downstream users of public data sources, i.e. Hatlas? >> Is Hatlas a "data processor"? Must Hatlas integrate with Fedora's Personal >> Data Removal process? We intend to do so, but there seems to be no >> obligation for either party. >> >> 5) Are there any data licenses applicable to downstream users such as >> Hatlas? I intend to apply one restricting the use of Hatlas data to >> non-commercial purposes, but there seem to be no restrictions coming from >> Fedora. >> >> Thanks in advance! >> >> Michael Winters
-- _______________________________________________ legal mailing list -- [email protected] To unsubscribe send an email to [email protected] Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/[email protected] Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
