[Fedora-legal-list] Re: Privacy policy - request for clarification (Hatlas)

Michael Winters via legal Tue, 11 Nov 2025 20:43:45 -0800

Thanks for the quick response Jilayne, and apologies for my verbosity here. I 
don't know what you know.

> what dataset are you referring to

There are actually many different datasets published by Fedora, and much of the 
task ahead is to bring them under one roof where they can be cross-referenced. 
But most of them have very similar data, and the same general privacy questions 
apply.

To answer your question, "Datanommer" is the biggest of these. This is a record 
of all the messages sent between various Fedora systems.

For example, let's say that system A stores code. When user "bob" changes code 
for project "foo" which is stored in system A, system A sends a message to all 
other Fedora systems saying, "user bob made this change to foo's code." 
Depending on the message, any number of other systems might respond. For 
example, system B might rebuild the code for project "foo" and release a new 
version of package "foo" to Fedora end users. Meanwhile, system C might be 
tracking how many code changes were made by bob this month for the purpose of 
issuing recognition rewards to incentivize contributions.

These systems might then also issue their own additional messages. Going back 
to system B, once it has released the new version of package "foo", it might 
send a message saying that it has done so successfully, and another system 
responds by announcing the new version somewhere. In general, this is the 
"event driven" fabric by which most of our applications communicate with each 
other.

Every message ever sent on this fabric back to 2012 is currently stored in the 
Datanommer database and is published by Fedora in several places, including on 
Datagrepper which Neil shared with you. As I've illustrated, these messages 
include records of user activity tied to a username and a timestamp. Depending 
on the type of message, additional information is also included, such as "this 
is the code that bob changed" in my example.

The dataset is enormous and difficult to review. There are over 28,000 
different types of messages in the system currently, which we refer to as 
"topics". The consensus seems to be that this data does not include the IP 
addresses of users or any other data that could be used to track a user's 
location or discover their identity beyond username. However, it's worth noting 
that there also seems to be no single party responsible for ensuring that this 
is true and that it remains true. In fact, I'm a bit concerned that given the 
vast quantity and variety of data here, some accidental PII may be discovered 
given the elevated community scrutiny that I'm anticipating with the elevated 
ease of access that Hatlas provides.

If I'm wrong about having a responsible party assigned, it would ease some 
concerns to publish that point of contact somewhere in our documentation.

> what is the intent of using this consolidated data?

Our previous Fedora Project Leader established a goal of doubling our number of 
contributors by 2028, aka "Strategy 2028". However, today we have no way of 
measuring how many contributors we have, let alone whether that number is 
growing, or by how much, and which efforts of ours are most successful in 
growing this number, and where we ought to invest.

This very simple question of how many contributors we have is complex to answer 
because there are many types of contributions. For example, someone who hosts a 
regional Fedora conference is certainly a contributor by Strategy 2028 
standards, but it's not likely that those activities will require new code, so 
measuring contributors as "those who contribute code" will miss this type of 
person and many others.

Additionally, there is a lot of nuance in many cases. Someone who makes a 
single code contribution of 10,000 lines of code representing hundreds of hours 
of work is likely more engaged with the project than someone who corrected 2 
separate typos in our documentation. We need to be able to combine these many 
sources of data and all of the nuances within them to ascertain the size and 
"health" of our community, for the purpose of improving it.

I've developed Hatlas as a downstream product of Fedora data mainly because 
that was the most expedient way for me to facilitate these analyses. My 
long-term plan is to bring this project fully into Fedora, running under 
Fedora's governance and on Fedora's servers.

However, the fact is that anyone could build a similar system with no special 
access or permissions required, and use this data for very different purposes 
such as publishing what time of day a specific user is active in order to 
encourage harassment, or to see which admins change their password the least 
frequently. (To be clear, the password change example is hyperbole, but data of 
similar sensitivity could be inadvertantly present in such a vast dataset. In 
fact, the task of discovering what data we have is the first step of building 
our analyses, and will likely be an ongoing activity.)

I've already received comments of concern along these lines and I anticipate 
many more after publicly announcing Hatlas, which is why I'm asking for clarity 
-- to be able to be able to definitively say, "the Fedora data policies do 
permit collecting and publishing this data, and it is compliant with all 
applicable laws." (Which may result in the community asking for a policy 
change.) Or, alternatively, to figure out what might need to change from the 
current state to get us to that point.

Thanks again,

Michael Winters

On November 11, 2025 6:07:00 PM CST, Jilayne Lovejoy via legal 
<[email protected]> wrote:
>Hi Michael,
>
>Thanks for raising this.  Having looked to your site, I'm a bit unclear as to 
>what dataset you are referring to (what is datanommer, is that an existing set 
>of data or a name you made?) and what exactly is published publicly already?
>
>It sounds like you are potentially pulling data from a variety of different 
>sources, is that right? If so, what are these sources and what is the intent 
>of using this consolidated data?
>
>Thanks,
>Jilayne
>
>On 11/11/25 10:27 AM, Michael Winters via legal wrote:
>> Hello Legal,
>> 
>> My name is Michael Winters, typically known here as @mwinters.  I have some 
>> questions about Fedora's data privacy policies, which I'll provide a bit of 
>> context to first.
>> 
>> There has been a long-standing desire within Fedora for better tools with 
>> which to analyze our user data and understand our community so that we can 
>> improve it.  To this end, I have recently created a "Data Lakehouse" proof 
>> of concept known as "Hatlas", available at https://hatlas.mwinters.net .  
>> This technology consolidates data from existing public Fedora datasets and 
>> provides simplified tools to facilitate public access and analysis.
>> 
>> Since these datasets were previously quite difficult to access, I believe 
>> that most people are unaware of what data exists about them within Fedora 
>> and/or the fact that it's being published publicly.  I expect that the 
>> announcement of easy access to this data will raise some community concerns 
>> about data privacy, so this email is in anticipation of those concerns.  I 
>> wish to have clear resources to refer people to, and current resources such 
>> as https://docs.fedoraproject.org/en-US/legal/privacy/ have left some 
>> questions open.
>> 
>> In particular, many of these datasets include usernames and records of user 
>> activity tied to those usernames, e.g. the contents and exact timing of 
>> forum posts, git commits, group membership changes, etc.  My current 
>> questions are:
>> 
>> 1) Does an arbitrary username (not necessarily tied to a real name) 
>> constitute PII which must be protected / anonymized?  It is not currently 
>> anonymized in Fedora datasets.
>> 
>> 2) Do current Fedora policies permit collecting user activity tied to 
>> usernames?  This is not explicitly stated under "Information We Collect", 
>> though it is mentioned later under "Using (Processing) Your Personal Data."
>> 
>> 3) Do current Fedora policies permit publishing user activity tied to 
>> usernames?  Section "Sharing Your Personal Data" does mention "For research 
>> activities", but it does not specify that data must be shared *only* in 
>> aggregate.
>> 
>> 4) How does GDPR view downstream users of public data sources, i.e. Hatlas?  
>> Is Hatlas a "data processor"?  Must Hatlas integrate with Fedora's Personal 
>> Data Removal process?  We intend to do so, but there seems to be no 
>> obligation for either party.
>> 
>> 5) Are there any data licenses applicable to downstream users such as 
>> Hatlas?  I intend to apply one restricting the use of Hatlas data to 
>> non-commercial purposes, but there seem to be no restrictions coming from 
>> Fedora.
>> 
>> Thanks in advance!
>> 
>> Michael Winters

-- 
_______________________________________________
legal mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/[email protected]
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

[Fedora-legal-list] Re: Privacy policy - request for clarification (Hatlas)

Reply via email to