+1. All that sounds reasonable, there are precedents, ASF supports Scarf officially. Would be great to have access to such telemetry data.
On Sat, Mar 30, 2024 at 1:18 AM Kaxil Naik <kaxiln...@apache.org> wrote: > Hi all, > > I want to propose gathering telemetry for Airflow installations. As the > Airflow community, we have been relying heavily on the yearly Airflow > Survey and anecdotes to answer a few key questions about Airflow usage. > Questions like the following: > > > - Which versions of Airflow are people installing/using now (i.e. > whether people have primarily made the jump from version X to version Y) > - Which DB is used as the Metadata DB and which version e.g Pg 14? > - What Python version is being used? > - Which Executor is being used? > - Approximately how many people out there in the world are installing > Airflow > > > There is a solution that should help answer these questions: Scarf [1]. The > ASF already approves Scarf [2][3] and is already used by other ASF > projects: Superset [4], Dolphin Scheduler [5], Dubbo Kubernetes, DevLake, > Skywalking as it follows GDPR and other regulations. > > Similar to Superset, we probably can use it as follows: > > > 1. Install the `scarf js` npm package and bundle it in the Webserver. > When the package is downloaded & Airflow webserver is opened, metadata > is > recorded to the Scarf dashboard. > 2. Utilize the Scarf Gateway [6], which we can use in front of docker > containers. While it’s possible people go around this gateway, we can > probably configure and encourage most traffic to go through these > gateways. > > While Scarf does not store any personally identifying information from SDK > telemetry data, it does send various bits of IP-derived information as > outlined here [7]. This data should be made as transparent as possible by > granting dashboard access to the Airflow PMC and any other relevant means > of sharing/surfacing it that we encounter (Town Hall, Slack, Newsletter > etc). > > The following case studies are worth reading: > > 1. https://about.scarf.sh/post/scarf-case-study-apache-superset (From > Maxime) > 2. > > https://about.scarf.sh/post/haskell-org-bridging-the-gap-between-language-innovation-and-community-understanding > > Similar to them, this could help in various ways that come with using data > for decision-making. With clear guidelines on "how to opt-out" [8][9][10] & > "what data is being collected" on the Airflow website, this can be > beneficial to the entire community as we would be making more informed > decisions. > > Regards, > Kaxil > > > [1] https://about.scarf.sh/ > [2] https://privacy.apache.org/policies/privacy-policy-public.html > [3] https://privacy.apache.org/faq/committers.html > [4] https://github.com/apache/superset/issues/25639 > [5] > > https://github.com/search?q=repo%3Aapache%2Fdolphinscheduler%20scarf.sh&type=code > [6] https://about.scarf.sh/scarf-gateway > [7] https://about.scarf.sh/privacy-policy > [8] > > https://superset.apache.org/docs/frequently-asked-questions/#does-superset-collect-any-telemetry-data > [9] > > https://superset.apache.org/docs/installation/installing-superset-using-docker-compose > [10] > > https://docs.scarf.sh/package-analytics/#as-a-user-of-a-package-using-scarf-js-how-can-i-opt-out-of-analytics >