Hi all,

I want to propose gathering telemetry for Airflow installations. As the
Airflow community, we have been relying heavily on the yearly Airflow
Survey and anecdotes to answer a few key questions about Airflow usage.
Questions like the following:


   - Which versions of Airflow are people installing/using now (i.e.
   whether people have primarily made the jump from version X to version Y)
   - Which DB is used as the Metadata DB and which version e.g Pg 14?
   - What Python version is being used?
   - Which Executor is being used?
   - Approximately how many people out there in the world are installing
   Airflow


There is a solution that should help answer these questions: Scarf [1]. The
ASF already approves Scarf [2][3] and is already used by other ASF
projects: Superset [4], Dolphin Scheduler [5], Dubbo Kubernetes, DevLake,
Skywalking as it follows GDPR and other regulations.

Similar to Superset, we probably can use it as follows:


   1. Install the `scarf js` npm package and bundle it in the Webserver.
   When the package is downloaded & Airflow webserver is opened, metadata is
   recorded to the Scarf dashboard.
   2. Utilize the Scarf Gateway [6], which we can use in front of docker
   containers. While it’s possible people go around this gateway, we can
   probably configure and encourage most traffic to go through these gateways.

While Scarf does not store any personally identifying information from SDK
telemetry data, it does send various bits of IP-derived information as
outlined here [7]. This data should be made as transparent as possible by
granting dashboard access to the Airflow PMC and any other relevant means
of sharing/surfacing it that we encounter (Town Hall, Slack, Newsletter
etc).

The following case studies are worth reading:

   1. https://about.scarf.sh/post/scarf-case-study-apache-superset (From
   Maxime)
   2.
   
https://about.scarf.sh/post/haskell-org-bridging-the-gap-between-language-innovation-and-community-understanding

Similar to them, this could help in various ways that come with using data
for decision-making. With clear guidelines on "how to opt-out" [8][9][10] &
"what data is being collected" on the Airflow website, this can be
beneficial to the entire community as we would be making more informed
decisions.

Regards,
Kaxil


[1] https://about.scarf.sh/
[2] https://privacy.apache.org/policies/privacy-policy-public.html
[3] https://privacy.apache.org/faq/committers.html
[4] https://github.com/apache/superset/issues/25639
[5]
https://github.com/search?q=repo%3Aapache%2Fdolphinscheduler%20scarf.sh&type=code
[6] https://about.scarf.sh/scarf-gateway
[7] https://about.scarf.sh/privacy-policy
[8]
https://superset.apache.org/docs/frequently-asked-questions/#does-superset-collect-any-telemetry-data
[9]
https://superset.apache.org/docs/installation/installing-superset-using-docker-compose
[10]
https://docs.scarf.sh/package-analytics/#as-a-user-of-a-package-using-scarf-js-how-can-i-opt-out-of-analytics

Reply via email to