Re: [DISCUSS] Proposal for adding Telemetry via Scarf

Jarek Potiuk Sat, 30 Mar 2024 06:36:02 -0700

Hello everyone,

it has to be:


1. Opt-in by default to not trigger security guys about new unplanned
> activity after regular upgrade.
>

That's a very good point about security triggering Alexander, but I am not
so sure it means that we "have to" do opt-in. There are other ways of
communicating with the "deployment managers" who install and upgrade
airflow - i.e. release notes. blogs, social media of ours, slack
announcements etc. We have plenty of channels we can use to communicate the
change.

I think we have a very good blueprint to follow including at least 5 other
ASF projects that also passed the review of the privacy@asf. And while I
understand (and concur) the urge for opt-in by default coming from consumer
market (where it makes perfect sense) Airflow is not a consumer
software and is used in "corporate environment" which has a little
different expectations and broad assumption that the company can make
decisions on such telemetry on behalf of the employees using it.

We should assume that those who deploy and upgrade Airflow - actually read
and take into account what is written in the release notes - especially if
they have security guys breathing their necks, similarly as we have to
assume they follow CVE announcements about security issues fixed. If we
are very straightforward and out-going about the change, inform very
clearly how to opt-out, I don't see a big problem with opt-out.

We should of course check with privacy@a.o (but I'v spend a good deal of
time reading the Superset  and other use case and explanation in detail to
make a better informed decision) - and it looks like they also went opt-out
way and got cleared by privacy@a.o.  And if we cannot reach consensus, we
should - as usual - make a voting decision on it (because yes, it is an
important decision), but - after reading and understanding why others also
did it - for me personally, opt-out is a good path.

Also because it will rather increase the amount of data to gather, and in
our case - counter intuitively - it will be even better for privacy and
corporate anonymity, because the more data we get, the more difficult it
will be to get any non-statistical/non-aggregated insight from it. Imagine
if only a few corporate users will enable it consciously - then we will be
able to draw much more conclusions if we find out who they are, than if
everyone has it enabled by default.

That's my take on it - but again, it's up to us to vote, for me opt-in is
not "has to", and I am rather for opt-out.

J.

> Hi all,
>
>
> > I want to propose gathering telemetry for Airflow installations. As the
> > Airflow community, we have been relying heavily on the yearly Airflow
> > Survey and anecdotes to answer a few key questions about Airflow usage.
> > Questions like the following:
> >
> >
> >    - Which versions of Airflow are people installing/using now (i.e.
> >    whether people have primarily made the jump from version X to version
> Y)
> >    - Which DB is used as the Metadata DB and which version e.g Pg 14?
> >    - What Python version is being used?
> >    - Which Executor is being used?
> >    - Approximately how many people out there in the world are installing
> >    Airflow
> >
> >
> > There is a solution that should help answer these questions: Scarf [1].
> The
> > ASF already approves Scarf [2][3] and is already used by other ASF
> > projects: Superset [4], Dolphin Scheduler [5], Dubbo Kubernetes, DevLake,
> > Skywalking as it follows GDPR and other regulations.
> >
> > Similar to Superset, we probably can use it as follows:
> >
> >
> >    1. Install the `scarf js` npm package and bundle it in the Webserver.
> >    When the package is downloaded & Airflow webserver is opened, metadata
> > is
> >    recorded to the Scarf dashboard.
> >    2. Utilize the Scarf Gateway [6], which we can use in front of docker
> >    containers. While it’s possible people go around this gateway, we can
> >    probably configure and encourage most traffic to go through these
> > gateways.
> >
> > While Scarf does not store any personally identifying information from
> SDK
> > telemetry data, it does send various bits of IP-derived information as
> > outlined here [7]. This data should be made as transparent as possible by
> > granting dashboard access to the Airflow PMC and any other relevant means
> > of sharing/surfacing it that we encounter (Town Hall, Slack, Newsletter
> > etc).
> >
> > The following case studies are worth reading:
> >
> >    1. https://about.scarf.sh/post/scarf-case-study-apache-superset (From
> >    Maxime)
> >    2.
> >
> >
> https://about.scarf.sh/post/haskell-org-bridging-the-gap-between-language-innovation-and-community-understanding
> >
> > Similar to them, this could help in various ways that come with using
> data
> > for decision-making. With clear guidelines on "how to opt-out"
> [8][9][10] &
> > "what data is being collected" on the Airflow website, this can be
> > beneficial to the entire community as we would be making more informed
> > decisions.
> >
> > Regards,
> > Kaxil
> >
> >
> > [1] https://about.scarf.sh/
> > [2] https://privacy.apache.org/policies/privacy-policy-public.html
> > [3] https://privacy.apache.org/faq/committers.html
> > [4] https://github.com/apache/superset/issues/25639
> > [5]
> >
> >
> https://github.com/search?q=repo%3Aapache%2Fdolphinscheduler%20scarf.sh&type=code
> > [6] https://about.scarf.sh/scarf-gateway
> > [7] https://about.scarf.sh/privacy-policy
> > [8]
> >
> >
> https://superset.apache.org/docs/frequently-asked-questions/#does-superset-collect-any-telemetry-data
> > [9]
> >
> >
> https://superset.apache.org/docs/installation/installing-superset-using-docker-compose
> > [10]
> >
> >
> https://docs.scarf.sh/package-analytics/#as-a-user-of-a-package-using-scarf-js-how-can-i-opt-out-of-analytics
> >
>

Re: [DISCUSS] Proposal for adding Telemetry via Scarf

Reply via email to