Proposal for Telemetry via Scarf

Evan Rusackas Fri, 16 Jun 2023 13:51:25 -0700

Hi all,

I wanted to float a new proposal regarding gathering Superset telemetry. The 
Superset dev community is constantly asking a few key questions about the 
installations of Superset in the wild, and we’ve been looking for a 
low-cost/low-effort opportunity to gather this telemetry data. We seek to 
better address questions such as:


• Which versions of Superset are people installing/using now (i.e. whether 
people have largely made the jump from version X to version Y)
• How many people are running off of various SHAs from the repo, rather than 
official Apache releases
• Knowing the potential fallout of security issues in older installations of 
Superset that may or may not be taking place
• Approximately how many people out there in the world are installing Superset, 
and other related metadata.

In order to address these and other questions, we’ve been looking at Scarf [1] 
as a potential solution. Scarf actuall provides many ways to get these sorts of 
telemetry metrics, but in the case of Superset, we’re hoping to start with a 
couple of small moves to make the least objectionable low-impact implementation 
possible.

Namely, we’d like to do two things:

1) Install the `scarf js` npm package that you install in your package JSON 
files (e.g. Superset, and all of the superset-ui packages). When the package is 
downloaded, metadata is recorded to the Scarf dashboard.

2) Utilize the Scarf Gateway, which we can use in front of PyPi and in front of 
any docker containers. While it’s possible people go around this gateway, we 
can probably configure and encourage most traffic to go through these gateways

While Scarf does not store any personally identifying information from SDK 
telemetry data, it does send various bits of IP-derived information as outlined 
here [2]. This data should be made as transparent as possible by granting 
dashboard access to the Superset PMC and any other relevant means of 
sharing/surfacing it that we encounter (Town Hall, Slack, etc). For what it’s 
worth, this is not the first Apache use of Scarf - Apache Skywalking and 
DolphinScheduler have been using Scarf, and DevLake and APISIX are getting 
started here as well.

If all this proves positive, there are also additional inroads we can take 
toward further telemetry collection, including utilizing Scarf’s API to 
manually send telemetry information to (at install, for example), as well as 
the possibility to add “pixels" in docs/websites. But first, let’s tackle those 
first updates, and see how it pans out.

In an attempt to be abundantly transparent, the PR will also include
• Documentation on the Superset website about the packages and code's existence 
and purpose,
• Instructions/documentation to opt out of it by either configuration or 
package removal [3].
• Means to view the collected data wherever possible (most likely for PMC 
members to start, since it can’t be wide-open public)

I’m officially seeking lazy consensus to install and test the Scarf npm module 
on `master` and begin implementing/trialing the Scarf gateway 
opportunistically, in order to start testing (and sharing) the telemetry Scarf 
gathers.

PRs for both the npm approach and gateway approach are already open for your 
consideration/reference  [4, 5]

Thanks,

Evan Rusackas
Preset | preset.io
Apache Superset PMC

[1] https://about.scarf.sh/
[2] https://about.scarf.sh/package-sdks
[3] 
https://docs.scarf.sh/package-analytics/#as-a-user-of-a-package-using-scarf-js-how-can-i-opt-out-of-analytics
[4] https://github.com/apache/superset/pull/24433
[5] https://github.com/apache/superset/pull/24432

Proposal for Telemetry via Scarf

Reply via email to