Hi all, I wanted to float a new proposal regarding gathering Superset telemetry. The Superset dev community is constantly asking a few key questions about the installations of Superset in the wild, and we’ve been looking for a low-cost/low-effort opportunity to gather this telemetry data. We seek to better address questions such as:
• Which versions of Superset are people installing/using now (i.e. whether people have largely made the jump from version X to version Y) • How many people are running off of various SHAs from the repo, rather than official Apache releases • Knowing the potential fallout of security issues in older installations of Superset that may or may not be taking place • Approximately how many people out there in the world are installing Superset, and other related metadata. In order to address these and other questions, we’ve been looking at Scarf [1] as a potential solution. Scarf actuall provides many ways to get these sorts of telemetry metrics, but in the case of Superset, we’re hoping to start with a couple of small moves to make the least objectionable low-impact implementation possible. Namely, we’d like to do two things: 1) Install the `scarf js` npm package that you install in your package JSON files (e.g. Superset, and all of the superset-ui packages). When the package is downloaded, metadata is recorded to the Scarf dashboard. 2) Utilize the Scarf Gateway, which we can use in front of PyPi and in front of any docker containers. While it’s possible people go around this gateway, we can probably configure and encourage most traffic to go through these gateways While Scarf does not store any personally identifying information from SDK telemetry data, it does send various bits of IP-derived information as outlined here [2]. This data should be made as transparent as possible by granting dashboard access to the Superset PMC and any other relevant means of sharing/surfacing it that we encounter (Town Hall, Slack, etc). For what it’s worth, this is not the first Apache use of Scarf - Apache Skywalking and DolphinScheduler have been using Scarf, and DevLake and APISIX are getting started here as well. If all this proves positive, there are also additional inroads we can take toward further telemetry collection, including utilizing Scarf’s API to manually send telemetry information to (at install, for example), as well as the possibility to add “pixels" in docs/websites. But first, let’s tackle those first updates, and see how it pans out. In an attempt to be abundantly transparent, the PR will also include • Documentation on the Superset website about the packages and code's existence and purpose, • Instructions/documentation to opt out of it by either configuration or package removal [3]. • Means to view the collected data wherever possible (most likely for PMC members to start, since it can’t be wide-open public) I’m officially seeking lazy consensus to install and test the Scarf npm module on `master` and begin implementing/trialing the Scarf gateway opportunistically, in order to start testing (and sharing) the telemetry Scarf gathers. PRs for both the npm approach and gateway approach are already open for your consideration/reference [4, 5] Thanks, Evan Rusackas Preset | preset.io Apache Superset PMC [1] https://about.scarf.sh/ [2] https://about.scarf.sh/package-sdks [3] https://docs.scarf.sh/package-analytics/#as-a-user-of-a-package-using-scarf-js-how-can-i-opt-out-of-analytics [4] https://github.com/apache/superset/pull/24433 [5] https://github.com/apache/superset/pull/24432
