When discussing the 2.0.x branch in another thread, it came up that we don’t have a good way to understand the version skew of HBase across the user base. Metrics gathering can be tricky. You don’t want to capture personally identifiable information (PII) and you need to be transparent about what you gather, for what purpose, how long the data will be retained, etc. The data can also be sensitive, for instance if a large number of installations are running a version with a CVE or known vulnerability against it. If you gather metrics, it really needs to be opt-out rather than opt-in so that you actually get a reasonable amount of data. You also need to stand up some kind of metrics-gathering service and run it somewhere, and some kind of reporting / visualization tooling. The flip side of all these difficulties is a more intelligent way to decide when to retire a branch or when to communicate more broadly / loudly asking people in a certain version stream to upgrade, as well as where to concentrate our efforts.
I’m not sticking my hand up to implement such a monster. I only wanted to open a discussion and see what y’all think. It seems to me that a few must-haves are: - Transparency: Release notes, logging about the status of metrics-gathering (on or off) at master or RS start-up, logging about exactly when and what metrics are sent - Low frequency: Would we really need to wake up and send metrics more often than weekly? - Conservative approach: Only collect what we can find useful today, don’t collect the world. - Minimize PII: This probably means not trying to group together time-series results for a given server or cluster at all, but could make the data look like there were a lot more clusters running in the world than really are. - Who has access to the data? Do we make it public or limit access to the PMC? Making it public would bolster our discipline about transparency and minimizing PII. I’m sure I’m missing a ton so I leave the discussion to y’all.