When discussing the 2.0.x branch in another thread, it came up that we
don’t have a good way to understand the version skew of HBase across the
user base. Metrics gathering can be tricky. You don’t want to capture
personally identifiable information (PII) and you need to be transparent
about what you gather, for what purpose, how long the data will be
retained, etc. The data can also be sensitive, for instance if a large
number of installations are running a version with a CVE or known
vulnerability against it. If you gather metrics, it really needs to be
opt-out rather than opt-in so that you actually get a reasonable amount of
data. You also need to stand up some kind of metrics-gathering service and
run it somewhere, and some kind of reporting / visualization tooling. The
flip side of all these difficulties is a more intelligent way to decide
when to retire a branch or when to communicate more broadly / loudly asking
people in a certain version stream to upgrade, as well as where to
concentrate our efforts.

I’m not sticking my hand up to implement such a monster. I only wanted to
open a discussion and see what y’all think. It seems to me that a few
must-haves are:

- Transparency: Release notes, logging about the status of
metrics-gathering (on or off) at master or RS start-up, logging about
exactly when and what metrics are sent
- Low frequency: Would we really need to wake up and send metrics more
often than weekly?
- Conservative approach: Only collect what we can find useful today, don’t
collect the world.
- Minimize PII: This probably means not trying to group together
time-series results for a given server or cluster at all, but could make
the data look like there were a lot more clusters running in the world than
really are.
- Who has access to the data? Do we make it public or limit access to the
PMC? Making it public would bolster our discipline about transparency and
minimizing PII.

I’m sure I’m missing a ton so I leave the discussion to y’all.

Reply via email to