Re: [DISCUSS] Gathering metrics on HBase versions in use

Andrew Purtell Wed, 14 Nov 2018 12:12:58 -0800

While it seems you are proposing some kind of autonomous ongoing usage
metrics collection, please note I ran an anonymous version usage survey via
surveymonkey for 1.x last year. It was opt in and there were no PII
concerns by its nature. All of the issues around data collection, storage,
and processing were also handled (by surveymonkey). Unfortunately I
recently cancelled my account.


For occasional surveys something like that might work. Otherwise there are
a ton of questions: How do we generate the data? How do we get per-site
opt-in permission? How do we collect the data? Store it? Process it? Audit
it? Seems more trouble than it's worth and requires ongoing volunteer
hosting and effort to maintain.


On Wed, Nov 14, 2018 at 11:47 AM Misty Linville <mi...@apache.org> wrote:

> When discussing the 2.0.x branch in another thread, it came up that we
> don’t have a good way to understand the version skew of HBase across the
> user base. Metrics gathering can be tricky. You don’t want to capture
> personally identifiable information (PII) and you need to be transparent
> about what you gather, for what purpose, how long the data will be
> retained, etc. The data can also be sensitive, for instance if a large
> number of installations are running a version with a CVE or known
> vulnerability against it. If you gather metrics, it really needs to be
> opt-out rather than opt-in so that you actually get a reasonable amount of
> data. You also need to stand up some kind of metrics-gathering service and
> run it somewhere, and some kind of reporting / visualization tooling. The
> flip side of all these difficulties is a more intelligent way to decide
> when to retire a branch or when to communicate more broadly / loudly asking
> people in a certain version stream to upgrade, as well as where to
> concentrate our efforts.
>
> I’m not sticking my hand up to implement such a monster. I only wanted to
> open a discussion and see what y’all think. It seems to me that a few
> must-haves are:
>
> - Transparency: Release notes, logging about the status of
> metrics-gathering (on or off) at master or RS start-up, logging about
> exactly when and what metrics are sent
> - Low frequency: Would we really need to wake up and send metrics more
> often than weekly?
> - Conservative approach: Only collect what we can find useful today, don’t
> collect the world.
> - Minimize PII: This probably means not trying to group together
> time-series results for a given server or cluster at all, but could make
> the data look like there were a lot more clusters running in the world than
> really are.
> - Who has access to the data? Do we make it public or limit access to the
> PMC? Making it public would bolster our discipline about transparency and
> minimizing PII.
>
> I’m sure I’m missing a ton so I leave the discussion to y’all.
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: [DISCUSS] Gathering metrics on HBase versions in use

Reply via email to