Re: [DISCUSS] Gathering metrics on HBase versions in use

Zach York Wed, 14 Nov 2018 13:36:01 -0800

Can we have metrics around the downloads from Apache/archives? I'm not sure
how that is all set up, but might be a low cost way to get some metrics.


On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell <[email protected] wrote:

> While it seems you are proposing some kind of autonomous ongoing usage
> metrics collection, please note I ran an anonymous version usage survey via
> surveymonkey for 1.x last year. It was opt in and there were no PII
> concerns by its nature. All of the issues around data collection, storage,
> and processing were also handled (by surveymonkey). Unfortunately I
> recently cancelled my account.
>
> For occasional surveys something like that might work. Otherwise there are
> a ton of questions: How do we generate the data? How do we get per-site
> opt-in permission? How do we collect the data? Store it? Process it? Audit
> it? Seems more trouble than it's worth and requires ongoing volunteer
> hosting and effort to maintain.
>
>
> On Wed, Nov 14, 2018 at 11:47 AM Misty Linville <[email protected]> wrote:
>
> > When discussing the 2.0.x branch in another thread, it came up that we
> > don’t have a good way to understand the version skew of HBase across the
> > user base. Metrics gathering can be tricky. You don’t want to capture
> > personally identifiable information (PII) and you need to be transparent
> > about what you gather, for what purpose, how long the data will be
> > retained, etc. The data can also be sensitive, for instance if a large
> > number of installations are running a version with a CVE or known
> > vulnerability against it. If you gather metrics, it really needs to be
> > opt-out rather than opt-in so that you actually get a reasonable amount
> of
> > data. You also need to stand up some kind of metrics-gathering service
> and
> > run it somewhere, and some kind of reporting / visualization tooling. The
> > flip side of all these difficulties is a more intelligent way to decide
> > when to retire a branch or when to communicate more broadly / loudly
> asking
> > people in a certain version stream to upgrade, as well as where to
> > concentrate our efforts.
> >
> > I’m not sticking my hand up to implement such a monster. I only wanted to
> > open a discussion and see what y’all think. It seems to me that a few
> > must-haves are:
> >
> > - Transparency: Release notes, logging about the status of
> > metrics-gathering (on or off) at master or RS start-up, logging about
> > exactly when and what metrics are sent
> > - Low frequency: Would we really need to wake up and send metrics more
> > often than weekly?
> > - Conservative approach: Only collect what we can find useful today,
> don’t
> > collect the world.
> > - Minimize PII: This probably means not trying to group together
> > time-series results for a given server or cluster at all, but could make
> > the data look like there were a lot more clusters running in the world
> than
> > really are.
> > - Who has access to the data? Do we make it public or limit access to the
> > PMC? Making it public would bolster our discipline about transparency and
> > minimizing PII.
> >
> > I’m sure I’m missing a ton so I leave the discussion to y’all.
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>

Re: [DISCUSS] Gathering metrics on HBase versions in use

Reply via email to