Re: [DISCUSS] Gathering metrics on HBase versions in use

2018-11-15 Thread Tamas Penzes
Hi All,

Just to add my 2 cents.
I have to agree with the people before me that it's quite impossible to get
runtime data of HBase clusters so we have to look for workarounds.
Collecting data of downloads and aggregated visitor informations of wiki
pages or reference guides are also useful, but still they don't provide
enough information.

In my last job we have faced the same issue and ended up with a solution
embedded in the application which provided a way for sysadmins to get the
needed data in a processable format for us (json).
So I would suggest to add a functionality to HBase which would collect
*aggregated
and anonymised* data of the HBase cluster setup (no DNS, nothing which the
users might want to keep secret) only version numbers, number of servers,
etc. (the content may be parametrized).
This functionality could be called *manually* and the result could be
uploaded to a place provided by the HBase community for analysis.

This approach still has a problem, it's depends on manual interaction, but
since the data can be verified by the users it increases the probability of
the uploads itself.

I don't think we can have a way to collect data like Apache Ambari or
Cloudera Manager does, but still we could get more insight this way.

Regards, Tamaas

On Thu, Nov 15, 2018 at 10:40 AM Reid Chan  wrote:

> What about following, collection is still on the download page, but there
> we can attach a link for survey as Andrew did, and describe community's
> intention and emphasize voluntary and anonymity.
>
> Inspired by Peter that there are a lot of learning|test cases, here i come
> up with two questions:
> profession: engineer/researcher/student/others
> version: for engineer, ask about which version(s) in use in
> production; for the rest, ask which version they are going to download.
>
> It doesn't have to be a third party survey link, it (e.g. web application)
> can just generate an e-mail and send to PMC's mail list.
>
>
> --
>
> Best regards,
> R.C
>
>
>
> ____
> From: Peter Somogyi 
> Sent: 15 November 2018 16:48
> To: dev@hbase.apache.org
> Subject: Re: [DISCUSS] Gathering metrics on HBase versions in use
>
> I like the idea to have some sort of metrics from the users.
>
> I agree with Allan that in many cases HBase cluster is in an internal
> network making the data collection difficult or not even possible. It could
> lead us to an incorrect view if these generally bigger clusters do not
> appear in the metrics but is full with stats from standalone test
> environments that were just started once and never again.
>
> Collecting download information could give us a better picture but in this
> statistics the latest version might be overrepresented and we won't know
> which releases are currently used in the field.
>
> What do you think about collecting page views of Reference guide tied to
> specific releases? Someone searching in 1.4 Ref guide probably using HBase
> 1.4 or in the process of setting it up.
>
> Thanks,
> Peter
>
> On Thu, Nov 15, 2018 at 4:56 AM 张铎(Duo Zhang) 
> wrote:
>
> > +1 on collecting the download information.
> >
> > And collecting data when starting up is a bit dangerous I'd say, both
> > technically and legally...
> >
> > Maybe a possible way is to add a link on the master state page,  or some
> > ASCII arts in the master start log, to guide the people to our survey?
> >
> > Allan Yang  于2018年11月15日周四 上午11:23写道:
> >
> > > I also think having metrics about the downloads from Apache/archives
> is a
> > > doable action. Most HBase clusters are running in user's Intranet with
> no
> > > public access, sending anonymous data from them may not be possible.
> And
> > > also we need to find a way to obtain their authorization I think...
> > > Best Regards
> > > Allan Yang
> > >
> > > Zach York  于2018年11月15日周四 上午5:35写道:
> > >
> > > > Can we have metrics around the downloads from Apache/archives? I'm
> not
> > > sure
> > > > how that is all set up, but might be a low cost way to get some
> > metrics.
> > > >
> > > > On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell  > > wrote:
> > > >
> > > > > While it seems you are proposing some kind of autonomous ongoing
> > usage
> > > > > metrics collection, please note I ran an anonymous version usage
> > survey
> > > > via
> > > > > surveymonkey for 1.x last year. It was opt in and there were no PII
> > > > > concerns by its nature. All of the issues around da

Re: [DISCUSS] Gathering metrics on HBase versions in use

2018-11-15 Thread Reid Chan
What about following, collection is still on the download page, but there we 
can attach a link for survey as Andrew did, and describe community's intention 
and emphasize voluntary and anonymity.

Inspired by Peter that there are a lot of learning|test cases, here i come up 
with two questions:
profession: engineer/researcher/student/others
version: for engineer, ask about which version(s) in use in production; for 
the rest, ask which version they are going to download.

It doesn't have to be a third party survey link, it (e.g. web application) can 
just generate an e-mail and send to PMC's mail list.


--

Best regards,
R.C




From: Peter Somogyi 
Sent: 15 November 2018 16:48
To: dev@hbase.apache.org
Subject: Re: [DISCUSS] Gathering metrics on HBase versions in use

I like the idea to have some sort of metrics from the users.

I agree with Allan that in many cases HBase cluster is in an internal
network making the data collection difficult or not even possible. It could
lead us to an incorrect view if these generally bigger clusters do not
appear in the metrics but is full with stats from standalone test
environments that were just started once and never again.

Collecting download information could give us a better picture but in this
statistics the latest version might be overrepresented and we won't know
which releases are currently used in the field.

What do you think about collecting page views of Reference guide tied to
specific releases? Someone searching in 1.4 Ref guide probably using HBase
1.4 or in the process of setting it up.

Thanks,
Peter

On Thu, Nov 15, 2018 at 4:56 AM 张铎(Duo Zhang)  wrote:

> +1 on collecting the download information.
>
> And collecting data when starting up is a bit dangerous I'd say, both
> technically and legally...
>
> Maybe a possible way is to add a link on the master state page,  or some
> ASCII arts in the master start log, to guide the people to our survey?
>
> Allan Yang  于2018年11月15日周四 上午11:23写道:
>
> > I also think having metrics about the downloads from Apache/archives is a
> > doable action. Most HBase clusters are running in user's Intranet with no
> > public access, sending anonymous data from them may not be possible. And
> > also we need to find a way to obtain their authorization I think...
> > Best Regards
> > Allan Yang
> >
> > Zach York  于2018年11月15日周四 上午5:35写道:
> >
> > > Can we have metrics around the downloads from Apache/archives? I'm not
> > sure
> > > how that is all set up, but might be a low cost way to get some
> metrics.
> > >
> > > On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell  > wrote:
> > >
> > > > While it seems you are proposing some kind of autonomous ongoing
> usage
> > > > metrics collection, please note I ran an anonymous version usage
> survey
> > > via
> > > > surveymonkey for 1.x last year. It was opt in and there were no PII
> > > > concerns by its nature. All of the issues around data collection,
> > > storage,
> > > > and processing were also handled (by surveymonkey). Unfortunately I
> > > > recently cancelled my account.
> > > >
> > > > For occasional surveys something like that might work. Otherwise
> there
> > > are
> > > > a ton of questions: How do we generate the data? How do we get
> per-site
> > > > opt-in permission? How do we collect the data? Store it? Process it?
> > > Audit
> > > > it? Seems more trouble than it's worth and requires ongoing volunteer
> > > > hosting and effort to maintain.
> > > >
> > > >
> > > > On Wed, Nov 14, 2018 at 11:47 AM Misty Linville 
> > > wrote:
> > > >
> > > > > When discussing the 2.0.x branch in another thread, it came up that
> > we
> > > > > don’t have a good way to understand the version skew of HBase
> across
> > > the
> > > > > user base. Metrics gathering can be tricky. You don’t want to
> capture
> > > > > personally identifiable information (PII) and you need to be
> > > transparent
> > > > > about what you gather, for what purpose, how long the data will be
> > > > > retained, etc. The data can also be sensitive, for instance if a
> > large
> > > > > number of installations are running a version with a CVE or known
> > > > > vulnerability against it. If you gather metrics, it really needs to
> > be
> > > > > opt-out rather than opt-in so that you actually get a reasonable
> > amount
> > &g

Re: [DISCUSS] Gathering metrics on HBase versions in use

2018-11-15 Thread Peter Somogyi
I like the idea to have some sort of metrics from the users.

I agree with Allan that in many cases HBase cluster is in an internal
network making the data collection difficult or not even possible. It could
lead us to an incorrect view if these generally bigger clusters do not
appear in the metrics but is full with stats from standalone test
environments that were just started once and never again.

Collecting download information could give us a better picture but in this
statistics the latest version might be overrepresented and we won't know
which releases are currently used in the field.

What do you think about collecting page views of Reference guide tied to
specific releases? Someone searching in 1.4 Ref guide probably using HBase
1.4 or in the process of setting it up.

Thanks,
Peter

On Thu, Nov 15, 2018 at 4:56 AM 张铎(Duo Zhang)  wrote:

> +1 on collecting the download information.
>
> And collecting data when starting up is a bit dangerous I'd say, both
> technically and legally...
>
> Maybe a possible way is to add a link on the master state page,  or some
> ASCII arts in the master start log, to guide the people to our survey?
>
> Allan Yang  于2018年11月15日周四 上午11:23写道:
>
> > I also think having metrics about the downloads from Apache/archives is a
> > doable action. Most HBase clusters are running in user's Intranet with no
> > public access, sending anonymous data from them may not be possible. And
> > also we need to find a way to obtain their authorization I think...
> > Best Regards
> > Allan Yang
> >
> > Zach York  于2018年11月15日周四 上午5:35写道:
> >
> > > Can we have metrics around the downloads from Apache/archives? I'm not
> > sure
> > > how that is all set up, but might be a low cost way to get some
> metrics.
> > >
> > > On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell  > wrote:
> > >
> > > > While it seems you are proposing some kind of autonomous ongoing
> usage
> > > > metrics collection, please note I ran an anonymous version usage
> survey
> > > via
> > > > surveymonkey for 1.x last year. It was opt in and there were no PII
> > > > concerns by its nature. All of the issues around data collection,
> > > storage,
> > > > and processing were also handled (by surveymonkey). Unfortunately I
> > > > recently cancelled my account.
> > > >
> > > > For occasional surveys something like that might work. Otherwise
> there
> > > are
> > > > a ton of questions: How do we generate the data? How do we get
> per-site
> > > > opt-in permission? How do we collect the data? Store it? Process it?
> > > Audit
> > > > it? Seems more trouble than it's worth and requires ongoing volunteer
> > > > hosting and effort to maintain.
> > > >
> > > >
> > > > On Wed, Nov 14, 2018 at 11:47 AM Misty Linville 
> > > wrote:
> > > >
> > > > > When discussing the 2.0.x branch in another thread, it came up that
> > we
> > > > > don’t have a good way to understand the version skew of HBase
> across
> > > the
> > > > > user base. Metrics gathering can be tricky. You don’t want to
> capture
> > > > > personally identifiable information (PII) and you need to be
> > > transparent
> > > > > about what you gather, for what purpose, how long the data will be
> > > > > retained, etc. The data can also be sensitive, for instance if a
> > large
> > > > > number of installations are running a version with a CVE or known
> > > > > vulnerability against it. If you gather metrics, it really needs to
> > be
> > > > > opt-out rather than opt-in so that you actually get a reasonable
> > amount
> > > > of
> > > > > data. You also need to stand up some kind of metrics-gathering
> > service
> > > > and
> > > > > run it somewhere, and some kind of reporting / visualization
> tooling.
> > > The
> > > > > flip side of all these difficulties is a more intelligent way to
> > decide
> > > > > when to retire a branch or when to communicate more broadly /
> loudly
> > > > asking
> > > > > people in a certain version stream to upgrade, as well as where to
> > > > > concentrate our efforts.
> > > > >
> > > > > I’m not sticking my hand up to implement such a monster. I only
> > wanted
> > > to
> > > > > open a discussion and see what y’all think. It seems to me that a
> few
> > > > > must-haves are:
> > > > >
> > > > > - Transparency: Release notes, logging about the status of
> > > > > metrics-gathering (on or off) at master or RS start-up, logging
> about
> > > > > exactly when and what metrics are sent
> > > > > - Low frequency: Would we really need to wake up and send metrics
> > more
> > > > > often than weekly?
> > > > > - Conservative approach: Only collect what we can find useful
> today,
> > > > don’t
> > > > > collect the world.
> > > > > - Minimize PII: This probably means not trying to group together
> > > > > time-series results for a given server or cluster at all, but could
> > > make
> > > > > the data look like there were a lot more clusters running in the
> > world
> > > > than
> > > > > really are.
> > > > > - Who has access to the data? Do

Re: [DISCUSS] Gathering metrics on HBase versions in use

2018-11-14 Thread Duo Zhang
+1 on collecting the download information.

And collecting data when starting up is a bit dangerous I'd say, both
technically and legally...

Maybe a possible way is to add a link on the master state page,  or some
ASCII arts in the master start log, to guide the people to our survey?

Allan Yang  于2018年11月15日周四 上午11:23写道:

> I also think having metrics about the downloads from Apache/archives is a
> doable action. Most HBase clusters are running in user's Intranet with no
> public access, sending anonymous data from them may not be possible. And
> also we need to find a way to obtain their authorization I think...
> Best Regards
> Allan Yang
>
> Zach York  于2018年11月15日周四 上午5:35写道:
>
> > Can we have metrics around the downloads from Apache/archives? I'm not
> sure
> > how that is all set up, but might be a low cost way to get some metrics.
> >
> > On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell  wrote:
> >
> > > While it seems you are proposing some kind of autonomous ongoing usage
> > > metrics collection, please note I ran an anonymous version usage survey
> > via
> > > surveymonkey for 1.x last year. It was opt in and there were no PII
> > > concerns by its nature. All of the issues around data collection,
> > storage,
> > > and processing were also handled (by surveymonkey). Unfortunately I
> > > recently cancelled my account.
> > >
> > > For occasional surveys something like that might work. Otherwise there
> > are
> > > a ton of questions: How do we generate the data? How do we get per-site
> > > opt-in permission? How do we collect the data? Store it? Process it?
> > Audit
> > > it? Seems more trouble than it's worth and requires ongoing volunteer
> > > hosting and effort to maintain.
> > >
> > >
> > > On Wed, Nov 14, 2018 at 11:47 AM Misty Linville 
> > wrote:
> > >
> > > > When discussing the 2.0.x branch in another thread, it came up that
> we
> > > > don’t have a good way to understand the version skew of HBase across
> > the
> > > > user base. Metrics gathering can be tricky. You don’t want to capture
> > > > personally identifiable information (PII) and you need to be
> > transparent
> > > > about what you gather, for what purpose, how long the data will be
> > > > retained, etc. The data can also be sensitive, for instance if a
> large
> > > > number of installations are running a version with a CVE or known
> > > > vulnerability against it. If you gather metrics, it really needs to
> be
> > > > opt-out rather than opt-in so that you actually get a reasonable
> amount
> > > of
> > > > data. You also need to stand up some kind of metrics-gathering
> service
> > > and
> > > > run it somewhere, and some kind of reporting / visualization tooling.
> > The
> > > > flip side of all these difficulties is a more intelligent way to
> decide
> > > > when to retire a branch or when to communicate more broadly / loudly
> > > asking
> > > > people in a certain version stream to upgrade, as well as where to
> > > > concentrate our efforts.
> > > >
> > > > I’m not sticking my hand up to implement such a monster. I only
> wanted
> > to
> > > > open a discussion and see what y’all think. It seems to me that a few
> > > > must-haves are:
> > > >
> > > > - Transparency: Release notes, logging about the status of
> > > > metrics-gathering (on or off) at master or RS start-up, logging about
> > > > exactly when and what metrics are sent
> > > > - Low frequency: Would we really need to wake up and send metrics
> more
> > > > often than weekly?
> > > > - Conservative approach: Only collect what we can find useful today,
> > > don’t
> > > > collect the world.
> > > > - Minimize PII: This probably means not trying to group together
> > > > time-series results for a given server or cluster at all, but could
> > make
> > > > the data look like there were a lot more clusters running in the
> world
> > > than
> > > > really are.
> > > > - Who has access to the data? Do we make it public or limit access to
> > the
> > > > PMC? Making it public would bolster our discipline about transparency
> > and
> > > > minimizing PII.
> > > >
> > > > I’m sure I’m missing a ton so I leave the discussion to y’all.
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrew
> > >
> > > Words like orphans lost among the crosstalk, meaning torn from truth's
> > > decrepit hands
> > >- A23, Crosstalk
> > >
> >
>


Re: [DISCUSS] Gathering metrics on HBase versions in use

2018-11-14 Thread Allan Yang
I also think having metrics about the downloads from Apache/archives is a
doable action. Most HBase clusters are running in user's Intranet with no
public access, sending anonymous data from them may not be possible. And
also we need to find a way to obtain their authorization I think...
Best Regards
Allan Yang

Zach York  于2018年11月15日周四 上午5:35写道:

> Can we have metrics around the downloads from Apache/archives? I'm not sure
> how that is all set up, but might be a low cost way to get some metrics.
>
> On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell 
> > While it seems you are proposing some kind of autonomous ongoing usage
> > metrics collection, please note I ran an anonymous version usage survey
> via
> > surveymonkey for 1.x last year. It was opt in and there were no PII
> > concerns by its nature. All of the issues around data collection,
> storage,
> > and processing were also handled (by surveymonkey). Unfortunately I
> > recently cancelled my account.
> >
> > For occasional surveys something like that might work. Otherwise there
> are
> > a ton of questions: How do we generate the data? How do we get per-site
> > opt-in permission? How do we collect the data? Store it? Process it?
> Audit
> > it? Seems more trouble than it's worth and requires ongoing volunteer
> > hosting and effort to maintain.
> >
> >
> > On Wed, Nov 14, 2018 at 11:47 AM Misty Linville 
> wrote:
> >
> > > When discussing the 2.0.x branch in another thread, it came up that we
> > > don’t have a good way to understand the version skew of HBase across
> the
> > > user base. Metrics gathering can be tricky. You don’t want to capture
> > > personally identifiable information (PII) and you need to be
> transparent
> > > about what you gather, for what purpose, how long the data will be
> > > retained, etc. The data can also be sensitive, for instance if a large
> > > number of installations are running a version with a CVE or known
> > > vulnerability against it. If you gather metrics, it really needs to be
> > > opt-out rather than opt-in so that you actually get a reasonable amount
> > of
> > > data. You also need to stand up some kind of metrics-gathering service
> > and
> > > run it somewhere, and some kind of reporting / visualization tooling.
> The
> > > flip side of all these difficulties is a more intelligent way to decide
> > > when to retire a branch or when to communicate more broadly / loudly
> > asking
> > > people in a certain version stream to upgrade, as well as where to
> > > concentrate our efforts.
> > >
> > > I’m not sticking my hand up to implement such a monster. I only wanted
> to
> > > open a discussion and see what y’all think. It seems to me that a few
> > > must-haves are:
> > >
> > > - Transparency: Release notes, logging about the status of
> > > metrics-gathering (on or off) at master or RS start-up, logging about
> > > exactly when and what metrics are sent
> > > - Low frequency: Would we really need to wake up and send metrics more
> > > often than weekly?
> > > - Conservative approach: Only collect what we can find useful today,
> > don’t
> > > collect the world.
> > > - Minimize PII: This probably means not trying to group together
> > > time-series results for a given server or cluster at all, but could
> make
> > > the data look like there were a lot more clusters running in the world
> > than
> > > really are.
> > > - Who has access to the data? Do we make it public or limit access to
> the
> > > PMC? Making it public would bolster our discipline about transparency
> and
> > > minimizing PII.
> > >
> > > I’m sure I’m missing a ton so I leave the discussion to y’all.
> > >
> >
> >
> > --
> > Best regards,
> > Andrew
> >
> > Words like orphans lost among the crosstalk, meaning torn from truth's
> > decrepit hands
> >- A23, Crosstalk
> >
>


Re: [DISCUSS] Gathering metrics on HBase versions in use

2018-11-14 Thread Zach York
Can we have metrics around the downloads from Apache/archives? I'm not sure
how that is all set up, but might be a low cost way to get some metrics.

On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell  While it seems you are proposing some kind of autonomous ongoing usage
> metrics collection, please note I ran an anonymous version usage survey via
> surveymonkey for 1.x last year. It was opt in and there were no PII
> concerns by its nature. All of the issues around data collection, storage,
> and processing were also handled (by surveymonkey). Unfortunately I
> recently cancelled my account.
>
> For occasional surveys something like that might work. Otherwise there are
> a ton of questions: How do we generate the data? How do we get per-site
> opt-in permission? How do we collect the data? Store it? Process it? Audit
> it? Seems more trouble than it's worth and requires ongoing volunteer
> hosting and effort to maintain.
>
>
> On Wed, Nov 14, 2018 at 11:47 AM Misty Linville  wrote:
>
> > When discussing the 2.0.x branch in another thread, it came up that we
> > don’t have a good way to understand the version skew of HBase across the
> > user base. Metrics gathering can be tricky. You don’t want to capture
> > personally identifiable information (PII) and you need to be transparent
> > about what you gather, for what purpose, how long the data will be
> > retained, etc. The data can also be sensitive, for instance if a large
> > number of installations are running a version with a CVE or known
> > vulnerability against it. If you gather metrics, it really needs to be
> > opt-out rather than opt-in so that you actually get a reasonable amount
> of
> > data. You also need to stand up some kind of metrics-gathering service
> and
> > run it somewhere, and some kind of reporting / visualization tooling. The
> > flip side of all these difficulties is a more intelligent way to decide
> > when to retire a branch or when to communicate more broadly / loudly
> asking
> > people in a certain version stream to upgrade, as well as where to
> > concentrate our efforts.
> >
> > I’m not sticking my hand up to implement such a monster. I only wanted to
> > open a discussion and see what y’all think. It seems to me that a few
> > must-haves are:
> >
> > - Transparency: Release notes, logging about the status of
> > metrics-gathering (on or off) at master or RS start-up, logging about
> > exactly when and what metrics are sent
> > - Low frequency: Would we really need to wake up and send metrics more
> > often than weekly?
> > - Conservative approach: Only collect what we can find useful today,
> don’t
> > collect the world.
> > - Minimize PII: This probably means not trying to group together
> > time-series results for a given server or cluster at all, but could make
> > the data look like there were a lot more clusters running in the world
> than
> > really are.
> > - Who has access to the data? Do we make it public or limit access to the
> > PMC? Making it public would bolster our discipline about transparency and
> > minimizing PII.
> >
> > I’m sure I’m missing a ton so I leave the discussion to y’all.
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>- A23, Crosstalk
>


Re: [DISCUSS] Gathering metrics on HBase versions in use

2018-11-14 Thread Andrew Purtell
While it seems you are proposing some kind of autonomous ongoing usage
metrics collection, please note I ran an anonymous version usage survey via
surveymonkey for 1.x last year. It was opt in and there were no PII
concerns by its nature. All of the issues around data collection, storage,
and processing were also handled (by surveymonkey). Unfortunately I
recently cancelled my account.

For occasional surveys something like that might work. Otherwise there are
a ton of questions: How do we generate the data? How do we get per-site
opt-in permission? How do we collect the data? Store it? Process it? Audit
it? Seems more trouble than it's worth and requires ongoing volunteer
hosting and effort to maintain.


On Wed, Nov 14, 2018 at 11:47 AM Misty Linville  wrote:

> When discussing the 2.0.x branch in another thread, it came up that we
> don’t have a good way to understand the version skew of HBase across the
> user base. Metrics gathering can be tricky. You don’t want to capture
> personally identifiable information (PII) and you need to be transparent
> about what you gather, for what purpose, how long the data will be
> retained, etc. The data can also be sensitive, for instance if a large
> number of installations are running a version with a CVE or known
> vulnerability against it. If you gather metrics, it really needs to be
> opt-out rather than opt-in so that you actually get a reasonable amount of
> data. You also need to stand up some kind of metrics-gathering service and
> run it somewhere, and some kind of reporting / visualization tooling. The
> flip side of all these difficulties is a more intelligent way to decide
> when to retire a branch or when to communicate more broadly / loudly asking
> people in a certain version stream to upgrade, as well as where to
> concentrate our efforts.
>
> I’m not sticking my hand up to implement such a monster. I only wanted to
> open a discussion and see what y’all think. It seems to me that a few
> must-haves are:
>
> - Transparency: Release notes, logging about the status of
> metrics-gathering (on or off) at master or RS start-up, logging about
> exactly when and what metrics are sent
> - Low frequency: Would we really need to wake up and send metrics more
> often than weekly?
> - Conservative approach: Only collect what we can find useful today, don’t
> collect the world.
> - Minimize PII: This probably means not trying to group together
> time-series results for a given server or cluster at all, but could make
> the data look like there were a lot more clusters running in the world than
> really are.
> - Who has access to the data? Do we make it public or limit access to the
> PMC? Making it public would bolster our discipline about transparency and
> minimizing PII.
>
> I’m sure I’m missing a ton so I leave the discussion to y’all.
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk


[DISCUSS] Gathering metrics on HBase versions in use

2018-11-14 Thread Misty Linville
When discussing the 2.0.x branch in another thread, it came up that we
don’t have a good way to understand the version skew of HBase across the
user base. Metrics gathering can be tricky. You don’t want to capture
personally identifiable information (PII) and you need to be transparent
about what you gather, for what purpose, how long the data will be
retained, etc. The data can also be sensitive, for instance if a large
number of installations are running a version with a CVE or known
vulnerability against it. If you gather metrics, it really needs to be
opt-out rather than opt-in so that you actually get a reasonable amount of
data. You also need to stand up some kind of metrics-gathering service and
run it somewhere, and some kind of reporting / visualization tooling. The
flip side of all these difficulties is a more intelligent way to decide
when to retire a branch or when to communicate more broadly / loudly asking
people in a certain version stream to upgrade, as well as where to
concentrate our efforts.

I’m not sticking my hand up to implement such a monster. I only wanted to
open a discussion and see what y’all think. It seems to me that a few
must-haves are:

- Transparency: Release notes, logging about the status of
metrics-gathering (on or off) at master or RS start-up, logging about
exactly when and what metrics are sent
- Low frequency: Would we really need to wake up and send metrics more
often than weekly?
- Conservative approach: Only collect what we can find useful today, don’t
collect the world.
- Minimize PII: This probably means not trying to group together
time-series results for a given server or cluster at all, but could make
the data look like there were a lot more clusters running in the world than
really are.
- Who has access to the data? Do we make it public or limit access to the
PMC? Making it public would bolster our discipline about transparency and
minimizing PII.

I’m sure I’m missing a ton so I leave the discussion to y’all.