Hi All,

Just to add my 2 cents.
I have to agree with the people before me that it's quite impossible to get
runtime data of HBase clusters so we have to look for workarounds.
Collecting data of downloads and aggregated visitor informations of wiki
pages or reference guides are also useful, but still they don't provide
enough information.

In my last job we have faced the same issue and ended up with a solution
embedded in the application which provided a way for sysadmins to get the
needed data in a processable format for us (json).
So I would suggest to add a functionality to HBase which would collect
*aggregated
and anonymised* data of the HBase cluster setup (no DNS, nothing which the
users might want to keep secret) only version numbers, number of servers,
etc. (the content may be parametrized).
This functionality could be called *manually* and the result could be
uploaded to a place provided by the HBase community for analysis.

This approach still has a problem, it's depends on manual interaction, but
since the data can be verified by the users it increases the probability of
the uploads itself.

I don't think we can have a way to collect data like Apache Ambari or
Cloudera Manager does, but still we could get more insight this way.

Regards, Tamaas

On Thu, Nov 15, 2018 at 10:40 AM Reid Chan <reidddc...@outlook.com> wrote:

> What about following, collection is still on the download page, but there
> we can attach a link for survey as Andrew did, and describe community's
> intention and emphasize voluntary and anonymity.
>
> Inspired by Peter that there are a lot of learning|test cases, here i come
> up with two questions:
>     profession: engineer/researcher/student/others
>     version: for engineer, ask about which version(s) in use in
> production; for the rest, ask which version they are going to download.
>
> It doesn't have to be a third party survey link, it (e.g. web application)
> can just generate an e-mail and send to PMC's mail list.
>
>
> --------------------------
>
> Best regards,
> R.C
>
>
>
> ________________________________________
> From: Peter Somogyi <psomo...@apache.org>
> Sent: 15 November 2018 16:48
> To: dev@hbase.apache.org
> Subject: Re: [DISCUSS] Gathering metrics on HBase versions in use
>
> I like the idea to have some sort of metrics from the users.
>
> I agree with Allan that in many cases HBase cluster is in an internal
> network making the data collection difficult or not even possible. It could
> lead us to an incorrect view if these generally bigger clusters do not
> appear in the metrics but is full with stats from standalone test
> environments that were just started once and never again.
>
> Collecting download information could give us a better picture but in this
> statistics the latest version might be overrepresented and we won't know
> which releases are currently used in the field.
>
> What do you think about collecting page views of Reference guide tied to
> specific releases? Someone searching in 1.4 Ref guide probably using HBase
> 1.4 or in the process of setting it up.
>
> Thanks,
> Peter
>
> On Thu, Nov 15, 2018 at 4:56 AM 张铎(Duo Zhang) <palomino...@gmail.com>
> wrote:
>
> > +1 on collecting the download information.
> >
> > And collecting data when starting up is a bit dangerous I'd say, both
> > technically and legally...
> >
> > Maybe a possible way is to add a link on the master state page,  or some
> > ASCII arts in the master start log, to guide the people to our survey?
> >
> > Allan Yang <allan...@apache.org> 于2018年11月15日周四 上午11:23写道:
> >
> > > I also think having metrics about the downloads from Apache/archives
> is a
> > > doable action. Most HBase clusters are running in user's Intranet with
> no
> > > public access, sending anonymous data from them may not be possible.
> And
> > > also we need to find a way to obtain their authorization I think...
> > > Best Regards
> > > Allan Yang
> > >
> > > Zach York <zyork.contribut...@gmail.com> 于2018年11月15日周四 上午5:35写道:
> > >
> > > > Can we have metrics around the downloads from Apache/archives? I'm
> not
> > > sure
> > > > how that is all set up, but might be a low cost way to get some
> > metrics.
> > > >
> > > > On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell <apurt...@apache.org
> > > wrote:
> > > >
> > > > > While it seems you are proposing some kind of autonomous ongoing
> > usage
> > > > > metrics collection, please note I ran an anonymous version usage
> > survey
> > > > via
> > > > > surveymonkey for 1.x last year. It was opt in and there were no PII
> > > > > concerns by its nature. All of the issues around data collection,
> > > > storage,
> > > > > and processing were also handled (by surveymonkey). Unfortunately I
> > > > > recently cancelled my account.
> > > > >
> > > > > For occasional surveys something like that might work. Otherwise
> > there
> > > > are
> > > > > a ton of questions: How do we generate the data? How do we get
> > per-site
> > > > > opt-in permission? How do we collect the data? Store it? Process
> it?
> > > > Audit
> > > > > it? Seems more trouble than it's worth and requires ongoing
> volunteer
> > > > > hosting and effort to maintain.
> > > > >
> > > > >
> > > > > On Wed, Nov 14, 2018 at 11:47 AM Misty Linville <mi...@apache.org>
> > > > wrote:
> > > > >
> > > > > > When discussing the 2.0.x branch in another thread, it came up
> that
> > > we
> > > > > > don’t have a good way to understand the version skew of HBase
> > across
> > > > the
> > > > > > user base. Metrics gathering can be tricky. You don’t want to
> > capture
> > > > > > personally identifiable information (PII) and you need to be
> > > > transparent
> > > > > > about what you gather, for what purpose, how long the data will
> be
> > > > > > retained, etc. The data can also be sensitive, for instance if a
> > > large
> > > > > > number of installations are running a version with a CVE or known
> > > > > > vulnerability against it. If you gather metrics, it really needs
> to
> > > be
> > > > > > opt-out rather than opt-in so that you actually get a reasonable
> > > amount
> > > > > of
> > > > > > data. You also need to stand up some kind of metrics-gathering
> > > service
> > > > > and
> > > > > > run it somewhere, and some kind of reporting / visualization
> > tooling.
> > > > The
> > > > > > flip side of all these difficulties is a more intelligent way to
> > > decide
> > > > > > when to retire a branch or when to communicate more broadly /
> > loudly
> > > > > asking
> > > > > > people in a certain version stream to upgrade, as well as where
> to
> > > > > > concentrate our efforts.
> > > > > >
> > > > > > I’m not sticking my hand up to implement such a monster. I only
> > > wanted
> > > > to
> > > > > > open a discussion and see what y’all think. It seems to me that a
> > few
> > > > > > must-haves are:
> > > > > >
> > > > > > - Transparency: Release notes, logging about the status of
> > > > > > metrics-gathering (on or off) at master or RS start-up, logging
> > about
> > > > > > exactly when and what metrics are sent
> > > > > > - Low frequency: Would we really need to wake up and send metrics
> > > more
> > > > > > often than weekly?
> > > > > > - Conservative approach: Only collect what we can find useful
> > today,
> > > > > don’t
> > > > > > collect the world.
> > > > > > - Minimize PII: This probably means not trying to group together
> > > > > > time-series results for a given server or cluster at all, but
> could
> > > > make
> > > > > > the data look like there were a lot more clusters running in the
> > > world
> > > > > than
> > > > > > really are.
> > > > > > - Who has access to the data? Do we make it public or limit
> access
> > to
> > > > the
> > > > > > PMC? Making it public would bolster our discipline about
> > transparency
> > > > and
> > > > > > minimizing PII.
> > > > > >
> > > > > > I’m sure I’m missing a ton so I leave the discussion to y’all.
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrew
> > > > >
> > > > > Words like orphans lost among the crosstalk, meaning torn from
> > truth's
> > > > > decrepit hands
> > > > >    - A23, Crosstalk
> > > > >
> > > >
> > >
> >
>

Reply via email to