Can we have metrics around the downloads from Apache/archives? I'm not sure how that is all set up, but might be a low cost way to get some metrics.
On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell <apurt...@apache.org wrote: > While it seems you are proposing some kind of autonomous ongoing usage > metrics collection, please note I ran an anonymous version usage survey via > surveymonkey for 1.x last year. It was opt in and there were no PII > concerns by its nature. All of the issues around data collection, storage, > and processing were also handled (by surveymonkey). Unfortunately I > recently cancelled my account. > > For occasional surveys something like that might work. Otherwise there are > a ton of questions: How do we generate the data? How do we get per-site > opt-in permission? How do we collect the data? Store it? Process it? Audit > it? Seems more trouble than it's worth and requires ongoing volunteer > hosting and effort to maintain. > > > On Wed, Nov 14, 2018 at 11:47 AM Misty Linville <mi...@apache.org> wrote: > > > When discussing the 2.0.x branch in another thread, it came up that we > > don’t have a good way to understand the version skew of HBase across the > > user base. Metrics gathering can be tricky. You don’t want to capture > > personally identifiable information (PII) and you need to be transparent > > about what you gather, for what purpose, how long the data will be > > retained, etc. The data can also be sensitive, for instance if a large > > number of installations are running a version with a CVE or known > > vulnerability against it. If you gather metrics, it really needs to be > > opt-out rather than opt-in so that you actually get a reasonable amount > of > > data. You also need to stand up some kind of metrics-gathering service > and > > run it somewhere, and some kind of reporting / visualization tooling. The > > flip side of all these difficulties is a more intelligent way to decide > > when to retire a branch or when to communicate more broadly / loudly > asking > > people in a certain version stream to upgrade, as well as where to > > concentrate our efforts. > > > > I’m not sticking my hand up to implement such a monster. I only wanted to > > open a discussion and see what y’all think. It seems to me that a few > > must-haves are: > > > > - Transparency: Release notes, logging about the status of > > metrics-gathering (on or off) at master or RS start-up, logging about > > exactly when and what metrics are sent > > - Low frequency: Would we really need to wake up and send metrics more > > often than weekly? > > - Conservative approach: Only collect what we can find useful today, > don’t > > collect the world. > > - Minimize PII: This probably means not trying to group together > > time-series results for a given server or cluster at all, but could make > > the data look like there were a lot more clusters running in the world > than > > really are. > > - Who has access to the data? Do we make it public or limit access to the > > PMC? Making it public would bolster our discipline about transparency and > > minimizing PII. > > > > I’m sure I’m missing a ton so I leave the discussion to y’all. > > > > > -- > Best regards, > Andrew > > Words like orphans lost among the crosstalk, meaning torn from truth's > decrepit hands > - A23, Crosstalk >