Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-12 Thread Junio C Hamano
Jeff King  writes:

> In my experience the maintenance burden is not in the "connect to a
> socket" part, but the fact that you have to sprinkle the entry points
> throughout the code (e.g., "set 'cloning' flag to 1" or "I'm entering
> the pack generation phase of the fetch"). So I'd love to see us do that
> sprinkling _once_ where everyone can benefit, whether they want better
> tracing for debugging, telemetry across their installed base, or
> whatever.

I tend to agree.  Transport (or file) can stay outside the core of
this "telemetry" thing---agreeing on what and when to trace, and how
the trace is represented, and having an API and solid guideline,
would allow us to annotate the code just once and get useful data in
a consistent way.

> The mechanism to handle those calls is much easier to plug in and out,
> then. And I don't have a huge preference for compile-time versus
> run-time, or whether bits are in-tree or out-of-tree. Obviously we'd
> have some basic "write to stderr or a file" consumer in-tree.

If it makes readers fearful of big brother feel safer, I think we
probably can add code that is runtime controllable that is compiled
in conditionally ;-)

I do not quite buy "Big brothers who want this for their own in
house use case can afford to patch" argument, by the way.  If
anything, having a solid tracing mechanism (to which existing
GIT_TRACE support _should_ be able to serve as a starting point)
would mean small budget guys do not have to do their own investing
to collect useful data over their customer base (I am seeing an
analog in "popcon", opt-in stats of installed package in a distro,
which we can view as "telemetry data for distro installer program").



Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-11 Thread Jeff King
On Mon, Jun 11, 2018 at 02:14:41AM -0400, Jeff King wrote:

> On Sun, Jun 10, 2018 at 12:00:57AM +, brian m. carlson wrote:
> 
> > I also have to look at this issue from the interests of what is best for
> > the FLOSS community and for users as a whole.  Adding in functionality
> > that sends off usage data from a command-line tool, especially one that
> > is as widely used as Git is, is not in the interests of users as a
> > whole, nor is it common practice in FLOSS tools.
> 
> I'm not sure if this last statement is true. For instance, many tools
> have crash-reporting functionality, and some contain more advanced
> telemetry. E.g.:
> 
>   https://wiki.mozilla.org/Telemetry
> 
> Personally I do not like this sort of data collection at all, and never
> enable it. And I would be highly suspicious of any FLOSS tool that
> shipped with it enabled by default. But it seems like at least some
> projects _do_ find it useful, and enough people enable it to make it
> worth carrying as a feature.

Re-reading what I wrote, I really want to emphasize something so that it
isn't missed. I think the important feature here is user consent, and
that seems to be the line that projects like Mozilla use (default to
off, and require explicit consent before sending anything off-host).

We don't have to use that line. I think "do not collect even locally" is
a reasonable line, because even collecting has performance and storage
management implications. And I don't think anybody is proposing to even
collect local telemetry without the user turning it on.

The line under discussion seems to be "implement code to collect
anything at all". That's a valid line, too, but IMHO it ends up blocking
useful features (and I think we already violate it with things like
GIT_TRACE!).

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-11 Thread Jeff King
On Sun, Jun 10, 2018 at 12:00:57AM +, brian m. carlson wrote:

> I also have to look at this issue from the interests of what is best for
> the FLOSS community and for users as a whole.  Adding in functionality
> that sends off usage data from a command-line tool, especially one that
> is as widely used as Git is, is not in the interests of users as a
> whole, nor is it common practice in FLOSS tools.

I'm not sure if this last statement is true. For instance, many tools
have crash-reporting functionality, and some contain more advanced
telemetry. E.g.:

  https://wiki.mozilla.org/Telemetry

Personally I do not like this sort of data collection at all, and never
enable it. And I would be highly suspicious of any FLOSS tool that
shipped with it enabled by default. But it seems like at least some
projects _do_ find it useful, and enough people enable it to make it
worth carrying as a feature.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-11 Thread Jeff King
On Sun, Jun 10, 2018 at 12:44:25AM +0200, Johannes Sixt wrote:

> > I agree with Peff: this is something you as a user need to be aware of,
> > and need to make sure you configure your Git just like you want. As long
> > as this is a purely opt-in feature, it is useful and helpful.
> 
> The problem with this feature is not so much that it enables someone to do
> bad things, but that it is specifically targeted at recording *how users use
> Git*.

I think one issue here is that we are not looking at concrete patches.
So for instance, I've seen a claim that Git should never have a way to
turn on tracing all the time. But at GitHub we have found it useful to
have a config option to log the sha1 of every object that is dropped by
git-prune or by "repack -ad". It's helped both as a developer (tracking
down races or bugs in our code) and as an administrator (figuring out
where a corruption was introduced). It needs to be on all the time to be
useful, since the point is to have an audit trail to look at _after_ a
bad thing happens.

That's something we do completely on the server side; I don't think
there are any privacy or "spying" issues there. And I don't think it's a
huge maintenance burden. Inside the existing code, it's literally a
one-line "log this" (the log code itself is a hundred or so lines in its
own file).

Now most users probably don't care that much about this use case. And
I'm OK to apply it as a custom patch. But doesn't it seem like that's
something other people hosting Git repos might want? Or that the concept
might extend to other loggable items that _are_ interesting on the
client side?

That's why I think it is worth taking this step-by-step. Let's log more
things. Let's make enabling tracing more flexible. Those are hopefully
uncontentious and universally useful. If you want to draw the line on
"spying", then I think the right place to draw it is when somebody wants
to ship code to actually move those logs out of the user's control.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-10 Thread Jeff King
On Sat, Jun 09, 2018 at 10:05:49PM +0200, Johannes Schindelin wrote:

> > E.g., could we have a flag or environment variable to have the existing
> > traces output JSON? I guess right now they're inherently free-form via
> > trace_printf, so it would involve adding some structured interface
> > calls. Which is more or less what I guess JeffH's proposed feature to
> > look like.
> 
> I think that is a much larger project than what JeffHost proposed, and
> would unfortunately put too much of a brake on his project.

I definitely don't want to stall somebody else's momentum with a bunch
of what-if's. But I also don't want to end up down the road with two
nearly-identical systems for tracing information. That's confusing to
users, and to developers who must choose which system to use for any new
tracing information they add.

So I think it's worth at least giving a little thought to how we might
leverage similarities between the trace system and this. Even if we
don't implement it now, it would be nice to have a vague sense of how
they could grow together in the long run.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread brian m. carlson
On Sat, Jun 09, 2018 at 08:56:00AM +0200, Johannes Sixt wrote:
> Am 09.06.2018 um 00:20 schrieb Ævar Arnfjörð Bjarmason:
> > 
> > On Fri, Jun 08 2018, Johannes Sixt wrote:
> > Can you elaborate on how someone who can maintain inject malicious code
> > into your git package + config would be thwarted by this being some
> > compile-time option, wouldn't they just compile it in?
> 
> Of course they can. But would we, the Git community do that?
> 
> From the design document:
> 
> > The goal of the telemetry feature is to be able to gather usage data
> > across a group of production users to identify real-world performance
> > problems in production.  Additionally, it might help identify common
> > user errors and guide future user training.
> 
> The goal to gather usage data may be valid for a small subset of Git
> installations. But it is wrong to put this into the software itself, in
> particular when the implementations includes scary things like loading
> unspecified dynamic libraries:
> 
> > If the config setting "telemetry.plugin" contains the pathname to a
> > shared library, the library will be dynamically loaded during start up
> > and events will be sent to it using the plugin API.
> 
> When you want usage data, ask your users for feedback. Look over their
> shoulders. But do not ask the software itself to gather usage data. It will
> be abused.
> 
> Do not offer open source software that has a "call-home" method built-in.
> 
> If you want to peek into the workplaces of YOUR users, then monkey-patch
> survaillance into YOUR version of Git. But please do not burden the rest of
> us.

I understand there's an interest in supporting the most people with the
fewest amount of staff.  I'm certainly in the situation where I, with
only minimal assistance, support every Git user in my division of the
company, regardless of technical ability, and I know how overwhelming
that can be.  (Burnout, I can tell you, is a thing.)

I also have to look at this issue from the interests of what is best for
the FLOSS community and for users as a whole.  Adding in functionality
that sends off usage data from a command-line tool, especially one that
is as widely used as Git is, is not in the interests of users as a
whole, nor is it common practice in FLOSS tools.

As a highly capable and technical user, I would find it very undesirable
to have my development tools reporting data like this, even if it is to
make my experience better.

The ability to load arbitrary libraries makes me concerned about people
using this to spirit away personal or company data or to subtly steal
data in a rootkit-like situation.  These are real threats in the kinds
of environments I distribute to in my work role.

I agree with Duy's point of view that GIT_TRACE-level output to a file
descriptor or file is fine, but a persistently enabled feature is not.

I expect this feature, if implemented, would be patched out of Debian's
Git, and it would be patched out of any Git I would distribute in my
work role for legal and ethical reasons.

As developers, we have a duty to be mindful of how our software can be
misused and abused and try to avoid that when possible.  I don't think
this feature is on the right side of that balance.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Sixt

Am 09.06.2018 um 22:43 schrieb Johannes Schindelin:

On Sat, 9 Jun 2018, Johannes Sixt wrote:

When you want usage data, ask your users for feedback. Look over their
shoulders. But do not ask the software itself to gather usage data. It will be
abused.

Do not offer open source software that has a "call-home" method built-in.

If you want to peek into the workplaces of YOUR users, then monkey-patch
survaillance into YOUR version of Git. But please do not burden the rest of
us.


We already offer hooks. You can do anything with those hooks. Even, if you
do not pay close attention, to transfer all your bitcoin to a specific
account.

I agree with Peff: this is something you as a user need to be aware of,
and need to make sure you configure your Git just like you want. As long
as this is a purely opt-in feature, it is useful and helpful.


The problem with this feature is not so much that it enables someone to 
do bad things, but that it is specifically targeted at recording *how 
users use Git*.



We do need it in-house, for the many thousands of Git users we try to
support with a relatively small team of Git developers.


Then please build your private version and sell it to your thousands of 
Git users. "in-house" == Microsoft? Small team of Git developers? So, 
instead of shelling out the bucks for this extra burden, they put the 
burden on the Community?


-- Hannes


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Schindelin
Hi Hannes,

On Sat, 9 Jun 2018, Johannes Sixt wrote:

> Am 09.06.2018 um 00:20 schrieb Ævar Arnfjörð Bjarmason:
> > 
> > On Fri, Jun 08 2018, Johannes Sixt wrote:
> > 
> > > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
> > > > I for my part would much rather prefer that to be a compile time
> > > > option so that I don't need to check on every git update on windows
> > > > if  this is now enabled or not.
> > >
> > > This exactly my concern, too! A compile-time option may make it a good
> > > deal less worrisome.
> > 
> > Can you elaborate on how someone who can maintain inject malicious code
> > into your git package + config would be thwarted by this being some
> > compile-time option, wouldn't they just compile it in?
> 
> Of course they can. But would we, the Git community do that?
> 
> From the design document:
> 
> > The goal of the telemetry feature is to be able to gather usage data
> > across a group of production users to identify real-world performance
> > problems in production.  Additionally, it might help identify common
> > user errors and guide future user training.
> 
> The goal to gather usage data may be valid for a small subset of Git
> installations. But it is wrong to put this into the software itself, in
> particular when the implementations includes scary things like loading
> unspecified dynamic libraries:
> 
> > If the config setting "telemetry.plugin" contains the pathname to a
> > shared library, the library will be dynamically loaded during start up
> > and events will be sent to it using the plugin API.
> 
> When you want usage data, ask your users for feedback. Look over their
> shoulders. But do not ask the software itself to gather usage data. It will be
> abused.
> 
> Do not offer open source software that has a "call-home" method built-in.
> 
> If you want to peek into the workplaces of YOUR users, then monkey-patch
> survaillance into YOUR version of Git. But please do not burden the rest of
> us.

We already offer hooks. You can do anything with those hooks. Even, if you
do not pay close attention, to transfer all your bitcoin to a specific
account.

I agree with Peff: this is something you as a user need to be aware of,
and need to make sure you configure your Git just like you want. As long
as this is a purely opt-in feature, it is useful and helpful.

We do need it in-house, for the many thousands of Git users we try to
support with a relatively small team of Git developers.

Ciao,
Dscho

Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Schindelin
Hi Peff,


On Sat, 9 Jun 2018, Jeff King wrote:

> On Sat, Jun 09, 2018 at 08:31:58AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> > 1) I really don't see the basis for this argument that they'd need to
> >patch it out, they're not patching out e.g. GIT_TRACE now, which has
> >all the same sort of concerns, it's just a format that's more of a
> >hassle to parse than this proposed format.
> >
> > 2) I think you and Johannes are just seeing the "telemetry" part of
> >this, but if you look past that all this *really* is is "GIT_TRACE
> >facility that doesn't suck to parse".
> 
> So that actually takes me to another question, which is: instead of
> introducing a new "telemetry" feature, could we make GIT_TRACE suck less
> to parse?

That's a different question, and even a different concern.

Please understand that this feature is needed for our own in-house use, to
be able to help our users pro-actively. We really want to be able to
contact a user when they struggle with the performance of any given Git
command, before they have to think of reaching out for assistance.

We discussed this plan at the contributor summit at GitMerge, and IIRC at
least Autodesk and DropBox were interested (but of course, the changes of
both Lars' and Alex' roles might make them less interested now), hence
JeffHost wanted to contribute this, for the benefit of the larger Git
community.

> E.g., could we have a flag or environment variable to have the existing
> traces output JSON? I guess right now they're inherently free-form via
> trace_printf, so it would involve adding some structured interface
> calls. Which is more or less what I guess JeffH's proposed feature to
> look like.

I think that is a much larger project than what JeffHost proposed, and
would unfortunately put too much of a brake on his project.

Ciao,
Dscho

Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Duy Nguyen
On Sat, Jun 9, 2018 at 8:32 AM Ævar Arnfjörð Bjarmason  wrote:
> Let's say you're in a corporate environment with Linux, OSX and Windows
> boxes, but all of whom have some shared mounts provisioned & ability to
> ship an /etc/gitconfig (wherever that lives on Windows).
>
> It's much easier to just do that than figure out how to build a custom
> Git on all three platforms.

Let's say I don't care about making it easier on corporate
environment. You have resources to do that.

> > Not making it a compile-time option could force [1] linux distro
> > to carry this function to everybody even if they don't use it (and
> > it's kinda dangerous to misuse if you don't anonymize the data
> > properly). I also prefer this a compile time option.
>
> Setting GIT_TRACE to a filename that you published is also similarly
> dangerous, so would setting up a trivial 4-line shell alias to wrap
> "git" and log what it's doing.
>
> > [1] Of course many distros can choose to patch it out. But it's the
> > same argument as bringing this option in in the first place: you guys
> > already have that code in private and now want to put it in stock git
> > to reduce maintenance cost, why add extra cost on linux distro
> > maintenance?
>
> Because:
>
> 1) I really don't see the basis for this argument that they'd need to
>patch it out, they're not patching out e.g. GIT_TRACE now, which has
>all the same sort of concerns, it's just a format that's more of a
>hassle to parse than this proposed format.
>
> 2) I think you and Johannes are just seeing the "telemetry" part of
>this, but if you look past that all this *really* is is "GIT_TRACE
>facility that doesn't suck to parse".

If it's simply an improvement over GIT_TRACE, then I have no problem
with that. That is:

- there is no new, permanent way to turn this one like config keys

- there's no plugin support or whatever for keeping output. Keep it to
either a file or a file descriptor like we do with GIT_TRACE.

>There's a lot of use-cases for that which have nothing to do with
>what this facility is originally written for, for example, a novice
>git user could turn it on and have it log in ~ somewhere, and then
>run some contrib script which analyzes his git usage and spews out
>suggestions ("you use this command/option, but not this related
>useful command/option").
>
>Users asking for help on the #git IRC channel or on this mailing list
>could turn this on if they have a problem, and paste it into some
>tool they could show to others to see exactly what they're doing /
>where it went wrong.

And they can use GIT_TRACE now. If GIT_TRACE does not give enough info
(e.g. too few trace points), add them. This new proposal is more about
automation. With GIT_TRACE, which is more human friendly, I could skim
over the output and remove sensitive info before I send it out for
help. Machine-friendly format (even json) tends to be a lot more
complex and harder to read/filter this way.
-- 
Duy


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Jeff King
On Sat, Jun 09, 2018 at 09:04:16AM +0200, Johannes Sixt wrote:

> > AFAICT this telemetry data is the same thing, but for performance. When
> > somebody says "why does this command take so long", wouldn't it be nice
> > for us to be able to tell them to flip a switch that will collect data
> > on which operations are taking a long time?
> 
> Why do we need long-term survaillance to answer this question and why can it
> not be made a mode of GIT_TRACE?

I guess I don't see how this isn't simply a mode of GIT_TRACE.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Sixt

Am 09.06.2018 um 08:51 schrieb Jeff King:

I actually think this could be useful for normal users, too. We have
GIT_TRACE for dumping debug output, and we sometimes ask users to turn
it on to help diagnose a problem (not to mention that we use it
ourselves).


The BIG difference is in "we ask the users". Asking for a trace is a 
one-shot thing; this telemetry thing is obviously intended for long-term 
survaillance.



AFAICT this telemetry data is the same thing, but for performance. When
somebody says "why does this command take so long", wouldn't it be nice
for us to be able to tell them to flip a switch that will collect data
on which operations are taking a long time?


Why do we need long-term survaillance to answer this question and why 
can it not be made a mode of GIT_TRACE?


-- Hannes


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Jeff King
On Sat, Jun 09, 2018 at 08:31:58AM +0200, Ævar Arnfjörð Bjarmason wrote:

> 1) I really don't see the basis for this argument that they'd need to
>patch it out, they're not patching out e.g. GIT_TRACE now, which has
>all the same sort of concerns, it's just a format that's more of a
>hassle to parse than this proposed format.
> 
> 2) I think you and Johannes are just seeing the "telemetry" part of
>this, but if you look past that all this *really* is is "GIT_TRACE
>facility that doesn't suck to parse".

So that actually takes me to another question, which is: instead of
introducing a new "telemetry" feature, could we make GIT_TRACE suck less
to parse?

E.g., could we have a flag or environment variable to have the existing
traces output JSON? I guess right now they're inherently free-form via
trace_printf, so it would involve adding some structured interface
calls. Which is more or less what I guess JeffH's proposed feature to
look like.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Sixt

Am 09.06.2018 um 00:20 schrieb Ævar Arnfjörð Bjarmason:


On Fri, Jun 08 2018, Johannes Sixt wrote:


Am 08.06.2018 um 18:00 schrieb Thomas Braun:

I for my part would much rather prefer that to be a compile time
option so that I don't need to check on every git update on windows
if  this is now enabled or not.


This exactly my concern, too! A compile-time option may make it a good
deal less worrisome.


Can you elaborate on how someone who can maintain inject malicious code
into your git package + config would be thwarted by this being some
compile-time option, wouldn't they just compile it in?


Of course they can. But would we, the Git community do that?

From the design document:

> The goal of the telemetry feature is to be able to gather usage data
> across a group of production users to identify real-world performance
> problems in production.  Additionally, it might help identify common
> user errors and guide future user training.

The goal to gather usage data may be valid for a small subset of Git 
installations. But it is wrong to put this into the software itself, in 
particular when the implementations includes scary things like loading 
unspecified dynamic libraries:


> If the config setting "telemetry.plugin" contains the pathname to a
> shared library, the library will be dynamically loaded during start up
> and events will be sent to it using the plugin API.

When you want usage data, ask your users for feedback. Look over their 
shoulders. But do not ask the software itself to gather usage data. It 
will be abused.


Do not offer open source software that has a "call-home" method built-in.

If you want to peek into the workplaces of YOUR users, then monkey-patch 
survaillance into YOUR version of Git. But please do not burden the rest 
of us.


-- Hannes


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Jeff King
On Sat, Jun 09, 2018 at 07:03:53AM +0200, Duy Nguyen wrote:

> > > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
> > >> I for my part would much rather prefer that to be a compile time
> > >> option so that I don't need to check on every git update on windows
> > >> if  this is now enabled or not.
> > >
> > > This exactly my concern, too! A compile-time option may make it a good
> > > deal less worrisome.
> >
> > Can you elaborate on how someone who can maintain inject malicious code
> > into your git package + config would be thwarted by this being some
> > compile-time option, wouldn't they just compile it in?
> 
> Look at this from a different angle. This is driven by the needs to
> collect telemetry in _controlled_ environment (mostly server side, I
> guess) and it should be no problem to make custom builds there for
> you. Not making it a compile-time option could force [1] linux distro
> to carry this function to everybody even if they don't use it (and
> it's kinda dangerous to misuse if you don't anonymize the data
> properly). I also prefer this a compile time option.

I actually think this could be useful for normal users, too. We have
GIT_TRACE for dumping debug output, and we sometimes ask users to turn
it on to help diagnose a problem (not to mention that we use it
ourselves).

AFAICT this telemetry data is the same thing, but for performance. When
somebody says "why does this command take so long", wouldn't it be nice
for us to be able to tell them to flip a switch that will collect data
on which operations are taking a long time?

> [1] Of course many distros can choose to patch it out. But it's the
> same argument as bringing this option in in the first place: you guys
> already have that code in private and now want to put it in stock git
> to reduce maintenance cost, why add extra cost on linux distro
> maintenance?

We don't do performance telemetry like this at GitHub, but we do collect
a few variables that we dump to a custom over a Unix socket (e.g., our
upload-pack records clone vs fetch and shallow vs non-shallow for each
request).

In my experience the maintenance burden is not in the "connect to a
socket" part, but the fact that you have to sprinkle the entry points
throughout the code (e.g., "set 'cloning' flag to 1" or "I'm entering
the pack generation phase of the fetch"). So I'd love to see us do that
sprinkling _once_ where everyone can benefit, whether they want better
tracing for debugging, telemetry across their installed base, or
whatever.

The mechanism to handle those calls is much easier to plug in and out,
then. And I don't have a huge preference for compile-time versus
run-time, or whether bits are in-tree or out-of-tree. Obviously we'd
have some basic "write to stderr or a file" consumer in-tree.

For myself, I'm happy with compile-time (I'm instrumenting the server
side, so I really only care about my specific build) and out-of-tree
(the protocol to our custom daemon consumer is not something anybody
else would care about anyway).

For people collecting from clients, I imagine it's more convenient to be
able to let them use official builds and just tweak a knob, even if they
_could_ build a custom Git and push it out to everybody. I don't know
anything about Windows event tracing, but if it's a standard facility
then I doubt we're talking about a huge maintenance burden to have it
in-tree as a configurable option.

So IMHO that really leaves the "is it too scary to have a config knob
that lets tracing go to this event facility versus to a file"? I guess I
just don't see it, as long as the user has to enable it explicitly. That
seems like complaining that GIT_TRACE could go to syslog. If you don't
want to store it, then don't turn on the feature.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Ævar Arnfjörð Bjarmason


On Sat, Jun 09 2018, Duy Nguyen wrote:

> On Sat, Jun 9, 2018 at 12:22 AM Ævar Arnfjörð Bjarmason
>  wrote:
>>
>>
>> On Fri, Jun 08 2018, Johannes Sixt wrote:
>>
>> > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
>> >> I for my part would much rather prefer that to be a compile time
>> >> option so that I don't need to check on every git update on windows
>> >> if  this is now enabled or not.
>> >
>> > This exactly my concern, too! A compile-time option may make it a good
>> > deal less worrisome.
>>
>> Can you elaborate on how someone who can maintain inject malicious code
>> into your git package + config would be thwarted by this being some
>> compile-time option, wouldn't they just compile it in?
>
>
> Look at this from a different angle. This is driven by the needs to
> collect telemetry in _controlled_ environment (mostly server side, I
> guess) and it should be no problem to make custom builds there for
> you.

Let's say you're in a corporate environment with Linux, OSX and Windows
boxes, but all of whom have some shared mounts provisioned & ability to
ship an /etc/gitconfig (wherever that lives on Windows).

It's much easier to just do that than figure out how to build a custom
Git on all three platforms.

I guess you might make the argument that "that's good", because in
practice that'll mean that it's such a hassle that fewer administrators
will turn this on.

But I think that would be a loss, because that's taking the default view
that people with the rights (i.e. managed config access) to turn on
something like this by default have nefarious motives, and we should do
what we can to stop them.

I don't think that's true, e.g. what I intend to use this for is:

 a) Getting aggregate data on what commands/switches are used, for
purposes of training and prioritizing my upstream contributions.

 b) Aggregate performance data to figure out what hotspots to tackle in
the code.

That's things that'll both benefit the users I'm responsible for, and
the wider git community.

> Not making it a compile-time option could force [1] linux distro
> to carry this function to everybody even if they don't use it (and
> it's kinda dangerous to misuse if you don't anonymize the data
> properly). I also prefer this a compile time option.

Setting GIT_TRACE to a filename that you published is also similarly
dangerous, so would setting up a trivial 4-line shell alias to wrap
"git" and log what it's doing.

> [1] Of course many distros can choose to patch it out. But it's the
> same argument as bringing this option in in the first place: you guys
> already have that code in private and now want to put it in stock git
> to reduce maintenance cost, why add extra cost on linux distro
> maintenance?

Because:

1) I really don't see the basis for this argument that they'd need to
   patch it out, they're not patching out e.g. GIT_TRACE now, which has
   all the same sort of concerns, it's just a format that's more of a
   hassle to parse than this proposed format.

2) I think you and Johannes are just seeing the "telemetry" part of
   this, but if you look past that all this *really* is is "GIT_TRACE
   facility that doesn't suck to parse".

   There's a lot of use-cases for that which have nothing to do with
   what this facility is originally written for, for example, a novice
   git user could turn it on and have it log in ~ somewhere, and then
   run some contrib script which analyzes his git usage and spews out
   suggestions ("you use this command/option, but not this related
   useful command/option").

   Users asking for help on the #git IRC channel or on this mailing list
   could turn this on if they have a problem, and paste it into some
   tool they could show to others to see exactly what they're doing /
   where it went wrong.


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-08 Thread Duy Nguyen
On Sat, Jun 9, 2018 at 12:22 AM Ævar Arnfjörð Bjarmason
 wrote:
>
>
> On Fri, Jun 08 2018, Johannes Sixt wrote:
>
> > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
> >> I for my part would much rather prefer that to be a compile time
> >> option so that I don't need to check on every git update on windows
> >> if  this is now enabled or not.
> >
> > This exactly my concern, too! A compile-time option may make it a good
> > deal less worrisome.
>
> Can you elaborate on how someone who can maintain inject malicious code
> into your git package + config would be thwarted by this being some
> compile-time option, wouldn't they just compile it in?


Look at this from a different angle. This is driven by the needs to
collect telemetry in _controlled_ environment (mostly server side, I
guess) and it should be no problem to make custom builds there for
you. Not making it a compile-time option could force [1] linux distro
to carry this function to everybody even if they don't use it (and
it's kinda dangerous to misuse if you don't anonymize the data
properly). I also prefer this a compile time option.

[1] Of course many distros can choose to patch it out. But it's the
same argument as bringing this option in in the first place: you guys
already have that code in private and now want to put it in stock git
to reduce maintenance cost, why add extra cost on linux distro
maintenance?
-- 
Duy


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-08 Thread Ævar Arnfjörð Bjarmason


On Fri, Jun 08 2018, Johannes Sixt wrote:

> Am 08.06.2018 um 18:00 schrieb Thomas Braun:
>> I for my part would much rather prefer that to be a compile time
>> option so that I don't need to check on every git update on windows
>> if  this is now enabled or not.
>
> This exactly my concern, too! A compile-time option may make it a good
> deal less worrisome.

Can you elaborate on how someone who can maintain inject malicious code
into your git package + config would be thwarted by this being some
compile-time option, wouldn't they just compile it in?


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-08 Thread Johannes Sixt

Am 08.06.2018 um 18:00 schrieb Thomas Braun:

I for my part would much rather prefer that to be a compile time
option so that I don't need to check on every git update on windows
if  this is now enabled or not.


This exactly my concern, too! A compile-time option may make it a good 
deal less worrisome.


-- Hannes


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-08 Thread Thomas Braun
Am 08.06.2018 um 11:07 schrieb Jeff King:
> On Thu, Jun 07, 2018 at 11:10:52PM +0200, Johannes Sixt wrote:
> 
>> Am 07.06.2018 um 16:53 schrieb g...@jeffhostetler.com:
>>> From: Jeff Hostetler 
>>>
>>> I've been working to add code to Git to optionally collect telemetry data.
>>> The goal is to be able to collect performance data from Git commands and
>>> allow it to be aggregated over a user community to find "slow commands".
>>
>> Seriously? "add code to collect telemetry data" said by somebody whose email
>> address ends with @microsoft.com is very irritating. I really don't want to
>> have yet another switch that I must check after every update that it is
>> still off.
> 
> If you look at the design document, it's off by default and would write
> to a file on the filesystem. That doesn't seem all that different from
> GIT_TRACE.

The patch also includes the following part

+telemetry.plugin
+
+
+If the config setting "telemetry.plugin" contains the pathname to a shared
+library, the library will be dynamically loaded during start up and events
+will be sent to it using the plugin API.
+
+This plugin model allows an organization to define a custom or private
+telemetry solution while using a stock version of Git.
+
+For example, on Windows, it allows telemetry events to go directly to the
+kernel via the plugin using the high performance Event Tracing for Windows
+(ETW) facility.
+
+The contrib/telemetry-plugin-examples directory contains two example
+plugins:
+ * A trivial log to stderr
+ * A trivial ETW writer

which is not a file but, if enabled, some windows internal thingie where the 
data is gone/duplicated/sent out/whatever.

I for my part would much rather prefer that to be a compile time option so that 
I don't need to check on every git update on windows if this is now enabled or 
not.


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-08 Thread Duy Nguyen
On Fri, Jun 8, 2018 at 11:40 AM, Ævar Arnfjörð Bjarmason
 wrote:
>
> On Thu, Jun 07 2018, Johannes Sixt wrote:
>
>> Am 07.06.2018 um 16:53 schrieb g...@jeffhostetler.com:
>>> From: Jeff Hostetler 
>>>
>>> I've been working to add code to Git to optionally collect telemetry data.
>>> The goal is to be able to collect performance data from Git commands and
>>> allow it to be aggregated over a user community to find "slow commands".
>>
>> Seriously? "add code to collect telemetry data" said by somebody whose
>> email address ends with @microsoft.com is very irritating. I really
>> don't want to have yet another switch that I must check after every
>> update that it is still off.
>
> To elaborate a bit on Jeff's reply (since this was discussed in more
> detail at Git Merge this year), the point of this feature is not to ship
> git.git with some default telemetry, but so that in-house users of git
> like Microsoft, Dropbox, Booking.com etc. can build & configure their
> internal versions of git to turn on telemetry for their own users.
>
> There's numerous in-house monkeypatches to git on various sites (at
> least Microsoft & Dropbox reported having internal patches already).
> Something like this facility would allow us to agree on some
> implementation that could be shipped by default (but would be off by
> default), those of us who'd make use of this feature already have "root"
> on those user's machines, and control what git binary they use etc,
> their /etc/gitconfig and so on.

This elaboration should have been in the commit message of the first
patch (and I hope it will be in v2). You guys knew because it was
discussed at Git Merge but the rest of us may not.. It would be much
more convincing to have something like this spelled out.
-- 
Duy


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-08 Thread Ævar Arnfjörð Bjarmason


On Thu, Jun 07 2018, Johannes Sixt wrote:

> Am 07.06.2018 um 16:53 schrieb g...@jeffhostetler.com:
>> From: Jeff Hostetler 
>>
>> I've been working to add code to Git to optionally collect telemetry data.
>> The goal is to be able to collect performance data from Git commands and
>> allow it to be aggregated over a user community to find "slow commands".
>
> Seriously? "add code to collect telemetry data" said by somebody whose
> email address ends with @microsoft.com is very irritating. I really
> don't want to have yet another switch that I must check after every
> update that it is still off.

To elaborate a bit on Jeff's reply (since this was discussed in more
detail at Git Merge this year), the point of this feature is not to ship
git.git with some default telemetry, but so that in-house users of git
like Microsoft, Dropbox, Booking.com etc. can build & configure their
internal versions of git to turn on telemetry for their own users.

There's numerous in-house monkeypatches to git on various sites (at
least Microsoft & Dropbox reported having internal patches already).
Something like this facility would allow us to agree on some
implementation that could be shipped by default (but would be off by
default), those of us who'd make use of this feature already have "root"
on those user's machines, and control what git binary they use etc,
their /etc/gitconfig and so on.


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-08 Thread Jeff King
On Thu, Jun 07, 2018 at 11:10:52PM +0200, Johannes Sixt wrote:

> Am 07.06.2018 um 16:53 schrieb g...@jeffhostetler.com:
> > From: Jeff Hostetler 
> > 
> > I've been working to add code to Git to optionally collect telemetry data.
> > The goal is to be able to collect performance data from Git commands and
> > allow it to be aggregated over a user community to find "slow commands".
> 
> Seriously? "add code to collect telemetry data" said by somebody whose email
> address ends with @microsoft.com is very irritating. I really don't want to
> have yet another switch that I must check after every update that it is
> still off.

If you look at the design document, it's off by default and would write
to a file on the filesystem. That doesn't seem all that different from
GIT_TRACE.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-07 Thread Johannes Sixt

Am 07.06.2018 um 16:53 schrieb g...@jeffhostetler.com:

From: Jeff Hostetler 

I've been working to add code to Git to optionally collect telemetry data.
The goal is to be able to collect performance data from Git commands and
allow it to be aggregated over a user community to find "slow commands".


Seriously? "add code to collect telemetry data" said by somebody whose 
email address ends with @microsoft.com is very irritating. I really 
don't want to have yet another switch that I must check after every 
update that it is still off.


-- Hannes


[RFC PATCH v1] telemetry design overview (part 1)

2018-06-07 Thread git
From: Jeff Hostetler 

I've been working to add code to Git to optionally collect telemetry data.
The goal is to be able to collect performance data from Git commands and
allow it to be aggregated over a user community to find "slow commands".

I'm going to break this up into several parts rather than sending one large
patch series.  I think it is easier to review in pieces and in stages.

Part 1 contains the overall design documentation.
Part 2 will contain the basic telemetry event mechanism and the cmd_start
and cmd_exit events.

I'll post part 2 shortly -- as soon as I can convert my tests from python
to perl.  :-)

Jeff Hostetler (1):
  telemetry: design documenation

 Documentation/technical/telemetry.txt | 475 ++
 1 file changed, 475 insertions(+)
 create mode 100644 Documentation/technical/telemetry.txt

-- 
2.9.3