Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-13 Thread Kaartic Sivaraam

On Tuesday 12 September 2017 08:35 PM, Jeff King wrote:
But theta-well isn't a pun. :P 


:)


It is true that prepending to a linked list is also Θ(1), but I'm not
sure if it's carelessness that causes many programmers to use big-O.
It's that what we care about is worst-case performance. So knowing that
we have a lower bound isn't usually that interesting. What we want to
know is whether it will blow up in our face as "n" gets large.


This seems quite acceptable.


Plus typing non-ascii characters is annoying. :)


I expected this to be a reason. :)

---
Kaartic


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-12 Thread Kaartic Sivaraam

On Tuesday 12 September 2017 08:59 PM, Jeff King wrote:

Like all good writing rules, I think it's important to know when to
break them. :)


That's right. "Have guidelines but 'Be bold' enough to break them when
they seem to be inducing counter productivity."


Writing in the imperative is _most_ important in the subject. You're
likely to see a lot of subjects in a list, and it makes the list easier
to read if they all match. It also tends to be shorter, which is good
for subjects.

For short commit messages, I think the imperative also keeps things
tight and to the point: describe the problem and then say how to fix it.
The recent 0db3dc75f is a good example (which I picked by skimming
recent "git log" output). But saying "this patch" is IMHO not that big a
problem there, as long as it isn't done excessively.

When you the explanation is longer or more complicated, the imperative
can actually be a bit _too_ terse. In longer text it helps to guide
readers in the direction you want their thoughts to take. Having a
three-paragraph explanation of the problem or current state of things
and then jumping right into "Do this. Do that." lacks context. A marker
like "this patch" helps the reader know that you're switching gears to
talking about the solution.

I also think that "I" is useful in avoiding the passive voice.  It can
certainly be used gratuitously and make things less clear, but in most
cases I'd rather see something like "I tested performance under these
conditions" than "Performance was tested under these conditions". I also
often use the "academic we" here even when I worked on something myself.


Thanks for taking the time to give the detailed and clear explanation.

---
Kaartic


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-12 Thread Jeff King
On Tue, Sep 12, 2017 at 07:11:38PM +0530, Kaartic Sivaraam wrote:

> On Tue, 2017-09-05 at 09:05 -0400, Jeff King wrote:
> > This patch introduces an UNLEAK() macro that lets us do so.
> > To understand its design, let's first look at some of the
> > alternatives.
> > 
> 
> > This patch adds the UNLEAK() macro and enables it
> > automatically when Git is compiled with SANITIZE=leak.
> > It adds some UNLEAK() annotations to show off how the
> > feature works. On top of other recent leak fixes, these are
> > enough to get t and t0001 to pass when compiled with
> > LSAN.
> 
> My nit of the day ;-)
> 
> The above paragraphs seems to going against the following guideline of
> Documentation/SubmittingPatches,
> 
> Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
> instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
> to do frotz", as if you are giving orders to the codebase to change
> its behavior.  Try to make sure your explanation can be understood

Like all good writing rules, I think it's important to know when to
break them. :)

Writing in the imperative is _most_ important in the subject. You're
likely to see a lot of subjects in a list, and it makes the list easier
to read if they all match. It also tends to be shorter, which is good
for subjects.

For short commit messages, I think the imperative also keeps things
tight and to the point: describe the problem and then say how to fix it.
The recent 0db3dc75f is a good example (which I picked by skimming
recent "git log" output). But saying "this patch" is IMHO not that big a
problem there, as long as it isn't done excessively.

When you the explanation is longer or more complicated, the imperative
can actually be a bit _too_ terse. In longer text it helps to guide
readers in the direction you want their thoughts to take. Having a
three-paragraph explanation of the problem or current state of things
and then jumping right into "Do this. Do that." lacks context. A marker
like "this patch" helps the reader know that you're switching gears to
talking about the solution.

I also think that "I" is useful in avoiding the passive voice.  It can
certainly be used gratuitously and make things less clear, but in most
cases I'd rather see something like "I tested performance under these
conditions" than "Performance was tested under these conditions". I also
often use the "academic we" here even when I worked on something myself.

-Peff


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-12 Thread Jeff King
On Tue, Sep 12, 2017 at 08:04:52PM +0530, Kaartic Sivaraam wrote:

> > On Tue, 2017-09-05 at 15:05 -0700, Stefan Beller wrote:
> > 
> > After having a sneak peak at the implementation
> > it is O(1) in runtime for each added element, and the
> > space complexity is O(well).
> > 
> 
> Incidentally I was reading about "complexity of algorithms" and there
> was the following paragraph in the book,
> 
> 
> Unfortunately, as Knuth observed, big-O notation is often used by 
> careless writers and
> speakers as if it had the same meaning as big-Theta notation. Keep this 
> in mind when you see
> big-O notation used. The recent trend has been to use big-Theta notation 
> whenever both upper
> and lower bounds on the size of a function are needed.
> 
> So, if my interpretation is correct the above usage of O(1) and O(well)
> should have been Θ(1) and Θ(well).

But theta-well isn't a pun. :P

It is true that prepending to a linked list is also Θ(1), but I'm not
sure if it's carelessness that causes many programmers to use big-O.
It's that what we care about is worst-case performance. So knowing that
we have a lower bound isn't usually that interesting. What we want to
know is whether it will blow up in our face as "n" gets large.

Plus typing non-ascii characters is annoying. :)

If you want to talk about sloppy analysis, the two most common problems
I see are:

  1. People talk about big-O complexity without discussing constants.
 For reasonable values of "n", the constants often matter (they're
 not wrong about big-O, but they are wrong about what will run fast
 in practice).

  2. Glossing over things like amortized costs. Hash tables aren't
 really O(1) because they eventually fill up and get collisions. You
 have to talk about load factor, resizing, etc.

I'm sure I'm guilty of doing those things sometimes, too.

-Peff


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-12 Thread Kaartic Sivaraam
> On Tue, 2017-09-05 at 15:05 -0700, Stefan Beller wrote:
> 
> After having a sneak peak at the implementation
> it is O(1) in runtime for each added element, and the
> space complexity is O(well).
> 

Incidentally I was reading about "complexity of algorithms" and there
was the following paragraph in the book,


Unfortunately, as Knuth observed, big-O notation is often used by careless 
writers and
speakers as if it had the same meaning as big-Theta notation. Keep this in 
mind when you see
big-O notation used. The recent trend has been to use big-Theta notation 
whenever both upper
and lower bounds on the size of a function are needed.


So, if my interpretation is correct the above usage of O(1) and O(well)
should have been Θ(1) and Θ(well).

-- 
Kaartic


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-12 Thread Kaartic Sivaraam
On Tue, 2017-09-05 at 09:05 -0400, Jeff King wrote:
> This patch introduces an UNLEAK() macro that lets us do so.
> To understand its design, let's first look at some of the
> alternatives.
> 

> This patch adds the UNLEAK() macro and enables it
> automatically when Git is compiled with SANITIZE=leak.
> It adds some UNLEAK() annotations to show off how the
> feature works. On top of other recent leak fixes, these are
> enough to get t and t0001 to pass when compiled with
> LSAN.

My nit of the day ;-)

The above paragraphs seems to going against the following guideline of
Documentation/SubmittingPatches,

Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
to do frotz", as if you are giving orders to the codebase to change
its behavior.  Try to make sure your explanation can be understood



-- 
Kaartic


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-07 Thread Stefan Beller
On Thu, Sep 7, 2017 at 2:17 AM, Jeff King  wrote:
>> After having a sneak peak at the implementation
>> it is O(1) in runtime for each added element, and the
>> space complexity is O(well).
>
> I'm not sure if your "well" is "this does well" or "well, it could be
> quite a lot". :)

Both actually. When I wrote it I thought the phonetic interpretation
was way too funny, but nobody can hear subtle humor on mailing
lists. :)

If UNLEAK is used correctly, then it sounds more like
"this does well (and we cannot do better anyway)".

> It certainly has the potential to grow the heap without bound (since
> after all, it's whole point is to make a giant list of variables that
> are going out of scope). But in practice we'd sprinkle this over a
> handful of variables just before program exit (and remember that it's
> copying only what's on the stack already; so pointers get copied, not
> whole heap-allocated blocks).
>
> Plus it does nothing at all when not compiled with leak-checking. So I'm
> not too worried about the extra memory usage or performance.

me neither.

Thanks for starting this series (I am really happy about this solution)!
Stefan


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-07 Thread Jeff King
On Tue, Sep 05, 2017 at 03:05:12PM -0700, Stefan Beller wrote:

> On Tue, Sep 5, 2017 at 6:05 AM, Jeff King  wrote:
> 
> >   int main(void)
> 
> nit of the day:
>   s/void/int argc, char *argv/ or in case we do not
>   want to emphasize the argument list s/void//
>   as that adds no uninteresting things.

That really is a nit. I chose not to provide argv because it's longer
than "void" and I wasn't going to use the arguments. And I chose not to
use an empty argument list because it violates our style (as well as
arguably the C standard, though it leaves room for implementations to
take other forms of main).

> > In other words, you can do:
> >
> >   int main(void)
> >   {
> > char *p = some_function();
> > printf("%s", p);
> > UNLEAK(p);
> > return 0;
> >   }
> >
> > to annotate "p" and suppress the leak report.
> 
> This sounds really cool so far.
> 
> After having a sneak peak at the implementation
> it is O(1) in runtime for each added element, and the
> space complexity is O(well).

I'm not sure if your "well" is "this does well" or "well, it could be
quite a lot". :)

It certainly has the potential to grow the heap without bound (since
after all, it's whole point is to make a giant list of variables that
are going out of scope). But in practice we'd sprinkle this over a
handful of variables just before program exit (and remember that it's
copying only what's on the stack already; so pointers get copied, not
whole heap-allocated blocks).

Plus it does nothing at all when not compiled with leak-checking. So I'm
not too worried about the extra memory usage or performance.

> >   1. It can be compiled conditionally. There's no need in
> >  normal runs to do this free(), and it just wastes time.
> >  By using a macro, we can get the benefit for leak-check
> >  builds with zero cost for normal builds (this patch
> >  uses a compile-time check, though we could clearly also
> >  make it a run-time check at very low cost).
> >
> >  Of course one could also hide free() behind a macro, so
> >  this is really just arguing for having UNLEAK(), not
> >  for its particular implementation.
> 
> This is only a real argument in combination with (2), or in other
> words you seem to hint at situations like these:

Well, the numbered list was meant to be a set of arguments, each of
which contributes to the overall conclusion. :) I agree that (1) is the
weakest. Since both you and Martin seemed to get hung up on it, I'll
re-organize it a bit for the re-roll.

>   5. It's not just about worrying if we can call UNLEAK
>   once (in 4), but we also do not have to worry about
>   calling it twice, or recursively. (This argument can be bad
>   for cargo cult programmers, but we don't have these ;-)

True. I didn't come across that case in any of the ones I converted. As
a more general rule, UNLEAK() doesn't access any pointed-to memory at
all. So it's fine with already-freed or even uninitialized memory (which
of course is technically wrong according to the standard, but in
practice would be fine, as we'd copy garbage that does not match a heap
block).

> > +#ifdef SUPPRESS_ANNOTATED_LEAKS
> > +extern void unleak_memory(const void *ptr, size_t len);
> > +#define UNLEAK(var) unleak_memory(&(var), sizeof(var));
> 
> As always with macros we have to be careful about its arguments.
> 
>   UNLEAK(a++)
>   UNLEAK(baz())
> 
> won't work as intended.

Yes, I intended this to be used only for actual variables. I couldn't
think of a way to enforce that at compile time with some kind of
BUILD_ASSERT (even requiring an lvalue isn't quite strict enough).

-Peff


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-07 Thread Jeff King
On Wed, Sep 06, 2017 at 07:16:00PM +0200, Martin Ågren wrote:

> > diff --git a/builtin/commit.c b/builtin/commit.c
> > index b3b04f5dd3..de775d906c 100644
> > --- a/builtin/commit.c
> > +++ b/builtin/commit.c
> > @@ -1819,5 +1819,6 @@ int cmd_commit(int argc, const char **argv, const 
> > char *prefix)
> > print_summary(prefix, &oid, !current_head);
> >
> > strbuf_release(&err);
> > +   UNLEAK(sb);
> > return 0;
> >  }
> 
> These are both strbufs, so this ends up being a bit inconsistent. What
> would be the ideal end state for these two and all other such
> structures? My guess is "always UNLEAK", as opposed to carefully judging
> whether foo_release() would/could add any significant overhead.
> 
> In other words, it would be ok/wanted with changes such as "let's UNLEAK
> bar, because ..., and while at it, convert the existing foo_release to
> UNLEAK for consistency" (or per policy, for smaller binary, whatever).
> Or "if it ain't broken, don't fix it"? Did you think about this, or was
> it more a random choice?

To be honest, I didn't really think that deeply about it. I had a hammer
in my hand, and LSAN kept showing me nails to pound.

I agree that these two strbufs should probably be treated the same.

In general, I think I prefer using UNLEAK() because it's hard to get it
wrong (i.e., you don't have to care about double-frees or uninitialized
pointers). For strbufs, though, that's less of an issue because they are
always maintained in a consistent state.

As an aside, I'm pretty sure that "err" can never have been allocated
here, and this release is always a noop. It's filled in only when we get
an error from the ref update, which also causes us to die(). But in
general I'd prefer the code that causes readers to think the least
(i.e., just calling free or UNLEAK here rather than forcing the reader
to figure out whether it's possible to leak).

-Peff


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-06 Thread Martin Ågren
On 5 September 2017 at 15:05, Jeff King  wrote:
>   1. It can be compiled conditionally. There's no need in
>  normal runs to do this free(), and it just wastes time.
>  By using a macro, we can get the benefit for leak-check
>  builds with zero cost for normal builds (this patch
>  uses a compile-time check, though we could clearly also
>  make it a run-time check at very low cost).
>
>  Of course one could also hide free() behind a macro, so
>  this is really just arguing for having UNLEAK(), not
>  for its particular implementation.

Like Stefan, I didn't quite follow 1. until after I had read the points
below it. But it's still a very good commit message (as always).

> diff --git a/builtin/commit.c b/builtin/commit.c
> index b3b04f5dd3..de775d906c 100644
> --- a/builtin/commit.c
> +++ b/builtin/commit.c
> @@ -1819,5 +1819,6 @@ int cmd_commit(int argc, const char **argv, const char 
> *prefix)
> print_summary(prefix, &oid, !current_head);
>
> strbuf_release(&err);
> +   UNLEAK(sb);
> return 0;
>  }

These are both strbufs, so this ends up being a bit inconsistent. What
would be the ideal end state for these two and all other such
structures? My guess is "always UNLEAK", as opposed to carefully judging
whether foo_release() would/could add any significant overhead.

In other words, it would be ok/wanted with changes such as "let's UNLEAK
bar, because ..., and while at it, convert the existing foo_release to
UNLEAK for consistency" (or per policy, for smaller binary, whatever).
Or "if it ain't broken, don't fix it"? Did you think about this, or was
it more a random choice?

Martin


Re: [PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-05 Thread Stefan Beller
On Tue, Sep 5, 2017 at 6:05 AM, Jeff King  wrote:

>   int main(void)

nit of the day:
  s/void/int argc, char *argv/ or in case we do not
  want to emphasize the argument list s/void//
  as that adds no uninteresting things.


>
> In other words, you can do:
>
>   int main(void)
>   {
> char *p = some_function();
> printf("%s", p);
> UNLEAK(p);
> return 0;
>   }
>
> to annotate "p" and suppress the leak report.

This sounds really cool so far.

After having a sneak peak at the implementation
it is O(1) in runtime for each added element, and the
space complexity is O(well).

> But wait, couldn't we just say "free(p)"? In this toy
> example, yes. But using UNLEAK() has several advantages over
> actually freeing the memory:

This is indeed the big question, that I have had.

>
>   1. It can be compiled conditionally. There's no need in
>  normal runs to do this free(), and it just wastes time.
>  By using a macro, we can get the benefit for leak-check
>  builds with zero cost for normal builds (this patch
>  uses a compile-time check, though we could clearly also
>  make it a run-time check at very low cost).
>
>  Of course one could also hide free() behind a macro, so
>  this is really just arguing for having UNLEAK(), not
>  for its particular implementation.

This is only a real argument in combination with (2), or in other
words you seem to hint at situations like these:

  struct *foo = obtain_new_foo();
  ...
  #if FREE_ANNOTATED_LEAKS
/* special free() */
release_foo(foo);
  #endif

With UNLEAK this situation works out nicely as we just
copy over all memory, ignoring elements allocated inside
foo, but for free() we'd have issues combining the preprocessor
magic with the special free implementation.

So how would we use syntactic sugar to made this
more comfortable? Roughly like

MAYBE(release_foo(foo))

  #if (FREE_ANNOTATED_LEAKS)
  /* we rely on strict text substitution */
  /* as the function signature may change */
  #define MAYBE(fn) fn;
  #else
  #define MAYBE(fn)
  #endif

Me regurgitating this first argument is just a long way of saying
that it put me off even more after reading only the first argument.
Maybe reorder this argument to show up after the current second
argument, so the reader is guided better?

>   2. It's recursive across structures. In many cases our "p"
>  is not just a pointer, but a complex struct whose
>  fields may have been allocated by a sub-function. And
>  in some cases (e.g., dir_struct) we don't even have a
>  function which knows how to free all of the struct
>  members.
>
>  By marking the struct itself as reachable, that confers
>  reachability on any pointers it contains (including those
>  found in embedded structs, or reachable by walking
>  heap blocks recursively.
>
>   3. It works on cases where we're not sure if the value is
>  allocated or not. For example:
>
>char *p = argc > 1 ? argv[1] : some_function();
>
>  It's safe to use UNLEAK(p) here, because it's not
>  freeing any memory. In the case that we're pointing to
>  argv here, the reachability checker will just ignore
>  our bytes.

This argument demonstrates why the MAYBE above is
inferior.

>
>   4. Because it's not actually freeing memory, you can
>  UNLEAK() before we are finished accessing the variable.
>  This is helpful in cases like this:
>
>char *p = some_function();
>return another_function(p);
>
>  Writing this with free() requires:
>
>int ret;
>char *p = some_function();
>ret = another_function(p);
>free(p);
>return ret;
>
>  But with unleak we can just write:
>
>char *p = some_function();
>UNLEAK(p);
>return another_function(p);

  5. It's not just about worrying if we can call UNLEAK
  once (in 4), but we also do not have to worry about
  calling it twice, or recursively. (This argument can be bad
  for cargo cult programmers, but we don't have these ;-)



> +#ifdef SUPPRESS_ANNOTATED_LEAKS
> +extern void unleak_memory(const void *ptr, size_t len);
> +#define UNLEAK(var) unleak_memory(&(var), sizeof(var));

As always with macros we have to be careful about its arguments.

  UNLEAK(a++)
  UNLEAK(baz())

won't work as intended.


[PATCH 10/10] add UNLEAK annotation for reducing leak false positives

2017-09-05 Thread Jeff King
It's a common pattern in git commands to allocate some
memory that should last for the lifetime of the program and
then not bother to free it, relying on the OS to throw it
away.

This keeps the code simple, and it's fast (we don't waste
time traversing structures or calling free at the end of the
program). But it also triggers warnings from memory-leak
checkers like valgrind or LSAN. They know that the memory
was still allocated at program exit, but they don't know
_when_ the leaked memory stopped being useful. If it was
early in the program, then it's probably a real and
important leak. But if it was used right up until program
exit, it's not an interesting leak and we'd like to suppress
it so that we can see the real leaks.

This patch introduces an UNLEAK() macro that lets us do so.
To understand its design, let's first look at some of the
alternatives.

Unfortunately the suppression systems offered by
leak-checking tools don't quite do what we want. A
leak-checker basically knows two things:

  1. Which blocks were allocated via malloc, and the
 callstack during the allocation.

  2. Which blocks were left un-freed at the end of the
 program (and which are unreachable, but more on that
 later).

Their suppressions work by mentioning the function or
callstack of a particular allocation, and marking it as OK
to leak.  So imagine you have code like this:

  int main(void)
  {
/* this allocates some memory */
char *p = some_function();
printf("%s", p);
return 0;
  }

You can say "ignore allocations from some_function(),
they're not leaks". But that's not right. That function may
be called elsewhere, too, and we would potentially want to
know about those leaks.

So you can say "ignore the callstack when main calls
some_function".  That works, but your annotations are
brittle. In this case it's only two functions, but you can
imagine that the actual allocation is much deeper. If any of
the intermediate code changes, you have to update the
suppression.

What we _really_ want to say is that "the value assigned to
p at the end of the function is not a real leak". But
leak-checkers can't understand that; they don't know about
"p" in the first place.

However, we can do something a little bit tricky if we make
some assumptions about how leak-checkers work. They
generally don't just report all un-freed blocks. That would
report even globals which are still accessible when the
leak-check is run.  Instead they take some set of memory
(like BSS) as a root and mark it as "reachable". Then they
scan the reachable blocks for anything that looks like a
pointer to a malloc'd block, and consider that block
reachable. And then they scan those blocks, and so on,
transitively marking anything reachable from a global as
"not leaked" (or at least leaked in a different category).

So we can mark the value of "p" as reachable by putting it
into a variable with program lifetime. One way to do that is
to just mark "p" as static. But that actually affects the
run-time behavior if the function is called twice (you
aren't likely to call main() twice, but some of our cmd_*()
functions are called from other commands).

Instead, we can trick the leak-checker by putting the value
into _any_ reachable bytes. This patch keeps a global
linked-list of bytes copied from "unleaked" variables. That
list is reachable even at program exit, which confers
recursive reachability on whatever values we unleak.

In other words, you can do:

  int main(void)
  {
char *p = some_function();
printf("%s", p);
UNLEAK(p);
return 0;
  }

to annotate "p" and suppress the leak report.

But wait, couldn't we just say "free(p)"? In this toy
example, yes. But using UNLEAK() has several advantages over
actually freeing the memory:

  1. It can be compiled conditionally. There's no need in
 normal runs to do this free(), and it just wastes time.
 By using a macro, we can get the benefit for leak-check
 builds with zero cost for normal builds (this patch
 uses a compile-time check, though we could clearly also
 make it a run-time check at very low cost).

 Of course one could also hide free() behind a macro, so
 this is really just arguing for having UNLEAK(), not
 for its particular implementation.

  2. It's recursive across structures. In many cases our "p"
 is not just a pointer, but a complex struct whose
 fields may have been allocated by a sub-function. And
 in some cases (e.g., dir_struct) we don't even have a
 function which knows how to free all of the struct
 members.

 By marking the struct itself as reachable, that confers
 reachability on any pointers it contains (including those
 found in embedded structs, or reachable by walking
 heap blocks recursively.

  3. It works on cases where we're not sure if the value is
 allocated or not. For example:

   char *p = argc > 1 ? argv[1] : some_function();