Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-25 Thread Michael Haggerty
On 03/24/2015 05:49 PM, René Scharfe wrote:
 Am 24.03.2015 um 17:06 schrieb Michael Haggerty:
 Parsing numbers is not rocket science, but there are a lot of pitfalls,
 especially around overflow. It's even harder to write such code via
 macros and the result is less readable.

 This patch series is mostly about finding a reasonable API and whipping
 the callers into shape. That seems ambitious enough for me. I'd rather
 stick with boring wrappers for now and lean on strtol()/strtoul() to do
 the dirty work. It will be easy for a motivated person to change the
 implementation later.
 
 The OpenBSD developers invented strtonum for that.  Are you aware of it?
  Would it fit?  This discussion may be useful:
 
 http://www.tedunangst.com/flak/post/the-design-of-strtonum

I wasn't aware of strtonum; thanks for the reference. It has an
untraditional interface. Their willingness to sacrifice half of the
unsigned range and requirement that the user specify minval and maxval
have the nice effect of permitting one function to be used for all
integer types.

I think git will need more flexibility, for example to support other
radixes and to allow trailing characters. So personally I don't think we
should use (or imitate) strtonum().

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-25 Thread Junio C Hamano
Michael Haggerty mhag...@alum.mit.edu writes:

[jc: dropped a non-working address from Cc list]

 I wasn't aware of strtonum; thanks for the reference. It has an
 untraditional interface. Their willingness to sacrifice half of the
 unsigned range and requirement that the user specify minval and maxval
 have the nice effect of permitting one function to be used for all
 integer types.
 ...
 I think git will need more flexibility, for example to support other
 radixes and to allow trailing characters. So personally I don't think we
 should use (or imitate) strtonum().

I had the same impression.  An earlier suggestion by Peff about
uintmax_t may be something we might want to keep in mind; we might
want to use strtoll/strtoull as the underlying implementation if we
wanted to avoid rolling our own.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-24 Thread Junio C Hamano
Michael Haggerty mhag...@alum.mit.edu writes:

 It is easy to allow --abbrev=+7; I would just need to add NUM_PLUS to
 those call sites. Should I do so?

The more relevant question to ask from my point of view is why you
need to add NUM_PLUS to enable it.  What valid reason do you
have to forbid it anywhere?  Only because you do not accept it by
default, you need to add to enable.

 Why is it a problem to allow git cmd --hexval=0x1234, even if git
 cmd --hexval=1234 would suffice?

 In some cases we would like to allow that flexibility; in some cases
 not. But the strtol()/strtoul() functions *always* allow it.

The same issue.  Whare are these some cases?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-24 Thread Michael Haggerty
On 03/19/2015 06:26 AM, Jeff King wrote:
 On Tue, Mar 17, 2015 at 05:00:02PM +0100, Michael Haggerty wrote:
 
 My main questions:

 * Do people like the API? My main goal was to make these functions as
   painless as possible to use correctly, because there are so many
   call sites.

 * Is it too gimmicky to encode the base together with other options in
   `flags`? (I think it is worth it to avoid the need for another
   parameter, which callers could easily put in the wrong order.)
 
 I definitely like the overall direction of this. My first thought was
 that it does seem like there are a lot of possible options to the
 functions (and OR-ing the flags with the base does seem weird, though I
 can't think of a plausible case where it would actually cause errors).
 Many of those options don't seem used in the example conversions (I'm
 not clear who would want NUM_SATURATE, for example).

There are a lot of options, but so far only few of them have been used.
As the call sites are rewritten we will see which of these features are
useful, and which can be dispensed with. If groups of options tend to be
used together, we can define constants for them like I've done with
NUM_SLOPPY and NUM_SIGN.

Regarding NUM_SATURATE: I'm not sure who would want it either, but I
thought there might be places where the user wants to specify
(effectively) infinity, and it might be convenient to let him specify
something easy to type like --max= rather than
--max=2147483647.

 I wondered if we could do away with the radix entirely. Wouldn't we be
 asking for base 10 most of the time? Of course, your first few patches
 show octal parsing, but I wonder if we should actually have a separate
 parse_mode() for that, since that seems to be the general reason for
 doing octal parsing. 10644 does not overflow an int, but it is
 hardly a reasonable mode.

Again, as a first pass I wanted to just have a really flexible API so
that call sites can be rewritten without a lot of extra thought. If
somebody wants to add a parse_mode() function, it will be easy to build
on top of convert_ui(). But that change can be done after this one.

 I also wondered if we could get rid of NUM_SIGN in favor of just having
 the type imply it (e.g., convert_l would always allow negative numbers,
 whereas convert_ul would not). But I suppose there are times when we end
 up using an int to store an unsigned value for a good reason (e.g.,
 -1 is a sentinel value, but we expect only positive values from the
 user). So that might be a bad idea.

Yes, as I was rewriting call sites, I found many that used the unsigned
variants of the parsing functions but stored the result in an int.
Probably some of these use -1 to denote unset; it might be that there
are other cases where the variable could actually be declared to be
unsigned int.

Prohibiting signs when parsing signed quantities isn't really elegant
from an API purity point of view, but it sure is handy!

 I notice that you go up to unsigned long here for sizes. If we want to
 use this everywhere, we'll need something larger for parsing off_t
 values on 32-bit machines (though I would not at all be surprised with
 the current code if 32-bit machines have a hard time configuring a
 pack.packSizeLimit above 4G).

Yes, probably. I haven't run into such call sites yet.

 I wonder how much of the boilerplate in the parse_* functions could be
 factored out to use a uintmax_t, with the caller just providing the
 range. That would make it easier to add new types like off_t, and
 possibly even constrained types (e.g., an integer from 0 to 100). On the
 other hand, you mentioned to me elsewhere that there may be some bugs in
 the range-checks of config.c's integer parsing. I suspect they are
 related to exactly this kind of refactoring, so perhaps writing things
 out is best.

It's not a lot of code yet. If we find out we need variants for size_t
and off_t and uintmax_t and intmax_t then such a refactoring would
definitely be worth considering.

 * Am I making callers too strict? In many cases where a positive
   integer is expected (e.g., --abbrev=num), I have been replacing
   code like
 [...]
 
 IMHO most of the tightening happening here is a good thing, and means we
 are more likely to notice mistakes rather than silently doing something
 stupid.
 
 For sites that currently allow it, I could imagine people using hex
 notation for some values, though, depending on the context. It looks
 there aren't many of them ((it is just when the radix is 0, right?).
 Some of them look to be accidental (does anybody really ask for
 --threads=0x10?), but others might not be (the pack index-version
 contains an offset field that might be quite big).

Yes, we can even debate whether we want to implement a general policy
that user-entered integers can be specified in any radix. Probably
nobody will ever specify --threads=0x10, but is there harm in allowing it?

Against the gain in flexibility, I see the following 

Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-24 Thread Michael Haggerty
On 03/19/2015 08:32 AM, Junio C Hamano wrote:
 Jeff King p...@peff.net writes:
 
 I wonder how much of the boilerplate in the parse_* functions could be
 factored out to use a uintmax_t, with the caller just providing the
 range. That would make it easier to add new types like off_t, and
 possibly even constrained types (e.g., an integer from 0 to 100). On the
 other hand, you mentioned to me elsewhere that there may be some bugs in
 the range-checks of config.c's integer parsing. I suspect they are
 related to exactly this kind of refactoring, so perhaps writing things
 out is best.
 
 I like this idea very well.  I wonder if we can implement the family
 of
 
 parse_{type}(const char *, unsigned int flags,
const char **endptr, {type} *result)
 
 functions by calls a helper that internally deals with the numbers
 in uintmax_t, and then checks if the value fits within the possible
 range of the *result before returning.
 
 int parse_i(const char *str, unsigned flags,
   const char **endptr, int *result) {
   uintmax_t val;
 int sign = 1, status;
 if (*str == '-') {
   sign = -1; 
 str++;
   }
 status = parse_num(str, flags, endptr, val, INT_MAX);
 if (status  0)
   return status;
   *result = sign  0 ? -val : val;
 return 0;
 }
 
 (assuming the last parameter to parse_num is used to check if the
 result fits within that range).  Or that may be easier and cleaner
 to be done in the caller with or something like that:
 
   status = parse_num(str, flags, endptr, val);
 if (status  0)
   return status;
   if (INT_MAX = val * sign || val * sign = INT_MIN) {
   errno = ERANGE;
 return -1;
   }
 
 If we wanted to, we may even be able to avoid duplication of
 boilerplate by wrapping the above pattern within a macro,
 parameterizing the TYPE_{MIN,MAX} and using token pasting, to
 expand to four necessary result types.
 
 There is no reason for the implementation of the parse_num() helper
 to be using strtoul(3) or strtoull(3); its behaviour will be under
 our total control.  It can become fancier by enriching the flags
 bits (e.g. allow scaling factor, etc.) only once and all variants
 for various result types will get the same feature.

Parsing numbers is not rocket science, but there are a lot of pitfalls,
especially around overflow. It's even harder to write such code via
macros and the result is less readable.

This patch series is mostly about finding a reasonable API and whipping
the callers into shape. That seems ambitious enough for me. I'd rather
stick with boring wrappers for now and lean on strtol()/strtoul() to do
the dirty work. It will be easy for a motivated person to change the
implementation later.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-24 Thread René Scharfe

Am 24.03.2015 um 17:06 schrieb Michael Haggerty:

Parsing numbers is not rocket science, but there are a lot of pitfalls,
especially around overflow. It's even harder to write such code via
macros and the result is less readable.

This patch series is mostly about finding a reasonable API and whipping
the callers into shape. That seems ambitious enough for me. I'd rather
stick with boring wrappers for now and lean on strtol()/strtoul() to do
the dirty work. It will be easy for a motivated person to change the
implementation later.


The OpenBSD developers invented strtonum for that.  Are you aware of it? 
 Would it fit?  This discussion may be useful:


http://www.tedunangst.com/flak/post/the-design-of-strtonum

René

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-24 Thread Michael Haggerty
On 03/24/2015 04:58 PM, Junio C Hamano wrote:
 Michael Haggerty mhag...@alum.mit.edu writes:
 
 It is easy to allow --abbrev=+7; I would just need to add NUM_PLUS to
 those call sites. Should I do so?
 
 The more relevant question to ask from my point of view is why you
 need to add NUM_PLUS to enable it.  What valid reason do you
 have to forbid it anywhere?  Only because you do not accept it by
 default, you need to add to enable.

I want to be able to plunge into this project without first auditing all
call sites to see which features will turn out to be needed. So I'm
erring on the side of flexibility. For now, I want to be able to
prohibit '+' signs.

Currently all of the flags cause additional features to be enabled. My
guess was that most callers *won't* need most features, so it seemed
easiest and most consistent to have all features be turned off by
default and let the caller add the features that it wants to allow.

Regarding specifically allowing/disallowing a leading '+': I saw a
couple of callsites that explicitly check that the first character is a
digit before calling strtol(). I assumed that is to disallow sign
characters [1]. For example,

diff.c: optarg()
builtin/apply.c: parse_num()
maybe date.c: match_multi_number()

There are other callsites that call strtoul(), but store the result in a
signed variable. Those would presumably not want to allow leading '-',
but I'm not sure.

I also imagine that there are places in protocols or file formats where
signs should not be allowed (e.g., timestamps in commits?).

 Why is it a problem to allow git cmd --hexval=0x1234, even if git
 cmd --hexval=1234 would suffice?

 In some cases we would like to allow that flexibility; in some cases
 not. But the strtol()/strtoul() functions *always* allow it.
 
 The same issue.  Whare are these some cases?

I admit I'm not sure there are such places for hexadecimal numbers.

I'm coming around to an alternate plan:

Step 1: write a NUM_DEFAULT combination-of-options that gives the new
functions behavior very like strtol()/strtoul() but without their insane
features.

Step 2: rewrite all callers to use that option (and usually endptr=NULL,
meaning no trailing characters) unless it is manifestly clear that they
are already trying to forbid some other features. This will already
produce the largest benefit: avoiding overflows, missing error checking,
etc.

Steps 3 through ∞: as time allows, rewrite individual callsites to be
stricter where appropriate.

Hopefully steps 1 and 2 will not be too controversial.

Michael

[1] That assumption is based on a rather quick look over the code,
because with well over 100 callsites, it is not practical to study each
callsite carefully.

-- 
Michael Haggerty
mhag...@alum.mit.edu

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-24 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 Michael Haggerty mhag...@alum.mit.edu writes:

 It is easy to allow --abbrev=+7; I would just need to add NUM_PLUS to
 those call sites. Should I do so?

 The more relevant question to ask from my point of view is why you
 need to add NUM_PLUS to enable it.  What valid reason do you
 have to forbid it anywhere?  Only because you do not accept it by
 default, you need to add to enable.

 Why is it a problem to allow git cmd --hexval=0x1234, even if git
 cmd --hexval=1234 would suffice?

 In some cases we would like to allow that flexibility; in some cases
 not. But the strtol()/strtoul() functions *always* allow it.

 The same issue.  Whare are these some cases?

And the same issue appears in the leading whitespace thing I did
not mention in the earlier part of your message I responded to. I
also notice you answered yourself that there may not be a valid
reason to forbid end-user supplied 0x prefix to arguments we
expect an integer for in your other message.

In short, if it is not a clearly bogus input that indicates a typo
or something (e.g.  --size=48l? did the user meant 48, 48k, or
48m?), and if it is clear we can tell the user meant what the code
would naturally interpret as (e.g. --hexval=0x1234), why forbid
it?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-24 Thread Michael Haggerty
On 03/19/2015 07:22 AM, Junio C Hamano wrote:
 Michael Haggerty mhag...@alum.mit.edu writes:
 
 * It allows leading whitespace.
 
 This might be blessing in disguise.  Our friends on MacOS may be
 relying on that
 
 git cmd --abbrev=$(wc -c foo)
 
 to work as expected, even though their wc gives leading spaces,
 for example.

Yuck. If you'd like, I can make sure to continue allowing leading
whitespace everywhere that it is allowed now. Alternatively, we could
make the parsing tighter and see if anybody squeals. What do you think?

 * It allows arbitrary trailing characters.
 
 Which is why we have introduced strtoul_ui() and such.
 
 * It allows a leading sign character ('+' or '-'), even though the
   result is unsigned.
 
 I do not particularly see it a bad thing to accept --abbrev=+7.
 Using unsigned type to accept a width and parsing --abbrev=-7 into
 a large positive integer _is_ a problem, and we may want to fix it,
 but I think that is still within the scope of the original better
 strtol/strtoul, and I do not see it as a justification to introduce
 a set of functions with completely unfamiliar name.

It is easy to allow --abbrev=+7; I would just need to add NUM_PLUS to
those call sites. Should I do so?

It would even be possible to allow --abbrev=-0, which is also a
non-negative integer. But it's more work and it seems pretty silly to me.

 * If the string doesn't contain a number at all, it sets its endptr
   argument to point at the start of the string but doesn't set errno.
 
 Why is that a problem?  A caller that cares is supposed to check
 endptr and the string it gave, no?  Now, if strtoul_ui() and friends
 do not do so, I again think that is still within the scope of the
 original better strtol/strtoul.

The text you are quoting was my argument for why we need wrappers around
strtol() and strtoul(), not for how the wrappers should be named.

The endptr convention for detecting errors is fine in theory, but in
practice a large majority of callers didn't check for errors correctly.
This is partly because the endptr convention is so awkward.

The existing strtoul_ui() and strtol_i() did check endptr correctly, but
there were only int-sized variants of the functions, and they didn't
give the caller the chance to capture endptr if the caller wanted to
allow trailing characters.

Why did I rename the wrapper functions?

* The old names, I think, emphasize that they take the long-sized
results of strtou?l() and convert them to integer size, which I think is
an implementation detail.
* The new functions will also have long versions. The obvious way to
generalize the existing function names for long variants (srtoul_ul()
and strtol_l()) seem kindof redundant.
* I wanted to change the signature and behavior of the functions, so
knowledge of the existing functions wouldn't be super helpful in
understanding the new ones.

 * If the value is larger than fits in an unsigned long, it returns the
   value clamped to the range 0..ULONG_MAX (setting errno to ERANGE).
 
 Ditto.
 
 * If the value is between -ULONG_MAX and 0, it returns the positive
   integer with the same bit pattern, without setting errno(!) (I can
   hardly think of a scenario where this would be useful.)
 
 Ditto.
 
 * For base=0 (autodetect base), it allows a base specifier prefix 0x
   or 0 and parses the number accordingly. For base=16 it also allows
   a 0x prefix.
 
 Why is it a problem to allow git cmd --hexval=0x1234, even if git
 cmd --hexval=1234 would suffice?

In some cases we would like to allow that flexibility; in some cases
not. But the strtol()/strtoul() functions *always* allow it.

 When I looked around, I found scores of sites that call atoi(),
 strtoul(), and strtol() carelessly. And it's understandable. Calling
 any of these functions safely is so much work as to be completely
 impractical in day-to-day code.
 
 Yes, these burdens on the caller were exactly why strtoul_ui()
 etc. wanted to reduce---and it will be a welcome change to do
 necessary checks that are currently not done.
 
 Please see the docstrings in numparse.h in the first commit for
 detailed API docs. Briefly, there are eight new functions:

 parse_{l,ul,i,ui}(const char *s, unsigned int flags,
   T *result, char **endptr);
 convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result);

 The parse_*() functions are for parsing a number from the start of a
 string; the convert_*() functions are for parsing a string that
 consists of a single number.
 
 I am not sure if I follow.  Why not unify them into one 4-function
 set, with optional endptr that could be NULL?

In the next version of this patch series I plan to include only four
functions, str_to_{i,ui,l,ul}(), named that way to make them more
reminiscent of the functions that they replace. They will take an entptr
argument, but if that argument is set to NULL, then the function will
error out if there are any trailing characters.

 While we are on the topic of 

Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-24 Thread Junio C Hamano
Michael Haggerty mhag...@alum.mit.edu writes:

 Regarding specifically allowing/disallowing a leading '+': I saw a
 couple of callsites that explicitly check that the first character is a
 digit before calling strtol(). I assumed that is to disallow sign
 characters [1]. For example,

 diff.c: optarg()

This one I know is from if not a digit we know it is not a number;
it is not an attempt to say we must forbid numbers to be spelled
with '+', but more about we do not need it and this is easier to
code without a full fledged str-to-num helper API sloppiness.

 builtin/apply.c: parse_num()

This parses @@ -l,k +m,n @@@ after stripping the punctuation
around the numbers, so this is a valid reason why you would want an
optional feature CANNOT_HAVE_SIGN in the str-to-num parser and use
it.

 maybe date.c: match_multi_number()

The approxidate callers parse random garbage input and attempt to
make best sense out of it, so they tend to try limitting the damage
caused by incorrect guesses by insisting we do not consider +0 to
be zero and such.  So this is a valid reason why you would want an
optional feature CANNOT_HAVE_SIGN in the str-to-num parser and use
it.

 I'm coming around to an alternate plan:

 Step 1: write a NUM_DEFAULT combination-of-options that gives the new
 functions behavior very like strtol()/strtoul() but without their insane
 features.

 Step 2: rewrite all callers to use that option (and usually endptr=NULL,
 meaning no trailing characters) unless it is manifestly clear that they
 are already trying to forbid some other features. This will already
 produce the largest benefit: avoiding overflows, missing error checking,
 etc.

 Steps 3 through ∞: as time allows, rewrite individual callsites to be
 stricter where appropriate.

 Hopefully steps 1 and 2 will not be too controversial.

All sounds sensible to me.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-19 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 I wonder how much of the boilerplate in the parse_* functions could be
 factored out to use a uintmax_t, with the caller just providing the
 range. That would make it easier to add new types like off_t, and
 possibly even constrained types (e.g., an integer from 0 to 100). On the
 other hand, you mentioned to me elsewhere that there may be some bugs in
 the range-checks of config.c's integer parsing. I suspect they are
 related to exactly this kind of refactoring, so perhaps writing things
 out is best.

I like this idea very well.  I wonder if we can implement the family
of

parse_{type}(const char *, unsigned int flags,
 const char **endptr, {type} *result)

functions by calls a helper that internally deals with the numbers
in uintmax_t, and then checks if the value fits within the possible
range of the *result before returning.

int parse_i(const char *str, unsigned flags,
const char **endptr, int *result) {
uintmax_t val;
int sign = 1, status;
if (*str == '-') {
sign = -1; 
str++;
}
status = parse_num(str, flags, endptr, val, INT_MAX);
if (status  0)
return status;
*result = sign  0 ? -val : val;
return 0;
}

(assuming the last parameter to parse_num is used to check if the
result fits within that range).  Or that may be easier and cleaner
to be done in the caller with or something like that:

status = parse_num(str, flags, endptr, val);
if (status  0)
return status;
if (INT_MAX = val * sign || val * sign = INT_MIN) {
errno = ERANGE;
return -1;
}

If we wanted to, we may even be able to avoid duplication of
boilerplate by wrapping the above pattern within a macro,
parameterizing the TYPE_{MIN,MAX} and using token pasting, to
expand to four necessary result types.

There is no reason for the implementation of the parse_num() helper
to be using strtoul(3) or strtoull(3); its behaviour will be under
our total control.  It can become fancier by enriching the flags
bits (e.g. allow scaling factor, etc.) only once and all variants
for various result types will get the same feature.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-19 Thread Junio C Hamano
Michael Haggerty mhag...@alum.mit.edu writes:

 Here were my thoughts:

 * I wanted to change the interface to look less like
   strtol()/strtoul(), so it seemed appropriate to make the names
   more dissimilar.

One reason I prefer the names in the compat-util is that it makes it
clear that what is converted into what (e.g. str to ul) and what
type the result is stored (i or ui), and as an added bonus, with
a short-and-sweet to it is already so clear we are converting
that we do not have to use verbs like parse_ or convert_.  From
parse_i() and convert_i(), I cannot tell which one lacks the endptr
and parses to the end and which one takes the endptr and tells the
caller where the conversion ended (and it also does not tell us that
they are converting from a string).

 * The functions seemed long enough that they shouldn't be inline,
   and I wanted to put them in their own module rather than put
   them in git-compat-util.h.

I actually agree on this, but that is not a justification why they
have to be named very differently.

Having said that, I would say that it will be futile to discuss this
further, as we seem to agree to disagree.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-19 Thread Junio C Hamano
Michael Haggerty mhag...@alum.mit.edu writes:

 * It allows leading whitespace.

This might be blessing in disguise.  Our friends on MacOS may be
relying on that

git cmd --abbrev=$(wc -c foo)

to work as expected, even though their wc gives leading spaces,
for example.

 * It allows arbitrary trailing characters.

Which is why we have introduced strtoul_ui() and such.

 * It allows a leading sign character ('+' or '-'), even though the
   result is unsigned.

I do not particularly see it a bad thing to accept --abbrev=+7.
Using unsigned type to accept a width and parsing --abbrev=-7 into
a large positive integer _is_ a problem, and we may want to fix it,
but I think that is still within the scope of the original better
strtol/strtoul, and I do not see it as a justification to introduce
a set of functions with completely unfamiliar name.

 * If the string doesn't contain a number at all, it sets its endptr
   argument to point at the start of the string but doesn't set errno.

Why is that a problem?  A caller that cares is supposed to check
endptr and the string it gave, no?  Now, if strtoul_ui() and friends
do not do so, I again think that is still within the scope of the
original better strtol/strtoul.

 * If the value is larger than fits in an unsigned long, it returns the
   value clamped to the range 0..ULONG_MAX (setting errno to ERANGE).

Ditto.

 * If the value is between -ULONG_MAX and 0, it returns the positive
   integer with the same bit pattern, without setting errno(!) (I can
   hardly think of a scenario where this would be useful.)

Ditto.

 * For base=0 (autodetect base), it allows a base specifier prefix 0x
   or 0 and parses the number accordingly. For base=16 it also allows
   a 0x prefix.

Why is it a problem to allow git cmd --hexval=0x1234, even if git
cmd --hexval=1234 would suffice?

 When I looked around, I found scores of sites that call atoi(),
 strtoul(), and strtol() carelessly. And it's understandable. Calling
 any of these functions safely is so much work as to be completely
 impractical in day-to-day code.

Yes, these burdens on the caller were exactly why strtoul_ui()
etc. wanted to reduce---and it will be a welcome change to do
necessary checks that are currently not done.

 Please see the docstrings in numparse.h in the first commit for
 detailed API docs. Briefly, there are eight new functions:

 parse_{l,ul,i,ui}(const char *s, unsigned int flags,
   T *result, char **endptr);
 convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result);

 The parse_*() functions are for parsing a number from the start of a
 string; the convert_*() functions are for parsing a string that
 consists of a single number.

I am not sure if I follow.  Why not unify them into one 4-function
set, with optional endptr that could be NULL?

While we are on the topic of improving number parsing, one thing
that I found somewhat frustrating is that config.c layer knows the
scaling suffix but option parsing layer does not.  These functions
should offer an option (via their flags) to say I allow scaled
numbers like 2k, 4M, etc..

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-19 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 I wondered if we could do away with the radix entirely. Wouldn't we be
 asking for base 10 most of the time? Of course, your first few patches
 show octal parsing, but I wonder if we should actually have a separate
 parse_mode() for that, since that seems to be the general reason for
 doing octal parsing. 10644 does not overflow an int, but it is
 hardly a reasonable mode.

True.

 I also wondered if we could get rid of NUM_SIGN in favor of just having
 the type imply it (e.g., convert_l would always allow negative numbers,
 whereas convert_ul would not). But I suppose there are times when we end
 up using an int to store an unsigned value for a good reason (e.g.,
 -1 is a sentinel value, but we expect only positive values from the
 user). So that might be a bad idea.

Yeah, there is a reason why ssize_t exists.

 IMHO most of the tightening happening here is a good thing, and means we
 are more likely to notice mistakes rather than silently doing something
 stupid.

I do like the general idea of making sure we avoid use of bare
atoi() and unchecked use of strtol() and such, and have a set of
well defined helper functions to perform the common checks, exactly
for the reason you said.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-18 Thread Duy Nguyen
On Wed, Mar 18, 2015 at 4:47 PM, Michael Haggerty mhag...@alum.mit.edu wrote:
 Thank you for doing it. I was about to write another number parser and
 you did it :D Maybe you can add another patch to convert the only
 strtol in upload-pack.c to parse_ui. This place should accept positive
 number in base 10, plus sign is not accepted.

 If the general direction of this patch series is accepted, I'll
 gradually try to go through the codebase, replacing *all* integer
 parsing with these functions. So there's no need to request particular
 callers of strtol()/strtoul() to be converted; I'll get to them all
 sooner or later (I hope).

Good to know. No there's no hurry in converting this particular place.
It's just tightening things up, not bug fixing or anything.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-18 Thread Michael Haggerty
On 03/18/2015 11:03 AM, Jeff King wrote:
 On Wed, Mar 18, 2015 at 10:47:40AM +0100, Michael Haggerty wrote:
 
 But in case you have some reason that you want upload-pack.c to be
 converted right away, I just pushed that change (plus some related
 cleanups) to my GitHub repo [1]. The branch depends only on the first
 patch of the numparse patch series.

 By the way, some other packet line parsing code in that file doesn't
 verify that there are no trailing characters on the lines that they
 process. That might be another thing that should be tightened up.
 
 Do you mean that upload-pack gets a pkt-line of length N that contains a
 line of length M, and then doesn't check that M==N?  We use the space
 between M and N for passing capabilities and other metadata around.
 
 Or do you mean that we see lines like:
 
   want [0-9a-f]{40} ...\n
 
 and do not bother looking at the ... that comes after the data we
 expect? That I can believe, and I don't think it would hurt to tighten
 up (we shouldn't need it for extensibility, as anybody trying to stick
 extra data there should do so only after using a capability flag earlier
 in the protocol).

The latter. For example here [1], the have command and its SHA-1 are
read from the line, but I don't see a check that there are no characters
after the SHA-1. The same here [2].

Michael

[1]
https://github.com/gitster/git/blob/9ab698f4000a736864c41f57fbae1e021ac27799/upload-pack.c#L404-L429
[2]
https://github.com/gitster/git/blob/9ab698f4000a736864c41f57fbae1e021ac27799/upload-pack.c#L550-L565

-- 
Michael Haggerty
mhag...@alum.mit.edu

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-18 Thread Michael Haggerty
On 03/18/2015 12:05 AM, Duy Nguyen wrote:
 On Tue, Mar 17, 2015 at 11:00 PM, Michael Haggerty mhag...@alum.mit.edu 
 wrote:
 Michael Haggerty (14):
   numparse: new module for parsing integral numbers
   cacheinfo_callback(): use convert_ui() when handling --cacheinfo
   write_subdirectory(): use convert_ui() for parsing mode
   handle_revision_opt(): use skip_prefix() in many places
   handle_revision_opt(): use convert_i() when handling -digit
   strtoul_ui(), strtol_i(): remove functions
   handle_revision_opt(): use convert_ui() when handling --abbrev=
   builtin_diff(): detect errors when parsing --unified argument
   opt_arg(): val is always non-NULL
   opt_arg(): use convert_i() in implementation
   opt_arg(): report errors parsing option values
   opt_arg(): simplify pointer handling
   diff_opt_parse(): use convert_i() when handling -lnum
   diff_opt_parse(): use convert_i() when handling --abbrev=num
 
 Thank you for doing it. I was about to write another number parser and
 you did it :D Maybe you can add another patch to convert the only
 strtol in upload-pack.c to parse_ui. This place should accept positive
 number in base 10, plus sign is not accepted.

If the general direction of this patch series is accepted, I'll
gradually try to go through the codebase, replacing *all* integer
parsing with these functions. So there's no need to request particular
callers of strtol()/strtoul() to be converted; I'll get to them all
sooner or later (I hope).

But in case you have some reason that you want upload-pack.c to be
converted right away, I just pushed that change (plus some related
cleanups) to my GitHub repo [1]. The branch depends only on the first
patch of the numparse patch series.

By the way, some other packet line parsing code in that file doesn't
verify that there are no trailing characters on the lines that they
process. That might be another thing that should be tightened up.

Michael

[1] https://github.com/mhagger/git.git, branch upload-pack-numparse

-- 
Michael Haggerty
mhag...@alum.mit.edu

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-18 Thread Jeff King
On Wed, Mar 18, 2015 at 10:47:40AM +0100, Michael Haggerty wrote:

 But in case you have some reason that you want upload-pack.c to be
 converted right away, I just pushed that change (plus some related
 cleanups) to my GitHub repo [1]. The branch depends only on the first
 patch of the numparse patch series.
 
 By the way, some other packet line parsing code in that file doesn't
 verify that there are no trailing characters on the lines that they
 process. That might be another thing that should be tightened up.

Do you mean that upload-pack gets a pkt-line of length N that contains a
line of length M, and then doesn't check that M==N?  We use the space
between M and N for passing capabilities and other metadata around.

Or do you mean that we see lines like:

  want [0-9a-f]{40} ...\n

and do not bother looking at the ... that comes after the data we
expect? That I can believe, and I don't think it would hurt to tighten
up (we shouldn't need it for extensibility, as anybody trying to stick
extra data there should do so only after using a capability flag earlier
in the protocol).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-18 Thread Jeff King
On Tue, Mar 17, 2015 at 05:00:02PM +0100, Michael Haggerty wrote:

 My main questions:
 
 * Do people like the API? My main goal was to make these functions as
   painless as possible to use correctly, because there are so many
   call sites.
 
 * Is it too gimmicky to encode the base together with other options in
   `flags`? (I think it is worth it to avoid the need for another
   parameter, which callers could easily put in the wrong order.)

I definitely like the overall direction of this. My first thought was
that it does seem like there are a lot of possible options to the
functions (and OR-ing the flags with the base does seem weird, though I
can't think of a plausible case where it would actually cause errors).
Many of those options don't seem used in the example conversions (I'm
not clear who would want NUM_SATURATE, for example).

I wondered if we could do away with the radix entirely. Wouldn't we be
asking for base 10 most of the time? Of course, your first few patches
show octal parsing, but I wonder if we should actually have a separate
parse_mode() for that, since that seems to be the general reason for
doing octal parsing. 10644 does not overflow an int, but it is
hardly a reasonable mode.

I also wondered if we could get rid of NUM_SIGN in favor of just having
the type imply it (e.g., convert_l would always allow negative numbers,
whereas convert_ul would not). But I suppose there are times when we end
up using an int to store an unsigned value for a good reason (e.g.,
-1 is a sentinel value, but we expect only positive values from the
user). So that might be a bad idea.

I notice that you go up to unsigned long here for sizes. If we want to
use this everywhere, we'll need something larger for parsing off_t
values on 32-bit machines (though I would not at all be surprised with
the current code if 32-bit machines have a hard time configuring a
pack.packSizeLimit above 4G).

I wonder how much of the boilerplate in the parse_* functions could be
factored out to use a uintmax_t, with the caller just providing the
range. That would make it easier to add new types like off_t, and
possibly even constrained types (e.g., an integer from 0 to 100). On the
other hand, you mentioned to me elsewhere that there may be some bugs in
the range-checks of config.c's integer parsing. I suspect they are
related to exactly this kind of refactoring, so perhaps writing things
out is best.

 * Am I making callers too strict? In many cases where a positive
   integer is expected (e.g., --abbrev=num), I have been replacing
   code like
 [...]

IMHO most of the tightening happening here is a good thing, and means we
are more likely to notice mistakes rather than silently doing something
stupid.

For sites that currently allow it, I could imagine people using hex
notation for some values, though, depending on the context. It looks
there aren't many of them ((it is just when the radix is 0, right?).
Some of them look to be accidental (does anybody really ask for
--threads=0x10?), but others might not be (the pack index-version
contains an offset field that might be quite big).


Feel free to ignore any or all of that. It is not so much criticism as
a dump of thoughts I had while reading the patches. Perhaps you can pick
something useful out of that, and perhaps not. :)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-17 Thread Junio C Hamano
Michael Haggerty mhag...@alum.mit.edu writes:

 When I looked around, I found scores of sites that call atoi(),
 strtoul(), and strtol() carelessly. And it's understandable. Calling
 any of these functions safely is so much work as to be completely
 impractical in day-to-day code.

 git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that
 try to make parsing integers a little bit easier.

Exactly.  They were introduced to prevent sloppy callers from
passing NULL to the end parameter to underlying strtoul(3).  The
first thing that came to my mind while reading your message up to
this point was why not use them more, tighten them, adding
different variants if necessary, instead of introducing an entirely
new set of functions in a new namespace?

For example:

 * Am I making callers too strict? In many cases where a positive
   integer is expected (e.g., --abbrev=num), I have been replacing
   code like

   result = strtoul(s, NULL, 10);

   with

   if (convert_i(s, 10, result))
   die(...);

I think strictness would be good and those who did --abbrev='  20'
deserve what they get from the additional strictness, but 

if (strtol_i(s, 10, result))

would have been more familiar.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-17 Thread Duy Nguyen
On Tue, Mar 17, 2015 at 11:00 PM, Michael Haggerty mhag...@alum.mit.edu wrote:
 Michael Haggerty (14):
   numparse: new module for parsing integral numbers
   cacheinfo_callback(): use convert_ui() when handling --cacheinfo
   write_subdirectory(): use convert_ui() for parsing mode
   handle_revision_opt(): use skip_prefix() in many places
   handle_revision_opt(): use convert_i() when handling -digit
   strtoul_ui(), strtol_i(): remove functions
   handle_revision_opt(): use convert_ui() when handling --abbrev=
   builtin_diff(): detect errors when parsing --unified argument
   opt_arg(): val is always non-NULL
   opt_arg(): use convert_i() in implementation
   opt_arg(): report errors parsing option values
   opt_arg(): simplify pointer handling
   diff_opt_parse(): use convert_i() when handling -lnum
   diff_opt_parse(): use convert_i() when handling --abbrev=num

Thank you for doing it. I was about to write another number parser and
you did it :D Maybe you can add another patch to convert the only
strtol in upload-pack.c to parse_ui. This place should accept positive
number in base 10, plus sign is not accepted.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-17 Thread Michael Haggerty
On 03/17/2015 07:48 PM, Junio C Hamano wrote:
 Michael Haggerty mhag...@alum.mit.edu writes:
 
 When I looked around, I found scores of sites that call atoi(),
 strtoul(), and strtol() carelessly. And it's understandable. Calling
 any of these functions safely is so much work as to be completely
 impractical in day-to-day code.

 git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that
 try to make parsing integers a little bit easier.
 
 Exactly.  They were introduced to prevent sloppy callers from
 passing NULL to the end parameter to underlying strtoul(3).  The
 first thing that came to my mind while reading your message up to
 this point was why not use them more, tighten them, adding
 different variants if necessary, instead of introducing an entirely
 new set of functions in a new namespace?

Here were my thoughts:

* I wanted to change the interface to look less like
  strtol()/strtoul(), so it seemed appropriate to make the names
  more dissimilar.

* The functions seemed long enough that they shouldn't be inline,
  and I wanted to put them in their own module rather than put
  them in git-compat-util.h.

* It wasn't obvious to me how to generalize the names, strtoul_ui()
  and strtol_i(), to the eight functions that I wanted.

That being said, I'm not married to the names. Suggestions are
welcome--but we would need names for 8 functions, not 4 [1].

Michael

 For example:
 
 * Am I making callers too strict? In many cases where a positive
   integer is expected (e.g., --abbrev=num), I have been replacing
   code like

   result = strtoul(s, NULL, 10);

   with

   if (convert_i(s, 10, result))
   die(...);
 
 I think strictness would be good and those who did --abbrev='  20'
 deserve what they get from the additional strictness, but 
 
   if (strtol_i(s, 10, result))
 
 would have been more familiar.

[1] It could be that when we're done, it will turn out that some of the
eight variants are not needed. If so, we can delete them then.

-- 
Michael Haggerty
mhag...@alum.mit.edu

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/14] numparse module: systematically tighten up integer parsing

2015-03-17 Thread Michael Haggerty
Sorry for the long email. Feel free to skip over the background info
and continue reading at My solution below.

This odyssey started with a typo:

$ git shortlog -snl v2.2.0..v2.3.0
 14795  Junio C Hamano
  1658  Jeff King
  1400  Shawn O. Pearce
  1109  Linus Torvalds
   760  Jonathan Nieder
[...]

It took me a minute to realize why so many commits are listed. It
turns out that -l, which I added by accident, requires an integer
argument, but it happily swallows v2.2.0..v2.3.0 without error. This
leaves no range argument, so the command traverses the entire history
of HEAD.

The relevant code is

else if ((argcount = short_opt('l', av, optarg))) {
options-rename_limit = strtoul(optarg, NULL, 10);
return argcount;
}

, which is broken in many ways. First of all, strtoul() is way too
permissive for simple tasks like this:

* It allows leading whitespace.

* It allows arbitrary trailing characters.

* It allows a leading sign character ('+' or '-'), even though the
  result is unsigned.

* If the string doesn't contain a number at all, it sets its endptr
  argument to point at the start of the string but doesn't set errno.

* If the value is larger than fits in an unsigned long, it returns the
  value clamped to the range 0..ULONG_MAX (setting errno to ERANGE).

* If the value is between -ULONG_MAX and 0, it returns the positive
  integer with the same bit pattern, without setting errno(!) (I can
  hardly think of a scenario where this would be useful.)

* For base=0 (autodetect base), it allows a base specifier prefix 0x
  or 0 and parses the number accordingly. For base=16 it also allows
  a 0x prefix.

strtol() has similar foibles.

The caller compounds the permissivity of strtoul() with further sins:

* It does absolutely no error detection.

* It assigns the return value, which is an unsigned long, to an int
  value. This could truncate the result, perhaps even resulting in
  rename_limit being set to a negative value.

When I looked around, I found scores of sites that call atoi(),
strtoul(), and strtol() carelessly. And it's understandable. Calling
any of these functions safely is so much work as to be completely
impractical in day-to-day code.

git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that
try to make parsing integers a little bit easier. But they are only
used in a few places, they hardly restrict the pathological
flexibility of strtoul()/strtol(), strtoul_ui() doesn't implement
clamping consistently when converting from unsigned long to unsigned
int, and neither function can be used when the integer value *is*
followed by other characters.


My solution: the numparse module

So I hereby propose a new module, numparse, to make it easier to parse
integral values from strings in a convenient, safe, and flexible way.
The first commit introduces the new module, and subsequent commits
change a sample (a small fraction!) of call sites to use the new
functions. Consider it a proof-of-concept. If people are OK with this
approach, I will continue sending patches to fix other call sites. (I
already have another two dozen such patches in my repo).

Please see the docstrings in numparse.h in the first commit for
detailed API docs. Briefly, there are eight new functions:

parse_{l,ul,i,ui}(const char *s, unsigned int flags,
  T *result, char **endptr);
convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result);

The parse_*() functions are for parsing a number from the start of a
string; the convert_*() functions are for parsing a string that
consists of a single number. The flags argument selects not only the
base of the number, but also which of strtol()/strtoul()'s many
features should be allowed. The *_i() and *.ui() functions are for
parsing int and unsigned int values; they are careful about how they
truncate the corresponding long values. The functions all return 0 on
success and a negative number on error.

Here are a few examples of how these functions can be used (taken from
the header of numparse.h):

* Convert hexadecimal string s into an unsigned int. Die if there are
  any characters in s besides hexadecimal digits, or if the result
  exceeds the range of an unsigned int:

if (convert_ui(s, 16, result))
die(...);

* Read a base-ten long number from the front of a string, allowing
  sign characters and setting endptr to point at any trailing
  characters:

if (parse_l(s, 10 | NUM_SIGN | NUM_TRAILING, result, endptr))
die(...);

* Convert decimal string s into a signed int, but not allowing the
  string to contain a '+' or '-' prefix (and thereby indirectly
  ensuring that the result will be non-negative):

if (convert_i(s, 10, result))
die(...);

* Convert s into a signed int, interpreting prefix 0x to mean
  hexadecimal and 0 to mean octal. If the value doesn't fit in an
  unsigned int, set result