Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On 03/24/2015 05:49 PM, René Scharfe wrote: Am 24.03.2015 um 17:06 schrieb Michael Haggerty: Parsing numbers is not rocket science, but there are a lot of pitfalls, especially around overflow. It's even harder to write such code via macros and the result is less readable. This patch series is mostly about finding a reasonable API and whipping the callers into shape. That seems ambitious enough for me. I'd rather stick with boring wrappers for now and lean on strtol()/strtoul() to do the dirty work. It will be easy for a motivated person to change the implementation later. The OpenBSD developers invented strtonum for that. Are you aware of it? Would it fit? This discussion may be useful: http://www.tedunangst.com/flak/post/the-design-of-strtonum I wasn't aware of strtonum; thanks for the reference. It has an untraditional interface. Their willingness to sacrifice half of the unsigned range and requirement that the user specify minval and maxval have the nice effect of permitting one function to be used for all integer types. I think git will need more flexibility, for example to support other radixes and to allow trailing characters. So personally I don't think we should use (or imitate) strtonum(). Michael -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Michael Haggerty mhag...@alum.mit.edu writes: [jc: dropped a non-working address from Cc list] I wasn't aware of strtonum; thanks for the reference. It has an untraditional interface. Their willingness to sacrifice half of the unsigned range and requirement that the user specify minval and maxval have the nice effect of permitting one function to be used for all integer types. ... I think git will need more flexibility, for example to support other radixes and to allow trailing characters. So personally I don't think we should use (or imitate) strtonum(). I had the same impression. An earlier suggestion by Peff about uintmax_t may be something we might want to keep in mind; we might want to use strtoll/strtoull as the underlying implementation if we wanted to avoid rolling our own. Thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Michael Haggerty mhag...@alum.mit.edu writes: It is easy to allow --abbrev=+7; I would just need to add NUM_PLUS to those call sites. Should I do so? The more relevant question to ask from my point of view is why you need to add NUM_PLUS to enable it. What valid reason do you have to forbid it anywhere? Only because you do not accept it by default, you need to add to enable. Why is it a problem to allow git cmd --hexval=0x1234, even if git cmd --hexval=1234 would suffice? In some cases we would like to allow that flexibility; in some cases not. But the strtol()/strtoul() functions *always* allow it. The same issue. Whare are these some cases? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On 03/19/2015 06:26 AM, Jeff King wrote: On Tue, Mar 17, 2015 at 05:00:02PM +0100, Michael Haggerty wrote: My main questions: * Do people like the API? My main goal was to make these functions as painless as possible to use correctly, because there are so many call sites. * Is it too gimmicky to encode the base together with other options in `flags`? (I think it is worth it to avoid the need for another parameter, which callers could easily put in the wrong order.) I definitely like the overall direction of this. My first thought was that it does seem like there are a lot of possible options to the functions (and OR-ing the flags with the base does seem weird, though I can't think of a plausible case where it would actually cause errors). Many of those options don't seem used in the example conversions (I'm not clear who would want NUM_SATURATE, for example). There are a lot of options, but so far only few of them have been used. As the call sites are rewritten we will see which of these features are useful, and which can be dispensed with. If groups of options tend to be used together, we can define constants for them like I've done with NUM_SLOPPY and NUM_SIGN. Regarding NUM_SATURATE: I'm not sure who would want it either, but I thought there might be places where the user wants to specify (effectively) infinity, and it might be convenient to let him specify something easy to type like --max= rather than --max=2147483647. I wondered if we could do away with the radix entirely. Wouldn't we be asking for base 10 most of the time? Of course, your first few patches show octal parsing, but I wonder if we should actually have a separate parse_mode() for that, since that seems to be the general reason for doing octal parsing. 10644 does not overflow an int, but it is hardly a reasonable mode. Again, as a first pass I wanted to just have a really flexible API so that call sites can be rewritten without a lot of extra thought. If somebody wants to add a parse_mode() function, it will be easy to build on top of convert_ui(). But that change can be done after this one. I also wondered if we could get rid of NUM_SIGN in favor of just having the type imply it (e.g., convert_l would always allow negative numbers, whereas convert_ul would not). But I suppose there are times when we end up using an int to store an unsigned value for a good reason (e.g., -1 is a sentinel value, but we expect only positive values from the user). So that might be a bad idea. Yes, as I was rewriting call sites, I found many that used the unsigned variants of the parsing functions but stored the result in an int. Probably some of these use -1 to denote unset; it might be that there are other cases where the variable could actually be declared to be unsigned int. Prohibiting signs when parsing signed quantities isn't really elegant from an API purity point of view, but it sure is handy! I notice that you go up to unsigned long here for sizes. If we want to use this everywhere, we'll need something larger for parsing off_t values on 32-bit machines (though I would not at all be surprised with the current code if 32-bit machines have a hard time configuring a pack.packSizeLimit above 4G). Yes, probably. I haven't run into such call sites yet. I wonder how much of the boilerplate in the parse_* functions could be factored out to use a uintmax_t, with the caller just providing the range. That would make it easier to add new types like off_t, and possibly even constrained types (e.g., an integer from 0 to 100). On the other hand, you mentioned to me elsewhere that there may be some bugs in the range-checks of config.c's integer parsing. I suspect they are related to exactly this kind of refactoring, so perhaps writing things out is best. It's not a lot of code yet. If we find out we need variants for size_t and off_t and uintmax_t and intmax_t then such a refactoring would definitely be worth considering. * Am I making callers too strict? In many cases where a positive integer is expected (e.g., --abbrev=num), I have been replacing code like [...] IMHO most of the tightening happening here is a good thing, and means we are more likely to notice mistakes rather than silently doing something stupid. For sites that currently allow it, I could imagine people using hex notation for some values, though, depending on the context. It looks there aren't many of them ((it is just when the radix is 0, right?). Some of them look to be accidental (does anybody really ask for --threads=0x10?), but others might not be (the pack index-version contains an offset field that might be quite big). Yes, we can even debate whether we want to implement a general policy that user-entered integers can be specified in any radix. Probably nobody will ever specify --threads=0x10, but is there harm in allowing it? Against the gain in flexibility, I see the following
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On 03/19/2015 08:32 AM, Junio C Hamano wrote: Jeff King p...@peff.net writes: I wonder how much of the boilerplate in the parse_* functions could be factored out to use a uintmax_t, with the caller just providing the range. That would make it easier to add new types like off_t, and possibly even constrained types (e.g., an integer from 0 to 100). On the other hand, you mentioned to me elsewhere that there may be some bugs in the range-checks of config.c's integer parsing. I suspect they are related to exactly this kind of refactoring, so perhaps writing things out is best. I like this idea very well. I wonder if we can implement the family of parse_{type}(const char *, unsigned int flags, const char **endptr, {type} *result) functions by calls a helper that internally deals with the numbers in uintmax_t, and then checks if the value fits within the possible range of the *result before returning. int parse_i(const char *str, unsigned flags, const char **endptr, int *result) { uintmax_t val; int sign = 1, status; if (*str == '-') { sign = -1; str++; } status = parse_num(str, flags, endptr, val, INT_MAX); if (status 0) return status; *result = sign 0 ? -val : val; return 0; } (assuming the last parameter to parse_num is used to check if the result fits within that range). Or that may be easier and cleaner to be done in the caller with or something like that: status = parse_num(str, flags, endptr, val); if (status 0) return status; if (INT_MAX = val * sign || val * sign = INT_MIN) { errno = ERANGE; return -1; } If we wanted to, we may even be able to avoid duplication of boilerplate by wrapping the above pattern within a macro, parameterizing the TYPE_{MIN,MAX} and using token pasting, to expand to four necessary result types. There is no reason for the implementation of the parse_num() helper to be using strtoul(3) or strtoull(3); its behaviour will be under our total control. It can become fancier by enriching the flags bits (e.g. allow scaling factor, etc.) only once and all variants for various result types will get the same feature. Parsing numbers is not rocket science, but there are a lot of pitfalls, especially around overflow. It's even harder to write such code via macros and the result is less readable. This patch series is mostly about finding a reasonable API and whipping the callers into shape. That seems ambitious enough for me. I'd rather stick with boring wrappers for now and lean on strtol()/strtoul() to do the dirty work. It will be easy for a motivated person to change the implementation later. Michael -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Am 24.03.2015 um 17:06 schrieb Michael Haggerty: Parsing numbers is not rocket science, but there are a lot of pitfalls, especially around overflow. It's even harder to write such code via macros and the result is less readable. This patch series is mostly about finding a reasonable API and whipping the callers into shape. That seems ambitious enough for me. I'd rather stick with boring wrappers for now and lean on strtol()/strtoul() to do the dirty work. It will be easy for a motivated person to change the implementation later. The OpenBSD developers invented strtonum for that. Are you aware of it? Would it fit? This discussion may be useful: http://www.tedunangst.com/flak/post/the-design-of-strtonum René -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On 03/24/2015 04:58 PM, Junio C Hamano wrote: Michael Haggerty mhag...@alum.mit.edu writes: It is easy to allow --abbrev=+7; I would just need to add NUM_PLUS to those call sites. Should I do so? The more relevant question to ask from my point of view is why you need to add NUM_PLUS to enable it. What valid reason do you have to forbid it anywhere? Only because you do not accept it by default, you need to add to enable. I want to be able to plunge into this project without first auditing all call sites to see which features will turn out to be needed. So I'm erring on the side of flexibility. For now, I want to be able to prohibit '+' signs. Currently all of the flags cause additional features to be enabled. My guess was that most callers *won't* need most features, so it seemed easiest and most consistent to have all features be turned off by default and let the caller add the features that it wants to allow. Regarding specifically allowing/disallowing a leading '+': I saw a couple of callsites that explicitly check that the first character is a digit before calling strtol(). I assumed that is to disallow sign characters [1]. For example, diff.c: optarg() builtin/apply.c: parse_num() maybe date.c: match_multi_number() There are other callsites that call strtoul(), but store the result in a signed variable. Those would presumably not want to allow leading '-', but I'm not sure. I also imagine that there are places in protocols or file formats where signs should not be allowed (e.g., timestamps in commits?). Why is it a problem to allow git cmd --hexval=0x1234, even if git cmd --hexval=1234 would suffice? In some cases we would like to allow that flexibility; in some cases not. But the strtol()/strtoul() functions *always* allow it. The same issue. Whare are these some cases? I admit I'm not sure there are such places for hexadecimal numbers. I'm coming around to an alternate plan: Step 1: write a NUM_DEFAULT combination-of-options that gives the new functions behavior very like strtol()/strtoul() but without their insane features. Step 2: rewrite all callers to use that option (and usually endptr=NULL, meaning no trailing characters) unless it is manifestly clear that they are already trying to forbid some other features. This will already produce the largest benefit: avoiding overflows, missing error checking, etc. Steps 3 through ∞: as time allows, rewrite individual callsites to be stricter where appropriate. Hopefully steps 1 and 2 will not be too controversial. Michael [1] That assumption is based on a rather quick look over the code, because with well over 100 callsites, it is not practical to study each callsite carefully. -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Junio C Hamano gits...@pobox.com writes: Michael Haggerty mhag...@alum.mit.edu writes: It is easy to allow --abbrev=+7; I would just need to add NUM_PLUS to those call sites. Should I do so? The more relevant question to ask from my point of view is why you need to add NUM_PLUS to enable it. What valid reason do you have to forbid it anywhere? Only because you do not accept it by default, you need to add to enable. Why is it a problem to allow git cmd --hexval=0x1234, even if git cmd --hexval=1234 would suffice? In some cases we would like to allow that flexibility; in some cases not. But the strtol()/strtoul() functions *always* allow it. The same issue. Whare are these some cases? And the same issue appears in the leading whitespace thing I did not mention in the earlier part of your message I responded to. I also notice you answered yourself that there may not be a valid reason to forbid end-user supplied 0x prefix to arguments we expect an integer for in your other message. In short, if it is not a clearly bogus input that indicates a typo or something (e.g. --size=48l? did the user meant 48, 48k, or 48m?), and if it is clear we can tell the user meant what the code would naturally interpret as (e.g. --hexval=0x1234), why forbid it? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On 03/19/2015 07:22 AM, Junio C Hamano wrote: Michael Haggerty mhag...@alum.mit.edu writes: * It allows leading whitespace. This might be blessing in disguise. Our friends on MacOS may be relying on that git cmd --abbrev=$(wc -c foo) to work as expected, even though their wc gives leading spaces, for example. Yuck. If you'd like, I can make sure to continue allowing leading whitespace everywhere that it is allowed now. Alternatively, we could make the parsing tighter and see if anybody squeals. What do you think? * It allows arbitrary trailing characters. Which is why we have introduced strtoul_ui() and such. * It allows a leading sign character ('+' or '-'), even though the result is unsigned. I do not particularly see it a bad thing to accept --abbrev=+7. Using unsigned type to accept a width and parsing --abbrev=-7 into a large positive integer _is_ a problem, and we may want to fix it, but I think that is still within the scope of the original better strtol/strtoul, and I do not see it as a justification to introduce a set of functions with completely unfamiliar name. It is easy to allow --abbrev=+7; I would just need to add NUM_PLUS to those call sites. Should I do so? It would even be possible to allow --abbrev=-0, which is also a non-negative integer. But it's more work and it seems pretty silly to me. * If the string doesn't contain a number at all, it sets its endptr argument to point at the start of the string but doesn't set errno. Why is that a problem? A caller that cares is supposed to check endptr and the string it gave, no? Now, if strtoul_ui() and friends do not do so, I again think that is still within the scope of the original better strtol/strtoul. The text you are quoting was my argument for why we need wrappers around strtol() and strtoul(), not for how the wrappers should be named. The endptr convention for detecting errors is fine in theory, but in practice a large majority of callers didn't check for errors correctly. This is partly because the endptr convention is so awkward. The existing strtoul_ui() and strtol_i() did check endptr correctly, but there were only int-sized variants of the functions, and they didn't give the caller the chance to capture endptr if the caller wanted to allow trailing characters. Why did I rename the wrapper functions? * The old names, I think, emphasize that they take the long-sized results of strtou?l() and convert them to integer size, which I think is an implementation detail. * The new functions will also have long versions. The obvious way to generalize the existing function names for long variants (srtoul_ul() and strtol_l()) seem kindof redundant. * I wanted to change the signature and behavior of the functions, so knowledge of the existing functions wouldn't be super helpful in understanding the new ones. * If the value is larger than fits in an unsigned long, it returns the value clamped to the range 0..ULONG_MAX (setting errno to ERANGE). Ditto. * If the value is between -ULONG_MAX and 0, it returns the positive integer with the same bit pattern, without setting errno(!) (I can hardly think of a scenario where this would be useful.) Ditto. * For base=0 (autodetect base), it allows a base specifier prefix 0x or 0 and parses the number accordingly. For base=16 it also allows a 0x prefix. Why is it a problem to allow git cmd --hexval=0x1234, even if git cmd --hexval=1234 would suffice? In some cases we would like to allow that flexibility; in some cases not. But the strtol()/strtoul() functions *always* allow it. When I looked around, I found scores of sites that call atoi(), strtoul(), and strtol() carelessly. And it's understandable. Calling any of these functions safely is so much work as to be completely impractical in day-to-day code. Yes, these burdens on the caller were exactly why strtoul_ui() etc. wanted to reduce---and it will be a welcome change to do necessary checks that are currently not done. Please see the docstrings in numparse.h in the first commit for detailed API docs. Briefly, there are eight new functions: parse_{l,ul,i,ui}(const char *s, unsigned int flags, T *result, char **endptr); convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result); The parse_*() functions are for parsing a number from the start of a string; the convert_*() functions are for parsing a string that consists of a single number. I am not sure if I follow. Why not unify them into one 4-function set, with optional endptr that could be NULL? In the next version of this patch series I plan to include only four functions, str_to_{i,ui,l,ul}(), named that way to make them more reminiscent of the functions that they replace. They will take an entptr argument, but if that argument is set to NULL, then the function will error out if there are any trailing characters. While we are on the topic of
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Michael Haggerty mhag...@alum.mit.edu writes: Regarding specifically allowing/disallowing a leading '+': I saw a couple of callsites that explicitly check that the first character is a digit before calling strtol(). I assumed that is to disallow sign characters [1]. For example, diff.c: optarg() This one I know is from if not a digit we know it is not a number; it is not an attempt to say we must forbid numbers to be spelled with '+', but more about we do not need it and this is easier to code without a full fledged str-to-num helper API sloppiness. builtin/apply.c: parse_num() This parses @@ -l,k +m,n @@@ after stripping the punctuation around the numbers, so this is a valid reason why you would want an optional feature CANNOT_HAVE_SIGN in the str-to-num parser and use it. maybe date.c: match_multi_number() The approxidate callers parse random garbage input and attempt to make best sense out of it, so they tend to try limitting the damage caused by incorrect guesses by insisting we do not consider +0 to be zero and such. So this is a valid reason why you would want an optional feature CANNOT_HAVE_SIGN in the str-to-num parser and use it. I'm coming around to an alternate plan: Step 1: write a NUM_DEFAULT combination-of-options that gives the new functions behavior very like strtol()/strtoul() but without their insane features. Step 2: rewrite all callers to use that option (and usually endptr=NULL, meaning no trailing characters) unless it is manifestly clear that they are already trying to forbid some other features. This will already produce the largest benefit: avoiding overflows, missing error checking, etc. Steps 3 through ∞: as time allows, rewrite individual callsites to be stricter where appropriate. Hopefully steps 1 and 2 will not be too controversial. All sounds sensible to me. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Jeff King p...@peff.net writes: I wonder how much of the boilerplate in the parse_* functions could be factored out to use a uintmax_t, with the caller just providing the range. That would make it easier to add new types like off_t, and possibly even constrained types (e.g., an integer from 0 to 100). On the other hand, you mentioned to me elsewhere that there may be some bugs in the range-checks of config.c's integer parsing. I suspect they are related to exactly this kind of refactoring, so perhaps writing things out is best. I like this idea very well. I wonder if we can implement the family of parse_{type}(const char *, unsigned int flags, const char **endptr, {type} *result) functions by calls a helper that internally deals with the numbers in uintmax_t, and then checks if the value fits within the possible range of the *result before returning. int parse_i(const char *str, unsigned flags, const char **endptr, int *result) { uintmax_t val; int sign = 1, status; if (*str == '-') { sign = -1; str++; } status = parse_num(str, flags, endptr, val, INT_MAX); if (status 0) return status; *result = sign 0 ? -val : val; return 0; } (assuming the last parameter to parse_num is used to check if the result fits within that range). Or that may be easier and cleaner to be done in the caller with or something like that: status = parse_num(str, flags, endptr, val); if (status 0) return status; if (INT_MAX = val * sign || val * sign = INT_MIN) { errno = ERANGE; return -1; } If we wanted to, we may even be able to avoid duplication of boilerplate by wrapping the above pattern within a macro, parameterizing the TYPE_{MIN,MAX} and using token pasting, to expand to four necessary result types. There is no reason for the implementation of the parse_num() helper to be using strtoul(3) or strtoull(3); its behaviour will be under our total control. It can become fancier by enriching the flags bits (e.g. allow scaling factor, etc.) only once and all variants for various result types will get the same feature. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Michael Haggerty mhag...@alum.mit.edu writes: Here were my thoughts: * I wanted to change the interface to look less like strtol()/strtoul(), so it seemed appropriate to make the names more dissimilar. One reason I prefer the names in the compat-util is that it makes it clear that what is converted into what (e.g. str to ul) and what type the result is stored (i or ui), and as an added bonus, with a short-and-sweet to it is already so clear we are converting that we do not have to use verbs like parse_ or convert_. From parse_i() and convert_i(), I cannot tell which one lacks the endptr and parses to the end and which one takes the endptr and tells the caller where the conversion ended (and it also does not tell us that they are converting from a string). * The functions seemed long enough that they shouldn't be inline, and I wanted to put them in their own module rather than put them in git-compat-util.h. I actually agree on this, but that is not a justification why they have to be named very differently. Having said that, I would say that it will be futile to discuss this further, as we seem to agree to disagree. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Michael Haggerty mhag...@alum.mit.edu writes: * It allows leading whitespace. This might be blessing in disguise. Our friends on MacOS may be relying on that git cmd --abbrev=$(wc -c foo) to work as expected, even though their wc gives leading spaces, for example. * It allows arbitrary trailing characters. Which is why we have introduced strtoul_ui() and such. * It allows a leading sign character ('+' or '-'), even though the result is unsigned. I do not particularly see it a bad thing to accept --abbrev=+7. Using unsigned type to accept a width and parsing --abbrev=-7 into a large positive integer _is_ a problem, and we may want to fix it, but I think that is still within the scope of the original better strtol/strtoul, and I do not see it as a justification to introduce a set of functions with completely unfamiliar name. * If the string doesn't contain a number at all, it sets its endptr argument to point at the start of the string but doesn't set errno. Why is that a problem? A caller that cares is supposed to check endptr and the string it gave, no? Now, if strtoul_ui() and friends do not do so, I again think that is still within the scope of the original better strtol/strtoul. * If the value is larger than fits in an unsigned long, it returns the value clamped to the range 0..ULONG_MAX (setting errno to ERANGE). Ditto. * If the value is between -ULONG_MAX and 0, it returns the positive integer with the same bit pattern, without setting errno(!) (I can hardly think of a scenario where this would be useful.) Ditto. * For base=0 (autodetect base), it allows a base specifier prefix 0x or 0 and parses the number accordingly. For base=16 it also allows a 0x prefix. Why is it a problem to allow git cmd --hexval=0x1234, even if git cmd --hexval=1234 would suffice? When I looked around, I found scores of sites that call atoi(), strtoul(), and strtol() carelessly. And it's understandable. Calling any of these functions safely is so much work as to be completely impractical in day-to-day code. Yes, these burdens on the caller were exactly why strtoul_ui() etc. wanted to reduce---and it will be a welcome change to do necessary checks that are currently not done. Please see the docstrings in numparse.h in the first commit for detailed API docs. Briefly, there are eight new functions: parse_{l,ul,i,ui}(const char *s, unsigned int flags, T *result, char **endptr); convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result); The parse_*() functions are for parsing a number from the start of a string; the convert_*() functions are for parsing a string that consists of a single number. I am not sure if I follow. Why not unify them into one 4-function set, with optional endptr that could be NULL? While we are on the topic of improving number parsing, one thing that I found somewhat frustrating is that config.c layer knows the scaling suffix but option parsing layer does not. These functions should offer an option (via their flags) to say I allow scaled numbers like 2k, 4M, etc.. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Jeff King p...@peff.net writes: I wondered if we could do away with the radix entirely. Wouldn't we be asking for base 10 most of the time? Of course, your first few patches show octal parsing, but I wonder if we should actually have a separate parse_mode() for that, since that seems to be the general reason for doing octal parsing. 10644 does not overflow an int, but it is hardly a reasonable mode. True. I also wondered if we could get rid of NUM_SIGN in favor of just having the type imply it (e.g., convert_l would always allow negative numbers, whereas convert_ul would not). But I suppose there are times when we end up using an int to store an unsigned value for a good reason (e.g., -1 is a sentinel value, but we expect only positive values from the user). So that might be a bad idea. Yeah, there is a reason why ssize_t exists. IMHO most of the tightening happening here is a good thing, and means we are more likely to notice mistakes rather than silently doing something stupid. I do like the general idea of making sure we avoid use of bare atoi() and unchecked use of strtol() and such, and have a set of well defined helper functions to perform the common checks, exactly for the reason you said. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On Wed, Mar 18, 2015 at 4:47 PM, Michael Haggerty mhag...@alum.mit.edu wrote: Thank you for doing it. I was about to write another number parser and you did it :D Maybe you can add another patch to convert the only strtol in upload-pack.c to parse_ui. This place should accept positive number in base 10, plus sign is not accepted. If the general direction of this patch series is accepted, I'll gradually try to go through the codebase, replacing *all* integer parsing with these functions. So there's no need to request particular callers of strtol()/strtoul() to be converted; I'll get to them all sooner or later (I hope). Good to know. No there's no hurry in converting this particular place. It's just tightening things up, not bug fixing or anything. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On 03/18/2015 11:03 AM, Jeff King wrote: On Wed, Mar 18, 2015 at 10:47:40AM +0100, Michael Haggerty wrote: But in case you have some reason that you want upload-pack.c to be converted right away, I just pushed that change (plus some related cleanups) to my GitHub repo [1]. The branch depends only on the first patch of the numparse patch series. By the way, some other packet line parsing code in that file doesn't verify that there are no trailing characters on the lines that they process. That might be another thing that should be tightened up. Do you mean that upload-pack gets a pkt-line of length N that contains a line of length M, and then doesn't check that M==N? We use the space between M and N for passing capabilities and other metadata around. Or do you mean that we see lines like: want [0-9a-f]{40} ...\n and do not bother looking at the ... that comes after the data we expect? That I can believe, and I don't think it would hurt to tighten up (we shouldn't need it for extensibility, as anybody trying to stick extra data there should do so only after using a capability flag earlier in the protocol). The latter. For example here [1], the have command and its SHA-1 are read from the line, but I don't see a check that there are no characters after the SHA-1. The same here [2]. Michael [1] https://github.com/gitster/git/blob/9ab698f4000a736864c41f57fbae1e021ac27799/upload-pack.c#L404-L429 [2] https://github.com/gitster/git/blob/9ab698f4000a736864c41f57fbae1e021ac27799/upload-pack.c#L550-L565 -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On 03/18/2015 12:05 AM, Duy Nguyen wrote: On Tue, Mar 17, 2015 at 11:00 PM, Michael Haggerty mhag...@alum.mit.edu wrote: Michael Haggerty (14): numparse: new module for parsing integral numbers cacheinfo_callback(): use convert_ui() when handling --cacheinfo write_subdirectory(): use convert_ui() for parsing mode handle_revision_opt(): use skip_prefix() in many places handle_revision_opt(): use convert_i() when handling -digit strtoul_ui(), strtol_i(): remove functions handle_revision_opt(): use convert_ui() when handling --abbrev= builtin_diff(): detect errors when parsing --unified argument opt_arg(): val is always non-NULL opt_arg(): use convert_i() in implementation opt_arg(): report errors parsing option values opt_arg(): simplify pointer handling diff_opt_parse(): use convert_i() when handling -lnum diff_opt_parse(): use convert_i() when handling --abbrev=num Thank you for doing it. I was about to write another number parser and you did it :D Maybe you can add another patch to convert the only strtol in upload-pack.c to parse_ui. This place should accept positive number in base 10, plus sign is not accepted. If the general direction of this patch series is accepted, I'll gradually try to go through the codebase, replacing *all* integer parsing with these functions. So there's no need to request particular callers of strtol()/strtoul() to be converted; I'll get to them all sooner or later (I hope). But in case you have some reason that you want upload-pack.c to be converted right away, I just pushed that change (plus some related cleanups) to my GitHub repo [1]. The branch depends only on the first patch of the numparse patch series. By the way, some other packet line parsing code in that file doesn't verify that there are no trailing characters on the lines that they process. That might be another thing that should be tightened up. Michael [1] https://github.com/mhagger/git.git, branch upload-pack-numparse -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On Wed, Mar 18, 2015 at 10:47:40AM +0100, Michael Haggerty wrote: But in case you have some reason that you want upload-pack.c to be converted right away, I just pushed that change (plus some related cleanups) to my GitHub repo [1]. The branch depends only on the first patch of the numparse patch series. By the way, some other packet line parsing code in that file doesn't verify that there are no trailing characters on the lines that they process. That might be another thing that should be tightened up. Do you mean that upload-pack gets a pkt-line of length N that contains a line of length M, and then doesn't check that M==N? We use the space between M and N for passing capabilities and other metadata around. Or do you mean that we see lines like: want [0-9a-f]{40} ...\n and do not bother looking at the ... that comes after the data we expect? That I can believe, and I don't think it would hurt to tighten up (we shouldn't need it for extensibility, as anybody trying to stick extra data there should do so only after using a capability flag earlier in the protocol). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On Tue, Mar 17, 2015 at 05:00:02PM +0100, Michael Haggerty wrote: My main questions: * Do people like the API? My main goal was to make these functions as painless as possible to use correctly, because there are so many call sites. * Is it too gimmicky to encode the base together with other options in `flags`? (I think it is worth it to avoid the need for another parameter, which callers could easily put in the wrong order.) I definitely like the overall direction of this. My first thought was that it does seem like there are a lot of possible options to the functions (and OR-ing the flags with the base does seem weird, though I can't think of a plausible case where it would actually cause errors). Many of those options don't seem used in the example conversions (I'm not clear who would want NUM_SATURATE, for example). I wondered if we could do away with the radix entirely. Wouldn't we be asking for base 10 most of the time? Of course, your first few patches show octal parsing, but I wonder if we should actually have a separate parse_mode() for that, since that seems to be the general reason for doing octal parsing. 10644 does not overflow an int, but it is hardly a reasonable mode. I also wondered if we could get rid of NUM_SIGN in favor of just having the type imply it (e.g., convert_l would always allow negative numbers, whereas convert_ul would not). But I suppose there are times when we end up using an int to store an unsigned value for a good reason (e.g., -1 is a sentinel value, but we expect only positive values from the user). So that might be a bad idea. I notice that you go up to unsigned long here for sizes. If we want to use this everywhere, we'll need something larger for parsing off_t values on 32-bit machines (though I would not at all be surprised with the current code if 32-bit machines have a hard time configuring a pack.packSizeLimit above 4G). I wonder how much of the boilerplate in the parse_* functions could be factored out to use a uintmax_t, with the caller just providing the range. That would make it easier to add new types like off_t, and possibly even constrained types (e.g., an integer from 0 to 100). On the other hand, you mentioned to me elsewhere that there may be some bugs in the range-checks of config.c's integer parsing. I suspect they are related to exactly this kind of refactoring, so perhaps writing things out is best. * Am I making callers too strict? In many cases where a positive integer is expected (e.g., --abbrev=num), I have been replacing code like [...] IMHO most of the tightening happening here is a good thing, and means we are more likely to notice mistakes rather than silently doing something stupid. For sites that currently allow it, I could imagine people using hex notation for some values, though, depending on the context. It looks there aren't many of them ((it is just when the radix is 0, right?). Some of them look to be accidental (does anybody really ask for --threads=0x10?), but others might not be (the pack index-version contains an offset field that might be quite big). Feel free to ignore any or all of that. It is not so much criticism as a dump of thoughts I had while reading the patches. Perhaps you can pick something useful out of that, and perhaps not. :) -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Michael Haggerty mhag...@alum.mit.edu writes: When I looked around, I found scores of sites that call atoi(), strtoul(), and strtol() carelessly. And it's understandable. Calling any of these functions safely is so much work as to be completely impractical in day-to-day code. git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that try to make parsing integers a little bit easier. Exactly. They were introduced to prevent sloppy callers from passing NULL to the end parameter to underlying strtoul(3). The first thing that came to my mind while reading your message up to this point was why not use them more, tighten them, adding different variants if necessary, instead of introducing an entirely new set of functions in a new namespace? For example: * Am I making callers too strict? In many cases where a positive integer is expected (e.g., --abbrev=num), I have been replacing code like result = strtoul(s, NULL, 10); with if (convert_i(s, 10, result)) die(...); I think strictness would be good and those who did --abbrev=' 20' deserve what they get from the additional strictness, but if (strtol_i(s, 10, result)) would have been more familiar. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On Tue, Mar 17, 2015 at 11:00 PM, Michael Haggerty mhag...@alum.mit.edu wrote: Michael Haggerty (14): numparse: new module for parsing integral numbers cacheinfo_callback(): use convert_ui() when handling --cacheinfo write_subdirectory(): use convert_ui() for parsing mode handle_revision_opt(): use skip_prefix() in many places handle_revision_opt(): use convert_i() when handling -digit strtoul_ui(), strtol_i(): remove functions handle_revision_opt(): use convert_ui() when handling --abbrev= builtin_diff(): detect errors when parsing --unified argument opt_arg(): val is always non-NULL opt_arg(): use convert_i() in implementation opt_arg(): report errors parsing option values opt_arg(): simplify pointer handling diff_opt_parse(): use convert_i() when handling -lnum diff_opt_parse(): use convert_i() when handling --abbrev=num Thank you for doing it. I was about to write another number parser and you did it :D Maybe you can add another patch to convert the only strtol in upload-pack.c to parse_ui. This place should accept positive number in base 10, plus sign is not accepted. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
On 03/17/2015 07:48 PM, Junio C Hamano wrote: Michael Haggerty mhag...@alum.mit.edu writes: When I looked around, I found scores of sites that call atoi(), strtoul(), and strtol() carelessly. And it's understandable. Calling any of these functions safely is so much work as to be completely impractical in day-to-day code. git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that try to make parsing integers a little bit easier. Exactly. They were introduced to prevent sloppy callers from passing NULL to the end parameter to underlying strtoul(3). The first thing that came to my mind while reading your message up to this point was why not use them more, tighten them, adding different variants if necessary, instead of introducing an entirely new set of functions in a new namespace? Here were my thoughts: * I wanted to change the interface to look less like strtol()/strtoul(), so it seemed appropriate to make the names more dissimilar. * The functions seemed long enough that they shouldn't be inline, and I wanted to put them in their own module rather than put them in git-compat-util.h. * It wasn't obvious to me how to generalize the names, strtoul_ui() and strtol_i(), to the eight functions that I wanted. That being said, I'm not married to the names. Suggestions are welcome--but we would need names for 8 functions, not 4 [1]. Michael For example: * Am I making callers too strict? In many cases where a positive integer is expected (e.g., --abbrev=num), I have been replacing code like result = strtoul(s, NULL, 10); with if (convert_i(s, 10, result)) die(...); I think strictness would be good and those who did --abbrev=' 20' deserve what they get from the additional strictness, but if (strtol_i(s, 10, result)) would have been more familiar. [1] It could be that when we're done, it will turn out that some of the eight variants are not needed. If so, we can delete them then. -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/14] numparse module: systematically tighten up integer parsing
Sorry for the long email. Feel free to skip over the background info and continue reading at My solution below. This odyssey started with a typo: $ git shortlog -snl v2.2.0..v2.3.0 14795 Junio C Hamano 1658 Jeff King 1400 Shawn O. Pearce 1109 Linus Torvalds 760 Jonathan Nieder [...] It took me a minute to realize why so many commits are listed. It turns out that -l, which I added by accident, requires an integer argument, but it happily swallows v2.2.0..v2.3.0 without error. This leaves no range argument, so the command traverses the entire history of HEAD. The relevant code is else if ((argcount = short_opt('l', av, optarg))) { options-rename_limit = strtoul(optarg, NULL, 10); return argcount; } , which is broken in many ways. First of all, strtoul() is way too permissive for simple tasks like this: * It allows leading whitespace. * It allows arbitrary trailing characters. * It allows a leading sign character ('+' or '-'), even though the result is unsigned. * If the string doesn't contain a number at all, it sets its endptr argument to point at the start of the string but doesn't set errno. * If the value is larger than fits in an unsigned long, it returns the value clamped to the range 0..ULONG_MAX (setting errno to ERANGE). * If the value is between -ULONG_MAX and 0, it returns the positive integer with the same bit pattern, without setting errno(!) (I can hardly think of a scenario where this would be useful.) * For base=0 (autodetect base), it allows a base specifier prefix 0x or 0 and parses the number accordingly. For base=16 it also allows a 0x prefix. strtol() has similar foibles. The caller compounds the permissivity of strtoul() with further sins: * It does absolutely no error detection. * It assigns the return value, which is an unsigned long, to an int value. This could truncate the result, perhaps even resulting in rename_limit being set to a negative value. When I looked around, I found scores of sites that call atoi(), strtoul(), and strtol() carelessly. And it's understandable. Calling any of these functions safely is so much work as to be completely impractical in day-to-day code. git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that try to make parsing integers a little bit easier. But they are only used in a few places, they hardly restrict the pathological flexibility of strtoul()/strtol(), strtoul_ui() doesn't implement clamping consistently when converting from unsigned long to unsigned int, and neither function can be used when the integer value *is* followed by other characters. My solution: the numparse module So I hereby propose a new module, numparse, to make it easier to parse integral values from strings in a convenient, safe, and flexible way. The first commit introduces the new module, and subsequent commits change a sample (a small fraction!) of call sites to use the new functions. Consider it a proof-of-concept. If people are OK with this approach, I will continue sending patches to fix other call sites. (I already have another two dozen such patches in my repo). Please see the docstrings in numparse.h in the first commit for detailed API docs. Briefly, there are eight new functions: parse_{l,ul,i,ui}(const char *s, unsigned int flags, T *result, char **endptr); convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result); The parse_*() functions are for parsing a number from the start of a string; the convert_*() functions are for parsing a string that consists of a single number. The flags argument selects not only the base of the number, but also which of strtol()/strtoul()'s many features should be allowed. The *_i() and *.ui() functions are for parsing int and unsigned int values; they are careful about how they truncate the corresponding long values. The functions all return 0 on success and a negative number on error. Here are a few examples of how these functions can be used (taken from the header of numparse.h): * Convert hexadecimal string s into an unsigned int. Die if there are any characters in s besides hexadecimal digits, or if the result exceeds the range of an unsigned int: if (convert_ui(s, 16, result)) die(...); * Read a base-ten long number from the front of a string, allowing sign characters and setting endptr to point at any trailing characters: if (parse_l(s, 10 | NUM_SIGN | NUM_TRAILING, result, endptr)) die(...); * Convert decimal string s into a signed int, but not allowing the string to contain a '+' or '-' prefix (and thereby indirectly ensuring that the result will be non-negative): if (convert_i(s, 10, result)) die(...); * Convert s into a signed int, interpreting prefix 0x to mean hexadecimal and 0 to mean octal. If the value doesn't fit in an unsigned int, set result