Re: string types
On 1/6/20 5:08 PM, Tim Rühsen wrote: >> * What about reusing the complete vasnprintf.c for (B), rather than >> adding another, limited printf-like implementation? > > Yes, would be nice. At least for appending we need an extra > malloc/malloc/memcpy/free. > > vasnprintf reallocates RESULTBUF that means we can't use a stack > allocated buffer - thus we lose the performance advantage. Or should we > try snprintf first and fallback to vasnprintf in case of truncation ? > > We want another module e.g. buffer-printf to not pull in vasnprintf when > not needing printf-like buffer functions. > > Once the buffer module is done, we could think of amending vasnprintf to > better play with the buffer type. Just made a speed comparison between vasnprintf and wget_buffer_printf, 10m times executed, within a stack-only szenario (no reallocations), gcc 9.2.1, -O1 -g. asnprintf(sbuf, , "%d", i); takes 0m2.727s wget_buffer_printf(, "%d", i); takes 0m0.226s char s[]="123"; asnprintf(sbuf, , "%s", s); takes 0m2.282s wget_buffer_printf(, "%s", s); takes 0m0.212s It tells me that vasnprintf has a huge startup overhead. Perhaps we can tweak that a little bit. the vasnprintf program that I run for %d: #include void main(void) { char sbuf[256]; size_t size = sizeof(sbuf); for (int i=0;i<100;i++) { asnprintf(sbuf, , "%d", i); asnprintf(sbuf, , "%d", i); asnprintf(sbuf, , "%d", i); asnprintf(sbuf, , "%d", i); asnprintf(sbuf, , "%d", i); asnprintf(sbuf, , "%d", i); asnprintf(sbuf, , "%d", i); asnprintf(sbuf, , "%d", i); asnprintf(sbuf, , "%d", i); asnprintf(sbuf, , "%d", i); } } signature.asc Description: OpenPGP digital signature
Re: string types
Hi Bruno, On 1/6/20 1:46 PM, Bruno Haible wrote: > - providing primitives for string allocation reduces the amount of > buffer > overflow bugs that otherwise occur in this area. [1] > [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html >>> ... >> We created a 'catch them all' string/buffer type plus API. It is a good >> compromise for all kinds of situations, works like a memory buffer but >> is guaranteed 0-terminated, allows custom stack buffers with fallback to >> heap if to small. >> >> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c >> >> >> There also is a sprintf functionality (glibc compatible) using these >> buffers - and the operation is well faster than glibc's sprintf-like >> functions for all format strings tested (tested back a few years). The >> code is also much smaller (380 C code lines), the return values are >> size_t. It doesn't support float/double. >> >> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c >> >> If there is serious interest, I could prepare modules for gnulib. > > It is interesting that your solution does not only cover the simple cases > (string concatenation, etc.), but also the more complex one, possibly > with if()s in the generation logic, and all this without blatant potential > for buffer overflow bugs. > > So, the solution would consists of the following parts: > (A) A growable buffer type, with up to N (128 or 1024 or so) bytes on > the stack. Preferable, the initial size and if starting with heap or stack buffer should be (runtime) configurable. - initial size because it allows fine-tuning to better avoid reallocations - initial stack if used as local / temporary buffer - initial heap when you already know that the resulting string has to persist the function return Currently there is are two init functions (I leave away the wget namespace): int buffer_init(buffer *buf, char *data, size_t size); buffer *buffer_alloc(size_t size); buffer_alloc creates a buffer instance on the heap and initializes it with a heap buffer of size. buffer_init(buf, data, date_size) initializes 'buf' with the given data and data_size. data will not be free'd, so stack data can be used here. buffer_init(buf, NULL, date_size) initializes 'buf' with freshly allocated heap data of size 'data_size'. buffer_init(buf, NULL, 0) initializes 'buf' with freshly allocated heap data of size 128. We could leave this out - it's a currently unused special case to avoid error handling. Then there is int buffer_ensure_capacity(buffer *buf, size_t size); > (B) A set of functions for appending to such a growable buffer. To copy a number of bytes to the beginning (effectively dropping the previous content): size_t buffer_memcpy(buffer *buf, const void *data, size_t length); To append a number of bytes: size_t buffer_memcat(buffer *buf, const void *data, size_t length); To copy a string to the beginning (effectively dropping the previous content): size_t buffer_strcpy(buffer *buf, const char *s); To append a string: size_t buffer_strcat(buffer *buf, const char *s); To set a number of bytes at the beginning (effectively dropping the previous content): size_t buffer_memset(buffer *buf, char c, size_t length); To append a number of the same bytes: size_t buffer_memset_append(buffer *buf, char c, size_t length); > (C) A function for creating a heap-allocated 'char *' from a growable > buffer. Currently we do: buffer buf; buffer_init(, NULL, date_size); // allocate buf.data on heap ... add stuff to buf ... mydata = buf.data; buf.data = NULL; buffer_deinit(); We could make up a (static inline) for this, named void *buffer_deinit_transfer(buffer *buf); This function could also call realloc() to shrink 'data' to it's occupied length. > (D) Short-hand functions for the simple cases (like string concatenation). See above, e.g. buffer_strcpy(buf, scheme); buffer_strcat(buf, "://"); buffer_strcat(buf, domain); buffer_memcat(buf, ":", 1); buffer_strcat(buf, port_s); buffer_memcat(buf, "/", 1); buffer_strcat(buf, path); But I prefer the slightly slower but better readable form buffer_printf("%s://%s:%d/%s", scheme, domain, port, path); Since our printf-like functions directly write into a buffer, there is no overhead for copying data. > It would be good to have all these well integrated (in terms of function > names and calling conventions). So far, in gnulib, we have only pieces of > it: > - Module 'scratch_buffer' is (A) without (B), (C), (D). > - Modules 'vasnprintf', 'asprintf' are (B), (C), (D) but without (A). > > Before you start writing the code, it's worth looking at the following > questions: > * Should the module 'scratch_buffer' be reused for (A)? Or is this > not possible? If not, can it still have a memory-leak prevention > like described in lib/malloc/scratch_buffer.h? I don't see the advantage of the described memory-leak prevention. On memory error
Re: string types
Hi Tim, > >>> - providing primitives for string allocation reduces the amount of > >>> buffer > >>> overflow bugs that otherwise occur in this area. [1] > >>> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html > > ... > We created a 'catch them all' string/buffer type plus API. It is a good > compromise for all kinds of situations, works like a memory buffer but > is guaranteed 0-terminated, allows custom stack buffers with fallback to > heap if to small. > > https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c > > > There also is a sprintf functionality (glibc compatible) using these > buffers - and the operation is well faster than glibc's sprintf-like > functions for all format strings tested (tested back a few years). The > code is also much smaller (380 C code lines), the return values are > size_t. It doesn't support float/double. > > https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c > > If there is serious interest, I could prepare modules for gnulib. It is interesting that your solution does not only cover the simple cases (string concatenation, etc.), but also the more complex one, possibly with if()s in the generation logic, and all this without blatant potential for buffer overflow bugs. So, the solution would consists of the following parts: (A) A growable buffer type, with up to N (128 or 1024 or so) bytes on the stack. (B) A set of functions for appending to such a growable buffer. (C) A function for creating a heap-allocated 'char *' from a growable buffer. (D) Short-hand functions for the simple cases (like string concatenation). It would be good to have all these well integrated (in terms of function names and calling conventions). So far, in gnulib, we have only pieces of it: - Module 'scratch_buffer' is (A) without (B), (C), (D). - Modules 'vasnprintf', 'asprintf' are (B), (C), (D) but without (A). Before you start writing the code, it's worth looking at the following questions: * Should the module 'scratch_buffer' be reused for (A)? Or is this not possible? If not, can it still have a memory-leak prevention like described in lib/malloc/scratch_buffer.h? * What about reusing the complete vasnprintf.c for (B), rather than adding another, limited printf-like implementation? * Is it best to implement (D) based on (A), (B), (C), or directly from scratch? Bruno
Re: string types
On 12/31/19 10:53 AM, Bruno Haible wrote: > Hi Tim, > >>> - providing primitives for string allocation reduces the amount of buffer >>> overflow bugs that otherwise occur in this area. [1] >>> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html > >> here is a string concatenation function without ellipsis, analogue to >> writev() and struct iovec - just a suggestion. Instead of 'struct >> strvec' a new string_t type would be handy. >> >> #include >> #include >> #include >> #include >> >> struct strvec { >> char *strv_base; >> size_t strv_len; >> }; >> >> __attribute__ ((nonnull (1))) >> char *concat_stringv(const struct strvec *strv) >> { >> const struct strvec *str; >> size_t len = 0; >> char *buf; >> >> for (str = strv; str->strv_base; str++) >> len += str->strv_len; >> >> if (!(buf = malloc(len + 1))) >> return buf; >> >> len = 0; >> for (str = strv; str->strv_base; len += str->strv_len, str++) >> memcpy(buf + len, str->strv_base, str->strv_len); >> >> buf[len] = 0; >> >> return buf; >> } >> >> void main(void) >> { >> char *s = concat_stringv((struct strvec []) { >> { "a", 1 }, >> { "b", 1 }, >> { NULL } >> }); > > This looks good. It brings us one step closer to the stated goal [1]. > > Would you like to contribute such a 'string-alloc' module that, together with > 'strdup' and 'asprintf', removes most needs to create a string's contents > "by hand"? When time allows, I would like to make up a module. Though IMO the design of the function doesn't allow to reuse an existing buffer (e.g. a scratch buffer on the stack). Since malloc() etc are pretty costly, you often want to avoid it as much as possible. Like e.g. /* Use given stack buffer, fallback to malloc() if too short */ char sbuf[256]; char *s = concat_stringv_stack(sbuf, sizeof (sbuf), (struct strvec []) { { "a", 1 }, { "b", 1 }, { NULL } }); ... do things with s ... if (s != sbuf) free (s); Sometimes you want to reuse an existing malloc'ed buffer: /* Use existing heap buffer, use realloc() if too short */ char *buf = malloc(N); char *buf = concat_stringv_reuse(buf, N, (struct strvec []) { { "a", 1 }, { "b", 1 }, { NULL } }); ... do things with s ... free (buf); You might also be interested in the size of the created string to avoid a superfluous strlen(). So the need for more specialized functions makes it all more and more complex. During the development of Libwget/Wget2 we needed all of the above (and more) and finally came up with a good compromise (well, good for us). We created a 'catch them all' string/buffer type plus API. It is a good compromise for all kinds of situations, works like a memory buffer but is guaranteed 0-terminated, allows custom stack buffers with fallback to heap if to small. $ cloc buffer.c Language files blankcomment code --- C1 49327195 https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c There also is a sprintf functionality (glibc compatible) using these buffers - and the operation is well faster than glibc's sprintf-like functions for all format strings tested (tested back a few years). The code is also much smaller (380 C code lines), the return values are size_t. It doesn't support float/double. $ cloc buffer_printf.c Language files blankcomment code --- C1 74120380 https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c If there is serious interest, I could prepare modules for gnulib. Regards, Tim signature.asc Description: OpenPGP digital signature
Re: string types
Hi Tim, > > - providing primitives for string allocation reduces the amount of buffer > > overflow bugs that otherwise occur in this area. [1] > > [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html > here is a string concatenation function without ellipsis, analogue to > writev() and struct iovec - just a suggestion. Instead of 'struct > strvec' a new string_t type would be handy. > > #include > #include > #include > #include > > struct strvec { > char *strv_base; > size_t strv_len; > }; > > __attribute__ ((nonnull (1))) > char *concat_stringv(const struct strvec *strv) > { > const struct strvec *str; > size_t len = 0; > char *buf; > > for (str = strv; str->strv_base; str++) > len += str->strv_len; > > if (!(buf = malloc(len + 1))) > return buf; > > len = 0; > for (str = strv; str->strv_base; len += str->strv_len, str++) > memcpy(buf + len, str->strv_base, str->strv_len); > > buf[len] = 0; > > return buf; > } > > void main(void) > { > char *s = concat_stringv((struct strvec []) { > { "a", 1 }, > { "b", 1 }, > { NULL } > }); This looks good. It brings us one step closer to the stated goal [1]. Would you like to contribute such a 'string-alloc' module that, together with 'strdup' and 'asprintf', removes most needs to create a string's contents "by hand"? Regarding the type name: There can't be a 'string_t' in C, I would say, because you will always have the NUL-terminated strings on one side and what you call a 'wget_string' on the other side, and there can't be a clear winner between both. Bruno
Re: string types
On 27.12.19 11:51, Bruno Haible wrote: > Aga wrote: >> I do not know if >> you can (or if it is possible, how it can be done), extract with a way a >> specific >> a functionality from gnulib, with the absolute necessary code and only that. > > gnulib-tool does this. With its --avoid option, the developer can even > customize > their notion of "absolutely necessary". > >> In a myriad of codebases a string type is implemented at least as: >> size_t mem_size; >> size_t num_bytes; >> char *bytes; > > This is actually a string-buffer type. A string type does not need two size_t > members. Long-term experience has shown that using different types for string > and string-buffer is a win, because > - a string can be put in a read-only virtual memory area, thus enforcing > immutability (-> reducing multithread problems), > - providing primitives for string allocation reduces the amount of buffer > overflow bugs that otherwise occur in this area. [1] > [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html > Just FYI, here is a string concatenation function without ellipsis, analogue to writev() and struct iovec - just a suggestion. Instead of 'struct strvec' a new string_t type would be handy. #include #include #include #include struct strvec { char *strv_base; size_t strv_len; }; __attribute__ ((nonnull (1))) char *concat_stringv(const struct strvec *strv) { const struct strvec *str; size_t len = 0; char *buf; for (str = strv; str->strv_base; str++) len += str->strv_len; if (!(buf = malloc(len + 1))) return buf; len = 0; for (str = strv; str->strv_base; len += str->strv_len, str++) memcpy(buf + len, str->strv_base, str->strv_len); buf[len] = 0; return buf; } void main(void) { char *s = concat_stringv((struct strvec []) { { "a", 1 }, { "b", 1 }, { NULL } }); puts(s); free(s); } In GNU Wget2 we already have type similar to string_t. Just used in cases where we need pointer + len of URLs inside const HTML/XML/CSS data. typedef struct { const char *p; //!< pointer to memory region size_t len; //!< length of memory region } wget_string; So maybe we need a string_t and a const_string_t type !? (to avoid casting from const char *) Regards, Tim signature.asc Description: OpenPGP digital signature
Re: string types
On Sun, Dec 29, at 10:19 Bruno Haible wrote: > Aga wrote: > > - the returned value of the *printf family of functions dictates their > > limits/range, as they return an int, this can be as INT_MAX mostly > > Yes, we need new implementations of the *asprintf functions that are not > limited to returning strings of maximum length INT_MAX. There is also the question how current functions behave with buffers over INT_MAX. And what to do with such large buffers if stdio can not handle them reliably. And what POSIX says about this if says at all?
Re: string types
On Sun, Dec 29, at 10:19 Bruno Haible wrote: > I agree with the goal. How to do it precisely, is an art however. Ok, let's see what do we have until now. First the Base: (easy) that is malloc maximum requested size from the kernel, and that is PTRDIFF_MAX. We also here we have to forget SIZE_MAX as it is not guaranteed that PTRDIFF_MAX equals to SIZE_MAX. Second the (function returned value) Requirenment: (easy) a signed type. There is an agreement that introduced functions should return on error -1, else the interface will be complicated and we do not want complication. So ptrdiff_t is adequate, since ptrdiff_t is in standard C and include'd with stddef.h. The rest: Catching out of bounds conditions: (rather easy and already implemented in snprintf) after the destination argument will follow an argument with the allocated destination size (from the stack or from the heap). Now, snprintf uses size_t here, but (question) isn't this a contradiction with the above or not? Not probably but it's better ask to de-confuse things (as clarity is a requirenment (semantics should be able to understood by mere humans)). Another concern. What if destination is NULL. Should the internal functions procceed by allocating a buffer with the requested size? What they will do if the requested size <= 0? There are preceding's here, like realpath() which allocates a buffer and it's up to the user to free it. Also. Declared as static internal variables considered harmfull. But sometimes is desirable to have some data in a private place protected or handy to work without side effects. This is solved however with the new im(muttable) module. Catching truncation (first priority maybe): There is a choice to complicate a bit the interface to return more values than -1, but this rejected by the perfect legal assumption that humans are lazy, probably because they have been exposed to try/catch (not bad if you ask but innapropriated for C). The other thing it could be done is to return -1 and set errno accordingly with the error. But such an error doesn't exists or exists? So ETRUNC should be introduced. Few programmers will take the risk to make their program dependable in something that is not standard, but perhaps they will (doubtfull though at this stage). The other thing that left is to check the returned value. Now. In snprintf(3) there are notes about this and a method to calculate truncation (misty though). The functions snprintf() and vsnprintf() do not write more than size bytes (including the terminating null byte ('\0')). If the output was truncated due to this limit, then the return value is the number of characters (excluding the terminating null byte) which would have been written to the final string if enough space had been available. Thus, a return value of size or more means that the output was truncated. (See also below under NOTES.) "which would have been written?" why not always the bytes that had been written? Ok i got it after a break; still difficult to parse though and for what? We have to admit that this a programmer error. [Sh|H]e should know her strings. But we still want to help here. How? Three choises comes to mind. 1. Use a bit map flag argument to control the function behavior. But this adds verbosity but at the same time allows extensibility. Which conditions could be covered with that? Perhaps to return an error if destination is NULL and the function directed with the flag to return in this condition. Same with the source. Very convenient but still verbose as you have to learn another set of FLAGS. 2. Introduce wrappers. Actually wrappers maybe will be used either way. Or introduce a complete set of same functions, post-fixed with _un (to denote unsafety, if _s (not sure) means safe). 3. The programmer knows best. Based on that, either continue with the implementation like it is, or (where is appropriate) use a fourh argument for the requested bytes to be written. And sleep in full conscience, that you did your best you could. He should do the same. Now. What concerns me most is the userspace and all these functions that takes a variable number of arguments and a format string. I was fighting in my code to know with a reliable way the actual bytes produced by the sum of those arguments (as this can be really difficult to catch some of those described conditions above). You also said at one point that noone that does system programming will use (because of the overhead this set of functions). We could go further and say. Noone sane (sorry) would want to format big strings. Such functions are very prone to errors, but are easy to work with them. So what should do with them? There is a method to calculate the size beforehand (means before the declaration) and is given in the printf(3) Linux man page. va_start(ap, fmt); size = vsnprintf(p, size, fmt, ap); va_end(ap); So it parses twice varargs. Plus a compiler version (not 9*),
Re: string types
Aga wrote: > - the returned value of the *printf family of functions dictates their > limits/range, as they return an int, this can be as INT_MAX mostly Yes, we need new implementations of the *asprintf functions that are not limited to returning strings of maximum length INT_MAX. > - as since there is a "risk"¹ that someone has to take at some point > (either the > programmer or the underlying library code (as strdup() does)), the > designed > interface should lower those risks I agree with the goal. How to do it precisely, is an art however. > In the case of an error, returns > 0 which is either: > #define EDSTPAR -1/* Error : bad dst parameters */ > #define ESRCPAR -2/* Error : bad src parameters */ > #define EMODPAR -3/* Error : bad mode parameter */ > #define ETRUNC-4/* Error : not enough space to > copy/concatenate > and truncation not > allowed */ I don't think an interface for string concatenation with that many error cases will be successful. Programmers are lazy, therefore - some will not check the errors at all, - some will only check for the fourth one (because "I'm not passing invalid arguments, after all"), - among those few that implement all 4 checks, half will get it wrong (that's my experience with similarly complex functions like mbrtowc() or iconv()). For an interface to be successful, it needs to be simpler than that. Bruno
Re: string types
On 12/28/19 12:44 PM, ag wrote: > is your opininion that this is adequate? > > typedef ptrdiff_t msize_t (m for memory here) Yes, something like that. dfa.c calls this type 'idx_t', which is a couple of characters shorter.
Re: string types
Hi Paul, On Sat, Dec 28, at 10:28 Paul Eggert wrote: > > Based on the above assumptions this can be extended. First instead of > > size_t to > > return ssize_t, so functions can return -1 and set errno accordingly. > > It's better to use ptrdiff_t for this sort of thing, since it's hardwired into > the C language (you can't do any better than ptrdiff_t anyway, if you use > pointer subtraction), whereas ssize_t is merely in POSIX and is narrower than > ptrdiff_t on some (obsolete?) platforms. So, let's say we designed this thing without obligating to the past and thinking for the next hundred years (of course with the current knowledge and to lessons from the past), and wanted to make it work with malloc and string type functions, as best it can be done and without worries for overflows and unsigned divisions and all this kind of confusing things that hunts us altogether after so many years that things should have been settled by now... is your opininion that this is adequate? typedef ptrdiff_t msize_t (m for memory here) > > #define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4)) > > #define MEM_IS_INT_OVERFLOW(nmemb, ssize) \ > > (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) && \ > > (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize)) > > Ouch. That code is not good. An unsigned division at runtime to do memory > allocation? Gnulib does better than that already. Also, Glibc has some code in > this area that we could migrate into Gnulib, that could be better yet. Sorry, i don't have time to do it right now - as i just escaped from a snow-storm - but i will check this for atleast not to spread misleading information (is quite possible my fault here), so thanks for your comment. By the way Paul and since i'm self taught by practical experience kind of human being and joking with zoi here said that at least my teacher is a hall of famer in the computing history. Isn't this life great! So true this is also a school for free afterall. My Honor, Αγαθοκλής
Re: string types
On 12/28/19 5:14 AM, ag wrote: > - PTRDIFF_MAX is at least INT_MAX and at most SIZE_MAX > (PTRDIFF_MAX is INT_MAX in 32bit) PTRDIFF_MAX can exceed SIZE_MAX, in the sense that POSIX and C allows it and it could be useful on 32-bit platforms for size_t to be 32 bits and ptrdiff_t to be 64 bits. Although I don't know of any platforms doing things that way, I prefer not to assume that PTRDIFF_MAX <= SIZE_MAX so as to allow for the possibility. > - SIZE_MAX as (size_t) (-1) > > - ssize_t (s means signed?) can be as big as SIZE_MAX? and SSIZE_MAX equals > to > SIZE_MAX? ssize_t can be either narrower or wider than size_t, according to POSIX. Historically ssize_t was 32 bits and size_t 64 bits on some platforms, and though I don't know of any current platforms doing that it's easy to not make assumptions here. > Based on the above assumptions this can be extended. First instead of size_t > to > return ssize_t, so functions can return -1 and set errno accordingly. It's better to use ptrdiff_t for this sort of thing, since it's hardwired into the C language (you can't do any better than ptrdiff_t anyway, if you use pointer subtraction), whereas ssize_t is merely in POSIX and is narrower than ptrdiff_t on some (obsolete?) platforms. > In my humble opinion there is also the choise to choose reallocarray() from > OpenBSD, > which always checks for integer overflows with the following way: > > #define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4)) > #define MEM_IS_INT_OVERFLOW(nmemb, ssize) \ > (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) && \ > (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize)) Ouch. That code is not good. An unsigned division at runtime to do memory allocation? Gnulib does better than that already. Also, Glibc has some code in this area that we could migrate into Gnulib, that could be better yet.
Re: string types
Hi, On Fri, Dec 27, at 11:51 Bruno Haible wrote: > - providing primitives for string allocation reduces the amount of buffer >overflow bugs that otherwise occur in this area. [1] [1] Re: string allocation https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html Thanks, i remember this thread, though at the time i couldn't understand some bits. >> ag wrote: > > ... to the actual algorithm (usually conditions that can or can't be met). > That is the idea behind the container types (list, map) in gnulib. However, I > don't > see how to reasonably transpose this principle to string types. Ok, let us try, so allow me to summarize with some of (my unqualified) assumptions (please correct): - glibc malloc can request at most PTRDIFF_MAX - PTRDIFF_MAX is at least INT_MAX and at most SIZE_MAX (PTRDIFF_MAX is INT_MAX in 32bit) - SIZE_MAX as (size_t) (-1) - ssize_t (s means signed?) can be as big as SIZE_MAX? and SSIZE_MAX equals to SIZE_MAX? - the returned value of the *printf family of functions dictates their limits/range, as they return an int, this can be as INT_MAX mostly Some concerns: - truncation errors should be caught - memory checkers should catch overflows - as since there is a "risk"¹ that someone has to take at some point (either the programmer or the underlying library code (as strdup() does)), the designed interface should lower those risks There is a proposal from Eric Sanchis to Austin group at 9 Jun 2016, for a String copy/concatenation interface, that his functions have both the allocated size and the number of bytes to be written as arguments (some i will inline them here, since i was unable to find his mail in the Posix mailing list archives). I used this as a basis (as it was rather intuitive and perfectly suited for C), to implement my own str_cp, which goes like this: size_t str_cp (char *dest, size_t dest_len, const char *src, size_t nelem) { size_t num = (nelem > (dest_len - 1) ? dest_len - 1 : nelem); size_t len = (NULL is src ? 0 : byte_cp (dest, src, num)); dest[len] = '\0'; return len; } size_t byte_cp (char *dest, const char *src, size_t nelem) { const char *sp = src; size_t len = 0; while (len < nelem and *sp) { dest[len] = *sp++; len++; } return len; } Of course it can be done better, but here we have a low level function (byte_cp), that does only the required checks and which returns the actual bytes written to `dest', while str_cp checks if `src' is NULL and if `nelem' is bigger than `dest_len' (if it is then copies at least `dest_len' - 1). It returns 0 or the actual written bytes. Since this returns the actual bytes written, it is up to the programmer to check if truncation happened, but there is no possibility to copy more than `dest_len' - 1. Based on the above assumptions this can be extended. First instead of size_t to return ssize_t, so functions can return -1 and set errno accordingly. Eric Sanchis in his proposal does it a bit different because in his functions adds an extra argument as size_t, that uses this to control the behavior of the function (what it will do in the case that destination length is less than source len). He uses an int as a returned value which either is 0/1 on succesful operation, the following: #define OKNOTRUNC 0 /* copy/concatenation performed without truncation */ #define OKTRUNC1 /* copy/concatenation performed with truncation */ And below is the extra information passed as fifth argument: #define TRUNC 0 /* truncation allowed */ #define NOTRUNC1 /* truncation not allowed */ In the case of an error, returns > 0 which is either: #define EDSTPAR -1 /* Error : bad dst parameters */ #define ESRCPAR -2 /* Error : bad src parameters */ #define EMODPAR -3 /* Error : bad mode parameter */ #define ETRUNC-4 /* Error : not enough space to copy/concatenate and truncation not allowed */ Now combining all this and if the assumptions are correct, gnulib can return ssize_t and uses this to make it's functions to work up to SIZE_MAX and uses either Eric's interface or to set errno accordingly. But to me a function call like: str_cp (dest, memsize_of_dest, src, memsize_of_dest - 1) is quite common C's way to do things, plus we have a way to catch truncation and not to go out of bounds at the same time. Of course such operations are tied with malloc(). I've read the gnulib document yesteday and i saw that gnulib wraps malloc() with a function that (quite logically) aborts execution and even allows to set a callback function. In my humble opinion there is also the choise to choose reallocarray() from OpenBSD, which always checks for integer overflows with the following way: #define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4)) #define
Re: string types
Aga wrote: > I do not know if > you can (or if it is possible, how it can be done), extract with a way a > specific > a functionality from gnulib, with the absolute necessary code and only that. gnulib-tool does this. With its --avoid option, the developer can even customize their notion of "absolutely necessary". > In a myriad of codebases a string type is implemented at least as: > size_t mem_size; > size_t num_bytes; > char *bytes; This is actually a string-buffer type. A string type does not need two size_t members. Long-term experience has shown that using different types for string and string-buffer is a win, because - a string can be put in a read-only virtual memory area, thus enforcing immutability (-> reducing multithread problems), - providing primitives for string allocation reduces the amount of buffer overflow bugs that otherwise occur in this area. [1] Unfortunately, the common string type in C is 'char *' with NUL termination, and a different type is hard to establish - because developers already know how to use 'char *', - because existing functions like printf consume 'char *' strings. - Few programs have had the need to correctly handles strings with embedded NULs. > An extended ustring (unicode|utf8) type can include information for its bytes > with > character semantics, like: > (utf8 typedef'ed as signed int) > utf8 code; // the integer representation > int len; // the number of the needed bytes > int width; // the number of the occupied cells > char buf[5]; // and probably the character representation Such a type would have a niche use, IMO, because - 99% of the processing would not need to access the width (screen columns) - so why spend CPU time and RAM to store it and keep it up-to-date? - 80% of the processing does not care about the Unicode code points either, and libraries like libunistring can do the Unicode-aware processing. > But the programmer mind would be probably best > if could concentrate to how to express the thought (with whatever meaning of > what we > are calling "thought") and follow this flow, or if could concentrate the > energy to > understand the intentions (while reading) of the code (instead of wasting > self with > the "details" of the code) and finally to the actual algorithm (usually > conditions > that can or can't be met). That is the idea behind the container types (list, map) in gnulib. However, I don't see how to reasonably transpose this principle to string types. Bruno [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html