Re: immutable string type
On 12/29/19 11:03 AM, Tim Rühsen wrote: > A quick web search didn't give me any complaints about immutable in D. > Do you have some examples or can you elaborate a bit ? Sorry, I don't recall the details. But the basic problem I had was that the const/mutable/in trichotomy is kinda complicated. Part of the issue is that in D one cannot have a const pointer to a mutable int (the D folks think of this limitation as a feature).
Re: string types
On 27.12.19 11:51, Bruno Haible wrote: > Aga wrote: >> I do not know if >> you can (or if it is possible, how it can be done), extract with a way a >> specific >> a functionality from gnulib, with the absolute necessary code and only that. > > gnulib-tool does this. With its --avoid option, the developer can even > customize > their notion of "absolutely necessary". > >> In a myriad of codebases a string type is implemented at least as: >> size_t mem_size; >> size_t num_bytes; >> char *bytes; > > This is actually a string-buffer type. A string type does not need two size_t > members. Long-term experience has shown that using different types for string > and string-buffer is a win, because > - a string can be put in a read-only virtual memory area, thus enforcing > immutability (-> reducing multithread problems), > - providing primitives for string allocation reduces the amount of buffer > overflow bugs that otherwise occur in this area. [1] > [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html > Just FYI, here is a string concatenation function without ellipsis, analogue to writev() and struct iovec - just a suggestion. Instead of 'struct strvec' a new string_t type would be handy. #include #include #include #include struct strvec { char *strv_base; size_t strv_len; }; __attribute__ ((nonnull (1))) char *concat_stringv(const struct strvec *strv) { const struct strvec *str; size_t len = 0; char *buf; for (str = strv; str->strv_base; str++) len += str->strv_len; if (!(buf = malloc(len + 1))) return buf; len = 0; for (str = strv; str->strv_base; len += str->strv_len, str++) memcpy(buf + len, str->strv_base, str->strv_len); buf[len] = 0; return buf; } void main(void) { char *s = concat_stringv((struct strvec []) { { "a", 1 }, { "b", 1 }, { NULL } }); puts(s); free(s); } In GNU Wget2 we already have type similar to string_t. Just used in cases where we need pointer + len of URLs inside const HTML/XML/CSS data. typedef struct { const char *p; //!< pointer to memory region size_t len; //!< length of memory region } wget_string; So maybe we need a string_t and a const_string_t type !? (to avoid casting from const char *) Regards, Tim signature.asc Description: OpenPGP digital signature
Re: string types
On Sun, Dec 29, at 10:19 Bruno Haible wrote: > Aga wrote: > > - the returned value of the *printf family of functions dictates their > > limits/range, as they return an int, this can be as INT_MAX mostly > > Yes, we need new implementations of the *asprintf functions that are not > limited to returning strings of maximum length INT_MAX. There is also the question how current functions behave with buffers over INT_MAX. And what to do with such large buffers if stdio can not handle them reliably. And what POSIX says about this if says at all?
Re: immutable string type
On 29.12.19 19:49, Paul Eggert wrote: > On 12/29/19 4:07 AM, Tim Rühsen wrote: >> Introducing a stronger 'const' could be helpful in some situations > > D has 'immutable' for that. It doesn't work as well as one might think. It all depends on the implementation / definition of 'immutable' and how it is used. Not sure what the actual problems are within D, though. A quick web search didn't give me any complaints about immutable in D. Do you have some examples or can you elaborate a bit ? Regards, Tim signature.asc Description: OpenPGP digital signature
Re: immutable string type
On 12/29/19 4:07 AM, Tim Rühsen wrote: > Introducing a stronger 'const' could be helpful in some situations D has 'immutable' for that. It doesn't work as well as one might think.
Re: string types
On Sun, Dec 29, at 10:19 Bruno Haible wrote: > I agree with the goal. How to do it precisely, is an art however. Ok, let's see what do we have until now. First the Base: (easy) that is malloc maximum requested size from the kernel, and that is PTRDIFF_MAX. We also here we have to forget SIZE_MAX as it is not guaranteed that PTRDIFF_MAX equals to SIZE_MAX. Second the (function returned value) Requirenment: (easy) a signed type. There is an agreement that introduced functions should return on error -1, else the interface will be complicated and we do not want complication. So ptrdiff_t is adequate, since ptrdiff_t is in standard C and include'd with stddef.h. The rest: Catching out of bounds conditions: (rather easy and already implemented in snprintf) after the destination argument will follow an argument with the allocated destination size (from the stack or from the heap). Now, snprintf uses size_t here, but (question) isn't this a contradiction with the above or not? Not probably but it's better ask to de-confuse things (as clarity is a requirenment (semantics should be able to understood by mere humans)). Another concern. What if destination is NULL. Should the internal functions procceed by allocating a buffer with the requested size? What they will do if the requested size <= 0? There are preceding's here, like realpath() which allocates a buffer and it's up to the user to free it. Also. Declared as static internal variables considered harmfull. But sometimes is desirable to have some data in a private place protected or handy to work without side effects. This is solved however with the new im(muttable) module. Catching truncation (first priority maybe): There is a choice to complicate a bit the interface to return more values than -1, but this rejected by the perfect legal assumption that humans are lazy, probably because they have been exposed to try/catch (not bad if you ask but innapropriated for C). The other thing it could be done is to return -1 and set errno accordingly with the error. But such an error doesn't exists or exists? So ETRUNC should be introduced. Few programmers will take the risk to make their program dependable in something that is not standard, but perhaps they will (doubtfull though at this stage). The other thing that left is to check the returned value. Now. In snprintf(3) there are notes about this and a method to calculate truncation (misty though). The functions snprintf() and vsnprintf() do not write more than size bytes (including the terminating null byte ('\0')). If the output was truncated due to this limit, then the return value is the number of characters (excluding the terminating null byte) which would have been written to the final string if enough space had been available. Thus, a return value of size or more means that the output was truncated. (See also below under NOTES.) "which would have been written?" why not always the bytes that had been written? Ok i got it after a break; still difficult to parse though and for what? We have to admit that this a programmer error. [Sh|H]e should know her strings. But we still want to help here. How? Three choises comes to mind. 1. Use a bit map flag argument to control the function behavior. But this adds verbosity but at the same time allows extensibility. Which conditions could be covered with that? Perhaps to return an error if destination is NULL and the function directed with the flag to return in this condition. Same with the source. Very convenient but still verbose as you have to learn another set of FLAGS. 2. Introduce wrappers. Actually wrappers maybe will be used either way. Or introduce a complete set of same functions, post-fixed with _un (to denote unsafety, if _s (not sure) means safe). 3. The programmer knows best. Based on that, either continue with the implementation like it is, or (where is appropriate) use a fourh argument for the requested bytes to be written. And sleep in full conscience, that you did your best you could. He should do the same. Now. What concerns me most is the userspace and all these functions that takes a variable number of arguments and a format string. I was fighting in my code to know with a reliable way the actual bytes produced by the sum of those arguments (as this can be really difficult to catch some of those described conditions above). You also said at one point that noone that does system programming will use (because of the overhead this set of functions). We could go further and say. Noone sane (sorry) would want to format big strings. Such functions are very prone to errors, but are easy to work with them. So what should do with them? There is a method to calculate the size beforehand (means before the declaration) and is given in the printf(3) Linux man page. va_start(ap, fmt); size = vsnprintf(p, size, fmt, ap); va_end(ap); So it parses twice varargs. Plus a compiler version (not 9*),
Re: immutable string type
On 29.12.19 10:45, Bruno Haible wrote: > Tim Rühsen wrote: >> the use cases are mostly in the testing area (especially fuzzing). > > Indeed. During fuzzing, you want to check against any kind of buggy/undefined > behaviour, and writing into arbitrary memory is one of these kinds. > > This brings up the question: Should such as facility be in a Sanitizer and > not in a library? I think the answer is "no", because > > - Writing into a string is not invalid in C. Even casting a 'const char *' > to 'char *' and then writing into it is valid. The reason is that the > C standard only makes statements about a program as a whole and therefore > cannot express constraints such as "function A is allowed to write into > the memory object M but function B is not". I agree that such a thing doesn't make sense in a sanitizer, as only the application/programmer knows about the semantics. > > - Integer overflow checking, for example, is available in both the > Sanitizers > and library code. Apparently it is useful enough that some applications > want to have it enabled in production code. I believe the same will be > true for immutables string or memory regions. Makes sense, especially as lot's of code is never being fuzzed to a full degree. > >> As a more general approach, a function that switches already allocated >> memory into read-only memory would be handy. Like in >> - m = malloc() >> - initialize m with some data >> - if in debug mode: call memmap_readonly(m) - from this point on 'm' is >> read-only and a write leads to a segmentation fault. >> - ... >> - free(m) > > Hardware has write barriers only on the page level. You can't easily request > a write barrier for a requence of, say, 30 bytes. To accomodate this, the > API needs to have a certain shape. Paul wrote: True, and it means that immalloc() always allocate multiples of the page size (page is 4096 bytes on x86_64 ?). How do you plan to optimize memory usage here ? > >> p = immalloc (sizeof *p); >> p->x = whatever; p->y = something; ... >> imfreeze (p, sizeof *p); >> [no changes to *p allowed here] >> imfree (p); > > The third line needs to be something like > >p = imfreeze (p, sizeof *p); > > because the "writable p" and the "read-only p" will be at different virtual > addresses. Ah, now I get it - you have two virtual memory addresses pointing to the same physical memory area. Just throwing in another thought as an addition to immalloc etc: Introducing a stronger 'const' could be helpful in some situations - the compiler could find possible violations during the build phase. Stronger in the means of "not allowed to be cast to non-const, but allowed to be cast to const". This can be achieved by either extending the C standard, by adding a new option to gcc or by adding a new __attribute__ for gcc. Example: #define imconst __attribute__ ((immutable)) #define transfer_ptr(dst) ({ typeof(dst) _t=(dst); (dst)=NULL; _t; }) char *tmp = malloc() ... initialize tmp... imconst char *m = transfer_ptr(tmp); // now m is save against writing if architecture has a MMU char *m2 = strdup(m); // allowed memcpy(m, m2, ...); // compiler error (implicit cast 'imconst char *' to 'void *' not allowed) memcpy((void *) m, m2, ...); // same compiler error (without 'implicit') Maybe we can think this through to either drop it or make a suggestion to the gcc folks. Regards, Tim signature.asc Description: OpenPGP digital signature
Re: immutable string type
Tim Rühsen wrote: > the use cases are mostly in the testing area (especially fuzzing). Indeed. During fuzzing, you want to check against any kind of buggy/undefined behaviour, and writing into arbitrary memory is one of these kinds. This brings up the question: Should such as facility be in a Sanitizer and not in a library? I think the answer is "no", because - Writing into a string is not invalid in C. Even casting a 'const char *' to 'char *' and then writing into it is valid. The reason is that the C standard only makes statements about a program as a whole and therefore cannot express constraints such as "function A is allowed to write into the memory object M but function B is not". - Integer overflow checking, for example, is available in both the Sanitizers and library code. Apparently it is useful enough that some applications want to have it enabled in production code. I believe the same will be true for immutables string or memory regions. > As a more general approach, a function that switches already allocated > memory into read-only memory would be handy. Like in > - m = malloc() > - initialize m with some data > - if in debug mode: call memmap_readonly(m) - from this point on 'm' is > read-only and a write leads to a segmentation fault. > - ... > - free(m) Hardware has write barriers only on the page level. You can't easily request a write barrier for a requence of, say, 30 bytes. To accomodate this, the API needs to have a certain shape. Paul wrote: > p = immalloc (sizeof *p); > p->x = whatever; p->y = something; ... > imfreeze (p, sizeof *p); > [no changes to *p allowed here] > imfree (p); The third line needs to be something like p = imfreeze (p, sizeof *p); because the "writable p" and the "read-only p" will be at different virtual addresses. Bruno
Re: immutable string type
Ben Pfaff wrote: > This sort of thing won't work on systems with virtually indexed caches, > at least not without inserting explicit flushes. Good point. Yes, explicit data cache flushing instructions or - in the worst case - system calls are necessary. > I don't know whether > virtually indexed caches still exist in the wild. Yes, data caches are indexed by virtual address, not by physical address, on many platforms. That makes the common case of a cache lookup faster. Bruno
Re: string types
Aga wrote: > - the returned value of the *printf family of functions dictates their > limits/range, as they return an int, this can be as INT_MAX mostly Yes, we need new implementations of the *asprintf functions that are not limited to returning strings of maximum length INT_MAX. > - as since there is a "risk"¹ that someone has to take at some point > (either the > programmer or the underlying library code (as strdup() does)), the > designed > interface should lower those risks I agree with the goal. How to do it precisely, is an art however. > In the case of an error, returns > 0 which is either: > #define EDSTPAR -1/* Error : bad dst parameters */ > #define ESRCPAR -2/* Error : bad src parameters */ > #define EMODPAR -3/* Error : bad mode parameter */ > #define ETRUNC-4/* Error : not enough space to > copy/concatenate > and truncation not > allowed */ I don't think an interface for string concatenation with that many error cases will be successful. Programmers are lazy, therefore - some will not check the errors at all, - some will only check for the fourth one (because "I'm not passing invalid arguments, after all"), - among those few that implement all 4 checks, half will get it wrong (that's my experience with similarly complex functions like mbrtowc() or iconv()). For an interface to be successful, it needs to be simpler than that. Bruno