Re: immutable string type

2019-12-29 Thread Paul Eggert
On 12/29/19 11:03 AM, Tim Rühsen wrote:
> A quick web search didn't give me any complaints about immutable in D.
> Do you have some examples or can you elaborate a bit ?

Sorry, I don't recall the details. But the basic problem I had was that the
const/mutable/in trichotomy is kinda complicated. Part of the issue is that in D
one cannot have a const pointer to a mutable int (the D folks think of this
limitation as a feature).



Re: string types

2019-12-29 Thread Tim Rühsen
On 27.12.19 11:51, Bruno Haible wrote:
> Aga wrote:
>> I do not know if
>> you can (or if it is possible, how it can be done), extract with a way a 
>> specific
>> a functionality from gnulib, with the absolute necessary code and only that.
> 
> gnulib-tool does this. With its --avoid option, the developer can even 
> customize
> their notion of "absolutely necessary".
> 
>> In a myriad of codebases a string type is implemented at least as:
>>   size_t mem_size;
>>   size_t num_bytes;
>>   char *bytes;
> 
> This is actually a string-buffer type. A string type does not need two size_t
> members. Long-term experience has shown that using different types for string
> and string-buffer is a win, because
>   - a string can be put in a read-only virtual memory area, thus enforcing
> immutability (-> reducing multithread problems),
>   - providing primitives for string allocation reduces the amount of buffer
> overflow bugs that otherwise occur in this area. [1]
> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
>

Just FYI,

here is a string concatenation function without ellipsis, analogue to
writev() and struct iovec - just a suggestion. Instead of 'struct
strvec' a new string_t type would be handy.

#include 
#include 
#include 
#include 

struct strvec {
  char *strv_base;
  size_t strv_len;
};

__attribute__ ((nonnull (1)))
char *concat_stringv(const struct strvec *strv)
{
  const struct strvec *str;
  size_t len = 0;
  char *buf;

  for (str = strv; str->strv_base; str++)
len += str->strv_len;

  if (!(buf = malloc(len + 1)))
return buf;

  len = 0;
  for (str = strv; str->strv_base; len += str->strv_len, str++)
memcpy(buf + len, str->strv_base, str->strv_len);

  buf[len] = 0;

  return buf;
}

void main(void)
{
  char *s = concat_stringv((struct strvec []) {
{ "a", 1 },
{ "b", 1 },
{ NULL }
  });

  puts(s);

  free(s);
}


In GNU Wget2 we already have type similar to string_t. Just used in
cases where we need pointer + len of URLs inside const HTML/XML/CSS data.

typedef struct {
const char
*p; //!< pointer to memory region
size_t
len; //!< length of memory region
} wget_string;


So maybe we need a string_t and a const_string_t type !? (to avoid
casting from const char *)

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: string types

2019-12-29 Thread ag
On Sun, Dec 29, at 10:19 Bruno Haible wrote:
> Aga wrote:
> >   - the returned value of the *printf family of functions dictates their
> > limits/range, as they return an int, this can be as INT_MAX mostly
> 
> Yes, we need new implementations of the *asprintf functions that are not
> limited to returning strings of maximum length INT_MAX.

There is also the question how current functions behave with buffers over 
INT_MAX.
And what to do with such large buffers if stdio can not handle them reliably.
And what POSIX says about this if says at all?



Re: immutable string type

2019-12-29 Thread Tim Rühsen
On 29.12.19 19:49, Paul Eggert wrote:
> On 12/29/19 4:07 AM, Tim Rühsen wrote:
>> Introducing a stronger 'const' could be helpful in some situations
> 
> D has 'immutable' for that. It doesn't work as well as one might think.

It all depends on the implementation / definition of 'immutable' and how
it is used. Not sure what the actual problems are within D, though.

A quick web search didn't give me any complaints about immutable in D.
Do you have some examples or can you elaborate a bit ?

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: immutable string type

2019-12-29 Thread Paul Eggert
On 12/29/19 4:07 AM, Tim Rühsen wrote:
> Introducing a stronger 'const' could be helpful in some situations

D has 'immutable' for that. It doesn't work as well as one might think.



Re: string types

2019-12-29 Thread ag
On Sun, Dec 29, at 10:19 Bruno Haible wrote:
> I agree with the goal. How to do it precisely, is an art however.

Ok, let's see what do we have until now.

First the Base: (easy) that is malloc maximum requested size from the kernel,
and that is PTRDIFF_MAX. We also here we have to forget SIZE_MAX as it is not
guaranteed that PTRDIFF_MAX equals to SIZE_MAX.

Second the (function returned value) Requirenment: (easy) a signed type.
There is an agreement that introduced functions should return on error -1,
else the interface will be complicated and we do not want complication.
So ptrdiff_t is adequate, since ptrdiff_t is in standard C and include'd
with stddef.h.

The rest:

Catching out of bounds conditions: (rather easy and already implemented in
snprintf) after the destination argument will follow an argument with the
allocated destination size (from the stack or from the heap). Now, snprintf
uses size_t here, but (question) isn't this a contradiction with the above
or not? Not probably but it's better ask to de-confuse things (as clarity
is a requirenment (semantics should be able to understood by mere humans)).

Another concern. What if destination is NULL. Should the internal functions
procceed by allocating a buffer with the requested size? What they will do
if the requested size <= 0?
There are preceding's here, like realpath() which allocates a buffer and
it's up to the user to free it.

Also. Declared as static internal variables considered harmfull. But sometimes
is desirable to have some data in a private place protected or handy to work
without side effects. This is solved however with the new im(muttable) module.

Catching truncation (first priority maybe): There is a choice to complicate
a bit the interface to return more values than -1, but this rejected by the
perfect legal assumption that humans are lazy, probably because they have been
exposed to try/catch (not bad if you ask but innapropriated for C).
The other thing it could be done is to return -1 and set errno accordingly with
the error. But such an error doesn't exists or exists? So ETRUNC should be
introduced. Few programmers will take the risk to make their program dependable
in something that is not standard, but perhaps they will (doubtfull though at
this stage).

The other thing that left is to check the returned value. Now. In snprintf(3)
there are notes about this and a method to calculate truncation (misty though).

   The functions snprintf() and vsnprintf() do not write more than size
   bytes (including the terminating null byte ('\0')).  If the output was
   truncated due to this limit, then the return value is the number of
   characters (excluding the terminating null byte) which would have been
   written to the final string if enough space had been available.  Thus,
   a return value of size or more means that the output was truncated.
   (See also below under NOTES.)

"which would have been written?" why not always the bytes that had been written?

Ok i got it after a break; still difficult to parse though and for what? We
have to admit that this a programmer error. [Sh|H]e should know her strings.
But we still want to help here. How? Three choises comes to mind.

1.
Use a bit map flag argument to control the function behavior. But this adds
verbosity but at the same time allows extensibility. Which conditions could
be covered with that? Perhaps to return an error if destination is NULL and
the function directed with the flag to return in this condition. Same with
the source. Very convenient but still verbose as you have to learn another
set of FLAGS.

2.
Introduce wrappers. Actually wrappers maybe will be used either way.
Or introduce a complete set of same functions, post-fixed with _un (to
denote unsafety, if _s (not sure) means safe).

3. The programmer knows best. Based on that, either continue with the
implementation like it is, or (where is appropriate) use a fourh argument
for the requested bytes to be written. And sleep in full conscience, that
you did your best you could. He should do the same.

Now. What concerns me most is the userspace and all these functions that
takes a variable number of arguments and a format string. I was fighting
in my code to know with a reliable way the actual bytes produced by the
sum of those arguments (as this can be really difficult to catch some of
those described conditions above). You also said at one point that noone
that does system programming will use (because of the overhead this set
of functions). We could go further and say. Noone sane (sorry) would want
to format big strings. Such functions are very prone to errors, but are
easy to work with them. So what should do with them? There is a method
to calculate the size beforehand (means before the declaration) and is
given in the printf(3) Linux man page.

  va_start(ap, fmt);
  size = vsnprintf(p, size, fmt, ap);
  va_end(ap);

So it parses twice varargs. Plus a compiler version (not 9*), 

Re: immutable string type

2019-12-29 Thread Tim Rühsen
On 29.12.19 10:45, Bruno Haible wrote:
> Tim Rühsen wrote:
>> the use cases are mostly in the testing area (especially fuzzing).
> 
> Indeed. During fuzzing, you want to check against any kind of buggy/undefined
> behaviour, and writing into arbitrary memory is one of these kinds.
> 
> This brings up the question: Should such as facility be in a Sanitizer and
> not in a library? I think the answer is "no", because
> 
>   - Writing into a string is not invalid in C. Even casting a 'const char *'
> to 'char *' and then writing into it is valid. The reason is that the
> C standard only makes statements about a program as a whole and therefore
> cannot express constraints such as "function A is allowed to write into
> the memory object M but function B is not".

I agree that such a thing doesn't make sense in a sanitizer, as only the
application/programmer knows about the semantics.

> 
>   - Integer overflow checking, for example, is available in both the 
> Sanitizers
> and library code. Apparently it is useful enough that some applications
> want to have it enabled in production code. I believe the same will be
> true for immutables string or memory regions.

Makes sense, especially as lot's of code is never being fuzzed to a full
degree.

> 
>> As a more general approach, a function that switches already allocated
>> memory into read-only memory would be handy. Like in
>>  - m = malloc()
>>  - initialize m with some data
>>  - if in debug mode: call memmap_readonly(m) - from this point on 'm' is
>> read-only and a write leads to a segmentation fault.
>>  - ...
>>  - free(m)
> 
> Hardware has write barriers only on the page level. You can't easily request
> a write barrier for a requence of, say, 30 bytes. To accomodate this, the
> API needs to have a certain shape. Paul wrote:

True, and it means that immalloc() always allocate multiples of the page
size (page is 4096 bytes on x86_64 ?). How do you plan to optimize
memory usage here ?

> 
>>  p = immalloc (sizeof *p);
>>  p->x = whatever; p->y = something; ...
>>  imfreeze (p, sizeof *p);
>>  [no changes to *p allowed here]
>>  imfree (p);
> 
> The third line needs to be something like
> 
>p = imfreeze (p, sizeof *p);
> 
> because the "writable p" and the "read-only p" will be at different virtual
> addresses.

Ah, now I get it - you have two virtual memory addresses pointing to the
same physical memory area.


Just throwing in another thought as an addition to immalloc etc:

Introducing a stronger 'const' could be helpful in some situations - the
compiler could find possible violations during the build phase. Stronger
in the means of "not allowed to be cast to non-const, but allowed to be
cast to const". This can be achieved by either extending the C standard,
by adding a new option to gcc or by adding a new __attribute__ for gcc.

Example:
#define imconst __attribute__ ((immutable))
#define transfer_ptr(dst) ({ typeof(dst) _t=(dst); (dst)=NULL; _t; })

char *tmp = malloc()
... initialize tmp...
imconst char *m = transfer_ptr(tmp);
// now m is save against writing if architecture has a MMU

char *m2 = strdup(m); // allowed
memcpy(m, m2, ...); // compiler error (implicit cast 'imconst char *' to
'void *' not allowed)
memcpy((void *) m, m2, ...); // same compiler error (without 'implicit')

Maybe we can think this through to either drop it or make a suggestion
to the gcc folks.

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: immutable string type

2019-12-29 Thread Bruno Haible
Tim Rühsen wrote:
> the use cases are mostly in the testing area (especially fuzzing).

Indeed. During fuzzing, you want to check against any kind of buggy/undefined
behaviour, and writing into arbitrary memory is one of these kinds.

This brings up the question: Should such as facility be in a Sanitizer and
not in a library? I think the answer is "no", because

  - Writing into a string is not invalid in C. Even casting a 'const char *'
to 'char *' and then writing into it is valid. The reason is that the
C standard only makes statements about a program as a whole and therefore
cannot express constraints such as "function A is allowed to write into
the memory object M but function B is not".

  - Integer overflow checking, for example, is available in both the Sanitizers
and library code. Apparently it is useful enough that some applications
want to have it enabled in production code. I believe the same will be
true for immutables string or memory regions.

> As a more general approach, a function that switches already allocated
> memory into read-only memory would be handy. Like in
>  - m = malloc()
>  - initialize m with some data
>  - if in debug mode: call memmap_readonly(m) - from this point on 'm' is
> read-only and a write leads to a segmentation fault.
>  - ...
>  - free(m)

Hardware has write barriers only on the page level. You can't easily request
a write barrier for a requence of, say, 30 bytes. To accomodate this, the
API needs to have a certain shape. Paul wrote:

>  p = immalloc (sizeof *p);
>  p->x = whatever; p->y = something; ...
>  imfreeze (p, sizeof *p);
>  [no changes to *p allowed here]
>  imfree (p);

The third line needs to be something like

   p = imfreeze (p, sizeof *p);

because the "writable p" and the "read-only p" will be at different virtual
addresses.

Bruno




Re: immutable string type

2019-12-29 Thread Bruno Haible
Ben Pfaff wrote:
> This sort of thing won't work on systems with virtually indexed caches,
> at least not without inserting explicit flushes.

Good point. Yes, explicit data cache flushing instructions or - in the worst
case - system calls are necessary.

> I don't know whether
> virtually indexed caches still exist in the wild.

Yes, data caches are indexed by virtual address, not by physical address, on
many platforms. That makes the common case of a cache lookup faster.

Bruno





Re: string types

2019-12-29 Thread Bruno Haible
Aga wrote:
>   - the returned value of the *printf family of functions dictates their
> limits/range, as they return an int, this can be as INT_MAX mostly

Yes, we need new implementations of the *asprintf functions that are not
limited to returning strings of maximum length INT_MAX.

>   - as since there is a "risk"¹ that someone has to take at some point 
> (either the
> programmer or the underlying library code (as strdup() does)), the 
> designed
> interface should lower those risks

I agree with the goal. How to do it precisely, is an art however.

> In the case of an error, returns > 0 which is either:
> #define   EDSTPAR   -1/* Error : bad dst parameters */
> #define   ESRCPAR   -2/* Error : bad src parameters */
> #define   EMODPAR   -3/* Error : bad mode parameter */
> #define   ETRUNC-4/* Error : not enough space to 
> copy/concatenate
>  and truncation not 
> allowed */

I don't think an interface for string concatenation with that many error
cases will be successful. Programmers are lazy, therefore
  - some will not check the errors at all,
  - some will only check for the fourth one (because "I'm not passing invalid
arguments, after all"),
  - among those few that implement all 4 checks, half will get it wrong
(that's my experience with similarly complex functions like mbrtowc() or
iconv()).

For an interface to be successful, it needs to be simpler than that.

Bruno