Re: string types

2020-01-06 Thread Tim Rühsen
On 1/6/20 5:08 PM, Tim Rühsen wrote:
>>   * What about reusing the complete vasnprintf.c for (B), rather than
>> adding another, limited printf-like implementation?
> 
> Yes, would be nice. At least for appending we need an extra
> malloc/malloc/memcpy/free.
> 
> vasnprintf reallocates RESULTBUF that means we can't use a stack
> allocated buffer - thus we lose the performance advantage. Or should we
> try snprintf first and fallback to vasnprintf in case of truncation ?
> 
> We want another module e.g. buffer-printf to not pull in vasnprintf when
> not needing printf-like buffer functions.
> 
> Once the buffer module is done, we could think of amending vasnprintf to
> better play with the buffer type.

Just made a speed comparison between vasnprintf and wget_buffer_printf,
10m times executed, within a stack-only szenario (no reallocations), gcc
9.2.1, -O1 -g.

asnprintf(sbuf, , "%d", i);
takes 0m2.727s

wget_buffer_printf(, "%d", i);
takes 0m0.226s

char s[]="123";

asnprintf(sbuf, , "%s", s);
takes 0m2.282s

wget_buffer_printf(, "%s", s);
takes 0m0.212s

It tells me that vasnprintf has a huge startup overhead. Perhaps we can
tweak that a little bit.


the vasnprintf program that I run for %d:

#include 

void main(void)
{
  char sbuf[256];
  size_t size = sizeof(sbuf);

  for (int i=0;i<100;i++) {
asnprintf(sbuf, , "%d", i);
asnprintf(sbuf, , "%d", i);
asnprintf(sbuf, , "%d", i);
asnprintf(sbuf, , "%d", i);
asnprintf(sbuf, , "%d", i);
asnprintf(sbuf, , "%d", i);
asnprintf(sbuf, , "%d", i);
asnprintf(sbuf, , "%d", i);
asnprintf(sbuf, , "%d", i);
asnprintf(sbuf, , "%d", i);
  }
}



signature.asc
Description: OpenPGP digital signature


Re: string types

2020-01-06 Thread Tim Rühsen
Hi Bruno,

On 1/6/20 1:46 PM, Bruno Haible wrote:
>   - providing primitives for string allocation reduces the amount of 
> buffer
> overflow bugs that otherwise occur in this area. [1]
> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
>>> ...
>> We created a 'catch them all' string/buffer type plus API. It is a good
>> compromise for all kinds of situations, works like a memory buffer but
>> is guaranteed 0-terminated, allows custom stack buffers with fallback to
>> heap if to small.
>>
>> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c
>>
>>
>> There also is a sprintf functionality (glibc compatible) using these
>> buffers - and the operation is well faster than glibc's sprintf-like
>> functions for all format strings tested (tested back a few years). The
>> code is also much smaller (380 C code lines), the return values are
>> size_t. It doesn't support float/double.
>>
>> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c
>>
>> If there is serious interest, I could prepare modules for gnulib.
> 
> It is interesting that your solution does not only cover the simple cases
> (string concatenation, etc.), but also the more complex one, possibly
> with if()s in the generation logic, and all this without blatant potential
> for buffer overflow bugs.
> 
> So, the solution would consists of the following parts:
>   (A) A growable buffer type, with up to N (128 or 1024 or so) bytes on
>   the stack.

Preferable, the initial size and if starting with heap or stack buffer
should be (runtime) configurable.
- initial size because it allows fine-tuning to better avoid reallocations
- initial stack if used as local / temporary buffer
- initial heap when you already know that the resulting string has to
persist the function return

Currently there is are two init functions (I leave away the wget namespace):
int buffer_init(buffer *buf, char *data, size_t size);
buffer *buffer_alloc(size_t size);

buffer_alloc creates a buffer instance on the heap and initializes it
with a heap buffer of size.

buffer_init(buf, data, date_size) initializes 'buf' with the given data
and data_size. data will not be free'd, so stack data can be used here.

buffer_init(buf, NULL, date_size) initializes 'buf' with freshly
allocated heap data of size 'data_size'.

buffer_init(buf, NULL, 0) initializes 'buf' with freshly allocated heap
data of size 128. We could leave this out - it's a currently unused
special case to avoid error handling.

Then there is
int buffer_ensure_capacity(buffer *buf, size_t size);


>   (B) A set of functions for appending to such a growable buffer.

To copy a number of bytes to the beginning (effectively dropping the
previous content):
size_t buffer_memcpy(buffer *buf, const void *data, size_t length);

To append a number of bytes:
size_t buffer_memcat(buffer *buf, const void *data, size_t length);

To copy a string to the beginning (effectively dropping the previous
content):
size_t buffer_strcpy(buffer *buf, const char *s);

To append a string:
size_t buffer_strcat(buffer *buf, const char *s);

To set a number of bytes at the beginning (effectively dropping the
previous content):
size_t buffer_memset(buffer *buf, char c, size_t length);

To append a number of the same bytes:
size_t buffer_memset_append(buffer *buf, char c, size_t length);


>   (C) A function for creating a heap-allocated 'char *' from a growable
>   buffer.

Currently we do:
buffer buf;
buffer_init(, NULL, date_size); // allocate buf.data on heap
... add stuff to buf ...
mydata = buf.data; buf.data = NULL;
buffer_deinit();

We could make up a (static inline) for this, named
void *buffer_deinit_transfer(buffer *buf);

This function could also call realloc() to shrink 'data' to it's
occupied length.

>   (D) Short-hand functions for the simple cases (like string concatenation).

See above, e.g.
buffer_strcpy(buf, scheme);
buffer_strcat(buf, "://");
buffer_strcat(buf, domain);
buffer_memcat(buf, ":", 1);
buffer_strcat(buf, port_s);
buffer_memcat(buf, "/", 1);
buffer_strcat(buf, path);

But I prefer the slightly slower but better readable form
buffer_printf("%s://%s:%d/%s", scheme, domain, port, path);

Since our printf-like functions directly write into a buffer, there is
no overhead for copying data.

> It would be good to have all these well integrated (in terms of function
> names and calling conventions). So far, in gnulib, we have only pieces of
> it:
>   - Module 'scratch_buffer' is (A) without (B), (C), (D).
>   - Modules 'vasnprintf', 'asprintf' are (B), (C), (D) but without (A).
> 
> Before you start writing the code, it's worth looking at the following
> questions:
>   * Should the module 'scratch_buffer' be reused for (A)? Or is this
> not possible? If not, can it still have a memory-leak prevention
> like described in lib/malloc/scratch_buffer.h?

I don't see the advantage of the described memory-leak prevention. On
memory error 

Re: string types

2020-01-06 Thread Bruno Haible
Hi Tim,

> >>>   - providing primitives for string allocation reduces the amount of 
> >>> buffer
> >>> overflow bugs that otherwise occur in this area. [1]
> >>> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
> > ...
> We created a 'catch them all' string/buffer type plus API. It is a good
> compromise for all kinds of situations, works like a memory buffer but
> is guaranteed 0-terminated, allows custom stack buffers with fallback to
> heap if to small.
> 
> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c
> 
> 
> There also is a sprintf functionality (glibc compatible) using these
> buffers - and the operation is well faster than glibc's sprintf-like
> functions for all format strings tested (tested back a few years). The
> code is also much smaller (380 C code lines), the return values are
> size_t. It doesn't support float/double.
> 
> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c
> 
> If there is serious interest, I could prepare modules for gnulib.

It is interesting that your solution does not only cover the simple cases
(string concatenation, etc.), but also the more complex one, possibly
with if()s in the generation logic, and all this without blatant potential
for buffer overflow bugs.

So, the solution would consists of the following parts:
  (A) A growable buffer type, with up to N (128 or 1024 or so) bytes on
  the stack.
  (B) A set of functions for appending to such a growable buffer.
  (C) A function for creating a heap-allocated 'char *' from a growable
  buffer.
  (D) Short-hand functions for the simple cases (like string concatenation).

It would be good to have all these well integrated (in terms of function
names and calling conventions). So far, in gnulib, we have only pieces of
it:
  - Module 'scratch_buffer' is (A) without (B), (C), (D).
  - Modules 'vasnprintf', 'asprintf' are (B), (C), (D) but without (A).

Before you start writing the code, it's worth looking at the following
questions:
  * Should the module 'scratch_buffer' be reused for (A)? Or is this
not possible? If not, can it still have a memory-leak prevention
like described in lib/malloc/scratch_buffer.h?
  * What about reusing the complete vasnprintf.c for (B), rather than
adding another, limited printf-like implementation?
  * Is it best to implement (D) based on (A), (B), (C), or directly
from scratch?

Bruno




Re: string types

2020-01-06 Thread Tim Rühsen


On 12/31/19 10:53 AM, Bruno Haible wrote:
> Hi Tim,
> 
>>>   - providing primitives for string allocation reduces the amount of buffer
>>> overflow bugs that otherwise occur in this area. [1]
>>> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
> 
>> here is a string concatenation function without ellipsis, analogue to
>> writev() and struct iovec - just a suggestion. Instead of 'struct
>> strvec' a new string_t type would be handy.
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> struct strvec {
>>   char *strv_base;
>>   size_t strv_len;
>> };
>>
>> __attribute__ ((nonnull (1)))
>> char *concat_stringv(const struct strvec *strv)
>> {
>>   const struct strvec *str;
>>   size_t len = 0;
>>   char *buf;
>>
>>   for (str = strv; str->strv_base; str++)
>> len += str->strv_len;
>>
>>   if (!(buf = malloc(len + 1)))
>> return buf;
>>
>>   len = 0;
>>   for (str = strv; str->strv_base; len += str->strv_len, str++)
>> memcpy(buf + len, str->strv_base, str->strv_len);
>>
>>   buf[len] = 0;
>>
>>   return buf;
>> }
>>
>> void main(void)
>> {
>>   char *s = concat_stringv((struct strvec []) {
>> { "a", 1 },
>> { "b", 1 },
>> { NULL }
>>   });
> 
> This looks good. It brings us one step closer to the stated goal [1].
> 
> Would you like to contribute such a 'string-alloc' module that, together with
> 'strdup' and 'asprintf', removes most needs to create a string's contents
> "by hand"?

When time allows, I would like to make up a module.

Though IMO the design of the function doesn't allow to reuse an existing
buffer (e.g. a scratch buffer on the stack). Since malloc() etc are
pretty costly, you often want to avoid it as much as possible.

Like e.g.

/* Use given stack buffer, fallback to malloc() if too short */
char sbuf[256];
char *s = concat_stringv_stack(sbuf, sizeof (sbuf), (struct strvec []) {
{ "a", 1 },
{ "b", 1 },
{ NULL }
  });

... do things with s ...

if (s != sbuf)
  free (s);

Sometimes you want to reuse an existing malloc'ed buffer:

/* Use existing heap buffer, use realloc() if too short */
char *buf = malloc(N);
char *buf = concat_stringv_reuse(buf, N, (struct strvec []) {
{ "a", 1 },
{ "b", 1 },
{ NULL }
  });

... do things with s ...

free (buf);

You might also be interested in the size of the created string to avoid
a superfluous strlen(). So the need for more specialized functions makes
it all more and more complex.

During the development of Libwget/Wget2 we needed all of the above (and
more) and finally came up with a good compromise (well, good for us).

We created a 'catch them all' string/buffer type plus API. It is a good
compromise for all kinds of situations, works like a memory buffer but
is guaranteed 0-terminated, allows custom stack buffers with fallback to
heap if to small.

$ cloc buffer.c
Language files  blankcomment   code
---
C1 49327195

https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c


There also is a sprintf functionality (glibc compatible) using these
buffers - and the operation is well faster than glibc's sprintf-like
functions for all format strings tested (tested back a few years). The
code is also much smaller (380 C code lines), the return values are
size_t. It doesn't support float/double.

$ cloc buffer_printf.c
Language files  blankcomment   code
---
C1 74120380

https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c

If there is serious interest, I could prepare modules for gnulib.


Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: string types

2019-12-31 Thread Bruno Haible
Hi Tim,

> >   - providing primitives for string allocation reduces the amount of buffer
> > overflow bugs that otherwise occur in this area. [1]
> > [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html

> here is a string concatenation function without ellipsis, analogue to
> writev() and struct iovec - just a suggestion. Instead of 'struct
> strvec' a new string_t type would be handy.
> 
> #include 
> #include 
> #include 
> #include 
> 
> struct strvec {
>   char *strv_base;
>   size_t strv_len;
> };
> 
> __attribute__ ((nonnull (1)))
> char *concat_stringv(const struct strvec *strv)
> {
>   const struct strvec *str;
>   size_t len = 0;
>   char *buf;
> 
>   for (str = strv; str->strv_base; str++)
> len += str->strv_len;
> 
>   if (!(buf = malloc(len + 1)))
> return buf;
> 
>   len = 0;
>   for (str = strv; str->strv_base; len += str->strv_len, str++)
> memcpy(buf + len, str->strv_base, str->strv_len);
> 
>   buf[len] = 0;
> 
>   return buf;
> }
> 
> void main(void)
> {
>   char *s = concat_stringv((struct strvec []) {
> { "a", 1 },
> { "b", 1 },
> { NULL }
>   });

This looks good. It brings us one step closer to the stated goal [1].

Would you like to contribute such a 'string-alloc' module that, together with
'strdup' and 'asprintf', removes most needs to create a string's contents
"by hand"?

Regarding the type name: There can't be a 'string_t' in C, I would say,
because you will always have the NUL-terminated strings on one side and what
you call a 'wget_string' on the other side, and there can't be a clear winner
between both.

Bruno




Re: string types

2019-12-29 Thread Tim Rühsen
On 27.12.19 11:51, Bruno Haible wrote:
> Aga wrote:
>> I do not know if
>> you can (or if it is possible, how it can be done), extract with a way a 
>> specific
>> a functionality from gnulib, with the absolute necessary code and only that.
> 
> gnulib-tool does this. With its --avoid option, the developer can even 
> customize
> their notion of "absolutely necessary".
> 
>> In a myriad of codebases a string type is implemented at least as:
>>   size_t mem_size;
>>   size_t num_bytes;
>>   char *bytes;
> 
> This is actually a string-buffer type. A string type does not need two size_t
> members. Long-term experience has shown that using different types for string
> and string-buffer is a win, because
>   - a string can be put in a read-only virtual memory area, thus enforcing
> immutability (-> reducing multithread problems),
>   - providing primitives for string allocation reduces the amount of buffer
> overflow bugs that otherwise occur in this area. [1]
> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
>

Just FYI,

here is a string concatenation function without ellipsis, analogue to
writev() and struct iovec - just a suggestion. Instead of 'struct
strvec' a new string_t type would be handy.

#include 
#include 
#include 
#include 

struct strvec {
  char *strv_base;
  size_t strv_len;
};

__attribute__ ((nonnull (1)))
char *concat_stringv(const struct strvec *strv)
{
  const struct strvec *str;
  size_t len = 0;
  char *buf;

  for (str = strv; str->strv_base; str++)
len += str->strv_len;

  if (!(buf = malloc(len + 1)))
return buf;

  len = 0;
  for (str = strv; str->strv_base; len += str->strv_len, str++)
memcpy(buf + len, str->strv_base, str->strv_len);

  buf[len] = 0;

  return buf;
}

void main(void)
{
  char *s = concat_stringv((struct strvec []) {
{ "a", 1 },
{ "b", 1 },
{ NULL }
  });

  puts(s);

  free(s);
}


In GNU Wget2 we already have type similar to string_t. Just used in
cases where we need pointer + len of URLs inside const HTML/XML/CSS data.

typedef struct {
const char
*p; //!< pointer to memory region
size_t
len; //!< length of memory region
} wget_string;


So maybe we need a string_t and a const_string_t type !? (to avoid
casting from const char *)

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: string types

2019-12-29 Thread ag
On Sun, Dec 29, at 10:19 Bruno Haible wrote:
> Aga wrote:
> >   - the returned value of the *printf family of functions dictates their
> > limits/range, as they return an int, this can be as INT_MAX mostly
> 
> Yes, we need new implementations of the *asprintf functions that are not
> limited to returning strings of maximum length INT_MAX.

There is also the question how current functions behave with buffers over 
INT_MAX.
And what to do with such large buffers if stdio can not handle them reliably.
And what POSIX says about this if says at all?



Re: string types

2019-12-29 Thread ag
On Sun, Dec 29, at 10:19 Bruno Haible wrote:
> I agree with the goal. How to do it precisely, is an art however.

Ok, let's see what do we have until now.

First the Base: (easy) that is malloc maximum requested size from the kernel,
and that is PTRDIFF_MAX. We also here we have to forget SIZE_MAX as it is not
guaranteed that PTRDIFF_MAX equals to SIZE_MAX.

Second the (function returned value) Requirenment: (easy) a signed type.
There is an agreement that introduced functions should return on error -1,
else the interface will be complicated and we do not want complication.
So ptrdiff_t is adequate, since ptrdiff_t is in standard C and include'd
with stddef.h.

The rest:

Catching out of bounds conditions: (rather easy and already implemented in
snprintf) after the destination argument will follow an argument with the
allocated destination size (from the stack or from the heap). Now, snprintf
uses size_t here, but (question) isn't this a contradiction with the above
or not? Not probably but it's better ask to de-confuse things (as clarity
is a requirenment (semantics should be able to understood by mere humans)).

Another concern. What if destination is NULL. Should the internal functions
procceed by allocating a buffer with the requested size? What they will do
if the requested size <= 0?
There are preceding's here, like realpath() which allocates a buffer and
it's up to the user to free it.

Also. Declared as static internal variables considered harmfull. But sometimes
is desirable to have some data in a private place protected or handy to work
without side effects. This is solved however with the new im(muttable) module.

Catching truncation (first priority maybe): There is a choice to complicate
a bit the interface to return more values than -1, but this rejected by the
perfect legal assumption that humans are lazy, probably because they have been
exposed to try/catch (not bad if you ask but innapropriated for C).
The other thing it could be done is to return -1 and set errno accordingly with
the error. But such an error doesn't exists or exists? So ETRUNC should be
introduced. Few programmers will take the risk to make their program dependable
in something that is not standard, but perhaps they will (doubtfull though at
this stage).

The other thing that left is to check the returned value. Now. In snprintf(3)
there are notes about this and a method to calculate truncation (misty though).

   The functions snprintf() and vsnprintf() do not write more than size
   bytes (including the terminating null byte ('\0')).  If the output was
   truncated due to this limit, then the return value is the number of
   characters (excluding the terminating null byte) which would have been
   written to the final string if enough space had been available.  Thus,
   a return value of size or more means that the output was truncated.
   (See also below under NOTES.)

"which would have been written?" why not always the bytes that had been written?

Ok i got it after a break; still difficult to parse though and for what? We
have to admit that this a programmer error. [Sh|H]e should know her strings.
But we still want to help here. How? Three choises comes to mind.

1.
Use a bit map flag argument to control the function behavior. But this adds
verbosity but at the same time allows extensibility. Which conditions could
be covered with that? Perhaps to return an error if destination is NULL and
the function directed with the flag to return in this condition. Same with
the source. Very convenient but still verbose as you have to learn another
set of FLAGS.

2.
Introduce wrappers. Actually wrappers maybe will be used either way.
Or introduce a complete set of same functions, post-fixed with _un (to
denote unsafety, if _s (not sure) means safe).

3. The programmer knows best. Based on that, either continue with the
implementation like it is, or (where is appropriate) use a fourh argument
for the requested bytes to be written. And sleep in full conscience, that
you did your best you could. He should do the same.

Now. What concerns me most is the userspace and all these functions that
takes a variable number of arguments and a format string. I was fighting
in my code to know with a reliable way the actual bytes produced by the
sum of those arguments (as this can be really difficult to catch some of
those described conditions above). You also said at one point that noone
that does system programming will use (because of the overhead this set
of functions). We could go further and say. Noone sane (sorry) would want
to format big strings. Such functions are very prone to errors, but are
easy to work with them. So what should do with them? There is a method
to calculate the size beforehand (means before the declaration) and is
given in the printf(3) Linux man page.

  va_start(ap, fmt);
  size = vsnprintf(p, size, fmt, ap);
  va_end(ap);

So it parses twice varargs. Plus a compiler version (not 9*), 

Re: string types

2019-12-29 Thread Bruno Haible
Aga wrote:
>   - the returned value of the *printf family of functions dictates their
> limits/range, as they return an int, this can be as INT_MAX mostly

Yes, we need new implementations of the *asprintf functions that are not
limited to returning strings of maximum length INT_MAX.

>   - as since there is a "risk"¹ that someone has to take at some point 
> (either the
> programmer or the underlying library code (as strdup() does)), the 
> designed
> interface should lower those risks

I agree with the goal. How to do it precisely, is an art however.

> In the case of an error, returns > 0 which is either:
> #define   EDSTPAR   -1/* Error : bad dst parameters */
> #define   ESRCPAR   -2/* Error : bad src parameters */
> #define   EMODPAR   -3/* Error : bad mode parameter */
> #define   ETRUNC-4/* Error : not enough space to 
> copy/concatenate
>  and truncation not 
> allowed */

I don't think an interface for string concatenation with that many error
cases will be successful. Programmers are lazy, therefore
  - some will not check the errors at all,
  - some will only check for the fourth one (because "I'm not passing invalid
arguments, after all"),
  - among those few that implement all 4 checks, half will get it wrong
(that's my experience with similarly complex functions like mbrtowc() or
iconv()).

For an interface to be successful, it needs to be simpler than that.

Bruno




Re: string types

2019-12-28 Thread Paul Eggert
On 12/28/19 12:44 PM, ag wrote:
> is your opininion that this is adequate?
> 
> typedef ptrdiff_t msize_t (m for memory here)

Yes, something like that. dfa.c calls this type 'idx_t', which is a couple of
characters shorter.



Re: string types

2019-12-28 Thread ag
Hi Paul,

On Sat, Dec 28, at 10:28 Paul Eggert wrote:
> > Based on the above assumptions this can be extended. First instead of 
> > size_t to
> > return ssize_t, so functions can return -1 and set errno accordingly.
> 
> It's better to use ptrdiff_t for this sort of thing, since it's hardwired into
> the C language (you can't do any better than ptrdiff_t anyway, if you use
> pointer subtraction), whereas ssize_t is merely in POSIX and is narrower than
> ptrdiff_t on some (obsolete?) platforms.

So, let's say we designed this thing without obligating to the past and 
thinking for
the next hundred years (of course with the current knowledge and to lessons 
from the
past), and wanted to make it work with malloc and string type functions, as 
best it
can be done and without worries for overflows and unsigned divisions and all 
this
kind of confusing things that hunts us altogether after so many years that 
things
should have been settled by now... is your opininion that this is adequate?

typedef ptrdiff_t msize_t (m for memory here)

> > #define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4))
> > #define MEM_IS_INT_OVERFLOW(nmemb, ssize) \
> >  (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) &&   \
> >   (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize))
> 
> Ouch. That code is not good. An unsigned division at runtime to do memory
> allocation? Gnulib does better than that already. Also, Glibc has some code in
> this area that we could migrate into Gnulib, that could be better yet.

Sorry, i don't have time to do it right now - as i just escaped from a 
snow-storm -
but i will check this for atleast not to spread misleading information (is quite
possible my fault here), so thanks for your comment.

By the way Paul and since i'm self taught by practical experience kind of human
being and joking with zoi here said that at least my teacher is a hall of famer
in the computing history. Isn't this life great!
So true this is also a school for free afterall.

My Honor,
 Αγαθοκλής



Re: string types

2019-12-28 Thread Paul Eggert
On 12/28/19 5:14 AM, ag wrote:

>   - PTRDIFF_MAX is at least INT_MAX and at most SIZE_MAX
> (PTRDIFF_MAX is INT_MAX in 32bit)

PTRDIFF_MAX can exceed SIZE_MAX, in the sense that POSIX and C allows it and it
could be useful on 32-bit platforms for size_t to be 32 bits and ptrdiff_t to be
64 bits. Although I don't know of any platforms doing things that way, I prefer
not to assume that PTRDIFF_MAX <= SIZE_MAX so as to allow for the possibility.

>   - SIZE_MAX as (size_t) (-1)
> 
>   - ssize_t (s means signed?) can be as big as SIZE_MAX? and SSIZE_MAX equals 
> to
> SIZE_MAX?

ssize_t can be either narrower or wider than size_t, according to POSIX.
Historically ssize_t was 32 bits and size_t 64 bits on some platforms, and
though I don't know of any current platforms doing that it's easy to not make
assumptions here.

> Based on the above assumptions this can be extended. First instead of size_t 
> to
> return ssize_t, so functions can return -1 and set errno accordingly.

It's better to use ptrdiff_t for this sort of thing, since it's hardwired into
the C language (you can't do any better than ptrdiff_t anyway, if you use
pointer subtraction), whereas ssize_t is merely in POSIX and is narrower than
ptrdiff_t on some (obsolete?) platforms.

> In my humble opinion there is also the choise to choose reallocarray() from 
> OpenBSD,
> which always checks for integer overflows with the following way:
> 
> #define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4))
> #define MEM_IS_INT_OVERFLOW(nmemb, ssize) \
>  (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) &&   \
>   (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize))

Ouch. That code is not good. An unsigned division at runtime to do memory
allocation? Gnulib does better than that already. Also, Glibc has some code in
this area that we could migrate into Gnulib, that could be better yet.



Re: string types

2019-12-28 Thread ag
Hi,

On Fri, Dec 27, at 11:51 Bruno Haible wrote:
>  - providing primitives for string allocation reduces the amount of buffer
>overflow bugs that otherwise occur in this area. [1]

[1] Re: string allocation
https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html

Thanks, i remember this thread, though at the time i couldn't understand some 
bits.

>> ag wrote:
> > ... to the actual algorithm (usually conditions that can or can't be met).

> That is the idea behind the container types (list, map) in gnulib. However, I 
> don't
> see how to reasonably transpose this principle to string types.

Ok, let us try, so allow me to summarize with some of (my unqualified) 
assumptions
(please correct):

  - glibc malloc can request at most PTRDIFF_MAX

  - PTRDIFF_MAX is at least INT_MAX and at most SIZE_MAX
(PTRDIFF_MAX is INT_MAX in 32bit)

  - SIZE_MAX as (size_t) (-1)

  - ssize_t (s means signed?) can be as big as SIZE_MAX? and SSIZE_MAX equals to
SIZE_MAX?

  - the returned value of the *printf family of functions dictates their
limits/range, as they return an int, this can be as INT_MAX mostly

Some concerns:

  - truncation errors should be caught

  - memory checkers should catch overflows

  - as since there is a "risk"¹ that someone has to take at some point (either 
the
programmer or the underlying library code (as strdup() does)), the designed
interface should lower those risks

There is a proposal from Eric Sanchis to Austin group at 9 Jun 2016, for a 
String
copy/concatenation interface, that his functions have both the allocated size 
and
the number of bytes to be written as arguments (some i will inline them here, 
since
i was unable to find his mail in the Posix mailing list archives).

I used this as a basis (as it was rather intuitive and perfectly suited for C), 
to
implement my own str_cp, which goes like this:

size_t str_cp (char *dest, size_t dest_len, const char *src, size_t nelem) {
  size_t num = (nelem > (dest_len - 1) ? dest_len - 1 : nelem);
  size_t len = (NULL is src ? 0 : byte_cp (dest, src, num));
  dest[len] = '\0';
  return len;
}

size_t byte_cp (char *dest, const char *src, size_t nelem) {
  const char *sp = src;
  size_t len = 0;

  while (len < nelem and *sp) {
dest[len] = *sp++;
len++;
  }

  return len;
}

Of course it can be done better, but here we have a low level function 
(byte_cp),
that does only the required checks and which returns the actual bytes written to
`dest', while str_cp checks if `src' is NULL and if `nelem' is bigger than 
`dest_len'
(if it is then copies at least `dest_len' - 1). It returns 0 or the actual 
written
bytes.

Since this returns the actual bytes written, it is up to the programmer to check
if truncation happened, but there is no possibility to copy more than 
`dest_len' - 1.

Based on the above assumptions this can be extended. First instead of size_t to
return ssize_t, so functions can return -1 and set errno accordingly.

Eric Sanchis in his proposal does it a bit different because in his functions 
adds
an extra argument as size_t, that uses this to control the behavior of the 
function
(what it will do in the case that destination length is less than source len).

He uses an int as a returned value which either is 0/1 on succesful operation, 
the
following:
#define   OKNOTRUNC  0  /* copy/concatenation performed without 
truncation */
#define   OKTRUNC1  /* copy/concatenation performed with truncation 
*/

And below is the extra information passed as fifth argument:
#define   TRUNC  0  /* truncation allowed */
#define   NOTRUNC1  /* truncation not allowed */

In the case of an error, returns > 0 which is either:
#define   EDSTPAR   -1  /* Error : bad dst parameters */
#define   ESRCPAR   -2  /* Error : bad src parameters */
#define   EMODPAR   -3  /* Error : bad mode parameter */
#define   ETRUNC-4  /* Error : not enough space to copy/concatenate
   and truncation not 
allowed */

Now combining all this and if the assumptions are correct, gnulib can return
ssize_t and uses this to make it's functions to work up to SIZE_MAX and uses
either Eric's interface or to set errno accordingly.

But to me a function call like:
  str_cp (dest, memsize_of_dest, src, memsize_of_dest - 1)
is quite common C's way to do things, plus we have a way to catch truncation and
not to go out of bounds at the same time.

Of course such operations are tied with malloc().
I've read the gnulib document yesteday and i saw that gnulib wraps malloc() 
with a
function that (quite logically) aborts execution and even allows to set a 
callback
function.

In my humble opinion there is also the choise to choose reallocarray() from 
OpenBSD,
which always checks for integer overflows with the following way:

#define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4))
#define 

Re: string types

2019-12-27 Thread Bruno Haible
Aga wrote:
> I do not know if
> you can (or if it is possible, how it can be done), extract with a way a 
> specific
> a functionality from gnulib, with the absolute necessary code and only that.

gnulib-tool does this. With its --avoid option, the developer can even customize
their notion of "absolutely necessary".

> In a myriad of codebases a string type is implemented at least as:
>   size_t mem_size;
>   size_t num_bytes;
>   char *bytes;

This is actually a string-buffer type. A string type does not need two size_t
members. Long-term experience has shown that using different types for string
and string-buffer is a win, because
  - a string can be put in a read-only virtual memory area, thus enforcing
immutability (-> reducing multithread problems),
  - providing primitives for string allocation reduces the amount of buffer
overflow bugs that otherwise occur in this area. [1]

Unfortunately, the common string type in C is 'char *' with NUL termination,
and a different type is hard to establish
  - because developers already know how to use 'char *',
  - because existing functions like printf consume 'char *' strings.
  - Few programs have had the need to correctly handles strings with embedded
NULs.

> An extended ustring (unicode|utf8) type can include information for its bytes 
> with
> character semantics, like:
>  (utf8 typedef'ed as signed int)
>   utf8 code;   // the integer representation
>   int len; // the number of the needed bytes
>   int width;   // the number of the occupied cells
>   char buf[5]; // and probably the character representation

Such a type would have a niche use, IMO, because
  - 99% of the processing would not need to access the width (screen columns) - 
so
why spend CPU time and RAM to store it and keep it up-to-date?
  - 80% of the processing does not care about the Unicode code points either,
and libraries like libunistring can do the Unicode-aware processing.

> But the programmer mind would be probably best
> if could concentrate to how to express the thought (with whatever meaning of 
> what we
> are calling "thought") and follow this flow, or if could concentrate the 
> energy to
> understand the intentions (while reading) of the code (instead of wasting 
> self with
> the "details" of the code) and finally to the actual algorithm (usually 
> conditions
> that can or can't be met).

That is the idea behind the container types (list, map) in gnulib. However, I 
don't
see how to reasonably transpose this principle to string types.

Bruno

[1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html