On 08/20/18 17:59, Bernd Edlinger wrote: > On 08/20/18 15:19, Richard Biener wrote: >> On Mon, 20 Aug 2018, Bernd Edlinger wrote: >> >>> On 08/20/18 13:01, Richard Biener wrote: >>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlin...@hotmail.de> >>>> wrote: >>>>> >>>>> >>>>> >>>>> On 08/01/18 11:29, Richard Biener wrote: >>>>>> >>>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would >>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient >>>>>> for your check above. Because the '\0' doesn't belong to the >>>>>> string. Then build_string internally appends a '\0' outside >>>>>> of TREE_STRING_LENGTH. >>>>>> >>>>> >>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide >>>>> character. >>>> >>>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 >>>> parameter to build_string and allocate as many extra 0s as needed. >>>> >>>> There are STRING_CSTs which are not string literals, >>>>> for instance attribute tags, Pragmas, asm constrants, etc. >>>>> They use the '\0' outside, and have probably no TREE_TYPE. >>>>> >>>>>> >>>>>>> So I would like to be able to assume that the STRING_CST objects >>>>>>> are internally always generated properly by the front end. >>>>>> >>>>>> Yeah, I guess we need to define what "properly" is ;) >>>>>> >>>>> Yes. >>>>> >>>>>>> And that the ARRAY_TYPE of the string literal either has the >>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, >>>>>>> this is always exactly one (wide) character size less than >>>>>>> TREE_STRING_LENGTH >>>>>> >>>>>> I think it should be always the same... >>>>>> >>>>> >>>>> One could not differentiate between "\0" without zero-termination >>>>> and "" with zero-termination, theoretically. >>>> >>>> Is that important? Doesn't the C standard say how to parse string >>>> literals? >>>> >>>>> We also have char x[100] = "ab"; >>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. >>>>> Of course one could create it with a TREE_STRING_LENGTH = 100, >>>>> but imagine char x[100000000000] = "ab" >>>> >>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I >>>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it >>>> unconditionally be [], thus incomplete (because the literals "size" depends >>>> on the context/LHS it is used on). >>>> >>> >>> Sorry, but I must say, it is not at all like that. >>> >>> If I compile x.c: >>> const char x[100] = "ab"; >>> >>> and set a breakpoint at output_constant: >>> >>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, >>> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 >>> 4821 if (size == 0 || flag_syntax_only) >>> (gdb) p size >>> $1 = 100 >>> (gdb) call debug(exp) >>> "ab" >>> (gdb) p *exp >>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, >>> (gdb) p exp->typed.type->type_common.size_unit >>> $5 = (tree) 0x7ffff6ff9d80 >>> (gdb) call debug(exp->typed.type->type_common.size_unit) >>> 100 >>> (gdb) p exp->string.length >>> $6 = 3 >>> (gdb) p exp->string.str[0] >>> $8 = 97 'a' >>> (gdb) p exp->string.str[1] >>> $9 = 98 'b' >>> (gdb) p exp->string.str[2] >>> $10 = 0 '\000' >>> (gdb) p exp->string.str[3] >>> $11 = 0 '\000' >>> >>> >>> This is an important property of string_cst objects, that is used in >>> c_strlen: >>> >>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond >>> TREE_STRING_LENGTH >>> is guaranteed to be zero up to the type size. >>> >>> I would not have spent one thought on implementing an optimization like >>> that, >>> but that's how it is right now. >> >> Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka >> they have zero-padding up to its type size. I don't see this documented >> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" >> with appropriate TYPE_SIZE. >> >> This is also a relatively new thing on trunk (coming out of the added >> mem_size parameter of string_constant). That this looks at the STRING_CST >> type like >> >> if (TREE_CODE (array) == STRING_CST) >> { >> *ptr_offset = fold_convert (sizetype, offset); >> if (mem_size) >> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); >> return array; >> >> I'd call simply a bug. As said, interpretation of STRING_CSTs should >> depend on the context. And for >> > > This use of the TYPE_SIZE_UNIT was pre-existing to r263607 > before that (but not in the gcc-8-branch) we had this in c_strlen: > > HOST_WIDE_INT maxelts = strelts; > tree type = TREE_TYPE (src); > if (tree size = TYPE_SIZE_UNIT (type)) > if (tree_fits_shwi_p (size)) > { > maxelts = tree_to_uhwi (size); > maxelts = maxelts / eltsize - 1; > } > > which I moved to string_constant and return the result through memsize. > > It seems to be working somehow, and I'd bet removing that will immediately > break twenty or thirty of the strlenopt tests. > > Obviously the tree string objects allow way too much variations, > and it has to be restricted in some way or another. > > In the moment my approach is: look at what most string constants do > and add assertions to ensure that there are no exceptions. > >
I found some more places where TYPE_SIZE_UNIT (TREE_TYPE (str)) and TREE_STRING_LENGTH (str) are used together, in some very old code and clearly Martin is not to blame for them: varasm.c: static HOST_WIDE_INT get_constant_size (tree exp) { HOST_WIDE_INT size; size = int_size_in_bytes (TREE_TYPE (exp)); if (TREE_CODE (exp) == STRING_CST) size = MAX (TREE_STRING_LENGTH (exp), size); return size; } mergeable_string_section (tree decl ATTRIBUTE_UNUSED, unsigned HOST_WIDE_INT align ATTRIBUTE_UNUSED, unsigned int flags ATTRIBUTE_UNUSED) { HOST_WIDE_INT len; if (HAVE_GAS_SHF_MERGE && flag_merge_constants && TREE_CODE (decl) == STRING_CST && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE && align <= 256 && (len = int_size_in_bytes (TREE_TYPE (decl))) > 0 && TREE_STRING_LENGTH (decl) >= len) So, no sorry, I don't see how to change the STRING_CST from using the TREE_TYPE in that way. Bernd. >> char a[4] = "abc"; >> char b[5] = "abc"; >> >> we should better be able to share STRING_CSTs [you can see LTO >> sharing the nodes if you do b[4] but not when b[5] I suppose]. >> >>> All I want to do, is make sure that all string constants have the same look >>> and feel >>> in the middle-end, and restrict the variations that are allowed by the >>> current >>> implementation. >> >> Sure, I understand that. But I'd like to simplify things and not add >> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide >> whether sth is 0-terminated. >> > > This is not only about 0-terminated. (*) > > It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, > there are places in the middle-end that assume that the object contains > _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. > Those I want to protect. > > Bernd. > > > *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, > the string is only returned when it is properly zero terminated.", > but maybe I should have written: > "If MEM_SIZE is zero, the string is only returned when the storage size > of the string object is at least TREE_STRING_LENGTH." > That's at least exactly what the check does. > >> Richard. >> >>> >>> Bernd. >>> >>> >>>>>>> The idea is to use this property of string literals where needed, >>>>>>> and check rigorously in varasm.c. >>>>>>> >>>>>>> Does that make sense? >>>>>> >>>>>> So if it is not the same then the excess character needs to be >>>>>> a (wide) NUL in your model? ISTR your varasm.c patch didn't verify >>>>>> that. >>>>>> >>>>> >>>>> I think it does. >>>>> >>>>> >>>>> Bernd. >>> >>