On Mon, 4 Sep 2017, Joseph Myers wrote: > On Mon, 4 Sep 2017, Richard Biener wrote: > > > always have a consistend "character" size and how the individual > > "characters" are encoded. The patch assumes that the array element > > type of the STRING_CST can be used to get access to individual > > characters by means of the element type size and those elements > > are stored in host byteorder. Which means the patch simply handles > > It's actually target byte order, i.e. the STRING_CST stores the same > sequence of target bytes as would appear on the target system (modulo > certain strings such as in asm statements and attributes, for which > translation to the execution character set is disabled because those > strings are only processed in the compiler on the host, not on the target > - but you should never encounter such strings in the optimizers etc.). > This is documented in generic.texi (complete with a warning about how it's > not well-defined what the encoding is if target bytes are not the same as > host bytes).
Ah thanks. > I suspect that, generically in the compiler, the use of C++ might make it > easier than it would have been some time ago to build some abstractions > around target strings that work for all of narrow strings, wide strings, > char16_t strings etc. (for extracting individual elements - or individual > characters which might be multibyte characters in the narrow string case, > etc.) - as would be useful for e.g. wide string format checking and more > generally for making e.g. optimizations for narrow strings also work for > wide strings. (Such abstractions wouldn't solve the question of what the > format is if host and target bytes differ, but their use would reduce the > number of places needing changing to establish a definition of the format > in that case if someone were to do a port to a system with bytes bigger > than 8 bits.) > > However, as I understand the place you're patching, it doesn't have any > use for such an abstraction; it just needs to copy a sequence of bytes > from one place to another. (And even with host bytes different from > target bytes, clearly it would make sense to define the internal > interfaces to make the encodings consistent so this function still only > needs to copy bytes from one place to another and still doesn't need such > abstractions.) Right. Given they are in target representation the patch becomes much simpler and we can handle all STRING_CSTs modulo for the case where BITS_PER_UNIT != CHAR_BIT (as you say). I suppose we can easily declare we'll never support a CHAR_BIT != 8 host and we currently don't have any BITS_PER_UNIT != 8 port (we had c4x). I'm not sure what constraints we have on CHAR_TYPE_SIZE vs. BITS_PER_UNIT, or for what port it would make sense to have differing values. Or what it means for native encoding (should the BITS_PER_UNIT != CHAR_BIT test be CHAR_TYPE_SIZE != CHAR_BIT instead?). BITS_PER_UNIT is also only documented in rtl.texi rather than in tm.texi. Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. Richard. 2017-09-05 Richard Biener <rguent...@suse.de> PR tree-optimization/82084 * fold-const.c (can_native_encode_string_p): Handle wide characters. Index: gcc/fold-const.c =================================================================== --- gcc/fold-const.c (revision 251661) +++ gcc/fold-const.c (working copy) @@ -7489,10 +7489,11 @@ can_native_encode_string_p (const_tree e { tree type = TREE_TYPE (expr); - if (TREE_CODE (type) != ARRAY_TYPE + /* Wide-char strings are encoded in target byte-order so native + encoding them is trivial. */ + if (BITS_PER_UNIT != CHAR_BIT + || TREE_CODE (type) != ARRAY_TYPE || TREE_CODE (TREE_TYPE (type)) != INTEGER_TYPE - || (GET_MODE_BITSIZE (SCALAR_INT_TYPE_MODE (TREE_TYPE (type))) - != BITS_PER_UNIT) || !tree_fits_shwi_p (TYPE_SIZE_UNIT (type))) return false; return true;