On Wednesday, October 5, 2016 at 12:01:32 AM UTC-4, josh...@fastmail.com wrote: > > OK, I understand now: they're continuation bytes for UTF-8 and can't > appear in that context so they get stripped from the string representation. >
They don't get stripped — invalid data is still stored in the String. However, anything that iterates over Unicode characters (length is a count of Unicode codepoints) skips them. julia> s = String([0x82,0x82,0x82,0x82,0x82]) 5-byte String of invalid UTF-8 data: 0x82 0x82 0x82 0x82 0x82 julia> length(s) 0 julia> sizeof(s) 5