On Sunday, 26 December 2021 at 21:22:42 UTC, Adam Ruppe wrote:
write just transfers a sequence of bytes. It doesn't know nor care what they represent - that's for the receiving end to figure out.

Oh, so it was as I expected :P

You are mistaken. There's several exceptions, utf-16 can come in pairs, and even utf-32 has multiple "characters" that combine onto one thing on screen.

Oh yeah. About that, I wasn't given a demonstration of how it works so I forgot about it. I saw that in Unicode you can combine some code points to get different results but I never saw how that happens in practice. If you combine two code points, you get another different graph. So yeah that one thing I don't understand...

I prefer to think of a string as a little virtual machine that can be run to produce output rather than actually being "characters". Even with plain ascii, consider the backspace "character" - it is more an instruction to go back than it is a thing that is displayed on its own.

Yes, that's a great way of seeing it. I suppose that this all happens under the hood and it is OS specific so why have to know how the OS we are working with works under the hood to fully understand how this happens. Also the idea of some "characters" been "instructions" is very interesting. Now from what I've seen, non-printable characters are always instructions (except for the "space" character) so another way to think about this is by thinking that every character can have one instruction and this is either to get written (displayed) in the file or to do another modification in the text but without getting displayed itself as a character. Of course, I don't suppose that's what happening under the hood but it's an interesting way of describe it.

This is because the *receiving program* treats them as utf-8 and runs it accordingly. Not all terminals will necessarily do this, and programs you pipe to can do it very differently.

That's pretty interesting actually. Terminals (and don't forget shells) are programs themselves so they choose the encoding themselves. However, do you know what we do from cross compatibility then? Because this sounds like a HUGE mess real world applications

The [w|d|]string.length function returns the number of elements in there, which is bytes for string, 16 bit elements for wstring (so bytes / 2), or 32 bit elements for dstring (so bytes / 4).

This is not necessarily related to the number of characters displayed.

I don't understand that. Based on your calculations, the results should have been different. Also how are the numbers fixed? Like you said the amount of bytes of each encoding is not always standard for every character. Even if they were fixed this means 2-bytes for each UTF-16 character and 4-bytes for each UTF-32 character so still the numbers doesn't make sense to me. So still the number of the "length" property should have been the same for every encoding or at least for UTF-16 and UTF-32. So are the sizes of every character fixed or not?

Damn you guys should got paid for the help you are giving in this forum

Reply via email to