Re: [Freedos-devel] ASCII to unicode table

Christian Masloch Tue, 30 Nov 2010 10:03:51 -0800

> UniCode is not the panacea it's purported to be.

No, but you have to give them that it's certainly an improvement.


>> UTF-8 is independent of byte-order. The exact encoding (and byte-order)
>> should always either be implicit (in the interface's or format's
>> definition) or be marked in some way.
>
> I don't think there is a way to automatically determine the encoding from
> the data itself,

Yes, you cannot reliably automatically determine encoding. That's why I  
said you should *know* what data you deal with. (Automatic determination  
of encoding is a serious problem in dealing with plain text files, but  
that need not concern a kernel code translation interface such as the one  
I have in mind.)

> and the only way to determine the byte-order (assuming it's
> not UTF-8, not a single character, and is unknown from the context) is to
> include the special BOM (Byte Order Mark) character as the first  
> character
> of the string.

Yes.

> In fact, according to the UniCode spec, if the BOM is not
> included and the byte-order is not clear from the context, you're  
> supposed
> to assume big-endian.

I don't know about that. But I guess that is the case if you say so.

> For file system and similar applications, the interface could just always
> assume a specific format (probably either UTF-8 or UTF-16LE).

Yes. For example, the (in)famous FAT "long" file names are stored in  
UTF-16LE. Their length is determined by their ASCIZ ("UTF-16LZ") nature ie  
they are terminated by a 16-bit word of the value zero.

If a file system interface (such as Int21/Int21.71) was to be made  
Unicode-capable I would probably use UTF-8. (Particularly because of the  
ASCII compatibility, where only characters >= 80h ("codepage-dependent" so  
to speak) represent code-points >= U+0080.)

> For a
> general-purpose interface, though, you should be able to handle all
> different kinds of possibilities (including things like "UTF-24" and
> "UTF-64").

UTF-24 would be pretty funny. (FAT24 is an actual idea I had. Would work  
well enough.) Even theoretically, UTF-64 doesn't make a lot of sense: a  
24-bit (let alone 32-bit) encoding can already represent more values than  
are currently reserved for all Unicode code-points. Alignment of each  
single code-point is no particularly good reason to unnecessarily double  
(you might speak of "bloat" (-; ) the space required to store any given  
string. 64-bit alignment of the whole string can still be achieved by  
storing an unused dword behind the actual string if it contains an odd  
number of dwords; accesses can be aligned by always accessing a whole  
qword then selecting the appropriate dword and discarding the other.

> Also, even though you're dealing with DOS doesn't necessarily
> mean everything will be little-endian -- it depends on the source of the
> data.  Certain hardware interfaces (like SCSI) are inherently big-endian,
> and data downloaded from external sources can be almost anything.

Yeah.

> Another possibility is what my UNI2ASCI program does, which is accept
> strings terminated with a specific character (in my case, the UniCode NUL
> character, conceptually similar to ASCIIZ).  A general-purpose program
> should provide more than one way to define a string's length.

I guess specifying the length in bytes is good enough. If you want to  
provide such an interface NUL-terminated (or CP/M-style dollar-terminated  
(-; ) strings, write a wrapper function which counts the number of non-NUL  
bytes/words/tri-bytes/dwords/qwords before passing the string to that  
interface. For non-UTF-8 Unicode encodings, a number of bytes not  
divisible by the length of the expected units (2, 3, 4, 8) could just  
cause an error.

Generally speaking, error handling is important. Correct UTF-8 validation  
isn't pretty though.

> If you limit
> input to only certain encodings or byte-orders or string/character types,
> then it ceases to be "general-purpose".  Maybe a general-purpose program  
> is
> not what we're really talking about here, but I think one needs to be
> developed.

Yes, yes. I don't think a general-purpose translation program is what was  
initially suggested (correct me though).

Regards,
Christian

Just noticing that this grows quite large. If someone finds this  
unbearable for this list, please speak up to let me know I should cut down  
the off-topic stuff on my public mails!

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel

Re: [Freedos-devel] ASCII to unicode table

Reply via email to