Re: UTF-8 ill-formed question

Otto Stolz Wed, 12 Dec 2012 12:25:25 -0800

Hello,

am 2012-12-11 20:16, schrieb James Lin:

If i have a code point: U+4E8C or "二"
In UTF-8, it's "E4 BA 8C" while in UTF-16, it's "4E8C".
Where is this "BA" comes from?


Cf. <http://skew.org/cumped/>.

Enclosed are the (almost original) version of “Cima’s Magic
UTF-8 Pocket encoder” (2004), and its two followers for
more UTFs. Display or print with a fixed-pitch font,
such as Lucida Console or Courier New. Enjoy!

Cheers,
   Otto Stolz

Side 1 (print and cut out):

+------------+-------+-----------------------+------+
|     U+0000 | yy zz |    Cima's UTF-8 Magic | Hex= |
|     U+007F | !  !  |        Pocket Encoder | B-4  |
|         YZ | .  .  |                       |      |
+------------+-------+-------+     Vers. 1.1 | 0=00 |
|     U+0080 | 3x xy | 2y zz |  30 June 2004 | 1=01 |
|     U+07FF | 3. .. | 2. !  |               | 2=02 |
|        XYZ | .  .  | .  .  |          M.C. | 3=03 |
+------------+-------+-------+-------+       | 4=10 |
|     U+0800 | 32 ww | 2x xy | 2y zz |       | 5=11 |
|     U+FFFF | !  !  | 2. .. | 2. !  |       | 6=12 |
|       WXYZ | E  .  | .  .  | .  .  |       | 7=13 |
+------------+-------+-------+-------+-------+ 8=20 |
| U-00010000 | 33 0v | 2v ww | 2x xy | 2y zz | 9=21 |
| U-000FFFFF | !  0. | 2. !  | 2. .. | 2. !  | A=22 |
|      VWXYZ | F  .  | .  .  | .  .  | .  .  | B=23 |
+------------+-------+-------+-------+-------+ C=30 |
| U-00100000 | 33 10 | 20 ww | 2x xy | 2y zz | D=31 |
| U-0010FFFF | !  1. | 2. !  | 2. .. | 2. !  | E=32 |
|       WXYZ | F  4  | 8  .  | .  .  | .  .  | F=33 |
+------------+-------+-------+-------+-------+------+

Side 2 (print, cut out, and glue on back of side 1):

+---------------------------------------------------+
| Cima's UTF-8 Magic Pocket Encoder - User's Manual |
| (vers. 1.1, 30 June 2004, by Marco Cimarosti)     |
|                                                   |
| - Left column: min and max Unicode scalar values: |
|   pick the row that applies to the code point you |
|   want to convert to UTF-8. Letters V..Z mark the |
|   hexadecimal digits that have to be processed.   |
| - Right column: hexadecimal to base-4 table.      |
| - Central columns: work area to compute each octet|
|   (1 to 4) that constitute UTF-8 octet sequences. |
| Convert each digit marked by V..Z from hex. to    |
| b.-4. Write b.-4 digits on the dots placed under  |
| letters v..z (two b.-4 digits per hex. digit).    |
| Convert 2-digit base-4 number to hex. digits and  |
| write them on the dots on the line. That is your  |
| UTF-8 sequence in hex.  ! Exclamation marks show  |
| passages that may be skipped, either because the  |
| digit is hard-coded, or because it may be copied  |
| directly from the scalar value.                   |
+---------------------------------------------------+

Enjoy!

Marco

Obverse: Print with a fixed-width font, such as Lucida Console,
and cut out.

╔════════════╦═════════════╦═════════════════════════════════╗
║     U+0000 ║ W  X  Y  Z  ║ Otto’s Magic Pocket Encoder     ║
║     U+D7FF ║ !  !  !  !  ║ for UTF-16  ╔═══════════════════╣
║       WXYZ ║ _  _  _  _  ║             ║    V>vv │    V>vv ║
╟────────────╫─────────────╢ Version 1.1 ║    U>uu │    U>uu ║
║     U+E000 ║ W  X  Y  Z  ║ ©2004-07-05 ║ tt<T    │ tt<T    ║
║     U+FFFF ║ !  !  !  !  ║             ║    _<__ │    _<__ ║
║       WXYZ ║ _  _  _  _  ║             ║ ────────┼──────── ║
╟────────────╫─────────────╚═════════════╣    0=00 │ 13<8=20 ║
║ U-00010000 ║ 31 2t tu uv │ 31 3v Y  Z  ║ 00<1=01 │ 20<9=21 ║
║ U-000FFFFF ║ !  2_ __ __ │ !  3_ !  !  ║ 01<2=02 │ 21<A=22 ║
║      TUVYZ ║ D  _  _  _  │ D  _  _  _  ║ 02<3=03 │ 22<B=23 ║
╟────────────╫─────────────┼─────────────╢ 03<4=10 │ 23<C=30 ║
║ U-00100000 ║ 31 23 3u uv │ 31 3v Y  Z  ║ 10<5=11 │ 30<D=31 ║
║ U-0010FFFF ║ !  !  3_ __ │ !  3_ !  !  ║ 11<6=12 │ 31<E=32 ║
║       UVYZ ║ D  B  _  _  │ D  _  _  _  ║ 12<7=13 │ 32<F=33 ║
╚════════════╩═════════════╧═════════════╩═══════════════════╝


....:....1....:....2....:....3....:....4....:....5....:....6..


Reverse: Cut out and paste on back of obverse.

╔════════════════════════════════════════════════════════════╗
║     Otto’s Magic Pocket Encoder for UTF-16 Version 1.1     ║
║     User’s Manual     (inspired from Cima’s UTF-8 MPE)     ║
╠════════════════════════════════════════════════════════════╣
║• Left column: min and max Unicode scalar values: pick the  ║
║  row that applies to the code point to be converted.       ║
║  T…Z mark the hexadecadic digits that have to be processed.║
║• Central column: work area to compute UTF-16BE code units. ║
║• Right column: hexadecadic to quaternary conversion tables:║
║  < for T to tt; = for U/V to uu/vv (step 1) and for step 2.║
║1. Convert each digit marked by T…V from hex to quat. Write ║
║   quat digits on the underscores placed under letters t…v. ║
║2. Convert 2-digit quat numbers to hex digits or copy digits║
║   W…Z, as indicated, and write them on the underscores on  ║
║   the next line. That’s your UTF-16BE sequence in hex.     ║
║! Exclamation marks indicate shortcuts.                     ║
╚════════════════════════════════════════════════════════════╝

Enjoy.


Side 1 (print and cut out):

╔════════════╦═══════════════════════╤═══════════════╗
║ This space ║   Mike’s UTF-32 Magic │     Vers. 1.0 ║
║  for rent  ║        Pocket Encoder │  06 July 2004 ║
║            ║                       │               ║
╠════════════╬═══════╤═══════╤═══════╪═══════╗       ║
║ U-00000000 ║ 0  0  │ U  V  │ W  X  │ Y  Z  ║       ║
║ U-0010FFFF ║ !  !  │ !  !  │ !  !  │ !  !  ║       ║
║     UVWXYZ ║ 0  0  │ _  _  │ _  _  │ _  _  ║       ║
╚════════════╩═══════╧═══════╧═══════╧═══════╩═══════╝

Side 2 (print, cut out, and glue on back of side 1):

╔════════════════════════════════════════════════════╗
║ Mike’s UTF-32 Magic Pocket Encoder - User’s Manual ║
║ (vers. 1.0, 6 July 2004, by Mike Ayers)            ║
║                                                    ║
║ - Left column: min and max Unicode scalar values.  ║
║   Letters U..Z mark the hexadecimal digits to be   ║
║   processed.  Read the bytes in the bottom row     ║
║   left to right, or right to left for UTF-32LE.    ║
╚════════════════════════════════════════════════════╝

Re: UTF-8 ill-formed question

Reply via email to