Re: Hebrew filenames from a Windows(XP) zip file.

Yedidyah Bar-David Wed, 25 Aug 2004 06:31:37 -0700

Hi,

On Wed, Aug 25, 2004 at 09:57:43AM +0300, Amir Hardon wrote:
> I'm trying to extract a zip file with Hebrew file names that was created with 
> winzip on a Windows XP machine.
> It looks like there is an encoding problem, but a weird one.


This also troubled me for some time. Incidentally, just yesterday I
downloaded unzip's sources, and your email was the last push to read
them.

> 
> Just for testing the encoding I listed the file names into a text file ('unzip 
> -l > file.txt'), and tried it to convert to different encodings using iconv.
> But iconv always failed(No matter which encoding I'm trying to use),
> with the following message:
> iconv: illegal input sequence at position 112
> The first byte that supposed to be Hebrew is at position 112,
> it's value is 0xEA which is "Kaf sofit" in iso-8859-8.
> 
> Anyway I just opened the text file with Mozilla and tried to view it using 
> every Hebrew or Unicode encoding it supports, but none of them worked.
> 
> My last resort was to calculate the difference between the values of the 
> letter I get and the letter it should be, the first two letters have the same 
> difference (reduce two to get the original letter) but the third letter have 
> a different one (add five to get the original letter).
> That is strange!

unzip has the encoding hard-coded in the source:
/*---------------------------------------------------------------------------
    
  The following conversion tables translate between IBM PC CP 850
  (OEM codepage) and the "Western Europe & America" Windows codepage 1252.
  The Windows codepage 1252 contains the ISO 8859-1 "Latin 1" codepage,
  with some additional printable characters in the range (0x80 - 0x9F),
  that is reserved to control codes in the ISO 8859-1 character table.
    
  The ISO <--> OEM conversion tables were constructed with the help
  of the WIN32 (Win16?) API's OemToAnsi() and AnsiToOem() conversion
  functions and have been checked against the CP850 and LATIN1 tables
  provided in the MS-Kermit 3.14 distribution.

  ---------------------------------------------------------------------------*/
[snip]
ZCONST uch Far oem2iso[] = {
    0xC7, 0xFC, 0xE9, 0xE2, 0xE4, 0xE0, 0xE5, 0xE7,  /* 80 - 87 */
    0xEA, 0xEB, 0xE8, 0xEF, 0xEE, 0xEC, 0xC4, 0xC5,  /* 88 - 8F */
    0xC9, 0xE6, 0xC6, 0xF4, 0xF6, 0xF2, 0xFB, 0xF9,  /* 90 - 97 */
    0xFF, 0xD6, 0xDC, 0xF8, 0xA3, 0xD8, 0xD7, 0x83,  /* 98 - 9F */
    0xE1, 0xED, 0xF3, 0xFA, 0xF1, 0xD1, 0xAA, 0xBA,  /* A0 - A7 */
    0xBF, 0xAE, 0xAC, 0xBD, 0xBC, 0xA1, 0xAB, 0xBB,  /* A8 - AF */
    0xA6, 0xA6, 0xA6, 0xA6, 0xA6, 0xC1, 0xC2, 0xC0,  /* B0 - B7 */
    0xA9, 0xA6, 0xA6, 0x2B, 0x2B, 0xA2, 0xA5, 0x2B,  /* B8 - BF */
    0x2B, 0x2D, 0x2D, 0x2B, 0x2D, 0x2B, 0xE3, 0xC3,  /* C0 - C7 */
    0x2B, 0x2B, 0x2D, 0x2D, 0xA6, 0x2D, 0x2B, 0xA4,  /* C8 - CF */
    0xF0, 0xD0, 0xCA, 0xCB, 0xC8, 0x69, 0xCD, 0xCE,  /* D0 - D7 */
    0xCF, 0x2B, 0x2B, 0xA6, 0x5F, 0xA6, 0xCC, 0xAF,  /* D8 - DF */
    0xD3, 0xDF, 0xD4, 0xD2, 0xF5, 0xD5, 0xB5, 0xFE,  /* E0 - E7 */
    0xDE, 0xDA, 0xDB, 0xD9, 0xFD, 0xDD, 0xAF, 0xB4,  /* E8 - EF */
    0xAD, 0xB1, 0x3D, 0xBE, 0xB6, 0xA7, 0xF7, 0xB8,  /* F0 - F7 */
    0xB0, 0xA8, 0xB7, 0xB9, 0xB3, 0xB2, 0xA6, 0xA0   /* F8 - FF */
};

Reading the comment, and looking a bit with od at the zip and the
output, I understand that the zip itself has DOS hebrew (cp862)
filenames, which unzip expects as cp850, and converts to iso8859-1.
This indeed worked: I created a filename with all the heberw letters,
zipped it witn winzip, unzipped in Linux, then did
ls -l | iconv -f iso8859-1 -t cp850 | iconv -f cp862 -t iso8859-8
and it worked.

> 
> (List's Hebrew haters, please forgive the next paragraph)
> Just for the record here is the string I get:
> "???????? ?´?????¤ ???????¤"
> Which should be:
> "???????? ???????£ ???????£"
> (Both strings are in logical order)
> 
> So I have two questions:
> 1. (The simple one) What's the problem with iconv?

That it does not only translate char ranges, it also checks validity.
Running it twice allowed me to trick it. Doing e.g. 'iconv -f iso8859-1
-t iso8859-8' might theoretically work (I am not sure, I have to think
about it), but iconv knows which chars should be in each and does not
agree to work with illegal ones.

> 2. What can I do with the Hebrew filenames?

Use the above with some script or, if you have a lot of time, make
unzip use iconv/gconv and allow the user to set the charset :-)
-- 
Didi


=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: Hebrew filenames from a Windows(XP) zip file.

Reply via email to