Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)
Hi Martin, On Sun, Dec 7, 2008 at 7:56 AM, Martin Duerst <[EMAIL PROTECTED]> wrote: > At 09:31 08/12/06, Marcos Caceres wrote: >>Hi, I'm trying to put the final touches on the zip section of the widget >>packaging spec [1] before we go to LC by the 10th and I've run into an i18n >>problem related to character encodings. I' wondering if anyone would be >>kind enough to give me some guidance as to what is going on, encoding wise, >>with in MacOS with regards to the encoding of file names in Zip >>Files? >> >>When I create a zip file with one file entry called "nフ�, inside the >>zip file, the file name gets decomposed to the following (hex) byte >sequence: >> >>nフ�-> 0x6E 0xCC > > My mailer has problems with UTF-8, but my guess is that you are > using n-with-tilde. In UTF-8 and NFD, that would be Ux6E 0xCC 0x83, > so one explanation is that some data was dropped (and one way to > explain that would be that the implementation was confused about > characters vs. bytes). Apologies, I made a mistake. I had another look and no data was dropped by Apple's zip implementation. The byte sequence for n-with-tilde is as you said: Ux6E 0xCC 0x83 However, I was reading [1] and it turns out that MacOS might actually be using their own decomposition that resembles FCD. >>6E is the letter "n" in Unicode, so there is obviously some >>decomposition going on there. But 0xCC in Unicode maps to >>テ�(LATIN >CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the >>>zip file is using. > > A single 0xCC byte doesn't map to anything in any Unicode encoding form. > >>The reason I ask is because I'm not sure what to put into the widget >>spec in regards to recommending the use of canonical decomposition for >unicode file names. Or even if that is a good idea!? >> >>Should I put the following into the spec?: "It is recommended that >>the file name field be encoded using [UTF-8] in fully decomposed canonical >>form." > > No. Although the Mac file system(s?) use (a variant of) NFD, > for file names, other operating systems (Windows, Linux,...) don't. > If you want to specify a normalization form, NFC is closer to what > the majority does. > >>OR just: >>"It is recommended that the file name field be encoded using [UTF-8]." > > Realistically, that's about what you can ask for. And that should > be enough if the main concern is to match file names from the same > source. If you need to assure that file names from different > sources can be matched, then proscribing NFC is the best thing > to do, but you may have difficulties to get your developers > following your spec. > Unfortunately, the concern is matching file names from different sources:( If this is lost cause, then I will stick with "It is recommended that the file name field be encoded using [UTF-8]." >>This seems important for when I go form MacOS to any other platform as >>file names get all mangled when files are extracted on any other >>platform. We obviously don't want that to happen so widget engines >>need to be prepared to deal with these encoding issues. >> >>I looked at the Zip spec [2], but I don't see any real guidance with >regards >>to this. However, for those who know more about encoding, it >>would be helpful if you could also take a look at the Zip spec. > > It looks to me that you should say that bit 11 should be set and > UTF-8 should be used for file name and comment, unless there are > a significant number of zip toolkits that don't allow that. > I have this in the spec already, but I've been unable to determine if any implementation actually sets general purpose bit 11. > The spec contains the following: > > > The 0x0008 Extra Field storage may be used with either setting for general > purpose bit 11. Examples of the intended usage for this field is to store > whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC. > > > modified-UTF-8 means that surrogates are directly converted into > 3-byte UTF-8(-like) sequences instead of converting surrogate pairs > into 4-byte UTF-8. UTF-8-MAC is UTF-8, mainly NFD, but NFC for Korean. > > The specification of the 0x0008 extra field... is extremely vague, > not useful at all. Yeah :( Thank you for your help! Kind regards, Marcos -- Marcos Caceres http://datadriven.com.au
Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)
At 09:31 08/12/06, Marcos Caceres wrote: >Hi, I'm trying to put the final touches on the zip section of the widget >packaging spec [1] before we go to LC by the 10th and I've run into an i18n >problem related to character encodings. I' wondering if anyone would be >kind enough to give me some guidance as to what is going on, encoding wise, >with in MacOS with regards to the encoding of file names in Zip >Files? > >When I create a zip file with one file entry called "nフ\xA5, inside the >zip file, the file name gets decomposed to the following (hex) byte >sequence: > >nフ\xA5-> 0x6E 0xCC My mailer has problems with UTF-8, but my guess is that you are using n-with-tilde. In UTF-8 and NFD, that would be Ux6E 0xCC 0x83, so one explanation is that some data was dropped (and one way to explain that would be that the implementation was confused about characters vs. bytes). >6E is the letter "n" in Unicode, so there is obviously some >decomposition going on there. But 0xCC in Unicode maps to >テ\xB7(LATIN >CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the >>zip file is using. A single 0xCC byte doesn't map to anything in any Unicode encoding form. >The reason I ask is because I'm not sure what to put into the widget >spec in regards to recommending the use of canonical decomposition for >>unicode file names. Or even if that is a good idea!? > >Should I put the following into the spec?: "It is recommended that >the file name field be encoded using [UTF-8] in fully decomposed canonical >form." No. Although the Mac file system(s?) use (a variant of) NFD, for file names, other operating systems (Windows, Linux,...) don't. If you want to specify a normalization form, NFC is closer to what the majority does. >OR just: >"It is recommended that the file name field be encoded using [UTF-8]." Realistically, that's about what you can ask for. And that should be enough if the main concern is to match file names from the same source. If you need to assure that file names from different sources can be matched, then proscribing NFC is the best thing to do, but you may have difficulties to get your developpers following your spec. >This seems important for when I go form MacOS to any other platform as >file names get all mangled when files are extracted on any other >platform. We obviously don't want that to happen so widget engines >need to be prepared to deal with these encoding issues. > >I looked at the Zip spec [2], but I don't see any real guidance with >regards >to this. However, for those who know more about encoding, it >would be helpful if you could also take a look at the Zip spec. It looks to me that you should say that bit 11 should be set and UTF-8 should be used for file name and comment, unless there are a significant number of zip toolkits that don't allow that. The spec contains the following: The 0x0008 Extra Field storage may be used with either setting for general purpose bit 11. Examples of the intended usage for this field is to store whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC. modified-UTF-8 means that surrogates are directly converted into 3-byte UTF-8(-like) sequences instead of converting surrogate pairs into 4-byte UTF-8. UTF-8-MAC is UTF-8, mainly NFD, but NFC for Korean. The specification of the 0x0008 extra field... is extremely vague, not useful at all. Regards,Martin. >Any help would be greatly appreciated, >Marcos > >[1] http://dev.w3.org/2006/waf/widgets/#zip-relative >[2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT >-- >Marcos Caceres http://datadriven.com.au #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:[EMAIL PROTECTED]
Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)
Woops, by fully decomposed canonical form I think I ment "Normalization Form D (NFD)" as defined in: http://www.unicode.org/reports/tr15/#Decomposition On Sat, Dec 6, 2008 at 12:31 AM, Marcos Caceres <[EMAIL PROTECTED]> wrote: > Hi, I'm trying to put the final touches on the zip section of the > widget packaging spec [1] before we go to LC by the 10th and I've run > into an i18n problem related to character encodings. I' wondering if > anyone would be kind enough to give me some guidance as to what is > going on, encoding wise, with in MacOS with regards to the encoding of > file names in Zip Files? > > When I create a zip file with one file entry called "ñ", inside the > zip file, the file name gets decomposed to the following (hex) byte > sequence: > > ñ -> 0x6E 0xCC > > 6E is the letter "n" in Unicode, so there is obviously some > decomposition going on there. But 0xCC in Unicode maps to Ì (LATIN > CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the zip > file is using. > > The reason I ask is because I'm not sure what to put into the widget > spec in regards to recommending the use of canonical decomposition for > unicode file names. Or even if that is a good idea!? > > Should I put the following into the spec?: > "It is recommended that the file name field be encoded using [UTF-8] > in fully decomposed canonical form." > > OR just: > "It is recommended that the file name field be encoded using [UTF-8]." > > This seems important for when I go form MacOS to any other platform as > file names get all mangled when files are extracted on any other > platform. We obviously don't want that to happen so widget engines > need to be prepared to deal with these encoding issues. > > I looked at the Zip spec [2], but I don't see any real guidance with > regards to this. However, for those who know more about encoding, it > would be helpful if you could also take a look at the Zip spec. > > Any help would be greatly appreciated, > Marcos > > [1] http://dev.w3.org/2006/waf/widgets/#zip-relative > [2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT > -- > Marcos Caceres > http://datadriven.com.au > -- Marcos Caceres http://datadriven.com.au