subject:"Re\: \[widgets\] Unicode Zip Paths \(fully decomposed canonical form\?\)"

Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)

2008-12-07 Thread Marcos Caceres

Hi Martin,

On Sun, Dec 7, 2008 at 7:56 AM, Martin Duerst <[EMAIL PROTECTED]> wrote:
> At 09:31 08/12/06, Marcos Caceres wrote:
>>Hi, I'm trying to put the final touches on the zip section of the widget
>>packaging spec [1] before we go to LC by the 10th and I've run into an i18n
>>problem related to character encodings. I' wondering if anyone would be
>>kind enough to give me some guidance as to what is going on, encoding wise,
>>with in MacOS with regards to the encoding of file names in Zip
>>Files?
>>
>>When I create a zip file with one file entry called "nフ�, inside the
>>zip file, the file name gets decomposed to the following (hex) byte >sequence:
>>
>>nフ�-> 0x6E 0xCC
>
> My mailer has problems with UTF-8, but my guess is that you are
> using n-with-tilde. In UTF-8 and NFD, that would be Ux6E 0xCC 0x83,
> so one explanation is that some data was dropped (and one way to
> explain that would be that the implementation was confused about
> characters vs. bytes).

Apologies, I made a mistake. I had another look and no data was
dropped by Apple's zip implementation. The byte sequence for
n-with-tilde is as you said:

Ux6E 0xCC 0x83

However, I was reading [1] and it turns out that MacOS might actually
be using their own decomposition that resembles FCD.

>>6E is the letter "n" in Unicode, so there is obviously some
>>decomposition going on there. But 0xCC in Unicode maps to
>>テ�(LATIN >CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the 
>>>zip file is using.
>
> A single 0xCC byte doesn't map to anything in any Unicode encoding form.
>
>>The reason I ask is because I'm not sure what to put into the widget
>>spec in regards to recommending the use of canonical decomposition for
>unicode file names. Or even if that is a good idea!?
>>
>>Should I put the following into the spec?: "It is recommended that
>>the file name field be encoded using [UTF-8] in fully decomposed canonical
>>form."
>
> No. Although the Mac file system(s?) use (a variant of) NFD,
> for file names, other operating systems (Windows, Linux,...) don't.
> If you want to specify a normalization form, NFC is closer to what
> the majority does.
>

>>OR just:
>>"It is recommended that the file name field be encoded using [UTF-8]."
>
> Realistically, that's about what you can ask for. And that should
> be enough if the main concern is to match file names from the same
> source. If you need to assure that file names from different
> sources can be matched, then proscribing NFC is the best thing
> to do, but you may have difficulties to get your developers
> following your spec.
>

Unfortunately, the concern is matching file names from different
sources:( If this is lost cause, then I will stick with "It is
recommended that the file name field be encoded using [UTF-8]."

>>This seems important for when I go form MacOS to any other platform as
>>file names get all mangled when files are extracted on any other
>>platform. We obviously don't want that to happen so widget engines
>>need to be prepared to deal with these encoding issues.
>>
>>I looked at the Zip spec [2], but I don't see any real guidance with >regards 
>>to this. However, for those who know more about encoding, it
>>would be helpful if you could also take a look at the Zip spec.
>
> It looks to me that you should say that bit 11 should be set and
> UTF-8 should be used for file name and comment, unless there are
> a significant number of zip toolkits that don't allow that.
>

I have this in the spec already, but I've been unable to determine if
any implementation actually sets general purpose bit 11.

> The spec contains the following:
>
>
> The 0x0008 Extra Field storage may be used with either setting for general
> purpose bit 11.  Examples of the intended usage for this field is to store
> whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC.
>
>
> modified-UTF-8 means that surrogates are directly converted into
> 3-byte UTF-8(-like) sequences instead of converting surrogate pairs
> into 4-byte UTF-8. UTF-8-MAC is UTF-8, mainly NFD, but NFC for Korean.
>
> The specification of the 0x0008 extra field... is extremely vague,
> not useful at all.

Yeah :(

Thank you for your help!

Kind regards,
Marcos


-- 
Marcos Caceres
http://datadriven.com.au

Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)

2008-12-06 Thread Martin Duerst

At 09:31 08/12/06, Marcos Caceres wrote:
>Hi, I'm trying to put the final touches on the zip section of the widget 
>packaging spec [1] before we go to LC by the 10th and I've run into an i18n 
>problem related to character encodings. I' wondering if anyone would be 
>kind enough to give me some guidance as to what is going on, encoding wise, 
>with in MacOS with regards to the encoding of file names in Zip 
>Files?
>
>When I create a zip file with one file entry called "nフ\xA5, inside the
>zip file, the file name gets decomposed to the following (hex) byte >sequence:
>
>nフ\xA5-> 0x6E 0xCC

My mailer has problems with UTF-8, but my guess is that you are
using n-with-tilde. In UTF-8 and NFD, that would be Ux6E 0xCC 0x83,
so one explanation is that some data was dropped (and one way to
explain that would be that the implementation was confused about
characters vs. bytes).

>6E is the letter "n" in Unicode, so there is obviously some
>decomposition going on there. But 0xCC in Unicode maps to 
>テ\xB7(LATIN >CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the 
>>zip file is using.

A single 0xCC byte doesn't map to anything in any Unicode encoding form.

>The reason I ask is because I'm not sure what to put into the widget
>spec in regards to recommending the use of canonical decomposition for 
>>unicode file names. Or even if that is a good idea!?
>
>Should I put the following into the spec?: "It is recommended that 
>the file name field be encoded using [UTF-8] in fully decomposed canonical 
>form."

No. Although the Mac file system(s?) use (a variant of) NFD,
for file names, other operating systems (Windows, Linux,...) don't.
If you want to specify a normalization form, NFC is closer to what
the majority does.

>OR just:
>"It is recommended that the file name field be encoded using [UTF-8]."

Realistically, that's about what you can ask for. And that should
be enough if the main concern is to match file names from the same
source. If you need to assure that file names from different
sources can be matched, then proscribing NFC is the best thing
to do, but you may have difficulties to get your developpers
following your spec.

>This seems important for when I go form MacOS to any other platform as
>file names get all mangled when files are extracted on any other
>platform. We obviously don't want that to happen so widget engines
>need to be prepared to deal with these encoding issues.
>
>I looked at the Zip spec [2], but I don't see any real guidance with >regards 
>to this. However, for those who know more about encoding, it
>would be helpful if you could also take a look at the Zip spec.

It looks to me that you should say that bit 11 should be set and
UTF-8 should be used for file name and comment, unless there are
a significant number of zip toolkits that don't allow that.

The spec contains the following:

The 0x0008 Extra Field storage may be used with either setting for general 
purpose bit 11.  Examples of the intended usage for this field is to store 
whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC.

modified-UTF-8 means that surrogates are directly converted into
3-byte UTF-8(-like) sequences instead of converting surrogate pairs
into 4-byte UTF-8. UTF-8-MAC is UTF-8, mainly NFD, but NFC for Korean.

The specification of the 0x0008 extra field... is extremely vague,
not useful at all.

Regards,Martin.

>Any help would be greatly appreciated,
>Marcos
>
>[1] http://dev.w3.org/2006/waf/widgets/#zip-relative
>[2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
>--
>Marcos Caceres http://datadriven.com.au

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp   mailto:[EMAIL PROTECTED]

Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)

2008-12-05 Thread Marcos Caceres

Woops, by fully decomposed canonical form I think I ment
"Normalization Form D (NFD)" as defined in:
http://www.unicode.org/reports/tr15/#Decomposition

On Sat, Dec 6, 2008 at 12:31 AM, Marcos Caceres
<[EMAIL PROTECTED]> wrote:
> Hi, I'm trying to put the final touches on the zip section of the
> widget packaging spec [1] before we go to LC by the 10th and I've run
> into an i18n problem related to character encodings. I' wondering if
> anyone would be kind enough to give me some guidance as to what is
> going on, encoding wise, with in MacOS with regards to the encoding of
> file names in Zip Files?
>
> When I create a zip file with one file entry called "ñ", inside the
> zip file, the file name gets decomposed to the following (hex) byte
> sequence:
>
> ñ -> 0x6E 0xCC
>
> 6E is the letter "n" in Unicode, so there is obviously some
> decomposition going on there. But 0xCC in Unicode maps to Ì (LATIN
> CAPITAL LETTER I WITH GRAVE)? So I'm not sure what encoding the zip
> file is using.
>
> The reason I ask is because I'm not sure what to put into the widget
> spec in regards to recommending the use of canonical decomposition for
> unicode file names. Or even if that is a good idea!?
>
> Should I put the following into the spec?:
> "It is recommended that the file name field be encoded using [UTF-8]
> in fully decomposed canonical form."
>
> OR just:
> "It is recommended that the file name field be encoded using [UTF-8]."
>
> This seems important for when I go form MacOS to any other platform as
> file names get all mangled when files are extracted on any other
> platform. We obviously don't want that to happen so widget engines
> need to be prepared to deal with these encoding issues.
>
> I looked at the Zip spec [2], but I don't see any real guidance with
> regards to this. However, for those who know more about encoding, it
> would be helpful if you could also take a look at the Zip spec.
>
> Any help would be greatly appreciated,
> Marcos
>
> [1] http://dev.w3.org/2006/waf/widgets/#zip-relative
> [2] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
> --
> Marcos Caceres
> http://datadriven.com.au
>



-- 
Marcos Caceres
http://datadriven.com.au

Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)

Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)

Re: [widgets] Unicode Zip Paths (fully decomposed canonical form?)

3 matches

Site Navigation

Mail list logo

Footer information