Re: [PHP-DEV] [RFC] ICU UConverter implementation for ext/intl

Sara Golemon Tue, 30 Oct 2012 10:27:22 -0700

> 1. transcode() accepts options, but there's no comparable way to set
> options to the object. I think these APIs should be synchronized.
> Imagine code keeping options in array/config object - it's be really
> annoying to have two separate procedures to feed these to object and to
> transcode().
>
transcode() having an $options parameter is to make up for the
instance version (convert()) being able to set those via instance
functions (setSubstChars()).


I don't picture a given app using both convert() and transcode(), the
latter only exists to placate those who are objectophobic.

> Also, description of options would be helpful.
>
They're covered in the RFC: to_subst and from_subst under "Simple Use"

> 2. Shouldn't "Enumeration and lookup" methods be static? They look like
> independent from encodings and don't use the object.
>
They are in the patch, I just forgot to note that in the RFC.  Updated.

> 3. For "Advanced Use", I think "no error" condition should be the
> default and not requiring explicit action.
>
If you take no action at all, then an error still exists.  This is
consistent with the underlying API.

> 4. I think error reporting should match other intl functions. It'd not
> really be good if intl submodules would be all different in error
> reporting.
>
Mentioned in previous feedback, I plan to look at this again.

> 5. What is $source parameter for callbacks?
>
It's context for where in the conversion we are.  $codeunit/$codepoint
is the specific element causing the problem, $source is the string
from that point forward.

> 6. Why toUCallback returns string but fromUCallback gets codepoint as
> long? Shouldn't those be the same - i.e., if toU returns unicode
> codepoint, it should be long? Or it can return multiple codepoints? In
> which case it becomes confusing as we represent codepoints as both
> string and long in the same API.
>
Actually (I left this out of the RFC), they both can return a large
number of types.

In the case of toUCallback, you can return a utf-8 string (most
reasonable Unicode representation to be returned as a char* string)
and the callback mechanism will make that into UChars to put into the
target string.  You can return a long and it'll be treated as a single
Unicode codepoint (One UChar for BMP, 2 for higher planes).  You can
also return an array of either of these types to specify a string in a
readable, but unicode friendly format, e.g. array("Espa", 0x00F1 /*
LATIN SMALL LETTER N WITH TILDE */, "ol")  would be equivalent to
"Espa\xC3\xB1ol".

The same is true for fromUCallback() with the exception that the
values being returned are assumed to be in the target encoding.  For
longs this means a single byte unsigned char which is appended to the
target as-is.  Similarly strings are appended as-is.

As for input parameters: for toUCallback, $source and $codeUnits are
still in their original encoding and presented as-is for that
encoding.  For fromUCallback(), the $source/$codePoint are in Unicode
(UChar/UTF16 internall) and can't be directly offered to PHP without
running into endianness issues.  So the codepoint is provided as a
single UChar32 (avoiding the surrogate problem in the process), and
source is given as a series of UChar32 codepoints in a numerically
indexed array.

I'll add a section about callback input/return types to clarify this.

> 7. Link to ICU API from the RFC would be helpful for reviewers and later
> docs, I think.
>
Added!

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] ICU UConverter implementation for ext/intl

Reply via email to