php-i18n Digest 11 Mar 2006 06:04:09 -0000 Issue 317
Topics (messages 958 through 971):
Re: Ideas for a portable string api
958 by: Andrei Zmievski
959 by: Marcus Boerger
970 by: Dmitry Stogov
surrogates optimization
960 by: Tex Texin
961 by: Tex Texin
963 by: Andrei Zmievski
964 by: Tex Texin
965 by: l0t3k
966 by: Derick Rethans
967 by: Andrei Zmievski
968 by: Tex Texin
971 by: Andi Gutmans
Results of Meeting
962 by: l0t3k
969 by: Tex Texin
Administrivia:
To subscribe to the digest, e-mail:
[EMAIL PROTECTED]
To unsubscribe from the digest, e-mail:
[EMAIL PROTECTED]
To post to the list, e-mail:
[email protected]
----------------------------------------------------------------------
--- Begin Message ---
I actually agree.. The more macros we add, the more confusing it
becomes. I had to look back and forth a few times between S_PASS() and
Z_PASS() to figure it out. I guess Marcus was the one who wanted
simpler API, so what does he think?
-Andrei
On Mar 7, 2006, at 4:06 AM, Dmitry Stogov wrote:
Hi,
The patch in attachment implements S_ARG(...) and related macroses.
However I don't like to commit it. Because code isn't more clear with
these
macroses and it is harder to debug.
I prefer stay API as is.
Thanks. Dmitry.
-----Original Message-----
From: Dmitry Stogov [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 16, 2006 12:12 PM
To: [email protected]
Subject: RE: [PHP-I18N] Ideas for a portable string api
Hi,
After reviewing Marcus ideas, some experiments and speaking
with Andrei. I propose the following solutions:
1) We will not use any kind of unicode literals in C code (no
L"foo" no "f\0o\0o\0\0"), Because L"" is not portable and
"f\0.." looks to ugly.
2) We will change "zval" structure to make
"zval.value.str.len" and "zval.value.ustr.len" of the same
type. This will allow optimize Z_UNISTR() and Z_UNILEN()
macros. They will
#define Z_UNISTR(z) ((void*)(Z_STRVAL(z)))
#define Z_UNILEN(z) ((void*)(Z_STRLEN(z)))
Instead of
#define Z_UNISTR(z)
Z_TYPE(z)==IS_UNICODE?(char*)Z_USTRVAL(z):Z_STRVAL(z)
#define Z_UNILEN(z)
Z_TYPE(z)==IS_UNICODE?(int)Z_USTRLEN(z):Z_STRLEN(z)
3) I don't like to break source compatibility with
modification of "zval" layout as Marcus suggested. We will
pass string/unicode values near in the same way as do today.
As three values - zend_uchar type, void* str, int len. But we
will create a set of the following macros to do it with less overhead.
#define S_TYPE(x) _type_##x
#define S_UNIVAL(x) _val_##x
#define S_UNILEN(x) _len_##x
#define S_STRVAL(x) ((char*)S_UNIVAL(x))
#define S_USTRVAL(x) ((UChar*)S_UNIVAL(x))
#define S_STRLEN(x) S_UNILEN(x)
#define S_USTRLEN(x) S_UNILEN(x)
#define S_ARG(x) zend_uchar S_TYPE(x), void
*S_UNIVAL(x), int
S_UNILEN(x)
#define S_PASS(x) S_TYPE(x), S_UNIVAL(x), S_UNILEN(x)
#define Z_STR_PASS(x) Z_TYPE(x), Z_UNIVAL(x), Z_UNILEN(x)
#define Z_STR_PASS_P(x) Z_TYPE_P(x), Z_UNIVAL_P(x),
Z_UNILEN_P(x)
#define Z_STR_PASS_PP(x) Z_TYPE_PP(x), Z_UNIVAL_PP(x),
Z_UNILEN_PP(x)
Then most zend_u_... Functions must be rewriten with these macros
Foe example:
ZEND_API int zend_u_lookup_class(S_ARG(name), zend_class_entry ***ce
TSRMLS_DC)
{
return zend_u_lookup_class_ex(S_PASS(name), 1, ce TSRMLS_CC); }
Instead of
ZEND_API int zend_u_lookup_class(zend_uchar type, void *name,
int name_length, zend_class_entry ***ce TSRMLS_DC) {
return zend_u_lookup_class_ex(type, name, name_length,
1, ce TSRMLS_CC); }
Any objections, additions?
Thanks. Dmitry.
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
<s_arg.diff.gz>
--- End Message ---
--- Begin Message ---
Hello Dmitry,
i looked at the patch and must say that you did it better than i thought
of when i first came to the idea but well you already told us that you can
do it much better. However i must also admit that it doesn't buy us much.
So i guess we can also stay with what we have and move along as Andrei
prefers. In the end the question is whether we want to make our developers
learn more macros or more parameter orders.
best regards
marcus
Tuesday, March 7, 2006, 1:06:53 PM, you wrote:
> Hi,
> The patch in attachment implements S_ARG(...) and related macroses.
> However I don't like to commit it. Because code isn't more clear with these
> macroses and it is harder to debug.
> I prefer stay API as is.
> Thanks. Dmitry.
>
>> -----Original Message-----
>> From: Dmitry Stogov [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, February 16, 2006 12:12 PM
>> To: [email protected]
>> Subject: RE: [PHP-I18N] Ideas for a portable string api
>>
>>
>> Hi,
>>
>> After reviewing Marcus ideas, some experiments and speaking
>> with Andrei. I propose the following solutions:
>>
>> 1) We will not use any kind of unicode literals in C code (no
>> L"foo" no "f\0o\0o\0\0"), Because L"" is not portable and
>> "f\0.." looks to ugly.
>>
>> 2) We will change "zval" structure to make
>> "zval.value.str.len" and "zval.value.ustr.len" of the same
>> type. This will allow optimize Z_UNISTR() and Z_UNILEN()
>> macros. They will
>>
>> #define Z_UNISTR(z) ((void*)(Z_STRVAL(z)))
>> #define Z_UNILEN(z) ((void*)(Z_STRLEN(z)))
>>
>> Instead of
>>
>> #define Z_UNISTR(z)
>> Z_TYPE(z)==IS_UNICODE?(char*)Z_USTRVAL(z):Z_STRVAL(z)
>> #define Z_UNILEN(z)
>> Z_TYPE(z)==IS_UNICODE?(int)Z_USTRLEN(z):Z_STRLEN(z)
>>
>> 3) I don't like to break source compatibility with
>> modification of "zval" layout as Marcus suggested. We will
>> pass string/unicode values near in the same way as do today.
>> As three values - zend_uchar type, void* str, int len. But we
>> will create a set of the following macros to do it with less overhead.
>>
>> #define S_TYPE(x) _type_##x
>> #define S_UNIVAL(x) _val_##x
>> #define S_UNILEN(x) _len_##x
>> #define S_STRVAL(x) ((char*)S_UNIVAL(x))
>> #define S_USTRVAL(x) ((UChar*)S_UNIVAL(x))
>> #define S_STRLEN(x) S_UNILEN(x)
>> #define S_USTRLEN(x) S_UNILEN(x)
>>
>> #define S_ARG(x) zend_uchar S_TYPE(x), void
>> *S_UNIVAL(x), int
>> S_UNILEN(x)
>>
>> #define S_PASS(x) S_TYPE(x), S_UNIVAL(x), S_UNILEN(x)
>>
>> #define Z_STR_PASS(x) Z_TYPE(x), Z_UNIVAL(x), Z_UNILEN(x)
>> #define Z_STR_PASS_P(x) Z_TYPE_P(x), Z_UNIVAL_P(x),
>> Z_UNILEN_P(x)
>> #define Z_STR_PASS_PP(x) Z_TYPE_PP(x), Z_UNIVAL_PP(x),
>> Z_UNILEN_PP(x)
>>
>> Then most zend_u_... Functions must be rewriten with these macros
>>
>> Foe example:
>>
>> ZEND_API int zend_u_lookup_class(S_ARG(name), zend_class_entry ***ce
>> TSRMLS_DC)
>> {
>> return zend_u_lookup_class_ex(S_PASS(name), 1, ce TSRMLS_CC); }
>>
>> Instead of
>>
>> ZEND_API int zend_u_lookup_class(zend_uchar type, void *name,
>> int name_length, zend_class_entry ***ce TSRMLS_DC) {
>> return zend_u_lookup_class_ex(type, name, name_length,
>> 1, ce TSRMLS_CC); }
>>
>> Any objections, additions?
>>
>> Thanks. Dmitry.
>>
>> --
>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>>
--
Best regards,
marcus
--- End Message ---
--- Begin Message ---
OK. So common decision not ot apply this.
May be sombody will get some other ideas?
Thanks. Dmitry.
> -----Original Message-----
> From: Marcus Boerger [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 07, 2006 10:39 PM
> To: Dmitry Stogov
> Cc: [email protected]; Andrei Zmievski; Andi Gutmans
> Subject: Re: [PHP-I18N] Ideas for a portable string api
>
>
> Hello Dmitry,
>
> i looked at the patch and must say that you did it better
> than i thought of when i first came to the idea but well you
> already told us that you can do it much better. However i
> must also admit that it doesn't buy us much. So i guess we
> can also stay with what we have and move along as Andrei
> prefers. In the end the question is whether we want to make
> our developers learn more macros or more parameter orders.
>
> best regards
> marcus
>
> Tuesday, March 7, 2006, 1:06:53 PM, you wrote:
>
> > Hi,
>
> > The patch in attachment implements S_ARG(...) and related macroses.
> > However I don't like to commit it. Because code isn't more
> clear with
> > these macroses and it is harder to debug. I prefer stay API as is.
>
> > Thanks. Dmitry.
> >
> >> -----Original Message-----
> >> From: Dmitry Stogov [mailto:[EMAIL PROTECTED]
> >> Sent: Thursday, February 16, 2006 12:12 PM
> >> To: [email protected]
> >> Subject: RE: [PHP-I18N] Ideas for a portable string api
> >>
> >>
> >> Hi,
> >>
> >> After reviewing Marcus ideas, some experiments and speaking
> >> with Andrei. I propose the following solutions:
> >>
> >> 1) We will not use any kind of unicode literals in C code (no
> >> L"foo" no "f\0o\0o\0\0"), Because L"" is not portable and
> >> "f\0.." looks to ugly.
> >>
> >> 2) We will change "zval" structure to make
> >> "zval.value.str.len" and "zval.value.ustr.len" of the same
> >> type. This will allow optimize Z_UNISTR() and Z_UNILEN()
> >> macros. They will
> >>
> >> #define Z_UNISTR(z) ((void*)(Z_STRVAL(z)))
> >> #define Z_UNILEN(z) ((void*)(Z_STRLEN(z)))
> >>
> >> Instead of
> >>
> >> #define Z_UNISTR(z)
> >> Z_TYPE(z)==IS_UNICODE?(char*)Z_USTRVAL(z):Z_STRVAL(z)
> >> #define Z_UNILEN(z)
> >> Z_TYPE(z)==IS_UNICODE?(int)Z_USTRLEN(z):Z_STRLEN(z)
> >>
> >> 3) I don't like to break source compatibility with
> >> modification of "zval" layout as Marcus suggested. We will
> >> pass string/unicode values near in the same way as do today.
> >> As three values - zend_uchar type, void* str, int len. But we
> >> will create a set of the following macros to do it with
> less overhead.
> >>
> >> #define S_TYPE(x) _type_##x
> >> #define S_UNIVAL(x) _val_##x
> >> #define S_UNILEN(x) _len_##x
> >> #define S_STRVAL(x) ((char*)S_UNIVAL(x))
> >> #define S_USTRVAL(x) ((UChar*)S_UNIVAL(x))
> >> #define S_STRLEN(x) S_UNILEN(x)
> >> #define S_USTRLEN(x) S_UNILEN(x)
> >>
> >> #define S_ARG(x) zend_uchar S_TYPE(x), void
> >> *S_UNIVAL(x), int
> >> S_UNILEN(x)
> >>
> >> #define S_PASS(x) S_TYPE(x), S_UNIVAL(x), S_UNILEN(x)
> >>
> >> #define Z_STR_PASS(x) Z_TYPE(x), Z_UNIVAL(x), Z_UNILEN(x)
> >> #define Z_STR_PASS_P(x) Z_TYPE_P(x), Z_UNIVAL_P(x),
> >> Z_UNILEN_P(x)
> >> #define Z_STR_PASS_PP(x) Z_TYPE_PP(x), Z_UNIVAL_PP(x),
> >> Z_UNILEN_PP(x)
> >>
> >> Then most zend_u_... Functions must be rewriten with these macros
> >>
> >> Foe example:
> >>
> >> ZEND_API int zend_u_lookup_class(S_ARG(name),
> zend_class_entry ***ce
> >> TSRMLS_DC)
> >> {
> >> return zend_u_lookup_class_ex(S_PASS(name), 1, ce
> TSRMLS_CC); }
> >>
> >> Instead of
> >>
> >> ZEND_API int zend_u_lookup_class(zend_uchar type, void *name,
> >> int name_length, zend_class_entry ***ce TSRMLS_DC) {
> >> return zend_u_lookup_class_ex(type, name, name_length,
> >> 1, ce TSRMLS_CC); }
> >>
> >> Any objections, additions?
> >>
> >> Thanks. Dmitry.
> >>
> >> --
> >> PHP Unicode & I18N Mailing List (http://www.php.net/)
> >> To unsubscribe, visit: http://www.php.net/unsub.php
> >>
> >>
> >>
>
>
>
> --
> Best regards,
> marcus
>
>
>
--- End Message ---
--- Begin Message ---
Suggestion for improving the performance of indexing strings:
Associate with the string the index of the first code unit that is a
surrogate.
Since most strings will have no surrogates, these strings will have a value
greater than the length of the string, and this tells you that you can index
directly into the string. When there is a surrogate, you can index directly,
prior to the surrogate's index.
If there is a surrogate then you can consider the meta data for remembering
which chars used surrogates, to optimize indexing as was proposed.
This is low cost, very efficient... Most strings won't have surrogates.
tex
--- End Message ---
--- Begin Message ---
Hi, this is my paper for the Unicode conference.
Please check it over for accuracy.
The Nov. changes muddied it a bit, but I commented on the changes during the
presentation.
http://www.i18nguy.com/unicode/Unicode-Enabling%20PHP-Mar%202006.pdf (5MB)
It displays slowly for some reason.
Tex
--- End Message ---
--- Begin Message ---
Tex,
This approach would work only if we allowed access to the string
contents always via regimented API. Unfortunately, many third party
extensions (and many bundled ones) simply change the contents of the
string directly via a pointer.. I am not sure we could standardize
this.
-Andrei
On Mar 8, 2006, at 1:35 AM, Tex Texin wrote:
Suggestion for improving the performance of indexing strings:
Associate with the string the index of the first code unit that is a
surrogate.
Since most strings will have no surrogates, these strings will have a
value
greater than the length of the string, and this tells you that you can
index
directly into the string. When there is a surrogate, you can index
directly,
prior to the surrogate's index.
If there is a surrogate then you can consider the meta data for
remembering
which chars used surrogates, to optimize indexing as was proposed.
This is low cost, very efficient... Most strings won't have surrogates.
tex
--- End Message ---
--- Begin Message ---
I thought the proposal in the nov minutes was to create a data structure
indicating which chars used surrogates. This approach is cheaper than that
approach.
Also, this model can be used in local loops and algorithms to gain
performance, so it still has benefits even where there isn't a longer term
structure available.
Tex Texin
Internationalization Architect, Yahoo! Inc.
> -----Original Message-----
> From: Andrei Zmievski [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 08, 2006 8:49 AM
> To: Tex Texin
> Cc: [email protected]
> Subject: [PHP-I18N] Re: surrogates optimization
>
>
> Tex,
>
> This approach would work only if we allowed access to the string
> contents always via regimented API. Unfortunately, many third party
> extensions (and many bundled ones) simply change the contents of the
> string directly via a pointer.. I am not sure we could standardize
> this.
>
> -Andrei
>
> On Mar 8, 2006, at 1:35 AM, Tex Texin wrote:
>
> > Suggestion for improving the performance of indexing strings:
> >
> > Associate with the string the index of the first code unit
> that is a
> > surrogate. Since most strings will have no surrogates,
> these strings
> > will have a value
> > greater than the length of the string, and this tells you
> that you can
> > index
> > directly into the string. When there is a surrogate, you can index
> > directly,
> > prior to the surrogate's index.
> >
> > If there is a surrogate then you can consider the meta data for
> > remembering
> > which chars used surrogates, to optimize indexing as was proposed.
> >
> > This is low cost, very efficient... Most strings won't have
> > surrogates. tex
> >
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
--- End Message ---
--- Begin Message ---
""Tex Texin"" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Also, this model can be used in local loops and algorithms to gain
> performance, so it still has benefits even where there isn't a longer term
> structure available.
i agree here, as it's not an all or nothing proposition. Extensions have to
be reviewed in any case for unicode support, so they can be upgraded later
to use macros to take advantage of this.
--- End Message ---
--- Begin Message ---
On Wed, 8 Mar 2006, Andrei Zmievski wrote:
> Tex,
>
> This approach would work only if we allowed access to the string contents
> always via regimented API. Unfortunately, many third party extensions (and
> many bundled ones) simply change the contents of the string directly via a
> pointer.. I am not sure we could standardize this.
I think we should seriously consider this though, as it makes many of
the string functions quite a bit faster.
Derick
--- End Message ---
--- Begin Message ---
How do we make sure that nothing changes the string contents without
also updating the index of the first surrogate?
-Andrei
On Mar 8, 2006, at 11:48 AM, Derick Rethans wrote:
On Wed, 8 Mar 2006, Andrei Zmievski wrote:
Tex,
This approach would work only if we allowed access to the string
contents
always via regimented API. Unfortunately, many third party extensions
(and
many bundled ones) simply change the contents of the string directly
via a
pointer.. I am not sure we could standardize this.
I think we should seriously consider this though, as it makes many of
the string functions quite a bit faster.
Derick
--- End Message ---
--- Begin Message ---
Set it and respect it in sections of code that are trusted and under your
control, and take whatever performance gain you can get.
Elsewhere, do it the old fashioned way.
Tex Texin
Internationalization Architect, Yahoo! Inc.
> -----Original Message-----
> From: Andrei Zmievski [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 08, 2006 12:44 PM
> To: Derick Rethans
> Cc: [email protected]; Tex Texin
> Subject: Re: [PHP-I18N] Re: surrogates optimization
>
>
> How do we make sure that nothing changes the string contents without
> also updating the index of the first surrogate?
>
> -Andrei
>
> On Mar 8, 2006, at 11:48 AM, Derick Rethans wrote:
>
> > On Wed, 8 Mar 2006, Andrei Zmievski wrote:
> >
> >> Tex,
> >>
> >> This approach would work only if we allowed access to the string
> >> contents
> >> always via regimented API. Unfortunately, many third party
> extensions
> >> (and
> >> many bundled ones) simply change the contents of the
> string directly
> >> via a
> >> pointer.. I am not sure we could standardize this.
> >
> > I think we should seriously consider this though, as it
> makes many of
> > the string functions quite a bit faster.
> >
> > Derick
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
--- End Message ---
--- Begin Message ---
If it's just for local loops then they can hold this separately. I
think it's a nice idea but unfortunately as Andrei pointed out, will
probably have too many issues to be useful.
At 11:06 AM 3/8/2006, Tex Texin wrote:
I thought the proposal in the nov minutes was to create a data structure
indicating which chars used surrogates. This approach is cheaper than that
approach.
Also, this model can be used in local loops and algorithms to gain
performance, so it still has benefits even where there isn't a longer term
structure available.
Tex Texin
Internationalization Architect, Yahoo! Inc.
> -----Original Message-----
> From: Andrei Zmievski [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 08, 2006 8:49 AM
> To: Tex Texin
> Cc: [email protected]
> Subject: [PHP-I18N] Re: surrogates optimization
>
>
> Tex,
>
> This approach would work only if we allowed access to the string
> contents always via regimented API. Unfortunately, many third party
> extensions (and many bundled ones) simply change the contents of the
> string directly via a pointer.. I am not sure we could standardize
> this.
>
> -Andrei
>
> On Mar 8, 2006, at 1:35 AM, Tex Texin wrote:
>
> > Suggestion for improving the performance of indexing strings:
> >
> > Associate with the string the index of the first code unit
> that is a
> > surrogate. Since most strings will have no surrogates,
> these strings
> > will have a value
> > greater than the length of the string, and this tells you
> that you can
> > index
> > directly into the string. When there is a surrogate, you can index
> > directly,
> > prior to the surrogate's index.
> >
> > If there is a surrogate then you can consider the meta data for
> > remembering
> > which chars used surrogates, to optimize indexing as was proposed.
> >
> > This is low cost, very efficient... Most strings won't have
> > surrogates. tex
> >
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
Hey gang,
did the teleconference ever materialize ? If so, are there notes someone
would like to share ?
clayton
--- End Message ---
--- Begin Message ---
Sorry, I was preparing for and then attending some meetings/conferences and
didnt have the bandwidth to reschedule.
Shall we try for some time next week?
Tex Texin
Internationalization Architect, Yahoo! Inc.
> -----Original Message-----
> From: l0t3k [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 08, 2006 7:27 AM
> To: [email protected]
> Subject: [PHP-I18N] Results of Meeting
>
>
> Hey gang,
> did the teleconference ever materialize ? If so, are there
> notes someone
> would like to share ?
>
> clayton
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
--- End Message ---