php-i18n Digest 18 May 2008 12:26:08 -0000 Issue 393

Topics (messages 1189 through 1218):

Re: ubuntu 7.10 pecl install intl
        1189 by: Darren Cook
        1196 by: Stanislav Malyshev
        1199 by: Darren Cook
        1200 by: Stanislav Malyshev
        1202 by: Darren Cook

Re: proposal: unification of the grapheme_extract functions
        1190 by: Ed Batutis
        1191 by: Stanislav Malyshev
        1192 by: Ed Batutis
        1193 by: Stanislav Malyshev
        1194 by: Ed Batutis
        1195 by: Stanislav Malyshev
        1197 by: Ed Batutis
        1203 by: Stanislav Malyshev
        1209 by: Texin, Tex
        1210 by: Ed Batutis
        1211 by: Texin, Tex
        1212 by: Ed Batutis
        1213 by: Stanislav Malyshev
        1214 by: Texin, Tex
        1215 by: Stanislav Malyshev
        1216 by: Texin, Tex
        1217 by: Texin, Tex

Re: intl extension
        1198 by: Stanislav Malyshev
        1201 by: Gergely Hodicska
        1204 by: Ed Batutis
        1205 by: Stanislav Malyshev
        1207 by: Tomas Kuliavas
        1208 by: Texin, Tex

Re: Problems with mime encoding of Japanese Characters in Subject
        1206 by: Tomas Kuliavas

Re: Is htmlentities multi-byte safe?
        1218 by: Isaak Malik

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
> I think ICU stuff might be added to 5.2 branch after 5.2.3 You may want
> to d/l 5.2.6 from php.net and compare the acinclude.m4, or just see one
> on cvs.php.net in PHP_5_2 branch. 

Thanks. I got it from here:
 http://cvs.php.net/viewvc.cgi/php-src/acinclude.m4?revision=1.387

Perhaps the intl pecl package instructions can be updated to say 5.2.3
is not supported?

With the above acinlude.m4 the pecl install intl now gets a lot further
along. But it still fails, here:

/tmp/pear/temp/intl/collator/collator_sort.c:30: error: conflicting
types for 'ptrdiff_t'
/usr/lib/gcc/i486-linux-gnu/4.1.3/include/stddef.h:152: error: previous
declaration of 'ptrdiff_t' was here


Seems like a strange one. I'm using a 32 bit machine. Could there be
something in collator that is hard-coded for 64-bit?

Darren

--- End Message ---
--- Begin Message ---
Hi!

/tmp/pear/temp/intl/collator/collator_sort.c:30: error: conflicting
types for 'ptrdiff_t'
/usr/lib/gcc/i486-linux-gnu/4.1.3/include/stddef.h:152: error: previous
declaration of 'ptrdiff_t' was here

This should be fixed already... Which version do you use - CVS or some other?
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
>> /tmp/pear/temp/intl/collator/collator_sort.c:30: error: conflicting
>> types for 'ptrdiff_t'
>> /usr/lib/gcc/i486-linux-gnu/4.1.3/include/stddef.h:152: error: previous
>> declaration of 'ptrdiff_t' was here
> 
> This should be fixed already... Which version do you use - CVS or some
> other?

I'm typing "pecl install intl". It downloads this file:
  intl-1.0.0beta.tgz (96,707 bytes)

BTW, this page needs to be updated:
 http://pecl.php.net/package/intl
to say PHP version 5.2.6 (?) or newer, not 5.2.0 or newer. Or (even
better) patch the intl package so it defines PHP_SETUP_ICU itself (and
any other macros it needs from PHP5's acinclude.m4).

Darren

P.S. Yes, by this point it would have been quicker to compile latest PHP
from source then compile intl from cvs. But if intl won't install from
pecl with a popular distro then it isn't ready for me to use it in
client projects. So I'll carry on being your pecl beta tester :-).



-- 
Darren Cook
http://dcook.org/mlsn/ (English-Japanese-German-Chinese free dictionary)
http://dcook.org/work/ (About me and my work)
http://dcook.org/work/charts/  (My flash charting demos)

--- End Message ---
--- Begin Message ---
Hi!

I'm typing "pecl install intl". It downloads this file:
  intl-1.0.0beta.tgz (96,707 bytes)

I see. Unfortunately, this build is a bit old... We should have better release pretty soon (I think once we figure out the last details of grapheme APIs, it'd be ready) but right now best bet would be to check out pecl/intl from CVS, using PHP_5_2 branch.

BTW, this page needs to be updated:
 http://pecl.php.net/package/intl
to say PHP version 5.2.6 (?) or newer, not 5.2.0 or newer. Or (even
better) patch the intl package so it defines PHP_SETUP_ICU itself (and
any other macros it needs from PHP5's acinclude.m4).

It's 5.2.4, but yes, package.xml will be updated.

P.S. Yes, by this point it would have been quicker to compile latest PHP
from source then compile intl from cvs. But if intl won't install from
pecl with a popular distro then it isn't ready for me to use it in
client projects. So I'll carry on being your pecl beta tester :-).

It will be ready for 1.0 release. Thanks for testing it :)
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
>> I'm typing "pecl install intl". It downloads this file:
>>   intl-1.0.0beta.tgz (96,707 bytes)
> 
> I see. Unfortunately, this build is a bit old... We should have better
> release pretty soon ... but right now best bet would be to check
> out pecl/intl from CVS, using PHP_5_2 branch.

Release early, release often, release today (as it seems current version
won't compile, there seems nothing to lose). I know, easy for me to say
as I'm not volunteering to do the release :-).

I'll try the CVS version in a day or two.

Darren


--- End Message ---
--- Begin Message ---
> It is an option that should be offered, because it is good for
> performance, but it is more tedious programming and harder to migrate
> programs to use this functionality.

I understand what you are saying. However, everyone should move towards
using break iterator, I believe, for performance reasons!
 
I could add more options to the extract call to allow the user to specify
what $start means:

GRAPHEME_EXTR_START_BYTE_COUNT
GRAPHEME_EXTR_START_CHAR_COUNT
GRAPHEME_EXTR_START_GRAPHEME_COUNT
 
The next question would be - what is the default? Should it be 'byte count'
in all cases, or should it match the $extract_type - graphemes if 'count',
bytes if 'max bytes' etc. I suspect the latter, if I understand your
use-case.

Thoughts?
 
=Ed



--- End Message ---
--- Begin Message ---
Hi!

I could add more options to the extract call to allow the user to specify
what $start means:

GRAPHEME_EXTR_START_BYTE_COUNT
GRAPHEME_EXTR_START_CHAR_COUNT
GRAPHEME_EXTR_START_GRAPHEME_COUNT

I'm afraid that might be an overkill - if you need to start at N graphemes, why not do grapheme_substr then?
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
> I'm afraid that might be an overkill - if you need to start at N
> graphemes, why not do grapheme_substr then?

I agree - grapheme_extract starting at N graphemes and returning a count of
graphemes is exactly the same functionality as grapheme_substr, but the
other combinations are not covered elsewhere, I think. I cannot think of a
solid basis for excluding any of the others. And I don't know how to change
the API to make that particular combination 'special' so it can be excluded.

=Ed



--- End Message ---
--- Begin Message ---
Hi!

I agree - grapheme_extract starting at N graphemes and returning a count of
graphemes is exactly the same functionality as grapheme_substr, but the
other combinations are not covered elsewhere, I think. I cannot think of a
solid basis for excluding any of the others. And I don't know how to change
the API to make that particular combination 'special' so it can be excluded.

Doesn't mb_substr implement the character stuff?
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
> 
> Doesn't mb_substr implement the character stuff?

Yes, that covers a character offset and a character count to return. mb
calls don't know anything about graphemes, of course. (At one point I
considered adding a 'grapheme mode' to the mb API though.) grapheme_extract
is different, I think, because it is bridging graphemes and
bytes/characters. I don't know of an mb function like that - there's no
mb_extract.

=Ed


--- End Message ---
--- Begin Message ---
Hi!

considered adding a 'grapheme mode' to the mb API though.) grapheme_extract
is different, I think, because it is bridging graphemes and
bytes/characters. I don't know of an mb function like that - there's no
mb_extract.

Maybe I just misunderstand the use case for the extract function - what it's supposed to do that substr, mb_substr and grapheme_substr can't or do worse?
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
> Maybe I just misunderstand the use case for the extract function - what
> it's supposed to do that substr, mb_substr and grapheme_substr can't or
> do worse?

Tex could probably answer this better than I could, but I'll have a go.

Use case 1: You have a buffer that is a fixed number of bytes long. You need
to fill it up as far as you can with whole graphemes. You are probably
sending that buffer to another API that might not be grapheme - or even
Unicode - aware. You are in a loop so you are tracking your position in the
original string. This is how the discussion got started about how the
'start' parameter is defined - it isn't clear how the position would be
tracked. I assumed a byte count because the user can simply do a strlen on
the return string to update his position, but Tex thinks this isn't as handy
as it should be. It depends on the details of the algorithm I guess.

Use case 2: Same as above except in this case it is an Oracle database
buffer where your columns are defined as being N Unicode characters (not
bytes or graphemes) long.

Use case 3 (a generalization of use case 1 really): You have some code that
knows about bytes or Unicode characters but nothing about graphemes. You
want to update the code so it is grapheme aware. You can't completely
abandon a byte count or character count in the code for some reason, but you
want to easily update the code to process whole graphemes.


=Ed



--- End Message ---
--- Begin Message ---
Hi!

Tex could probably answer this better than I could, but I'll have a go.

OK, thanks! Picture is much clearer for me now. If you do it in a loop, then I think bytes should be enough, since you'd have to do strlen/grapheme_strlen in any case to know how much did you receive, so doing strlen is no worse than doing anything else, and you could always work with bytes there, and since grapheme_extract would always stop on grapheme boundary, I think you don't need to worry about bytes being not good enough for graphemes inside the string.

As an alternative, we could update $start with new "position" - i.e. old $start+how many bytes we returned, but I'm not sure if it's the best way.

Tex, what do you think?
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
Hi,

I disagree with case 2 as it is described. You don't want to truncate in the 
middle of a grapheme, if you in fact have graphemes.

Basically and ideally, there should be only 3 use cases:

A) You are working with graphemes, and ideally you would program with grapheme 
indexes and counts (start and length in grapheme units).

B) Because of fixed width buffers, you need to specify a max length in bytes, 
but the function should only extract whole graphemes (start in graphemes, 
length in bytes).

C) You are not working with text but bytes (which really shouldn't be in this 
discussion, but for completeness...) and so start and length in bytes.

But we don't live in an ideal world.
Grapheme based processing is more expensive than character processing, and this 
is more expensive than byte processing.

If this weren't so, we would only have to deal with A, B, and C and programming 
would be simpler.

It is a bad assumption for i18n, but if you are not dealing with Indic or 
Middle Eastern languages, then you know you have characters not graphemes, and 
so why pay the cost?
Also, graphemes are newly supported, so existing code is character based.

Therefore we need to offer the character support.  Analagous to A and B, we 
need CA (start and length in character units) and CB (start in character units 
and length in bytes)

But again performance rears its ugly head. Having start be in character or 
grapheme units, means the function always scans thru start number of units to 
find the beginning offset. Hence the desire to offer start position in bytes, 
giving us a version of A and CA that starts with bytes, and a version of B and 
CB that specifies start and length in bytes but returns a whole number of 
graphemes or characters, as appropriate.

The final ugliness is we have some of these functions in the plain (or non-mb) 
flavor and the mb_string flavor.
So, we could say for graphemes use grapheme_substr, and for character use mb 
functions, and for bytes use the plain functions (or the other way around (I 
think mb overloads the plain with the character based and provides the byte 
versions in the mb form... I always have to look to check.)

But, some of the mb functions are not implemented well so I don't trust them, 
which you can chalk up to my personal idiosyncrasy. The more salient point is 
it is confusing for people to have to sort thru all the function flavors with 
different names. I would prefer to have the choices in one function with 
options and an explanation of when to use what, perhaps derived from the above 
logic. And I would deprecate the related functions in mb and plain.

That said, if this is all that's holding up the release, I would release with 
the byte start and add the other flavors in the next version.
People can always use grapheme_length/mb_length(or whatever it is) to get the 
starting byte position and perhaps write their own function to calculate the 
byte start and call the grapheme_substr function.
It is a nuisance but if they understand that they can migrate easily.

Let's wrap this up.


tex


> -----Original Message-----
> From: Ed Batutis [mailto:[EMAIL PROTECTED] 
> Sent: Monday, May 12, 2008 1:01 PM
> To: 'Stanislav Malyshev'
> Cc: Texin, Tex; [EMAIL PROTECTED]
> Subject: RE: [PHP-I18N] proposal: unification of the 
> grapheme_extract functions
> 
> > Maybe I just misunderstand the use case for the extract function - 
> > what it's supposed to do that substr, mb_substr and grapheme_substr 
> > can't or do worse?
> 
> Tex could probably answer this better than I could, but I'll 
> have a go.
> 
> Use case 1: You have a buffer that is a fixed number of bytes 
> long. You need to fill it up as far as you can with whole 
> graphemes. You are probably sending that buffer to another 
> API that might not be grapheme - or even Unicode - aware. You 
> are in a loop so you are tracking your position in the 
> original string. This is how the discussion got started about 
> how the 'start' parameter is defined - it isn't clear how the 
> position would be tracked. I assumed a byte count because the 
> user can simply do a strlen on the return string to update 
> his position, but Tex thinks this isn't as handy as it should 
> be. It depends on the details of the algorithm I guess.
> 
> Use case 2: Same as above except in this case it is an Oracle 
> database buffer where your columns are defined as being N 
> Unicode characters (not bytes or graphemes) long.
> 
> Use case 3 (a generalization of use case 1 really): You have 
> some code that knows about bytes or Unicode characters but 
> nothing about graphemes. You want to update the code so it is 
> grapheme aware. You can't completely abandon a byte count or 
> character count in the code for some reason, but you want to 
> easily update the code to process whole graphemes.
> 
> 
> =Ed
> 
> 
> 

--- End Message ---
--- Begin Message ---
> I disagree with case 2 as it is described. You don't want to truncate in
> the middle of a grapheme, if you in fact have graphemes.

I didn't intend to say that - the only difference between 1 and 2 is that in
2 the buffer is a character-length buffer and presumably you'd have a
character index that you'd like to use in $start. But grapheme_extract
always returns whole graphemes regardless of any option or there's no point
to it.

Stas brought up the idea of having $start be a reference so the routine
could update it to the next position. I think that might solve some problems
in the caller's code. $start could still be defined as any of bytes,
characters, or graphemes and it would be updated respecting that. What do
you think? If we do that, the user might be perfectly happy with only a
"byte flavor" of $start in many simple cases since they don't need to do
anything extra to iterate through the original string - they can always get
a grapheme count or character count if they need it by making a function
call.

=Ed



--- End Message ---
--- Begin Message ---
Hi,
On $start being a reference, I like the idea, especially if we do that 
consistently for all functions (eventually I guess). (Otherwise, it may cause 
bugs to have $start change unexpectedly for a single function.) 

It does make migration also a little harder as people will need to adjust their 
code which does not expect $start to change.

A variation of the proposal might be to have the end value be an optional 
argument at the end of the arg list for returning the end position.
That is easy to migrate and requires a conscious change to update the variable 
and only require updating it if in fact it will be used.

All in all either approach is fine.

(I am out of the office and don't have the specs in front of me - sorry for not 
being more precise.)

But it doesn't really fix the fundamental issue with needing to involve php 
programmers with a byte vs char vs grapheme choice.

The right solution (to my mind) is to have some meta data maintained with 
strings and store position and other info about the strings to improve 
performance without involving programmers and letting people program strings 
without caring about encoding or architecture. That is an opportunity for php 6.




> -----Original Message-----
> From: Ed Batutis [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, May 13, 2008 8:25 AM
> To: Texin, Tex; 'Stanislav Malyshev'
> Cc: [EMAIL PROTECTED]
> Subject: RE: [PHP-I18N] proposal: unification of the 
> grapheme_extract functions
> 
> 
> > I disagree with case 2 as it is described. You don't want 
> to truncate 
> > in the middle of a grapheme, if you in fact have graphemes.
> 
> I didn't intend to say that - the only difference between 1 
> and 2 is that in
> 2 the buffer is a character-length buffer and presumably 
> you'd have a character index that you'd like to use in 
> $start. But grapheme_extract always returns whole graphemes 
> regardless of any option or there's no point to it.
> 
> Stas brought up the idea of having $start be a reference so 
> the routine could update it to the next position. I think 
> that might solve some problems in the caller's code. $start 
> could still be defined as any of bytes, characters, or 
> graphemes and it would be updated respecting that. What do 
> you think? If we do that, the user might be perfectly happy 
> with only a "byte flavor" of $start in many simple cases 
> since they don't need to do anything extra to iterate through 
> the original string - they can always get a grapheme count or 
> character count if they need it by making a function call.
> 
> =Ed
> 
> 
> 

--- End Message ---
--- Begin Message ---

> A variation of the proposal might be ...

I like it. I vote for:

  Add a new optional parameter on the end - called $next - that
  is a reference that will receive the offset just past the 
  end of the returned string.

The caller can use the same variable for $start and $next if they don't want
to have to manually assign the value in their own code.

I will add the three options for the meaning of 'start' as well with the
default being 'byte'. The returned $next value will respect the value of
that option.
 
If no one objects I'll put that in.

> The right solution (to my mind) is to have some meta data maintained with
> strings and store position and other info about the strings to improve
> performance without involving programmers and letting people program
> strings without caring about encoding or architecture. That is an
> opportunity for php 6.

Sounds like a break iterator with a bit of extra info to support multiple
encodings. Or perhaps you mean to wrap all string operations?

=Ed



--- End Message ---
--- Begin Message ---
Hi!

  Add a new optional parameter on the end - called $next - that
is a reference that will receive the offset just past the end of the returned string.

OK, it may be even better.

I will add the three options for the meaning of 'start' as well with the
default being 'byte'. The returned $next value will respect the value of
that option.
If no one objects I'll put that in.

That would make it six-parameter function? I think it's too much, just bytes should be enough... Note that if you receive $start in something other than bytes you'd need to re-scan the string next time to know where to start, even though you already had this information at the last call. So you lose part of the performance gain of having this offset. If you not looping but just need to start with Nth grapheme, you can always go for grapheme_substr. Also, I agree with Tex on wanting to release it ASAP and it seems like this issue is the last one. So I think let's make it just bytes with optional $next, document it and start rolling out 1.0.0.
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
You could make the type work with flags rather than type so as not to need an 
additional argument.

So you would declare the mix and match of start/next units and length units as 
flag1+flag2
Where both flag1 and flag2 would have appropriate defaults for a value of 0.
Flag1 would use bits 4-6 and flag2 bits in the range 1-3 or some such.

Don't hate me it is just a suggestion.

> -----Original Message-----
> From: Stanislav Malyshev [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, May 13, 2008 2:43 PM
> To: Ed Batutis
> Cc: Texin, Tex; [EMAIL PROTECTED]
> Subject: Re: [PHP-I18N] proposal: unification of the 
> grapheme_extract functions
> 
> Hi!
> 
> >   Add a new optional parameter on the end - called $next - that
> >   is a reference that will receive the offset just past the 
> >   end of the returned string.
> 
> OK, it may be even better.
> 
> > I will add the three options for the meaning of 'start' as 
> well with 
> > the default being 'byte'. The returned $next value will respect the 
> > value of that option.
> >  
> > If no one objects I'll put that in.
> 
> That would make it six-parameter function? I think it's too 
> much, just bytes should be enough... Note that if you receive 
> $start in something other than bytes you'd need to re-scan 
> the string next time to know where to start, even though you 
> already had this information at the last call. So  you lose 
> part of the performance gain of having this offset. 
> If you not looping but just need to start with Nth grapheme, 
> you can always go for grapheme_substr.
> Also, I agree with Tex on wanting to release it ASAP and it 
> seems like this issue is the last one. So I think let's make 
> it just bytes with optional $next, document it and start 
> rolling out 1.0.0.
> --
> Stanislav Malyshev, Zend Software Architect
> [EMAIL PROTECTED]   http://www.zend.com/
> (408)253-8829   MSN: [EMAIL PROTECTED]
> 

--- End Message ---
--- Begin Message ---
Hi!

You could make the type work with flags rather than type so as not to need an 
additional argument.

So you would declare the mix and match of start/next units and length units as 
flag1+flag2
Where both flag1 and flag2 would have appropriate defaults for a value of 0.
Flag1 would use bits 4-6 and flag2 bits in the range 1-3 or some such.

Don't hate me it is just a suggestion.

Not hating, but I still think just bytes are enough :) If we discover real code where I'm wrong and it can't work, we can add it later, right?
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
Yes, bytes are ok for now. I still think it makes php unnecessarily difficult 
to program. If php were a 3gl it would be fine. 

;-)

It is not that it can't work. It is that php shouldn't require users to do 2 
types of accounting and have two variables for tracking positions. (one to 
remember byte offsets and one to remember character/grapheme equivalent index.) 
I guarantee bugs due to the variables getting out of sync.

Hey, do we have a function that returns the character index, given a byte index?

I guess you do a grapheme_length(grapheme_substr(bytes)...
A function that gives the grapheme count for a string starting at byte offseet 
for a length of n bytes would be handy.

(still not able to access specs from here)
tex

> -----Original Message-----
> From: Stanislav Malyshev [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, May 13, 2008 3:47 PM
> To: Texin, Tex
> Cc: Ed Batutis; [EMAIL PROTECTED]
> Subject: Re: [PHP-I18N] proposal: unification of the 
> grapheme_extract functions
> 
> Hi!
> 
> > You could make the type work with flags rather than type so 
> as not to need an additional argument.
> > 
> > So you would declare the mix and match of start/next units 
> and length 
> > units as flag1+flag2 Where both flag1 and flag2 would have 
> appropriate defaults for a value of 0.
> > Flag1 would use bits 4-6 and flag2 bits in the range 1-3 or 
> some such.
> > 
> > Don't hate me it is just a suggestion.
> 
> Not hating, but I still think just bytes are enough :) If we 
> discover real code where I'm wrong and it can't work, we can 
> add it later, right?
> --
> Stanislav Malyshev, Zend Software Architect
> [EMAIL PROTECTED]   http://www.zend.com/
> (408)253-8829   MSN: [EMAIL PROTECTED]
> 
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/) To 
> unsubscribe, visit: http://www.php.net/unsub.php
> 
> 

--- End Message ---
--- Begin Message ---
 

> -----Original Message-----
> From: Ed Batutis [mailto:[EMAIL PROTECTED] 
> 
> Sounds like a break iterator with a bit of extra info to 
> support multiple encodings. Or perhaps you mean to wrap all 
> string operations?
> 
> =Ed


I mean just for unicode, not other encodings, and yes to wrap all string ops, 
so that it can be maintained for any operations performed on the string.

I would maintain some info like is the string all ascii, are there any 
graphemes, etc. to use lowercost functions if possible, and some information 
about where eac character in the string begins for fast indexing on short 
strings.
For longer strings, I might remember beginning of lines and their character and 
byte offsets.
Also previous and next character info.

It would be something you might use on certain frequently used strings that are 
actually processed.
PHP does a lot of just moving strings around and not parsing or modifying them 
so it isnt cost effective for all.

Just a thought for the future.

--- End Message ---
--- Begin Message ---
Hi!

I would have some question:
- If I should have to came out to a site with multi language support in
1-2 months is it worth to rely on this extension? (Will be part of 5.3?)

Yes, I think so.

- Is there now standard like way to handle plurals?

In MessageFormatter, you have a way to use conditionals (see ICU docs here: http://icu-project.org/apiref/icu4c/classMessageFormat.html#_details) which may allow you to do correct plurals. Of course, you would still have to supply the correct words (room/rooms, etc.) in the format.
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
Hi!


In MessageFormatter, you have a way to use conditionals (see ICU docs here: http://icu-project.org/apiref/icu4c/classMessageFormat.html#_details) which may allow you to do correct plurals. Of course, you would still have to supply the correct words (room/rooms, etc.) in the format.
Thank you answer.

I would have one more question, which is a little more theoretically. Currently I see two direction providing proper plural handling and actually I am not sure which one should be better to use.

So I am curious what others think about this, and can offer some more thing which I should consider, and what helps me to make a better decision.


1.) Gettext way: I call a format_plural($count, '1 comment', '$count comments'), and this function can handle special cases (for example Russian), and the given language file can store multiple plural format.

2.) "All-in-one" way: only one string belongs to one string id, and this string holds the different versions of the given word/sentence, and the rules too on which we can decide which version should be used. For example:
"[0]No comment.|[1]1 comment.|[2,]$count comments."


Actually I like the second way, why it is more flexible, but I am not sure if this flexibility worth. And I like the possibility that you can handle the "0 case" in one place.

But I have some problems with this approach too (I feel that it is a little programmer centric). o Maybe the most obvious one is that the rules are repeated in every entry and the fact that the rules belongs to the language not to an entry (so the data model is a little strange, it holds some redundancy). o My other problem is that for this structure it would be not easy to create a logical/usable user interface for translating. o Or what is if I want to export my master language to XLIFF format, will a translator able handle this approach and create a proper translation?


I will really appreciate your comments about this topic, I have no work experience yet with it.

TIA!


Best Regards,
Felhő

--- End Message ---
--- Begin Message ---
> entry (so the data model is a little strange, it holds some redundancy).

How many of these do you have? A little redundancy is OK, I think.

>   o My other problem is that for this structure it would be not easy to
> create a logical/usable user interface for translating.

Don't do that. Translators have specialized tools they own and want to use.

>   o Or what is if I want to export my master language to XLIFF format,

Good idea. Or a plain text file or many other formats will be OK with most
translators.

=Ed



--- End Message ---
--- Begin Message ---
Hi!

1.) Gettext way: I call a format_plural($count, '1 comment', '$count comments'), and this function can handle special cases (for example Russian), and the given language file can store multiple plural format.

2.) "All-in-one" way: only one string belongs to one string id, and this string holds the different versions of the given word/sentence, and the rules too on which we can decide which version should be used. For example:
"[0]No comment.|[1]1 comment.|[2,]$count comments."

Here you'd probably want to use {1,number,integer} instead of $count and provide $count as parameter, otherwise you'd lose local number formatting.

o Maybe the most obvious one is that the rules are repeated in every entry and the fact that the rules belongs to the language not to an entry (so the data model is a little strange, it holds some redundancy).

Yes, that's correct - but you could always isolate it into specific printing function - i.e. create function print_comments_count($count) and use the pattern there, and then insert it into bigger pattern. Depends on how often that pattern repeats, I guess.

o My other problem is that for this structure it would be not easy to create a logical/usable user interface for translating.

Well, here I don't have any meaningful experience, but I think since ICU uses it there must be some tools that support it. Right now the extension does not deal with the question where you get the patterns from or how you switch between pattern sets (we planned to add support for resource bundles, etc. later). Maybe other people on the list could help more with that :)
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
> - Is there now standard like way to handle plurals?

Those who don't use gettext are doomed to reimplement it.

-- 
Tomas

--- End Message ---
--- Begin Message ---
Of the two, I prefer the all in one way, because the localizer needs to see the 
entire structure and modify it appropriately for the language, grammar and 
rules.

If the programmer creates a format_plural function, then the function is not 
likely to cover all the needs of all languages, nor the reordering of the text 
that a localizer might choose for correct grammar.


That said, I try to avoid the problem altogether by putting the text in a form 
of label:value, since this can almost always be accommodated by an unchanging 
label in all languages. 

Number of comments: (0, 1, 2 whatever...)

It is not as engaging as writing "You have n comments" but it can be done so it 
is not offputting and you can use the rest of the text to better engage the 
audience.
tex

> -----Original Message-----
> From: Stanislav Malyshev [mailto:[EMAIL PROTECTED] 
> Sent: Monday, May 12, 2008 5:23 PM
> To: Gergely Hodicska
> Cc: [EMAIL PROTECTED]
> Subject: Re: [PHP-I18N] intl extension
> 
> Hi!
> 
> > 1.) Gettext way: I call a format_plural($count, '1 
> comment', '$count 
> > comments'), and this function can handle special cases (for example 
> > Russian), and the given language file can store multiple 
> plural format.
> > 
> > 2.) "All-in-one" way: only one string belongs to one string id, and 
> > this string holds the different versions of the given 
> word/sentence, 
> > and the rules too on which we can decide which version 
> should be used. For example:
> > "[0]No comment.|[1]1 comment.|[2,]$count comments."
> 
> Here you'd probably want to use {1,number,integer} instead of 
> $count and provide $count as parameter, otherwise you'd lose 
> local number formatting.
> 
> >  o Maybe the most obvious one is that the rules are 
> repeated in every 
> > entry and the fact that the rules belongs to the language not to an 
> > entry (so the data model is a little strange, it holds some 
> redundancy).
> 
> Yes, that's correct - but you could always isolate it into 
> specific printing function - i.e. create function 
> print_comments_count($count) and use the pattern there, and 
> then insert it into bigger pattern. 
> Depends on how often that pattern repeats, I guess.
> 
> >  o My other problem is that for this structure it would be 
> not easy to 
> > create a logical/usable user interface for translating.
> 
> Well, here I don't have any meaningful experience, but I 
> think since ICU uses it there must be some tools that support 
> it. Right now the extension does not deal with the question 
> where you get the patterns from or how you switch between 
> pattern sets (we planned to add support for resource bundles, 
> etc. later). Maybe other people on the list could help more 
> with that :)
> --
> Stanislav Malyshev, Zend Software Architect
> [EMAIL PROTECTED]   http://www.zend.com/
> (408)253-8829   MSN: [EMAIL PROTECTED]
> 
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/) To 
> unsubscribe, visit: http://www.php.net/unsub.php
> 
> 

--- End Message ---
--- Begin Message ---
> (This is a reply to a problem in the archives, from March:
>    http://marc.info/?l=php-i18n&m=120595161128203&w=2 )
> 
> As you obviously have the mb_string extension installed, have you tried
> using mb_send_mail() instead of mail()? Then you shouldn't need to mess
> around encoding your own mimeheaders.
> 
> <minor rant>
> Later in the thread Tomas suggested using UTF-8 instead of ISO-2022-JP,
> and getting Docomo to change. The problem is all those handsets in
> existence. Not to mention all the other legacy email clients that don't
> work well with UTF-8, but real people still use. Docomo could convert
> from UTF-8 to ISO-2022-JP at the gateway of course, which apparently is
> what softbank and kddi actually do, but Docomo deal with a lot of email
> so care about the cost of the extra CPU cycles, and you're going to need
> better motivation for them than "PHP cannot write proper MIME headers" I
> suspect.
> </minor rant>

I missed mb_encode_mimeheader dependency on internal mbstring encoding.
If internal mbstring charset is set, mb_encode_mimeheader should work
correctly.

Still using utf-8 instead of iso-2022-jp is better solution. "other
legacy email clients" some day will face same situation Japanese had one
and a half centuries ago when country could not defend itself from four
steam ships. In modern world seclusion hurts only the ones are using it.
iso-2022-jp is outdated charset. ISO-2022-JP does not support all
characters that can be used in utf-8 html form and it is harder to parse
than utf-8.

-- 
Tomas

--- End Message ---
--- Begin Message ---
htmlentities() is multi-byte compatible, you do have to pass UTF-8 as third
argument to the function to add multi-byte support.

More info:
www.php.net/htmlentities

On Wed, May 7, 2008 at 8:19 PM, Jerry Schwartz <[EMAIL PROTECTED]>
wrote:

> The changelog for html_entity_decode says that it is multi-byte safe
> starting with 5.0.0. What about htmlentities? There's no note there.
>
> Regards,
>
> Jerry Schwartz
> The Infoshop by Global Information Incorporated
> 195 Farmington Ave.
> Farmington, CT 06032
>
> 860.674.8796 / FAX: 860.674.8341
>
> www.the-infoshop.com
> www.giiexpress.com
> www.etudes-marche.com
>
>
>
>
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
-- 
Isaak Malik
Web Developer

--- End Message ---

Reply via email to