php-i18n Digest 3 Mar 2008 20:05:36 -0000 Issue 383

php-i18n-digest-help Mon, 03 Mar 2008 12:07:25 -0800

php-i18n Digest 3 Mar 2008 20:05:36 -0000 Issue 383

Topics (messages 1153 through 1155):


Re: Code point iteration
        1153 by: Tomas Kuliavas

Re: Foreign language sorting
        1154 by: Tomas Kuliavas

[RFC] Replace the flex-based scanner with an re2c [1] based lexer
        1155 by: Marcus Boerger

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------

--- Begin Message ---

> Given a Unicode UTF8 character, how would you get the character at the
> very next code point?

Check last byte. If it is between 0x80 and 0xBE, add one point to last
byte and return all bytes.

If last byte is 0xBF, set it to 0x80 and repeat 0x80-0xBE and 0xBF
checks on next byte.

If next byte is 0xDF, convert it it 0xE0 0x80 and return all bytes.

If next byte is 0xEF, convert it to 0xF0 0x80 and return all bytes.

If next byte is 0xF7, convert it to 0xF8 0x80 and return all bytes.

If next byte is 0xFB, convert it to 0xFC 0x80 and return all bytes.

If next byte is set to some other value between 0xC0 and 0xFA, add one
point to it and return all bytes.

If next byte is set to FD, next unicode codepoint is 0x00.

Or calculate codepoint of utf-8 character, add one point and convert it
back to utf-8.

-- 
Tomas

--- End Message ---

--- Begin Message ---

> I am looking for a method to sort an array of strings, however the
> strings are UTF8 and contain latin and non-latin characters. Is there a
> method to sort a string such that letters such as e with an acute are
> grouped with e without an acute, or even for strings that are made up
> entirely of Thai or Cyrillic characters?
> I really hope I don't have to roll my own sorting algorithm.

See http://pecl.php.net/package/intl and
http://docs.php.net/manual/en/book.intl.php

--- End Message ---

--- Begin Message ---

Hello everyone,

  sorry for the crosspost. But recent discussions about:
'[RFC] Replace the flex-based scanner with an re2c [1] based lexer'
revealed one big issue. During the development of said RFC we dropped
--enable-multibyte-support and interaction between engine and ext/mbstring
using declare(encoding=..). Now neither of the two is documented anywhere,
nor does any of the core developers happen to know how it works, what it is
supposed to do or how to test it.

Since we do not want to drop this feature we need some test code, best in
the form of .PHPTs. You can find information on how to write tests here:
http://qa.php.net/write-test.php and
http://talks.somabo.de/200703_montreal_need_for_testing.pdf

If you are interested in this further you are of course also more than
welcome to help in any other form. Apart from the proposal below, there
is also my blog entry to help you getting started:
http://blog.somabo.de/2008/02/php-on-re2c.html

thanks
marcus


Sunday, March 2, 2008, 11:21:34 PM, you wrote:

> RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

> Situation:
> The current flex-based lexer depends on an outdated and unsupported flex
> version. Alternatives include either updating to a newer version of flex or
> using re2c, which we already use for a variety of things (serializing, pdo sql
> scanning, date/time parsing). While moving towards a newer flex version would
> be much easier, switching to re2c promises a much faster lexer. Actually,
> without any specific re2c optimizations we already get around a 20% scanner
> performance increase. Running the tests gets an overall speedup of 2%. It is
> arguable whether this is enough, but re2c has more advantages. First of all,
> re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
> Secondly, it allows for better integration with Lemon [2], which would be the
> next step. And thirdly we can switch to a reentrant scanner.

> Current state:
> Flex has been fully replaced by re2c in Zend. We have also switched to an
> mmap-based lexer approach for now. However, we had to drop multibyte support
> as well as the encoding declare. The current state can be checked out from
> Scott's subversion repository [3] and you can follow the development on his
> Trac setup [4]. When you want to build php with re2c, then you need to grab
> re2c from its sourceforge subversion repository [5]. You can also check out
> the changes in a patch created Sunday 2nd March against a PHP checkout from 
> 14th February [6].

> Further steps:
> Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
> multibyte support with libintl.

> Future steps:
> Replace bison with lemon in PHP 5.4 or HEAD.

> Time Frame:
> Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
> of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
> After that is done, decide about multibyte support. Along with the commit to
> the 5.3 branch there will be a new re2c version available.


> Marcus Boerger
> Nuno Lopes
> Scott MacVicar


> [1] http://re2c.org/
> [2] http://www.hwaci.com/sw/lemon/
> [3] svn://whisky.macvicar.net/php-re2c
> [4] http://trac.macvicar.net/php-re2c/
> [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
> [6] http://php.net/~helly/php-re2c-20080302.diff.txt

--- End Message ---

php-i18n Digest 3 Mar 2008 20:05:36 -0000 Issue 383

Reply via email to