Re: [PHP-DEV] BreakIterator

Gustavo Lopes Mon, 04 Jun 2012 14:09:33 -0700

On Mon, 04 Jun 2012 21:09:28 +0200, Stas Malyshev <[email protected]>wrote:

I understand that, but I have no idea how to write proper rules for word
boundaries, I just want to tell it "give me word boundaries" but not by
saying createWordBoundaries() but by doing createIterator($type) where
$type == WORD_BOUNDARIES.

Why? This makes no sense to me. Why would createIterator(WORD_BOUNDARIES)be better than BreakIterator::createWordInstance()? Especially in adynamic language like PHP where you can do:


$type = 'word';
$bi = BreakIterator::{"create" . $type . 'instance'}(NULL);

To iterate over code points, you can build a very simple
RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this
example here: https://gist.github.com/2843005


Is there any reason not to provide this as a service for PHP user? I
understand somebody who is a specialist in ICU knows that already, but
most PHP users don't know this magic.

Well, the reason I didn't add it is because ICU didn't add such aniterator. I imagine the reason for that is that there are much moreefficient ways to iterate over UTF-8 that don't involve a full-blown regexbased text segmentation engine. In fact, ICU provides very efficient ways(with macros and simple specialized functions) to iterate over UTF-8 textin utf8.h:


http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/common/unicode/utf8.h

Right now, the ICU implementation just calls
Locale::getAvailableLocales(), but its description is "Gets all the
available locales that has localized text boundary data." so I suppose
it could return a different set in the future.


My only concern is that no other classes have getAvailableLocales() and
it doesn't seem to do anything useful now, so maybe we should omit it
for now?

I have no special love for it, but your statement is innacurate in oneaspect -- I've added a similar function in IntlCalendar... whoseimplementation is basically the same:


http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/i18n/calendar.cpp#getAvailableLocales

I don't mind removing both though.

Another thing I notice here: why not make:
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);

into:
$bi = BreakIterator::createWordInstance(NULL, $foo);


Two reasons:

* it encourages bad behavior, namely not reusing the BreakIterator objects.

* that's not the ICU signature. If ICU in the future adds overloads with astring in the second argument, we'll find ourselves with odd signatures.

OK, if you have to do getPartsIterator() it's fine as long as you can
easily do foreach on it, since that's what one expects from iterator.
I'd also add some flag that would skip or not skip whitespace, if this
is possible - so for "foo bar" sometimes you want ['foo', ' ', 'bar']
and sometimes you want just ['foo', 'bar'] - does ICU support it somehow?

The BreakIterator cannot throws away text. You have to look at the rulesstatuses. Example:


$text = 'This is a phrase... with some punctuation.';
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($text);
foreach ($bi->getPartsIterator() as $v) {
        if ($bi->getRuleStatus() > BreakIterator::WORD_NONE_LIMIT)
                var_dump($v);
}

string(4) "This"
string(2) "is"
string(1) "a"
string(6) "phrase"
string(4) "with"
string(4) "some"
string(11) "punctuation"

Again, having some full description of proposed API would be nice.
For example, what hashCode() does?

The ICU docs only say "Compute a hash code for this BreakIterator." If I'mnot mistaken from my quick glance at the source, it just returns thelength of the forward rules.


--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

Reply via email to