On Mon, 04 Jun 2012 21:09:28 +0200, Stas Malyshev <smalys...@sugarcrm.com> wrote:

I understand that, but I have no idea how to write proper rules for word
boundaries, I just want to tell it "give me word boundaries" but not by
saying createWordBoundaries() but by doing createIterator($type) where
$type == WORD_BOUNDARIES.

Why? This makes no sense to me. Why would createIterator(WORD_BOUNDARIES) be better than BreakIterator::createWordInstance()? Especially in a dynamic language like PHP where you can do:

$type = 'word';
$bi = BreakIterator::{"create" . $type . 'instance'}(NULL);

To iterate over code points, you can build a very simple
RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this
example here: https://gist.github.com/2843005

Is there any reason not to provide this as a service for PHP user? I
understand somebody who is a specialist in ICU knows that already, but
most PHP users don't know this magic.

Well, the reason I didn't add it is because ICU didn't add such an iterator. I imagine the reason for that is that there are much more efficient ways to iterate over UTF-8 that don't involve a full-blown regex based text segmentation engine. In fact, ICU provides very efficient ways (with macros and simple specialized functions) to iterate over UTF-8 text in utf8.h:

http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/common/unicode/utf8.h


Right now, the ICU implementation just calls
Locale::getAvailableLocales(), but its description is "Gets all the
available locales that has localized text boundary data." so I suppose
it could return a different set in the future.

My only concern is that no other classes have getAvailableLocales() and
it doesn't seem to do anything useful now, so maybe we should omit it
for now?

I have no special love for it, but your statement is innacurate in one aspect -- I've added a similar function in IntlCalendar... whose implementation is basically the same:

http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/i18n/calendar.cpp#getAvailableLocales

I don't mind removing both though.

Another thing I notice here: why not make:
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);

into:
$bi = BreakIterator::createWordInstance(NULL, $foo);


Two reasons:

* it encourages bad behavior, namely not reusing the BreakIterator objects.
* that's not the ICU signature. If ICU in the future adds overloads with a string in the second argument, we'll find ourselves with odd signatures.

OK, if you have to do getPartsIterator() it's fine as long as you can
easily do foreach on it, since that's what one expects from iterator.
I'd also add some flag that would skip or not skip whitespace, if this
is possible - so for "foo bar" sometimes you want ['foo', ' ', 'bar']
and sometimes you want just ['foo', 'bar'] - does ICU support it somehow?

The BreakIterator cannot throws away text. You have to look at the rules statuses. Example:

$text = 'This is a phrase... with some punctuation.';
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($text);
foreach ($bi->getPartsIterator() as $v) {
        if ($bi->getRuleStatus() > BreakIterator::WORD_NONE_LIMIT)
                var_dump($v);
}

string(4) "This"
string(2) "is"
string(1) "a"
string(6) "phrase"
string(4) "with"
string(4) "some"
string(11) "punctuation"

Again, having some full description of proposed API would be nice.
For example, what hashCode() does?

The ICU docs only say "Compute a hash code for this BreakIterator." If I'm not mistaken from my quick glance at the source, it just returns the length of the forward rules.

--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to