The subject of character set detection (yes, I know, a hard problem to
solve) came up on SO chat, and Niki noticed that we don't yet wrap the
ICU UCharsetDetector API so I volunteered to put something together.

https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector

The trouble is, for the WIDE majority of my test cases so far, ICU is
really bad at detecting character sets correctly (as I said, it's a
tough problem).  In fact, the ICU manual admits that it doesn't even
look at all of the corpus text, and the "language detection" is a
byproduct not meant for actual language detection.

Given all that, I'm inclined to reject the idea of rolling this into
PHP for fear of just confusing users without actually adding any
value.

Thoughts?

-Sara

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to