On 06/12/2011 05:16, Marvin Humphrey wrote:
I didn't grok everything that was being done in the compressed table lookup
scheme, but your code is as well-documented and easy to follow as anything
that does that much bit-twiddling possibly could be, and I feel like I could
dive in and work on it if the need arose.

This and similar schemes are widely used in Unicode processing. It isn't too complicated once you wrap your head around it. There's also a brief description in section 5.1 of the Unicode Standard.

I also made the assumption that the Tokenizer input is valid UTF-8. Is that true?

What I still want to do is to incorporate the word break test cases from the Unicode website:

http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt

I like the way the snowball stemmer tests read test data from JSON files using our own parser. So I'd convert the Unicode tests to JSON with a perl script. I saw that there is an issue with JSON files and RAT because we can't include a license header. Maybe we should put all Unicode database related material (also the word break tables) in a single directory like modules/unicode/ucd like the snowball stemmer.

Nick

Reply via email to