Re: [lucy-dev] StandardTokenizer has landed

Nick Wellnhofer Tue, 06 Dec 2011 05:46:31 -0800

On 06/12/2011 05:16, Marvin Humphrey wrote:

I didn't grok everything that was being done in the compressed table lookup
scheme, but your code is as well-documented and easy to follow as anything
that does that much bit-twiddling possibly could be, and I feel like I could
dive in and work on it if the need arose.

This and similar schemes are widely used in Unicode processing. It isn'ttoo complicated once you wrap your head around it. There's also a briefdescription in section 5.1 of the Unicode Standard.

I also made the assumption that the Tokenizer input is valid UTF-8. Isthat true?

What I still want to do is to incorporate the word break test cases fromthe Unicode website:


http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt

I like the way the snowball stemmer tests read test data from JSON filesusing our own parser. So I'd convert the Unicode tests to JSON with aperl script. I saw that there is an issue with JSON files and RATbecause we can't include a license header. Maybe we should put allUnicode database related material (also the word break tables) in asingle directory like modules/unicode/ucd like the snowball stemmer.


Nick

Re: [lucy-dev] StandardTokenizer has landed

Reply via email to