Re: [lucy-dev] StandardTokenizer has landed

Nick Wellnhofer Tue, 06 Dec 2011 13:43:36 -0800

On 06/12/11 20:08, Marvin Humphrey wrote:

On Tue, Dec 06, 2011 at 02:45:46PM +0100, Nick Wellnhofer wrote:

This and similar schemes are widely used in Unicode processing. It isn't
too complicated once you wrap your head around it.


Yes, that's the impression I got.  The data structures didn't seem hard --
just a couple extra derefs, essentially.  It's all the bit-twiddling that
makes it hard to grok -- but that's the price we pay for the efficiency of
bitwise operations.

Can I ask you to please add some comments to S_wb_lookup() in
StandardTokenizer.c?  Also, longer and more descriptive variable names would
be helpful -- it's not immediately clear what "i1", "i2", and "i3" represent.

I'm pretty sure I can tease that lookup code apart if I try, but it would be
easier with narration -- and it will certainly be easier for people who
haven't done much hard-core UTF-8 and Unicode coding.


I will try to make that code more descriptive.

There's also a brief description in section 5.1 of the Unicode Standard.


Looks like this:

     http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf

     Optimized Two-Stage Table.

     Wherever any blocks are identical, the pointers just point to the same
     block. For transcoding tables, this case occurs generally for a block
     containing only mappings to the default or “unmappable” character. Instead
     of using NULL pointers and a default value, one “shared” block of default
     entries is created. This block is pointed to by all first-stage table
     entries, for which no character value can be mapped.  By avoiding tests
     and branches, this strategy provides access time that approaches the
     simple array access, but at a great savings in storage.


It's actually a three-stage table. See the next paragraph:

        Multistage Table Tuning.

        Given a table of arbitrary size and content, it is a relatively
        simple matter to write a small utility that can calculate the
        optimal number of stages and their width for a multistage
        table. Tuning the number of stages and the width of their
        arrays of index pointers can result in various trade-offs of
        table size versus average access time.

It isn't really made clear but the whole process can again be applied tothe first-stage table so you end up with three tables. If you use a16-bit and an 8-bit split, you can think of Unicode planes and rows.

What I still want to do is to incorporate the word break test cases from
the Unicode website:

http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt


That would really help.  Bitwise code is hard to debug visually.  That's one
reason why Lucy::Object::BitVector and Lucy::Util::NumberUtils have unit tests
which are somewhat more thorough than other Lucy components.

The word break tables are tested for all Unicode characters ingen_word_break_tables.pl but that doesn't cover the C code.

I just implemented and committed the tests against the UCD test suite.There were two bugs in the implementation of the actual word break rulesthat I fixed. Now the complete test suite passes.


Nick

Re: [lucy-dev] StandardTokenizer has landed

Reply via email to