On 06/12/11 20:08, Marvin Humphrey wrote:
On Tue, Dec 06, 2011 at 02:45:46PM +0100, Nick Wellnhofer wrote:
This and similar schemes are widely used in Unicode processing. It isn't
too complicated once you wrap your head around it.
Yes, that's the impression I got. The data structures didn't seem hard --
just a couple extra derefs, essentially. It's all the bit-twiddling that
makes it hard to grok -- but that's the price we pay for the efficiency of
bitwise operations.
Can I ask you to please add some comments to S_wb_lookup() in
StandardTokenizer.c? Also, longer and more descriptive variable names would
be helpful -- it's not immediately clear what "i1", "i2", and "i3" represent.
I'm pretty sure I can tease that lookup code apart if I try, but it would be
easier with narration -- and it will certainly be easier for people who
haven't done much hard-core UTF-8 and Unicode coding.
I will try to make that code more descriptive.
There's also a brief description in section 5.1 of the Unicode Standard.
Looks like this:
http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
Optimized Two-Stage Table.
Wherever any blocks are identical, the pointers just point to the same
block. For transcoding tables, this case occurs generally for a block
containing only mappings to the default or “unmappable” character. Instead
of using NULL pointers and a default value, one “shared” block of default
entries is created. This block is pointed to by all first-stage table
entries, for which no character value can be mapped. By avoiding tests
and branches, this strategy provides access time that approaches the
simple array access, but at a great savings in storage.
It's actually a three-stage table. See the next paragraph:
Multistage Table Tuning.
Given a table of arbitrary size and content, it is a relatively
simple matter to write a small utility that can calculate the
optimal number of stages and their width for a multistage
table. Tuning the number of stages and the width of their
arrays of index pointers can result in various trade-offs of
table size versus average access time.
It isn't really made clear but the whole process can again be applied to
the first-stage table so you end up with three tables. If you use a
16-bit and an 8-bit split, you can think of Unicode planes and rows.
What I still want to do is to incorporate the word break test cases from
the Unicode website:
http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt
That would really help. Bitwise code is hard to debug visually. That's one
reason why Lucy::Object::BitVector and Lucy::Util::NumberUtils have unit tests
which are somewhat more thorough than other Lucy components.
The word break tables are tested for all Unicode characters in
gen_word_break_tables.pl but that doesn't cover the C code.
I just implemented and committed the tests against the UCD test suite.
There were two bugs in the implementation of the actual word break rules
that I fixed. Now the complete test suite passes.
Nick