On 07/12/2011 03:56, Marvin Humphrey wrote:
I also wanted to double check what happens when invalid UTF-8 shows up. It
looks like the masking that's in place would force any bogus header bytes
positioned as continuation bytes to be evaluated safely, so no problem there.
The one thing that isn't clear to me is that it's impossible to overshoot the
end of the compressed lookup table arrays. I see that we're covered as far as
the plane_index table goes:
if (plane_index>= WB_PLANE_MAP_SIZE) { return 0; }
plane_id = wb_plane_map[plane_index];
There aren't boundary checks for the other tables,
The other tables don't need boundary checks because they're indexed
using an id from another table which are all safe to use.
I can see only two bad things that can happen with invalid UTF-8:
1. The tokenizer doesn't detect invalid UTF-8, so it will pass it to
other analyzers, possibly creating even more invalid UTF-8.
2. If there's invalid UTF-8 near the end of the input buffer, we might
read up to three bytes past the end of the buffer.
but I see that you defined
a bunch of size-related constants in the autogenerated WordBreak.tab file
which haven't yet been used:
#define WB_PLANES_SHIFT 6
#define WB_PLANES_MASK 63
#define WB_PLANES_SIZE 1472
Perhaps you were already planning to add stuff like this eventually?
#if (WB_ASCII_SIZE< 128)
#error "ASCII word break table too small"
#endif
The tables used to have different shift and mask values, therefore the
SHIFT and MASK defines. Now they're fixed at 6 bits and the defines are
unused.
Nick