Re: [lucy-dev] StandardTokenizer has landed

Nick Wellnhofer Wed, 07 Dec 2011 06:44:52 -0800

On 07/12/2011 03:56, Marvin Humphrey wrote:

I also wanted to double check what happens when invalid UTF-8 shows up.  It
looks like the masking that's in place would force any bogus header bytes
positioned as continuation bytes to be evaluated safely, so no problem there.


The one thing that isn't clear to me is that it's impossible to overshoot the
end of the compressed lookup table arrays.  I see that we're covered as far as
the plane_index table goes:

         if (plane_index>= WB_PLANE_MAP_SIZE) { return 0; }
         plane_id  = wb_plane_map[plane_index];

There aren't boundary checks for the other tables,

The other tables don't need boundary checks because they're indexedusing an id from another table which are all safe to use.


I can see only two bad things that can happen with invalid UTF-8:

1. The tokenizer doesn't detect invalid UTF-8, so it will pass it toother analyzers, possibly creating even more invalid UTF-8.

2. If there's invalid UTF-8 near the end of the input buffer, we mightread up to three bytes past the end of the buffer.

but I see that you defined
a bunch of size-related constants in the autogenerated WordBreak.tab file
which haven't yet been used:

   #define WB_PLANES_SHIFT 6
   #define WB_PLANES_MASK  63
   #define WB_PLANES_SIZE  1472

Perhaps you were already planning to add stuff like this eventually?

   #if (WB_ASCII_SIZE<  128)
     #error "ASCII word break table too small"
   #endif

The tables used to have different shift and mask values, therefore theSHIFT and MASK defines. Now they're fixed at 6 bits and the defines areunused.


Nick

Re: [lucy-dev] StandardTokenizer has landed

Reply via email to