Re: [lucy-dev] StandardTokenizer has landed

Marvin Humphrey Wed, 07 Dec 2011 18:05:55 -0800

On Wed, Dec 07, 2011 at 03:44:13PM +0100, Nick Wellnhofer wrote:
> I can see only two bad things that can happen with invalid UTF-8:
>
> 1. The tokenizer doesn't detect invalid UTF-8, so it will pass it to  
> other analyzers, possibly creating even more invalid UTF-8.


I think this is OK.  For efficiency reasons, we do not want to require UTF-8
validity at the granularity of the Token during analysis.  How about we
establish these rules for the analysis phase?

  * Input to the analysis chain must be valid UTF-8.
  * Analyzers must be prepared to encounter broken UTF-8 but may either throw
    an exception or produce junk.
  * Broken UTF-8 emitted by an analysis chain should be detected prior to
    Indexer commit.

For the record, we currently perform a UTF-8 validity check on individual
terms within PostingPool.c (during the CB_Mimic_Str() invocations, which
perform internal UTF-8 sanity checking).  This is the right phase for the
check, IMO -- it's after the terms have been sorted and uniqued, so we perform
the validity check once per unique term rather than e.g. once per Token if we
were to enforce UTF-8 validity at the end of the analysis chain.

> 2. If there's invalid UTF-8 near the end of the input buffer, we might  
> read up to three bytes past the end of the buffer.

I think this is OK, too.  First, this is only a problem for broken analysis
chains.  Second, the typical outcome will be a token with a small amount of
random bogus content, and the Indexer will probably throw an exception prior
to commit anyway rather than leak the content into the index.

Marvin Humphrey

Re: [lucy-dev] StandardTokenizer has landed

Reply via email to