On Wed, Dec 07, 2011 at 03:44:13PM +0100, Nick Wellnhofer wrote:
> I can see only two bad things that can happen with invalid UTF-8:
>
> 1. The tokenizer doesn't detect invalid UTF-8, so it will pass it to
> other analyzers, possibly creating even more invalid UTF-8.
I think this is OK. For efficiency reasons, we do not want to require UTF-8
validity at the granularity of the Token during analysis. How about we
establish these rules for the analysis phase?
* Input to the analysis chain must be valid UTF-8.
* Analyzers must be prepared to encounter broken UTF-8 but may either throw
an exception or produce junk.
* Broken UTF-8 emitted by an analysis chain should be detected prior to
Indexer commit.
For the record, we currently perform a UTF-8 validity check on individual
terms within PostingPool.c (during the CB_Mimic_Str() invocations, which
perform internal UTF-8 sanity checking). This is the right phase for the
check, IMO -- it's after the terms have been sorted and uniqued, so we perform
the validity check once per unique term rather than e.g. once per Token if we
were to enforce UTF-8 validity at the end of the analysis chain.
> 2. If there's invalid UTF-8 near the end of the input buffer, we might
> read up to three bytes past the end of the buffer.
I think this is OK, too. First, this is only a problem for broken analysis
chains. Second, the typical outcome will be a token with a small amount of
random bogus content, and the Indexer will probably throw an exception prior
to commit anyway rather than leak the content into the index.
Marvin Humphrey