On 08/12/11 02:59, Marvin Humphrey wrote:
I think this is OK.  For efficiency reasons, we do not want to require UTF-8
validity at the granularity of the Token during analysis.  How about we
establish these rules for the analysis phase?

   * Input to the analysis chain must be valid UTF-8.
   * Analyzers must be prepared to encounter broken UTF-8 but may either throw
     an exception or produce junk.
   * Broken UTF-8 emitted by an analysis chain should be detected prior to
     Indexer commit.

Sounds reasonable.

2. If there's invalid UTF-8 near the end of the input buffer, we might
read up to three bytes past the end of the buffer.

I think this is OK, too.  First, this is only a problem for broken analysis
chains.  Second, the typical outcome will be a token with a small amount of
random bogus content, and the Indexer will probably throw an exception prior
to commit anyway rather than leak the content into the index.

But reading past the end of the buffer might cause a segfault. So if we want to follow the rules above, we should guard against that.

Nick


Reply via email to