Re: [sqlite] [FTS3] Understanding the Flow of data through the tokenizer

Dan Kennedy Sun, 24 Jul 2011 21:24:18 -0700

On 07/24/2011 08:16 PM, Abhinav Upadhyay wrote:
> Hi,
>
> I am trying to write my own custom tokenizer to filter stopwords apart
> from doing normalization and stemming. I have gone through the
> comments in fts3_tokenizer.h and also read the implementation of the
> simple tokenizer. While overall I am able to understand what I need to
> do to implement this tokenizer, but I still cannot visualize how the
> FTS engine calls the tokenizer and what data in what form it passes to
> it.
>
> Does the FTS engine pass the complete document data to the tokenizer
> or it passes some chunks of data, or individual words ? I need to
> understand this part because the next function needs to set the
> offsets accordingly. By just going through the code of the simple
> tokenizer I could not completely comprehend it (it would have been
> better if I could debug it).
>
> By the next functio I mean this: int (*xNext)(
>      sqlite3_tokenizer_cursor *pCursor,   /* Tokenizer cursor */
>      const char **ppToken, int *pnBytes,  /* OUT: Normalized text for token */
>      int *piStartOffset,  /* OUT: Byte offset of token in input buffer */
>      int *piEndOffset,    /* OUT: Byte offset of end of token in input buffer 
> */
>      int *piPosition      /* OUT: Number of tokens returned before this one */
>    );
> };
>
> It would be better if you could explain what is the role of these
> parameters: piEndOffset , piStartOffset ?


Each time xNext() returns SQLITE_OK to return a new token, xNext()
should set:

   *piStartOffset to the number of bytes in the input buffer before
   start of the token being returned,

   *piEndOffset to *piStartOffset plus the number of bytes in the
   token text, and

   *piPosition to the number of tokens that occur in the input buffer
   before the token being returned.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] [FTS3] Understanding the Flow of data through the tokenizer

Reply via email to