[sqlite] [FTS3] Understanding the Flow of data through the tokenizer

Abhinav Upadhyay Sun, 24 Jul 2011 06:17:05 -0700

Hi,

I am trying to write my own custom tokenizer to filter stopwords apart
from doing normalization and stemming. I have gone through the
comments in fts3_tokenizer.h and also read the implementation of the
simple tokenizer. While overall I am able to understand what I need to
do to implement this tokenizer, but I still cannot visualize how the
FTS engine calls the tokenizer and what data in what form it passes to
it.


Does the FTS engine pass the complete document data to the tokenizer
or it passes some chunks of data, or individual words ? I need to
understand this part because the next function needs to set the
offsets accordingly. By just going through the code of the simple
tokenizer I could not completely comprehend it (it would have been
better if I could debug it).

By the next functio I mean this: int (*xNext)(
    sqlite3_tokenizer_cursor *pCursor,   /* Tokenizer cursor */
    const char **ppToken, int *pnBytes,  /* OUT: Normalized text for token */
    int *piStartOffset,  /* OUT: Byte offset of token in input buffer */
    int *piEndOffset,    /* OUT: Byte offset of end of token in input buffer */
    int *piPosition      /* OUT: Number of tokens returned before this one */
  );
};

It would be better if you could explain what is the role of these
parameters: piEndOffset , piStartOffset ?

Thanks
Abhinav
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] [FTS3] Understanding the Flow of data through the tokenizer

Reply via email to