Hi, I am trying to write my own custom tokenizer to filter stopwords apart from doing normalization and stemming. I have gone through the comments in fts3_tokenizer.h and also read the implementation of the simple tokenizer. While overall I am able to understand what I need to do to implement this tokenizer, but I still cannot visualize how the FTS engine calls the tokenizer and what data in what form it passes to it.
Does the FTS engine pass the complete document data to the tokenizer or it passes some chunks of data, or individual words ? I need to understand this part because the next function needs to set the offsets accordingly. By just going through the code of the simple tokenizer I could not completely comprehend it (it would have been better if I could debug it). By the next functio I mean this: int (*xNext)( sqlite3_tokenizer_cursor *pCursor, /* Tokenizer cursor */ const char **ppToken, int *pnBytes, /* OUT: Normalized text for token */ int *piStartOffset, /* OUT: Byte offset of token in input buffer */ int *piEndOffset, /* OUT: Byte offset of end of token in input buffer */ int *piPosition /* OUT: Number of tokens returned before this one */ ); }; It would be better if you could explain what is the role of these parameters: piEndOffset , piStartOffset ? Thanks Abhinav _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users