On 07/24/2011 08:16 PM, Abhinav Upadhyay wrote: > Hi, > > I am trying to write my own custom tokenizer to filter stopwords apart > from doing normalization and stemming. I have gone through the > comments in fts3_tokenizer.h and also read the implementation of the > simple tokenizer. While overall I am able to understand what I need to > do to implement this tokenizer, but I still cannot visualize how the > FTS engine calls the tokenizer and what data in what form it passes to > it. > > Does the FTS engine pass the complete document data to the tokenizer > or it passes some chunks of data, or individual words ? I need to > understand this part because the next function needs to set the > offsets accordingly. By just going through the code of the simple > tokenizer I could not completely comprehend it (it would have been > better if I could debug it). > > By the next functio I mean this: int (*xNext)( > sqlite3_tokenizer_cursor *pCursor, /* Tokenizer cursor */ > const char **ppToken, int *pnBytes, /* OUT: Normalized text for token */ > int *piStartOffset, /* OUT: Byte offset of token in input buffer */ > int *piEndOffset, /* OUT: Byte offset of end of token in input buffer > */ > int *piPosition /* OUT: Number of tokens returned before this one */ > ); > }; > > It would be better if you could explain what is the role of these > parameters: piEndOffset , piStartOffset ?
Each time xNext() returns SQLITE_OK to return a new token, xNext() should set: *piStartOffset to the number of bytes in the input buffer before start of the token being returned, *piEndOffset to *piStartOffset plus the number of bytes in the token text, and *piPosition to the number of tokens that occur in the input buffer before the token being returned. _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users