Re: [sqlite] [FTS3] Understanding the Flow of data through the tokenizer
On Mon, Jul 25, 2011 at 9:54 AM, Dan Kennedywrote: > On 07/24/2011 08:16 PM, Abhinav Upadhyay wrote: >> Hi, >> >> I am trying to write my own custom tokenizer to filter stopwords apart >> from doing normalization and stemming. I have gone through the >> comments in fts3_tokenizer.h and also read the implementation of the >> simple tokenizer. While overall I am able to understand what I need to >> do to implement this tokenizer, but I still cannot visualize how the >> FTS engine calls the tokenizer and what data in what form it passes to >> it. >> >> Does the FTS engine pass the complete document data to the tokenizer >> or it passes some chunks of data, or individual words ? I need to >> understand this part because the next function needs to set the >> offsets accordingly. By just going through the code of the simple >> tokenizer I could not completely comprehend it (it would have been >> better if I could debug it). >> >> By the next functio I mean this: int (*xNext)( >> sqlite3_tokenizer_cursor *pCursor, /* Tokenizer cursor */ >> const char **ppToken, int *pnBytes, /* OUT: Normalized text for token >> */ >> int *piStartOffset, /* OUT: Byte offset of token in input buffer */ >> int *piEndOffset, /* OUT: Byte offset of end of token in input >> buffer */ >> int *piPosition /* OUT: Number of tokens returned before this one >> */ >> ); >> }; >> >> It would be better if you could explain what is the role of these >> parameters: piEndOffset , piStartOffset ? > > Each time xNext() returns SQLITE_OK to return a new token, xNext() > should set: > > *piStartOffset to the number of bytes in the input buffer before > start of the token being returned, > > *piEndOffset to *piStartOffset plus the number of bytes in the > token text, and > > *piPosition to the number of tokens that occur in the input buffer > before the token being returned. Thanks for the explanation. I was able to correct my implementation :-) . ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [FTS3] Understanding the Flow of data through the tokenizer
On 07/24/2011 08:16 PM, Abhinav Upadhyay wrote: > Hi, > > I am trying to write my own custom tokenizer to filter stopwords apart > from doing normalization and stemming. I have gone through the > comments in fts3_tokenizer.h and also read the implementation of the > simple tokenizer. While overall I am able to understand what I need to > do to implement this tokenizer, but I still cannot visualize how the > FTS engine calls the tokenizer and what data in what form it passes to > it. > > Does the FTS engine pass the complete document data to the tokenizer > or it passes some chunks of data, or individual words ? I need to > understand this part because the next function needs to set the > offsets accordingly. By just going through the code of the simple > tokenizer I could not completely comprehend it (it would have been > better if I could debug it). > > By the next functio I mean this: int (*xNext)( > sqlite3_tokenizer_cursor *pCursor, /* Tokenizer cursor */ > const char **ppToken, int *pnBytes, /* OUT: Normalized text for token */ > int *piStartOffset, /* OUT: Byte offset of token in input buffer */ > int *piEndOffset,/* OUT: Byte offset of end of token in input buffer > */ > int *piPosition /* OUT: Number of tokens returned before this one */ >); > }; > > It would be better if you could explain what is the role of these > parameters: piEndOffset , piStartOffset ? Each time xNext() returns SQLITE_OK to return a new token, xNext() should set: *piStartOffset to the number of bytes in the input buffer before start of the token being returned, *piEndOffset to *piStartOffset plus the number of bytes in the token text, and *piPosition to the number of tokens that occur in the input buffer before the token being returned. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] [FTS3] Understanding the Flow of data through the tokenizer
Hi, I am trying to write my own custom tokenizer to filter stopwords apart from doing normalization and stemming. I have gone through the comments in fts3_tokenizer.h and also read the implementation of the simple tokenizer. While overall I am able to understand what I need to do to implement this tokenizer, but I still cannot visualize how the FTS engine calls the tokenizer and what data in what form it passes to it. Does the FTS engine pass the complete document data to the tokenizer or it passes some chunks of data, or individual words ? I need to understand this part because the next function needs to set the offsets accordingly. By just going through the code of the simple tokenizer I could not completely comprehend it (it would have been better if I could debug it). By the next functio I mean this: int (*xNext)( sqlite3_tokenizer_cursor *pCursor, /* Tokenizer cursor */ const char **ppToken, int *pnBytes, /* OUT: Normalized text for token */ int *piStartOffset, /* OUT: Byte offset of token in input buffer */ int *piEndOffset,/* OUT: Byte offset of end of token in input buffer */ int *piPosition /* OUT: Number of tokens returned before this one */ ); }; It would be better if you could explain what is the role of these parameters: piEndOffset , piStartOffset ? Thanks Abhinav ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users