Re: [sqlite] [FTS3] Understanding the Flow of data through the tokenizer

2011-07-25 Thread Abhinav Upadhyay
On Mon, Jul 25, 2011 at 9:54 AM, Dan Kennedy  wrote:
> On 07/24/2011 08:16 PM, Abhinav Upadhyay wrote:
>> Hi,
>>
>> I am trying to write my own custom tokenizer to filter stopwords apart
>> from doing normalization and stemming. I have gone through the
>> comments in fts3_tokenizer.h and also read the implementation of the
>> simple tokenizer. While overall I am able to understand what I need to
>> do to implement this tokenizer, but I still cannot visualize how the
>> FTS engine calls the tokenizer and what data in what form it passes to
>> it.
>>
>> Does the FTS engine pass the complete document data to the tokenizer
>> or it passes some chunks of data, or individual words ? I need to
>> understand this part because the next function needs to set the
>> offsets accordingly. By just going through the code of the simple
>> tokenizer I could not completely comprehend it (it would have been
>> better if I could debug it).
>>
>> By the next functio I mean this: int (*xNext)(
>>      sqlite3_tokenizer_cursor *pCursor,   /* Tokenizer cursor */
>>      const char **ppToken, int *pnBytes,  /* OUT: Normalized text for token 
>> */
>>      int *piStartOffset,  /* OUT: Byte offset of token in input buffer */
>>      int *piEndOffset,    /* OUT: Byte offset of end of token in input 
>> buffer */
>>      int *piPosition      /* OUT: Number of tokens returned before this one 
>> */
>>    );
>> };
>>
>> It would be better if you could explain what is the role of these
>> parameters: piEndOffset , piStartOffset ?
>
> Each time xNext() returns SQLITE_OK to return a new token, xNext()
> should set:
>
>   *piStartOffset to the number of bytes in the input buffer before
>   start of the token being returned,
>
>   *piEndOffset to *piStartOffset plus the number of bytes in the
>   token text, and
>
>   *piPosition to the number of tokens that occur in the input buffer
>   before the token being returned.

Thanks for the explanation. I was able to correct my implementation :-)
.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [FTS3] Understanding the Flow of data through the tokenizer

2011-07-24 Thread Dan Kennedy
On 07/24/2011 08:16 PM, Abhinav Upadhyay wrote:
> Hi,
>
> I am trying to write my own custom tokenizer to filter stopwords apart
> from doing normalization and stemming. I have gone through the
> comments in fts3_tokenizer.h and also read the implementation of the
> simple tokenizer. While overall I am able to understand what I need to
> do to implement this tokenizer, but I still cannot visualize how the
> FTS engine calls the tokenizer and what data in what form it passes to
> it.
>
> Does the FTS engine pass the complete document data to the tokenizer
> or it passes some chunks of data, or individual words ? I need to
> understand this part because the next function needs to set the
> offsets accordingly. By just going through the code of the simple
> tokenizer I could not completely comprehend it (it would have been
> better if I could debug it).
>
> By the next functio I mean this: int (*xNext)(
>  sqlite3_tokenizer_cursor *pCursor,   /* Tokenizer cursor */
>  const char **ppToken, int *pnBytes,  /* OUT: Normalized text for token */
>  int *piStartOffset,  /* OUT: Byte offset of token in input buffer */
>  int *piEndOffset,/* OUT: Byte offset of end of token in input buffer 
> */
>  int *piPosition  /* OUT: Number of tokens returned before this one */
>);
> };
>
> It would be better if you could explain what is the role of these
> parameters: piEndOffset , piStartOffset ?

Each time xNext() returns SQLITE_OK to return a new token, xNext()
should set:

   *piStartOffset to the number of bytes in the input buffer before
   start of the token being returned,

   *piEndOffset to *piStartOffset plus the number of bytes in the
   token text, and

   *piPosition to the number of tokens that occur in the input buffer
   before the token being returned.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] [FTS3] Understanding the Flow of data through the tokenizer

2011-07-24 Thread Abhinav Upadhyay
Hi,

I am trying to write my own custom tokenizer to filter stopwords apart
from doing normalization and stemming. I have gone through the
comments in fts3_tokenizer.h and also read the implementation of the
simple tokenizer. While overall I am able to understand what I need to
do to implement this tokenizer, but I still cannot visualize how the
FTS engine calls the tokenizer and what data in what form it passes to
it.

Does the FTS engine pass the complete document data to the tokenizer
or it passes some chunks of data, or individual words ? I need to
understand this part because the next function needs to set the
offsets accordingly. By just going through the code of the simple
tokenizer I could not completely comprehend it (it would have been
better if I could debug it).

By the next functio I mean this: int (*xNext)(
sqlite3_tokenizer_cursor *pCursor,   /* Tokenizer cursor */
const char **ppToken, int *pnBytes,  /* OUT: Normalized text for token */
int *piStartOffset,  /* OUT: Byte offset of token in input buffer */
int *piEndOffset,/* OUT: Byte offset of end of token in input buffer */
int *piPosition  /* OUT: Number of tokens returned before this one */
  );
};

It would be better if you could explain what is the role of these
parameters: piEndOffset , piStartOffset ?

Thanks
Abhinav
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users