On Fri, Jul 8, 2016 at 3:01 AM, Matthias-Christian Ott <o...@mirix.org> wrote:
> On 2016-07-05 18:11, Abhinav Upadhyay wrote:
>> I'm wondering if it is possible to extend the functionality of the
>> porter tokenizer. I would like to use the functionality of the Porter
>> tokenizer but before stemming the token, I want to decide whether the
>> token should be stemmed or not.
>>
>> Do I need to copy the Porter tokenizer and modify it to suit my needs
>> or there is a better way, to minimize code duplication?
>
> The first argument of the Porter tokenizer is its parent tokenizer. The
> Porter tokenizer calls the parent tokenizer's xTokenize function with an
> xToken function that wraps the xToken function that was passed to the
> xTokenize function of the Porter tokenizer and stems the tokens passed
> to it. So create a custom tokenizer that extracts the original xToken
> function from the xToken member of its pCtx parameter:
>
> typedef struct PorterContext PorterContext;
> struct PorterContext {
>   void *pCtx;
>   int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
>       int iStart, int iEnd);
>   char *aBuf;
> };
>
> typedef struct CustomTokenizer CustomTokenizer;
> struct CustomTokenizer {
>   fts5_tokenizer tokenizer;
>   Fts5Tokenizer *pTokenizer;
> };
>
> typedef struct CustomContext CustomContext;
> struct CustomContext {
>   void *pCtx;
>   int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
>       int iStart, int iEnd);
> };
>
> int customToken(
>   void *pCtx,
>   int tflags,
>   const char *pToken,
>   int nToken,
>   int iStart,
>   int iEnd
> ){
>   CustomContext *c = (CustomContext*)pCtx;
>   PorterContext *p;
>
>   if( stem ){
>     c->xToken(c->pCtx, tflags, pToken, nToken, iStart, iEnd);
>   }else{
>     p = (PorterContext)c->pCtx;
>     return p->xToken(p->pCtx, tflags, pToken, nToken, iStart, iEnd);
>   }
> }
>
> int customTokenize(
>   Fts5Tokenizer *pTokenizer,
>   void *pCtx,
>   int flags,
>   const char *pText,
>   int nText,
>   int (*xToken)(void *, int, const char *, int nToken, int iStart,
>       int iEnd)
> ){
>   CustomTokenizer *t = (CustomTokenizer)pTokenizer;
>   CustomContext sCtx;
>   sCtx.pCtx = pCtx;
>   sCtx.xToken = xToken;
>   return t->tokenizer.xTokenize(t->pTokenizer, (void*)&sCtx, flags,
>       pText, nText, customToken);
> }
>
> Note that you are accessing an internal struct and relying on
> implementation details and therefore have check whether the struct or
> any other relevant implementation details changed with every release.

Thanks for the detailed response. I think this would work but we are
currently using FTS4. The ability of calling a parent tokenizer is
really what I needed, but I don't think this is possible with FTS4?

-
Abhinav
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to