On Fri, Jul 8, 2016 at 3:01 AM, Matthias-Christian Ott <o...@mirix.org> wrote: > On 2016-07-05 18:11, Abhinav Upadhyay wrote: >> I'm wondering if it is possible to extend the functionality of the >> porter tokenizer. I would like to use the functionality of the Porter >> tokenizer but before stemming the token, I want to decide whether the >> token should be stemmed or not. >> >> Do I need to copy the Porter tokenizer and modify it to suit my needs >> or there is a better way, to minimize code duplication? > > The first argument of the Porter tokenizer is its parent tokenizer. The > Porter tokenizer calls the parent tokenizer's xTokenize function with an > xToken function that wraps the xToken function that was passed to the > xTokenize function of the Porter tokenizer and stems the tokens passed > to it. So create a custom tokenizer that extracts the original xToken > function from the xToken member of its pCtx parameter: > > typedef struct PorterContext PorterContext; > struct PorterContext { > void *pCtx; > int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken, > int iStart, int iEnd); > char *aBuf; > }; > > typedef struct CustomTokenizer CustomTokenizer; > struct CustomTokenizer { > fts5_tokenizer tokenizer; > Fts5Tokenizer *pTokenizer; > }; > > typedef struct CustomContext CustomContext; > struct CustomContext { > void *pCtx; > int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken, > int iStart, int iEnd); > }; > > int customToken( > void *pCtx, > int tflags, > const char *pToken, > int nToken, > int iStart, > int iEnd > ){ > CustomContext *c = (CustomContext*)pCtx; > PorterContext *p; > > if( stem ){ > c->xToken(c->pCtx, tflags, pToken, nToken, iStart, iEnd); > }else{ > p = (PorterContext)c->pCtx; > return p->xToken(p->pCtx, tflags, pToken, nToken, iStart, iEnd); > } > } > > int customTokenize( > Fts5Tokenizer *pTokenizer, > void *pCtx, > int flags, > const char *pText, > int nText, > int (*xToken)(void *, int, const char *, int nToken, int iStart, > int iEnd) > ){ > CustomTokenizer *t = (CustomTokenizer)pTokenizer; > CustomContext sCtx; > sCtx.pCtx = pCtx; > sCtx.xToken = xToken; > return t->tokenizer.xTokenize(t->pTokenizer, (void*)&sCtx, flags, > pText, nText, customToken); > } > > Note that you are accessing an internal struct and relying on > implementation details and therefore have check whether the struct or > any other relevant implementation details changed with every release.
Thanks for the detailed response. I think this would work but we are currently using FTS4. The ability of calling a parent tokenizer is really what I needed, but I don't think this is possible with FTS4? - Abhinav _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users