On 07/10/2016 01:33 PM, Abhinav Upadhyay wrote:
On Fri, Jul 8, 2016 at 3:01 AM, Matthias-Christian Ott <o...@mirix.org> wrote:
On 2016-07-05 18:11, Abhinav Upadhyay wrote:
I'm wondering if it is possible to extend the functionality of the
porter tokenizer. I would like to use the functionality of the Porter
tokenizer but before stemming the token, I want to decide whether the
token should be stemmed or not.

Do I need to copy the Porter tokenizer and modify it to suit my needs
or there is a better way, to minimize code duplication?
The first argument of the Porter tokenizer is its parent tokenizer. The
Porter tokenizer calls the parent tokenizer's xTokenize function with an
xToken function that wraps the xToken function that was passed to the
xTokenize function of the Porter tokenizer and stems the tokens passed
to it. So create a custom tokenizer that extracts the original xToken
function from the xToken member of its pCtx parameter:

typedef struct PorterContext PorterContext;
struct PorterContext {
   void *pCtx;
   int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
       int iStart, int iEnd);
   char *aBuf;
};

typedef struct CustomTokenizer CustomTokenizer;
struct CustomTokenizer {
   fts5_tokenizer tokenizer;
   Fts5Tokenizer *pTokenizer;
};

typedef struct CustomContext CustomContext;
struct CustomContext {
   void *pCtx;
   int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
       int iStart, int iEnd);
};

int customToken(
   void *pCtx,
   int tflags,
   const char *pToken,
   int nToken,
   int iStart,
   int iEnd
){
   CustomContext *c = (CustomContext*)pCtx;
   PorterContext *p;

   if( stem ){
     c->xToken(c->pCtx, tflags, pToken, nToken, iStart, iEnd);
   }else{
     p = (PorterContext)c->pCtx;
     return p->xToken(p->pCtx, tflags, pToken, nToken, iStart, iEnd);
   }
}

int customTokenize(
   Fts5Tokenizer *pTokenizer,
   void *pCtx,
   int flags,
   const char *pText,
   int nText,
   int (*xToken)(void *, int, const char *, int nToken, int iStart,
       int iEnd)
){
   CustomTokenizer *t = (CustomTokenizer)pTokenizer;
   CustomContext sCtx;
   sCtx.pCtx = pCtx;
   sCtx.xToken = xToken;
   return t->tokenizer.xTokenize(t->pTokenizer, (void*)&sCtx, flags,
       pText, nText, customToken);
}

Note that you are accessing an internal struct and relying on
implementation details and therefore have check whether the struct or
any other relevant implementation details changed with every release.
Thanks for the detailed response. I think this would work but we are
currently using FTS4. The ability of calling a parent tokenizer is
really what I needed, but I don't think this is possible with FTS4?

No way to do that with FTS4 unfortunately. I think you'll either need to switch to FTS5 or make a copy of the porter stemmer code and modify it to suit your purpose.

Dan.




-
Abhinav
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to