Re: [sqlite] fts3 issue with tokenizing of content during a query

Dan Kennedy Tue, 29 Dec 2009 20:44:23 -0800

On Dec 30, 2009, at 6:25 AM, Nick Hodapp wrote:

> Hi -
>
> I'm using sqlite 3.6.21 with this
> patch<http://www.sqlite.org/src/ci/6cbbae849990d99b7ffe252b642d6be49d0c7235 
> >,
> which I found in this forum a few weeks ago.  I'm also using a custom
> tokenizer which I wrote.
>
> My scenario is this:  I am storing XHTML in the database, and I want  
> to
> FTS-enable this content.  I only want to index the text contained  
> within the
> XHTML elements, not the element names or attributes.  (e.g. "<dont- 
> index
> this="or this">index this</...>")  My tokenizer skips over element  
> names and
> attributes, then delegates the element textual content to the Porter
> tokenizer.  On return from the Porter tokenizer, I correct the token  
> offset
> and length values to be the actual offsets within the document (Porter
> tokenizer doesn't ever see the whole document, just a string within  
> a tag).
>
> I didn't want to ship my tokenizer with my app for two reasons.  1 -  
> I wrote
> it using an API not available to my client app, 2 - it doesn't make  
> sense
> because on the client the user will be entering search terms that  
> aren't
> surrounded by xml tags, which is what my tokenizer expects.   
> Instead, my
> client registers a tokenizer with the same name as my custom  
> tokenizer, but
> in fact it is registering a copy of the porter tokenizer.
>
> I expected this to work fine - and it appeared to, until I  
> discovered that
> it was pulling out text in some of the xml attributes - which  
> shouldn't be
> indexed.
>
> It turns out that FTS3 is re-tokenizing the content (not just the  
> search
> term) on the client (using my copy of the Porter tokenizer) and  
> returning
> those results.  I don't understand why - is this a bug or is this  
> normal
> behavior?


It runs the tokenizer on returned documents as part of the snippet() or
offsets() function. The full-text index doesn't actually store the byte
offsets returned by the tokenizer xNext() call, just the token number.
So you have to re-tokenize to figure out the byte offsets required by
snippet() or offsets().

Dan.



>  I expected the fts index to retain all of the token offsets/sizes
> such that they wouldn't have to be recomputed on the client.
>
> My workaround is to port my tokenizer so that it runs on the client,  
> and to
> wrap search terms in dummy xml tags <dummy>like this</dummy>.   But  
> I feel I
> shouldn't have to do this...
>
> Any feedback appreciated...
>
> Nick Hodapp
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] fts3 issue with tokenizing of content during a query

Reply via email to