Hi -

I'm using sqlite 3.6.21 with this
which I found in this forum a few weeks ago.  I'm also using a custom
tokenizer which I wrote.

My scenario is this:  I am storing XHTML in the database, and I want to
FTS-enable this content.  I only want to index the text contained within the
XHTML elements, not the element names or attributes.  (e.g. "<dont-index
this="or this">index this</...>")  My tokenizer skips over element names and
attributes, then delegates the element textual content to the Porter
tokenizer.  On return from the Porter tokenizer, I correct the token offset
and length values to be the actual offsets within the document (Porter
tokenizer doesn't ever see the whole document, just a string within a tag).

I didn't want to ship my tokenizer with my app for two reasons.  1 - I wrote
it using an API not available to my client app, 2 - it doesn't make sense
because on the client the user will be entering search terms that aren't
surrounded by xml tags, which is what my tokenizer expects.  Instead, my
client registers a tokenizer with the same name as my custom tokenizer, but
in fact it is registering a copy of the porter tokenizer.

I expected this to work fine - and it appeared to, until I discovered that
it was pulling out text in some of the xml attributes - which shouldn't be

It turns out that FTS3 is re-tokenizing the content (not just the search
term) on the client (using my copy of the Porter tokenizer) and returning
those results.  I don't understand why - is this a bug or is this normal
behavior?  I expected the fts index to retain all of the token offsets/sizes
such that they wouldn't have to be recomputed on the client.

My workaround is to port my tokenizer so that it runs on the client, and to
wrap search terms in dummy xml tags <dummy>like this</dummy>.   But I feel I
shouldn't have to do this...

Any feedback appreciated...

Nick Hodapp
sqlite-users mailing list

Reply via email to