On Dec 30, 2009, at 6:25 AM, Nick Hodapp wrote: > Hi - > > I'm using sqlite 3.6.21 with this > patch<http://www.sqlite.org/src/ci/6cbbae849990d99b7ffe252b642d6be49d0c7235 > >, > which I found in this forum a few weeks ago. I'm also using a custom > tokenizer which I wrote. > > My scenario is this: I am storing XHTML in the database, and I want > to > FTS-enable this content. I only want to index the text contained > within the > XHTML elements, not the element names or attributes. (e.g. "<dont- > index > this="or this">index this</...>") My tokenizer skips over element > names and > attributes, then delegates the element textual content to the Porter > tokenizer. On return from the Porter tokenizer, I correct the token > offset > and length values to be the actual offsets within the document (Porter > tokenizer doesn't ever see the whole document, just a string within > a tag). > > I didn't want to ship my tokenizer with my app for two reasons. 1 - > I wrote > it using an API not available to my client app, 2 - it doesn't make > sense > because on the client the user will be entering search terms that > aren't > surrounded by xml tags, which is what my tokenizer expects. > Instead, my > client registers a tokenizer with the same name as my custom > tokenizer, but > in fact it is registering a copy of the porter tokenizer. > > I expected this to work fine - and it appeared to, until I > discovered that > it was pulling out text in some of the xml attributes - which > shouldn't be > indexed. > > It turns out that FTS3 is re-tokenizing the content (not just the > search > term) on the client (using my copy of the Porter tokenizer) and > returning > those results. I don't understand why - is this a bug or is this > normal > behavior?
It runs the tokenizer on returned documents as part of the snippet() or offsets() function. The full-text index doesn't actually store the byte offsets returned by the tokenizer xNext() call, just the token number. So you have to re-tokenize to figure out the byte offsets required by snippet() or offsets(). Dan. > I expected the fts index to retain all of the token offsets/sizes > such that they wouldn't have to be recomputed on the client. > > My workaround is to port my tokenizer so that it runs on the client, > and to > wrap search terms in dummy xml tags <dummy>like this</dummy>. But > I feel I > shouldn't have to do this... > > Any feedback appreciated... > > Nick Hodapp > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users