On Mon, May 4, 2009 at 2:22 PM, Garry Watkins <ga...@dynafocus.com> wrote: > What encoding is used on inbound insert statements into a FTS3 virtual > table? For example I have Japanese text encoded as UTF-8 and passed > in as UTF-8 insert statement is encoded as UTF-8. I am not using the > ICU library (SQLITE_ENABLE_ICU is not defined).
Should be UTF-8 across the board. > In the above tokenizer I want to eliminate words (stemming), do I just > not return those words, and move to the next? Does this actually > influence the text that is stored, or is this just used for indexing? The original is stored as the document, the tokenizer results are used to build the index and to construct queries. So if you skip stopwords, they won't be searchable using the fts MATCH operator. It should make the index smaller. There may be query-time edge-cases, for instance a phrase containing a stopword will probably match occurances of the phrase words in the input regardless of whether there were stopwords between them, or what those stopwords were. Probably not a huge problem in practice. [I mean that if "a" and "the" are stopwords, then "catch a ball" would match documents containing "catch the ball".] You would probably implement stemming by returning a token which is different from the literal word in question. That should work, too, the index is built from what the tokenizer returns, not from the input. You indicate the part of the input that the token represents by the offsets the tokenizer returns by reference. So the tokenizer can return offsets corresponding to "corresponding", but return the token "correspond" which isn't actually the same length, and "correspond" will be what goes into the index. I should note that I haven't worked in the code for a long time, things may have changed. > I am building a static library, and I want the tokenizer to be > available for anything that I link it with. Is there an easy way to > make that happen? I currently configured it through adding code into > the core FTS3 code. I don't want to do this each time, and would like > to keep the source separated, so I can easily change versions of SQLite. Here, I'm a little wishy-washy. For Google Gears, we just wired stuff in via compile-time arguments, I think it was something like -DSQLITE_ENABLE_FTS3 (actually, Gears uses fts2, because VACUUM was explicitly disabled so that design flaw couldn't happen. You should use fts3, but things should work the same). Gears has third_party/sqlite_vendor and third_party/sqlite_google, so you should be able to trivially verify what changes Gears had from SQLite CVS as of the time things were last imported. Google Chrome also uses fts, but I'm not entirely sure on the mechanics, there, so you'll have to poke around on your own. -scott _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users