The problem is that is is that i don't want the ICU tokenizer to split
single words entrys and shorter figure of speaches, and for those the
search is atm very good (and it has to be exact in 99% because i limit
the results to 1).

I modded the simple tokenizer to filter more stuff like [ ] 123456789 ;
here is an example entry:
感じる; [1] かんじる
to feel
and mecab and icu would split this to
感 じる
feeling, and gramatical ending.
but there is already another entry.
(splitting stuff in japanese is not tivial because the reading of a lot
of sign change when they are alone...)

if i could use multiple tokenizer and prioritize first the simple and
then the other mecab or ICU it would be best. but i guess it also will
increase the size massively..
using no indexing at all as these searches occure very rarely would
propably be best if possible at all (because i already have a manual
conversion of conjugated words liek 感じます polite to 感じる or
simmilar).

here you can see the app in a gif animation in action
http://www.boscowitch.de/projects/wadoku-notify

regards
boscowitch

Am Dienstag, den 30.11.2010, 00:22 +0700 schrieb Dan Kennedy:
> On 11/30/2010 12:09 AM, boscowitch wrote:
> > Hi recently I noticed that i can't search with the like '%searchword%'
> > syntax on an FTS3 virtual table.
> >
> > And with "match"  i can't search on example sentences (the indexed data
> > is a japanese dictionary an therefore has no spaces in example sentences
> > and there is no perfekt tokenizer atm i tried mecab but it makes
> > misstakes).
> 
> Can you use the ICU tokenizer?
> 
>    http://www.sqlite.org/fts3.html#tokenizer
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> 


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to