Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

2018-09-21 Thread 邱朗
Hi,


It was exactly like you said, my bad, so now I have built an icu version. BUT 
unfortunately it still does not support CJK, why is that ?


qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ ./sqlite3
SQLite version 3.25.1 2018-09-18 20:20:44
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for 
chinese');
sqlite> select * from zh_text where text match 'work';
为什么不支持中文 icu does not seem to work for chinese
sqlite> select * from zh_text where text match '中';
sqlite>


BTW, whoever hit the icu4c error it may be because you make the same mistake as 
I did. So I first run brew link icu4c, but brew refused, "Warning: Refusing to 
link macOS-provided software: icu4c", then I forgot to add it to my path :$


If you run brew info icu4c, it will tell you that but actually I didn't set 
them and compiler still can find them


For compilers to find icu4c you may need to set:
  export LDFLAGS="-L/usr/local/opt/icu4c/lib"
  export CPPFLAGS="-I/usr/local/opt/icu4c/include"


Thanks,
Qiulang
At 2018-09-21 23:43:01, "Dan Kennedy"  wrote:
>On 09/21/2018 09:44 PM, 邱朗 wrote:
>> I actually first used  ./configure CFLAGS="-DSQLITE_ENABLE_ICU `icu-config 
>> --cppflags`" LDFLAGS="`icu-config --ldflags`"  But I got the error
>
>When you ran this configure command, is the first line out output 
>something like the following?
>
>   bash: icu-config: command not found
>

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

2018-09-21 Thread Jens Alfke


> On Sep 20, 2018, at 11:01 PM, 邱朗  wrote:
> 
> https://www.sqlite.org/fts5.html  said " 
> The unicode tokenizer classifies all unicode characters as either "separator" 
> or "token" characters. By default all space and punctuation characters, as 
> defined by Unicode 6.1, are considered separators, and all other characters 
> as token characters... "  I really doubt unicode tokenizer requires white 
> space, that is ascii tokenizer.

Detecting word breaks in many East Asian languages (not just CJK; Thai is 
another) is a rather difficult task and requires having a non-small database of 
character sequences to match. I’m sure the SQLite maintainers considered it too 
large to build into their Unicode tokenizer.

IIRC, ICU can do this, as can special libraries like Mecab.

—Jens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] FTS5 minimum number of characters to index ?

2018-09-21 Thread Jens Alfke


> On Sep 21, 2018, at 3:26 AM, Domingo Alvarez Duarte  
> wrote:
> 
> looking at some fts5 tables it seems that an option to limit the minimum 
> number of characters to at least 2 or 3 would be a good shot as stopwords,

A real stop-word list is valuable, but I don’t think a simple minimum-length 
rule would be as useful. Maybe in a few contexts, but not in general. (It’s not 
useful even for English text; for example, I’m very glad that Google indexes 
the word “C” so I can look up questions about C programming!)

> another interest option would be a regex like black/white list of sequence of 
> characters to be indexed.

You can do all this and more with a custom tokenizer :)

(Most real-world uses of FTS for natural language text will end up needing a 
custom tokenizer anyway, because IIRC the default tokenizer is very stupid and 
only breaks at whitespace. At a minimum you need one that can ignore inter-word 
punctuation like periods and commas, and recognize some non-ASCII characters 
like curly quotes and en-dashes.

—Jens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

2018-09-21 Thread Dan Kennedy

On 09/21/2018 09:44 PM, 邱朗 wrote:

I actually first used  ./configure CFLAGS="-DSQLITE_ENABLE_ICU `icu-config --cppflags`" 
LDFLAGS="`icu-config --ldflags`"  But I got the error


When you ran this configure command, is the first line out output 
something like the following?


  bash: icu-config: command not found

Is [icu-config] actually in your path? And if so, what does the 
[icu-config --ldflags] command return?


Dan.







sqlite3.c:184184:10: fatal error: 'unicode/utypes.h' file not found
#include 


Then I added -I -L switches and if I remembered correct I used brew to install 
icu4c. The compiler command are  these


qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ make
/bin/sh ./libtool  --tag=CC   --mode=compile gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" 
-DPACKAGE_STRING=\"sqlite\ 3.25.1\" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\"; -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" 
-DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 
-DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 
-DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I.
-D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB  
-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU  -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c -o sqlite3.lo sqlite3.c
libtool: compile:  gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 
3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\"; -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" 
-DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 
-DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 
-DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 
-DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo 
-c sqlite3.c  -fno-common -DPIC -o .libs/sqlite3.o
libtool: compile:  gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 
3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\"; -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 
-DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 
-DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 
-DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE 
-DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c -o sqlite3.o >/dev/null 2>&1
mv -f .deps/sqlite3.Tpo .deps/sqlite3.Plo
/bin/sh ./libtool  --tag=CC   --mode=link gcc -D_REENTRANT=1 
-DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 
-DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB  
-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU  -no-undefined -version-info 
8:6:8 -L/usr/local/opt/icu4c/lib  -o libsqlite3.la -rpath /usr/local/lib 
sqlite3.lo  -lz
libtool: link: gcc -dynamiclib  -o .libs/libsqlite3.0.dylib  .libs/sqlite3.o   
-L/usr/local/opt/icu4c/lib -lz-install_name  
/usr/local/lib/libsqlite3.0.dylib -compatibility_version 9 -current_version 9.6 
-Wl,-single_module
Undefined symbols for architecture x86_64:
  "_u_errorName_62", referenced from:
  _icuFunctionError in sqlite3.o
  "_u_foldCase_62", referenced from:
...





At 2018-09-21 21:52:30, "Dan Kennedy"  wrote:

On 09/21/2018 05:21 PM, 邱朗 wrote:

Hi,

Thanks for replying my question. Following are the error I got when compiling 
sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion
http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html


I am using macOS 10.13 & xcode 10


The text below is just the error. If you post the compiler commands that
appear before it in the build log somebody might be able to spot the
problem.

From the error message, it ma

Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

2018-09-21 Thread 邱朗
I actually first used  ./configure CFLAGS="-DSQLITE_ENABLE_ICU `icu-config 
--cppflags`" LDFLAGS="`icu-config --ldflags`"  But I got the error


sqlite3.c:184184:10: fatal error: 'unicode/utypes.h' file not found
#include 


Then I added -I -L switches and if I remembered correct I used brew to install 
icu4c. The compiler command are  these 


qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ make
/bin/sh ./libtool  --tag=CC   --mode=compile gcc -DPACKAGE_NAME=\"sqlite\" 
-DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" 
-DPACKAGE_STRING=\"sqlite\ 3.25.1\" 
-DPACKAGE_BUGREPORT=\"http://www.sqlite.org\"; -DPACKAGE_URL=\"\" 
-DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 
-DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 
-DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 
-DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 
-DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 
-DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 
-DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I.-D_REENTRANT=1 -DSQLITE_THREADSAFE=1 
-DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 
-DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB  -I/usr/local/opt/icu4c/include 
-DSQLITE_ENABLE_ICU  -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c -o 
sqlite3.lo sqlite3.c
libtool: compile:  gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" 
-DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" 
-DPACKAGE_BUGREPORT=\"http://www.sqlite.org\"; -DPACKAGE_URL=\"\" 
-DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 
-DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 
-DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 
-DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 
-DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 
-DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 
-DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 
-DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 
-DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include 
-DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c  
-fno-common -DPIC -o .libs/sqlite3.o
libtool: compile:  gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" 
-DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" 
-DPACKAGE_BUGREPORT=\"http://www.sqlite.org\"; -DPACKAGE_URL=\"\" 
-DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 
-DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 
-DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 
-DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 
-DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 
-DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 
-DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 
-DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 
-DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include 
-DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c 
-o sqlite3.o >/dev/null 2>&1
mv -f .deps/sqlite3.Tpo .deps/sqlite3.Plo
/bin/sh ./libtool  --tag=CC   --mode=link gcc -D_REENTRANT=1 
-DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 
-DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB  
-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU  -no-undefined -version-info 
8:6:8 -L/usr/local/opt/icu4c/lib  -o libsqlite3.la -rpath /usr/local/lib 
sqlite3.lo  -lz
libtool: link: gcc -dynamiclib  -o .libs/libsqlite3.0.dylib  .libs/sqlite3.o   
-L/usr/local/opt/icu4c/lib -lz-install_name  
/usr/local/lib/libsqlite3.0.dylib -compatibility_version 9 -current_version 9.6 
-Wl,-single_module
Undefined symbols for architecture x86_64:
  "_u_errorName_62", referenced from:
  _icuFunctionError in sqlite3.o
  "_u_foldCase_62", referenced from:
...





At 2018-09-21 21:52:30, "Dan Kennedy"  wrote:
>On 09/21/2018 05:21 PM, 邱朗 wrote:
>> Hi,
>>
>> Thanks for replying my question. Following are the error I got when 
>> compiling sqlite-autoconf-3250100.tar.gz . The error looks similar to this 
>> old discussion
>> http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html
>>
>>
>> I am using macOS 10.13 & xcode 10
>
>The text below is just the error. If you post the compiler commands that 
>appear before it in the build log somebody might be able to spot the 
>problem.
>
> From the error message, it may be that you have mismatched ICU header 
>and library files, or it may be that not all required ICU libraries are 
>being linked. If you remove the -I... and -L... switches from your 
>command line does it make any difference?
>
>Dan.

Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

2018-09-21 Thread Dan Kennedy

On 09/21/2018 05:21 PM, 邱朗 wrote:

Hi,

Thanks for replying my question. Following are the error I got when compiling 
sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion
http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html


I am using macOS 10.13 & xcode 10


The text below is just the error. If you post the compiler commands that 
appear before it in the build log somebody might be able to spot the 
problem.


From the error message, it may be that you have mismatched ICU header 
and library files, or it may be that not all required ICU libraries are 
being linked. If you remove the -I... and -L... switches from your 
command line does it make any difference?


Dan.






Undefined symbols for architecture x86_64:
  "_u_errorName_62", referenced from:
  _icuFunctionError in sqlite3.o
  "_u_foldCase_62", referenced from:
  _icuOpen in sqlite3.o
  _icuLikeCompare in sqlite3.o
  "_u_isspace_62", referenced from:
  _icuNext in sqlite3.o
  "_u_strToLower_62", referenced from:
  _icuCaseFunc16 in sqlite3.o
  "_u_strToUTF8_62", referenced from:
  _icuNext in sqlite3.o
  "_u_strToUpper_62", referenced from:
  _icuCaseFunc16 in sqlite3.o
  "_ubrk_close_62", referenced from:
  _icuClose in sqlite3.o
  "_ubrk_current_62", referenced from:
  _icuNext in sqlite3.o
  "_ubrk_first_62", referenced from:
  _icuOpen in sqlite3.o
  "_ubrk_next_62", referenced from:
  _icuNext in sqlite3.o
  "_ubrk_open_62", referenced from:
  _icuOpen in sqlite3.o
  "_ucol_close_62", referenced from:
  _icuLoadCollation in sqlite3.o
  _icuCollationDel in sqlite3.o
  "_ucol_open_62", referenced from:
  _icuLoadCollation in sqlite3.o
  "_ucol_strcoll_62", referenced from:
  _icuCollationColl in sqlite3.o
  "_uregex_close_62", referenced from:
  _icuRegexpDelete in sqlite3.o
  "_uregex_matches_62", referenced from:
  _icuRegexpFunc in sqlite3.o
  "_uregex_open_62", referenced from:
  _icuRegexpFunc in sqlite3.o
  "_uregex_setText_62", referenced from:
  _icuRegexpFunc in sqlite3.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [libsqlite3.la] Error 1




在 2018-09-21 17:32:38,"Dan Kennedy"  写道:

On 09/21/2018 01:38 PM, 邱朗 wrote:


I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


I am Chinese and I know a little bit of Korean, I can help to test your product 
:D  All Jokes aside I also tried to build an ICU SQlite macOS version but I 
failed. All the document I googled seem outdated. e.g. I used this (and other 
solutions) but I just can not build a macOS version. Do you have any experience 
for that ?


./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config 
--cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"


Can you post the complete output of the failed build attempt? Thanks.

Dan.


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] FTS5 min_word_size patch small error

2018-09-21 Thread Domingo Alvarez Duarte

Hello !

On my last post about a patch to fts5 to add an option "min_word_size" 
there is a small mistake on the comparison:


Original with mistake:

if(p->nMinWordSize && p->nMinWordSize >= wsz) continue;

New with mistake fixed (it should be ">" instead of ">="):

if(p->nMinWordSize && p->nMinWordSize > wsz) continue;


Cheers !

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] FTS5 min_word_size patch

2018-09-21 Thread Domingo Alvarez Duarte

Hello !

After reporting here previously about this issue I've got a working 
implementation of "min_word_size" option to Unicode61Tokenizer see patch 
bellow.


With it here is the result of a simple test:



./sqlite3
SQLite version 3.26.0 2018-09-20 20:43:28
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> create virtual table tfts using fts5(data, tokenize = 'unicode61 
min_word_size 3');
sqlite> create virtual table if not exists tfts_vocab_row USING 
fts5vocab('tfts', 'row');
sqlite> insert into tfts(data) values('A new way to tokenize using fts5 
from sqlite, we can discard n letters word');

sqlite> select * from tfts_vocab_row;
discard|1|1
from|1|1
fts5|1|1
letters|1|1
sqlite|1|1
tokenize|1|1
using|1|1
word|1|1





fossil diff fts5_tokenize.c
Index: ext/fts5/fts5_tokenize.c
==
--- ext/fts5/fts5_tokenize.c
+++ ext/fts5/fts5_tokenize.c
@@ -233,10 +233,11 @@
 struct Unicode61Tokenizer {
   unsigned char aTokenChar[128];  /* ASCII range token characters */
   char *aFold;    /* Buffer to fold text into */
   int nFold;  /* Size of aFold[] in bytes */
   int bRemoveDiacritic;   /* True if remove_diacritics=1 is set */
+  int nMinWordSize;   /* Min size of a word to be indexed */
   int nException;
   int *aiException;

   unsigned char aCategory[32];    /* True for token char categories */
 };
@@ -360,10 +361,11 @@
   const char *zCat = "L* N* Co";
   int i;
   memset(p, 0, sizeof(Unicode61Tokenizer));

   p->bRemoveDiacritic = 1;
+  p->nMinWordSize = 0;
   p->nFold = 64;
   p->aFold = sqlite3_malloc(p->nFold * sizeof(char));
   if( p->aFold==0 ){
 rc = SQLITE_NOMEM;
   }
@@ -393,10 +395,14 @@
 if( 0==sqlite3_stricmp(azArg[i], "separators") ){
   rc = fts5UnicodeAddExceptions(p, zArg, 0);
 }else
 if( 0==sqlite3_stricmp(azArg[i], "categories") ){
   /* no-op */
+    }else
+    if( 0==sqlite3_stricmp(azArg[i], "min_word_size") ){
+  int mwsz;
+  if( sqlite3GetInt32(zArg, &mwsz) ) p->nMinWordSize = mwsz;
 }else{
   rc = SQLITE_ERROR;
 }
   }

@@ -450,10 +456,11 @@
   while( rc==SQLITE_OK ){
 int iCode;    /* non-ASCII codepoint read from 
input */

 char *zOut = aFold;
 int is;
 int ie;
+    int wsz;

 /* Skip any separator characters. */
 while( 1 ){
   if( zCsr>=zTerm ) goto tokenize_done;
   if( *zCsr & 0x80 ) {
@@ -517,12 +524,15 @@
 zCsr++;
   }
   ie = zCsr - (unsigned char*)pText;
 }

+    wsz = zOut-aFold;
+    /* Check min word size */
+    if(p->nMinWordSize && p->nMinWordSize >= wsz) continue;
 /* Invoke the token callback */
-    rc = xToken(pCtx, 0, aFold, zOut-aFold, is, ie);
+    rc = xToken(pCtx, 0, aFold, wsz, is, ie);
   }

  tokenize_done:
   if( rc==SQLITE_DONE ) rc = SQLITE_OK;
   return rc;



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQlite 3 - bottleneck with rbuFindMaindb

2018-09-21 Thread Simon Slavin
On 20 Sep 2018, at 10:31pm, Roger Cuypers  wrote:

> rbuFindMaindb
> rbuVfsAccess
> sqlite3OsAccess
> hasHotJournal
> sqlite3PagerSharedLock
> zipvfsLockFile

Thanks.  That's very useful.  Your stack includes both zipvfsLockFile and 
rbuVfsAccess, and I'm not familiar with either of these.   So I leave your 
problem to the others who will see this.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] FTS5 minimum number of characters to index ?

2018-09-21 Thread Domingo Alvarez Duarte

Hello !

I'm looking in the documentation and it doesn't seem to mention any 
option to specify a minimum number of characters to index, looking at 
some fts5 tables it seems that an option to limit the minimum number of 
characters to at least 2 or 3 would be a good shot as stopwords, another 
interest option would be a regex like black/white list of sequence of 
characters to be indexed.


Something like:

create virtual table if not exists pdfs_fts using fts5(pdf_name 
UNINDEXED, data,


    tokenize = 'unicode61 remove_diacritics 1 min_word_size 3 
word_black_list [\d\.\d\d\w \a\d\d\d] word_white_list [\(\d+\) 
\d\d\.\d\d\d\.\d\d\a]');


The idea is to allow/disallow some specific domain sequences to be 
included/excluded from indexing.


Any idea on how to obtain that ?

Cheers !

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

2018-09-21 Thread 邱朗
Hi,

Thanks for replying my question. Following are the error I got when compiling 
sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion 
 
http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html


I am using macOS 10.13 & xcode 10


Undefined symbols for architecture x86_64:
  "_u_errorName_62", referenced from:
  _icuFunctionError in sqlite3.o
  "_u_foldCase_62", referenced from:
  _icuOpen in sqlite3.o
  _icuLikeCompare in sqlite3.o
  "_u_isspace_62", referenced from:
  _icuNext in sqlite3.o
  "_u_strToLower_62", referenced from:
  _icuCaseFunc16 in sqlite3.o
  "_u_strToUTF8_62", referenced from:
  _icuNext in sqlite3.o
  "_u_strToUpper_62", referenced from:
  _icuCaseFunc16 in sqlite3.o
  "_ubrk_close_62", referenced from:
  _icuClose in sqlite3.o
  "_ubrk_current_62", referenced from:
  _icuNext in sqlite3.o
  "_ubrk_first_62", referenced from:
  _icuOpen in sqlite3.o
  "_ubrk_next_62", referenced from:
  _icuNext in sqlite3.o
  "_ubrk_open_62", referenced from:
  _icuOpen in sqlite3.o
  "_ucol_close_62", referenced from:
  _icuLoadCollation in sqlite3.o
  _icuCollationDel in sqlite3.o
  "_ucol_open_62", referenced from:
  _icuLoadCollation in sqlite3.o
  "_ucol_strcoll_62", referenced from:
  _icuCollationColl in sqlite3.o
  "_uregex_close_62", referenced from:
  _icuRegexpDelete in sqlite3.o
  "_uregex_matches_62", referenced from:
  _icuRegexpFunc in sqlite3.o
  "_uregex_open_62", referenced from:
  _icuRegexpFunc in sqlite3.o
  "_uregex_setText_62", referenced from:
  _icuRegexpFunc in sqlite3.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [libsqlite3.la] Error 1




在 2018-09-21 17:32:38,"Dan Kennedy"  写道:
>On 09/21/2018 01:38 PM, 邱朗 wrote:
>>>
>>> I think it could be made to work, or at least, I have experience
>>> making it work with CJK based on functionality exposed via ICU. I
>>> don't know if the unicode tokenizer uses ICU or if the functionality
>>> in ICU that I used is available in the unicode tables. Not
>>> understanding any of the languages represented by CJK, I can't say
>>> with any confidence how good my solution was, but it seemed to be good
>>> enough for the use case of my management and customers in the impacted
>>> regions.
>>> ___
>>> sqlite-users mailing list
>>> sqlite-users@mailinglists.sqlite.org
>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>> I am Chinese and I know a little bit of Korean, I can help to test your 
>> product :D  All Jokes aside I also tried to build an ICU SQlite macOS 
>> version but I failed. All the document I googled seem outdated. e.g. I used 
>> this (and other solutions) but I just can not build a macOS version. Do you 
>> have any experience for that ?
>>
>>
>> ./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU 
>> `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config 
>> --ldflags`"
>
>Can you post the complete output of the failed build attempt? Thanks.
>
>Dan.
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQlite 3 - bottleneck with rbuFindMaindb

2018-09-21 Thread Roger Cuypers
Ok, I have more info now. The database consists of multiple individual database 
files which are opened and closed individually each with their own connection, 
multiple at at time. There is a root file but its just another database file 
whose only purpose is to tell the application where to find the other files.

Here is an example call stack of the a high load call:

rbuFindMaindb
rbuVfsAccess
sqlite3OsAccess
hasHotJournal
sqlite3PagerSharedLock
zipvfsLockFile
sqlite3OsLock
pagerLockDb
pagerLockDb
pager_wait_on_lock
sqlite3PagerSharedLock
lockBtree
sqlite3BtreeBeginTrans
sqlite3VdbeExec
sqlite3Step
sqlite3_step


> Am 19.09.2018 um 22:27 schrieb Simon Slavin :
> 
> On 19 Sep 2018, at 8:47pm, Roger Cuypers  wrote:
> 
>> the database has a root file. The subfiles are all loaded via separate 
>> connections as far as I know.
> 
> Sorry, but this makes no sense.  Each database file can have only one WAL 
> file.
> 
> You say that the program is looking through lots of WAL files.  The only way 
> it should be doing that is if the program has lots of database files open at 
> the same time.  If a database is not open, then SQLite does not even know its 
> WAL file exists.
> 
> Does your program really have numerous database files open at one time ?
> 
> If so, does it do that using the ATTACH command, and attaching them all to 
> one connection, or by opening a separate connection to each database ?
> 
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

2018-09-21 Thread Dan Kennedy

On 09/21/2018 01:38 PM, 邱朗 wrote:


I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


I am Chinese and I know a little bit of Korean, I can help to test your product 
:D  All Jokes aside I also tried to build an ICU SQlite macOS version but I 
failed. All the document I googled seem outdated. e.g. I used this (and other 
solutions) but I just can not build a macOS version. Do you have any experience 
for that ?


./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config 
--cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"





Can you post the complete output of the failed build attempt? Thanks.

Dan.






Thanks,
Qiulang
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Docs typo JSON1 @ 4.13

2018-09-21 Thread John G
In that same JSON page, in 1. Overview the text mentions '12 of 14 SQL
functions'  but the listing shows different numbers - 13 numbered items in
the first section,  2 in the second, numbered 1 - 15.

Should that be "twelve of the *fifteen* SQL functions" or "*thirteen* of
the *fifteen* SQL functions"?

Cheers
JG

On 19 September 2018 at 11:16, Peter Johnson 
wrote:

> Hi,
>
> The JSON1 docs at https://www.sqlite.org/json1.html have a minor typo:
>
> Section 4.13. The json_each() and json_tree() table-valued functions
>
> atom ANY, -- value for primitive types, null for array & object
> > id INTEGER -- integer ID for this element
> > parent INTEGER, -- integer ID for the parent of this element
>
>
> The "id INTEGER" column definition is missing a trailing comma.
>
> Cheers,
> -P
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users