Re: [sqlite] Odd insertion error FTS4 + ICU (E. Timothy Uy)
>> >> inserting the following into my virtual table: >> >> >> >> 一日耶羅波安出 >> >> Can you post the list of codepoints in this text? Or the hex >> of the utf-16 or utf-8 encoding of the same? 00 4E E5 65 36 80 85 7F E2 6C 89 5B FA 51 Here no problem inserting this string (Mac OSX 10.6.8) sqlite> create table u8_t (u8c1 varchar(32)); sqlite> insert into u8_t values ('一日耶羅波安出'); sqlite> .mode list sqlite> select * from u8_t; u8c1 一日耶羅波安出 sqlite> .quit [[bash SQLite]] sqlite Club.sl3 SQLite version 3.7.13 2012-06-11 02:05:22 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite> Cordiali saluti/Vriendelijke groeten/Kind regards, Klaas V http://innocentisart.net ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Odd insertion error FTS4 + ICU
On 06/19/2012 04:28 AM, E. Timothy Uy wrote: > Dear Dan, > > With the change from U8_NEXT to U16_NEXT, I am able to insert 一日耶羅波安出. I > was also able to insert the rest of the data set (about 31000 more rows > containing both traditional and simplified Chinese). Is this an ICU error? > Seems like everything should be using U8_ in the tokenizer. U16_NEXT is correct, as that buffer contains utf-16 characters. Data is converted to utf-16 before it is tokenized as ICU does not provide a break-iterator that operates directly on utf-8. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Odd insertion error FTS4 + ICU
Dear Dan, With the change from U8_NEXT to U16_NEXT, I am able to insert 一日耶羅波安出. I was also able to insert the rest of the data set (about 31000 more rows containing both traditional and simplified Chinese). Is this an ICU error? Seems like everything should be using U8_ in the tokenizer. Thank you much. Respectfully, Tim On Mon, Jun 18, 2012 at 2:20 PM, E. Timothy Uy wrote: > I'll take a look right now. Though my first thought was if you change > U8_NEXT to U16_NEXT, wouldn't you have to change it everywhere else? I > recompiled ICU with U_CHARSET_IS_UTF8 earlier and this did not help. > > > On Mon, Jun 18, 2012 at 2:06 PM, Dan Kennedy wrote: > >> On 06/19/2012 03:39 AM, E. Timothy Uy wrote: >> > If anyone can unravel this mystery, it would be much appreciated. For >> now, >> > I inserted a comma - 一日、耶羅波安出 and it works. I suspect it must be somehow >> > that the sequence of bytes encodes another character, which throws the >> > tokenizer out of whack or maybe the fts4aux table. >> >> Can you try with this: >> >> http://www.sqlite.org/src/info/892b74116a >> >> Thanks. >> ___ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Odd insertion error FTS4 + ICU
I'll take a look right now. Though my first thought was if you change U8_NEXT to U16_NEXT, wouldn't you have to change it everywhere else? I recompiled ICU with U_CHARSET_IS_UTF8 earlier and this did not help. On Mon, Jun 18, 2012 at 2:06 PM, Dan Kennedy wrote: > On 06/19/2012 03:39 AM, E. Timothy Uy wrote: > > If anyone can unravel this mystery, it would be much appreciated. For > now, > > I inserted a comma - 一日、耶羅波安出 and it works. I suspect it must be somehow > > that the sequence of bytes encodes another character, which throws the > > tokenizer out of whack or maybe the fts4aux table. > > Can you try with this: > > http://www.sqlite.org/src/info/892b74116a > > Thanks. > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Odd insertion error FTS4 + ICU
On 06/19/2012 03:39 AM, E. Timothy Uy wrote: > If anyone can unravel this mystery, it would be much appreciated. For now, > I inserted a comma - 一日、耶羅波安出 and it works. I suspect it must be somehow > that the sequence of bytes encodes another character, which throws the > tokenizer out of whack or maybe the fts4aux table. Can you try with this: http://www.sqlite.org/src/info/892b74116a Thanks. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Odd insertion error FTS4 + ICU
If anyone can unravel this mystery, it would be much appreciated. For now, I inserted a comma - 一日、耶羅波安出 and it works. I suspect it must be somehow that the sequence of bytes encodes another character, which throws the tokenizer out of whack or maybe the fts4aux table. 一 19968 %E4%B8%80 日 26085 %E6%97%A5 耶 32822 %E8%80%B6 羅 32645 %E7%BE%85 波 27874 %E6%B3%A2 安 23433 %E5%AE%89 出 20986 %E5%87%BA On Mon, Jun 18, 2012 at 12:59 PM, E. Timothy Uy wrote: > Thanks for writing back Dan. Using charCodeAt() in Javascript, I have the > following for 一日耶羅波安出: > > 19968 > 26085 > 32822 > 32645 > 27874 > 23433 > 20986 > > I tried entering subsets of the data: > > 一日耶羅波安出 - Error: SQL logic error or missing database <-- target > 一日耶羅波安 - Ok > 日耶羅波安出 - Ok > 耶羅波安出 - Ok > 一日耶羅波安出x - Error: SQL logic error or missing database > 一日耶羅波安x出 - Error: SQL logic error or missing database > 一日耶羅波x安出 - Error: SQL logic error or missing database > 一日耶羅x波安出 - Ok > 一日耶x羅波安出 - Ok > 一日x耶羅波安出 - Ok > 一x日耶羅波安出 - Ok > x一日耶羅波安出 - Ok > > I'm a bit concerned that this might be an indicator for a deeper issue. > Running Ubuntu Linux x64. > > Respectfully, > Tim > > > On Mon, Jun 18, 2012 at 12:29 PM, Dan Kennedy wrote: > >> On 06/19/2012 02:11 AM, E. Timothy Uy wrote: >> > I recompiled ICU using U_CHARSET_IS_UTF8 and the error persists. >> > >> > On Mon, Jun 18, 2012 at 11:45 AM, E. Timothy Uy wrote: >> > >> >> Hopefully someone has some insight on this. I am using FTS4 with >> >> tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an >> error >> >> inserting the following into my virtual table: >> >> >> >> 一日耶羅波安出 >> >> Can you post the list of codepoints in this text? Or the hex >> of the utf-16 or utf-8 encoding of the same? >> ___ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Odd insertion error FTS4 + ICU
Thanks for writing back Dan. Using charCodeAt() in Javascript, I have the following for 一日耶羅波安出: 19968 26085 32822 32645 27874 23433 20986 I tried entering subsets of the data: 一日耶羅波安出 - Error: SQL logic error or missing database <-- target 一日耶羅波安 - Ok 日耶羅波安出 - Ok 耶羅波安出 - Ok 一日耶羅波安出x - Error: SQL logic error or missing database 一日耶羅波安x出 - Error: SQL logic error or missing database 一日耶羅波x安出 - Error: SQL logic error or missing database 一日耶羅x波安出 - Ok 一日耶x羅波安出 - Ok 一日x耶羅波安出 - Ok 一x日耶羅波安出 - Ok x一日耶羅波安出 - Ok I'm a bit concerned that this might be an indicator for a deeper issue. Running Ubuntu Linux x64. Respectfully, Tim On Mon, Jun 18, 2012 at 12:29 PM, Dan Kennedy wrote: > On 06/19/2012 02:11 AM, E. Timothy Uy wrote: > > I recompiled ICU using U_CHARSET_IS_UTF8 and the error persists. > > > > On Mon, Jun 18, 2012 at 11:45 AM, E. Timothy Uy wrote: > > > >> Hopefully someone has some insight on this. I am using FTS4 with > >> tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an error > >> inserting the following into my virtual table: > >> > >> 一日耶羅波安出 > > Can you post the list of codepoints in this text? Or the hex > of the utf-16 or utf-8 encoding of the same? > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Odd insertion error FTS4 + ICU
On 06/19/2012 02:11 AM, E. Timothy Uy wrote: > I recompiled ICU using U_CHARSET_IS_UTF8 and the error persists. > > On Mon, Jun 18, 2012 at 11:45 AM, E. Timothy Uy wrote: > >> Hopefully someone has some insight on this. I am using FTS4 with >> tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an error >> inserting the following into my virtual table: >> >> 一日耶羅波安出 Can you post the list of codepoints in this text? Or the hex of the utf-16 or utf-8 encoding of the same? ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Odd insertion error FTS4 + ICU
I recompiled ICU using U_CHARSET_IS_UTF8 and the error persists. On Mon, Jun 18, 2012 at 11:45 AM, E. Timothy Uy wrote: > Hopefully someone has some insight on this. I am using FTS4 with > tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an error > inserting the following into my virtual table: > > 一日耶羅波安出 > > If I add a byte to the front, lets say "|", it works. If I delete the > first character, or delete the last, it works too. If I add more > characters, it doesn't work. Seems like it is an encoding issue, and I > wonder if it isn't because ICU is using UTF-16 internally. This is a > 1/31000 problem but aggravating nonetheless. If I don't use tokenize=icu it > works. > > The error is SQLITE_ERROR: SQL logic error or missing database. > > Respectfully, > Tim > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Odd insertion error FTS4 + ICU
Hopefully someone has some insight on this. I am using FTS4 with tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an error inserting the following into my virtual table: 一日耶羅波安出 If I add a byte to the front, lets say "|", it works. If I delete the first character, or delete the last, it works too. If I add more characters, it doesn't work. Seems like it is an encoding issue, and I wonder if it isn't because ICU is using UTF-16 internally. This is a 1/31000 problem but aggravating nonetheless. If I don't use tokenize=icu it works. The error is SQLITE_ERROR: SQL logic error or missing database. Respectfully, Tim ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users