Re: [sqlite] Odd insertion error FTS4 + ICU (E. Timothy Uy)

2012-06-20 Thread Klaas Van Be
>> >> inserting the following into my virtual table:
>> >>
>> >> 一日耶羅波安出
>>
>> Can you post the list of codepoints in this text? Or the hex
>> of the utf-16 or utf-8 encoding of the same?

00 4E E5 65 36 80 85 7F E2 6C 89 5B FA 51


Here no problem inserting this string (Mac OSX 10.6.8)

sqlite> create table u8_t (u8c1 varchar(32));
sqlite> insert into u8_t values ('一日耶羅波安出');
sqlite> .mode list
sqlite> select * from u8_t;
u8c1
一日耶羅波安出
sqlite> .quit
[[bash SQLite]]
sqlite Club.sl3
SQLite version 3.7.13 2012-06-11 02:05:22
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> 

 

Cordiali saluti/Vriendelijke groeten/Kind regards,
Klaas V
http://innocentisart.net
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Odd insertion error FTS4 + ICU

2012-06-18 Thread Dan Kennedy
On 06/19/2012 04:28 AM, E. Timothy Uy wrote:
> Dear Dan,
> 
> With the change from U8_NEXT to U16_NEXT, I am able to insert 一日耶羅波安出. I
> was also able to insert the rest of the data set (about 31000 more rows
> containing both traditional and simplified Chinese). Is this an ICU error?
> Seems like everything should be using U8_ in the tokenizer.

U16_NEXT is correct, as that buffer contains utf-16 characters. Data
is converted to utf-16 before it is tokenized as ICU does not provide
a break-iterator that operates directly on utf-8.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Odd insertion error FTS4 + ICU

2012-06-18 Thread E. Timothy Uy
Dear Dan,

With the change from U8_NEXT to U16_NEXT, I am able to insert 一日耶羅波安出. I
was also able to insert the rest of the data set (about 31000 more rows
containing both traditional and simplified Chinese). Is this an ICU error?
Seems like everything should be using U8_ in the tokenizer.

Thank you much.

Respectfully,
Tim


On Mon, Jun 18, 2012 at 2:20 PM, E. Timothy Uy  wrote:

> I'll take a look right now. Though my first thought was if you change
> U8_NEXT to U16_NEXT, wouldn't you have to change it everywhere else?  I
> recompiled ICU with U_CHARSET_IS_UTF8 earlier and this did not help.
>
>
> On Mon, Jun 18, 2012 at 2:06 PM, Dan Kennedy wrote:
>
>> On 06/19/2012 03:39 AM, E. Timothy Uy wrote:
>> > If anyone can unravel this mystery, it would be much appreciated. For
>> now,
>> > I inserted a comma - 一日、耶羅波安出 and it works. I suspect it must be somehow
>> > that the sequence of bytes encodes another character, which throws the
>> > tokenizer out of whack or maybe the fts4aux table.
>>
>> Can you try with this:
>>
>>  http://www.sqlite.org/src/info/892b74116a
>>
>> Thanks.
>> ___
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
>
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Odd insertion error FTS4 + ICU

2012-06-18 Thread E. Timothy Uy
I'll take a look right now. Though my first thought was if you change
U8_NEXT to U16_NEXT, wouldn't you have to change it everywhere else?  I
recompiled ICU with U_CHARSET_IS_UTF8 earlier and this did not help.

On Mon, Jun 18, 2012 at 2:06 PM, Dan Kennedy  wrote:

> On 06/19/2012 03:39 AM, E. Timothy Uy wrote:
> > If anyone can unravel this mystery, it would be much appreciated. For
> now,
> > I inserted a comma - 一日、耶羅波安出 and it works. I suspect it must be somehow
> > that the sequence of bytes encodes another character, which throws the
> > tokenizer out of whack or maybe the fts4aux table.
>
> Can you try with this:
>
>  http://www.sqlite.org/src/info/892b74116a
>
> Thanks.
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Odd insertion error FTS4 + ICU

2012-06-18 Thread Dan Kennedy
On 06/19/2012 03:39 AM, E. Timothy Uy wrote:
> If anyone can unravel this mystery, it would be much appreciated. For now,
> I inserted a comma - 一日、耶羅波安出 and it works. I suspect it must be somehow
> that the sequence of bytes encodes another character, which throws the
> tokenizer out of whack or maybe the fts4aux table.

Can you try with this:

  http://www.sqlite.org/src/info/892b74116a

Thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Odd insertion error FTS4 + ICU

2012-06-18 Thread E. Timothy Uy
If anyone can unravel this mystery, it would be much appreciated. For now,
I inserted a comma - 一日、耶羅波安出 and it works. I suspect it must be somehow
that the sequence of bytes encodes another character, which throws the
tokenizer out of whack or maybe the fts4aux table.

一
19968
%E4%B8%80
日
26085
%E6%97%A5
耶
32822
%E8%80%B6
羅
32645
%E7%BE%85
波
27874
%E6%B3%A2
安
23433
%E5%AE%89
出
20986
%E5%87%BA


On Mon, Jun 18, 2012 at 12:59 PM, E. Timothy Uy  wrote:

> Thanks for writing back Dan. Using charCodeAt() in Javascript, I have the
> following for 一日耶羅波安出:
>
> 19968
> 26085
> 32822
> 32645
> 27874
> 23433
> 20986
>
> I tried entering subsets of the data:
>
> 一日耶羅波安出 - Error: SQL logic error or missing database <-- target
> 一日耶羅波安 - Ok
> 日耶羅波安出 - Ok
> 耶羅波安出 - Ok
> 一日耶羅波安出x - Error: SQL logic error or missing database
> 一日耶羅波安x出 - Error: SQL logic error or missing database
> 一日耶羅波x安出 - Error: SQL logic error or missing database
> 一日耶羅x波安出 - Ok
> 一日耶x羅波安出 - Ok
> 一日x耶羅波安出 - Ok
> 一x日耶羅波安出 - Ok
> x一日耶羅波安出 - Ok
>
> I'm a bit concerned that this might be an indicator for a deeper issue.
> Running Ubuntu Linux x64.
>
> Respectfully,
> Tim
>
>
> On Mon, Jun 18, 2012 at 12:29 PM, Dan Kennedy wrote:
>
>> On 06/19/2012 02:11 AM, E. Timothy Uy wrote:
>> > I recompiled ICU using U_CHARSET_IS_UTF8 and the error persists.
>> >
>> > On Mon, Jun 18, 2012 at 11:45 AM, E. Timothy Uy  wrote:
>> >
>> >> Hopefully someone has some insight on this. I am using FTS4 with
>> >> tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an
>> error
>> >> inserting the following into my virtual table:
>> >>
>> >> 一日耶羅波安出
>>
>> Can you post the list of codepoints in this text? Or the hex
>> of the utf-16 or utf-8 encoding of the same?
>> ___
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
>
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Odd insertion error FTS4 + ICU

2012-06-18 Thread E. Timothy Uy
Thanks for writing back Dan. Using charCodeAt() in Javascript, I have the
following for 一日耶羅波安出:

19968
26085
32822
32645
27874
23433
20986

I tried entering subsets of the data:

一日耶羅波安出 - Error: SQL logic error or missing database <-- target
一日耶羅波安 - Ok
日耶羅波安出 - Ok
耶羅波安出 - Ok
一日耶羅波安出x - Error: SQL logic error or missing database
一日耶羅波安x出 - Error: SQL logic error or missing database
一日耶羅波x安出 - Error: SQL logic error or missing database
一日耶羅x波安出 - Ok
一日耶x羅波安出 - Ok
一日x耶羅波安出 - Ok
一x日耶羅波安出 - Ok
x一日耶羅波安出 - Ok

I'm a bit concerned that this might be an indicator for a deeper issue.
Running Ubuntu Linux x64.

Respectfully,
Tim


On Mon, Jun 18, 2012 at 12:29 PM, Dan Kennedy  wrote:

> On 06/19/2012 02:11 AM, E. Timothy Uy wrote:
> > I recompiled ICU using U_CHARSET_IS_UTF8 and the error persists.
> >
> > On Mon, Jun 18, 2012 at 11:45 AM, E. Timothy Uy  wrote:
> >
> >> Hopefully someone has some insight on this. I am using FTS4 with
> >> tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an error
> >> inserting the following into my virtual table:
> >>
> >> 一日耶羅波安出
>
> Can you post the list of codepoints in this text? Or the hex
> of the utf-16 or utf-8 encoding of the same?
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Odd insertion error FTS4 + ICU

2012-06-18 Thread Dan Kennedy
On 06/19/2012 02:11 AM, E. Timothy Uy wrote:
> I recompiled ICU using U_CHARSET_IS_UTF8 and the error persists.
> 
> On Mon, Jun 18, 2012 at 11:45 AM, E. Timothy Uy  wrote:
> 
>> Hopefully someone has some insight on this. I am using FTS4 with
>> tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an error
>> inserting the following into my virtual table:
>>
>> 一日耶羅波安出

Can you post the list of codepoints in this text? Or the hex
of the utf-16 or utf-8 encoding of the same?
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Odd insertion error FTS4 + ICU

2012-06-18 Thread E. Timothy Uy
I recompiled ICU using U_CHARSET_IS_UTF8 and the error persists.

On Mon, Jun 18, 2012 at 11:45 AM, E. Timothy Uy  wrote:

> Hopefully someone has some insight on this. I am using FTS4 with
> tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an error
> inserting the following into my virtual table:
>
> 一日耶羅波安出
>
> If I add a byte to the front, lets say "|", it works. If I delete the
> first character, or delete the last, it works too. If I add more
> characters, it doesn't work. Seems like it is an encoding issue, and I
> wonder if it isn't because ICU is using UTF-16 internally.  This is a
> 1/31000 problem but aggravating nonetheless. If I don't use tokenize=icu it
> works.
>
> The error is SQLITE_ERROR: SQL logic error or missing database.
>
> Respectfully,
> Tim
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Odd insertion error FTS4 + ICU

2012-06-18 Thread E. Timothy Uy
Hopefully someone has some insight on this. I am using FTS4 with
tokenize=icu (and PRAGMA encoding="UTF-8"). I'm getting getting an error
inserting the following into my virtual table:

一日耶羅波安出

If I add a byte to the front, lets say "|", it works. If I delete the first
character, or delete the last, it works too. If I add more characters, it
doesn't work. Seems like it is an encoding issue, and I wonder if it isn't
because ICU is using UTF-16 internally.  This is a 1/31000 problem but
aggravating nonetheless. If I don't use tokenize=icu it works.

The error is SQLITE_ERROR: SQL logic error or missing database.

Respectfully,
Tim
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users