[sqlite] FTS simple tokenizer with custom delimeters

2012-05-06 Thread Jos Groot Lipman
While looking around in the source of the simple tokenizer I found code that
suggests custom delimeters can be specified (I want to exclude the
underscore).
 
 
http://www.sqlite.org/src/artifact/5c98225a53705e5ee34824087478cf477bdb7004?
ln=76-87
 
An indeed:
  CREATE VIRTUAL TABLE ft USING fts3(title, body, tokenize=simple XX
[&'\" *()./\\=,:;%<>-?!])
seems to work fine.
 
As far as I can tell this feature is undocumented which means I am not
suppose to use it.
Is this:
- An oversight
- For good reason as it is unstable
- or: because the syntax might change in the near future?
 
Also: I need to include the dummy XX as the delimeters are searched in
argv[1] in stead of argv[0]. I cannot find what the argv[0] is supposed to
do here. Any reason?
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] FTS simple tokenizer

2012-02-28 Thread Matt Young
Using the _ character to separate words is an informal  language standard,
s in: method_do_this...

On Tue, Feb 28, 2012 at 12:40 AM, Dan Kennedy  wrote:

> On 02/28/2012 12:09 AM, Jos Groot Lipman wrote:
>
>> It was reported before (and not solved)
>> http://www.mail-archive.com/**sqlite-users@sqlite.org/**msg55959.html
>>
>
> The document sources are updated now. So the fix will appear on
> the website next time it is regenerated.
> __**_
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] FTS simple tokenizer

2012-02-28 Thread Dan Kennedy

On 02/28/2012 12:09 AM, Jos Groot Lipman wrote:

It was reported before (and not solved)
http://www.mail-archive.com/sqlite-users@sqlite.org/msg55959.html


The document sources are updated now. So the fix will appear on
the website next time it is regenerated.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] FTS simple tokenizer

2012-02-27 Thread Jos Groot Lipman
It was reported before (and not solved)
http://www.mail-archive.com/sqlite-users@sqlite.org/msg55959.html 

> -Original Message-
> From: sqlite-users-boun...@sqlite.org 
> [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Hamish Allan
> Sent: maandag 27 februari 2012 11:27
> To: General Discussion of SQLite Database
> Cc: sqlite-users@sqlite.org
> Subject: Re: [sqlite] FTS simple tokenizer
> 
> Thanks Dan. Have just checked how to report bug, and 
> apparently we already have :)
> 
> Please excuse the brevity -- sent from my phone
> 
> On 27 Feb 2012, at 07:06, Dan Kennedy <danielk1...@gmail.com> wrote:
> 
> > On 02/27/2012 05:59 AM, Hamish Allan wrote:
> >> The docs for the simple tokenizer
> >> (http://www.sqlite.org/fts3.html#tokenizer) say:
> >> 
> >> "A term is a contiguous sequence of eligible characters, where 
> >> eligible characters are all alphanumeric characters, the "_"
> >> character, and all characters with UTF codepoints greater than or 
> >> equal to 128."
> >> 
> >> If I do:
> >> 
> >> CREATE VIRTUAL TABLE test USING fts3(); INSERT INTO test (content) 
> >> VALUES ('hello_world');
> >> 
> >> SELECT * FROM test WHERE content MATCH 'orld'; SELECT * FROM test 
> >> WHERE content MATCH 'world';
> >> 
> >> I get no match for the first query, because it doesn't 
> match a term, 
> >> but I get a match for the second, whereas according to my 
> reading of 
> >> the docs "world" shouldn't be a term because the 
> underscore character 
> >> shouldn't be considered a term break.
> >> 
> >> Can anyone please help me understand this behaviour?
> > 
> > Documentation bug. Eligible characters are just 
> alphanumerics and UTF 
> > codepoints greater than 128.
> > 
> > Dan.
> > ___
> > sqlite-users mailing list
> > sqlite-users@sqlite.org
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] FTS simple tokenizer

2012-02-27 Thread Hamish Allan
Thanks Dan. Have just checked how to report bug, and apparently we already have 
:)

Please excuse the brevity -- sent from my phone

On 27 Feb 2012, at 07:06, Dan Kennedy  wrote:

> On 02/27/2012 05:59 AM, Hamish Allan wrote:
>> The docs for the simple tokenizer
>> (http://www.sqlite.org/fts3.html#tokenizer) say:
>> 
>> "A term is a contiguous sequence of eligible characters, where
>> eligible characters are all alphanumeric characters, the "_"
>> character, and all characters with UTF codepoints greater than or
>> equal to 128."
>> 
>> If I do:
>> 
>> CREATE VIRTUAL TABLE test USING fts3();
>> INSERT INTO test (content) VALUES ('hello_world');
>> 
>> SELECT * FROM test WHERE content MATCH 'orld';
>> SELECT * FROM test WHERE content MATCH 'world';
>> 
>> I get no match for the first query, because it doesn't match a term,
>> but I get a match for the second, whereas according to my reading of
>> the docs "world" shouldn't be a term because the underscore character
>> shouldn't be considered a term break.
>> 
>> Can anyone please help me understand this behaviour?
> 
> Documentation bug. Eligible characters are just alphanumerics and
> UTF codepoints greater than 128.
> 
> Dan.
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] FTS simple tokenizer

2012-02-26 Thread Dan Kennedy

On 02/27/2012 05:59 AM, Hamish Allan wrote:

The docs for the simple tokenizer
(http://www.sqlite.org/fts3.html#tokenizer) say:

"A term is a contiguous sequence of eligible characters, where
eligible characters are all alphanumeric characters, the "_"
character, and all characters with UTF codepoints greater than or
equal to 128."

If I do:

CREATE VIRTUAL TABLE test USING fts3();
INSERT INTO test (content) VALUES ('hello_world');

SELECT * FROM test WHERE content MATCH 'orld';
SELECT * FROM test WHERE content MATCH 'world';

I get no match for the first query, because it doesn't match a term,
but I get a match for the second, whereas according to my reading of
the docs "world" shouldn't be a term because the underscore character
shouldn't be considered a term break.

Can anyone please help me understand this behaviour?


Documentation bug. Eligible characters are just alphanumerics and
UTF codepoints greater than 128.

Dan.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] FTS simple tokenizer

2012-02-26 Thread Hamish Allan
The docs for the simple tokenizer
(http://www.sqlite.org/fts3.html#tokenizer) say:

"A term is a contiguous sequence of eligible characters, where
eligible characters are all alphanumeric characters, the "_"
character, and all characters with UTF codepoints greater than or
equal to 128."

If I do:

CREATE VIRTUAL TABLE test USING fts3();
INSERT INTO test (content) VALUES ('hello_world');

SELECT * FROM test WHERE content MATCH 'orld';
SELECT * FROM test WHERE content MATCH 'world';

I get no match for the first query, because it doesn't match a term,
but I get a match for the second, whereas according to my reading of
the docs "world" shouldn't be a term because the underscore character
shouldn't be considered a term break.

Can anyone please help me understand this behaviour?

Thanks,
Hamish
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users