SQLite Gurus,

In SQLIte FTS3/4, does the '*' (wildcard expansion character) discriminate
between alphanumeric characters vs non-alpha numeric characters when
matching? I have two test cases below which causes me to believe that it
does. Also, the OR operator appears to fail when matching against a
combination of alphanumerics and non-alphanumerics.

I'm using a custom tokenizer in each scenario (yes it works and the proof
is enclosed). In the first dataset, the data was tokenized such that any
alphanumeric character qualifies as part of a token.

In the second dataset, the data was tokenized such that anything other than
a semicolon qualifies as part of a token.\

The issues I'm raising is that 1) the '*' expansion doesn't seem to work
for a alphanumeric/non-alphanumeric token matches nor does the OR operator.
I haven't figured out what I'm missing here. I've been banging my head all
morning against tihs. I appreciate any help.

I've listed a working and non working example below since a comparison
should clarify best:

[THIS WORKS FINE]

MY DATASET:

STATE|NAMES

---------------------------

maryland|fred,louis,jenny

virginia|ruth,greg,denise

maine|richard,norman,willis

TOKENIZER CRITERIA USED: = any word characters (\w+)

(comment: all these names should be split into regular tokens and they were
as shown below)

FULL TEXT INDEX:

sqlite> select * from ft_terms;

term col documents occurrences

---------- ---------- ---------- -----------

fred * 1 1

fred 1 1 1

jenny * 1 1

jenny 1 1 1

louis * 1 1

louis 1 1 1

maryland * 1 1

maryland 0 1 1

ruth * 1 1

ruth 1 1 1

virginia * 1 1

virginia 0 1 1

QUERY:

select * from word where word match 'mary* jen*'

RETURNS:

maryland|fred,louis,jenny

QUERY:

select * from word where word match 'mary* OR v*'

RETURNS:

maryland|fred,louis,jenny

virginia|ruth,greg,denise

All the above behaves as expected. Now lets introduce some non-alphanumerics

--------------------------------------------------------------------------

[THIS DOESN'T WORK]

DATASET:

ROWSET|PAIR

----------------------

1 A=15;B=16;C=38

2 D=15;E=25;F=16

TOKENIZER CRITERIA USED: = any character that is NOT a semicolon ([^;]+)

(comment: all these PAIR values should be tokenized by semicolon and they
were as shown below)

FULL TEXT INDEX:

term col documents occurrences

---------- ---------- ---------- -----------

1 * 1 1

1 0 1 1

2 * 1 1

2 0 1 1

A=15 * 1 1

A=15 1 1 1

B=16 * 1 1

B=16 1 1 1

C=38 * 1 1

C=38 1 1 1

D=15 * 1 1

D=15 1 1 1

E=25 * 1 1

E=25 1 1 1

F=16 * 1 1

F=16 1 1 1

QUERY1:

select * from NUMMY where NUMMY MATCH 'A=* OR D=*'

RETURNS:

(nothing)

comment: Should have returned ROWSET 1 and 2 (refer to above dataset)

QUERY2:

select * from NUMMY where NUMMY MATCH 'A* C*'

RETURNS:

(nothing)

comment: Should have returned ROWSET 1 (refer to above dataset)

--HOWEVER SPECIFYING THE FULL TOKEN WORKS--

QUERY3:

select * from NUMMY where NUMMY MATCH 'A=15'

RETURNS:

1|A=15;B=16;C=38

QUERY4:

select * from NUMMY where NUMMY MATCH 'A=15'

RETURNS:

1|A=15;B=16;C=38

QUERY5:

select * from NUMMY where NUMMY MATCH 'E=25'

RETURNS:

2|D=15;E=25;F=16

--THIS SUCCESS IS SHORTLIVED--

QUERY6:

select * from NUMMY where NUMMY MATCH 'E=25 OR B=16'

RETURNS:

(nothing)

This query should have returned rowset 1 and two as this was an OR query
and both creiteria are met.

In summary, the wildcard expansion as well as the OR operator seems not to
work in the second example. There only main difference between the two data
sets in that the first set is composed of alpha characters only and the
second is a combination of alphanumeric and non-aplhanumerica characters.
The (*) expansion character is not matching these. A match only occurs when
you specify the full token.

Thanks for your time in looking at this issue.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to