Re: [sqlite] Custom Collation comparing only firt character?

2011-08-29 Thread Igor Tandetnik
Jean-Denis MUYS  wrote:
> int collationAnyCIAI(void *arg1, int str1Length, const void *str1, int 
> str2Length, const void *str2) {
>
> But maybe you have good reasons to ignore the Apple-provided collations. In 
> that case, I still don't get why you don't work
> directly on the Unicode strings. You call hexStringWithData:ofLength: (which 
> is not an Apple method and for which you don't
> provide the code). Why? What would be wrong with:  
> 
> NSString *strA = [[NSString alloc] initWithUTF8String:str1]; // release me 
> before returning
> NSString *strB = [[NSString alloc] initWithUTF8String:str2]; // release me 
> before returning

str1 and str2 are UTF-16 strings, not UTF-8 strings (since the OP passed 
SQLITE_UTF16 flag to sqlite3_create_collation; if UTF-8 strings are more 
convenient to work with, he could have passed SQLITE_UTF8 instead). Further, 
the strings may not be NUL-terminated, that's why you are given their lengths.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Custom Collation comparing only firt character?

2011-08-29 Thread Jean-Denis MUYS

On 25 août 2011, at 07:58, Roberto Colnaghi wrote:

int collationAnyCIAI(void *arg1, int str1Length, const void *str1, int 
str2Length, const void *str2) {
NSString *strA = [NSString hexStringWithData:str1 ofLength:1];
NSString *strB = [NSString hexStringWithData:str2 ofLength:1];
int striA;
sscanf([strA cString], "%x", &striA);
int striB;
sscanf([strB cString], "%x", &striB);

I am a lot more experienced with iOS than SQLite, so my 2 cents are biased. In 
particular, I have never had to use a custom collation for SQLite. That being 
said, the snippet of code I quoted makes me cringe a little, and unless you 
have reasons to avoid Cocoa [Touch] entirely, why don't you use it?

In the present case, the NSString class has extensive comparison options, 
including the following:

NSDiacriticInsensitiveSearch
Search ignores diacritic marks.
For example, ‘ö’ is equal to ‘o’.
Available in iOS 2.0 and later.
Declared in NSString.h.

This option is passed to:
- (NSComparisonResult)compare:(NSString *)aString 
options:(NSStringCompareOptions)mask
Using this option is as simple as calling:

return [strA compare:strB options: NSDiacriticInsensitiveSearch];

You also have variations that take into account the user's current locale:

return [strA localizedCompare:strB];
return [strA localizedCaseInsensitiveCompare:strB];

As those two routines end up calling

- (NSComparisonResult)compare:(NSString *)aString 
options:(NSStringCompareOptions)mask range:(NSRange)range locale:(id)locale

They call it passing the current user locale as the locale ([NSLocale 
currentLocale]), and the string full range as the range.

You could even use a "smart" comparison using

return [strA localizedStandardCompare: strB]

The methods orders strings in a "natural" order "as in the Finder". For 
example, it would order the strings "file 9.txt", "file 20.txt", "file 100.txt" 
in that order.

(By the way, the result returned by Cocoa string comparison methods seems 
compatible with what SQLite expects:


enum {
   
NSOrderedAscending
 = -1,
   
NSOrderedSame,
   
NSOrderedDescending
};
typedef NSInteger NSComparisonResult;


if I am mistaken (I haven't looked it put), it looks like converting it should 
be straightforward).

But maybe you have good reasons to ignore the Apple-provided collations. In 
that case, I still don't get why you don't work directly on the Unicode 
strings. You call hexStringWithData:ofLength: (which is not an Apple method and 
for which you don't provide the code). Why? What would be wrong with:

NSString *strA = [[NSString alloc] initWithUTF8String:str1]; // release me 
before returning
NSString *strB = [[NSString alloc] initWithUTF8String:str2]; // release me 
before returning

Or, if the strings from SQLite are not UTF-8 encoded, use

- (id)initWithCString:(const char *)nullTerminatedCString 
encoding:(NSStringEncoding)encoding

Then you don't have to do this wasteful charade with hex conversion.

Finally, if hexStringWithData:ofLength: is properly implemented, it returns an 
autoreleased string. If the collation routine is called a lot by SQLite, you 
will allocate NSStrings like crazy, which is not good citizenship on an iPhone 
that could lead your app to be killed by the OS, even if the autoreleased 
strings eventually get deallocated the next time in the run loop. Unless you 
wrap the whole enchilada in your own autorelease pool (you didn't show that). 
In your specific case, it's better to allocate and release your NSStrings right 
in your collation routine.

I hope this helps a little.

JD














































___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Custom Collation comparing only firt character?

2011-08-27 Thread Jean-Christophe Deschamps

>Cool - can you please post a link to the sources?

Sure, here you are:
http://dl.dropbox.com/u/26433628/unifuzz.zip

The .c source contains all details (and all the bugs).


--
j...@antichoc.net  

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Custom Collation comparing only firt character?

2011-08-26 Thread Stephan Beal
On Sat, Aug 27, 2011 at 12:04 AM, Jean-Christophe Deschamps <
j...@antichoc.net> wrote:

> If you also need to search names with uncertain spelling, you can also
> use my typos() function to perform a fuzzy search. Here's a sample of
> its use on a decently populated ZipCodes table (848207 rows):

select pays, zip, ville, region from allcountries where typos(ville,
> 'saopaul%') < 3 group by pays, ville, region
>

Cool - can you please post a link to the sources?

-- 
- stephan beal
http://wanderinghorse.net/home/stephan/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Custom Collation comparing only firt character?

2011-08-26 Thread Jean-Christophe Deschamps
Roberto,

>Though I cannot use DLLs since it is an iPhone iOS (MacOSX) 
>operational system.

I made it DLL by default at build time since it fits my needs.  You can 
still compile the extension (or part of it) as a standard .o obj and 
statically link it into your application.

>I was hoping for a collation callback that is called for all 
>characters, not only the first.

A collation always works on the full arguments it is supplied with, 
i.e. on the whole strings that are to ba collated.  This behavior is 
expected and fully docuented, like Igor also points out in his recent 
reply.

>Shouldn't sqlite3_create_collation be called for every single character?

No it has to be called once for each connection and every external (not 
already defined by default in the SQLite core) collation function you 
require.

>  Let's say the comparing names are "São Paulo" and "Santos". 
> ->  SELECT * FROM Game WHERE TeamHome = 'SANTOS' COLLATE anyCIAI;

I have no idea what anyCIAI means in this context.  The extension I 
proposed offers 4 new collation functions (NOCASE which overrides the 
builtin NOCASE, UNACCENTED, NAMES and NUMERICS).  Since these 
internally use a Windows call, you can't use their code as is.  What I 
would do in your situation is write a new collation relying on the 
unaccenting internal functions provided in the extension.

>The LOG function shows a comparison between S and other first char 
>only only:
>
>41 65 A - 53 83 S = -18
>43 67 C - 53 83 S = -16
>46 70 F - 53 83 S = -13
>53 83 S - 53 83 S = 0
>49 73 I - 53 83 S = -10
>46 70 F - 53 83 S = -13
>50 80 P - 53 83 S = -3
>43 67 C - 53 83 S = -16
>47 71 G - 53 83 S = -12
>I was expecting it to go further in the comparison:"São Paulo" and 
>"Santos" should LOGS - Sã - ao - n -> stops here, not what your looking for
>When using it on ORDER BY, it is clear that only the first char is 
>compared.

I don't know what LOG / LOGS are in this context.

If you also need to search names with uncertain spelling, you can also 
use my typos() function to perform a fuzzy search. Here's a sample of 
its use on a decently populated ZipCodes table (848207 rows):

select pays, zip, ville, region from allcountries where typos(ville, 
'saopaul%') < 3 group by pays, ville, region

RecNo Pays Zip   VilleRegion
-  -  --
 1 AR   6221  LA PAULINA   LA PAMPA
 2 AU   2031  St Pauls New South Wales
 3 BR   64670-000 São Julião   Piaui
 4 BR   01000-000 São PauloSao Paulo
 5 BR   97980-000 São Paulo das MissõesRio Grande do Sul
 6 BR   69600-000 São Paulo de OlivençaAmazonas
 7 BR   59460-000 São Paulo do Potengi Rio Grande do Norte
 8 ES   22281 La Paul  Aragon
 9 ES   22471 LaspaulesAragon
10 ES   07691 Sa Taulera   Baleares
11 FR   29400 Lampaul GuimiliauBretagne
12 FR   29810 Lampaul PlouarzelBretagne
13 FR   29830 Lampaul PloudalmezeauBretagne
14 FR   33390 St Paul  Aquitaine
15 FR   61100 St Paul  Basse-Normandie
16 FR   87260 St Paul  Limousin
17 FR   88170 St Paul  Lorraine
18 FR   65150 St Paul  Midi-Pyrenees
19 FR   60650 St Paul  Picardie
20 FR   06570 St Paul  Provence-Alpes-Cote D'Azur
21 FR   73170 St Paul  Rhone-Alpes
22 FR   02300 St Paul Aux Bois Picardie
23 FR   81220 St Paul Cap De Joux  Midi-Pyrenees
24 FR   82400 St Paul D Espis  Midi-Pyrenees
 >>> snip >>>
68 FR   11320 St PauletLanguedoc-Roussillon
69 FR   30130 St Paulet De Caisson Languedoc-Roussillon
70 FR   43350 St Paulien   Auvergne
71 GB   EC4   St Paul's(null)
72 GB   BR5   St Paul's Cray   (null)
73 GB   SG4   St Paul's Walden (null)
74 HU   3714  Sajópálfala  Borsod-Abaúj-Zemplén
75 IN   281307Sahpau   Uttar Pradesh
76 IN   328027Saipau   Rajasthan
77 IN   171006Sanjauli Himachal Pradesh
78 IT   39050 St.Paul  Trentino-Alto Adige
79 PK   47701 Sanpal   Norhern Punajb Rawalpindi
80 PT   8900-121  SapalFaro
81 PT   4560-042  SopalPorto
82 PT   2705-738  São Julião   Lisboa
83 PT   7300-469  São Julião   Portalegre
84 PT   4560-197  São Julião   Porto
85 PT   4950-854  São Julião   Viana do Castelo
86 PT   5400-754  São Julião de Montenegro Vila Real
87 PT   5300-871  São Julião de Palácios   Brag

Re: [sqlite] Custom Collation comparing only firt character?

2011-08-26 Thread Igor Tandetnik
Roberto Colnaghi  wrote:
> Thank you for your detailed reply.
> Though I cannot use DLLs since it is an iPhone iOS (MacOSX) operational 
> system.I was hoping for a collation callback that is
> called for all characters, not only the first. 

It is. What do you think str1Length and str2Length parameters are for?

> For my subset of data, it fits just perfect. All comparing fields are UTF8 
> VARCHAR.
> Shouldn't sqlite3_create_collation be called for every single character?

You seem to believe that your callback will be called to compare individual 
characters. I can't imagine whatever gave you this idea. Your callback is 
called to compare two *strings* (of arbitrary length) and report whether, 
according to your collation, one string should be sorted before, together with, 
or after the other.

> Let's say the comparing names are "São Paulo" and
> "Santos". ->  SELECT * FROM Game WHERE TeamHome = 'SANTOS' COLLATE anyCIAI; 
> 
> The LOG function shows a comparison between S and other first char only only:

That's because *you* only log the first character. You are given the whole 
string - you just *choose* not to look at it beyond the first character.

> I was expecting it to go further

What is "it" in this sentence? Whom, other than yourself, were you expecting to 
go further?
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Custom Collation comparing only firt character?

2011-08-26 Thread Roberto Colnaghi

Thank you for your detailed reply.
Though I cannot use DLLs since it is an iPhone iOS (MacOSX) operational 
system.I was hoping for a collation callback that is called for all characters, 
not only the first.
For my subset of data, it fits just perfect. All comparing fields are UTF8 
VARCHAR.
Shouldn't sqlite3_create_collation be called for every single character? Let's 
say the comparing names are "São Paulo" and "Santos". ->  SELECT * FROM Game 
WHERE TeamHome = 'SANTOS' COLLATE anyCIAI;







The LOG function shows a comparison between S and other first char only only:








41 65 A - 53 83 S = -18
43 67 C - 53 83 S = -16
46 70 F - 53 83 S = -13
53 83 S - 53 83 S = 0
49 73 I - 53 83 S = -10
46 70 F - 53 83 S = -13
50 80 P - 53 83 S = -3
43 67 C - 53 83 S = -16
47 71 G - 53 83 S = -12
I was expecting it to go further in the comparison:"São Paulo" and "Santos" 
should LOGS - Sã - ao - n -> stops here, not what your looking for
When using it on ORDER BY, it is clear that only the first char is compared.
Since ICU is not an option for iPhone, I've run out of options and ideas here.
Thanks again,











Date: Thu, 25 Aug 2011 13:30:57 +0200
To: colna...@msn.com
From: j...@antichoc.net
Subject: Re: [sqlite] Custom Collation comparing only firt character?



Hi Roberto,


It all depends on your data source(s).  If you're sure you have all
accented codepoints dealt whith in your custom collation, then it may be
enough.  But if your application has any possibility to have to deal
some day with codepoints that you didn't consider in your
collation, then you're going to have it changed, possibily several times,
while the app is in the wild.  That may be a serious issue with
embedded systems ...


FYI I forward you a download link to a small SQLite extension I wrote for
dealing with Unicode text from several locales.  Brief background:
my wife and I run a e-shop.  We have customers in 27 countries and
suppliers in India and China, among other regions.  For instance I
had to syndicate catalogs and price lists from several indian sources,
some of them writen in indian scripts and using indian digits.


In short, the extension offers locale-independant unaccentuation, casing
and collation(s).  There are a number of other text-related
functions inside as well, like a locale-independant fuzzy
compare.


You can download the extension
here.


It isn't a replacement for ICU: ICU is a _huge_ beast, is slow and
requires you to select a _specific_ locale to work with for every
operation.  My extension is oriented towards locale-independancy,
allowing you to perform operations on columns holding text from anywhere
(using any Unicode codepoint).  Of course locale-independant
collation is less than perfect when you focus on a given specific locale:
for that ICU is way better.  The full Windows x86 footprint is
<180kb and even includes functions for utf-8 and utf-16 text to avoid
back and forth UTF conversions.  Compare to 16+Mb of ICU...


While it's likely that the whole baby may not be 100% fitted to your use
case, you can still grab ideas, code, tables from the included
source.  Be sure to take the time to read the explanations on top of
the C source.


The archive comes with a ready-to-use x86 Windows DLL which allows you to
play with the various functions without requiring you to write a single
line of code, for instance using a third-party SQLite manager like SQLite
Expert (by far my favorite at any rate).


Feel free to do whatever you want with the source but please report bugs
and/or issues.

Don't hesitate to chime here should you have any question.


Best regards.


--

j...@antichoc.net 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Custom Collation comparing only firt character?

2011-08-25 Thread Simon Slavin

On 25 Aug 2011, at 6:58am, Roberto Colnaghi wrote:

> int collationAnyCIAI(void *arg1, int str1Length, const void *str1, int 
> str2Length, const void *str2) {
>   NSString *strA = [NSString hexStringWithData:str1 ofLength:1];
>   NSString *strB = [NSString hexStringWithData:str2 ofLength:1];

>From the above 'ofLength', you are looking at only the first ASCII character 
>of the two strings, as Igor said.

You need to load two bytes at a time (because you specified UTF-16) and you 
need to iterate along the length of the two strings rather than just look along 
the first byte.

I suggest you do it by writing a 'normalize' function which turns each string 
into a string without any accents on the characters.  Then once you've 
converted both string 1 and string 2 you can compare the results using any 
normal string comparison routines.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Custom Collation comparing only firt character?

2011-08-25 Thread Michael Stephenson
Most likely, since you say only the first character is being compared, is
that you have an ANSI/Unicode issue.  Specifically, it sounds like a Unicode
string is being passed to your collation function, which is expecting ANSI
and then finding a 0 at the second byte is determining that that is the end
of the string, thus exiting after the first character.

I have a collation function (Windows, code page 1252) that sounds like what
you are doing.  It was actually a bit complex to get it right (I think it's
right).  If you're interested, I could post it.

-Original Message-
From: sqlite-users-boun...@sqlite.org
[mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Roberto Colnaghi
Sent: Thursday, August 25, 2011 1:59 AM
To: sqlite-users@sqlite.org
Subject: [sqlite] Custom Collation comparing only firt character?











Hi,


I'm using iOS SQLite with a custom collation.


I've registered it:
sqlite3_create_collation(sqlDatabase, 
"anyCIAI", 
SQLITE_UTF16, 
nil,
collationAnyCIAI);


And it is used like this:
"select * from Team where Name = 'SOMETHING' COLLATE anyCIAI;"


It works though only the FIRST character seems to be compared instead of the
whole "Name".
Is there anything missing here?


The collation method should compare a á à ã... and so on as equal.

Thank you!










int collationAnyCIAI(void *arg1, int str1Length, const void *str1, int
str2Length, const void *str2) {
NSString *strA = [NSString hexStringWithData:str1 ofLength:1];
NSString *strB = [NSString hexStringWithData:str2 ofLength:1];
int striA;
sscanf([strA cString], "%x", &striA);
int striB;
sscanf([strB cString], "%x", &striB);


//convert to accentless
//aA with accent to capital A
if( (striA >= 192 && striA <= 197) || (striA >= 224 && striA <= 229)
){
striA = 65;
}
//çÇ to C
if( striA == 199 || striA == 231 ){
striA = 67;
}
//eE with accent to capital E
if( (striA >= 200 && striA <= 203) || (striA >= 232 && striA <= 235)
){
striA = 69;
}
//iI with accent to capital I
if( (striA >= 204 && striA <= 207) || (striA >= 236 && striA <= 239)
){
striA = 73;
}
//oO with accent to capital O
if( (striA >= 210 && striA <= 214) || (striA >= 242 && striA <= 246)
){
striA = 79;
}
//uU with accent to capital U
if( (striA >= 217 && striA <= 220) || (striA >= 249 && striA <= 252)
){
striA = 85;
}
//a-z to A-Z
if( striA >= 97 && striA <= 122 ){
striA -= 32;
}


//convert to accentless
//aA with accent to capital A
if( (striB >= 192 && striB <= 197) || (striB >= 224 && striB <= 229)
){
striB = 65;
}
//çÇ to C
if( striB == 199 || striB == 231 ){
striB = 67;
}
//eE with accent to capital E
if( (striB >= 200 && striB <= 203) || (striB >= 232 && striB <= 235)
){
striB = 69;
}
//iI with accent to capital I
if( (striB >= 204 && striB <= 207) || (striB >= 236 && striB <= 239)
){
striB = 73;
}
//oO with accent to capital O
if( (striB >= 210 && striB <= 214) || (striB >= 242 && striB <= 246)
){
striB = 79;
}
//uU with accent to capital U
if( (striB >= 217 && striB <= 220) || (striB >= 249 && striB <= 252)
){
striB = 85;
}
//a-z to A-Z
if( striB >= 97 && striB <= 122 ){
striB -= 32;
}

int result = striA - striB;


return result;
} 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Custom Collation comparing only firt character?

2011-08-25 Thread Igor Tandetnik
Roberto Colnaghi  wrote:
> It works though only the FIRST character seems to be compared instead of the 
> whole "Name".
> Is there anything missing here?
> 
> int collationAnyCIAI(void *arg1, int str1Length, const void *str1, int 
> str2Length, const void *str2) {
> NSString *strA = [NSString hexStringWithData:str1 ofLength:1];

Well, I know nothing about Objective C, but the "ofLength:1" part looks highly 
suspicious. Shouldn't str1Length appear in there somewhere?
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users