Re: [sqlite] Encoding and Collation

2011-09-14 Thread Antonio Maniero
>
> Just wanted to add a small warning: the problem with using custom
> collations is that if you use them in indexes, then all applications that
> use the same database must also have the same collation for everything to
> work OK.
>
> I'm aware of this. Thanks.

I will build the like function too but it's not priority now.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-14 Thread Nuno Lucas

On 09/11/2011 08:27 PM, Antonio Maniero wrote:

It's very easy to replace the SQLite functions with user-defined ones,
so if someone wants to go the easy way (partial support for just the
common western scripts) it's easy. And already done by many, if you
search the mailing list.


It's exactly what I'm looking for. It could be my mistake but I searched the
list and I couldn't find it. If not asking too much, can you suggest better
terms to use in my search?


I haven't replied earlier because I also didn't found a search term that worked 
well (although I'm sure I read those mails). So I wanted to find the code I 
myself did for this (less than 100 lines of commented code), but couldn't find 
it (at least not in the time a new implementation from scratch would take).


Anyway, I see now that you already found what you needed, and know how easy it 
is to create custom collations.


Just wanted to add a small warning: the problem with using custom collations is 
that if you use them in indexes, then all applications that use the same 
database must also have the same collation for everything to work OK.


Most of the time you don't have to worry about it, but do take that in mind.


Regards,
~Nuno Lucas
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-14 Thread Antonio Maniero
>
> You can try my quick and dirty "pseudo-universal" extension for
> less-than-perfect Unicode support.
>
> Go download the extension at http://dl.dropbox.com/u/**
> 26433628/unifuzz.zip .  Take
> the time to fully read the explanations at top of the source file.  Please
> report on my personal mail bugs and other issues.
>
>
> This code is very interesting. It don't fit my needs because it is Windows
dependent and still a big implementation. What I need is very simple and I
got it by my own now (see my previous message in this thread).

Your extension could be useful for another project.

Thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-14 Thread Antonio Maniero
>
> In either case the wrapping and registration with
> sqlite_create_collation_v2
> is quite easy.
>

Yes, it is. I thought that the actual algorithm would be harder but I was
wrong.

It's surprisingly ease and simple to build a simple algorithm to handle utf8
ASCII and Latin Basic plane with standard rules. The implementation runs
extremely fast comparing to ICU implementation and covers 100% of my use
cases. The algorithm is based on original NOCASE from SQLite and it runs
near fast. It takes just less than 20 lines of actual C code. This is lite.


> Most users actually don't care about the collation sequence, because they
> neither order by strings nor use like on strings (or only use them with
> identifiers, that fit in ascii anyway). Most of the rest either wants the
> collation to be correct in all cases (so they use ICU) or go with the
> simplest solution (so they use ICU). The few that don't want ICU just
> define
> the collation sequences themselves.
>
> For me the simplest solution it is my own implementation. I thought that
someone else had build it some simple and I wouldn't need to reinvent the
wheel.

Thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-13 Thread Jean-Christophe Deschamps



> It's very easy to replace the SQLite functions with user-defined ones,
> so if someone wants to go the easy way (partial support for just the
> common western scripts) it's easy. And already done by many, if you
> search the mailing list.


It's exactly what I'm looking for. It could be my mistake but I 
searched the
list and I couldn't find it. If not asking too much, can you suggest 
better

terms to use in my search?


You can try my quick and dirty "pseudo-universal" extension for 
less-than-perfect Unicode support.


Go download the extension at 
http://dl.dropbox.com/u/26433628/unifuzz.zip.  Take the time to fully 
read the explanations at top of the source file.  Please report on my 
personal mail bugs and other issues.



--
j...@antichoc.net  


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-12 Thread Jan Hudec
On Sun, Sep 11, 2011 at 11:55:23 -0300, Antonio Maniero wrote:
> I doubt it that it would be ease to storing 8859 without string functions
> problems. The proper collation would be simple, of course, but probably I
> would need to re-implement all string functions too. Am I wrong?
> 
> I can use utf8 but for me SQLite won't be lite anymore without a simple
> utf8 implementation. Hopefully someone else could have a ready solution for
> collation, otherwise I will do my own implementation. It will never be
> correct, but it will be enough.

Actually your operating system probably provides the collation sequence and
binding it for sqlite is about 5 lines of C. The only problem is that it's
system specific, so it'd be quite hard to maintain in sqlite, which is
probably why it's not provided. But if your application is domain-specific
anyway (otherwise iso-8859-1 could never be good enough for you), it's quite
easy.

In Windows you'd use the 'CompareString' function (or something like that;
I am not looking at MSDN). You probably need to ask sqlite to give you UTF-16
strings (it will happily convert for you even when storing as UTF-8) and
select correct locale.

In Unix you should be able to use either 'strcoll' or 'wcscoll' (beware,
wchar_t is 32-bit in many unix compilers these days!), you just have to
figure out how to set the locale before calling it.

In either case the wrapping and registration with sqlite_create_collation_v2
is quite easy.

> Am I the only user that need a lite implementation of SQLite with case
> insensitive?

Well, maybe you are one of the only two or something like that.

Most users actually don't care about the collation sequence, because they
neither order by strings nor use like on strings (or only use them with
identifiers, that fit in ascii anyway). Most of the rest either wants the
collation to be correct in all cases (so they use ICU) or go with the
simplest solution (so they use ICU). The few that don't want ICU just define
the collation sequences themselves.

-- 
 Jan 'Bulb' Hudec 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-11 Thread Antonio Maniero
>
> Nothing changed, except now the SQLite 3 API *EXPLICITLY* says the
> text functions require either UTF-8 or UTF-16, but nothing stops
> someone doing the same as with SQLite2 and store it's text as BLOBs.
>
> I forgot to reply this. I thought about BLOB as it uses single byte storage
but besides the dirt solution it has no support for collation. Case
insensitive can be emulated in a simple and clean way using BLOBs instead of
TEXT?

Thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-11 Thread Antonio Maniero
>
> Do you relly know that there is no 8859 encoding?
> The standards go from ISO-8859-1 to ISO-8859-16 and you would need to
> have collations for all of them and a way to let the user choose which
> one is the right one for them (including regional variations).
>

For me I just need the 8859-1.

What I really need is a single byte storage because it is ease and simple to
handle. 8859-1 is just the perfect choice among the all single byte
encoding available.

To clarify my requirements: I need SQLite to be lite working with accented
characters. App and database (storage file) lite. Utf16 would be more simple
than utf8 but I would get the database unnecessarily bloated.


> There are so many things wrong with this kind of reasoning I prefer to
> not say more.
>

It's a pity. You're helping a lot.


> My guess is that you are assuming your text is ISO-8859-1 (commonly
> called Latin-1). There are many problems with this, like for example
> the fact ISO-8859-15 -- Latin-9 -- being the replacement, with the €
> (euro) sign added, but Windows decided to invent it's own encoding and
> it's "Latin" it's not exactly the same (the Euro sign is the symbol
> which usually get's corrupted).
>

This is another problem that not affects me. Anyway, for me 8859-1 is not
the requirement, but a single byte with latin accented character. 8859-1
just fit this requirement.


> Don't assume a case insensitive match it's easy if done right. Read
> about the Unicode collations and you will understand why -- there
> doesn't exist a single generic upper/lower case function that works
> for all.
>

Exactly! Using utf8 it's not ease. It could be simplified at the cost of the
correctness. I don't need correctness. In fact I don't need multi byte, but
I can do nothing to change this. SQLite is a multi byte database and I need
to handle it with this constraint.

What you stated it's a very important reason to a database not choose just
one encoding (or tree counting with utf16be and le). One encoding doesn't
works to all.


> It's very easy to replace the SQLite functions with user-defined ones,
> so if someone wants to go the easy way (partial support for just the
> common western scripts) it's easy. And already done by many, if you
> search the mailing list.
>

It's exactly what I'm looking for. It could be my mistake but I searched the
list and I couldn't find it. If not asking too much, can you suggest better
terms to use in my search?

Many thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-11 Thread Antonio Maniero
>
> Possibly because SQLite3 supports UTF-8 and UTF-16 rather than ASCII.
>  Assuming you're using the ICU stuff, of course.
>

Yes, I know that, but I wish to know the reason. Why the authors choose turn
off 8859 rather than keep both like as it was before? Technical reason?


> Exactly.  Though I'm having trouble pointing to a page for the SQLite3 ICU
> stuff at the moment.  Does anyone know if it moved with the docs ?
>
> Simon.
>
> My previous post here it was about the wiki down. I know the ICU extension
and the bloated ICU implementation. Bloated for me. I know that it is the
correct implementation but I don't need to handle all Unicode table and
composite characters or other utf8 eccentricities. Neither I can live
without basic accentuation.

Thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-11 Thread Nuno Lucas
On Sun, Sep 11, 2011 at 15:55, Antonio Maniero  wrote:
>>
>> I see. Well, SQLite2 is ancient: that ship has sailed and it's not coming
>> back.
>>
>> Did SQLite2 actually implement case-insensitive comparison on accented
>> Latin characters? I honestly don't know - by the time I got involved with
>> SQLite (in late 2005), SQLite2 was already history, and its original
>> documentation doesn't seem to exist anymore.

SQLite2 didn't support ISO-8859-1 (or 15 or whatever). It just ignored
8-bit characters so you could put whatever you wanted and get whatever
you put there back.
That means you could also put there UTF-8 text, and the "-16" API
functions allowed to directly store/retrieve windows UCS2 Unicode
strings (UTF-16 was only after XP, iirc), which SQLite automatically
encoded/decoded to UTF-8 text.

The problem was that there was no way for 3rd party applications to
know what was the encoding being used.

If I remember correctly, UTF-8 was still prefered, and some SQL
functions acting on text only worked well if it was UTF-8 (like
LENGTH).

> Maybe someone else could say about the reason that SQLite dropped 8859
> encoding.
>
> Probably SQLite 2 had not case insensitive comparison on 8859 because it has
> many encodings and locales, but implement it would be ease and simple.

It didn't drop support. It just never had it anyway, although never
caring about it.

Nothing changed, except now the SQLite 3 API *EXPLICITLY* says the
text functions require either UTF-8 or UTF-16, but nothing stops
someone doing the same as with SQLite2 and store it's text as BLOBs.

>> Version 3 keeps support for 8859?
>>
>> No, not really. But, again, it won't prevent you from storing 8859-encoded
>> strings in the database, and installing a custom collation that understands
>> them, if you are so inclined. Personally, I'd seriously consider switching
>> to UTF-8.
>>
>>
> I doubt it that it would be ease to storing 8859 without string functions
> problems. The proper collation would be simple, of course, but probably I
> would need to re-implement all string functions too. Am I wrong?

Do you relly know that there is no 8859 encoding?
The standards go from ISO-8859-1 to ISO-8859-16 and you would need to
have collations for all of them and a way to let the user choose which
one is the right one for them (including regional variations).

There are so many things wrong with this kind of reasoning I prefer to
not say more.

> I can use utf8 but for me SQLite won't be lite anymore without a simple
> utf8 implementation. Hopefully someone else could have a ready solution for
> collation, otherwise I will do my own implementation. It will never be
> correct, but it will be enough.

My guess is that you are assuming your text is ISO-8859-1 (commonly
called Latin-1). There are many problems with this, like for example
the fact ISO-8859-15 -- Latin-9 -- being the replacement, with the €
(euro) sign added, but Windows decided to invent it's own encoding and
it's "Latin" it's not exactly the same (the Euro sign is the symbol
which usually get's corrupted).

> Am I the only user that need a lite implementation of SQLite with case
> insensitive?

Don't assume a case insensitive match it's easy if done right. Read
about the Unicode collations and you will understand why -- there
doesn't exist a single generic upper/lower case function that works
for all.

SQLite could have an erroneous and basic case insensitive match
(already has a basic one, for 7-bit ASCII), but that would not solve
nothing that isn't already solved by the correct solution, which is
the use of the ICU extension.

It's very easy to replace the SQLite functions with user-defined ones,
so if someone wants to go the easy way (partial support for just the
common western scripts) it's easy. And already done by many, if you
search the mailing list.

As a final note, SQLite 2 never had any support for ISO-8859-X
collations, so you have no reason to believe SQLite 3 would have it.

Regards,
~Nuno Lucas

>
> Thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-11 Thread Igor Tandetnik
Simon Slavin  wrote:
> Though I'm having trouble pointing to a page for the SQLite3 ICU stuff at the 
> moment.

It would be here:

http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt

but the server seems to be down at the moment.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-11 Thread Simon Slavin

On 11 Sep 2011, at 3:55pm, Antonio Maniero wrote:

> Maybe someone else could say about the reason that SQLite dropped 8859
> encoding.

Possibly because SQLite3 supports UTF-8 and UTF-16 rather than ASCII.  Assuming 
you're using the ICU stuff, of course.

> I can use utf8 but for me SQLite won't be lite anymore without a simple
> utf8 implementation.

Exactly.  Though I'm having trouble pointing to a page for the SQLite3 ICU 
stuff at the moment.  Does anyone know if it moved with the docs ?

Simon.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-11 Thread Antonio Maniero
>
> I see. Well, SQLite2 is ancient: that ship has sailed and it's not coming
> back.
>
> Did SQLite2 actually implement case-insensitive comparison on accented
> Latin characters? I honestly don't know - by the time I got involved with
> SQLite (in late 2005), SQLite2 was already history, and its original
> documentation doesn't seem to exist anymore.
>
>
Maybe someone else could say about the reason that SQLite dropped 8859
encoding.

Probably SQLite 2 had not case insensitive comparison on 8859 because it has
many encodings and locales, but implement it would be ease and simple.


> Version 3 keeps support for 8859?
>
> No, not really. But, again, it won't prevent you from storing 8859-encoded
> strings in the database, and installing a custom collation that understands
> them, if you are so inclined. Personally, I'd seriously consider switching
> to UTF-8.
>
>
I doubt it that it would be ease to storing 8859 without string functions
problems. The proper collation would be simple, of course, but probably I
would need to re-implement all string functions too. Am I wrong?

I can use utf8 but for me SQLite won't be lite anymore without a simple
utf8 implementation. Hopefully someone else could have a ready solution for
collation, otherwise I will do my own implementation. It will never be
correct, but it will be enough.

Am I the only user that need a lite implementation of SQLite with case
insensitive?

Thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-11 Thread Igor Tandetnik
Antonio Maniero  wrote:
>>> Why SQLite dropped the 8859 or single byte support for text? Is there
>>> any technical reason?
>> 
>> What do you mean, dropped? What exactly used to worked before and has
> stopped working now? What event has occurred between then and now that you
> attribute the problem to?
> 
> Maybe I had misunderstood some old documentation and release notes talking
> about 8859. Specially from http://www.sqlite.org/c_interface.html :

I see. Well, SQLite2 is ancient: that ship has sailed and it's not coming back.

Did SQLite2 actually implement case-insensitive comparison on accented Latin 
characters? I honestly don't know - by the time I got involved with SQLite (in 
late 2005), SQLite2 was already history, and its original documentation doesn't 
seem to exist anymore.

> Version 3 keeps support for 8859?

No, not really. But, again, it won't prevent you from storing 8859-encoded 
strings in the database, and installing a custom collation that understands 
them, if you are so inclined. Personally, I'd seriously consider switching to 
UTF-8.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-11 Thread Antonio Maniero
> > Why SQLite dropped the 8859 or single byte support for text? Is there
> > any technical reason?
>
> What do you mean, dropped? What exactly used to worked before and has
stopped working now? What event has occurred between then and now that you
attribute the problem to?

Maybe I had misunderstood some old documentation and release notes talking
about 8859. Specially from http://www.sqlite.org/c_interface.html :


3.7 Library character encoding
By default, SQLite assumes that all data uses a fixed-size 8-bit character
(iso8859). But if you give the --enable-utf8 option to the configure script,
then the library assumes UTF-8 variable sized characters. This makes a
difference for the LIKE and GLOB operators and the LENGTH() and SUBSTR()
functions. The static string sqlite_encoding will be set to either "UTF-8"
or "iso8859" to indicate how the library was compiled. In addition,
thesqlite.h header file will define one of the
macros SQLITE_UTF8 or SQLITE_ISO8859, as appropriate.


Am I wrong? Version 2 never had support for 8859?

Version 3 keeps support for 8859? I can't find current documentation about
it.

Thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Encoding and Collation

2011-09-10 Thread Igor Tandetnik
Antonio Maniero  wrote:
> Why SQLite dropped the 8859 or single byte support for text? Is there
> any technical reason?

What do you mean, dropped? What exactly used to worked before and has stopped 
working now? What event has occurred between then and now that you attribute 
the problem to?

> Is there any ready simple solution to use case insensitive collation on
> SQLite to work with non-English (Latin) characters? I don't need and I don't
> want a full ICU implementation

Then write your own custom collation suited to your particular needs, whatever 
they might be.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Encoding and Collation

2011-09-10 Thread Antonio Maniero
Hi

Sorry by my bad English.

Why SQLite dropped the 8859 or single byte support for text? Is there
any technical reason?

Is there any ready simple solution to use case insensitive collation on
SQLite to work with non-English (Latin) characters? I don't need and I don't
want a full ICU implementation after all I'm using utf8 only because I can't
count with a better option. I don't need the full and correct
implementation, a "just works" implementation for trivial use is enough.

The assumption that the application has the proper comparison function is
valid only when the application works with utf8. That's not my case exactly
because the full correct implementation is heavy and not valuable for the
requirements.

It's ironic that utf8 (the universal encoding) support turned SQLite into an
English centric database by default.

Best regards

Antonio Maniero
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users