[sqlite] Unicode support in SQLite

2015-07-06 Thread Aleksey Tulinov
Hello,

I'm glad to announce that nunicode SQLite extension was updated to 
support Unicode 8.0. This extension support the following encodings: 
UTF-8, UTF-16, UTF-16LE, UTF-16BE and only 230Kb in size (approximately).

This extension provides the following Unicode-aware components:

- upper(X)
- lower(X)
- X LIKE Y ESCAPE Z
- COLLATE NU800 : case-sensitive Unicode 8.0.0 collation
- COLLATE NU800_NOCASE : case-insensitive Unicode 8.0.0 collation

You can read about and download this extension at BitBucket page of 
nunicode library: 
https://bitbucket.org/alekseyt/nunicode#markdown-header-sqlite3-extension

If you were using previous version of nunicode extension for SQLite, 
please note that extension was renamed from libnunicode to libnusqlite3. 
The entry points remained the same: sqlite3_nunicode_init() and 
nunicode_sqlite3_static_init(). Sorry for the inconvenience.

Also note that this version of extension is no longer providing NU700 
and NU700_NOCASE collations, they are replaced with NU800 and NU800_NOCASE.

Complete changelog is available here: 
https://bitbucket.org/alekseyt/nunicode/raw/master/CHANGELOG


Re: [sqlite] Unicode support in SQLite

2014-10-14 Thread Aleksey Tulinov

On 14/10/14 17:02, Kevin Benson wrote:


https://bitbucket.org/alekseyt/nunicode/downloads/libnusqlite3-1.4-4a0e4773-win32.zip
 <---
404 response code



Thank you, fixed now.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support in SQLite

2014-10-14 Thread Kevin Benson
On Tue, Oct 14, 2014 at 4:37 AM, Aleksey Tulinov 
wrote:

> Hello,
>
> I'm glad to announce that nunicode SQLite extension was updated to support
> Unicode-conformant case folding and was improved on performance of every
> component provided to SQLite.
>
> You can read about and download this extension at BitBucket page of
> nunicode library: https://bitbucket.org/alekseyt/nunicode#markdown-
> header-sqlite3-extension
>
> This extension provides the following Unicode-aware components:
>
> - upper(X)
> - lower(X)
> - X LIKE Y ESCAPE Z
> - COLLATE NU700 : case-sensitive Unicode 7.0.0 collation
> - COLLATE NU700_NOCASE : case-insensitive Unicode 7.0.0 collation
>

https://bitbucket.org/alekseyt/nunicode/downloads/libnusqlite3-1.4-4a0e4773-win32.zip
<---
404 response code

--
   --
  --
 --Ô¿Ô--
K e V i N
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Unicode support in SQLite

2014-10-14 Thread Aleksey Tulinov

Hello,

I'm glad to announce that nunicode SQLite extension was updated to 
support Unicode-conformant case folding and was improved on performance 
of every component provided to SQLite.


You can read about and download this extension at BitBucket page of 
nunicode library: 
https://bitbucket.org/alekseyt/nunicode#markdown-header-sqlite3-extension


This extension provides the following Unicode-aware components:

- upper(X)
- lower(X)
- X LIKE Y ESCAPE Z
- COLLATE NU700 : case-sensitive Unicode 7.0.0 collation
- COLLATE NU700_NOCASE : case-insensitive Unicode 7.0.0 collation
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support in SQLite

2014-07-04 Thread Aleksey Tulinov

Hello,

I'm glad to announce that nunicode SQLite extension was updated to 
support Unicode 7.0.0 character set. It also implements LIKE operation 
which is faster compared to previous releases.


This extension provides the following Unicode-aware components:

- upper(X)
- lower(X)
- X LIKE Y ESCAPE Z
- COLLATE NU700 : case-sensitive Unicode 7.0.0 collation
- COLLATE NU700_NOCASE : case-insensitive Unicode 7.0.0 collation

Collation functions implement default Unicode collation (based on 
DUCET). Previously implemented Unicode 6.3.0 collations NU630 and 
NU630_NOCASE were removed from this version of extension.


You can find implementation details, changelog and downloads at 
BitBucket page of nunicode library: https://bitbucket.org/alekseyt/nunicode

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support in SQLite

2014-04-02 Thread Aleksey Tulinov

Hey,

According to previous discussion in this mailing list, i've updated 
nunicode SQLite extension not to override default NOCASE collation due 
to possible issues with database indexing.


Version 1.2.1 removes nunicode-specific NOCASE and NUNICODE collations 
and introduces NU630 and NU630_NOCASE collations instead. First is 
case-sensitive Unicode 6.3.0 collation, second is case-insensitive, both 
implements default Unicode collation ordering (DUCET).


In all other regards, it's not different from 1.2 version of extension 
and based on the same nunicode 1.2.


Full changelog is available here: 
https://bitbucket.org/alekseyt/nunicode/src/master/CHANGELOG


Pre-compiled extensions are available under "Downloads" for Win32 and 
i386/amd64 Linux.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Unicode support in SQLite

2014-01-24 Thread Aleksey Tulinov

Hey,

I've just updated nunicode to version 1.2: 
https://bitbucket.org/alekseyt/nunicode


Now all collations are backed by reduced DUCET. Library grew in size a 
little bit, you'll get Unicode collations for around 200Kb, but at the 
same time you will also get several languages completely working out of 
the box as they don't need any collation tailoring.


You can also write your own tailoring, this is somewhat described here: 
https://bitbucket.org/alekseyt/nunicode#markdown-header-custom-collations and 
also covered in embedded Doxygen doc.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support in SQLite

2013-11-10 Thread Gert Van Assche
Very nice! Thanks for sharing, Aleksey.


2013/11/9 Aleksey Tulinov 

> On 11/04/2013 11:50 AM, Aleksey Tulinov wrote:
>
> Hey,
>
>
>  As you can see, this is truly full Unicode collation and case mapping
>> with untailored special casing. Extension provides the following functions,
>> statements and collations:
>>
>
> I've updated extension, examples and documentation, now it's easier to
> link extension statically. Everything, including new prebuilt binaries, is
> available on BitBucket, changelog is available here:
> https://bitbucket.org/alekseyt/nunicode/src/master/CHANGELOG
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support in SQLite

2013-11-09 Thread Aleksey Tulinov

On 11/04/2013 11:50 AM, Aleksey Tulinov wrote:

Hey,

As you can see, this is truly full Unicode collation and case mapping 
with untailored special casing. Extension provides the following 
functions, statements and collations:


I've updated extension, examples and documentation, now it's easier to 
link extension statically. Everything, including new prebuilt binaries, 
is available on BitBucket, changelog is available here: 
https://bitbucket.org/alekseyt/nunicode/src/master/CHANGELOG

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Unicode support in SQLite

2013-11-04 Thread Aleksey Tulinov

Dear SQLite users,

I'd like to present you Unicode support extension i've implemented for 
SQLite, it does full Unicode (6.3.0) collations, case mapping and 
untailored ordering, and takes only ~100Kb to do that if you link it 
statically. It's also open source and free (MIT license): 
https://bitbucket.org/alekseyt/nunicode


What it does exactly:

sqlite> .load ./sqlite3/libnusqlite3.so
sqlite> SELECT 'MASSE' LIKE 'Maße';
1
sqlite> SELECT 'æ' LIKE 'AE';
1
sqlite> SELECT 'Masse' == 'Maße' COLLATE NUNICODE;
1
sqlite> SELECT upper('Maße');
MASSE

As you can see, this is truly full Unicode collation and case mapping 
with untailored special casing. Extension provides the following 
functions, statements and collations:


- upper()/lower()
- X LIKE Y ESCAPE Z
- COLLATE NOCASE
- COLLATE NUNICODE

Supported encodings: UTF-8, UTF-16 (host-endian), UTF-16BE, UTF-16LE.

If you wish to try it, you can find some pre-built binaries for Windows 
and Linux in downloads section on BitBucket, documentation is embedded 
into sources.


Any ideas?
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-20 Thread Nicolas Williams
On Tue, Nov 17, 2009 at 09:31:46PM -0500, Tim Romano wrote:
>  but if ORDER BY is
> relying on an index for ordering, then flip() can have negative 
> effects.
> 
> 
> Substr() could have negative effects on ordering too.  That is a red 
> herring.  Flip() is merely a function that  reverses the order of 
> codepoints "as found" without knowing anything about what those 
> codepoints, individually or in combination, might signify in a writing 
> system.  If I want to write those codepoints to a column that's my concern.

In Unicode there's codepoints, characters, and glyphs.  Codepoints are
single 21-bit values.  Characters are either single codepoints or
combinations of codepoints.  Glyphs are either single characters or
combinations of characters that are displayed as single
programatically-constructed glyphs.  SQLite3 knows about none of that.
Nor about normalization forms.

Therefore any functions like substr() and flip() that work at the
codepoint level (or worse, at the byte level, but fortunately substr()
is UTF-8/16 aware) can break semantics for your strings.

> What if I wanted to have a column that consisted of codepoints from all 
> over the Unicode range: a codepoint from Greek next to a codepoint from 
> Swahili next to a codepoint from Hungarian?  Shouldn't I be able to say 
> to a database:  this column contains codepoints (characters) and 
> collation is not relevant, sort the column using the numeric value of 
> the codepoints? 

Yes, I think so.  I'm not sure why you'd want that, but yes, it ought to
be possible, and right now SQLite3 lets you do that because it is not
aware of characters and glyphs -- SQLite3 is aware of only codepoints.
But if you load the ICU extensions that might change!

Ideally there should be a way to indicate a variety of Unicode-related
behaviors:

 - normalization form for use in index keys
 - normalization-insensitive string comparison operators
 - whether to normalize values in tables and, if so, with what form (by
   column, obviously)
- if you normalize strings in index keys but not in tables then you
  get normalization-insensitive-but-normalization-preserving
  behavior, which is really, really convenient

 - collation options, such as language
- whether to honor language tags embedded in the UTF-8/16 strings

 - multiple text types? (string of codepoints, of characters, or glyphs)

 - a whole range of Unicode-aware functions like substr() (and flip(),
   and like(), and regex(), and glob(), ...), with options for character
   and glyph counting instead of codepoint counting

 - codesets (for non-Unicode data), with automatic codeset conversions
   similar to type conversions
 - to have automatic conversions I think would require an extensible
   text type system

That's... a lot of functionality.  I'm not sure how much of it needs to
be implemented with help from the SQLite3 core, versus extensions.  It'd
be nice if all of it could be implemented via extensions, but I don't
think that's possible right now.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-17 Thread Tim Romano
 but if ORDER BY is
relying on an index for ordering, then flip() can have negative effects.


Substr() could have negative effects on ordering too.  That is a red 
herring.  Flip() is merely a function that  reverses the order of 
codepoints "as found" without knowing anything about what those 
codepoints, individually or in combination, might signify in a writing 
system.  If I want to write those codepoints to a column that's my concern.

What if I wanted to have a column that consisted of codepoints from all 
over the Unicode range: a codepoint from Greek next to a codepoint from 
Swahili next to a codepoint from Hungarian?  Shouldn't I be able to say 
to a database:  this column contains codepoints (characters) and 
collation is not relevant, sort the column using the numeric value of 
the codepoints? 


Tim Romano


Nicolas Williams wrote:
> On Tue, Nov 17, 2009 at 05:15:16PM -0500, Igor Tandetnik wrote:
>   
>> Nicolas Williams  wrote:
>> 
>>> This is no longer true, either of 'ch' nor 'll'.
>>>   
>> There is a number of contractions in Hungarian that are still very
>> much in use, but I can't recall them off the top of my head the way I
>> can 'ch' (it's something like 'dzs'). There are also contractions in
>> German Phonebook sort (e.g. 'oe' should sort between 'o with umlaut'
>> and 'p', if I recall correctly). There are likely other cases.
>> 
>
> I'm not surprised :(
>
>   
>>> The principle you
>>> state is correct, of course, but really, this is a collation problem,
>>> and affects SQLite3 apps regardless of "flip()".
>>>   
>> My point is, it's difficult to even define what the correct behavior
>> of flip() should be, let alone implement one. And so the safest course
>> of action is to leave it out of core SQLite: a developer in need of
>> such a function would presumably know the nature of their data and
>> precisely what they want the function to achieve, and can always
>> implement it as a custom function.
>> 
>
> Maybe.  For indexing, I don't see the harm as long as an index built
> with this function isn't used for ORDER BY when you care about
> collations (ah! SQLite3 couldn't tell this is happening without knowing
> the semantics of the function).
>
>   
>>> The collation is
>>> per-column, and the run-time should make functions aware of the
>>> collation (if any) of a column when an argument.
>>>   
>> What about
>>
>> select flip(EnglishText || GermanText || SpanishText)
>> from MyMultilingualTable;
>> 
>
> No different than:
>
> select EnglishText || GermanText || SpanishText from MyMultilingualTable;
>
> the concatenation can create 'oe' and all those other whatever they are
> called's.
>
> This is OK until you ORDER BY, and _then_ the collation requested or
> inferred needs to apply.  Ah, there should be no inference of collation
> from function names, and functions shouldn't have to care about
> collations "in effect" -- only ORDER BY should care, but if ORDER BY is
> relying on an index for ordering, then flip() can have negative effects.
>
> Nico
>   
> 
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.425 / Virus Database: 270.14.71/2510 - Release Date: 11/17/09 
> 19:26:00
>
>   

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-17 Thread Jean-Christophe Deschamps
Tim,


>For those who are insisting on Unicode graphemic codepoint-combination
>intelligence:  why can't we have a function that simply reverses the
>order of the codepoints, and is blissfully ignorant about what those
>individual codepoints or codepoint-combinations might signify as
>graphemes in a writing system?  The flip() function could be totally
>naive about all that and be 100% deterministic. All I want is a way to
>get the monadic codepoints of a text-affinity column in reverse order.

I just wrote one for you, can you check you inbox?



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-17 Thread Tim Romano
For those who are insisting on Unicode graphemic codepoint-combination 
intelligence:  why can't we have a function that simply reverses the 
order of the codepoints, and is blissfully ignorant about what those 
individual codepoints or codepoint-combinations might signify as 
graphemes in a writing system?  The flip() function could be totally 
naive about all that and be 100% deterministic. All I want is a way to 
get the monadic codepoints of a text-affinity column in reverse order. 




___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-17 Thread Nicolas Williams
On Tue, Nov 17, 2009 at 05:15:16PM -0500, Igor Tandetnik wrote:
> Nicolas Williams  wrote:
> > This is no longer true, either of 'ch' nor 'll'.
> 
> There is a number of contractions in Hungarian that are still very
> much in use, but I can't recall them off the top of my head the way I
> can 'ch' (it's something like 'dzs'). There are also contractions in
> German Phonebook sort (e.g. 'oe' should sort between 'o with umlaut'
> and 'p', if I recall correctly). There are likely other cases.

I'm not surprised :(

> > The principle you
> > state is correct, of course, but really, this is a collation problem,
> > and affects SQLite3 apps regardless of "flip()".
> 
> My point is, it's difficult to even define what the correct behavior
> of flip() should be, let alone implement one. And so the safest course
> of action is to leave it out of core SQLite: a developer in need of
> such a function would presumably know the nature of their data and
> precisely what they want the function to achieve, and can always
> implement it as a custom function.

Maybe.  For indexing, I don't see the harm as long as an index built
with this function isn't used for ORDER BY when you care about
collations (ah! SQLite3 couldn't tell this is happening without knowing
the semantics of the function).

> > The collation is
> > per-column, and the run-time should make functions aware of the
> > collation (if any) of a column when an argument.
> 
> What about
> 
> select flip(EnglishText || GermanText || SpanishText)
> from MyMultilingualTable;

No different than:

select EnglishText || GermanText || SpanishText from MyMultilingualTable;

the concatenation can create 'oe' and all those other whatever they are
called's.

This is OK until you ORDER BY, and _then_ the collation requested or
inferred needs to apply.  Ah, there should be no inference of collation
from function names, and functions shouldn't have to care about
collations "in effect" -- only ORDER BY should care, but if ORDER BY is
relying on an index for ordering, then flip() can have negative effects.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-17 Thread Beau Wilkinson
A few minutes ago I wrote that:

>I think that as a general rule, the "combining" accents should be disregared 
>during collation.
>
> etc.

I just read that "collation" page from Unicode.org and it seems to be 
completely at odds with what I suggested, e.g. in its insistence that some 
sequences of code points are "canonically equivalent."

In light of this fact, I do not see how Unicode can ever really be considered 
"collated." And it follows that it cannot be reversed. At least, this is the 
case if one follows the advice at Unicode.org.

The "collation" that Unicode.org seems to suggest is basically the invention of 
some academics. It does not seem to correspond to any human alphabet. Please, 
please correct me if I am wrong on this.

I have never been one of those to just ignore Unicode. But I am starting to see 
that it does not really work so well in the real world once one leaves the 
realm of  "ASCII-with-zeroes-on-top."


From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] On 
Behalf Of Igor Tandetnik [itandet...@mvps.org]
Sent: Tuesday, November 17, 2009 1:01 PM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] Unicode support

Simon Slavin <slav...@bigfraud.org> wrote:
> On 17 Nov 2009, at 6:37pm, Igor Tandetnik wrote:
>
>> Simon Slavin <slav...@bigfraud.org> wrote:
>>> First split the string into characters, then reassemble them in
>>> reverse order.
>>
>> The problem is, in Unicode it's not quite clear what constitutes a
>> "character". Are we talking about codepoints, sort elements,
>> graphemes? Depending on the application, either definition might
>> make sense.
>
> I agree about the problem, but sort elements is the obvious answer in
> this case.

This would mean that the result of the hypothetical flip() function would be 
locale-dependent. E.g. in Spanish Traditional sort, a combination 'ch' sorts as 
if it were a single letter between 'c' and 'd', forming a single sort element 
(a so-called contraction). So should 'a ch b' reverse to 'b ch a' under Spanish 
Traditional sort, and to 'b hc a' otherwise? Would you pass a desired locale as 
a parameter to flip(), in order to achieve that?

Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

The information contained in this e-mail is privileged and confidential 
information intended only for the use of the individual or entity named.  If 
you are not the intended recipient, or the employee or agent responsible for 
delivering this message to the intended recipient, you are hereby notified that 
any disclosure, dissemination, distribution, or copying of this communication 
is strictly prohibited.  If you have received this e-mail in error, please 
immediately notify the sender and delete any copies from your system.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

The information contained in this e-mail is privileged and confidential 
information intended only for the use of the individual or entity named.  If 
you are not the intended recipient, or the employee or agent responsible for 
delivering this message to the intended recipient, you are hereby notified that 
any disclosure, dissemination, distribution, or copying of this communication 
is strictly prohibited.  If you have received this e-mail in error, please 
immediately notify the sender and delete any copies from your system.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-17 Thread Igor Tandetnik
Nicolas Williams  wrote:
> On Tue, Nov 17, 2009 at 02:01:55PM -0500, Igor Tandetnik wrote:
>> This would mean that the result of the hypothetical flip() function
>> would be locale-dependent. E.g. in Spanish Traditional sort, a
>> combination 'ch' sorts as if it were a single letter between 'c' and
>> 'd', forming a single sort element (a so-called contraction). So
>> should 'a ch b' reverse to 'b ch a' under Spanish Traditional sort,
>> and to 'b hc a' otherwise? Would you pass a desired locale as a
>> parameter to flip(), in order to achieve that?
> 
> This is no longer true, either of 'ch' nor 'll'.

There is a number of contractions in Hungarian that are still very much in use, 
but I can't recall them off the top of my head the way I can 'ch' (it's 
something like 'dzs'). There are also contractions in German Phonebook sort 
(e.g. 'oe' should sort between 'o with umlaut' and 'p', if I recall correctly). 
There are likely other cases.

> The principle you
> state is correct, of course, but really, this is a collation problem,
> and affects SQLite3 apps regardless of "flip()".

My point is, it's difficult to even define what the correct behavior of flip() 
should be, let alone implement one. And so the safest course of action is to 
leave it out of core SQLite: a developer in need of such a function would 
presumably know the nature of their data and precisely what they want the 
function to achieve, and can always implement it as a custom function.

> The collation is
> per-column, and the run-time should make functions aware of the
> collation (if any) of a column when an argument.

What about

select flip(EnglishText || GermanText || SpanishText)
from MyMultilingualTable;

Igor Tandetnik


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-17 Thread Beau Wilkinson
>> On 17 Nov 2009, at 5:52pm, Igor Tandetnik wrote:
>>
>>> But for your goals, it has to be sortable, right? In a proper
>>> Unicode collation, U+0041 U+0301 would behave quite differently from
>>> U+0301 U+0041. Consider "A ' E" (where ' stands for a combining
>>> acute accent). In most locales, this would sort between AE and BE.
>>> Now, if we reverse it naively, we'll end up with "E ' A", with the
>>> accent now attached to E and not A. The result would sort between EA
>>> and FA, rather than between EA and EB as you would probably want.
>>

I think that as a general rule, the "combining" accents should be disregared 
during collation.

For example: if a string contains the letter "a" plus a "combining acute 
accent," to me that seems like a hint that what we have is basically a letter 
"a," not a distinct letter with its own place in the collation sequence. This 
should be collated as an "a" that just happens to be accented, for whatever 
reason.

In Spanish, for example, a diaresis is sometimes placed over the letter "U." 
This indicates that the preceding consonant is hard. It does not make the "U" 
into a different letter, or signficantly affect the collation sequence. (At 
most, it is a tie-breaker between two otherwise identical words.)

So, I think the Spanish diaresis thus represents a legitimate use of the 
Uniciode "combining diaresis." In fact, I would submit that encoding Spanish's 
"U with diaresis" using code point U+00FC is just wrong, in the same way as 
coding letter "O" as ASCII 0x30 (zero) is wrong. We do not need to worry about 
cleaning up such a mistake in our collation code.

In German, and the Scandinavian languages, the opposite is true. Putting a 
diaresis over a letter makes a new letter, which collates differently. 
"Combining accents" code points are not appropriate in these languages and 
their use should not be supported by a collation algorithm. Rather, these 
letters should be encoded using single code points.

I think a better approach (to the design of Unicode) would have been for 
Spanish and German (for instance) to share absolutely nothing in the encoding 
standards. Each language ought to have its own little span of letters, 
immortalized into the standard in correct order-of-collation, with no sharing 
of "code points," "characters," or anything else.

Unicode screws this up, as it does with so many things, and this is a big 
reason why it's widely reviled (or, ignored) by many programmers. This is 
editorial commentary, but I do not necessarily think it is irrelevant. I get 
the feeling that something better than Unicode must be brewing somewhere.

Of course, sometimes bad standards have a life of their own, because they give 
us license to refuse to implement things and still look smart in so refusing. I 
suggest that this a very detrimental pattern, though.

____________
From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] On 
Behalf Of Igor Tandetnik [itandet...@mvps.org]
Sent: Tuesday, November 17, 2009 1:01 PM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] Unicode support

Simon Slavin <slav...@bigfraud.org> wrote:
> On 17 Nov 2009, at 6:37pm, Igor Tandetnik wrote:
>
>> Simon Slavin <slav...@bigfraud.org> wrote:
>>> First split the string into characters, then reassemble them in
>>> reverse order.
>>
>> The problem is, in Unicode it's not quite clear what constitutes a
>> "character". Are we talking about codepoints, sort elements,
>> graphemes? Depending on the application, either definition might
>> make sense.
>
> I agree about the problem, but sort elements is the obvious answer in
> this case.

This would mean that the result of the hypothetical flip() function would be 
locale-dependent. E.g. in Spanish Traditional sort, a combination 'ch' sorts as 
if it were a single letter between 'c' and 'd', forming a single sort element 
(a so-called contraction). So should 'a ch b' reverse to 'b ch a' under Spanish 
Traditional sort, and to 'b hc a' otherwise? Would you pass a desired locale as 
a parameter to flip(), in order to achieve that?

Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

The information contained in this e-mail is privileged and confidential 
information intended only for the use of the individual or entity named.  If 
you are not the intended recipient, or the employee or agent responsible for 
delivering this message to the intended recipient, you are hereby notified that 
any disclosure, dissemination, distribution, or copying of this communication 
is strictly prohibited.  If you have received this e-mail in error, please 
immediately notify the sender and delete any copies from your system.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-17 Thread Nicolas Williams
On Tue, Nov 17, 2009 at 02:01:55PM -0500, Igor Tandetnik wrote:
> This would mean that the result of the hypothetical flip() function
> would be locale-dependent. E.g. in Spanish Traditional sort, a
> combination 'ch' sorts as if it were a single letter between 'c' and
> 'd', forming a single sort element (a so-called contraction). So
> should 'a ch b' reverse to 'b ch a' under Spanish Traditional sort,
> and to 'b hc a' otherwise? Would you pass a desired locale as a
> parameter to flip(), in order to achieve that?

This is no longer true, either of 'ch' nor 'll'.  The principle you
state is correct, of course, but really, this is a collation problem,
and affects SQLite3 apps regardless of "flip()".  The collation is
per-column, and the run-time should make functions aware of the
collation (if any) of a column when an argument.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode support

2009-11-17 Thread Igor Tandetnik
Simon Slavin  wrote:
> On 17 Nov 2009, at 6:37pm, Igor Tandetnik wrote:
> 
>> Simon Slavin  wrote:
>>> First split the string into characters, then reassemble them in
>>> reverse order.
>> 
>> The problem is, in Unicode it's not quite clear what constitutes a
>> "character". Are we talking about codepoints, sort elements,
>> graphemes? Depending on the application, either definition might
>> make sense.   
> 
> I agree about the problem, but sort elements is the obvious answer in
> this case.

This would mean that the result of the hypothetical flip() function would be 
locale-dependent. E.g. in Spanish Traditional sort, a combination 'ch' sorts as 
if it were a single letter between 'c' and 'd', forming a single sort element 
(a so-called contraction). So should 'a ch b' reverse to 'b ch a' under Spanish 
Traditional sort, and to 'b hc a' otherwise? Would you pass a desired locale as 
a parameter to flip(), in order to achieve that?

Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Unicode support

2009-11-17 Thread Simon Slavin

On 17 Nov 2009, at 6:37pm, Igor Tandetnik wrote:

> Simon Slavin  wrote:
>> On 17 Nov 2009, at 5:52pm, Igor Tandetnik wrote:
>> 
>>> But for your goals, it has to be sortable, right? In a proper
>>> Unicode collation, U+0041 U+0301 would behave quite differently from
>>> U+0301 U+0041. Consider "A ' E" (where ' stands for a combining
>>> acute accent). In most locales, this would sort between AE and BE.
>>> Now, if we reverse it naively, we'll end up with "E ' A", with the
>>> accent now attached to E and not A. The result would sort between EA
>>> and FA, rather than between EA and EB as you would probably want.   
>> 
>> Obviously, your routine to reverse a string must be unicode-aware. 
> 
> Tim Romano seems to insist on precisely the opposite.

That would be suffient for Tim, but it's too weak to be useful for many people, 
therefore it's probably never going to be written.

>> First split the string into characters, then reassemble them in
>> reverse order.
> 
> The problem is, in Unicode it's not quite clear what constitutes a 
> "character". Are we talking about codepoints, sort elements, graphemes? 
> Depending on the application, either definition might make sense.

I agree about the problem, but sort elements is the obvious answer in this 
case.  By the way, for those of you wondering about what it would take to 
support Unicode in an index (i.e. to sort Unicode strings) here's an outline of 
the problems involved and what's necessary:



Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


RE: [sqlite] Unicode support for Sqlite?

2007-12-12 Thread Sreedhar.a
 
Thankyou all for the quick replies.

Best Regards,
A.Sreedhar.
 

-Original Message-
From: Trevor Talbot [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 12, 2007 5:08 PM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] Unicode support for Sqlite?

On 12/12/07, Sreedhar.a <[EMAIL PROTECTED]> wrote:

> I am using the sqlite to store the metadata of audio files.
> Is it possible to store the metadata in unicode character format in
sqlite.

Yes; SQLite assumes all TEXT type data in the database is Unicode. You can
work with it in UTF-8 with the *_text() APIs, or UTF-16 using the
*_text16() calls. SQLite will convert between the two encodings as
necessary.

The sqlite3 shell assumes UTF-8, but it depends on the platform's console to
actually use UTF-8 when talking to it, so it may be difficult to properly
test with it.


-
To unsubscribe, send email to [EMAIL PROTECTED]

-




-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode support for Sqlite?

2007-12-12 Thread Trevor Talbot
On 12/12/07, Sreedhar.a <[EMAIL PROTECTED]> wrote:

> I am using the sqlite to store the metadata of audio files.
> Is it possible to store the metadata in unicode character format in sqlite.

Yes; SQLite assumes all TEXT type data in the database is Unicode. You
can work with it in UTF-8 with the *_text() APIs, or UTF-16 using the
*_text16() calls. SQLite will convert between the two encodings as
necessary.

The sqlite3 shell assumes UTF-8, but it depends on the platform's
console to actually use UTF-8 when talking to it, so it may be
difficult to properly test with it.

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode support for Sqlite?

2007-12-12 Thread Daniel Önnerby
utf-8 and utf-16 ARE unicode formats. But there are some things that 
sqlite does not handle without the ICU extension.

The ICU extension extends SQLite with the following functionallity:
   1.1  SQL Scalars upper() and lower()
   1.2  Unicode Aware LIKE Operator
   1.3  ICU Collation Sequences
   1.4  SQL REGEXP Operator

Download the SQLite source and have a look in the ext/icu directory

Sreedhar.a wrote:

Hi,
 
Does Sqlite support unicode?

I have seen that it supports utf-8 and utf-16.
I want to know whether it supports unicode character formats.
 
Thanks and Best Regards,

A.Sreedhar.
 
 

  


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



RE: [sqlite] Unicode support for Sqlite?

2007-12-12 Thread Sreedhar.a
 
Hi,

I am using the sqlite to store the metadata of audio files.
Is it possible to store the metadata in unicode character format in sqlite.

Best Regards,
A.Sreedhar.
 

-Original Message-
From: Trevor Talbot [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 12, 2007 4:40 PM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] Unicode support for Sqlite?

On 12/12/07, Sreedhar.a <[EMAIL PROTECTED]> wrote:

> Does Sqlite support unicode?
> I have seen that it supports utf-8 and utf-16.
> I want to know whether it supports unicode character formats.

Unicode is a very large and complex topic, so that question is way too vague
to answer. Can you provide an example of what you're looking for?


-
To unsubscribe, send email to [EMAIL PROTECTED]

-




-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode support for Sqlite?

2007-12-12 Thread Trevor Talbot
On 12/12/07, Sreedhar.a <[EMAIL PROTECTED]> wrote:

> Does Sqlite support unicode?
> I have seen that it supports utf-8 and utf-16.
> I want to know whether it supports unicode character formats.

Unicode is a very large and complex topic, so that question is way too
vague to answer. Can you provide an example of what you're looking
for?

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] Unicode support for Sqlite?

2007-12-12 Thread Sreedhar.a
Hi,
 
Does Sqlite support unicode?
I have seen that it supports utf-8 and utf-16.
I want to know whether it supports unicode character formats.
 
Thanks and Best Regards,
A.Sreedhar.
 
 


Re: [sqlite] UNICODE Support

2006-08-05 Thread Nathaniel Smith
On Fri, Aug 04, 2006 at 10:02:58PM -0700, Cory Nelson wrote:
> On 8/4/06, Trevor Talbot <[EMAIL PROTECTED]> wrote:
> >On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote:
> >
> >> But, since you brought it up - I have no expectations of SQLite
> >> integrating a full Unicode locale library, however it would be a great
> >> improvement if it would respect the current locale and use wcs*
> >> functions when available, or at least order by standard Unicode order
> >> instead of completely mangling things on UTF-8 codes.
> >
> >What do you mean by "standard Unicode order" in this context?
> >
> 
> Convert UTF-8 to UTF-16 (or both to UCS-4 if you want to be entirely
> correct) while sorting, to at least make them follow the same pattern.

Huh?

UTF-8 handled in the naive way (using "memcmp", like sqlite does) will
automagically give you sorting by unicode codepoint (probably the only
useful meaning of "standard Unicode order" here).

UTF-16 handled in the naive way (either using "memcmp" or
lexicographically on 2-byte integers) will sort things by codepoint,
mostly, sort of, and otherwise by a weird order that falls out of
details of the UTF-16 standard accidentally.[1]

Perhaps you're using a legacy system that standardized on UTF-16
before the BMP ran out, and want to be compatible with its
idiosyncratic sorting -- then converting things to UTF-16 before
comparing makes sense.  But that's not really appropriate to make as a
general recommendation... better to convert UTF-16 to UTF-8, if you
want to be entirely correct :-).

[1] see e.g. http://icu.sourceforge.net/docs/papers/utf16_code_point_order.html

-- Nathaniel

-- 
Details are all that matters; God dwells there, and you never get to
see Him if you don't struggle to get them right. -- Stephen Jay Gould


Re: [sqlite] UNICODE Support

2006-08-05 Thread Trevor Talbot

On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote:

On 8/4/06, Trevor Talbot <[EMAIL PROTECTED]> wrote:
> On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote:
>
> > But, since you brought it up - I have no expectations of SQLite
> > integrating a full Unicode locale library, however it would be a great
> > improvement if it would respect the current locale and use wcs*
> > functions when available, or at least order by standard Unicode order
> > instead of completely mangling things on UTF-8 codes.



> What do you mean by "standard Unicode order" in this context?



Convert UTF-8 to UTF-16 (or both to UCS-4 if you want to be entirely
correct) while sorting, to at least make them follow the same pattern.


Ah, so Unicode codepoint order.  Unfortunately this isn't accurate:
UTF-8 and UTF-32/UCS-4 are both naturally in codepoint order (UTF-8
because of the MSB-first style format), but UTF-16 isn't due to the
way surrogate pairs are constructed.  UTF-16 is actually the oddball
here :P


Re: [sqlite] UNICODE Support

2006-08-04 Thread Cory Nelson

On 8/4/06, Trevor Talbot <[EMAIL PROTECTED]> wrote:

On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote:

> But, since you brought it up - I have no expectations of SQLite
> integrating a full Unicode locale library, however it would be a great
> improvement if it would respect the current locale and use wcs*
> functions when available, or at least order by standard Unicode order
> instead of completely mangling things on UTF-8 codes.

What do you mean by "standard Unicode order" in this context?



Convert UTF-8 to UTF-16 (or both to UCS-4 if you want to be entirely
correct) while sorting, to at least make them follow the same pattern.

--
Cory Nelson
http://www.int64.org


Re: [sqlite] UNICODE Support

2006-08-04 Thread Trevor Talbot

On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote:


But, since you brought it up - I have no expectations of SQLite
integrating a full Unicode locale library, however it would be a great
improvement if it would respect the current locale and use wcs*
functions when available, or at least order by standard Unicode order
instead of completely mangling things on UTF-8 codes.


What do you mean by "standard Unicode order" in this context?


Re: [sqlite] UNICODE Support

2006-08-04 Thread Nuno Lucas

On 8/5/06, Cory Nelson <[EMAIL PROTECTED]> wrote:

On 8/4/06, Nuno Lucas <[EMAIL PROTECTED]> wrote:
> On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote:
> > IE, using memcmp() to compare strings.  I've been bitten by this
> > before, with SQLite producing unexpected results when using UTF-8.
> > Using UTF-16 has worked more reliably in my experience.
>
> SQLite only knows how to sort ASCII, so memcmp does that right (being
> it UTF-8 or UTF-16).
>
> If you think about it, the only way sorting will work 100% is by
> having some form of localization (because for each language different
> sorting rules apply, _even_ for words composed only of ASCII
> characters).
>
> Adding localization to SQLite is out of the question (it would
> probably need a library as big as SQLite itself), so it's up to the
> user to define it's own localization funtions and integrate them with
> sqlite (there are all the necessary hooks ready for that).

I was not talking about sorting in my post - I've had simple = index
comparisons fail in UTF-8.


You should have reported it. If it's true, it's a bug that needs to be
corrected.
But again I would say I never found a bug like that in sqlite.


But, since you brought it up - I have no expectations of SQLite
integrating a full Unicode locale library, however it would be a great
improvement if it would respect the current locale and use wcs*
functions when available, or at least order by standard Unicode order
instead of completely mangling things on UTF-8 codes.


For it to respect the current locale then the database would be
invalid after moving/using it in another locale (the affected indexes
would need to be rebuilt). Using the COLATE thing (which I never used
exactly because of the problem above) you can define your own sort
function that does what you want.

On the second point, you may be right and can be considered a bug. A
sorted table should have exactly the same order either if the database
is using UTF-8 or UTF-16 internally (even if it doesn't follow the
UNICODE order). At least it seems consistency on a query result should
be assured on this.

Maybe others have another point of view...


Regards,
~Nuno Lucas


Re: [sqlite] UNICODE Support

2006-08-04 Thread Cory Nelson

On 8/4/06, Nuno Lucas <[EMAIL PROTECTED]> wrote:

On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote:
> IE, using memcmp() to compare strings.  I've been bitten by this
> before, with SQLite producing unexpected results when using UTF-8.
> Using UTF-16 has worked more reliably in my experience.

SQLite only knows how to sort ASCII, so memcmp does that right (being
it UTF-8 or UTF-16).

If you think about it, the only way sorting will work 100% is by
having some form of localization (because for each language different
sorting rules apply, _even_ for words composed only of ASCII
characters).

Adding localization to SQLite is out of the question (it would
probably need a library as big as SQLite itself), so it's up to the
user to define it's own localization funtions and integrate them with
sqlite (there are all the necessary hooks ready for that).


I was not talking about sorting in my post - I've had simple = index
comparisons fail in UTF-8.

But, since you brought it up - I have no expectations of SQLite
integrating a full Unicode locale library, however it would be a great
improvement if it would respect the current locale and use wcs*
functions when available, or at least order by standard Unicode order
instead of completely mangling things on UTF-8 codes.



Regards,
~Nuno Lucas




--
Cory Nelson
http://www.int64.org


Re: [sqlite] UNICODE Support

2006-08-04 Thread Nuno Lucas

On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote:

IE, using memcmp() to compare strings.  I've been bitten by this
before, with SQLite producing unexpected results when using UTF-8.
Using UTF-16 has worked more reliably in my experience.


SQLite only knows how to sort ASCII, so memcmp does that right (being
it UTF-8 or UTF-16).

If you think about it, the only way sorting will work 100% is by
having some form of localization (because for each language different
sorting rules apply, _even_ for words composed only of ASCII
characters).

Adding localization to SQLite is out of the question (it would
probably need a library as big as SQLite itself), so it's up to the
user to define it's own localization funtions and integrate them with
sqlite (there are all the necessary hooks ready for that).


Regards,
~Nuno Lucas


Re: [sqlite] UNICODE Support

2006-08-04 Thread Cory Nelson

On 8/4/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

"Cory Nelson" <[EMAIL PROTECTED]> wrote:
> On 8/3/06, RohitPatel <[EMAIL PROTECTED]> wrote:
>
> I recommend using utf-16 in the database - sqlite doesn't fully
> support utf-8, and some things may give unexpected results if you use
> it.
>

Oh really?  What exactly is missing from SQLite's UTF-8 support?


Correct me if I'm wrong but from what I understand SQLite supports
storing and converting between UTF-8 and UTF-16, but that is where the
support stops.  It is wrong (in my opinion) to claim UTF-8 support, at
least without a clear upfront warning, when that's all it offers.

IE, using memcmp() to compare strings.  I've been bitten by this
before, with SQLite producing unexpected results when using UTF-8.
Using UTF-16 has worked more reliably in my experience.


--
D. Richard Hipp   <[EMAIL PROTECTED]>





--
Cory Nelson
http://www.int64.org


Re: [sqlite] UNICODE Support

2006-08-04 Thread drh
"Cory Nelson" <[EMAIL PROTECTED]> wrote:
> On 8/3/06, RohitPatel <[EMAIL PROTECTED]> wrote:
> 
> I recommend using utf-16 in the database - sqlite doesn't fully
> support utf-8, and some things may give unexpected results if you use
> it.
> 

Oh really?  What exactly is missing from SQLite's UTF-8 support?
--
D. Richard Hipp   <[EMAIL PROTECTED]>



RE: [sqlite] UNICODE Support

2005-06-08 Thread Dennis Volodomanov
You can convert your text using A2W() and W2A() functions (or others)
before passing it to SQLite and after retrieving it back from SQLite.
That's what we do (it's a Japanese application).

   Dennis 

-Original Message-
From: Ajay [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 09, 2005 12:12 AM
To: sqlite-users@sqlite.org
Subject: RE: [sqlite] UNICODE Support


But what about the SQLite Function's parameters whose data type is LPSTR
? 
Let me know the details to support wide char ?

Regards,
Ajay Sonawane


-Original Message-
From: Martin Engelschalk [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 08, 2005 6:48 PM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] UNICODE Support

Hi,

See http://www.sqlite.org/pragma.html, search for 'PRAGMA encoding'

/Martin

Ajay schrieb:

>Hello there,
>
>Does SQLite support UNICODE? Can I store some Arabic or Chinese text in

>database?
>
>If it does not support UNICODE, Is there any workaround for that?
>
> 
>
>Regards,
>
>Ajay Sonawane
>
> 
>
>
>  
>







RE: [sqlite] UNICODE Support

2005-06-08 Thread Ajay

But what about the SQLite Function's parameters whose data type is LPSTR ? 
Let me know the details to support wide char ?

Regards,
Ajay Sonawane


-Original Message-
From: Martin Engelschalk [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 08, 2005 6:48 PM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] UNICODE Support

Hi,

See http://www.sqlite.org/pragma.html, search for 'PRAGMA encoding'

/Martin

Ajay schrieb:

>Hello there,
>
>Does SQLite support UNICODE? Can I store some Arabic or Chinese text in
>database?
>
>If it does not support UNICODE, Is there any workaround for that?
>
> 
>
>Regards,
>
>Ajay Sonawane
>
> 
>
>
>  
>



Re: [sqlite] UNICODE Support

2005-06-08 Thread Martin Engelschalk

Hi,

See http://www.sqlite.org/pragma.html, search for 'PRAGMA encoding'

/Martin

Ajay schrieb:


Hello there,

Does SQLite support UNICODE? Can I store some Arabic or Chinese text in
database?

If it does not support UNICODE, Is there any workaround for that?



Regards,

Ajay Sonawane




 



[sqlite] UNICODE Support

2005-06-08 Thread Ajay
Hello there,

Does SQLite support UNICODE? Can I store some Arabic or Chinese text in
database?

If it does not support UNICODE, Is there any workaround for that?

 

Regards,

Ajay Sonawane