Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-05 Thread Daniel Kinzler
jida...@jidanni.org schrieb:
> Say, e.g., api.php?action=query&list=logevents looks fine, but when I
> look at the same table in an SQL dump, the Chinese utf8 is just a
> latin1 jumble. How can I convert such strings back to utf8? I can't
> find the place where MediaWiki converts them back and forth.

It doesn't. it's already UTF8, only mysql things it's not. this is because mysql
doesn't support utf8 before 5.0, and even in 5.0 and later, the support is 
flacky.

So, mediawiki (per default) tells mysql that the data is latin1 and treates it
as binary.

If you see it asa "jumble" entirely depends on the program you view it with.

this is a nasty hack, and it may cause corruption when importing/exporting
dumps. be careful about it.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-05 Thread jidanni
The BLOBs are fine, it's just the VARCHARs,
 `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL 
default '',
How can one convert these back to UTF-8 with a script, outside of mysql, just 
for
occasional viewing of the SQL dumps outside of the wiki.
Yes, my wiki works fine.
OK, I'll study 
http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-05 Thread Daniel Kinzler
jida...@jidanni.org schrieb:
> The BLOBs are fine, it's just the VARCHARs,
>  `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL 
> default '',
> How can one convert these back to UTF-8 with a script, outside of mysql, just 
> for
> occasional viewing of the SQL dumps outside of the wiki.
> Yes, my wiki works fine.
> OK, I'll study 
> http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html

Again: never mind what it is declared as, it *is* UTF-8. MySQL may however
automatically convert it on the way to the clinet or dump program. To prevent
that, tell mysql that the encoding of your client is latin1. Confusing? Hell 
yea :)

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-07 Thread howard chen
On Fri, Mar 6, 2009 at 3:54 PM, Daniel Kinzler  wrote:
> Again: never mind what it is declared as, it *is* UTF-8. MySQL may however
> automatically convert it on the way to the clinet or dump program. To prevent
> that, tell mysql that the encoding of your client is latin1. Confusing? Hell 
> yea :)
>

Best way is to use VARBINARY or BINARY

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-10 Thread jidanni
OK, I found if I use "mysqldump --default-character-set=latin1"
I can read all that can be read in the dump.
The only difference from plain mysqldump is
-/*!40101 SET NAMES utf8 */;
+/*!40101 SET NAMES latin1 */;
But that doesn't seem to affect restores from the SQL file. I'm sold.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-11 Thread Tei
note to self:  look into the code that order text  (collation) in
mediawiki, has to be fun one :-)


-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-11 Thread Daniel Kinzler
Tei schrieb:
> note to self:  look into the code that order text  (collation) in
> mediawiki, has to be fun one :-)

There is none. Sorting is done by the database. That is to say, in the default
"comnpatibility" mode, binary "collation" is used - that is, byte-by-byte
comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until
MySQL gets proper Unicode support.

If you set up the database to use proper UTF-8, collation is a bit better
(though still not configurable, i think). But it crashes hard if you try to
store characters that are outside the Basic Multilingual Plane (Gothic runes,
some obscure Chinese characters, ...) - that's why this is not used on 
wikipedia.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-11 Thread Aryeh Gregor
On Wed, Mar 11, 2009 at 6:14 AM, Daniel Kinzler  wrote:
> There is none. Sorting is done by the database. That is to say, in the default
> "comnpatibility" mode, binary "collation" is used - that is, byte-by-byte
> comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until
> MySQL gets proper Unicode support.

And until we upgrade to that version.  MySQL 4 doesn't have *any*
Unicode support -- or any character encoding support, in fact.  Every
is binary.

But we don't have to wait on MySQL.  We would just have to store a
Unicode sortkey in cl_sortkey instead of the actual Unicode
characters.  This would require an implementation of a Unicode sorting
algorithm in MediaWiki.  It could be language-specific or whatever you
want.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-11 Thread Petr Kadlec
2009/3/11 Aryeh Gregor :
> But we don't have to wait on MySQL.  We would just have to store a
> Unicode sortkey in cl_sortkey instead of the actual Unicode
> characters.  This would require an implementation of a Unicode sorting
> algorithm in MediaWiki.  It could be language-specific or whatever you
> want.

I still hold the belief that implementing an Unicode sorting algorithm
is none of the business of a PHP wiki engine (like implementing its
own file system). But still, if that is the only way my favorite
https://bugzilla.wikimedia.org/show_bug.cgi?id=164 would get resolved…

-- [[cs:User:Mormegil | Petr Kadlec]]

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-11 Thread Daniel Kinzler
Aryeh Gregor schrieb:
> On Wed, Mar 11, 2009 at 6:14 AM, Daniel Kinzler  wrote:
>> There is none. Sorting is done by the database. That is to say, in the 
>> default
>> "comnpatibility" mode, binary "collation" is used - that is, byte-by-byte
>> comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until
>> MySQL gets proper Unicode support.
> 
> And until we upgrade to that version.  MySQL 4 doesn't have *any*
> Unicode support -- or any character encoding support, in fact.  Every
> is binary.

right :)

> But we don't have to wait on MySQL.  We would just have to store a
> Unicode sortkey in cl_sortkey instead of the actual Unicode
> characters.  This would require an implementation of a Unicode sorting
> algorithm in MediaWiki.  It could be language-specific or whatever you
> want.

Yes, i thought about that a bit too. One problem would be that you can't use
that to make pretty sections on the category page. But that would be solvable
using an extra column, I suppose. Or by some kind of extra special magic 
mapping.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-11 Thread Aryeh Gregor
On Wed, Mar 11, 2009 at 10:04 AM, Daniel Kinzler  wrote:
> Yes, i thought about that a bit too. One problem would be that you can't use
> that to make pretty sections on the category page. But that would be solvable
> using an extra column, I suppose. Or by some kind of extra special magic 
> mapping.

While we're implementing obscene hacks, we can reserve the first byte
for the namespace (to sort subcats/pages/files separately), use the
middle for a Unicode sort key, and reserve the last four bytes for a
UTF-8 header character (which could also be language-specific).  :D

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-11 Thread Gerard Meijssen
Hoi,
If you are interested in collation, you may want to look into the CLDR, it
is where the collations are registered per language. There is no such thing
as an universally correct sorting algorithm.. NB the CLDR is a UNICODE
project.
Thanks,
 GerardM

2009/3/11 Aryeh Gregor

>

> On Wed, Mar 11, 2009 at 6:14 AM, Daniel Kinzler 
> wrote:
> > There is none. Sorting is done by the database. That is to say, in the
> default
> > "comnpatibility" mode, binary "collation" is used - that is, byte-by-byte
> > comparison of UTF-8 encoded data. Which sucks. But we are stuck with it
> until
> > MySQL gets proper Unicode support.
>
> And until we upgrade to that version.  MySQL 4 doesn't have *any*
> Unicode support -- or any character encoding support, in fact.  Every
> is binary.
>
> But we don't have to wait on MySQL.  We would just have to store a
> Unicode sortkey in cl_sortkey instead of the actual Unicode
> characters.  This would require an implementation of a Unicode sorting
> algorithm in MediaWiki.  It could be language-specific or whatever you
> want.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l