On Mon, May 12, 2003 at 10:26:30PM +0900, Tomohiro KUBOTA wrote: > Hi, > > From: [EMAIL PROTECTED] (Denis Barbier) > Subject: Re: enable searching East Asian words at search.debian.org > Date: Mon, 12 May 2003 13:45:08 +0200 > > > > For example, I can search an Russian word "Novosti" (of course in > > > Cyrillic) > > > > The point is: how are Cyrillic words passed by the web browser to the > > search engine? > > Are they encoded in ISO-8859-5, KOI8-R or UTF-8 charsets? > > UTF-8, i.e., the same encoding as the search page. For example, > the previous example: > > http://search.debian.org/?q=%D0%9D%D0%BE%D0%B2%D0%BE%D1%81%D1%82%D0%B8&ps=10&o=0&m=all&g= > > The first 6 bytes read: > > %D0%9D -> U+041D (CYRILLIC CAPITAL LETTER EN) > %D0%BE -> U+043E (CYRILLIC SMALL LETTER O) > %D0%B2 -> U+0432 (CYRILLIC SMALL LETTER VE)
Hmmm I tend to disagree. I tried with the French word for 'election', which is also 'election' but first 'e' being e-acute. In a ISO-8859-15 environment, I enter this word on search.debian.org and select the French language, and am redirected to http://search.debian.org/?q=%C3%A9lection&ps=10&o=0&m=all&g=fr which gives 46 pages. If now I run $ export LANG=fr_FR.UTF-8 $ xterm go to search.debian.org in this window and cut'n'paste this word from another window, I am redirected to http://search.debian.org/?q=%C3%83%C2%A9lection&ps=10&o=0&m=all&g=fr which means that e-acute has been converted twice, and no pages are found. Am I doing something wrong? Denis