Re: enable searching East Asian words at search.debian.org
On Wed, May 14, 2003 at 11:19:56AM +0900, Tomohiro KUBOTA wrote: Great, ipadic is 3meg and consumes 12meg. I cannot expect people to download that with the default packages. So I'll release mnogo 3.2.10 with no chasen. It's broken anyway because it needs some rc files and other things. Now for the webiste, I'll get the other charsets going and we'll work on the JP problem separately. Sorry, I don't understand the meaning or feeling of Great here. Can you explain? It's not great at all, its bad. I forget people might not understand me and should of been clearer. However, search.debian.org is a different topic. Since Japanese is one of several languages for which number of translated pages in http://www.debian.org/ is more than 50%, it is nonsence to exclude these pages from the target of search. I don't understand at all why some of Debian (and other free-software- related) people tend to exclude Japanese and other Asian languages from range of support Even people who are interested in i18n and translation sometimes tend to do! Yes, you are right I am confusing the two issues. I can easily, well almost easily, make a special search.d.o set of binaries and they can have little or no bearing on the packages... It gets us back to the original problem that what is the license for ipadic? And libchasen is broken. Why is it broken? It won't work without some files that are not part of the package. These files are nowhere to be seen and there is no documentation on how these files are supposed to come about nor what format they are in. That can all be solved I'm sure, but its no use asking admins to put libchasen on until this is fixed or a work-around is found. - Craig -- Craig Small VK2XLZ GnuPG:1C1B D893 1418 2AF4 45EE 95CB C76C E5AC 12CA DFA5 Eye-Net Consulting http://www.enc.com.au/[EMAIL PROTECTED] MIEEE [EMAIL PROTECTED] Debian developer [EMAIL PROTECTED]
Re: enable searching East Asian words at search.debian.org
Hi, From: [EMAIL PROTECTED] (Craig Small) Subject: Re: enable searching East Asian words at search.debian.org Date: Wed, 14 May 2003 16:19:29 +1000 Yes, you are right I am confusing the two issues. I can easily, well almost easily, make a special search.d.o set of binaries and they can have little or no bearing on the packages... This must be easy, because you are willing to force all Japanese mnogosearch users to do this and I will agree on it. It gets us back to the original problem that what is the license for ipadic? And libchasen is broken. How this should be fixed, Nokubi-san? Does Kakashi or MeCab have an emulating layer (API) for Chasen? Or, any alternatives already available for ipadic? mnogosearch seems to use chasen_sparse_tostr() and chasen_getopt_argv(). Why is it broken? It won't work without some files that are not part of the package. These files are nowhere to be seen and there is no documentation on how these files are supposed to come about nor what format they are in. I don't understand why you say so strongly. Yes, it is a bug. However, did you document that Debian mnogosearch package is compiled with eliminating east Asian support? This is just as severe as that. Anyway, Nokubi-san is a maintainer of chasen packages and I hope he will fix this soon. That can all be solved I'm sure, but its no use asking admins to put libchasen on until this is fixed or a work-around is found. A work-around. apt-get install libchasen-dev ipadic instead of apt-get install libchasen-dev. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
On Tue, May 13, 2003 at 08:33:54AM +0900, Tomohiro KUBOTA wrote: I think there are no problem on this, because Craig has already compiled version 3.2.7 at his home directory to test search.debian.org/new , though his compilation is by eliminating east Asian character mapping tables and without chasen support. The version 3.2.7 was the latest version at that time (December 2002). I'm working on 3.2.10 that will have the charset support. Do I also need to include chasen? Without chasen, mnogosearch will not understand a Japanese word? I'll get something uploaded soon, can you test it for me on a simple set of pages (use the builtin if you like for no db) to see it does work for your pages. If so I'll get it compiled on klecker. - Craig -- Craig Small VK2XLZ GnuPG:1C1B D893 1418 2AF4 45EE 95CB C76C E5AC 12CA DFA5 Eye-Net Consulting http://www.enc.com.au/[EMAIL PROTECTED] MIEEE [EMAIL PROTECTED] Debian developer [EMAIL PROTECTED]
Re: enable searching East Asian words at search.debian.org
On Tue, May 13, 2003 at 03:24:15PM +1000, Craig Small wrote: I'm working on 3.2.10 that will have the charset support. Do I also need to include chasen? Without chasen, mnogosearch will not understand a Japanese word? *sigh* Now I've got rpath problems, anyone have a libtool lobotomiser to fix this stupid bug? W: mnogosearch-pgsql: binary-or-shlib-defines-rpath ./usr/lib/libmnogosearch-3.2.so /usr/lib W: mnogosearch-pgsql: binary-or-shlib-defines-rpath ./usr/lib/libmnogocharset-3.2.so /usr/lib How i hate libtool - Craig -- Craig Small VK2XLZ GnuPG:1C1B D893 1418 2AF4 45EE 95CB C76C E5AC 12CA DFA5 Eye-Net Consulting http://www.enc.com.au/[EMAIL PROTECTED] MIEEE [EMAIL PROTECTED] Debian developer [EMAIL PROTECTED]
Re: enable searching East Asian words at search.debian.org
;At Tue, 13 May 2003 15:24:15 +1000, Craig Small wrote: I'm working on 3.2.10 that will have the charset support. Do I also need to include chasen? Without chasen, mnogosearch will not understand a Japanese word? In general, Japanese text is not separated by space characters in any word threshold. So you need to do morphological analysis for splitting words. By the way, there are another issue in ChaSen. It's dictionary (ipadic) is licensed under two languages. The license written in Japanese is DFSG-free, but the aother license written in English is questionable. http://lists.debian.org/debian-legal/2001/debian-legal-200104/msg00062.html (I think it is the gray area...) I tried to talk upstream to change English license, but the licenser, an governmental organization, was already dissolved, so I couldn't do that. Now I'm trying to make another DFSG-free dictionary for ChaSen. If I can do it, I'll move ipadic package to non-free and ITP the new one. The another solution is to use libkakasi instead libchasen. It is completely free. -- NOKUBI Takatsugu E-mail: [EMAIL PROTECTED] [EMAIL PROTECTED] / [EMAIL PROTECTED]
Re: enable searching East Asian words at search.debian.org
Hi, From: [EMAIL PROTECTED] (Craig Small) Subject: Re: enable searching East Asian words at search.debian.org Date: Tue, 13 May 2003 15:24:15 +1000 I'm working on 3.2.10 that will have the charset support. Do I also need to include chasen? Without chasen, mnogosearch will not understand a Japanese word? You are right. Also, --with-extra-charsets=all needed (if 3.2.10's default setting eliminates mapping tables for CJK just like 3.2.8). I'll get something uploaded soon, can you test it for me on a simple set of pages (use the builtin if you like for no db) to see it does work for your pages. If so I'll get it compiled on klecker. Yes, I will, of course. You mean, you will compile 3.2.10 and set-up a test search page, then I will test searching CJK words? However, I tested builtin database but it didn't work well. I didn't research further on builtin because builtin won't be used in the real search page. Thus, if you'd like to test builtin, please test that your new compilation works well for English words. Then I will test for various languages including Chinese, Japanese, and Korean. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
Hi, From: [EMAIL PROTECTED] Subject: Re: enable searching East Asian words at search.debian.org Date: Tue, 13 May 2003 15:32:49 +0900 Now I'm trying to make another DFSG-free dictionary for ChaSen. If I can do it, I'll move ipadic package to non-free and ITP the new one. Any perspectives? The another solution is to use libkakasi instead libchasen. It is completely free. At first, I imagine chasen is much better than kakasi because chasen analyzes the grammer of Japanese sentences while kakasi doesn't. Which is better, chasen without dictionary (chasen itself is free) or kakasi? Or, chasen *needs* dictionary (though libchasen0 doesn't Depends: on ipadic)? Second, can we use kakasi for mnogosearch? If we don't have solution, how about writing please use google for searching CJK words in Debian site at http://search.debian.org/ and admit that free softwares are not yet something which can substitute proprietary softwares? At last, which solution do you suggest? Should we wait for free alternative for ipadic? Or, ipadic should be regarded free? Or, can we use kakasi? Or, should we recognize there are no free implementation for web search which supports languages including Japanese? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
Denis Barbier: which means that e-acute has been converted twice, and no pages are found. Am I doing something wrong? Your browser is most probably buggy, try upgrading it to the latest version and try again. A form on a page which does not declare an accept-charset attribute on the form tag is supposed to use the encoding from the page the form was on for submitting it. Or did you send it from the command line with unencoded non-ASCII characters? If so, there is no context in which to evaluate the form data, so anything might happen. -- \\// Peter - http://www.softwolves.pp.se/ I do not read or respond to mail with HTML attachments.
Re: enable searching East Asian words at search.debian.org
On Tue, May 13, 2003 at 10:02:15AM +0200, Peter Karlsson wrote: Denis Barbier: which means that e-acute has been converted twice, and no pages are found. Am I doing something wrong? Your browser is most probably buggy, try upgrading it to the latest version and try again. A form on a page which does not declare an accept-charset attribute on the form tag is supposed to use the encoding from the page the form was on for submitting it. I am on sid and first tested with Lynx. Now I performed other tests, here are my results: in this table I represent how e-acute (0xe9 in latin1 encoding) is escaped in the q= part of the results URL. --- browser\ env.| iso-8859-15 | utf-8| --- lynx| %C3%A9 | %C3%83%C2%A9 | w3m-en | %E9 | %C3%A9 | mozilla | %C3%A9 | %C3%A9 | konqueror | %C3%A9 | %C3%A9 | --- So my problems seems related to text browsers. In UTF-8 environment, the xterm window nicely displays UTF-8 encoded files, I cut'n'paste the French word, and it appears fine in the browser. Or did you send it from the command line with unencoded non-ASCII characters? No. Denis
Re: enable searching East Asian words at search.debian.org
At Tue, 13 May 2003 15:54:01 +0900 (JST), Tomohiro KUBOTA wrote: Now I'm trying to make another DFSG-free dictionary for ChaSen. If I can do it, I'll move ipadic package to non-free and ITP the new one. Any perspectives? In fact, it's already done by Tsuchiya-sann. Now I'm trying to make development environment (CVS, web, et al.). It is based on pubdic that usually used with Canna and FreeWnn. At first, I imagine chasen is much better than kakasi because chasen analyzes the grammer of Japanese sentences while kakasi doesn't. Which is better, chasen without dictionary (chasen itself is free) or kakasi? Or, chasen *needs* dictionary (though libchasen0 doesn't Depends: on ipadic)? Oops, it's a bug on libchasen. ChaSen (or libchasen) must need their dictionary. Second, can we use kakasi for mnogosearch? It could be. It seems not so hard. If I have a time, I'll try to make a patch. At last, which solution do you suggest? Should we wait for free alternative for ipadic? Or, ipadic should be regarded free? Or, s can we use kakasi? Or, should we recognize there are no free implementation for web search which supports languages including Japanese? My another intention of making free dictionary is to ITP MeCab that is yet anoter morphological analyzer. The author of the software said that it is faster than ChaSen or KAKASI, so the best way is using it, I think. -- NOKUBI Takatsugu E-mail: [EMAIL PROTECTED] [EMAIL PROTECTED] / [EMAIL PROTECTED]
Re: enable searching East Asian words at search.debian.org
Denis Barbier: I am on sid and first tested with Lynx. Lynx is buggy, I reported it some time ago. It has been fixed in the upstream release, though. See bug 156680 URL:bugs.debian.org/156680 So my problems seems related to text browsers. Yeah, most the text browsers seem, unfortunately, to have severe problems with the encodings support. -- \\// Peter - http://www.softwolves.pp.se/ I do not read or respond to mail with HTML attachments.
Re: enable searching East Asian words at search.debian.org
On Tue, May 13, 2003 at 01:03:41PM +0200, Peter Karlsson wrote: Denis Barbier: I am on sid and first tested with Lynx. Lynx is buggy, I reported it some time ago. It has been fixed in the upstream release, though. See bug 156680 URL:bugs.debian.org/156680 In this bugreport you tell that lynx-cur is right, but I have similar results with lynx-cur 2.8.5-10. I made further investigations and found http://koi8.pp.ru/htmlforms.html Should search.d.o. follow these instructions, or are they wrong? Denis
Re: enable searching East Asian words at search.debian.org
On Mon, May 12, 2003 at 11:55:03PM +0200, Denis Barbier wrote: However, there will be no results for Japanese words because of problems I wrote. Yes, I am pretty sure that Josip was investigating these problems when he sent his mail and will implement your solution. Actually, I was not. There are only 24 hours in a day and I can't set aside enough time to handle this, _too_. *shrug* Tomohiro KUBOTA worked hard to find a solution, and he needs libchasen-dev, libchasen0, and ipadic packages from stable to be installed on klecker. It should not take too much time ;) And you're telling me this because...? I can't install those packages on klecker! -- 2. That which causes joy or happiness.
Re: enable searching East Asian words at search.debian.org
On Tue, May 13, 2003 at 04:05:27PM +1000, Craig Small wrote: On Tue, May 13, 2003 at 03:24:15PM +1000, Craig Small wrote: I'm working on 3.2.10 that will have the charset support. Do I also need to include chasen? Without chasen, mnogosearch will not understand a Japanese word? *sigh* Now I've got rpath problems, anyone have a libtool lobotomiser to fix this stupid bug? W: mnogosearch-pgsql: binary-or-shlib-defines-rpath ./usr/lib/libmnogosearch-3.2.so /usr/lib W: mnogosearch-pgsql: binary-or-shlib-defines-rpath ./usr/lib/libmnogocharset-3.2.so /usr/lib How i hate libtool Mnogosearch was libtoolized with an old version of libtool. You can rerun the bootstrap script with newer auto* tools from unstable. Denis
Re: enable searching East Asian words at search.debian.org
Denis Barbier: In this bugreport you tell that lynx-cur is right, but I have similar results with lynx-cur 2.8.5-10. It worked for me in my tests. When exactly did it go wrong for you? I made further investigations and found http://koi8.pp.ru/htmlforms.html Should search.d.o. follow these instructions, or are they wrong? As I wrote earlier, accept-charset is a very good idea to specify. And you have to specify it if the source document is in another encoding. I don't think all browser support it, though, so it's also best to have the document with the form in the same encoding. -- \\// Peter - http://www.softwolves.pp.se/ I do not read or respond to mail with HTML attachments.
Re: enable searching East Asian words at search.debian.org
Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: enable searching East Asian words at search.debian.org Date: Tue, 13 May 2003 14:12:00 +0200 In this bugreport you tell that lynx-cur is right, but I have similar results with lynx-cur 2.8.5-10. I tested lynx and lynx-cur and found that both of them are problematic. I tested lynx and lynx-cur on mlterm and xterm in UTF-8 mode and ja_JP.UTF-8 locale. I searched a Russian word for News. Then, though the search seems to work well, all Cyrillic characters are displayed in Latin alphabet transliteration. I imagine they are not sensible of locale. Please test w3mmee. It should work well. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
On Wed, May 14, 2003 at 01:15:18AM +0900, Tomohiro KUBOTA wrote: Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: enable searching East Asian words at search.debian.org Date: Tue, 13 May 2003 14:12:00 +0200 In this bugreport you tell that lynx-cur is right, but I have similar results with lynx-cur 2.8.5-10. I tested lynx and lynx-cur and found that both of them are problematic. I tested lynx and lynx-cur on mlterm and xterm in UTF-8 mode and ja_JP.UTF-8 locale. I searched a Russian word for News. Then, though the search seems to work well, all Cyrillic characters are displayed in Latin alphabet transliteration. I imagine they are not sensible of locale. Thanks to Peter's explanations, I now understand that this problem is related to browsers, and not search.d.o. I made some other investigations and found that Lynx does not get charset from my locale, and I have to set it manually by pressing the 'o' key. It then seems to work fine. Please test w3mmee. It should work well. Indeed, it works fine. Looks like a smart browser. It automatically sets charset in a UTF-8 locale, but not when in a legacy encoding. Do you know why? This is quite annoying when performing such tests ;) In conclusion, I could not make links and w3m work, but lynx and w3mmee are fine if well configured. This was interesting, thanks to all of you. Now we can go back to the mnogosearch issue ;) Denis
Re: enable searching East Asian words at search.debian.org
On Tue, May 13, 2003 at 03:32:49PM +0900, [EMAIL PROTECTED] wrote: By the way, there are another issue in ChaSen. It's dictionary (ipadic) is licensed under two languages. The license written in Japanese is DFSG-free, but the aother license written in English is questionable. Great, ipadic is 3meg and consumes 12meg. I cannot expect people to download that with the default packages. So I'll release mnogo 3.2.10 with no chasen. It's broken anyway because it needs some rc files and other things. Now for the webiste, I'll get the other charsets going and we'll work on the JP problem separately. - Craig -- Craig Small VK2XLZ GnuPG:1C1B D893 1418 2AF4 45EE 95CB C76C E5AC 12CA DFA5 Eye-Net Consulting http://www.enc.com.au/[EMAIL PROTECTED] MIEEE [EMAIL PROTECTED] Debian developer [EMAIL PROTECTED]
Re: enable searching East Asian words at search.debian.org
Hi, From: [EMAIL PROTECTED] (Craig Small) Subject: Re: enable searching East Asian words at search.debian.org Date: Wed, 14 May 2003 11:36:28 +1000 Great, ipadic is 3meg and consumes 12meg. I cannot expect people to download that with the default packages. So I'll release mnogo 3.2.10 with no chasen. It's broken anyway because it needs some rc files and other things. Now for the webiste, I'll get the other charsets going and we'll work on the JP problem separately. Sorry, I don't understand the meaning or feeling of Great here. Can you explain? You are confusing two different aspects: one is providing Debian mnogosearch packages and another is how search.debian.org is constructed. I agree that Japanese people cannot use Debian mnogosearch package but we are forced to recompile it, in order to save megs of disk space from people who don't need Japanese. (Please write an instruction on recompilation at README.Debian). However, search.debian.org is a different topic. Since Japanese is one of several languages for which number of translated pages in http://www.debian.org/ is more than 50%, it is nonsence to exclude these pages from the target of search. I don't understand at all why some of Debian (and other free-software- related) people tend to exclude Japanese and other Asian languages from range of support Even people who are interested in i18n and translation sometimes tend to do! --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
On Mon, May 12, 2003 at 08:06:32AM +0900, Tomohiro KUBOTA wrote: [...] However, if you would like to supply translated search pages (though I think it is not an urgent problem), I just read the webwml/english/searchtmpl/Makefile and found that `grep CHARSET ../.wmlrc` might have a problem. webwml/japanese/.wmlrc have two lines which matches 'grep CHARSET', which are '-D CHARSET=iso-2022-jp' and '-D CHARSET_WML=euc-jp'. Yes, I fixed it yesterday. My understanding of Josip mail is that when investigating your instructions about mnogosearch, he wondered how input text has to be encoded when filling search form. This is a good question, search page should tell which encoding to use when searching for non-English words. Denis
Re: enable searching East Asian words at search.debian.org
Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: enable searching East Asian words at search.debian.org Date: Mon, 12 May 2003 09:54:58 +0200 My understanding of Josip mail is that when investigating your instructions about mnogosearch, he wondered how input text has to be encoded when filling search form. This is a good question, search page should tell which encoding to use when searching for non-English words. Yes, I know. The solution is to write the search page in UTF-8, which has been available since last December when Craig and I discussed about this problem. For example, I can search an Russian word Novosti (of course in Cyrillic) (which means News) at http://search.debian.org/ English page like: http://search.debian.org/?q=%D0%9D%D0%BE%D0%B2%D0%BE%D1%81%D1%82%D0%B8ps=10o=0m=allg= and the page shows 112 results. Also, I can input Japanese words. However, there will be no results for Japanese words because of problems I wrote. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: enable searching East Asian words at search.debian.org Date: Mon, 12 May 2003 13:45:08 +0200 For example, I can search an Russian word Novosti (of course in Cyrillic) The point is: how are Cyrillic words passed by the web browser to the search engine? Are they encoded in ISO-8859-5, KOI8-R or UTF-8 charsets? UTF-8, i.e., the same encoding as the search page. For example, the previous example: http://search.debian.org/?q=%D0%9D%D0%BE%D0%B2%D0%BE%D1%81%D1%82%D0%B8ps=10o=0m=allg= The first 6 bytes read: %D0%9D - U+041D (CYRILLIC CAPITAL LETTER EN) %D0%BE - U+043E (CYRILLIC SMALL LETTER O) %D0%B2 - U+0432 (CYRILLIC SMALL LETTER VE) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
On Mon, May 12, 2003 at 10:26:30PM +0900, Tomohiro KUBOTA wrote: Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: enable searching East Asian words at search.debian.org Date: Mon, 12 May 2003 13:45:08 +0200 For example, I can search an Russian word Novosti (of course in Cyrillic) The point is: how are Cyrillic words passed by the web browser to the search engine? Are they encoded in ISO-8859-5, KOI8-R or UTF-8 charsets? UTF-8, i.e., the same encoding as the search page. For example, the previous example: http://search.debian.org/?q=%D0%9D%D0%BE%D0%B2%D0%BE%D1%81%D1%82%D0%B8ps=10o=0m=allg= The first 6 bytes read: %D0%9D - U+041D (CYRILLIC CAPITAL LETTER EN) %D0%BE - U+043E (CYRILLIC SMALL LETTER O) %D0%B2 - U+0432 (CYRILLIC SMALL LETTER VE) Hmmm I tend to disagree. I tried with the French word for 'election', which is also 'election' but first 'e' being e-acute. In a ISO-8859-15 environment, I enter this word on search.debian.org and select the French language, and am redirected to http://search.debian.org/?q=%C3%A9lectionps=10o=0m=allg=fr which gives 46 pages. If now I run $ export LANG=fr_FR.UTF-8 $ xterm go to search.debian.org in this window and cut'n'paste this word from another window, I am redirected to http://search.debian.org/?q=%C3%83%C2%A9lectionps=10o=0m=allg=fr which means that e-acute has been converted twice, and no pages are found. Am I doing something wrong? Denis
Re: enable searching East Asian words at search.debian.org
On Mon, May 12, 2003 at 01:45:08PM +0200, Denis Barbier wrote: However, there will be no results for Japanese words because of problems I wrote. Yes, I am pretty sure that Josip was investigating these problems when he sent his mail and will implement your solution. Actually, I was not. There are only 24 hours in a day and I can't set aside enough time to handle this, _too_. *shrug* -- 2. That which causes joy or happiness.
Re: enable searching East Asian words at search.debian.org
On Mon, May 12, 2003 at 11:07:32PM +0200, Josip Rodin wrote: On Mon, May 12, 2003 at 01:45:08PM +0200, Denis Barbier wrote: However, there will be no results for Japanese words because of problems I wrote. Yes, I am pretty sure that Josip was investigating these problems when he sent his mail and will implement your solution. Actually, I was not. There are only 24 hours in a day and I can't set aside enough time to handle this, _too_. *shrug* Tomohiro KUBOTA worked hard to find a solution, and he needs libchasen-dev, libchasen0, and ipadic packages from stable to be installed on klecker. It should not take too much time ;) When done, mnogosearch from unstable has to be recompiled, I can volunteer to provide a backport if that helps. Denis
Re: enable searching East Asian words at search.debian.org
Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: enable searching East Asian words at search.debian.org Date: Mon, 12 May 2003 23:55:03 +0200 When done, mnogosearch from unstable has to be recompiled, I can volunteer to provide a backport if that helps. I think there are no problem on this, because Craig has already compiled version 3.2.7 at his home directory to test search.debian.org/new , though his compilation is by eliminating east Asian character mapping tables and without chasen support. The version 3.2.7 was the latest version at that time (December 2002). --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: enable searching East Asian words at search.debian.org Date: Mon, 12 May 2003 19:09:54 +0200 If now I run $ export LANG=fr_FR.UTF-8 $ xterm go to search.debian.org in this window and cut'n'paste this word from another window, I am redirected to http://search.debian.org/?q=%C3%83%C2%A9lectionps=10o=0m=allg=fr which means that e-acute has been converted twice, and no pages are found. Am I doing something wrong? There might be two problems. One is whether cut'n'paste works well or not, and another is whether the browser can handle encoding conversion correctly. I tested with galeon and w3mmee. (w3m doesn't support UTF-8.) Also, Intenet Explorer on Windows works well. I'd like to test your operation. Which browser did you use? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
On Mon, May 05, 2003 at 07:47:11PM +0900, Tomohiro KUBOTA wrote: No reply for more than one week. Someone please reply. BTW on related note, I noticed: make: Entering directory `/org/www.debian.org/webwml/japanese/searchtmpl' wml -q -D CUR_YEAR=2003 -o UNDEFuJA:[EMAIL PROTECTED] --prolog=/usr/bin/kcc -e - --epilog=../convert search.ja.html search.wml c=`grep CHARSET ../.wmlrc | cut -d= -f2`; \ iconv -f $c -t UTF-8 search.ja.html | perl -pe 's,^(\s*meta http-equiv=Content-Type content=text/html; charset=)\S+()$,$1UTF-8$2,' search.ja.html iconv: cannot open input file `euc-jp': No such file or directory copying search.ja.html to ../../../www/searchtmpl make: Leaving directory `/org/www.debian.org/webwml/japanese/searchtmpl' I suppose even if the page isn't coded properly, the Japanese characters would still be input in UTF-8 (presuming the perl line did its job) so you're back to square one...? In any event, searchtmpl/Makefile needs some adjustments to accommodate for the pre-conversion stuff. -- 2. That which causes joy or happiness.
Re: enable searching East Asian words at search.debian.org
Hi, From: Josip Rodin [EMAIL PROTECTED] Subject: Re: enable searching East Asian words at search.debian.org Date: Sun, 11 May 2003 14:33:38 +0200 make: Entering directory `/org/www.debian.org/webwml/japanese/searchtmpl' wml -q -D CUR_YEAR=2003 -o UNDEFuJA:[EMAIL PROTECTED] --prolog=/usr/bin/kcc -e - --epilog=../convert search.ja.html search.wml c=`grep CHARSET ../.wmlrc | cut -d= -f2`; \ iconv -f $c -t UTF-8 search.ja.html | perl -pe 's,^(\s*meta http-equiv=Content-Type content=text/html; charset=)\S+()$,$1UTF-8$2,' search.ja.html iconv: cannot open input file `euc-jp': No such file or directory copying search.ja.html to ../../../www/searchtmpl make: Leaving directory `/org/www.debian.org/webwml/japanese/searchtmpl' Sorry I don't understand what you are doing. However, my improvement is not related to search.ja.html (or translation of search page) at all. My intension is to enable searching, for example, Bunsho (in Kanji), which means documentation in Japanese, at the search page. It should be enabled, because there are many Japanese-translated pages at Debian site and these pages should be targets of searching. Not translation of the search page. (I guess you are trying to prepare Japanese translation of search page? I will research this point later. However, please note, for Japanese people, that a search page in English which can search Japanese words is absolutely better than a search page in Japanese which cannot search Japanese words.) The problems are: (1) Though mnogosearch is based on UTF-8 (and should be able to process all languages for translation of Debian web pages), the support of CJK languages are disabled. (Please read the ./configure --help output or installation instruction of mnogosearch). The option is just to drop character code mapping tables between CJK encodings and UTF-8. This is why recompilation of mnogosearch is needed. (2) Japanese and Chinese don't use whitespaces between words, which causes indexing (i.e., reading all web pages and store all words into databaase for searching) doesn't work well. chasen-related packages are needed to fix this. (I hope you read my mails which I wrote that chasen is needed -- please just go back this thread.) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
On Sun, May 11, 2003 at 11:09:44PM +0900, Tomohiro KUBOTA wrote: c=`grep CHARSET ../.wmlrc | cut -d= -f2`; \ iconv -f $c -t UTF-8 search.ja.html | perl -pe 's,^(\s*meta http-equiv=Content-Type content=text/html; charset=)\S+()$,$1UTF-8$2,' search.ja.html iconv: cannot open input file `euc-jp': No such file or directory Sorry I don't understand what you are doing. However, my improvement is not related to search.ja.html (or translation of search page) at all. Well, it's related if you want people to be able to actually input stuff properly into the search engine. :) -- 2. That which causes joy or happiness.
Re: enable searching East Asian words at search.debian.org
Hi, From: Josip Rodin [EMAIL PROTECTED] Subject: Re: enable searching East Asian words at search.debian.org Date: Sun, 11 May 2003 19:44:17 +0200 On Sun, May 11, 2003 at 11:09:44PM +0900, Tomohiro KUBOTA wrote: c=`grep CHARSET ../.wmlrc | cut -d= -f2`; \ iconv -f $c -t UTF-8 search.ja.html | perl -pe 's,^(\s*meta http-equiv=Content-Type content=text/html; charset=)\S+()$,$1UTF-8$2,' search.ja.html iconv: cannot open input file `euc-jp': No such file or directory Sorry I don't understand what you are doing. However, my improvement is not related to search.ja.html (or translation of search page) at all. Well, it's related if you want people to be able to actually input stuff properly into the search engine. :) OK, I remembered. The search web page must be UTF-8. The current (English) version of the search page is already UTF-8 and have no problem for international search, I think. However, if you would like to supply translated search pages (though I think it is not an urgent problem), I just read the webwml/english/searchtmpl/Makefile and found that `grep CHARSET ../.wmlrc` might have a problem. webwml/japanese/.wmlrc have two lines which matches 'grep CHARSET', which are '-D CHARSET=iso-2022-jp' and '-D CHARSET_WML=euc-jp'. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: enable searching East Asian words at search.debian.org
Hi, No reply for more than one week. Someone please reply. There are Chinese, Japanese, and Korean translation of www.debian.org but search.debian.org cannot search words in these languages. Please do the following: 1. Install libchasen-dev, libchasen0, and ipadic packages to klecker. 2. Add me ([EMAIL PROTECTED]) as a user of postgresql database at klecker. 2. Create a postgresql database for which I have write permission at klecker. Then I can prove the improvement (or bugfix, I regard) in the last mail which I cite the whole contents of. From: Tomohiro KUBOTA [EMAIL PROTECTED] Subject: enable searching East Asian words at search.debian.org Date: Sat, 26 Apr 2003 09:45:48 +0900 (JST) Hi, So far search.debian.org doesn't support East Asian languages (Chinese, Japanese, and Korean). I.e., it cannot search Chinese, Japanese, nor Korean words. I have recently researched this problem and I think I found how to fix it. I tested at my personal machine without 24hr internet connection and it works almost fine. 1. install libchasen-dev, libchasen0, and ipadic packages. 2. recompile mnogosearch (version 3.2.8 or later) with --enable-chasen --with-extra-charsets=all option for ./configure . 3. invoke indexer -C and then indexer to rebuild the search database. Could someone do this? Or, can I have a database (postgresql) access (write access) permission at klecker to prove this? Explanation: Chasen packages are needed to extract words from Japanese texts. Japanese texts don't use whitespaces between words. --enable-chasen (since version 3.2.8) option for mnogosearch enables usage of chasen from mnogosearch. Though mnogosearch is Unicode-based software and potentially supports East Asian languages, support of these languages is disabled by default. To enable this, --with-extra-charsets=all is needed. Since the current search database in search.debian.org doesn't have any east Asian words, it is needed to rebuild the whole database. (Of course it is enough to rebuild database only for *.{ja,ko,zh-cn, zh-hk,zh-tw}.html pages but I don't know if it is possible to this.) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/