Re: enable searching East Asian words at search.debian.org

2003-05-14 Thread Craig Small
On Wed, May 14, 2003 at 11:19:56AM +0900, Tomohiro KUBOTA wrote:
  Great, ipadic is 3meg and consumes 12meg.  I cannot expect people to
  download that with the default packages.
  
  So I'll release mnogo 3.2.10 with no chasen. It's broken anyway because
  it needs some rc files and other things.
 
  Now for the webiste, I'll get the other charsets going and we'll work
  on the JP problem separately.
 
 Sorry, I don't understand the meaning or feeling of Great here.
 Can you explain?
It's not great at all, its bad.  I forget people might not understand me
and should of been clearer.

 However, search.debian.org is a different topic.  Since Japanese
 is one of several languages for which number of translated pages in
 http://www.debian.org/ is more than 50%, it is nonsence to exclude
 these pages from the target of search.
 
 I don't understand at all why some of Debian (and other free-software-
 related) people tend to exclude Japanese and other Asian languages
 from range of support  Even people who are interested in
 i18n and translation sometimes tend to do!

Yes, you are right I am confusing the two issues.  I can easily, well
almost easily, make a special search.d.o set of binaries and they can
have little or no bearing on the packages...

It gets us back to the original problem that what is the license for
ipadic?  And libchasen is broken.

Why is it broken?
It won't work without some files that are not part of the package.
These files are nowhere to be seen and there is no documentation on
how these files are supposed to come about nor what format they are in.

That can all be solved I'm sure, but its no use asking admins to put
libchasen on until this is fixed or a work-around is found.

  - Craig

-- 
Craig Small VK2XLZ  GnuPG:1C1B D893 1418 2AF4 45EE  95CB C76C E5AC 12CA DFA5
Eye-Net Consulting http://www.enc.com.au/[EMAIL PROTECTED]
MIEEE [EMAIL PROTECTED] Debian developer [EMAIL PROTECTED]



Re: enable searching East Asian words at search.debian.org

2003-05-14 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Craig Small)
Subject: Re: enable searching East Asian words at search.debian.org
Date: Wed, 14 May 2003 16:19:29 +1000

 Yes, you are right I am confusing the two issues.  I can easily, well
 almost easily, make a special search.d.o set of binaries and they can
 have little or no bearing on the packages...

This must be easy, because you are willing to force all Japanese
mnogosearch users to do this and I will agree on it.


 It gets us back to the original problem that what is the license for
 ipadic?  And libchasen is broken.

How this should be fixed, Nokubi-san?  Does Kakashi or MeCab have an
emulating layer (API) for Chasen?  Or, any alternatives already
available for ipadic?  mnogosearch seems to use chasen_sparse_tostr()
and chasen_getopt_argv().


 Why is it broken?
 It won't work without some files that are not part of the package.
 These files are nowhere to be seen and there is no documentation on
 how these files are supposed to come about nor what format they are in.

I don't understand why you say so strongly.  Yes, it is a bug.  However,
did you document that Debian mnogosearch package is compiled with
eliminating east Asian support?  This is just as severe as that.

Anyway, Nokubi-san is a maintainer of chasen packages and I hope he
will fix this soon.


 That can all be solved I'm sure, but its no use asking admins to put
 libchasen on until this is fixed or a work-around is found.

A work-around.  apt-get install libchasen-dev ipadic instead
of apt-get install libchasen-dev.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Craig Small
On Tue, May 13, 2003 at 08:33:54AM +0900, Tomohiro KUBOTA wrote:
 I think there are no problem on this, because Craig has already compiled
 version 3.2.7 at his home directory to test search.debian.org/new , though
 his compilation is by eliminating east Asian character mapping tables
 and without chasen support.
 
 The version 3.2.7 was the latest version at that time (December 2002).

I'm working on 3.2.10 that will have the charset support.  Do I also
need to include chasen? Without chasen, mnogosearch will not understand
a Japanese word?

I'll get something uploaded soon, can you test it for me on a simple
set of pages (use the builtin if you like for no db) to see it does
work for your pages.  If so I'll get it compiled on klecker.

  - Craig

-- 
Craig Small VK2XLZ  GnuPG:1C1B D893 1418 2AF4 45EE  95CB C76C E5AC 12CA DFA5
Eye-Net Consulting http://www.enc.com.au/[EMAIL PROTECTED]
MIEEE [EMAIL PROTECTED] Debian developer [EMAIL PROTECTED]



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Craig Small
On Tue, May 13, 2003 at 03:24:15PM +1000, Craig Small wrote:
 I'm working on 3.2.10 that will have the charset support.  Do I also
 need to include chasen? Without chasen, mnogosearch will not understand
 a Japanese word?

*sigh*
Now I've got rpath problems, anyone have a libtool lobotomiser to fix
this stupid bug?

W: mnogosearch-pgsql: binary-or-shlib-defines-rpath 
./usr/lib/libmnogosearch-3.2.so /usr/lib
W: mnogosearch-pgsql: binary-or-shlib-defines-rpath 
./usr/lib/libmnogocharset-3.2.so /usr/lib

How i hate libtool

  - Craig
-- 
Craig Small VK2XLZ  GnuPG:1C1B D893 1418 2AF4 45EE  95CB C76C E5AC 12CA DFA5
Eye-Net Consulting http://www.enc.com.au/[EMAIL PROTECTED]
MIEEE [EMAIL PROTECTED] Debian developer [EMAIL PROTECTED]



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread knok
;At Tue, 13 May 2003 15:24:15 +1000,
Craig Small wrote:
 I'm working on 3.2.10 that will have the charset support.  Do I also
 need to include chasen? Without chasen, mnogosearch will not understand
 a Japanese word?

In general, Japanese text is not separated by space characters in any
word threshold. So you need to do morphological analysis for splitting
words.

By the way, there are another issue in ChaSen. It's dictionary
(ipadic) is licensed under two languages. The license written in
Japanese is DFSG-free, but the aother license written in English is
questionable.
http://lists.debian.org/debian-legal/2001/debian-legal-200104/msg00062.html
(I think it is the gray area...)

I tried to talk upstream to change English license, but the licenser,
an governmental organization, was already dissolved, so I couldn't do
that.

Now I'm trying to make another DFSG-free dictionary for ChaSen. If I
can do it, I'll move ipadic package to non-free and ITP the new one.

The another solution is to use libkakasi instead libchasen. It is
completely free.
-- 
NOKUBI Takatsugu
E-mail: [EMAIL PROTECTED]
[EMAIL PROTECTED] / [EMAIL PROTECTED]



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Craig Small)
Subject: Re: enable searching East Asian words at search.debian.org
Date: Tue, 13 May 2003 15:24:15 +1000

 I'm working on 3.2.10 that will have the charset support.  Do I also
 need to include chasen? Without chasen, mnogosearch will not understand
 a Japanese word?

You are right.  Also, --with-extra-charsets=all needed (if 3.2.10's
default setting eliminates mapping tables for CJK just like 3.2.8).


 I'll get something uploaded soon, can you test it for me on a simple
 set of pages (use the builtin if you like for no db) to see it does
 work for your pages.  If so I'll get it compiled on klecker.

Yes, I will, of course.  You mean, you will compile 3.2.10 and set-up
a test search page, then I will test searching CJK words?

However, I tested builtin database but it didn't work well.  I didn't
research further on builtin because builtin won't be used in the real
search page.  Thus, if you'd like to test builtin, please test that
your new compilation works well for English words.  Then I will test
for various languages including Chinese, Japanese, and Korean.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED]
Subject: Re: enable searching East Asian words at search.debian.org
Date: Tue, 13 May 2003 15:32:49 +0900

 Now I'm trying to make another DFSG-free dictionary for ChaSen. If I
 can do it, I'll move ipadic package to non-free and ITP the new one.

Any perspectives?


 The another solution is to use libkakasi instead libchasen. It is
 completely free.

At first, I imagine chasen is much better than kakasi because chasen
analyzes the grammer of Japanese sentences while kakasi doesn't.
Which is better, chasen without dictionary (chasen itself is free)
or kakasi?  Or, chasen *needs* dictionary (though libchasen0 doesn't
Depends: on ipadic)?

Second, can we use kakasi for mnogosearch?

If we don't have solution, how about writing please use google for
searching CJK words in Debian site at http://search.debian.org/ 
and admit that free softwares are not yet something which can
substitute proprietary softwares?


At last, which solution do you suggest?  Should we wait for free
alternative for ipadic?  Or, ipadic should be regarded free?  Or,
can we use kakasi?  Or, should we recognize there are no free
implementation for web search which supports languages including
Japanese?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Peter Karlsson
Denis Barbier:

 which means that e-acute has been converted twice, and no pages are
 found.  Am I doing something wrong?

Your browser is most probably buggy, try upgrading it to the latest
version and try again.

A form on a page which does not declare an accept-charset attribute on
the form tag is supposed to use the encoding from the page the form was
on for submitting it.

Or did you send it from the command line with unencoded non-ASCII
characters? If so, there is no context in which to evaluate the form
data, so anything might happen.

-- 
\\//
Peter - http://www.softwolves.pp.se/
  I do not read or respond to mail with HTML attachments.



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Denis Barbier
On Tue, May 13, 2003 at 10:02:15AM +0200, Peter Karlsson wrote:
 Denis Barbier:
 
  which means that e-acute has been converted twice, and no pages are
  found.  Am I doing something wrong?
 
 Your browser is most probably buggy, try upgrading it to the latest
 version and try again.
 
 A form on a page which does not declare an accept-charset attribute on
 the form tag is supposed to use the encoding from the page the form was
 on for submitting it.

I am on sid and first tested with Lynx.  Now I performed other tests,
here are my results: in this table I represent how e-acute (0xe9 in
latin1 encoding) is escaped in the q= part of the results URL.

 ---
 browser\ env.| iso-8859-15 | utf-8|
 ---
  lynx| %C3%A9  | %C3%83%C2%A9 |
  w3m-en  | %E9 | %C3%A9   |
  mozilla | %C3%A9  | %C3%A9   |
  konqueror   | %C3%A9  | %C3%A9   |
 ---

So my problems seems related to text browsers.  In UTF-8 environment,
the xterm window nicely displays UTF-8 encoded files, I cut'n'paste
the French word, and it appears fine in the browser.

 Or did you send it from the command line with unencoded non-ASCII
 characters?

No.

Denis



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread knok
At Tue, 13 May 2003 15:54:01 +0900 (JST),
Tomohiro KUBOTA wrote:
  Now I'm trying to make another DFSG-free dictionary for ChaSen. If I
  can do it, I'll move ipadic package to non-free and ITP the new one.
 
 Any perspectives?

In fact, it's already done by Tsuchiya-sann. Now I'm trying to make
development environment (CVS, web, et al.). It is based on pubdic that
usually used with Canna and FreeWnn.

 At first, I imagine chasen is much better than kakasi because chasen
 analyzes the grammer of Japanese sentences while kakasi doesn't.
 Which is better, chasen without dictionary (chasen itself is free)
 or kakasi?  Or, chasen *needs* dictionary (though libchasen0 doesn't
 Depends: on ipadic)?

Oops, it's a bug on libchasen. ChaSen (or libchasen) must need their
dictionary.

 Second, can we use kakasi for mnogosearch?

It could be. It seems not so hard. If I have a time, I'll try to make
a patch.

 At last, which solution do you suggest?  Should we wait for free
 alternative for ipadic?  Or, ipadic should be regarded free?  Or,
s can we use kakasi?  Or, should we recognize there are no free
 implementation for web search which supports languages including
 Japanese?

My another intention of making free dictionary is to ITP MeCab that is
yet anoter morphological analyzer. The author of the software said
that it is faster than ChaSen or KAKASI, so the best way is using it,
I think.
-- 
NOKUBI Takatsugu
E-mail: [EMAIL PROTECTED]
[EMAIL PROTECTED] / [EMAIL PROTECTED]



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Peter Karlsson
Denis Barbier:

 I am on sid and first tested with Lynx.

Lynx is buggy, I reported it some time ago. It has been fixed in the
upstream release, though. See bug 156680 URL:bugs.debian.org/156680

 So my problems seems related to text browsers.

Yeah, most the text browsers seem, unfortunately, to have severe
problems with the encodings support.

-- 
\\//
Peter - http://www.softwolves.pp.se/
  I do not read or respond to mail with HTML attachments.



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Denis Barbier
On Tue, May 13, 2003 at 01:03:41PM +0200, Peter Karlsson wrote:
 Denis Barbier:
 
  I am on sid and first tested with Lynx.
 
 Lynx is buggy, I reported it some time ago. It has been fixed in the
 upstream release, though. See bug 156680 URL:bugs.debian.org/156680

In this bugreport you tell that lynx-cur is right, but I have similar
results with lynx-cur 2.8.5-10.
I made further investigations and found http://koi8.pp.ru/htmlforms.html
Should search.d.o. follow these instructions, or are they wrong?

Denis



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Josip Rodin
On Mon, May 12, 2003 at 11:55:03PM +0200, Denis Barbier wrote:
However, there will be no results
for Japanese words because of problems I wrote.
   
   Yes, I am pretty sure that Josip was investigating these problems when
   he sent his mail and will implement your solution.
  
  Actually, I was not. There are only 24 hours in a day and I can't set aside
  enough time to handle this, _too_. *shrug*
 
 Tomohiro KUBOTA worked hard to find a solution, and he needs libchasen-dev,
 libchasen0, and ipadic packages from stable to be installed on klecker.
 It should not take too much time ;)

And you're telling me this because...? I can't install those packages on
klecker!

-- 
 2. That which causes joy or happiness.



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Denis Barbier
On Tue, May 13, 2003 at 04:05:27PM +1000, Craig Small wrote:
 On Tue, May 13, 2003 at 03:24:15PM +1000, Craig Small wrote:
  I'm working on 3.2.10 that will have the charset support.  Do I also
  need to include chasen? Without chasen, mnogosearch will not understand
  a Japanese word?
 
 *sigh*
 Now I've got rpath problems, anyone have a libtool lobotomiser to fix
 this stupid bug?
 
 W: mnogosearch-pgsql: binary-or-shlib-defines-rpath 
 ./usr/lib/libmnogosearch-3.2.so /usr/lib
 W: mnogosearch-pgsql: binary-or-shlib-defines-rpath 
 ./usr/lib/libmnogocharset-3.2.so /usr/lib
 
 How i hate libtool

Mnogosearch was libtoolized with an old version of libtool. You can
rerun the bootstrap script with newer auto* tools from unstable.

Denis



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Peter Karlsson
Denis Barbier:

 In this bugreport you tell that lynx-cur is right, but I have similar
 results with lynx-cur 2.8.5-10.

It worked for me in my tests. When exactly did it go wrong for you?

 I made further investigations and found http://koi8.pp.ru/htmlforms.html
 Should search.d.o. follow these instructions, or are they wrong?

As I wrote earlier, accept-charset is a very good idea to specify. And
you have to specify it if the source document is in another encoding.
I don't think all browser support it, though, so it's also best to have
the document with the form in the same encoding.

-- 
\\//
Peter - http://www.softwolves.pp.se/
  I do not read or respond to mail with HTML attachments.



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Denis Barbier)
Subject: Re: enable searching East Asian words at search.debian.org
Date: Tue, 13 May 2003 14:12:00 +0200

 In this bugreport you tell that lynx-cur is right, but I have similar
 results with lynx-cur 2.8.5-10.

I tested lynx and lynx-cur and found that both of them are problematic.

I tested lynx and lynx-cur on mlterm and xterm in UTF-8 mode and
ja_JP.UTF-8 locale.  I searched a Russian word for News.  Then,
though the search seems to work well, all Cyrillic characters are
displayed in Latin alphabet transliteration.  I imagine they are
not sensible of locale.

Please test w3mmee.  It should work well.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Denis Barbier
On Wed, May 14, 2003 at 01:15:18AM +0900, Tomohiro KUBOTA wrote:
 Hi,
 
 From: [EMAIL PROTECTED] (Denis Barbier)
 Subject: Re: enable searching East Asian words at search.debian.org
 Date: Tue, 13 May 2003 14:12:00 +0200
 
  In this bugreport you tell that lynx-cur is right, but I have similar
  results with lynx-cur 2.8.5-10.
 
 I tested lynx and lynx-cur and found that both of them are problematic.
 
 I tested lynx and lynx-cur on mlterm and xterm in UTF-8 mode and
 ja_JP.UTF-8 locale.  I searched a Russian word for News.  Then,
 though the search seems to work well, all Cyrillic characters are
 displayed in Latin alphabet transliteration.  I imagine they are
 not sensible of locale.

Thanks to Peter's explanations, I now understand that this problem
is related to browsers, and not search.d.o.  I made some other
investigations and found that Lynx does not get charset from my
locale, and I have to set it manually by pressing the 'o' key.
It then seems to work fine.

 Please test w3mmee.  It should work well.

Indeed, it works fine.  Looks like a smart browser.
It automatically sets charset in a UTF-8 locale, but not when
in a legacy encoding.  Do you know why?  This is quite annoying
when performing such tests ;)

In conclusion, I could not make links and w3m work, but lynx and
w3mmee are fine if well configured.  This was interesting, thanks
to all of you.

Now we can go back to the mnogosearch issue ;)

Denis



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Craig Small
On Tue, May 13, 2003 at 03:32:49PM +0900, [EMAIL PROTECTED] wrote:
 By the way, there are another issue in ChaSen. It's dictionary
 (ipadic) is licensed under two languages. The license written in
 Japanese is DFSG-free, but the aother license written in English is
 questionable.

Great, ipadic is 3meg and consumes 12meg.  I cannot expect people to
download that with the default packages.

So I'll release mnogo 3.2.10 with no chasen. It's broken anyway because
it needs some rc files and other things.

Now for the webiste, I'll get the other charsets going and we'll work
on the JP problem separately.

  - Craig

-- 
Craig Small VK2XLZ  GnuPG:1C1B D893 1418 2AF4 45EE  95CB C76C E5AC 12CA DFA5
Eye-Net Consulting http://www.enc.com.au/[EMAIL PROTECTED]
MIEEE [EMAIL PROTECTED] Debian developer [EMAIL PROTECTED]



Re: enable searching East Asian words at search.debian.org

2003-05-13 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Craig Small)
Subject: Re: enable searching East Asian words at search.debian.org
Date: Wed, 14 May 2003 11:36:28 +1000

 Great, ipadic is 3meg and consumes 12meg.  I cannot expect people to
 download that with the default packages.
 
 So I'll release mnogo 3.2.10 with no chasen. It's broken anyway because
 it needs some rc files and other things.

 Now for the webiste, I'll get the other charsets going and we'll work
 on the JP problem separately.

Sorry, I don't understand the meaning or feeling of Great here.
Can you explain?

You are confusing two different aspects: one is providing Debian
mnogosearch packages and another is how search.debian.org is
constructed.

I agree that Japanese people cannot use Debian mnogosearch package
but we are forced to recompile it, in order to save megs of disk
space from people who don't need Japanese.  (Please write an
instruction on recompilation at README.Debian).

However, search.debian.org is a different topic.  Since Japanese
is one of several languages for which number of translated pages in
http://www.debian.org/ is more than 50%, it is nonsence to exclude
these pages from the target of search.

I don't understand at all why some of Debian (and other free-software-
related) people tend to exclude Japanese and other Asian languages
from range of support  Even people who are interested in
i18n and translation sometimes tend to do!

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-12 Thread Denis Barbier
On Mon, May 12, 2003 at 08:06:32AM +0900, Tomohiro KUBOTA wrote:
[...]
 However, if you would like to supply translated search pages (though
 I think it is not an urgent problem), I just read the
 webwml/english/searchtmpl/Makefile and found that
 `grep CHARSET ../.wmlrc` might have a problem.  webwml/japanese/.wmlrc
 have two lines which matches 'grep CHARSET', which are
 '-D CHARSET=iso-2022-jp' and '-D CHARSET_WML=euc-jp'.

Yes, I fixed it yesterday.
My understanding of Josip mail is that when investigating your
instructions about mnogosearch, he wondered how input text has
to be encoded when filling search form.  This is a good question,
search page should tell which encoding to use when searching for
non-English words.

Denis



Re: enable searching East Asian words at search.debian.org

2003-05-12 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Denis Barbier)
Subject: Re: enable searching East Asian words at search.debian.org
Date: Mon, 12 May 2003 09:54:58 +0200

 My understanding of Josip mail is that when investigating your
 instructions about mnogosearch, he wondered how input text has
 to be encoded when filling search form.  This is a good question,
 search page should tell which encoding to use when searching for
 non-English words.

Yes, I know.  The solution is to write the search page in UTF-8,
which has been available since last December when Craig and I
discussed about this problem.

For example, I can search an Russian word Novosti (of course in
Cyrillic) (which means News) at http://search.debian.org/ English
page like:

http://search.debian.org/?q=%D0%9D%D0%BE%D0%B2%D0%BE%D1%81%D1%82%D0%B8ps=10o=0m=allg=

and the page shows 112 results.

Also, I can input Japanese words.  However, there will be no results
for Japanese words because of problems I wrote.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-12 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Denis Barbier)
Subject: Re: enable searching East Asian words at search.debian.org
Date: Mon, 12 May 2003 13:45:08 +0200

  For example, I can search an Russian word Novosti (of course in
  Cyrillic)
 
 The point is: how are Cyrillic words passed by the web browser to the
 search engine?
 Are they encoded in ISO-8859-5, KOI8-R or UTF-8 charsets?

UTF-8, i.e., the same encoding as the search page.  For example,
the previous example:

http://search.debian.org/?q=%D0%9D%D0%BE%D0%B2%D0%BE%D1%81%D1%82%D0%B8ps=10o=0m=allg=

The first 6 bytes read:

%D0%9D - U+041D (CYRILLIC CAPITAL LETTER EN)
%D0%BE - U+043E (CYRILLIC SMALL LETTER O)
%D0%B2 - U+0432 (CYRILLIC SMALL LETTER VE)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-12 Thread Denis Barbier
On Mon, May 12, 2003 at 10:26:30PM +0900, Tomohiro KUBOTA wrote:
 Hi,
 
 From: [EMAIL PROTECTED] (Denis Barbier)
 Subject: Re: enable searching East Asian words at search.debian.org
 Date: Mon, 12 May 2003 13:45:08 +0200
 
   For example, I can search an Russian word Novosti (of course in
   Cyrillic)
  
  The point is: how are Cyrillic words passed by the web browser to the
  search engine?
  Are they encoded in ISO-8859-5, KOI8-R or UTF-8 charsets?
 
 UTF-8, i.e., the same encoding as the search page.  For example,
 the previous example:
 
 http://search.debian.org/?q=%D0%9D%D0%BE%D0%B2%D0%BE%D1%81%D1%82%D0%B8ps=10o=0m=allg=
 
 The first 6 bytes read:
 
 %D0%9D - U+041D (CYRILLIC CAPITAL LETTER EN)
 %D0%BE - U+043E (CYRILLIC SMALL LETTER O)
 %D0%B2 - U+0432 (CYRILLIC SMALL LETTER VE)

Hmmm I tend to disagree.  I tried with the French word for 'election',
which is also 'election' but first 'e' being e-acute.
In a ISO-8859-15 environment, I enter this word on search.debian.org and
select the French language, and am redirected to
  http://search.debian.org/?q=%C3%A9lectionps=10o=0m=allg=fr
which gives 46 pages.

If now I run
  $ export LANG=fr_FR.UTF-8
  $ xterm
go to search.debian.org in this window and cut'n'paste this word from
another window, I am redirected to
  http://search.debian.org/?q=%C3%83%C2%A9lectionps=10o=0m=allg=fr
which means that e-acute has been converted twice, and no pages are
found.  Am I doing something wrong?

Denis



Re: enable searching East Asian words at search.debian.org

2003-05-12 Thread Josip Rodin
On Mon, May 12, 2003 at 01:45:08PM +0200, Denis Barbier wrote:
  However, there will be no results
  for Japanese words because of problems I wrote.
 
 Yes, I am pretty sure that Josip was investigating these problems when
 he sent his mail and will implement your solution.

Actually, I was not. There are only 24 hours in a day and I can't set aside
enough time to handle this, _too_. *shrug*

-- 
 2. That which causes joy or happiness.



Re: enable searching East Asian words at search.debian.org

2003-05-12 Thread Denis Barbier
On Mon, May 12, 2003 at 11:07:32PM +0200, Josip Rodin wrote:
 On Mon, May 12, 2003 at 01:45:08PM +0200, Denis Barbier wrote:
   However, there will be no results
   for Japanese words because of problems I wrote.
  
  Yes, I am pretty sure that Josip was investigating these problems when
  he sent his mail and will implement your solution.
 
 Actually, I was not. There are only 24 hours in a day and I can't set aside
 enough time to handle this, _too_. *shrug*

Tomohiro KUBOTA worked hard to find a solution, and he needs libchasen-dev,
libchasen0, and ipadic packages from stable to be installed on klecker.
It should not take too much time ;)
When done, mnogosearch from unstable has to be recompiled, I can volunteer
to provide a backport if that helps.

Denis



Re: enable searching East Asian words at search.debian.org

2003-05-12 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Denis Barbier)
Subject: Re: enable searching East Asian words at search.debian.org
Date: Mon, 12 May 2003 23:55:03 +0200

 When done, mnogosearch from unstable has to be recompiled, I can volunteer
 to provide a backport if that helps.

I think there are no problem on this, because Craig has already compiled
version 3.2.7 at his home directory to test search.debian.org/new , though
his compilation is by eliminating east Asian character mapping tables
and without chasen support.

The version 3.2.7 was the latest version at that time (December 2002).

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-12 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Denis Barbier)
Subject: Re: enable searching East Asian words at search.debian.org
Date: Mon, 12 May 2003 19:09:54 +0200

 If now I run
   $ export LANG=fr_FR.UTF-8
   $ xterm
 go to search.debian.org in this window and cut'n'paste this word from
 another window, I am redirected to
   http://search.debian.org/?q=%C3%83%C2%A9lectionps=10o=0m=allg=fr
 which means that e-acute has been converted twice, and no pages are
 found.  Am I doing something wrong?

There might be two problems.  One is whether cut'n'paste works well
or not, and another is whether the browser can handle encoding conversion
correctly.

I tested with galeon and w3mmee.  (w3m doesn't support UTF-8.)
Also, Intenet Explorer on Windows works well.

I'd like to test your operation.  Which browser did you use?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-11 Thread Josip Rodin
On Mon, May 05, 2003 at 07:47:11PM +0900, Tomohiro KUBOTA wrote:
 No reply for more than one week.  Someone please reply.

BTW on related note, I noticed:

make: Entering directory `/org/www.debian.org/webwml/japanese/searchtmpl'
wml -q -D CUR_YEAR=2003 -o UNDEFuJA:[EMAIL PROTECTED] --prolog=/usr/bin/kcc -e 
- --epilog=../convert search.ja.html search.wml
c=`grep CHARSET ../.wmlrc | cut -d= -f2`; \
  iconv -f $c -t UTF-8 search.ja.html | perl -pe 's,^(\s*meta 
http-equiv=Content-Type content=text/html; charset=)\S+()$,$1UTF-8$2,'  
search.ja.html
iconv: cannot open input file `euc-jp': No such file or directory
copying search.ja.html to ../../../www/searchtmpl
make: Leaving directory `/org/www.debian.org/webwml/japanese/searchtmpl'

I suppose even if the page isn't coded properly, the Japanese characters
would still be input in UTF-8 (presuming the perl line did its job) so
you're back to square one...?

In any event, searchtmpl/Makefile needs some adjustments to accommodate for
the pre-conversion stuff.

-- 
 2. That which causes joy or happiness.



Re: enable searching East Asian words at search.debian.org

2003-05-11 Thread Tomohiro KUBOTA
Hi,

From: Josip Rodin [EMAIL PROTECTED]
Subject: Re: enable searching East Asian words at search.debian.org
Date: Sun, 11 May 2003 14:33:38 +0200

 make: Entering directory `/org/www.debian.org/webwml/japanese/searchtmpl'
 wml -q -D CUR_YEAR=2003 -o UNDEFuJA:[EMAIL PROTECTED] --prolog=/usr/bin/kcc 
 -e - --epilog=../convert search.ja.html search.wml
 c=`grep CHARSET ../.wmlrc | cut -d= -f2`; \
   iconv -f $c -t UTF-8 search.ja.html | perl -pe 's,^(\s*meta 
 http-equiv=Content-Type content=text/html; charset=)\S+()$,$1UTF-8$2,'  
 search.ja.html
 iconv: cannot open input file `euc-jp': No such file or directory
 copying search.ja.html to ../../../www/searchtmpl
 make: Leaving directory `/org/www.debian.org/webwml/japanese/searchtmpl'

Sorry I don't understand what you are doing.  However, my improvement
is not related to search.ja.html (or translation of search page) at all.

My intension is to enable searching, for example, Bunsho (in Kanji),
which means documentation in Japanese, at the search page.  It should
be enabled, because there are many Japanese-translated pages at Debian
site and these pages should be targets of searching.  Not translation
of the search page.  (I guess you are trying to prepare Japanese
translation of search page?  I will research this point later.  However,
please note, for Japanese people, that a search page in English which
can search Japanese words is absolutely better than a search page in
Japanese which cannot search Japanese words.)

The problems are:

(1) Though mnogosearch is based on UTF-8 (and should be able to process
all languages for translation of Debian web pages), the support of CJK
languages are disabled.  (Please read the ./configure --help output or
installation instruction of mnogosearch).  The option is just to drop
character code mapping tables between CJK encodings and UTF-8.  This is
why recompilation of mnogosearch is needed.

(2) Japanese and Chinese don't use whitespaces between words, which
causes indexing (i.e., reading all web pages and store all words into
databaase for searching) doesn't work well.  chasen-related packages
are needed to fix this.  (I hope you read my mails which I wrote that
chasen is needed -- please just go back this thread.)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-11 Thread Josip Rodin
On Sun, May 11, 2003 at 11:09:44PM +0900, Tomohiro KUBOTA wrote:
  c=`grep CHARSET ../.wmlrc | cut -d= -f2`; \
iconv -f $c -t UTF-8 search.ja.html | perl -pe 's,^(\s*meta 
  http-equiv=Content-Type content=text/html; charset=)\S+()$,$1UTF-8$2,' 
   search.ja.html
  iconv: cannot open input file `euc-jp': No such file or directory
 
 Sorry I don't understand what you are doing.  However, my improvement
 is not related to search.ja.html (or translation of search page) at all.

Well, it's related if you want people to be able to actually input stuff
properly into the search engine. :)

-- 
 2. That which causes joy or happiness.



Re: enable searching East Asian words at search.debian.org

2003-05-11 Thread Tomohiro KUBOTA
Hi,

From: Josip Rodin [EMAIL PROTECTED]
Subject: Re: enable searching East Asian words at search.debian.org
Date: Sun, 11 May 2003 19:44:17 +0200

 On Sun, May 11, 2003 at 11:09:44PM +0900, Tomohiro KUBOTA wrote:
   c=`grep CHARSET ../.wmlrc | cut -d= -f2`; \
 iconv -f $c -t UTF-8 search.ja.html | perl -pe 's,^(\s*meta 
   http-equiv=Content-Type content=text/html; 
   charset=)\S+()$,$1UTF-8$2,'  search.ja.html
   iconv: cannot open input file `euc-jp': No such file or directory
  
  Sorry I don't understand what you are doing.  However, my improvement
  is not related to search.ja.html (or translation of search page) at all.
 
 Well, it's related if you want people to be able to actually input stuff
 properly into the search engine. :)

OK, I remembered.  The search web page must be UTF-8.

The current (English) version of the search page is already UTF-8 and
have no problem for international search, I think.

However, if you would like to supply translated search pages (though
I think it is not an urgent problem), I just read the
webwml/english/searchtmpl/Makefile and found that
`grep CHARSET ../.wmlrc` might have a problem.  webwml/japanese/.wmlrc
have two lines which matches 'grep CHARSET', which are
'-D CHARSET=iso-2022-jp' and '-D CHARSET_WML=euc-jp'.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: enable searching East Asian words at search.debian.org

2003-05-05 Thread Tomohiro KUBOTA
Hi,

No reply for more than one week.  Someone please reply.
There are Chinese, Japanese, and Korean translation of www.debian.org
but search.debian.org cannot search words in these languages.

Please do the following:

1. Install libchasen-dev, libchasen0, and ipadic packages to klecker.
2. Add me ([EMAIL PROTECTED]) as a user of postgresql database at klecker.
2. Create a postgresql database for which I have write permission at klecker.

Then I can prove the improvement (or bugfix, I regard) in the last mail
which I cite the whole contents of.


From: Tomohiro KUBOTA [EMAIL PROTECTED]
Subject: enable searching East Asian words at search.debian.org
Date: Sat, 26 Apr 2003 09:45:48 +0900 (JST)

 Hi,
 
 So far search.debian.org doesn't support East Asian languages
 (Chinese, Japanese, and Korean).  I.e., it cannot search Chinese,
 Japanese, nor Korean words.
 
 I have recently researched this problem and I think I found
 how to fix it.  I tested at my personal machine without 24hr
 internet connection and it works almost fine.
 
  1. install libchasen-dev, libchasen0, and ipadic packages.
  2. recompile mnogosearch (version 3.2.8 or later) with
 --enable-chasen --with-extra-charsets=all option for ./configure .
  3. invoke indexer -C and then indexer to rebuild the search database.
 
 Could someone do this?  Or, can I have a database (postgresql) access
 (write access) permission at klecker to prove this?
 
 
 Explanation:
 
 Chasen packages are needed to extract words from Japanese texts.
 Japanese texts don't use whitespaces between words.  --enable-chasen
 (since version 3.2.8) option for mnogosearch enables usage of chasen
 from mnogosearch.
 
 Though mnogosearch is Unicode-based software and potentially supports
 East Asian languages, support of these languages is disabled by default.
 To enable this, --with-extra-charsets=all is needed.
 
 Since the current search database in search.debian.org doesn't have
 any east Asian words, it is needed to rebuild the whole database.
 (Of course it is enough to rebuild database only for *.{ja,ko,zh-cn,
 zh-hk,zh-tw}.html pages but I don't know if it is possible to this.)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/