Re: multilanguage text

2001-05-10 Thread Danil Lavrentyuk

[ On Thu, 10 May 2001, Alexander Barkov wrote: ]

AB  Could mnoGoSearch to correctly index, for example, english words in russian
AB  text?
AB
AB It can.
AB
AB  Will it simply think these wirds having an incorrect sepelling and (in case of
AB  IspellIncorrectFactor 1) use they 'as is' in indexing?
AB
AB Yes. But you may add English ispell files too.

Hmmm..

If I'll add English ispell files - will it take words of 'latin' laters as
english words (when such words are occur in a russian text)?


Danil Lavrentyuk
Communiware.net
Programmer

___
If you want to unsubscribe send unsubscribe general
to [EMAIL PROTECTED]




Re: multilanguage text

2001-05-10 Thread Alexander Barkov


3.2.x branch will have language guesser. It's already implemented
and work very fine for single-language pages or even mostly
single-language
pages. I hope first release of 3.2.x will be available in May.



Danil Lavrentyuk wrote:
 
 [ On Wed, 9 May 2001, Maxime Zakharov wrote: ]
 
 MZ  And what if a site having many texts uploaded by users?
 MZ  Have I manualy edit all they satting lang attributes? :)
 MZ  Have I demand it from uploader? They will not.
 MZ
 MZ Users may upload big mega gifs as .html files :)
 
 It would be an obvious fraud...
 
 MZ Let talk about W3C recommendations.
 
 ... but ignoring of far-away-placed committee's recomendations could be a
 simply laziness.
 Not all of the software use all of the recomendations.
 Not all of users know all of the recomedations. Even not all of users think on
 using such recomendations.
 
 Text could be converted to HTML from someone another text fromat.
 Who, for example, will check for foreign phrases such text like big books
 which consists of many volumes (like Amber by Zhilazny or Wheel Of Time by
 Jordan or even bigger)? :)
 
 Let's tall about real world where we would have to index multilanguage texts
 without lang attributes.
 
 MZ  What if I have to index texts placed somewhere in the internet, not locally?
 MZ  What if a site contains texts of many books (something like www.lib.ry, for
 MZ  example)?
 MZ
 MZ Sometime, without explicit language definition it's impossible uniquely
 MZ select language for a word.
 MZ For example, word 'test' may be english or german.
 
 I know.
 Think it is real (but hard, I see) to make a system which could guess what the
 text's language is. It could use 2 steps:
 1) Create a list of encodings this text could be written in (symply by
 testing, is all of the word's characters are aplhas in this encoding). Here we
 could think that a two or more successive foreign words are from the same
 language.
 2) Check (using ispell tables) all the languages which use encondigs from list
 (created above), looking for one where this words are correct.
 3) (optoinal) If there more then one language suitable, select one that was
 seelcted for the previous phrase.
 
 OK. This method does not gurantee that selection will be correct always. But
 in the most cases it will.
 
 Yes, I know, this method is not too quick... But it is better then no any
 method at all. Any way it is good to make it able to turn it of in the
 indexer.conf file or by a command line option.
 
 
 Danil Lavrentyuk
 Communiware.net
 Programmer
 
 ___
 If you want to unsubscribe send unsubscribe general
 to [EMAIL PROTECTED]
___
If you want to unsubscribe send unsubscribe general
to [EMAIL PROTECTED]




Re: multilanguage text

2001-05-09 Thread Sergey Kartashoff

Hi!

Tuesday, May 08, 2001, 6:30:44 PM, you wrote:

DL Could mnoGoSearch to correctly index, for example, english words in russian
DL text?

Yes, it should.

DL Will it simply think these wirds having an incorrect sepelling and (in case of
DL IspellIncorrectFactor 1) use they 'as is' in indexing?

It can correctly deterimine the multilanguage documents.

-- 
Regards, Sergey aka gluke.


___
If you want to unsubscribe send unsubscribe general
to [EMAIL PROTECTED]




Re: multilanguage text

2001-05-09 Thread Danil Lavrentyuk

[ On Wed, 9 May 2001, Sergey Kartashoff wrote: ]

SK DL Could mnoGoSearch to correctly index, for example, english words in russian
SK DL text?
SK
SK Yes, it should.
SK
SK DL Will it simply think these wirds having an incorrect sepelling and (in case of
SK DL IspellIncorrectFactor 1) use they 'as is' in indexing?
SK
SK It can correctly deterimine the multilanguage documents.

How will it determine, what the language to use as secondary.
I.e. how can it guess, for example, have I english, french or latin phrase in
russian text?


Danil Lavrentyuk
Communiware.net
Programmer

___
If you want to unsubscribe send unsubscribe general
to [EMAIL PROTECTED]




Re: multilanguage text

2001-05-09 Thread Maxime Zakharov

Danil Lavrentyuk wrote:
 
 SK DL Will it simply think these wirds having an incorrect sepelling and (in case 
of
 SK DL IspellIncorrectFactor 1) use they 'as is' in indexing?
 SK
 SK It can correctly deterimine the multilanguage documents.
 
 How will it determine, what the language to use as secondary.
 I.e. how can it guess, for example, have I english, french or latin phrase in
 russian text?

by HTML attribute lang.
For example,

p lang=enTesting. This is simple./p.

will considered as english text, regardless document language.

-- 
Maxime Zakharov   http://sochi.net.ru/~maxime/
 Sochi, Russia   http://www.sochi.com/
___
If you want to unsubscribe send unsubscribe general
to [EMAIL PROTECTED]




Re: multilanguage text

2001-05-09 Thread Maxime Zakharov

Danil Lavrentyuk wrote:
 MZ p lang=enTesting. This is simple./p.
 MZ
 MZ will considered as english text, regardless document language.
 
 And what if a site having many texts uploaded by users?
 Have I manualy edit all they satting lang attributes? :)
 Have I demand it from uploader? They will not.

Users may upload big mega gifs as .html files :) Let talk about W3C
recommendations.

 What if I have to index texts placed somewhere in the internet, not locally?
 What if a site contains texts of many books (something like www.lib.ry, for
 example)?

Sometime, without explicit language definition it's impossible uniquely
select language for a word.
For example, word 'test' may be english or german.

-- 
Maxime Zakharov   http://sochi.net.ru/~maxime/
 Sochi, Russia   http://www.sochi.com/
___
If you want to unsubscribe send unsubscribe general
to [EMAIL PROTECTED]




Re: multilanguage text

2001-05-09 Thread Danil Lavrentyuk

[ On Wed, 9 May 2001, Maxime Zakharov wrote: ]

MZ  And what if a site having many texts uploaded by users?
MZ  Have I manualy edit all they satting lang attributes? :)
MZ  Have I demand it from uploader? They will not.
MZ
MZ Users may upload big mega gifs as .html files :)

It would be an obvious fraud...

MZ Let talk about W3C recommendations.

... but ignoring of far-away-placed committee's recomendations could be a
simply laziness.
Not all of the software use all of the recomendations.
Not all of users know all of the recomedations. Even not all of users think on
using such recomendations.

Text could be converted to HTML from someone another text fromat.
Who, for example, will check for foreign phrases such text like big books
which consists of many volumes (like Amber by Zhilazny or Wheel Of Time by
Jordan or even bigger)? :)

Let's tall about real world where we would have to index multilanguage texts
without lang attributes.

MZ  What if I have to index texts placed somewhere in the internet, not locally?
MZ  What if a site contains texts of many books (something like www.lib.ry, for
MZ  example)?
MZ
MZ Sometime, without explicit language definition it's impossible uniquely
MZ select language for a word.
MZ For example, word 'test' may be english or german.

I know.
Think it is real (but hard, I see) to make a system which could guess what the
text's language is. It could use 2 steps:
1) Create a list of encodings this text could be written in (symply by
testing, is all of the word's characters are aplhas in this encoding). Here we
could think that a two or more successive foreign words are from the same
language.
2) Check (using ispell tables) all the languages which use encondigs from list
(created above), looking for one where this words are correct.
3) (optoinal) If there more then one language suitable, select one that was
seelcted for the previous phrase.

OK. This method does not gurantee that selection will be correct always. But
in the most cases it will.

Yes, I know, this method is not too quick... But it is better then no any
method at all. Any way it is good to make it able to turn it of in the
indexer.conf file or by a command line option.


Danil Lavrentyuk
Communiware.net
Programmer

___
If you want to unsubscribe send unsubscribe general
to [EMAIL PROTECTED]




multilanguage text

2001-05-08 Thread Danil Lavrentyuk

Hello!

Could mnoGoSearch to correctly index, for example, english words in russian
text?
Will it simply think these wirds having an incorrect sepelling and (in case of
IspellIncorrectFactor 1) use they 'as is' in indexing?


Danil Lavrentyuk
Communiware.net
Programmer

___
If you want to unsubscribe send unsubscribe general
to [EMAIL PROTECTED]