Re: multilanguage text
[ On Thu, 10 May 2001, Alexander Barkov wrote: ] AB Could mnoGoSearch to correctly index, for example, english words in russian AB text? AB AB It can. AB AB Will it simply think these wirds having an incorrect sepelling and (in case of AB IspellIncorrectFactor 1) use they 'as is' in indexing? AB AB Yes. But you may add English ispell files too. Hmmm.. If I'll add English ispell files - will it take words of 'latin' laters as english words (when such words are occur in a russian text)? Danil Lavrentyuk Communiware.net Programmer ___ If you want to unsubscribe send unsubscribe general to [EMAIL PROTECTED]
Re: multilanguage text
3.2.x branch will have language guesser. It's already implemented and work very fine for single-language pages or even mostly single-language pages. I hope first release of 3.2.x will be available in May. Danil Lavrentyuk wrote: [ On Wed, 9 May 2001, Maxime Zakharov wrote: ] MZ And what if a site having many texts uploaded by users? MZ Have I manualy edit all they satting lang attributes? :) MZ Have I demand it from uploader? They will not. MZ MZ Users may upload big mega gifs as .html files :) It would be an obvious fraud... MZ Let talk about W3C recommendations. ... but ignoring of far-away-placed committee's recomendations could be a simply laziness. Not all of the software use all of the recomendations. Not all of users know all of the recomedations. Even not all of users think on using such recomendations. Text could be converted to HTML from someone another text fromat. Who, for example, will check for foreign phrases such text like big books which consists of many volumes (like Amber by Zhilazny or Wheel Of Time by Jordan or even bigger)? :) Let's tall about real world where we would have to index multilanguage texts without lang attributes. MZ What if I have to index texts placed somewhere in the internet, not locally? MZ What if a site contains texts of many books (something like www.lib.ry, for MZ example)? MZ MZ Sometime, without explicit language definition it's impossible uniquely MZ select language for a word. MZ For example, word 'test' may be english or german. I know. Think it is real (but hard, I see) to make a system which could guess what the text's language is. It could use 2 steps: 1) Create a list of encodings this text could be written in (symply by testing, is all of the word's characters are aplhas in this encoding). Here we could think that a two or more successive foreign words are from the same language. 2) Check (using ispell tables) all the languages which use encondigs from list (created above), looking for one where this words are correct. 3) (optoinal) If there more then one language suitable, select one that was seelcted for the previous phrase. OK. This method does not gurantee that selection will be correct always. But in the most cases it will. Yes, I know, this method is not too quick... But it is better then no any method at all. Any way it is good to make it able to turn it of in the indexer.conf file or by a command line option. Danil Lavrentyuk Communiware.net Programmer ___ If you want to unsubscribe send unsubscribe general to [EMAIL PROTECTED] ___ If you want to unsubscribe send unsubscribe general to [EMAIL PROTECTED]
Re: multilanguage text
Hi! Tuesday, May 08, 2001, 6:30:44 PM, you wrote: DL Could mnoGoSearch to correctly index, for example, english words in russian DL text? Yes, it should. DL Will it simply think these wirds having an incorrect sepelling and (in case of DL IspellIncorrectFactor 1) use they 'as is' in indexing? It can correctly deterimine the multilanguage documents. -- Regards, Sergey aka gluke. ___ If you want to unsubscribe send unsubscribe general to [EMAIL PROTECTED]
Re: multilanguage text
[ On Wed, 9 May 2001, Sergey Kartashoff wrote: ] SK DL Could mnoGoSearch to correctly index, for example, english words in russian SK DL text? SK SK Yes, it should. SK SK DL Will it simply think these wirds having an incorrect sepelling and (in case of SK DL IspellIncorrectFactor 1) use they 'as is' in indexing? SK SK It can correctly deterimine the multilanguage documents. How will it determine, what the language to use as secondary. I.e. how can it guess, for example, have I english, french or latin phrase in russian text? Danil Lavrentyuk Communiware.net Programmer ___ If you want to unsubscribe send unsubscribe general to [EMAIL PROTECTED]
Re: multilanguage text
Danil Lavrentyuk wrote: SK DL Will it simply think these wirds having an incorrect sepelling and (in case of SK DL IspellIncorrectFactor 1) use they 'as is' in indexing? SK SK It can correctly deterimine the multilanguage documents. How will it determine, what the language to use as secondary. I.e. how can it guess, for example, have I english, french or latin phrase in russian text? by HTML attribute lang. For example, p lang=enTesting. This is simple./p. will considered as english text, regardless document language. -- Maxime Zakharov http://sochi.net.ru/~maxime/ Sochi, Russia http://www.sochi.com/ ___ If you want to unsubscribe send unsubscribe general to [EMAIL PROTECTED]
Re: multilanguage text
Danil Lavrentyuk wrote: MZ p lang=enTesting. This is simple./p. MZ MZ will considered as english text, regardless document language. And what if a site having many texts uploaded by users? Have I manualy edit all they satting lang attributes? :) Have I demand it from uploader? They will not. Users may upload big mega gifs as .html files :) Let talk about W3C recommendations. What if I have to index texts placed somewhere in the internet, not locally? What if a site contains texts of many books (something like www.lib.ry, for example)? Sometime, without explicit language definition it's impossible uniquely select language for a word. For example, word 'test' may be english or german. -- Maxime Zakharov http://sochi.net.ru/~maxime/ Sochi, Russia http://www.sochi.com/ ___ If you want to unsubscribe send unsubscribe general to [EMAIL PROTECTED]
Re: multilanguage text
[ On Wed, 9 May 2001, Maxime Zakharov wrote: ] MZ And what if a site having many texts uploaded by users? MZ Have I manualy edit all they satting lang attributes? :) MZ Have I demand it from uploader? They will not. MZ MZ Users may upload big mega gifs as .html files :) It would be an obvious fraud... MZ Let talk about W3C recommendations. ... but ignoring of far-away-placed committee's recomendations could be a simply laziness. Not all of the software use all of the recomendations. Not all of users know all of the recomedations. Even not all of users think on using such recomendations. Text could be converted to HTML from someone another text fromat. Who, for example, will check for foreign phrases such text like big books which consists of many volumes (like Amber by Zhilazny or Wheel Of Time by Jordan or even bigger)? :) Let's tall about real world where we would have to index multilanguage texts without lang attributes. MZ What if I have to index texts placed somewhere in the internet, not locally? MZ What if a site contains texts of many books (something like www.lib.ry, for MZ example)? MZ MZ Sometime, without explicit language definition it's impossible uniquely MZ select language for a word. MZ For example, word 'test' may be english or german. I know. Think it is real (but hard, I see) to make a system which could guess what the text's language is. It could use 2 steps: 1) Create a list of encodings this text could be written in (symply by testing, is all of the word's characters are aplhas in this encoding). Here we could think that a two or more successive foreign words are from the same language. 2) Check (using ispell tables) all the languages which use encondigs from list (created above), looking for one where this words are correct. 3) (optoinal) If there more then one language suitable, select one that was seelcted for the previous phrase. OK. This method does not gurantee that selection will be correct always. But in the most cases it will. Yes, I know, this method is not too quick... But it is better then no any method at all. Any way it is good to make it able to turn it of in the indexer.conf file or by a command line option. Danil Lavrentyuk Communiware.net Programmer ___ If you want to unsubscribe send unsubscribe general to [EMAIL PROTECTED]
multilanguage text
Hello! Could mnoGoSearch to correctly index, for example, english words in russian text? Will it simply think these wirds having an incorrect sepelling and (in case of IspellIncorrectFactor 1) use they 'as is' in indexing? Danil Lavrentyuk Communiware.net Programmer ___ If you want to unsubscribe send unsubscribe general to [EMAIL PROTECTED]