3.2.x branch will have language guesser. It's already implemented
and work very fine for single-language pages or even "mostly
single-language"
pages. I hope first release of 3.2.x will be available in May.



Danil Lavrentyuk wrote:
> 
> [ On Wed, 9 May 2001, Maxime Zakharov wrote: ]
> 
> MZ> > And what if a site having many texts uploaded by users?
> MZ> > Have I manualy edit all they satting "lang" attributes? :)
> MZ> > Have I demand it from uploader? They will not.
> MZ>
> MZ> Users may upload big mega gifs as .html files :)
> 
> It would be an obvious fraud...
> 
> MZ> Let talk about W3C recommendations.
> 
> ... but ignoring of far-away-placed committee's recomendations could be a
> simply laziness.
> Not all of the software use all of the recomendations.
> Not all of users know all of the recomedations. Even not all of users think on
> using such recomendations.
> 
> Text could be converted to HTML from someone another text fromat.
> Who, for example, will check for foreign phrases such text like big books
> which consists of many volumes (like "Amber" by Zhilazny or "Wheel Of Time" by
> Jordan or even bigger)? :)
> 
> Let's tall about real world where we would have to index multilanguage texts
> without "lang" attributes.
> 
> MZ> > What if I have to index texts placed somewhere in the internet, not locally?
> MZ> > What if a site contains texts of many books (something like www.lib.ry, for
> MZ> > example)?
> MZ>
> MZ> Sometime, without explicit language definition it's impossible uniquely
> MZ> select language for a word.
> MZ> For example, word 'test' may be english or german.
> 
> I know.
> Think it is real (but hard, I see) to make a system which could guess what the
> text's language is. It could use 2 steps:
> 1) Create a list of encodings this text could be written in (symply by
> testing, is all of the word's characters are aplhas in this encoding). Here we
> could think that a two or more successive "foreign" words are from the same
> language.
> 2) Check (using ispell tables) all the languages which use encondigs from list
> (created above), looking for one where this words are correct.
> 3) (optoinal) If there more then one language suitable, select one that was
> seelcted for the previous phrase.
> 
> OK. This method does not gurantee that selection will be correct always. But
> in the most cases it will.
> 
> Yes, I know, this method is not too quick... But it is better then no any
> method at all. Any way it is good to make it able to turn it of in the
> indexer.conf file or by a command line option.
> 
> ----------------
> Danil Lavrentyuk
> Communiware.net
> Programmer
> 
> ___________________________________________
> If you want to unsubscribe send "unsubscribe general"
> to [EMAIL PROTECTED]
___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]

Reply via email to