[ On Wed, 9 May 2001, Maxime Zakharov wrote: ]

MZ> > And what if a site having many texts uploaded by users?
MZ> > Have I manualy edit all they satting "lang" attributes? :)
MZ> > Have I demand it from uploader? They will not.
MZ> Users may upload big mega gifs as .html files :)

It would be an obvious fraud...

MZ> Let talk about W3C recommendations.

... but ignoring of far-away-placed committee's recomendations could be a
simply laziness.
Not all of the software use all of the recomendations.
Not all of users know all of the recomedations. Even not all of users think on
using such recomendations.

Text could be converted to HTML from someone another text fromat.
Who, for example, will check for foreign phrases such text like big books
which consists of many volumes (like "Amber" by Zhilazny or "Wheel Of Time" by
Jordan or even bigger)? :)

Let's tall about real world where we would have to index multilanguage texts
without "lang" attributes.

MZ> > What if I have to index texts placed somewhere in the internet, not locally?
MZ> > What if a site contains texts of many books (something like www.lib.ry, for
MZ> > example)?
MZ> Sometime, without explicit language definition it's impossible uniquely
MZ> select language for a word.
MZ> For example, word 'test' may be english or german.

I know.
Think it is real (but hard, I see) to make a system which could guess what the
text's language is. It could use 2 steps:
1) Create a list of encodings this text could be written in (symply by
testing, is all of the word's characters are aplhas in this encoding). Here we
could think that a two or more successive "foreign" words are from the same
2) Check (using ispell tables) all the languages which use encondigs from list
(created above), looking for one where this words are correct.
3) (optoinal) If there more then one language suitable, select one that was
seelcted for the previous phrase.

OK. This method does not gurantee that selection will be correct always. But
in the most cases it will.

Yes, I know, this method is not too quick... But it is better then no any
method at all. Any way it is good to make it able to turn it of in the
indexer.conf file or by a command line option.

Danil Lavrentyuk

If you want to unsubscribe send "unsubscribe general"

Reply via email to