|a tova e dobre ;-) no kakto kazah nqma 100% strict method da se otgatne |kodiraneto na tova koeto podavash ... ima nesto symnitelno tuka pri |detect-vaneto, t.e. nqmame 100% garanciq franciq 4e ste ucelim input |encoding-a: |http://search.cpan.org/author/JNEYSTADT/cyrillic-1.05/Lingua/DetectCharset.pm |This routine is implemented using algorithm of statistical analysis of text, |which was proved to be very efficient and showed around 99.98% acccuracy in |tests. | |Ako znaem input encoding-a, posle konviertiraneto gore dolu e lesno imajki |predvid izklu4eniqta za "symbols-out-of-range" ;-) ]- poglednah modula, pichagata otkriwa mnogo hitro encodinga... nai weroqtno e pusnal statistical analiz na nqkakwi tekstowe (weroqtno ruski, pyk znaesh li move da e porbwal wsichki kirilski ezici :") ) i wsichki wazmovni dwubukweni poredici poluchawat teglo... kolkoto po chesto dwe-bukwi (edna do druga) se sreshtat tolkowa po "tevki" sa.. I kato prowerqwa teksta posle, pri koito ot charsetowete se poluchi po golqma weroqtnost/teglo nego izbira... Predpolagam che ako se naprawi syshtoto nesto za BG text, ste otgatwa po dobre bg-encoding... ama dokolkoto znam nqma podobni na word-"corpusi" za bulgarski ezik... (i nie sme cheli malko za linguisics :") ) Ako "corpus-a" e dostatychno golqm i da obhwashta poweche oblasti naisitna move da ima 99.98% tochnost..
raptor ============================================================================ A mail-list of Linux Users Group - Bulgaria (bulgarian linuxers). http://www.linux-bulgaria.org - Hosted by Internet Group Ltd. - Stara Zagora To unsubscribe: http://www.linux-bulgaria.org/public/mail_list.html ============================================================================