|a tova e dobre ;-) no kakto kazah nqma 100% strict method da se otgatne 
|kodiraneto na tova koeto podavash ... ima nesto symnitelno tuka pri 
|detect-vaneto, t.e. nqmame 100% garanciq franciq 4e ste ucelim input 
|encoding-a: 
|http://search.cpan.org/author/JNEYSTADT/cyrillic-1.05/Lingua/DetectCharset.pm
|This routine is implemented using algorithm of statistical analysis of text, 
|which was proved to be very efficient and showed around 99.98% acccuracy in 
|tests.
|
|Ako znaem input encoding-a, posle konviertiraneto gore dolu e lesno imajki 
|predvid izklu4eniqta za "symbols-out-of-range" ;-)
]- poglednah modula, pichagata otkriwa mnogo hitro encodinga... nai weroqtno e pusnal 
statistical analiz na nqkakwi tekstowe (weroqtno ruski, pyk znaesh li move da e 
porbwal wsichki kirilski ezici :") ) i wsichki wazmovni dwubukweni poredici poluchawat 
teglo... kolkoto po chesto dwe-bukwi (edna do druga) se sreshtat tolkowa po "tevki" 
sa..
I kato prowerqwa teksta posle, pri koito ot charsetowete se poluchi po golqma 
weroqtnost/teglo nego izbira...
Predpolagam che ako se naprawi syshtoto nesto za BG text, ste otgatwa po dobre 
bg-encoding... ama dokolkoto znam nqma podobni na word-"corpusi" za bulgarski ezik... 
(i nie sme cheli malko za linguisics :") )
Ako "corpus-a" e dostatychno golqm i da obhwashta poweche oblasti naisitna move da ima 
99.98% tochnost..

raptor

============================================================================
A mail-list of Linux Users Group - Bulgaria (bulgarian linuxers).
http://www.linux-bulgaria.org - Hosted by Internet Group Ltd. - Stara Zagora
To unsubscribe: http://www.linux-bulgaria.org/public/mail_list.html
============================================================================

Reply via email to