Re: [CODE4LIB] unwanted (bogus) characters in marc

stuart yeates Sun, 10 Oct 2010 12:47:47 -0700

Thomas Krichel wrote:

  Ere Maijala writes

On 7.10.2010 15:17, Thomas Krichel wrote:

...

use Encode::Guess qw/latin-1/;
$decoded=decode("Guess", $dodgy_input);

  $decoded then should be a utf-8 string with utf8 flag on.

Would that work for a predominantly proper utf-8 input with some
"mistakes" thrown in?


  It will try to guess between UTF-8 and ISO-8859-1. This can be done
  because UTF-8 has many invalid byte sequences.  But say if you
  wanted to guess between ISO-8859-1 and ISO-8859-2, you'd be out of

luck.


Not necessarily.

There are tools such as http://www.let.rug.nl/~vannoord/TextCat/ whichprovide very reliable guessing of languages. A minor adaptation might beneeded to make it guess twice, once for each of ISO-8859-1 andISO-8859-2 and then you take the highest ranked.


cheers
stuart
--
Stuart Yeates
http://www.nzetc.org/       New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/     Institutional Repository

Re: [CODE4LIB] unwanted (bogus) characters in marc

Reply via email to