Apologies, I missed the subject line...
Then you might use the regex character classes. For instance $text =~
m/\p{Hiragana}/; matches any Japanese Hiragana character. I have not tested
it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you
find the character class that most characters match and you look for the
exceptions. Would that help?
From: George Milten [mailto:[email protected]]
Sent: dinsdag 10 februari 2015 15:56
To: Kool,Wouter
Cc: [email protected]
Subject: Re: UNICODE character identification
utf-8,
thank you
2015-02-10 16:54 GMT+02:00 Kool,Wouter
<[email protected]<mailto:[email protected]>>:
What encoding is your data in? utf8? Single-byte encoding? Marc8? That
information matters a lot to determine whether your idea would work. If it is
in a single-byte encoding there is often no way to determine the script the
character belongs to.
Wouter Kool
Metadata Specialist · OCLC B.V.
Schipholweg 99 · P.O. Box 876 · 2300 AW Leiden · The Netherlands
t +31-(0)71-524 6500<tel:%2B31-%280%2971-524%206500>
[email protected]<mailto:[email protected]> ·
www.oclc.org<http://www.oclc.org/>
[Volg @OCLC_NL op Twitter]<https://twitter.com/OCLC_NL>[Volg OCLC (Nederland)
op LinkedIn]<https://www.linkedin.com/company/oclc-nederland->[Abonneer op
OCLCVideo]<https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO>
[https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C000000227Uz&oid=00D80000000ZRv8&lastMod=1409843680000]<http://www.oclc.org/>
From: George Milten
[mailto:[email protected]<mailto:[email protected]>]
Sent: dinsdag 10 februari 2015 13:27
To: [email protected]<mailto:[email protected]>
Subject: UNICODE character identification
Hello friendly folks,
follows what i am trying to do, and i am looking for your help in order to find
the most clever way to achieve this:
We have records, that include typos like this: we have a word say Plato, where
the last o is inputted with the keyboard set to Greek language, so we need
something that would parse all metadata in a per character basis, check against
what is the script language that the majority of characters the word belongs to
have, and return the odd characters, the script they belong, and the record
identifier they were found in, so as to be able to correct them
thank you in advance