Character set tests [was MARC::Charset]

2007-03-14 Thread Doran, Michael D
Hi Ashley, Thanks for the info! Trying to keep up with i18n and/or character set stuff is almost a full time job. > > How are you testing for UTF-8? > > There's a handy perl regexp on the W3C web site at: > > http://www.w3.org/International/questions/qa-forms-utf-8 > > You'll need to cha

Re: MARC::Charset

2007-03-14 Thread Ashley Sanders
Michael, So, basically, you either need prior knowledge about the actual character encoding used, or you have to test. Testing for UTF-8 is fairly straightforward... How are you testing for UTF-8? There's a handy perl regexp on the W3C web site at: http://www.w3.org/International/questi

RE: MARC::Charset

2007-03-14 Thread Doran, Michael D
Hi Ashley, > I think 〹 is now legal in MARC-8 now to indicate a > Unicode character that isn't in the MARC-8 repertoire. Yes, that's also my understanding [1,2], though I've not personally come across any records yet that use that method. (Although not being a cataloger, I don't routinely exa

RE: MARC::Charset

2007-03-14 Thread Doran, Michael D
Hi Henri-Damien, > And any LOWERCASE DIGRAPH AE or UPPERCASE DIGRAPH AE or > LOWERCASE DIGRAPH OE is not well encoded. Encoding is > **assumed** to be latin1 translated into utf-8 in the > catalogue I am working on but appears respectively µ, ¥,¶ > in biblios. hex MARC-8

Re: MARC::Charset

2007-03-14 Thread Ashley Sanders
Your MARC records appear to be encoded in MARC-8 as evidenced by "ergáo" in which the combining accent character comes before the character to be modified. I.e. the byte string that displays as "ergáo" in your email would display as "ergò" (with a Latin small letter o with grave) in a MARC-8 a

Re: MARC::Charset

2007-03-14 Thread Henri-Damien LAURENT
Doran, Michael D a écrit : > Hi Henri, > > Although in my email client, the character in question appears as a MICRO > SIGN ("µ"), I am assuming that it is actually meant to be a LOWERCASE DIGRAPH > AE ("æ") since that is consistent with the Latin vernacular text in your > record. In MARC-8,