Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread Deng, Sai
Hi list,
I am a Metadata librarian but not a programmer, sorry if my question seems 
naïve. We use XSLT stylesheet to transform some harvested DC records from 
DSpace to MARC in MarcEdit, and then export them to OCLC.
Some characters do not display correctly and need manual editing, for example:
In MarcEditor   
Transferred to OCLC   Edit in OCLC
Bayes’ theorem  
Bayes⁰́₉ theorem  Bayes' theorem
―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆ attitude   
it won't happen here attitude
“Generation Y”  ⁰́₋Generation 
Y⁰́₊  Generation Y
listeners‟ evaluationslisteners⁰́Ÿ evaluations  
listeners' evaluations
high school – from high school ⁰́₃ from 
  high school – from
Co₀․₅Zn₀․₅Fe₂O₄  Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴
   Co0.5Zn0.5Fe2O4?
μ  Îơ   

   μ
Nafion®Nafion℗ʼ 
 Nafion®
Lévy  L©♭vy 
   Lévy
43±13.20 years 43℗ł13.20 years  
43±13.20 years   
12.6 ± 7.05 ft∙lbs  12.6 ℗ł 7.05 ft⁸́₉lbs   
   12.6 ± 7.05 ft•lbs
‘Pouring on the Pounds'  ⁰́₈Pouring on the Pounds'  
'Pouring on the Pounds'  
k-ε turbulence   k-Îæ turbulence
 k-ε turbulence
student—neither parents student⁰́₄neither parents   
student-neither parents
Λ = M – {p1, p2,…,pκ}   Î₎ = M ⁰́₃ {p1, p2,⁰́Œ,pÎð} 
  ? (won’t save)
M = (0, δ)x × YM = (0, Îþ)x ©₇ Y
? 
100°100℗ð   
   100⁰
(α ≥16º)   (Îł ⁹́Æ16℗ð) 
(α=16⁰)
naïve   na©¯ve  
  naïve

To deal with this, we normally replace limited numbers of characters in 
MarcEditor first and then do the compiling and transfer. For example: replace ’ 
to ', “ to , ” to  and ‟ to '. I am not sure about the right and efficient 
way to solve this problem. I see that the XSLT stylesheet specifies 
encoding=UTF-8. Is there a systematic way to make the character transform and 
display right? Thank you for your suggestion and feedback!

Sophie

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tod 
Olson
Sent: Tuesday, April 17, 2012 10:13 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 
and MARC21

In practice it seems to mean UTF-8. At least I've only seen UTF-8, and I can't 
imagine the code that processes this stuff being safe for UTF-16 or UTF-32. All 
of the offsets are byte-oriented, and there's too much legacy code that makes 
assumption about null-terminated strings.

-Tod

On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:

 Okay, forget XML for a moment, let's just look at marc 'binary'.
 
 First, for Anglophone-centric MARC21.
 
 The LC docs don't actually say quite what I thought about leader byte 09, 
 used to advertise encoding:
 
 
 a - UCS/Unicode
 Character coding in the record makes use of characters from the Universal 
 Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset.
 
 
 
 That doesn't say UTF-8. It says UCS or Unicode. What does that actually 
 mean?  Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be 
 called UCS I think?).  Whatever it actually means, do people violate it in 
 the wild?
 
 
 
 Now we get to non-Anglophone centric marc. I think all of which is ISO_2709?  
 A standard which of course is not open access, so I can't get it to see what 
 it says.
 
 But leader 09 being used for 

Re: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL

2012-04-19 Thread Al Matthews
Hi. Is there a reason not to attempt this instead through the CLI?

Al Matthews, Software Dev,
Atlanta University Center

From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Rosalyn Metz 
[rosalynm...@gmail.com]
Sent: Wednesday, April 18, 2012 9:23 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL

Hi Everyone,

I posted this over on the Archivists' Toolkit listserv and got no response
(yet), so I thought I might try here as well.

I have a large quantity (around 300+) of digital objects that I need to add
to Archivists' Toolkit.  I think I've figured out what queries I need to
run in order to do this in MySQL (rather than the interface) but I wanted
to get opinions from the peanut gallery before trying it out on my test
instance.

It seems that there are actually two update queries that need to be used
when creating a Digital Object.  They are:

insert into ArchDescriptionInstances
(instanceType, resourceComponentId, resourceId, parentResourceId,
instanceDescriminator, archDescriptionInstancesId)
values
('Digital object', 336673, null, 543, 'digital', 22567003)


and...

insert into DigitalObjects
(version, lastUpdated, created, lastUpdatedBy, createdBy, title,
dateExpression, dateBegin, dateEnd, languageCode, restrictionsApply,
eadDaoActuate, eadDaoShow, metsIdentifier, objectType, label, objectOrder,
componentId, parentDigitalObjectId, archDescriptionInstancesId,
repositoryId)
values
(0, '2012-04-17 12:05:15', '2012-04-17 12:05:15', 'username', 'username',
'title', '1938-1959', null, null, '', 0, 'onRequest', 'new', '678.1829',
'text', '', 0, '', null, 22567003, 1)


There also appears to be some update queries as well, but I'm guessing that
they are less important (please correct me if I'm wrong).  Has anyone tried
to do this in the past? If so do you have scripts that will create Digital
Objects for you that you wouldn't mind sharing?  Is there anything you
think I should know before testing this out in my test instance of AT?  Any
caveats for me?

Any help anyone can provide would be greatly appreciated.

Thanks,
Rosalyn
-
**
The contents of this email and any attachments are confidential.
They are intended for the named recipient(s) only.
If you have received this email in error please notify the system
manager or  the 
sender immediately and do not disclose the contents to anyone or
make copies.

** IronMail scanned this email for viruses, vandals and malicious
content. **
**


Re: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL

2012-04-19 Thread Rosalyn Metz
Al,

Looking at the CLI quickly it looks like its related to batch exporting.
 I'm trying to quickly create +1500 new digital object records in AT (ie.
go into a resource, click new instance, choose digital object, add the
title, add the mets identifier, click save).

If I misread what the CLI does, then I'm all ears on how I could use it to
help.

Thanks!
Rosalyn



On Wed, Apr 18, 2012 at 11:13 AM, Al Matthews amatth...@auctr.edu wrote:

 Hi. Is there a reason not to attempt this instead through the CLI?

 Al Matthews, Software Dev,
 Atlanta University Center
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Rosalyn
 Metz [rosalynm...@gmail.com]
 Sent: Wednesday, April 18, 2012 9:23 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL

 Hi Everyone,

 I posted this over on the Archivists' Toolkit listserv and got no response
 (yet), so I thought I might try here as well.

 I have a large quantity (around 300+) of digital objects that I need to add
 to Archivists' Toolkit.  I think I've figured out what queries I need to
 run in order to do this in MySQL (rather than the interface) but I wanted
 to get opinions from the peanut gallery before trying it out on my test
 instance.

 It seems that there are actually two update queries that need to be used
 when creating a Digital Object.  They are:

 insert into ArchDescriptionInstances
 (instanceType, resourceComponentId, resourceId, parentResourceId,
 instanceDescriminator, archDescriptionInstancesId)
 values
 ('Digital object', 336673, null, 543, 'digital', 22567003)


 and...

 insert into DigitalObjects
 (version, lastUpdated, created, lastUpdatedBy, createdBy, title,
 dateExpression, dateBegin, dateEnd, languageCode, restrictionsApply,
 eadDaoActuate, eadDaoShow, metsIdentifier, objectType, label, objectOrder,
 componentId, parentDigitalObjectId, archDescriptionInstancesId,
 repositoryId)
 values
 (0, '2012-04-17 12:05:15', '2012-04-17 12:05:15', 'username', 'username',
 'title', '1938-1959', null, null, '', 0, 'onRequest', 'new', '678.1829',
 'text', '', 0, '', null, 22567003, 1)


 There also appears to be some update queries as well, but I'm guessing that
 they are less important (please correct me if I'm wrong).  Has anyone tried
 to do this in the past? If so do you have scripts that will create Digital
 Objects for you that you wouldn't mind sharing?  Is there anything you
 think I should know before testing this out in my test instance of AT?  Any
 caveats for me?

 Any help anyone can provide would be greatly appreciated.

 Thanks,
 Rosalyn
 -

 **
 The contents of this email and any attachments are confidential.
 They are intended for the named recipient(s) only.
 If you have received this email in error please notify the system
 manager or  the
 sender immediately and do not disclose the contents to anyone or
 make copies.

 ** IronMail scanned this email for viruses, vandals and malicious
 content. **

 **



[CODE4LIB] NYC code4lib meetup: next Weds April 25

2012-04-19 Thread Yitzchak Schaffer

Hello all,

The code4lib-nyc chapter in conjunction with METRO is holding our 
somewhat-quarterly jam session:


next Wednesday, April 25 10am-noon at the METRO Training Center,
57 E 11th Street, NYC.

Come talk about your projects, and find out what everybody's working on! 
Folks without technical background are welcome to join us.


No charge; please register at
http://metro.org/events/144/

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
212.742.8770 ext. 2432
http://www.tourolib.org/

Access Problems? Contact systems.libr...@touro.edu


Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread Reese, Terry
Actually -- the issue isn't one of MARC8 versus UTF8 (since this data is being 
harvested from DSpace and is UTF8 encoded).  It's actually an issue with user 
entered data -- specifically, smart quotes and the like.  These values 
obviously are not in the MARC8 characterset and cause many who transform user 
entered data (which tend to be used by default on Windows) from XML to MARC.  
If you are sticking with a strickly UTF8 based system, there generally are not 
issues because these are valid characters.  If you move them into a system 
where the data needs to be represented in MARC -- then you have more problems.  

We do a lot of harvesting, and because of that, we run into these types of 
issues moving data that is in UTF8, but has characters not represented in 
MARC8, from into Connexion and having some of that data flattened.  Given the 
wide range of data not in the MARC8 set that can show up in UTF8, it's not a 
surprise that this would happen.  My guess is that you could add a template to 
your XSLT translation that attempted to filter the most common forms of these 
smart quotes/values and replace them with the more standard values.  
Likewise, if there was a great enough need, I could provide a canned cleaner in 
MarcEdit that could fix many of the most common varieties of these smart 
quotes/values.  

--TR

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Jonathan Rochkind
Sent: Thursday, April 19, 2012 11:13 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] more on MARC char encoding

If your records are really in MARC8 not UTF8, your best bet is to use a tool to 
convert them to UTF8 before hitting your XSLT.

The open source 'yaz' command line tools can do it for Marc21.

The Marc4J package can do it in java, and probably work for any MARC variant 
not just Marc21.

Char encoding issues are tricky. You might want to first figure out if your 
records are really in Marc8, thus the problems, or if instead they illegally 
contain bad data or data in some other encoding (Latin1).

Char encoding is a tricky topic, you might want to do some reading on it in 
general. The Unicode docs are pretty decent.

On 4/19/2012 11:06 AM, Deng, Sai wrote:
 Hi list,
 I am a Metadata librarian but not a programmer, sorry if my question seems 
 naïve. We use XSLT stylesheet to transform some harvested DC records from 
 DSpace to MARC in MarcEdit, and then export them to OCLC.
 Some characters do not display correctly and need manual editing, for example:
 In MarcEditor 
 Transferred to OCLC   Edit in OCLC
 Bayes’ theorem
 Bayes⁰́₉ theorem  Bayes' theorem
 ―it won‘t happen here‖ attitude   ⁰́₅it won⁰́₈t happen here⁰́₆ 
 attitude   it won't happen here attitude
 “Generation Y”⁰́₋Generation 
 Y⁰́₊  Generation Y
 listeners‟ evaluations  listeners⁰́Ÿ evaluations  
 listeners' evaluations
 high school – from   high school ⁰́₃ from 
   high school – from
 Co₀․₅Zn₀․₅Fe₂O₄
 Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴   
 Co0.5Zn0.5Fe2O4?
 μÎơ   
   
  μ
 Nafion®  Nafion℗ʼ 
  Nafion®
 LévyL©♭vy 

 Lévy
 43±13.20 years   
 43℗ł13.20 years  43±13.20 
 years
 12.6 ± 7.05 ft∙lbs12.6 ℗ł 7.05 ft⁸́₉lbs   
12.6 ± 7.05 ft•lbs
 ‘Pouring on the Pounds'⁰́₈Pouring on the Pounds'  
 'Pouring on the Pounds'
 k-ε turbulence k-Îæ 
 turbulence k-ε turbulence
 student—neither parents   student⁰́₄neither parents   
 student-neither parents
 Λ = M – {p1, p2,…,pκ} Î₎ = M ⁰́₃ {p1, p2,⁰́Œ,pÎð} 
   ? (won’t save)
 M = (0, δ)x × Y  M = (0, Îþ)x 
 ©₇ Y?
 100° 

Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread Jonathan Rochkind

Ah, thanks Terry.

That canned cleaner in MarcEdit sounds potentially useful -- I'm in a 
continuing battle to keep the character encoding in our local marc 
corpus clean.


(The real blame here is on cataloger interfaces that let catalogers save 
data that are illegal bytes for the character set it's being saved as. 
And/or display the data back to the cataloger using a translation that 
lets them show up as expected even though they are _wrong_ for the 
character set being saved as.  Connexion is theoretically the rolls 
royce of cataloger interfaces, does it do this? Gosh I hope not.)


On 4/19/2012 2:20 PM, Reese, Terry wrote:

Actually -- the issue isn't one of MARC8 versus UTF8 (since this data is being 
harvested from DSpace and is UTF8 encoded).  It's actually an issue with user 
entered data -- specifically, smart quotes and the like.  These values 
obviously are not in the MARC8 characterset and cause many who transform user 
entered data (which tend to be used by default on Windows) from XML to MARC.  
If you are sticking with a strickly UTF8 based system, there generally are not 
issues because these are valid characters.  If you move them into a system 
where the data needs to be represented in MARC -- then you have more problems.

We do a lot of harvesting, and because of that, we run into these types of issues moving data that 
is in UTF8, but has characters not represented in MARC8, from into Connexion and having some of 
that data flattened.  Given the wide range of data not in the MARC8 set that can show up in UTF8, 
it's not a surprise that this would happen.  My guess is that you could add a template to your XSLT 
translation that attempted to filter the most common forms of these smart quotes/values 
and replace them with the more standard values.  Likewise, if there was a great enough need, I 
could provide a canned cleaner in MarcEdit that could fix many of the most common varieties of 
these smart quotes/values.

--TR

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Jonathan Rochkind
Sent: Thursday, April 19, 2012 11:13 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] more on MARC char encoding

If your records are really in MARC8 not UTF8, your best bet is to use a tool to 
convert them to UTF8 before hitting your XSLT.

The open source 'yaz' command line tools can do it for Marc21.

The Marc4J package can do it in java, and probably work for any MARC variant 
not just Marc21.

Char encoding issues are tricky. You might want to first figure out if your 
records are really in Marc8, thus the problems, or if instead they illegally 
contain bad data or data in some other encoding (Latin1).

Char encoding is a tricky topic, you might want to do some reading on it in 
general. The Unicode docs are pretty decent.

On 4/19/2012 11:06 AM, Deng, Sai wrote:

Hi list,
I am a Metadata librarian but not a programmer, sorry if my question seems 
naïve. We use XSLT stylesheet to transform some harvested DC records from 
DSpace to MARC in MarcEdit, and then export them to OCLC.
Some characters do not display correctly and need manual editing, for example:
In MarcEditor   
Transferred to OCLC   Edit in OCLC
Bayes’ theorem  
Bayes⁰́₉ theorem  Bayes' theorem
―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆ attitude   
it won't happen here attitude
“Generation Y”  ⁰́₋Generation Y⁰́₊
  Generation Y
listeners‟ evaluationslisteners⁰́Ÿ evaluations  
listeners' evaluations
high school – from high school ⁰́₃ from 
  high school – from
Co₀․₅Zn₀․₅Fe₂O₄  Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴
   Co0.5Zn0.5Fe2O4?
μ  Îơ   

   μ
Nafion®Nafion℗ʼ 
 Nafion®
Lévy  L©♭vy 
   Lévy
43±13.20 years 43℗ł13.20 years  
43±13.20 years
12.6 ± 7.05 ft∙lbs  12.6 ℗ł 7.05 ft⁸́₉lbs   
   12.6 ± 7.05 ft•lbs
‘Pouring on the Pounds'  ⁰́₈Pouring on the Pounds'  
 

Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread LeVan,Ralph
We see Unicode data pasted into MARC8 records all the time.  It happens enough 
that my MARC8-Unicode converter takes a second look at illegal MARC8 bytes and 
tries a UTF-8 encoding as well.

Ralph

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Jonathan Rochkind
Sent: Thursday, April 19, 2012 3:14 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: more on MARC char encoding

Ah, thanks Terry.

That canned cleaner in MarcEdit sounds potentially useful -- I'm in a 
continuing battle to keep the character encoding in our local marc 
corpus clean.

(The real blame here is on cataloger interfaces that let catalogers save 
data that are illegal bytes for the character set it's being saved as. 
And/or display the data back to the cataloger using a translation that 
lets them show up as expected even though they are _wrong_ for the 
character set being saved as.  Connexion is theoretically the rolls 
royce of cataloger interfaces, does it do this? Gosh I hope not.)

On 4/19/2012 2:20 PM, Reese, Terry wrote:
 Actually -- the issue isn't one of MARC8 versus UTF8 (since this data is 
 being harvested from DSpace and is UTF8 encoded).  It's actually an issue 
 with user entered data -- specifically, smart quotes and the like.  These 
 values obviously are not in the MARC8 characterset and cause many who 
 transform user entered data (which tend to be used by default on Windows) 
 from XML to MARC.  If you are sticking with a strickly UTF8 based system, 
 there generally are not issues because these are valid characters.  If you 
 move them into a system where the data needs to be represented in MARC -- 
 then you have more problems.

 We do a lot of harvesting, and because of that, we run into these types of 
 issues moving data that is in UTF8, but has characters not represented in 
 MARC8, from into Connexion and having some of that data flattened.  Given the 
 wide range of data not in the MARC8 set that can show up in UTF8, it's not a 
 surprise that this would happen.  My guess is that you could add a template 
 to your XSLT translation that attempted to filter the most common forms of 
 these smart quotes/values and replace them with the more standard values.  
 Likewise, if there was a great enough need, I could provide a canned cleaner 
 in MarcEdit that could fix many of the most common varieties of these smart 
 quotes/values.

 --TR

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
 Jonathan Rochkind
 Sent: Thursday, April 19, 2012 11:13 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] more on MARC char encoding

 If your records are really in MARC8 not UTF8, your best bet is to use a tool 
 to convert them to UTF8 before hitting your XSLT.

 The open source 'yaz' command line tools can do it for Marc21.

 The Marc4J package can do it in java, and probably work for any MARC variant 
 not just Marc21.

 Char encoding issues are tricky. You might want to first figure out if your 
 records are really in Marc8, thus the problems, or if instead they illegally 
 contain bad data or data in some other encoding (Latin1).

 Char encoding is a tricky topic, you might want to do some reading on it in 
 general. The Unicode docs are pretty decent.

 On 4/19/2012 11:06 AM, Deng, Sai wrote:
 Hi list,
 I am a Metadata librarian but not a programmer, sorry if my question seems 
 naïve. We use XSLT stylesheet to transform some harvested DC records from 
 DSpace to MARC in MarcEdit, and then export them to OCLC.
 Some characters do not display correctly and need manual editing, for 
 example:
 In MarcEditor
 Transferred to OCLC   Edit in OCLC
 Bayes’ theorem   
 Bayes⁰́₉ theorem  Bayes' theorem
 ―it won‘t happen here‖ attitude  ⁰́₅it won⁰́₈t happen here⁰́₆ 
 attitude   it won't happen here attitude
 “Generation Y”   ⁰́₋Generation 
 Y⁰́₊  Generation Y
 listeners‟ evaluations listeners⁰́Ÿ evaluations  
 listeners' evaluations
 high school – from  high school ⁰́₃ from 
   high school – from
 Co₀․₅Zn₀․₅Fe₂O₄   
 Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴   
 Co0.5Zn0.5Fe2O4?
 μ   Îơ   
  
   μ
 Nafion® Nafion℗ʼ 
  Nafion®
 Lévy   

Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread Jonathan Rochkind

On 4/19/2012 3:23 PM, LeVan,Ralph wrote:

We see Unicode data pasted into MARC8 records all the time.  It happens enough 
that my MARC8-Unicode converter takes a second look at illegal MARC8 bytes and 
tries a UTF-8 encoding as well.


Right. I see it too. I'm arguing that means cataloger entry tools, the 
tools which catalogers are using when they paste that stuff in, are not 
giving the cataloger sufficient feedback as to their entry. Flagging 
completely illegal byte sequences in the output encoding and not letting 
them be saved; make sure cataloger input is displayed back _as 
appropriate for the current encoding_, so they get immediate visual 
feedback if they're entering bytes that don't mean what they think for 
the operative output encoding.


I think it's possible _no_ cataloger interfaces actually do this. 
(although if any do, I bet it's MarcEdit).


If Connexion doesn't, for interactive cataloger entry, it'd be awfully 
nice if it did.


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-19 Thread Robert Haschart

On 4/18/2012 12:08 PM, Jonathan Rochkind wrote:

On 4/18/2012 11:09 AM, Doran, Michael D wrote:
I don't believe that is the case.  Take UTF-8 out of the picture, and 
consider the MARC-8 character set with its escape sequences and 
combining characters.  A character such as an n with a tilde would 
consist of two bytes.  The Greek small letter alpha, if invoked in 
accordance with ANSI X3.41, would consist of five bytes (two bytes 
for the initial escape sequence, a byte for the character, and then 
two bytes for the escape sequence returning to the default character 
set).


ISO 2709 doesn't care how many bytes your characters are. The 
directory and offsets and other things count bytes, not characters. 
(which was, in my opinion, the _right_ decision, for once with marc!)


How bytes translate into characters is not a concern of ISO 2709.

The majority of non-7-bit-ASCII encodings will have chars that are 
more than one byte, either sometimes or always. This is true of MARC8 
(some chars), UTF8 (some chars), and UTF16 (all chars), all of them. 
(It is not true of Latin-1 though, for instance, I don't think).


ISO 2709 doesn't care what char encodings you use, and there's no 
standard ISO 2709 way to determine what char encodings are used for 
_data_ in the MARC record. ISO 2709 does say that _structural_ 
elements like field names, subfield names, the directory itself, 
seperator chars, etc, all need to be (essentially, over-simplifying) 
7-bit-ASCII. The actual data itself is application dependent, 2709 
doesn't care, and 2709 doesn't give any standard cross-2709 way to 
determine it.


That is my conclusion at the moment, helped by all of you all in this 
thread, thanks!


The conclusion that I came to in the work I have done on marc4j (which 
is used heavily by SolrMarc)  is that for any significant processing of 
Marc records the only solution that makes sense is to translate the 
record data into Unicode characters as it is being read in.  Of course 
as you and others have stated, determining what the data actually is, in 
order to correctly translate it to Unicode, is no easy task.  The leader 
byte that merely indicates is UTF8 or  is not UTF8 is wrong often 
enough in the real world that it is of little value when it indicates 
is UTF-8and is even less value when it indicates is not UTF-8


Significant portions of the code I've added to marc4j deal with trying 
to determine what the encoding of that data actually is and trying to 
translate the data correctly into Unicode even when the data is incorrect.


You also argued in another message that cataloger entry tools should 
give feedback to help the cataloger not create errors.   I agree.  I 
think one possible step towards this would be that the editor must work 
in Unicode, irrespective of the data format that the underlying system 
expects the data to be.  If the underlying system expects MARC8 then the 
save as process should be able to translate the data into MARC8 on 
output.


-Robert Haschart


[CODE4LIB] ruby-marc, better ruby 1.9 char encoding support, testers wanted

2012-04-19 Thread Jonathan Rochkind
I have implemented fairly complete and robust proper support for 
character encodings in ruby-marc when reading 'binary' marc under ruby 1.9.


It's currently in a git branch, not yet released, and not yet in git 
master. https://github.com/ruby-marc/ruby-marc/tree/char_encodings


If anyone who uses this (or doesn't) has a chance to beta test it, it 
would be appreciated. One way to test, checkout with git, switch to 
'char_encodings' branch, and `rake install` to install as a gem to your 
system.  These changes should _only_ effect use under ruby 1.9, and only 
effect reading in 'binary' (ISO 2709) marc.


The new functionality is pretty extensively covered by automated tests, 
but there are some weird and complex interactions that can occur 
depending on exactly what you're doing, bugs are possible. It was 
somewhat more complicated than one might expect to implement a complete 
solution here, in part because we _do_ have international users who use 
ruby-marc, with encodings that are neither MARC8 nor UTF8, and in fact 
non-MARC21.


If any of the other committers (or anyone else) wants to code review, 
you are welcome to.


POSSIBLE BACKWARDS INCOMPAT

Some previous 0.4.x versions, when running under ruby 1.9 only, would 
automatically _transcode_ non-unicode encodings to UTF-8 for you under 
the hood. The new version no longer does so automatically (although you 
can ask it to). It was not tenable to support that backwards compatibly.


Everything else _ought_ to be backwards compatible with previous 0.4.x 
ruby-marc under ruby 1.9, fixing many problems.


NEW FEATURES

All applying to ruby 1.9 only, and to reading binary MARC only.

* Do a pretty good job of setting encodings properly for your ruby 
environment, especially under standard UTF-8 usage.


* You _can_ and _do have to_ provide an argument for reading non-UTF8 
encodings. (but sadly no support for marc8).


* You can ask MARC::Reader to transcode to a different encoding when 
loading marc.


* You can ask MARC::Reader to replace bytes that are illegal in the 
believed source encoding with a replacement character (or the empty 
string) to avoid ruby invalid UTF-8 byte exceptions later, and 
sanitize your input.


New features documented in inline comments, see at:
http://rubydoc.info/github/ruby-marc/ruby-marc/MARC/Reader

I had trouble making the docs concise, sorry, I think I've been pounding 
my head against this stuff so much realizing how complicated it ends up 
being that I wasn't sure what to leave out.


[CODE4LIB] Job: Head of Metadata Services at Georgetown University

2012-04-19 Thread jobs
Head of Metadata Services

  
Georgetown University Library is seeking a dynamic, forward-thinking,
innovative, energetic and teamoriented person to serve as Head of the Metadata
Services Unit within the Technical ServicesDepartment.

  
The successful candidate will have overall responsibility for providing
innovative leadership, vision,planning, and supervision for
cataloging and metadata services. The incumbent will set
priorities;allocate resources; develop plans, policies and
practices within the unit/department; supervise
operationsfor original and copy cataloging of print, multi-
media resources, special collections/rare books,electronic
monographs, serials, and databases using MARC or other metadata formats;
oversee physicalprocessing functions; provide leadership for knowledgeable
staff in an environment of anticipated create a
positive work environment; deliver digital initiatives support; monitor the
national andinternational trends in metadata creation and
direct on-going review and revision of library-
widemetadata/cataloging policies and procedures; serve as
the resource person to all Library staff answeringinquiries
and providing interpretations on existing and emerging metadata standards and
rules;collaborate and work with other library units to
create metadata for digital and special collections;oversee
the Library's participation in cooperative metadata endeavors such as NACO;
serve as amember of the Technical Services Department's
Management Team; serve on library and university wide committees and task
forces and initiatives as required.This position reports
directly to the Head of Technical Services. Directly reporting to this
position are 2catalogers and 1 Receiving/Copy Cataloging
Supervisor, 4 indirect reports and 1-3
student(s).Additional indirect reporting may also include
staff that performs metadata creation work within
otherdepartments, copy cataloging, and physical processing.
Work is performed according to priorities set bythe
Department Head and within guidelines and procedures established for the
Department.

  
Qualifications:

  
The candidate must have an ALA-accredited MLS degree, at least 2 years
progressively increasingsupervisory/management/leadership
experience, along with demonstrated knowledge and
experiencewith provision of metadata/cataloging services,
including those related to digital initiatives within
anacademic or research library setting. The candidate must
demonstrate excellent verbal and written skills.Experience
working with metadata creation for institutional repositories is highly
preferred. Workingknowledge of MARC21and non-MARC metadata
schema including but not limited to metadata formats,such
as Dublin Core, EAD, METS, MODS, OAI, and XML required. Familiarity with data
interchangestandards (e.g., OAI-PMH); knowledge of the
semantic web and linked data; experience with
digitalcontent management systems such as DSpace and
ContentDM; knowledge of current standards such asAACR2,
LCSH, LC Classification, NACO and forthcoming changes with FRBR, RDA and MARC;
andemerging technologies in cataloging services, including
those related to digital libraries and specialcollections
is highly preferred.

  
Salary/Benefits/Rank:

  
Salary commensurate with experience. Comprehensive benefits
packageincluding 21 days paid leave per year; medical;
TIAA/CREF; tuition assistance. This is a
12-month,Academic/Administrative Professional (AAP)
appointment.Apply online at
www.library.georgetown.edu/employment.Review of
applications begins immediately and continues until filled.
Georgetown University is an Equal Opportunity, Affirmative
Action Employer.



Brought to you by code4lib jobs: http://jobs.code4lib.org/job/898/