[CODE4LIB] MassLNC RFP Notice of Deadline Extension

2011-03-31 Thread Kathy Lussier
Hi all,

Please excuse any cross-postings. The Massachusetts Library Network
Cooperative is extending the deadline for its Request for Proposals (RFP)
for Evergreen enhancements. The new deadline is 5 p.m. (EDT) today (March
31, 2011.) The RFP is available at http://masslnc.cwmars.org/node/2301 and
the responses to initial vendor questions are available at
http://masslnc.cwmars.org/node/2324. Any responses should be sent to
kluss...@masslnc.org by 5 p.m. (EDT) today.

Thank you.

-
Kathy Lussier
Project Coordinator
Massachusetts Library Network Cooperative
(508) 756-0172
(508) 755-3721 (fax)
kluss...@masslnc.org
IM: kmlussier (AOL  Yahoo)
Twitter: http://www.twitter.com/kmlussier
 
 


[CODE4LIB] regexp for LCC?

2011-03-31 Thread Jonathan Rochkind
Does anyone have a good regular expression that will match all legal LC 
Call Numbers from the LC Classified Schedule, but will generally not 
match things that could not possibly be an LC Call Number from the LC 
Classified Schedule?


In particular, I need it to NOT match an MLC call number, which is an 
LC assigned call number that shows up in an 050 with no way to 
distinguish based on indicators, but isn't actually from the LC 
Schedules.  Here's an example of an MLC call number:


MLCS 83/5180 (P)

Hmm, maybe all MLC call numbers begin with MLC, okay I guess I can 
exclude them just like that. But it looks like there are also OTHER 
things that can show up in the 050 but aren't actually from the 
classified schedule, the OCLC documentation even contains an example of 
Microfilm 19072 E.


What a mess, huh?  So, yeah, regex anyone?

[You can probably guess why I care if it's from the LC Classified 
Schedule or not].


Re: [CODE4LIB] regexp for LCC?

2011-03-31 Thread Tod Olson
Check the regexp that Google uses in their call number normalization:

http://code.google.com/p/library-callnumber-lc/wiki/Home

You may want to remove the prefix part, and allow for a fourth cutter.

The folks at UNC pointed me to this a few months ago.

-Tod

On Mar 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:

 Does anyone have a good regular expression that will match all legal LC 
 Call Numbers from the LC Classified Schedule, but will generally not 
 match things that could not possibly be an LC Call Number from the LC 
 Classified Schedule?
 
 In particular, I need it to NOT match an MLC call number, which is an 
 LC assigned call number that shows up in an 050 with no way to 
 distinguish based on indicators, but isn't actually from the LC 
 Schedules.  Here's an example of an MLC call number:
 
 MLCS 83/5180 (P)
 
 Hmm, maybe all MLC call numbers begin with MLC, okay I guess I can 
 exclude them just like that. But it looks like there are also OTHER 
 things that can show up in the 050 but aren't actually from the 
 classified schedule, the OCLC documentation even contains an example of 
 Microfilm 19072 E.
 
 What a mess, huh?  So, yeah, regex anyone?
 
 [You can probably guess why I care if it's from the LC Classified 
 Schedule or not].

Tod Olson t...@uchicago.edu
Systems Librarian
University of Chicago Library


Re: [CODE4LIB] regexp for LCC?

2011-03-31 Thread Jonathan Rochkind

Thanks, that looks good!

It's hosted on Google Code, but I don't think that code is anything 
Google uses, it looks like it's from our very own Bill Dueber.


On 3/31/2011 12:38 PM, Tod Olson wrote:

Check the regexp that Google uses in their call number normalization:

http://code.google.com/p/library-callnumber-lc/wiki/Home

You may want to remove the prefix part, and allow for a fourth cutter.

The folks at UNC pointed me to this a few months ago.

-Tod

On Mar 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:


Does anyone have a good regular expression that will match all legal LC
Call Numbers from the LC Classified Schedule, but will generally not
match things that could not possibly be an LC Call Number from the LC
Classified Schedule?

In particular, I need it to NOT match an MLC call number, which is an
LC assigned call number that shows up in an 050 with no way to
distinguish based on indicators, but isn't actually from the LC
Schedules.  Here's an example of an MLC call number:

MLCS 83/5180 (P)

Hmm, maybe all MLC call numbers begin with MLC, okay I guess I can
exclude them just like that. But it looks like there are also OTHER
things that can show up in the 050 but aren't actually from the
classified schedule, the OCLC documentation even contains an example of
Microfilm 19072 E.

What a mess, huh?  So, yeah, regex anyone?

[You can probably guess why I care if it's from the LC Classified
Schedule or not].

Tod Olsont...@uchicago.edu
Systems Librarian
University of Chicago Library



Re: [CODE4LIB] regexp for LCC?

2011-03-31 Thread Jonathan Rochkind
Except now I wonder if those annoying MLCS call numbers might actually 
be properly MATCHED by this regex, when I need em excluded. They are 
annoying _similar_ to a classified call number. Well, one way to find out.


And the reason this matters is to try and use an LCC to map to a 
'discipline' or other broad category, either directly from the LCC 
schedule labels, or using a mapping like umich's: 
http://www.lib.umich.edu/browse/categories/


But if it's not really an LCC at all, and you try to map it, you'll get 
bad postings.


On 3/31/2011 1:03 PM, Jonathan Rochkind wrote:

Thanks, that looks good!

It's hosted on Google Code, but I don't think that code is anything
Google uses, it looks like it's from our very own Bill Dueber.

On 3/31/2011 12:38 PM, Tod Olson wrote:

Check the regexp that Google uses in their call number normalization:

http://code.google.com/p/library-callnumber-lc/wiki/Home

You may want to remove the prefix part, and allow for a fourth cutter.

The folks at UNC pointed me to this a few months ago.

-Tod

On Mar 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:


Does anyone have a good regular expression that will match all legal LC
Call Numbers from the LC Classified Schedule, but will generally not
match things that could not possibly be an LC Call Number from the LC
Classified Schedule?

In particular, I need it to NOT match an MLC call number, which is an
LC assigned call number that shows up in an 050 with no way to
distinguish based on indicators, but isn't actually from the LC
Schedules.  Here's an example of an MLC call number:

MLCS 83/5180 (P)

Hmm, maybe all MLC call numbers begin with MLC, okay I guess I can
exclude them just like that. But it looks like there are also OTHER
things that can show up in the 050 but aren't actually from the
classified schedule, the OCLC documentation even contains an example of
Microfilm 19072 E.

What a mess, huh?  So, yeah, regex anyone?

[You can probably guess why I care if it's from the LC Classified
Schedule or not].

Tod Olsont...@uchicago.edu
Systems Librarian
University of Chicago Library



Re: [CODE4LIB] regexp for LCC?

2011-03-31 Thread Keith Jenkins
The Google Code regex looks like it will accept any 1-3 letters at the
start of the call number.  But LCC has no I, O, W, X, or Y
classifications.

So you might want to use something more like ^[A-HJ-NP-VZ] at the
start of the regex.

Also, there are only a few major classifications that use three
letters.  Like DJK, and several in the Ks.  I'm not sure, but there
might be others.

Keith


On Thu, Mar 31, 2011 at 1:11 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Except now I wonder if those annoying MLCS call numbers might actually be
 properly MATCHED by this regex, when I need em excluded. They are annoying
 _similar_ to a classified call number. Well, one way to find out.

 And the reason this matters is to try and use an LCC to map to a
 'discipline' or other broad category, either directly from the LCC schedule
 labels, or using a mapping like umich's:
 http://www.lib.umich.edu/browse/categories/

 But if it's not really an LCC at all, and you try to map it, you'll get bad
 postings.

 On 3/31/2011 1:03 PM, Jonathan Rochkind wrote:

 Thanks, that looks good!

 It's hosted on Google Code, but I don't think that code is anything
 Google uses, it looks like it's from our very own Bill Dueber.

 On 3/31/2011 12:38 PM, Tod Olson wrote:

 Check the regexp that Google uses in their call number normalization:

        http://code.google.com/p/library-callnumber-lc/wiki/Home

 You may want to remove the prefix part, and allow for a fourth cutter.

 The folks at UNC pointed me to this a few months ago.

 -Tod

 On Mar 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:

 Does anyone have a good regular expression that will match all legal LC
 Call Numbers from the LC Classified Schedule, but will generally not
 match things that could not possibly be an LC Call Number from the LC
 Classified Schedule?

 In particular, I need it to NOT match an MLC call number, which is an
 LC assigned call number that shows up in an 050 with no way to
 distinguish based on indicators, but isn't actually from the LC
 Schedules.  Here's an example of an MLC call number:

 MLCS 83/5180 (P)

 Hmm, maybe all MLC call numbers begin with MLC, okay I guess I can
 exclude them just like that. But it looks like there are also OTHER
 things that can show up in the 050 but aren't actually from the
 classified schedule, the OCLC documentation even contains an example of
 Microfilm 19072 E.

 What a mess, huh?  So, yeah, regex anyone?

 [You can probably guess why I care if it's from the LC Classified
 Schedule or not].

 Tod Olsont...@uchicago.edu
 Systems Librarian
 University of Chicago Library




Re: [CODE4LIB] regexp for LCC?

2011-03-31 Thread Doran, Michael D
Hi Jonathan,

Although designed for a different purpose, you might want to take a look at the 
regex in the LC call number sorting utilities on this page: 
http://rocky.uta.edu/doran/sortlc/

Note that unparsable call numbers printed to STDERR with error message.  So you 
could run it against a list containing valid and MLC call numbers and see 
which ones end up where,   refine regexp, retry, rinse, and repeat.  If you 
make significant (or any) improvements to the regexp being used, I'd be 
delighted to incorporate it back into those LC sort utilities.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
 

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Thursday, March 31, 2011 11:29 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] regexp for LCC?
 
 Does anyone have a good regular expression that will match all legal LC
 Call Numbers from the LC Classified Schedule, but will generally not
 match things that could not possibly be an LC Call Number from the LC
 Classified Schedule?
 
 In particular, I need it to NOT match an MLC call number, which is an
 LC assigned call number that shows up in an 050 with no way to
 distinguish based on indicators, but isn't actually from the LC
 Schedules.  Here's an example of an MLC call number:
 
 MLCS 83/5180 (P)
 
 Hmm, maybe all MLC call numbers begin with MLC, okay I guess I can
 exclude them just like that. But it looks like there are also OTHER
 things that can show up in the 050 but aren't actually from the
 classified schedule, the OCLC documentation even contains an example of
 Microfilm 19072 E.
 
 What a mess, huh?  So, yeah, regex anyone?
 
 [You can probably guess why I care if it's from the LC Classified
 Schedule or not].


Re: [CODE4LIB] regexp for LCC?

2011-03-31 Thread Naomi Dushay
You could also try to use the code I put in SolrMarc utilities classes  
ha ha ha.


- Naomi

On Mar 31, 2011, at 10:25 AM, Keith Jenkins wrote:


The Google Code regex looks like it will accept any 1-3 letters at the
start of the call number.  But LCC has no I, O, W, X, or Y
classifications.

So you might want to use something more like ^[A-HJ-NP-VZ] at the
start of the regex.

Also, there are only a few major classifications that use three
letters.  Like DJK, and several in the Ks.  I'm not sure, but there
might be others.

Keith


On Thu, Mar 31, 2011 at 1:11 PM, Jonathan Rochkind  
rochk...@jhu.edu wrote:
Except now I wonder if those annoying MLCS call numbers might  
actually be
properly MATCHED by this regex, when I need em excluded. They are  
annoying

_similar_ to a classified call number. Well, one way to find out.

And the reason this matters is to try and use an LCC to map to a
'discipline' or other broad category, either directly from the LCC  
schedule

labels, or using a mapping like umich's:
http://www.lib.umich.edu/browse/categories/

But if it's not really an LCC at all, and you try to map it, you'll  
get bad

postings.

On 3/31/2011 1:03 PM, Jonathan Rochkind wrote:


Thanks, that looks good!

It's hosted on Google Code, but I don't think that code is anything
Google uses, it looks like it's from our very own Bill Dueber.

On 3/31/2011 12:38 PM, Tod Olson wrote:


Check the regexp that Google uses in their call number  
normalization:


   http://code.google.com/p/library-callnumber-lc/wiki/Home

You may want to remove the prefix part, and allow for a fourth  
cutter.


The folks at UNC pointed me to this a few months ago.

-Tod

On Mar 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:

Does anyone have a good regular expression that will match all  
legal LC
Call Numbers from the LC Classified Schedule, but will generally  
not
match things that could not possibly be an LC Call Number from  
the LC

Classified Schedule?

In particular, I need it to NOT match an MLC call number,  
which is an

LC assigned call number that shows up in an 050 with no way to
distinguish based on indicators, but isn't actually from the LC
Schedules.  Here's an example of an MLC call number:

MLCS 83/5180 (P)

Hmm, maybe all MLC call numbers begin with MLC, okay I guess I can
exclude them just like that. But it looks like there are also  
OTHER

things that can show up in the 050 but aren't actually from the
classified schedule, the OCLC documentation even contains an  
example of

Microfilm 19072 E.

What a mess, huh?  So, yeah, regex anyone?

[You can probably guess why I care if it's from the LC Classified
Schedule or not].


Tod Olsont...@uchicago.edu
Systems Librarian
University of Chicago Library





[CODE4LIB] digital librarian job description

2011-03-31 Thread Eric Lease Morgan
Below is an abbreviated digital librarian job description -- a grant-funded 
temporary position here at Notre Dame:

  The overall goal of the Vector Control Development Network (VCDN)
  is to develop an analytical framework for the evaluation of the
  transmission of vector borne diseases to support efforts by the
  research and modeling communities, national vector borne disease
  control programs, developers of products for disease control and
  the donors/funders of research and control activities to
  control/eliminate these diseases. The first phase of the project
  will focus on malaria and the vector behaviors that determine the
  intensity of transmission and the susceptibility of vectors to
  interventions (both present and anticipated). 
  
  Candidates should possess at least a Master's degree in
  bioinformatics or its equivalent (those finishing the last
  semester toward completion of an MS in bioinformatics are also
  encouraged to apply). They must possess advanced skills in
  searching bibliographic literature databases and be conversant
  with data curation and copyright issues and implications. They
  must know at least one programming language, be familiar with
  RESTful Web computing techniques and know how to implement them.
  They must be able to read and write XML files, design and
  implement a relational database, know how to index data and
  provide Web access to the index, and be familiar with metadata
  standards such as Dublin Core or Darwin Core. They must also
  understand how to provide and exploit access to data through an
  HTTP server.

  http://bit.ly/gyrEIs

For more information, email Parker Ladwig ladwi...@nd.edu.

-- 
Eric Lease Morgan
University of Notre Dame


[CODE4LIB] digital preservation management workshop

2011-03-31 Thread Eric Lease Morgan
[Forwarded on behalf of Nancy McGovern nancy...@umich.edu --ELM]
 

Call for Applications

We are very pleased that our colleagues at the University at Albany, SUNY will 
host the five-day
Digital Preservation Management workshop this June in Albany, New York. 
Application form available on April 13, 2011 at 1:00pm ET at:

  http://www.regonline.com/DPMworkshop-Albany2011.
 
Digital Preservation Management: Short-Term Solutions for Long-Term Problems
Location:  Albany, New York, USA
Dates:  June 5 – 10, 2011
Tuition:  USD $ 950.00


Who Should Attend?

The intended audience for the workshop series is managers at organizations of 
all kinds who are or will be responsible for managing digital content over 
time. The workshop begins on Sunday evening with an opening session, continues 
Monday -Thursday 9am - 5pm, and concludes on Friday at noon. 

Additional information about the workshop content and instructors is available 
at:

  http://www.icpsr.umich.edu/dpm/workshops/fiveday.html.


Instructors and Keynote Speaker

Nancy McGovern is the lead instructor for the workshop and will be joined by 
three topical instructors.  The Keynote speaker for the Albany June 2011 
workshop is Theresa Pardo, the Center Director of the Center for Technology in 
Government.


Application for Registration

Workshop applications are reviewed before a formal acceptance and registration 
for the workshop may occur – a two-step process.  The application system will 
be available at 1pm ET on Wednesday, April 13, 2011 and will remain open until 
the workshop is full (24 participants).  We have already had a very high level 
of interest in the workshop and encourage early application. Apply online at:

  http://www.regonline.com/DPMworkshop-Albany2011
 
Please Note: Applicants will be notified within five (5) business days if they 
are accepted to register for the June Albany, NY workshop. Until then, all 
applicant status will be ‘pending’.  Persons accepted to register will be able 
to do so at the beginning of May when the registration and tuition payment 
system will be made available.


About the Workshop

The Digital Preservation Management Workshops, a series presented since 2003, 
incorporate community standards and exemplars of good practice to provide 
practical guidance for developing effective digital preservation programs. The 
workshops were initially developed at Cornell University beginning in 2003 
under the direction of Anne Kenney and Nancy McGovern.  Since 2006, McGovern 
has continued curricular development and directing the workshop from ICPSR at 
the University of Michigan. This has included development of Special Topic 
advanced workshops and a Train-the-Trainer program.  Through 2010, the workshop 
series was developed with funding from the National Endowment for the 
Humanities. 

If you have questions, please contact us at: 
digital-preservat...@icpsr.umich.edu  

-- 
Nancy Y McGovern
Director, Digital Preservation Management workshops
http:// www.icpsr.umich.edu/dpm/workshops
 


[CODE4LIB] XC NCIP Toolkit Connectors available for Symphony and Voyager ILS

2011-03-31 Thread Cook, Randall
Here is exciting news regarding resource sharing and discovery using the
NCIP protocol.

 

The open source community of software developers working on and
supporting the eXtensible Catalog's (XC) NCIP Toolkit,
http://code.google.com/p/xcncip2toolkit/ is pleased to announce the
release of new software.

 

We are releasing:

* NCIP Core - required for all installations.  Installation
instructions at
http://code.google.com/p/xcncip2toolkit/wiki/CoreInstallation 

* SirsiDynix Symphony connector - currently written for resource
sharing purposes.   Installation instructions at
http://code.google.com/p/xcncip2toolkit/wiki/SymphonyInstallation 

* Ex Libris Voyager connector - currently written for resource
discovery purposes.  Installation instructions at
http://code.google.com/p/xcncip2toolkit/wiki/VoyagerInstallation 

 

For several months a group of developers from OCLC, eXtensible Catalog
Organization (XCO), Lehigh University, Consortium of Academic and
Research Libraries in Illinois (CARLI), University of North Carolina
-Charlotte, and Notre Dame have been working on a revision of the
original XC NCIP Toolkit to support version 2 of the NCIP protocol.
This software release represents the result of that work so far.

 

We invite all interested parties to check out the software.  The best
way to ask questions is by signing up for and using the NCIP Toolkit
mailing list.  The developers involved in the released code have all
agreed to field questions as they come in.   The following url will
allow you to sign up for the NCIP Toolkit mailing list as well as other
XC lists http://www.extensiblecatalog.org/support 

 

SirsiDynix Symphony connector - Was written by Lehigh University staff.
They are in production use with their connector and are participating in
a consortial resource sharing system that uses the following NCIP
services:

* Lookup User

* CheckIn Item

* CheckOut Item

* Accept item

 

Ex Libris Voyager connector - Written by XCO and CARLI staff, this
connector uses the following NCIP services:

* Lookup Item

* Lookup User

* Lookup Item Set - a new (currently non standard) service that
allows lookups based on 1 or more bibliographic ids with a response that
contains information on all related items.  This service was built with
input from an NCIP Standing Committee member and has been submitted to
the NCIP Standing Committee for adoption review.

* Renew Item

 

We are interested in expanding both the number of ILS connectors
available and the number of NCIP services supported.If you are
interested in participating in any way, please contact Randall Cook at
XCO (rc...@library.rochester.edu).

 

 

Randall Cook, PMP

eXtensible Catalog Organization

University of Rochester River Campus Libraries

585-273-2042

rc...@library.rochester.edu mailto:rc...@library.rochester.edu 

 


[CODE4LIB] techniques for parsing legacy library data

2011-03-31 Thread Thomale, Jason
Hey all,

I 3 today's LCC thread, and ones like it.

It seems like there's a ton of knowledge out there (buried) about parsing 
various pieces of library data like this, but I haven't really seen a concerted 
effort to log/organize this info in one place. It seems like such a thing could 
be a useful resource for the Code4lib community? (Because I would find it 
terribly useful.)

I started a page on the wiki: 
http://wiki.code4lib.org/index.php/Parsing_Library_Data

It's skeletal, but the idea is to collect/share any  all info about parsing 
library data--code, techniques, methods, problems, general discussion--to make 
it a little easier to build off of each other.

If something like this already exists, I'd love to know about it (and in that 
case this wiki page would be redundant). Otherwise--yeah, it would be great if 
you guys could go in and add links to relevant articles/work/blog 
postings--especially if this is a resource you'd find useful, too. Or make some 
suggestions, and I'll take care of it...

Thanks!

Jason Thomale
Resource Discovery Systems Librarian
University of North Texas Libraries


Re: [CODE4LIB] techniques for parsing legacy library data

2011-03-31 Thread Simon Spero
I strongly suggest taking a look at GATE  (http://gate.ac.uk) and UIMA (
http://uima-framework.sourceforge.net/ ).

GATE can use UIMA workflows as processing resource.  UIMA can use GATE
workflows as processing resources.

Don't cross the streams...

Simon

On Thu, Mar 31, 2011 at 5:22 PM, Thomale, Jason jason.thom...@unt.eduwrote:

 Hey all,

 I 3 today's LCC thread, and ones like it.

 It seems like there's a ton of knowledge out there (buried) about parsing
 various pieces of library data like this, but I haven't really seen a
 concerted effort to log/organize this info in one place. It seems like such
 a thing could be a useful resource for the Code4lib community? (Because I
 would find it terribly useful.)

 I started a page on the wiki:
 http://wiki.code4lib.org/index.php/Parsing_Library_Data

 It's skeletal, but the idea is to collect/share any  all info about
 parsing library data--code, techniques, methods, problems, general
 discussion--to make it a little easier to build off of each other.

 If something like this already exists, I'd love to know about it (and in
 that case this wiki page would be redundant). Otherwise--yeah, it would be
 great if you guys could go in and add links to relevant articles/work/blog
 postings--especially if this is a resource you'd find useful, too. Or make
 some suggestions, and I'll take care of it...

 Thanks!

 Jason Thomale
 Resource Discovery Systems Librarian
 University of North Texas Libraries



Re: [CODE4LIB] techniques for parsing legacy library data

2011-03-31 Thread Doran, Michael D
Hi Jason,

 I started a page on the wiki:
 http://wiki.code4lib.org/index.php/Parsing_Library_Data

Cool idea.  I added a link under the Title section to a small code snippet for 
parsing titles to determine the number of nonfiling characters (for when 
converting non-MARC data to MARC).

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
 

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Thomale, Jason
 Sent: Thursday, March 31, 2011 4:22 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] techniques for parsing legacy library data
 
 Hey all,
 
 I 3 today's LCC thread, and ones like it.
 
 It seems like there's a ton of knowledge out there (buried) about parsing
 various pieces of library data like this, but I haven't really seen a
 concerted effort to log/organize this info in one place. It seems like
 such a thing could be a useful resource for the Code4lib community?
 (Because I would find it terribly useful.)
 
 I started a page on the wiki:
 http://wiki.code4lib.org/index.php/Parsing_Library_Data
 
 It's skeletal, but the idea is to collect/share any  all info about
 parsing library data--code, techniques, methods, problems, general
 discussion--to make it a little easier to build off of each other.
 
 If something like this already exists, I'd love to know about it (and in
 that case this wiki page would be redundant). Otherwise--yeah, it would be
 great if you guys could go in and add links to relevant articles/work/blog
 postings--especially if this is a resource you'd find useful, too. Or make
 some suggestions, and I'll take care of it...
 
 Thanks!
 
 Jason Thomale
 Resource Discovery Systems Librarian
 University of North Texas Libraries


Re: [CODE4LIB] MARC magic for file

2011-03-31 Thread William Denton

On 28 March 2011, Ford, Kevin wrote:

I couldn't get Simon's MARC 21 Magic file to work.  Among other issues, 
I received line too long errors.  But, since I've been curious about 
this for sometime, I figured I'd take a whack at it myself.  Try this:


This is very nice!  Thanks.  I tried it on a bunch of MARC files I have, 
and it recognized almost all of them.  A few it didn't, so I had a closer 
look, and they're invalid.


For example, the Internet Archive's Binghamton catalogue dump:

http://ia600307.us.archive.org/6/items/marc_binghamton_univ/

$ file -m marc.magic bgm*mrc
bgm_openlib_final_0-5.mrc: data
bgm_openlib_final_10-15.mrc:   MARC Bibliographic
bgm_openlib_final_15-18.mrc:   data
bgm_openlib_final_5-10.mrc:MARC Bibliographic

But why?  Aha:

$ head -c 25 bgm_openlib_final_*mrc
== bgm_openlib_final_0-5.mrc ==
01812cas  2200457   45x00
== bgm_openlib_final_10-15.mrc ==
01008nam  2200289ua 45000
== bgm_openlib_final_15-18.mrc ==
01614cam00385   45  0
== bgm_openlib_final_5-10.mrc ==
00887nam  2200265v  45000

As you say, the leader should end with 4500 (as defined at 
http://www.loc.gov/marc/authority/adleader.html) but two of those files 
don't.  So they're not valid MARC.  I'm sure any decent MARC tool can deal 
with them, since decent MARC tools are certainly going to be forgiving 
enough to deal with four characters that apparently don't even really 
matter.


So on the one hand they're usable MARC but file wouldn't say so, and on 
the other that's a good indication that the files have failed a basic 
validity test.  I wonder if there are similar situations for JPEGs or 
MP3s.


I think you should definitely submit this for inclusion in the magic file. 
It would be very useful for us all!


Bill

P.S. I'd never used head -c (to show a fixed number of bytes) before. 
Always nice to find a new useful option to an old command.



#
# MARC 21 Magic  (Second cut)

# Set at position 0
0   short   0x

# leader ends with 4500

20  string  4500


# leader starts with 5 digits, followed by codes specific to MARC format

0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
0   regex/1 (^[0-9]{5})[cdn][q] MARC Community



--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org