Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Robert Fox
Assuming that memory won't be an issue, you could use MARC::Batch to  
read in the record set and print out seperate files where you split on  
X amount of records. You would have an iterative loop loading each  
record from the large batch, and a counter variable that would get  
reset after X amount of records. You might want to name the sets using  
another counter that keeps track of how many sets you have and name  
each file something like batch_$count.mrc and write them out to a  
specific directory. Just concatenate each record to the previous one  
when you're making your smaller batches.

Rob Fox
Hesburgh Libraries
University of Notre Dame

On Jan 25, 2010, at 9:48 AM, "Nolte, Jennifer"  
 wrote:

> Hello-
>
> I am working with files of MARC records that are over a million  
> records each. I'd like to split them down into smaller chunks,  
> preferably using a command line. MARCedit works, but is slow and  
> made for the desktop. I've looked around and haven't found anything  
> truly useful- Endeavor's MARCsplit comes close but doesn't separate  
> files into even numbers, only by matching criteria, so there could  
> be lots of record duplication between files.
>
> Any idea where to begin? I am a (super) novice Perl person.
>
> Thank you!
>
> ~Jenn Nolte
>
>
> Jenn Nolte
> Applications Manager / Database Analyst
> Production Systems Team
> Information Technology Office
> Yale University Library
> 130 Wall St.
> New Haven CT 06520
> 203 432 4878
>
>


Job Posting: Senior Technical Consultant Analyst - University Libraries of Notre Dame

2005-08-11 Thread Robert Fox

Please excuse the cross posting.

JOB POSTING

University Libraries of Notre Dame invite applications for the position
Senior Technical Consultant Analyst. Reporting to the Electronic
Resources Librarian, Head of the Electronic Resources & Serials Access
Department, the Senior Analyst will be participate in team-based
development and support of projects to enhance access to library
electronic databases and services for the Notre Dame community.

TITLE: Senior Technical Consultant Analyst
DEPARTMENT:  Electronic Resources & Serials Access
RATE: $41,637 - $70,045 per year
LEVEL: 10
FLSA Status: Exempt

POSTING DATE: August 12, 2005. Applications will be accepted until
September 2, 2005.


DUTIES:
Database and programming enhancements for access to electronic resources
* Assists in development of mySQL database, PERL and PHP programming
solutions to facilitate access to electronic resources as needed.
Examples of library services with custom development are OSCR electronic
reserves, electronic products management database, the eJournal Locator
service, PURL (persistent URL) database solution for better distributed
URL management.

Technical Support for electronic databases, including access tools
* Shared management of the library proxy server in order to ensure
uninterrupted remote access for users to Library maintained electronic
databases and journals.
* Configuration and authentication of Web-based research database
products owned and leased by the Libraries. Liaison with
providers/publishers to resolve database-related access problems.
* Provides excellent support and assistance for patron and library staff
access to electronic databases from all platforms (shared responsibility)
* Acquires in-depth knowledge of electronic research and reference
resources provided by the Libraries.
* Provides user support for Endnote configuration and use.  May help
with workshops and documentation.

   Usage Statistics Management for Databases
* Acquires and manages database use statistics in electronic form. Plans
enhancements of statistics management to enable better analysis of usage
patterns for databases and journals in electronic formats. Supervises
student workers for statistics entry responsibilities.
* Will contribute to development of database management of database
usage statistics, within framework of larger statistics management needs
of libraries

This position participates in work of electronic resources working
groups and committees as appropriate. Attends workshops and conferences
when possible to expand knowledge of electronic resources enhancements.

QUALIFICATIONS:
Bachelor's Degree, preferably in IT area.
The successful candidate will have
- Demonstrated working experience with Web programming/scripting tools
such as PHP and Perl, markup tools including XML, HTML including CSS,
and with mySQL database design and management.
- Solid knowledge of microcomputer operating systems, especially Windows
XP. Evidence of wide application experience, particularly Excel.
- Network file management skills and familiarity with TCP/IP protocols.

Excellent communication and interpersonal skills are essential. Must be
able to work collaboratively and creatively with diverse groups. Able to
manage shifting priorities in fast-paced environment; excellent
organization skills. Ability to effectively communicate technical
information to individuals who lack a technical background.

SCHEDULE: Monday - Friday, 8.00 am - 5.00 pm., 12 months/40 hours.


The University of Notre Dame is an Equal Opportunity/Affirmative Action
Employer.

*ENVIRONMENT: *The University of Notre Dame is a national Catholic
teaching and research university enriched with a diversified faculty,
located in Northern Indiana ninety miles from Chicago. On a highly
residential campus, approximately 8,200 undergraduates and 3,100
graduate students pursue a broad range of studies. The University
Libraries house approximately 3 million volumes within the main Hesburgh
Library and seven branch libraries and currently subscribe to nearly
17,000 serials. The Libraries have a dynamic staff of 198 including 48
librarians.

*APPLICATIONS: *To apply, send a letter of application, curriculum
vitae, and names, addresses, phone numbers and email addresses of three
references to:
Michelle Stenberg
Library Administrative Offices
221 Hesburgh Library
University of Notre Dame
Notre Dame, IN 46556.
[EMAIL PROTECTED] 

*Application deadline:* Electronic submission of application documents
is strongly encouraged. Initial review of applications will begin on
September 2, 2005 and continue until a successful candidate is chosen.
The University of Notre Dame is an Equal Opportunity/Affirmative Action
Employer strongly committed to diversity. We value qualified candidates
who can bring to our community a variety of backgrounds.


Enhancement request for MARC::Record

2004-03-04 Thread Robert Fox
I would like to make a suggestion for a functionality enhancement to 
MARC::Record. This is really more of a tweak than anything else.

I just debugged a program where I was expecting a list to be returned from 
the $record->subfield($field, $subfield) method and instead was returned a 
scalar of the first instance of that combination found. Would it be 
possible to change the functionality of this method so that it returns a 
list of all instances in list context and the first instance in scalar 
context similar to the $record->field($field) method?

For some reason I expected the same behavior from both methods. Perhaps 
others would like this feature implemented as well. I've found many 
instances where this would be handy (for example, pulling out all 7XX added 
entry subfield $a's, etc.). Others please chime in if they think that this 
would be helpful. I realize that this same thing could be implemented with 
existing module functionality by simply adding a couple of lines of code, 
but if it would be easy to add to the method, I think it would be worthwhile.

Thanks,

Rob

Robert Fox
Sr. Programmer/Analyst
University Libraries of Notre Dame
(574)631-3353
[EMAIL PROTECTED]


XML processing of large XML docs pt. 2

2004-02-27 Thread Robert Fox
First of all, thanks to all of you who supplied comments and suggestions 
for my issue relating to parsing very large XML documents with complex 
structures.

Given those suggestions, I was able to find a solution. Believe it or not, 
the solution was to use a Perl API that relied upon a C library for parsing 
the XML, as opposed to a pure Perl solution. In this case, I used 
XML::LibXML (which is an API to the Gnome libxml2 C library). It is an 
understatement to say that the processing speed, after running several 
tests, was improved by many orders of magnitude. I'm now able to process a 
54MB file of XML/RDF records in 1/24th the time it was taking me previously 
using the Perl based XML::XPath/XML::Parse modules. It now takes minutes 
instead of hours. And, as a bonus, re-coding the program didn't take that 
long since the API is very similar, using the DOM technique. I think my 
perfomance is also predicated on the fact that I have enough RAM to 
manipulate the document in memory as opposed to horrific disk space swapping.

My script is running on the same host, against the same data set, and the 
improvement was phenomenal. This is the only performance tweak that I made 
to my program and the pay off was well worth the relatively minimal effort. 
I really couldn't believe it when I saw the performance increase, but I 
must say that I'm relieved because I was worried that the issue may have 
been my algorithm and the underlying Perl code library we had written as a 
basis for this application.

I would be interested to know if others have had a similar experience 
switching to an API which relies on a compiled set of C library routines 
(such as XML::Sablotron). Hats off to Matt Sergeant and Christian Glahn for 
their work on the XML::LibXML modules.

I hope my experience helps some of you out there working on XML projects 
involving large data sets.

Rob

Robert Fox
Sr. Programmer/Analyst
University Libraries of Notre Dame
(574)631-3353
[EMAIL PROTECTED]


RE: XML Parsing for large XML documents

2004-02-26 Thread Robert Fox
Peter and Ed-

Thanks for the replies.

Your suggestions are very good. Here is my problem, though: I don't think 
that I can process this document in a serial fashion, which seems to be 
more akin to SAX. I need to do a lot of node hopping in order to create 
somewhat complex data structures for import into the database, and that 
requires a lot of jumping around from one part of the node tree to another. 
Thus, it seems as though I need to use a DOM parser to accomplish this. 
Scanning an entire document of this size in order to perform very specific 
event handling for each operation (using SAX) seems like it would be just 
as time consuming as having the entire node tree represented in memory. 
Please correct me if I'm wrong here.

On the plus side, I am running this process on a machine that seems to have 
enough RAM to represent the entire document and my code structures (arrays, 
etc.) without the need for virtual memory and heavy disk I/O. However, the 
process is VERY CPU intensive because of all of the sorting and lookups 
that occur for many of the operations. I'm going to see today if I can make 
those more efficient as well.

Someone else has suggested to me that perhaps it would be a good idea to 
break up the larger document into smaller parts during processing and only 
work on those parts in a serial mode. It was also suggested that 
XML::LibXML was an efficient tool because of the C library core (libxml2). 
And, I've also now heard of "hybrid" parsers that allow the ease of use and 
flexibility of DOM with the efficiency of SAX (RelaxNGCC).

For those of you that haven't heard of these tools before, you might want 
to check out:

XML::Sablotron (similar to XML::LibXML)
XMLPull (http://www.xmlpull.org)
Piccolo Parser (http://piccolo.sourceforge.net)
RelaxNGCC (http://relaxngcc.sourceforge.net/en/index.htm)
I get the impression that if I tried to use SAX parsing for a relatively 
complex RDF document, the programming load would be rather significant. 
But, if it speeds up processing by several orders of magnitude, then it 
would be worth it. I'm concerned, though, that I won't have the ability to 
crawl the document nodes using conditionals and revert to previous portions 
of the document that need further processing. What is your experience in 
this regard?

Thanks again for the responses. This is great.

Rob



At 11:07 AM 2/26/2004 +, Peter Corrigan wrote:
On 25 February 2004 20:31 wrote...
>1. Am I using the best XML processing module that I can for this sort
of
> task?
If it must be faster, then it might be worth porting what you have to
work with LibXML which has all round impressive benchmarks especially
for DOM work.
Useful comparisons may be found at:
http://xmlbench.sourceforge.net/results/benchmark/index.html
Remember that the size of the final internal representation used to
manipulate the XML data for DOM could be up to 5 times the original size
i.e. 270mb in your case. Simply adding RAM/porting your exising code to
another machine might be enough to give you the speed-up you require.
>3. What is the most efficient way to process through such a large
document
> no matter what XML processor one uses?
SAX type processing will be faster and use less memory. If you need
random access to any point of the tree after the document has been read
you will need DOM, hence you will need lots of memory.
If this is a one off load, I guess you have to balance the cost of your
time recoding with the cost of waiting for the data to load using what
you have already. Machines usually work cheaper :-)
Best of luck

Peter Corrigan
Head of Library Systems
James Hardiman Library
NUI Galway
IRELAND
Tel: +353-91-524411 Ext 2497
Mobile: +353-87-2798505
-Original Message-
From: Robert Fox [mailto:[EMAIL PROTECTED]
Sent: 25 February 2004 20:31
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: XML Parsing for large XML documents
I'm cross posting this question to perl4lib and xml4lib, hoping that
someone will have a suggestion.
I've created a very large (~54MB) XML document in RDF format for the
purpose of importing related records into a database. Not only does the
RDF
document contain many thousands of individual records for electronic
resources (web resources), but it also contains all of the
"relationships"
between those resources encoded in such a way that the document itself
represents a rather large database of these resources. The relationships
are multi-tiered. I've also written a Perl script which can parse this
large document and process through all of the XML data in order to
import
the data, along with all of the various relationships, into the
database.
The Perl script uses XML::XPath, and XML::XPath::XMLParser. I use these
modules to find the appropriate document nodes as needed while the
processing is going on and the database is being populated. The database
is
not a flat file: several data table

XML Parsing for large XML documents

2004-02-25 Thread Robert Fox
I'm cross posting this question to perl4lib and xml4lib, hoping that 
someone will have a suggestion.

I've created a very large (~54MB) XML document in RDF format for the 
purpose of importing related records into a database. Not only does the RDF 
document contain many thousands of individual records for electronic 
resources (web resources), but it also contains all of the "relationships" 
between those resources encoded in such a way that the document itself 
represents a rather large database of these resources. The relationships 
are multi-tiered. I've also written a Perl script which can parse this 
large document and process through all of the XML data in order to import 
the data, along with all of the various relationships, into the database. 
The Perl script uses XML::XPath, and XML::XPath::XMLParser. I use these 
modules to find the appropriate document nodes as needed while the 
processing is going on and the database is being populated. The database is 
not a flat file: several data tables and linking tables are involved.

I've run into a problem, though: my Perl script runs very slowly. I've done 
just about everything I can to optimize my script so that it isn't memory 
intensive and efficient, and nothing seems to have significantly helped. 
Therefore, I have a couple of questions for the list(s):

1. Am I using the best XML processing module that I can for this sort of task?
2. Has anyone else processed documents of this size, and what have they used?
3. What is the most efficient way to process through such a large document 
no matter what XML processor one uses?

The processing on this is so amazingly slow that it is likely to take many 
hours if not days(!) to process through the bulk of records in this XML 
document. There must be a better way.

Any suggestions or help would be much appreciated,

Rob Fox

Robert Fox
Sr. Programmer/Analyst
University Libraries of Notre Dame
(574)631-3353
[EMAIL PROTECTED]


Re: MARC::Record insert function

2003-11-06 Thread Robert Fox
Ron et. al.-

On the issue of alpha and numeric tags, I know that alpha values in tags 
have been permitted in the MARC standard for a long time, and applaud the 
fact that MARC::Record allows for it, but has anyone actually seen one 
used? In some later revision of UNIMARC or one of the national standards 
based on UNIMARC perhaps? I'd be curious to know about specific cases 
people have seen.
The Aleph ILS (vendor: Ex Libris) regularly uses alpha MARC tags. We had 
requested that alpha tags be implemented in MARC::Record because of that 
circumstance. However, we've only used this ability in MARC::Record to read 
alpha tags, not create new records using them.

Rob

Robert Fox
Sr. Programmer/Analyst
University Libraries of Notre Dame
(574)631-3353
[EMAIL PROTECTED]