Re: Splitting a large file of MARC records into smaller files
I also recommend using MARC::Batch. Attached is a simple script I wrote for myself. Saiful Amin +91-9343826438 On Mon, Jan 25, 2010 at 8:33 PM, Robert Fox wrote: > Assuming that memory won't be an issue, you could use MARC::Batch to > read in the record set and print out seperate files where you split on > X amount of records. You would have an iterative loop loading each > record from the large batch, and a counter variable that would get > reset after X amount of records. You might want to name the sets using > another counter that keeps track of how many sets you have and name > each file something like batch_$count.mrc and write them out to a > specific directory. Just concatenate each record to the previous one > when you're making your smaller batches. > > Rob Fox > Hesburgh Libraries > University of Notre Dame > > On Jan 25, 2010, at 9:48 AM, "Nolte, Jennifer" > wrote: > > > Hello- > > > > I am working with files of MARC records that are over a million > > records each. I'd like to split them down into smaller chunks, > > preferably using a command line. MARCedit works, but is slow and > > made for the desktop. I've looked around and haven't found anything > > truly useful- Endeavor's MARCsplit comes close but doesn't separate > > files into even numbers, only by matching criteria, so there could > > be lots of record duplication between files. > > > > Any idea where to begin? I am a (super) novice Perl person. > > > > Thank you! > > > > ~Jenn Nolte > > > > > > Jenn Nolte > > Applications Manager / Database Analyst > > Production Systems Team > > Information Technology Office > > Yale University Library > > 130 Wall St. > > New Haven CT 06520 > > 203 432 4878 > > > > > #!c:/perl/bin/perl.exe # # Name: mbreaker.pl # Version: 0.1 # Date: Jan 2009 # Author: Saiful Amin # # Description: Extract MARC records based on command-line paramenters use strict; use warnings; use Getopt::Long; use MARC::Batch; my $start = 0; my $end = 1; GetOptions ("start=i" => \$start, "end=i" => \$end ); my $batch = MARC::Batch->new('USMARC', $ARGV[0]); $batch->strict_off(); $batch->warnings_off(); my $num = 0; while (my $record = $batch->next() ) { $num++; next if $num < $start; last if $num > $end; print $record->as_usmarc(); warn "$num records\n" if ( $num % 1000 == 0 ); } __END__ =head1 NAME mbreaker.pl Breaks the MARC record file as per start and end position specified =head1 SYNOPSIS mbreaker.pl [options] file Options: -start start position for reading records -end end position for reading records
Re: Splitting a large file of MARC records into smaller files
Assuming that memory won't be an issue, you could use MARC::Batch to read in the record set and print out seperate files where you split on X amount of records. You would have an iterative loop loading each record from the large batch, and a counter variable that would get reset after X amount of records. You might want to name the sets using another counter that keeps track of how many sets you have and name each file something like batch_$count.mrc and write them out to a specific directory. Just concatenate each record to the previous one when you're making your smaller batches. Rob Fox Hesburgh Libraries University of Notre Dame On Jan 25, 2010, at 9:48 AM, "Nolte, Jennifer" wrote: > Hello- > > I am working with files of MARC records that are over a million > records each. I'd like to split them down into smaller chunks, > preferably using a command line. MARCedit works, but is slow and > made for the desktop. I've looked around and haven't found anything > truly useful- Endeavor's MARCsplit comes close but doesn't separate > files into even numbers, only by matching criteria, so there could > be lots of record duplication between files. > > Any idea where to begin? I am a (super) novice Perl person. > > Thank you! > > ~Jenn Nolte > > > Jenn Nolte > Applications Manager / Database Analyst > Production Systems Team > Information Technology Office > Yale University Library > 130 Wall St. > New Haven CT 06520 > 203 432 4878 > >
Re: Splitting a large file of MARC records into smaller files
Hi, The yaz-marcdump utility may be what you are looking for. See for instance options -s and -C. hth, Shérab.
RE: Splitting a large file of MARC records into smaller files
> yaz-marcdump allows you to break a > marcfile into chunks of x-records +1 --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Colin Campbell [colin.campb...@ptfs-europe.com] Sent: Monday, January 25, 2010 7:08 AM To: perl4lib@perl.org Subject: Re: Splitting a large file of MARC records into smaller files On 25/01/10 14:48, Nolte, Jennifer wrote: > Hello- > > I am working with files of MARC records that are over a million records each. > I'd like to split them down into smaller chunks, preferably using a command > line. MARCedit works, but is slow and made for the desktop. I've looked > around and haven't found anything truly useful- Endeavor's MARCsplit comes > close but doesn't separate files into even numbers, only by matching > criteria, so there could be lots of record duplication between files. If you have Indexdata's yaz installed the program yaz-marcdump allows you to break a marcfile into chunks of x-records. man yaz-marcdump for details Cheers Colin -- Colin Campbell Chief Software Engineer, PTFS Europe Limited Content Management and Library Solutions +44 (0) 208 366 1295 (phone) +44 (0) 7759 633626 (mobile) colin.campb...@ptfs-europe.com skype: colin_campbell2 http://www.ptfs-europe.com
Re: Splitting a large file of MARC records into smaller files
Jennifer, I am working with files of MARC records that are over a million records each. I'd like to split them down into smaller chunks, preferably using a command line. MARCedit works, but is slow and made for the desktop. I've looked around and haven't found anything truly useful- Endeavor's MARCsplit comes close but doesn't separate files into even numbers, only by matching criteria, so there could be lots of record duplication between files. Any idea where to begin? I am a (super) novice Perl person. Well... if you have a *nix style command line and the usual utilities and your file of MARC records is in exchange format with the records just delimited by the end-of-record character 0x1d, then you could do something like this: tr '\035' '\n' < my-marc-file.mrc > recs.txt split -1000 recs.txt The tr command will turn the MARC end-of-record characters into newlines. Then use the split command to carve up the output of tr into files of 1000 records. You then may have to use tr to convert the newlines back to MARC end-of-record characters. Ashley. -- Ashley Sanders a.sand...@manchester.ac.uk Copac http://copac.ac.uk A Mimas service funded by JISC
RE: Splitting a large file of MARC records into smaller files
This isn't a perl solution, but it may work for you. You can use the unix split command to split a file into several other files with the same number of lines each. For that to work, you'll first have to use tr to convert the ^] record separators into newlines. Then use tr to convert them all back in each split file. #> tr '^]' '\n' < filename > filename.nl #> split -l $lines_per_file filename.nl SPLIT #> for file in SPLIT*; do tr '\n' '^]' < $file > ${file%.nl}.rs Or something like that. /dev -- Devon Smith Consulting Software Engineer OCLC Office of Research -Original Message- From: Nolte, Jennifer [mailto:jennifer.no...@yale.edu] Sent: Monday, January 25, 2010 9:48 AM To: perl4lib@perl.org Subject: Splitting a large file of MARC records into smaller files Hello- I am working with files of MARC records that are over a million records each. I'd like to split them down into smaller chunks, preferably using a command line. MARCedit works, but is slow and made for the desktop. I've looked around and haven't found anything truly useful- Endeavor's MARCsplit comes close but doesn't separate files into even numbers, only by matching criteria, so there could be lots of record duplication between files. Any idea where to begin? I am a (super) novice Perl person. Thank you! ~Jenn Nolte Jenn Nolte Applications Manager / Database Analyst Production Systems Team Information Technology Office Yale University Library 130 Wall St. New Haven CT 06520 203 432 4878
Re: Splitting a large file of MARC records into smaller files
On 25/01/10 14:48, Nolte, Jennifer wrote: Hello- I am working with files of MARC records that are over a million records each. I'd like to split them down into smaller chunks, preferably using a command line. MARCedit works, but is slow and made for the desktop. I've looked around and haven't found anything truly useful- Endeavor's MARCsplit comes close but doesn't separate files into even numbers, only by matching criteria, so there could be lots of record duplication between files. If you have Indexdata's yaz installed the program yaz-marcdump allows you to break a marcfile into chunks of x-records. man yaz-marcdump for details Cheers Colin -- Colin Campbell Chief Software Engineer, PTFS Europe Limited Content Management and Library Solutions +44 (0) 208 366 1295 (phone) +44 (0) 7759 633626 (mobile) colin.campb...@ptfs-europe.com skype: colin_campbell2 http://www.ptfs-europe.com
RE: Splitting a large file of MARC records into smaller files
> From: Nolte, Jennifer [mailto:jennifer.no...@yale.edu] > Sent: Monday, January 25, 2010 09:48 AM > To: perl4lib@perl.org > Subject: Splitting a large file of MARC records into smaller files > > Hello- > > I am working with files of MARC records that are over a million records > each. I'd like to split them down into smaller chunks, preferably using > a command line. MARCedit works, but is slow and made for the desktop. > I've looked around and haven't found anything truly useful- Endeavor's > MARCsplit comes close but doesn't separate files into even numbers, > only by matching criteria, so there could be lots of record duplication > between files. > > Any idea where to begin? I am a (super) novice Perl person. I use the following handy script I created many-many years ago. Consider it to be in the public domain. #!perl # # Usage: #perl MARC21-split.pl [-d#] [-n#] [-pPrefix] [-sSuffix] *.marc # #perl MARC21-split.pl -d3 -n1 -pbib -s.marc *.marc # # Creates files with three digits sequence number that have 10,000 # records per file: bib001.marc, bib002.marc, etc. # # Options: #-d number of digits for sequence number #-n number of records per file #-p prefix text before sequence number #-s suffix text after sequence number # package main; # The current package name require 5.003; # The current package requires Perl v5.003 or later. BEGIN { unshift(@INC,'.') } use Carp; # Perl package, see documentation my $PACKAGE = 'main'; ## VARIABLES # my $crlf = "\n"; # ASCII newline. my $recd = "\x1D";# MARC21 record delimiter. my $fldd = "\x1E";# MARC21 field delimiter. my $subd = "\x1F";# MARC21 field separator. ## INLINE CODE # # Change Perls default record delimiter. $/ = $recd; # Set defaults for command line options. my $recs = 1; my $digits = 2; my $prefix = ''; my $suffix = '.mrc'; # Initialize total record count to zero. my $total = 0; print STDERR join("\r\nARG=",'',@ARGV),"\r\n"; # Process command line. foreach $FileMARC (@ARGV) { my $FileOUT = undef; # Process command line options. if ($FileMARC =~ m/^[\-][Dd]/) { $FileMARC =~ s/^[\-][Dd]//; if (($digits = $FileMARC) !~ m/\d+/ || $recs == 0) { $digits = 1; } next; } elsif ($FileMARC =~ m/^[\-][Nn]/) { $FileMARC =~ s/^[\-][Nn]//; if (($recs = $FileMARC) !~ m/\d+/ || $recs == 0) { $recs = 1; } next; } elsif ($FileMARC =~ m/^[\-][Pp]/) { $FileMARC =~ s/^[\-][Pp]//; $prefix = $FileMARC; next; } elsif ($FileMARC =~ m/^[\-][Ss]/) { $FileMARC =~ s/^[\-][Ss]//; $suffix = $FileMARC; next; } # Open file from command line. open(MARC,'<'.$FileMARC) || croak("$PACKAGE:: Cannot open input file '$FileMARC': $!"); # Count each record in the file. my $count = 0; while () { # Open new output file when necessary. if (($total % $recs) == 0) { my $pattern = sprintf('%%s%%0%uu%%s',int($digits)); $FileOUT = sprintf($pattern,$prefix,($total/$recs)+1,$suffix); # Open output file. open(OUT,'>'.$FileOUT) || croak("$PACKAGE:: Cannot open output file '$FileOUT': $!"); } print OUT $_; ++$total; # Close output file when full. if (($total % $recs) == 0) { # Close file from command line. close(OUT); } ++$count; } # Close file from command line. close(MARC); # Output total records in file and file name. print STDERR join("\t",$count,$FileMARC),$crlf; } # Output total record count and file count. print STDERR join("\t",$total,"Total Records"),$crlf; print STDERR join("\t",int($total/$recs)+1,"Total Files"),$crlf;
Re: Splitting a large file of MARC records into smaller files
Hi, A long time ago, I've written the following : --- snippet --- #!/usr/bin/env perl use strict; use warnings; use MARC::File::USMARC; use MARC::Record; use Getopt::Long; my $config = { output => 'input' }; GetOptions($config, 'input=s', 'chunk=s', 'output=s', 'max=s'); if (not exists $config->{input} and not exists $config->{chunk}) { die "Usage: $0 --input file --chunk size [--output file]\n"; } else { run($config->{input}, $config->{output}, $config->{chunk}, $config->{max}); } sub run { my ($input, $output, $chunk, $max) = @_; my $marcfile = MARC::File::USMARC->in($input); my $fh = $output eq 'input' ? create_file($input) : create_file($output); my $cpt = 1; my $total = 0; while (my $record = $marcfile->next) { $total++; if (defined $max) { last if $total > $max; } if ($cpt++ > $chunk) { close $fh; $fh = $output eq 'input' ? create_file($input) : create_file($output); $cpt = 1; } print $fh $record->as_usmarc; } close $fh; } sub create_file { my ($output) = @_; my $cpt = 0; my $filename = sprintf('%s.%03d', $output, $cpt++); while (-e $filename) { $filename = sprintf('%s.%03d', $output, $cpt++); } open my $fh, '>', $filename; return $fh; } --- snippet --- Hope this help Emmanuel Di Pretoro 2010/1/25 Nolte, Jennifer > Hello- > > I am working with files of MARC records that are over a million records > each. I'd like to split them down into smaller chunks, preferably using a > command line. MARCedit works, but is slow and made for the desktop. I've > looked around and haven't found anything truly useful- Endeavor's MARCsplit > comes close but doesn't separate files into even numbers, only by matching > criteria, so there could be lots of record duplication between files. > > Any idea where to begin? I am a (super) novice Perl person. > > Thank you! > > ~Jenn Nolte > > > Jenn Nolte > Applications Manager / Database Analyst > Production Systems Team > Information Technology Office > Yale University Library > 130 Wall St. > New Haven CT 06520 > 203 432 4878 > > >