Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Saiful Amin
I also recommend using MARC::Batch. Attached is a simple script I wrote for
myself.

Saiful Amin
+91-9343826438


On Mon, Jan 25, 2010 at 8:33 PM, Robert Fox  wrote:

> Assuming that memory won't be an issue, you could use MARC::Batch to
> read in the record set and print out seperate files where you split on
> X amount of records. You would have an iterative loop loading each
> record from the large batch, and a counter variable that would get
> reset after X amount of records. You might want to name the sets using
> another counter that keeps track of how many sets you have and name
> each file something like batch_$count.mrc and write them out to a
> specific directory. Just concatenate each record to the previous one
> when you're making your smaller batches.
>
> Rob Fox
> Hesburgh Libraries
> University of Notre Dame
>
> On Jan 25, 2010, at 9:48 AM, "Nolte, Jennifer"
>  wrote:
>
> > Hello-
> >
> > I am working with files of MARC records that are over a million
> > records each. I'd like to split them down into smaller chunks,
> > preferably using a command line. MARCedit works, but is slow and
> > made for the desktop. I've looked around and haven't found anything
> > truly useful- Endeavor's MARCsplit comes close but doesn't separate
> > files into even numbers, only by matching criteria, so there could
> > be lots of record duplication between files.
> >
> > Any idea where to begin? I am a (super) novice Perl person.
> >
> > Thank you!
> >
> > ~Jenn Nolte
> >
> >
> > Jenn Nolte
> > Applications Manager / Database Analyst
> > Production Systems Team
> > Information Technology Office
> > Yale University Library
> > 130 Wall St.
> > New Haven CT 06520
> > 203 432 4878
> >
> >
>
#!c:/perl/bin/perl.exe
#
# Name: mbreaker.pl
# Version: 0.1
# Date: Jan 2009
# Author: Saiful Amin 
#
# Description: Extract MARC records based on command-line paramenters

use strict;
use warnings;
use Getopt::Long;
use MARC::Batch;

my $start   = 0;
my $end = 1;

GetOptions ("start=i" => \$start,
"end=i"   => \$end
);

my $batch = MARC::Batch->new('USMARC', $ARGV[0]);
$batch->strict_off();
$batch->warnings_off();

my $num = 0;
while (my $record = $batch->next() ) {
$num++;
next if $num < $start;
last if $num > $end;
print $record->as_usmarc();
warn "$num records\n" if ( $num % 1000 == 0 );
}


__END__

=head1 NAME

mbreaker.pl

Breaks the MARC record file as per start and end position specified

=head1 SYNOPSIS

mbreaker.pl [options] file

Options:
 -start start position for reading records
 -end   end position for reading records

Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Robert Fox
Assuming that memory won't be an issue, you could use MARC::Batch to  
read in the record set and print out seperate files where you split on  
X amount of records. You would have an iterative loop loading each  
record from the large batch, and a counter variable that would get  
reset after X amount of records. You might want to name the sets using  
another counter that keeps track of how many sets you have and name  
each file something like batch_$count.mrc and write them out to a  
specific directory. Just concatenate each record to the previous one  
when you're making your smaller batches.

Rob Fox
Hesburgh Libraries
University of Notre Dame

On Jan 25, 2010, at 9:48 AM, "Nolte, Jennifer"  
 wrote:

> Hello-
>
> I am working with files of MARC records that are over a million  
> records each. I'd like to split them down into smaller chunks,  
> preferably using a command line. MARCedit works, but is slow and  
> made for the desktop. I've looked around and haven't found anything  
> truly useful- Endeavor's MARCsplit comes close but doesn't separate  
> files into even numbers, only by matching criteria, so there could  
> be lots of record duplication between files.
>
> Any idea where to begin? I am a (super) novice Perl person.
>
> Thank you!
>
> ~Jenn Nolte
>
>
> Jenn Nolte
> Applications Manager / Database Analyst
> Production Systems Team
> Information Technology Office
> Yale University Library
> 130 Wall St.
> New Haven CT 06520
> 203 432 4878
>
>


Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Sébastien Hinderer
Hi,

The yaz-marcdump utility may be what you are looking for.
See for instance options -s and -C.

hth,
Shérab.


RE: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Walker, David
> yaz-marcdump allows you to break a 
> marcfile into chunks of x-records

+1 

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Colin Campbell [colin.campb...@ptfs-europe.com]
Sent: Monday, January 25, 2010 7:08 AM
To: perl4lib@perl.org
Subject: Re: Splitting a large file of MARC records into smaller files

On 25/01/10 14:48, Nolte, Jennifer wrote:
> Hello-
>
> I am working with files of MARC records that are over a million records each. 
> I'd like to split them down into smaller chunks, preferably using a command 
> line. MARCedit works, but is slow and made for the desktop. I've looked 
> around and haven't found anything truly useful- Endeavor's MARCsplit comes 
> close but doesn't separate files into even numbers, only by matching 
> criteria, so there could be lots of record duplication between files.

If you have Indexdata's yaz installed the program yaz-marcdump allows
you to break a marcfile into chunks of x-records. man yaz-marcdump for
details

Cheers
Colin


--
Colin Campbell
Chief Software Engineer,
PTFS Europe Limited
Content Management and Library Solutions
+44 (0) 208 366 1295 (phone)
+44 (0) 7759 633626  (mobile)
colin.campb...@ptfs-europe.com
skype: colin_campbell2

http://www.ptfs-europe.com

Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Ashley Sanders

Jennifer,


I am working with files of MARC records that are over a million records each. 
I'd like to split them down into smaller chunks, preferably using a command 
line. MARCedit works, but is slow and made for the desktop. I've looked around 
and haven't found anything truly useful- Endeavor's MARCsplit comes close but 
doesn't separate files into even numbers, only by matching criteria, so there 
could be lots of record duplication between files.

Any idea where to begin? I am a (super) novice Perl person.


Well... if you have a *nix style command line and the usual
utilities and your file of MARC records is in exchange format
with the records just delimited by the end-of-record character
0x1d, then you could do something like this:

tr '\035' '\n' < my-marc-file.mrc > recs.txt
split -1000 recs.txt

The tr command will turn the MARC end-of-record characters
into newlines. Then use the split command to carve up
the output of tr into files of 1000 records.

You then may have to use tr to convert the newlines back
to MARC end-of-record characters.

Ashley.

--
Ashley Sanders   a.sand...@manchester.ac.uk
Copac http://copac.ac.uk A Mimas service funded by JISC


RE: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Smith,Devon
This isn't a perl solution, but it may work for you.

You can use the unix split command to split a file into several other
files with the same number of lines each. For that to work, you'll first
have to use tr to convert the ^] record separators into newlines. Then
use tr to convert them all back in each split file.

#> tr '^]' '\n' < filename > filename.nl
#> split -l $lines_per_file filename.nl SPLIT
#> for file in SPLIT*; do tr '\n' '^]' < $file > ${file%.nl}.rs

Or something like that.

/dev
-- 

Devon Smith
Consulting Software Engineer
OCLC Office of Research



-Original Message-
From: Nolte, Jennifer [mailto:jennifer.no...@yale.edu] 
Sent: Monday, January 25, 2010 9:48 AM
To: perl4lib@perl.org
Subject: Splitting a large file of MARC records into smaller files

Hello-

I am working with files of MARC records that are over a million records
each. I'd like to split them down into smaller chunks, preferably using
a command line. MARCedit works, but is slow and made for the desktop.
I've looked around and haven't found anything truly useful- Endeavor's
MARCsplit comes close but doesn't separate files into even numbers, only
by matching criteria, so there could be lots of record duplication
between files.

Any idea where to begin? I am a (super) novice Perl person.

Thank you!

~Jenn Nolte


Jenn Nolte
Applications Manager / Database Analyst
Production Systems Team
Information Technology Office
Yale University Library
130 Wall St.
New Haven CT 06520
203 432 4878






Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Colin Campbell

On 25/01/10 14:48, Nolte, Jennifer wrote:

Hello-

I am working with files of MARC records that are over a million records each. 
I'd like to split them down into smaller chunks, preferably using a command 
line. MARCedit works, but is slow and made for the desktop. I've looked around 
and haven't found anything truly useful- Endeavor's MARCsplit comes close but 
doesn't separate files into even numbers, only by matching criteria, so there 
could be lots of record duplication between files.


If you have Indexdata's yaz installed the program yaz-marcdump allows 
you to break a marcfile into chunks of x-records. man yaz-marcdump for 
details


Cheers
Colin


--
Colin Campbell
Chief Software Engineer,
PTFS Europe Limited
Content Management and Library Solutions
+44 (0) 208 366 1295 (phone)
+44 (0) 7759 633626  (mobile)
colin.campb...@ptfs-europe.com
skype: colin_campbell2

http://www.ptfs-europe.com


RE: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Houghton,Andrew
> From: Nolte, Jennifer [mailto:jennifer.no...@yale.edu]
> Sent: Monday, January 25, 2010 09:48 AM
> To: perl4lib@perl.org
> Subject: Splitting a large file of MARC records into smaller files
> 
> Hello-
> 
> I am working with files of MARC records that are over a million records
> each. I'd like to split them down into smaller chunks, preferably using
> a command line. MARCedit works, but is slow and made for the desktop.
> I've looked around and haven't found anything truly useful- Endeavor's
> MARCsplit comes close but doesn't separate files into even numbers,
> only by matching criteria, so there could be lots of record duplication
> between files.
> 
> Any idea where to begin? I am a (super) novice Perl person.

I use the following handy script I created many-many years ago.  Consider 
it to be in the public domain.

#!perl
#
# Usage:
#perl MARC21-split.pl [-d#] [-n#] [-pPrefix] [-sSuffix] *.marc
#
#perl MARC21-split.pl -d3 -n1 -pbib -s.marc *.marc
#
#  Creates files with three digits sequence number that have 10,000
#  records per file: bib001.marc, bib002.marc, etc.
#
# Options:
#-d   number of digits for sequence number
#-n   number of records per file
#-p   prefix text before sequence number
#-s   suffix text after sequence number
#


package main;   # The current package name
require 5.003;  # The current package requires Perl v5.003 or later.

BEGIN { unshift(@INC,'.') }

use Carp;   # Perl package, see documentation

my $PACKAGE = 'main';


## VARIABLES #


my $crlf = "\n";  # ASCII newline.

my $recd = "\x1D";# MARC21 record delimiter.
my $fldd = "\x1E";# MARC21 field  delimiter.
my $subd = "\x1F";# MARC21 field  separator.


## INLINE CODE #


# Change Perls default record delimiter.
$/ = $recd;

# Set defaults for command line options.
my $recs   = 1;
my $digits = 2;
my $prefix = '';
my $suffix = '.mrc';

# Initialize total record count to zero.
my $total = 0;

print STDERR join("\r\nARG=",'',@ARGV),"\r\n";

# Process command line.
foreach $FileMARC (@ARGV) {
  my $FileOUT = undef;

  # Process command line options.
  if ($FileMARC =~ m/^[\-][Dd]/) {
$FileMARC =~ s/^[\-][Dd]//;

if (($digits = $FileMARC) !~ m/\d+/ || $recs == 0) {
  $digits = 1;
}

next;

  } elsif ($FileMARC =~ m/^[\-][Nn]/) {
$FileMARC =~ s/^[\-][Nn]//;

if (($recs = $FileMARC) !~ m/\d+/ || $recs == 0) {
  $recs = 1;
}

next;

  } elsif ($FileMARC =~ m/^[\-][Pp]/) {
$FileMARC =~ s/^[\-][Pp]//;
$prefix   = $FileMARC;
next;

  } elsif ($FileMARC =~ m/^[\-][Ss]/) {
$FileMARC =~ s/^[\-][Ss]//;
$suffix   = $FileMARC;
next;
  }

  # Open file from command line.
  open(MARC,'<'.$FileMARC) ||
croak("$PACKAGE:: Cannot open input file '$FileMARC': $!");

  # Count each record in the file.
  my $count = 0;

  while () {

# Open new output file when necessary.
if (($total % $recs) == 0) {
  my $pattern = sprintf('%%s%%0%uu%%s',int($digits));
  
  $FileOUT = sprintf($pattern,$prefix,($total/$recs)+1,$suffix);

  # Open output file.
  open(OUT,'>'.$FileOUT) ||
croak("$PACKAGE:: Cannot open output file '$FileOUT': $!");
}

print OUT $_; ++$total;

# Close output file when full.
if (($total % $recs) == 0) {
  # Close file from command line.
  close(OUT);
}

++$count;
  }

  # Close file from command line.
  close(MARC);

  # Output total records in file and file name.
  print STDERR join("\t",$count,$FileMARC),$crlf;
}

# Output total record count and file count.
print STDERR join("\t",$total,"Total Records"),$crlf;
print STDERR join("\t",int($total/$recs)+1,"Total Files"),$crlf;





Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Emmanuel Di Pretoro
Hi,

A long time ago, I've written the following :

--- snippet ---
#!/usr/bin/env perl

use strict;
use warnings;

use MARC::File::USMARC;
use MARC::Record;

use Getopt::Long;

my $config = { output => 'input' };

GetOptions($config, 'input=s', 'chunk=s', 'output=s', 'max=s');

if (not exists $config->{input} and not exists $config->{chunk}) {
die "Usage: $0 --input file --chunk size [--output file]\n";
} else {
run($config->{input}, $config->{output}, $config->{chunk},
$config->{max});

}

sub run {
my ($input, $output, $chunk, $max) = @_;

my $marcfile = MARC::File::USMARC->in($input);

my $fh = $output eq 'input' ? create_file($input) :
create_file($output);
my $cpt = 1;
my $total = 0;
while (my $record = $marcfile->next) {
$total++;

if (defined $max) {
last if $total > $max;
}
if ($cpt++ > $chunk) {
close $fh;
$fh = $output eq 'input' ? create_file($input) :
create_file($output);
$cpt = 1;
}

print $fh $record->as_usmarc;
}
close $fh;
}

sub create_file {
my ($output) = @_;
my $cpt = 0;

my $filename = sprintf('%s.%03d', $output, $cpt++);
while (-e $filename) {
$filename = sprintf('%s.%03d', $output, $cpt++);
}

open my $fh, '>', $filename;
return $fh;
}
--- snippet ---

Hope this help

Emmanuel Di Pretoro

2010/1/25 Nolte, Jennifer 

> Hello-
>
> I am working with files of MARC records that are over a million records
> each. I'd like to split them down into smaller chunks, preferably using a
> command line. MARCedit works, but is slow and made for the desktop. I've
> looked around and haven't found anything truly useful- Endeavor's MARCsplit
> comes close but doesn't separate files into even numbers, only by matching
> criteria, so there could be lots of record duplication between files.
>
> Any idea where to begin? I am a (super) novice Perl person.
>
> Thank you!
>
> ~Jenn Nolte
>
>
> Jenn Nolte
> Applications Manager / Database Analyst
> Production Systems Team
> Information Technology Office
> Yale University Library
> 130 Wall St.
> New Haven CT 06520
> 203 432 4878
>
>
>