Eric Lease Morgan wrote:
Since my original implementation is still the fastest, and the newer
implementations do not improve the speed of the application, then I must
assume that the process is slow because of the XSLT transformations
themselves. These transformations are straight-forward:
# transform the document and save it
my $doc = $parser->parse_file($file);
my $results = $stylesheet->transform($doc);
my $html_file = "$HTML_DIR/$id.html";
open OUT, "> $html_file";
print OUT $stylesheet->output_string($results);
close OUT;
# convert the HTML to plain text and save it
my $html = parse_htmlfile($html_file);
my $text_file = "$TEXT_DIR/$id.txt";
open OUT, "> $text_file";
print OUT $formatter->format($html);
close OUT;
Can you save some time by not re-parsing the HTML file? I haven't used
the parse HTML feature of LibXML, but doesn't it produce the exact same
kind of XML Document object? If so, you already have a copy in $results
in the first part of the code, so you shouldn't need to go back and
re-parse the file you just created, since $html should be identical, or
at least functionally identical, to $results.
I don't know whether or not you are already doing this, but you might be
able to save a lot of time if you don't re-parse documents and
stylesheets, re-instantiate XML parsers, and so forth. Ideally, you
would call XML::LibXML->new and XML::LibXSLT->new once at the beginning
of the script, immediately followed by creating a $stylesheet that
contains the parsed stylesheet that you can then apply to each document
in the batch. You can then parse each source XML document once and
perform all of your operations on it in one go. Your script could then
look something vaguely like:
my $xml_parser = XML::LibXML->new;
my $xslt_parser = XML::LibXSLT->new;
my $xslt_doc = $xml_parser->parse_file ('stylesheet.xsl');
my $stylesheet = $xslt_parser->parse_stylesheet ($xslt_doc);
foreach my $file (@files_to_process) {
# Parse the document
my $original_doc = $xml_parser->parse_file ($file);
# Transform to HTML
my $html_doc = $stylesheet->transform ($original_doc);
my $html_file= my_filenaming_algorithm ($file);
$html_doc->toFile ($html_file);
# Transform the newly-transformed HTML (or XHTML) to plain text
open TEXT_OUT (">$text_file");
print TEXT_OUT $formatter->format ($html_doc);
close TEXT_OUT;
# Grab selected information from the TEI header
my ($header) = $html_doc->findnodes('teiHeader');
my $author = $header->findvalue('fileDesc/titleStmt/author');
my $title = $header->findvalue('fileDesc/titleStmt/title');
my $id =$header->findvalue('fileDesc/publicationStmt/idno');
do_something_with_my_data ($author, $title, $id);
}
That way, you only instantiate a parser once, you only parse the
XML->HTML stylesheet once, and you only parse each XML document once. I
don't know how much of this you are doing already, but eliminating
unnecessary parsing could speed things up a fair bit.
I think the speed of the XSLT process depends a lot on how complex the
stylesheet is. I have a script that parses XML documents and creates
secondary XML documents which contain a small subset of the original
data (with some fields amalgamated and otherwise massaged) and it takes
maybe 20-40 minutes to batch about 4 or 5 gigabytes of data in about
8500 files. My original documents are quite large and numerous, but the
derived documents are only about 1 KB or so, and the structure of the
original is reasonably simple. The stylesheet itself is only about 100
lines, though the stylesheet rules do seem to include a lot of 'or'
clauses in them. I don't know how complex your input files and
transformations are compared to mine, or how fast your computer is, but
96 seconds to process 1.5 MB does seem a little slow compared to what I
am getting.
I hope some of that helps.
--
William Wueppelmann
Electronic Systems Specialist
Canadian Institute for Historical Microreproductions (CIHM)
http://www.canadiana.org