Eric Lease Morgan wrote:

Since my original implementation is still the fastest, and the newer
implementations do not improve the speed of the application, then I must
assume that the process is slow because of the XSLT transformations
themselves. These transformations are straight-forward:

# transform the document and save it
my $doc = $parser->parse_file($file);
my $results = $stylesheet->transform($doc);
my $html_file = "$HTML_DIR/$id.html";
open OUT, "> $html_file";
print OUT $stylesheet->output_string($results);
close OUT;
# convert the HTML to plain text and save it
my $html = parse_htmlfile($html_file);
my $text_file = "$TEXT_DIR/$id.txt";
open OUT, "> $text_file";
print OUT $formatter->format($html);
close OUT;

Can you save some time by not re-parsing the HTML file? I haven't used the parse HTML feature of LibXML, but doesn't it produce the exact same kind of XML Document object? If so, you already have a copy in $results in the first part of the code, so you shouldn't need to go back and re-parse the file you just created, since $html should be identical, or at least functionally identical, to $results.


I don't know whether or not you are already doing this, but you might be able to save a lot of time if you don't re-parse documents and stylesheets, re-instantiate XML parsers, and so forth. Ideally, you would call XML::LibXML->new and XML::LibXSLT->new once at the beginning of the script, immediately followed by creating a $stylesheet that contains the parsed stylesheet that you can then apply to each document in the batch. You can then parse each source XML document once and perform all of your operations on it in one go. Your script could then look something vaguely like:

my $xml_parser = XML::LibXML->new;
my $xslt_parser = XML::LibXSLT->new;

my $xslt_doc   = $xml_parser->parse_file ('stylesheet.xsl');
my $stylesheet = $xslt_parser->parse_stylesheet ($xslt_doc);

foreach my $file (@files_to_process) {
        # Parse the document
        my $original_doc = $xml_parser->parse_file ($file);

        # Transform to HTML
        my $html_doc     = $stylesheet->transform ($original_doc);
        my $html_file    = my_filenaming_algorithm ($file);
        $html_doc->toFile ($html_file);

        # Transform the newly-transformed HTML (or XHTML) to plain text
        open TEXT_OUT (">$text_file");
        print TEXT_OUT $formatter->format ($html_doc);
        close TEXT_OUT;

        # Grab selected information from the TEI header
        my ($header) = $html_doc->findnodes('teiHeader');
        my $author = $header->findvalue('fileDesc/titleStmt/author');
        my $title  = $header->findvalue('fileDesc/titleStmt/title');
        my $id =$header->findvalue('fileDesc/publicationStmt/idno');
        do_something_with_my_data ($author, $title, $id);
}

That way, you only instantiate a parser once, you only parse the XML->HTML stylesheet once, and you only parse each XML document once. I don't know how much of this you are doing already, but eliminating unnecessary parsing could speed things up a fair bit.

I think the speed of the XSLT process depends a lot on how complex the stylesheet is. I have a script that parses XML documents and creates secondary XML documents which contain a small subset of the original data (with some fields amalgamated and otherwise massaged) and it takes maybe 20-40 minutes to batch about 4 or 5 gigabytes of data in about 8500 files. My original documents are quite large and numerous, but the derived documents are only about 1 KB or so, and the structure of the original is reasonably simple. The stylesheet itself is only about 100 lines, though the stylesheet rules do seem to include a lot of 'or' clauses in them. I don't know how complex your input files and transformations are compared to mine, or how fast your computer is, but 96 seconds to process 1.5 MB does seem a little slow compared to what I am getting.

I hope some of that helps.

--
William Wueppelmann
Electronic Systems Specialist
Canadian Institute for Historical Microreproductions (CIHM)
http://www.canadiana.org



Reply via email to