date:20040106

Re: Koha

2004-01-06 Thread paul POULAIN

Jacobs, Jane W a écrit :

I'm forwarding a question from a colleague in NYC:

"Do you know of any libraries closer than New Zealand or Europe that use
Koha?  We are almost ready to consider it."
I'd appreciate anyone who can be a resource, preferably in the US, to
respond on or off the list.  Thanks.
JJ
Nelsonville Public Library, in Nelsonville, Ohio.

mail adress in cc of this answer.

Note that europe/France is very sweet, so i think you could try to 
negociate a plane ticket with your boss if you want a demo here :-D

--
Paul POULAIN
Consultant indépendant en logiciels libres
responsable francophone de koha (SIGB libre http://www.koha-fr.org)

Re: Extracting data from an XML file

2004-01-06 Thread Paul Hoffman

On Monday, January 5, 2004, at 10:27  PM, Eric Lease Morgan wrote:

Fourth, I tried both of these approaches plus my own, and timed them. 
I had
to process 1.5 MB of data in nineteen files. Tiny. Ironically, my 
original
code was the fastest at 96 seconds.
Yikes!

The XSLT implementation came in second
at 101 seconds,
Yikes again.

and the XML::Twig implementation, while straight-forward
came in last as 141 seconds. (See the attached code snippets.)
Did you try using 'twig_roots' instead of 'TwigHandlers' in the 
constructor?  Also, it might speed up if you purge the twig at the end 
of each handler; this is supposed to release memory.

  # using XML::Twig
  print "Processing $file...\n";
  my ($author, $title, $id);
  my $author_xpath = 'teiHeader/fileDesc/titleStmt/author';
  my $title_xpath = 'teiHeader/fileDesc/titleStmt/title';
  my $id_xpath = 'teiHeader/fileDesc/publicationStmt/idno';
  my $twig = new XML::Twig('twig_roots' => {
$author_xpath => sub {$author = $_[1]->text; $twig->purge },
$title_xpath  => sub {$title  = $_[1]->text; $twig->purge },
$id_xpath => sub {$id = $_[1]->text; $twig->purge }});
  $twig->parsefile($file);
  $twig->purge;
  print "  author: $author\n   title: $title\n  id: $id\n\n";
Have you considered using a regular expression to extract the teiHeader?

Maybe the folks at PerlMonks would have some helpful suggestions.

Paul.

--
Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan
[EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/

Re: Extracting data from an XML file

2004-01-06 Thread William Wueppelmann

Eric Lease Morgan wrote:

Since my original implementation is still the fastest, and the newer
implementations do not improve the speed of the application, then I must
assume that the process is slow because of the XSLT transformations
themselves. These transformations are straight-forward:
  # transform the document and save it
  my $doc   = $parser->parse_file($file);
  my $results   = $stylesheet->transform($doc);
  my $html_file = "$HTML_DIR/$id.html";
  open OUT, "> $html_file";
  print OUT $stylesheet->output_string($results);
  close OUT;
  
  # convert the HTML to plain text and save it
  my $html  = parse_htmlfile($html_file);
  my $text_file = "$TEXT_DIR/$id.txt";
  open OUT, "> $text_file";
  print OUT $formatter->format($html);
  close OUT;
Can you save some time by not re-parsing the HTML file? I haven't used 
the parse HTML feature of LibXML, but doesn't it produce the exact same 
kind of XML Document object? If so, you already have a copy in $results 
in the first  part of the code, so you shouldn't need to go back and 
re-parse the file you just created, since $html should be identical, or 
at least functionally identical, to $results.

I don't know whether or not you are already doing this, but you might be 
able to save a lot of time if you don't re-parse documents and 
stylesheets, re-instantiate XML parsers, and so forth. Ideally, you 
would call XML::LibXML->new and XML::LibXSLT->new once at the beginning 
of the script, immediately followed by creating a $stylesheet that 
contains the parsed stylesheet that you can then apply to each document 
in the batch. You can then parse each source XML document once and 
perform all of your operations on it in one go. Your script could then 
look something vaguely like:

my $xml_parser = XML::LibXML->new;
my $xslt_parser = XML::LibXSLT->new;
my $xslt_doc   = $xml_parser->parse_file ('stylesheet.xsl');
my $stylesheet = $xslt_parser->parse_stylesheet ($xslt_doc);
foreach my $file (@files_to_process) {
# Parse the document
my $original_doc = $xml_parser->parse_file ($file);
# Transform to HTML
my $html_doc = $stylesheet->transform ($original_doc);
my $html_file= my_filenaming_algorithm ($file);
$html_doc->toFile ($html_file);
# Transform the newly-transformed HTML (or XHTML) to plain text
open TEXT_OUT (">$text_file");
print TEXT_OUT $formatter->format ($html_doc);
close TEXT_OUT;
# Grab selected information from the TEI header
my ($header) = $html_doc->findnodes('teiHeader');
my $author = $header->findvalue('fileDesc/titleStmt/author');
my $title  = $header->findvalue('fileDesc/titleStmt/title');
my $id =$header->findvalue('fileDesc/publicationStmt/idno');
do_something_with_my_data ($author, $title, $id);
}
That way, you only instantiate a parser once, you only parse the 
XML->HTML stylesheet once, and you only parse each XML document once. I 
don't know how much of this you are doing already, but eliminating 
unnecessary parsing could speed things up a fair bit.

I think the speed of the XSLT process depends a lot on how complex the 
stylesheet is. I have a script that parses XML documents and creates 
secondary XML documents which contain a small subset of the original 
data (with some fields amalgamated and otherwise massaged) and it takes 
maybe 20-40 minutes to batch about 4 or 5 gigabytes of data in about 
8500 files. My original documents are quite large and numerous, but the 
derived documents are only about 1 KB or so, and the structure of the 
original is reasonably simple. The stylesheet itself is only about 100 
lines, though the stylesheet rules do seem to include a lot of 'or' 
clauses in them. I don't know how complex your input files and 
transformations are compared to mine, or how fast your computer is, but 
96 seconds to process 1.5 MB does seem a little slow compared to what I 
am getting.

I hope some of that helps.

--
William Wueppelmann
Electronic Systems Specialist
Canadian Institute for Historical Microreproductions (CIHM)
http://www.canadiana.org

Re: Extracting data from an XML file

2004-01-06 Thread Ed Summers

On Mon, Jan 05, 2004 at 10:27:39PM -0500, Eric Lease Morgan wrote:
> Since my original implementation is still the fastest, and the newer
> implementations do not improve the speed of the application, then I must
> assume that the process is slow because of the XSLT transformations
> themselves. These transformations are straight-forward:

If you can provide me with the data files I would be willing to write
a similar benchmark using XML::SAX :)

//Ed

Re: Koha

Re: Extracting data from an XML file

Re: Extracting data from an XML file

Re: Extracting data from an XML file

4 matches

Site Navigation

Mail list logo

Footer information