Re: More Fun With MARC::File::XML: Solutions

2006-05-28 Thread Joshua Ferraro
Hi everyone,

Just providing an update on this issue. As you may recall, I've
been putting the MARC::Record suite, specifically MARC::File::XML
and MARC::Charset, through some fairly rigourous tests, including
a 'roundtrip' test, which converts the binary MARC-8 records to
MARCXML / UTF-8 and then back to binary MARC but encoded as UTF-8.
This test is available here:

http://liblime.com/public/roundtrip.pl

I discovered a number of bugs or issues, not in the MARC::* stuff, but in the
back-end SAX parsers. I'll just summarize my discoveries here for 
posterity:

1. MARC::File::XML, if it encounteres unmapped encoding in a
MARC-8 encoded binary MARC file (in as_xml()) will drop the entire
subfield where the improper encoding exists. The simple solution is
to always use: MARC::Charset-ignore_errors(1); if you expect your 
data will have improper encoding.

2. the XML::SAX::PurePerl parser cannot properly handle combining
characters. I've reported this bug here:

http://rt.cpan.org/Public/Bug/Display.html?id=19543

At the suggestion of several, I tried replacing my default system
parser with expat, which cause another problem:

3. handing valid UTF-8 encoded XML to new_from_xml() sometimes 
causes the entire record to be destroyed when using XML::SAX::Expat
as the parser (with PurePerl these seem to work). It fails with
the error:

not well-formed (invalid token) at line 23, column 43, byte 937 at 
/usr/lib/perl5/XML/Parser.pm line 187

I haven't been able to track the cause of this bug, I eventually
found a workaround that didn't result in the above error, but instead,
silently mangled the resulting binary MARC record on the way out:

4. Using incompatible version of XML::SAX::LibXML and libxml2 will
cause binary MARC records to be mangled when passed through new_from_xml()
in some cases. The solution here is to make sure you're running
compatible versions of XML::SAX::LibXML and libxml2. I run Debian
Sarge and when I just used the package maintainer's versions it
fixed the bug. It's unclear to me why the binary MARC would be 
mangled, this may indicate a problem with MARC::* but I haven't
had time to track it down and since installing compatible versions
of the parser back-end solves the problem I can only assume it's
the fault of the incompatible parsers.

Issues #3 and #4 above can be replicated following batch of records
through the roundtrip.pl script above:

http://liblime.com/public/several.mrc

If you want to test #2, try running this record through roundtrip.pl:

http://liblime.com/public/combiningchar.mrc

BTW: you can change your default SAX parser by editing the .ini file ... 
mine is located in /usr/local/share/perl/5.8.4/XML/SAX/ParserDetails.ini

So the bottom line is, if you want to use MARC::File::XML in any
serious application, you've got to use compatible versions of the
libxml2 parser and XML::SAX::LibXML. Check the README in the perl
package for documentation on which are compatible...

Maybe a note somewhere in the MARC::File::XML documentation to point 
these issues out would be useful. Also, it wouldn't be too bad to have
a few tests to make sure that the system's default SAX parser is capable
of handling these cases. Just my two cents. 

Cheers,

--
Joshua Ferraro   VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology   migration, training, maintenance, support
LibLimeFeaturing Koha Open-Source ILS
[EMAIL PROTECTED] |Full Demos at http://liblime.com/koha |1(888)KohaILS


Re: More Fun With MARC::File::XML: Solutions

2006-05-28 Thread Edward Summers

On May 28, 2006, at 2:25 PM, Joshua Ferraro wrote:

Maybe a note somewhere in the MARC::File::XML documentation to point
these issues out would be useful. Also, it wouldn't be too bad to have
a few tests to make sure that the system's default SAX parser is  
capable

of handling these cases. Just my two cents.


Great idea, since you have rights to cvs and cpan perhaps you could  
do this :-)


//Ed