Please find attached the file I'm trying to parse. It is extracted from a OAI Data Provider in oai_dc format. The challenge is to preserve the Thai characters encoded in UTF-8.

I see these are the result of oai-pmh GetRequests. If you like you can use the SAX handler in Net::OAI::Harvester directly to extract record objects like so:

  #!/usr/bin/perl

  use strict;
  use XML::SAX::ParserFactory;
  use Net::OAI::Record::OAI_DC;

  my $file = shift;
  my $factory = XML::SAX::ParserFactory->new();
  my $record = Net::OAI::Record::OAI_DC->new();
  my $parser = $factory->parser(Handler => $record);

  # parse the file
  $parser->parse_uri($file);

  # print out the title
  print $record->title();

That is a script that takes the filename as an argument and prints out the title. For info about utf8 and perl your best bet is to read about it in the Camel book (imho). As for a utf8 safe MARC::Record I believe it's not on CPAN yet, although you can get it out of SourceForge. Andy Lester manages the CPANification of MARC::Record.

XSL is the logical choice for transforming one version of XML to another. However if you need to parse XML to stuff rows into a database it isn't that logical...at least for me.

//Ed

Reply via email to