Re: Problem displaying characters with diacritics in the vi editor (t he octal encoding is displayed)

2005-06-15 Thread William Wueppelmann
I haven't got much experience using Solaris, SunOS, or AIX specifically, 
but I would start by checking the value of the locales in use at the 
different machines by running "locale" or "echo $LC_ALL". If they are 
different, I would suggest changing the locale for the one that isn't 
displaying properly.


Also, you might want to check to make sure that the version of vi is the 
same on the machines. I wonder if the problem might be that "vi" is 
actually a link to vim or another vi-like editor on the machines that 
are displaying the diacritics, while original, classic vi is being run 
on the ones that are displaying octals. I haven't used "classic vi" in a 
long time, but I do recall that it has a more limited ability to display 
certain characters. (Also, even if you are running vim, if vim doesn't 
detect a .vimrc file at startup, it defaults to running in vi-compatible 
mode, so another thing to check might be if you have a .vimrc file on 
the system that displays the characters but lack one on the other.)


I don't know if this will solve the problem for you, but that's where I 
would start looking.


Cheers.


William Wueppelmann
Canadiana.org
(formerly Canadian Institute for Historical Microreproductions)
http://www.canadiana.org


[EMAIL PROTECTED] wrote:

Can anyone help me on this one ?

On some of our Solaris & AIX servers I can edit accented characters
(Latin-1) in the vi editor without problems, on others the octal
encoding of
the accented characters is displayed instead  (which is a pain when
you're
editing Aleph config files for a French interface). 


In the latter case the accented character displays OK on the Unix
command
line.

I've checked various settings in the .profile files, but can't work out
what's making the difference.

Any tips would be much appreciated.

Ian H.

E.g. On one machine é entered  =>   "\351"displayed
(OS=SunOS 5.8,  .profile includes 'set -o v',  TERM=xterm   )
   On another   é entered  =>   "é"
displayed
(OS=SunOS 5.8,   idem above )
_
Ian Hamilton 
Library Systems Administrator
European Commission - Directorate General for Education and Culture 
EAC C4 - Central Library Unit 
* +32-2-295.24.60 (direct phone) 
* +32-2-299.91.89 (fax)

 http://europa.eu.int/comm/dgs/education_culture/index_en.htm
<http://europa.eu.int/comm/dgs/education_culture/index_en.htm> 
   http://europa.eu.int/comm/libraries/index.htm
<http://europa.eu.int/comm/libraries/index.htm> 
 http://europa.eu.int/eclas/ <http://europa.eu.int/eclas/> 





Re: Separating index terms

2005-05-31 Thread William Wueppelmann




Saiful Amin wrote:

Hi all,

I'm doing some data cleaning for MARC data. The MARC export I got from a

legacy system has field 653 in following format:
$a 

I want to create repeatable field separating the individual index terms.

my $tag_653 = $record->field('653');
$record->delete_field('653');

my $keywords = $tag_653->subfield('a');

/*** I'm lost here (How do I separate keywords trapped in '<>' ?)**/ 


I'm not sure I completely understand what you're after, but if all you 
want to do is create an array of elements, where each element is the 
text inside the <> delimiters, you could do this:


# Get rid of everything up to and including the first "<"
$keyword =~ s/^.*?" to the end of the line
$keyword =~ s/>[^>]*$//;

# We now have fields delimited by "><", so we can just split at that
# delimiter:
my @keywords = split ('><', $keyword);



--
William Wueppelmann
Canadiana.org
(formerly Canadian Institute for Historical Microreproductions)
http://www.canadiana.org


Re: baffling perl/linux problem

2004-06-23 Thread William Wueppelmann
Jon Legree wrote:
THE QUESTION:
Is there anything in the system environment (Red Hat linux 7.1, perl 5.6.0 -
upgraded to 5.6.1 in an attempt to fix the problem, apache 1.3.x) that would
suddenly cause grep or regular expression functions, or simple file
access/reading functions to stop working or not work right? This system has
worked perfectly for 3 years, and nothing has been changed recently.
It is possible that certain environment variables, such as the local 
variables (LANG and LC_*: you can get a list of them along with their 
current value by running `locale') might change how certain regular 
expressions behave by changing what counts as a letter, though I am not 
absolutely certain of this. Some Linux distributions ship with the 
locale set to C or POSIX; I think Red Hat defaults to a human language 
like EN_US. But I can't really see how this would cause the behaviour 
you are describing.

I would be inclined to look through the code to try to find where the 
authentication is done. It is possible that there is a bug in the code 
that is interacting with some recent additions or changes you made 
recently to your data files. If you have an older backup copy of card 
numbers and guest ids, you might be able to see if you can reproduce the 
bug using the older data set to see if it is, perhaps, the interaction 
of a code bug with some new data.

--
William Wueppelmann
Electronic Systems Specialist
Canadian Institute for Historical Microreproductions (CIHM)
http://www.canadiana.org/


Re: STDIN as well as command line input

2004-04-26 Thread William Wueppelmann
Michael McDonnell wrote:

A popular GNUism might be helpful here as well.  Many GNU programs use 
an option command line argument of "--" to indicate that input should be 
taken from STDIN instead of from other command line arguments.
'--' is usually used to mean that anything that follows is a non-option 
argument. For example, to remove a file called '-rf', you could use:

rm -- -rf

to force rm to interpret '-r' as a filename instead of a pair of options.

'-' is usually used as a placeholder for STDIN. Unix programs generally 
read from STDIN if no files are specified on the command line. If one or 
more files are specified on the command line, '-' is usually taken to 
mean STDIN, so

cat foo - bar < baz

would list the contents of foo, then baz, then bar.

But what Eric seems to want to do is supply *data* either from a stream 
on STDIN or directly on the command line. This is a little 
unconventional, and you have to remember that the data will have to be 
parsed differently depending on where it comes from. For example, 
"foo.pl a b c d e" produces 5 elements in @ARGV, while "echo a b c d e | 
foo.pl" produces one element (with a newline appended) that has to be 
parsed and split into elements at the correct places. This can  get 
messy if you have escape characters, whitespace as data, and so on.

I would suggest that the standard way to do this sort of thing would be 
to read *all* data from STDIN (and, optionally, from files specified on 
the command line) and treat each line as one record, if your data never 
contains newlines. If you had some data you wanted to pass manually to 
the program, you could always invoke it as:

foo.pl
plato-cratylus-1072532262
plato-charmides-1072462708
bacon-new-1072751992
^D
or

echo "plato-cratylus-1072532262
plato-charmides-1072462708
bacon-new-1072751992" | foo.pl
Either that, or supply all data on the command line, and use the 
backticks method as Michael suggests. Mixing the two conventions might 
lead to confusion later on, especially if someone else will end up 
having to use the program.

--
William Wueppelmann
Electronic Systems Specialist
Canadian Institute for Historical Microreproductions (CIHM)
http://www.canadiana.org/


Re: Extracting data from an XML file

2004-01-06 Thread William Wueppelmann
Eric Lease Morgan wrote:

Since my original implementation is still the fastest, and the newer
implementations do not improve the speed of the application, then I must
assume that the process is slow because of the XSLT transformations
themselves. These transformations are straight-forward:
  # transform the document and save it
  my $doc   = $parser->parse_file($file);
  my $results   = $stylesheet->transform($doc);
  my $html_file = "$HTML_DIR/$id.html";
  open OUT, "> $html_file";
  print OUT $stylesheet->output_string($results);
  close OUT;
  
  # convert the HTML to plain text and save it
  my $html  = parse_htmlfile($html_file);
  my $text_file = "$TEXT_DIR/$id.txt";
  open OUT, "> $text_file";
  print OUT $formatter->format($html);
  close OUT;
Can you save some time by not re-parsing the HTML file? I haven't used 
the parse HTML feature of LibXML, but doesn't it produce the exact same 
kind of XML Document object? If so, you already have a copy in $results 
in the first  part of the code, so you shouldn't need to go back and 
re-parse the file you just created, since $html should be identical, or 
at least functionally identical, to $results.

I don't know whether or not you are already doing this, but you might be 
able to save a lot of time if you don't re-parse documents and 
stylesheets, re-instantiate XML parsers, and so forth. Ideally, you 
would call XML::LibXML->new and XML::LibXSLT->new once at the beginning 
of the script, immediately followed by creating a $stylesheet that 
contains the parsed stylesheet that you can then apply to each document 
in the batch. You can then parse each source XML document once and 
perform all of your operations on it in one go. Your script could then 
look something vaguely like:

my $xml_parser = XML::LibXML->new;
my $xslt_parser = XML::LibXSLT->new;
my $xslt_doc   = $xml_parser->parse_file ('stylesheet.xsl');
my $stylesheet = $xslt_parser->parse_stylesheet ($xslt_doc);
foreach my $file (@files_to_process) {
# Parse the document
my $original_doc = $xml_parser->parse_file ($file);
# Transform to HTML
my $html_doc = $stylesheet->transform ($original_doc);
my $html_file= my_filenaming_algorithm ($file);
$html_doc->toFile ($html_file);
# Transform the newly-transformed HTML (or XHTML) to plain text
open TEXT_OUT (">$text_file");
print TEXT_OUT $formatter->format ($html_doc);
close TEXT_OUT;
# Grab selected information from the TEI header
my ($header) = $html_doc->findnodes('teiHeader');
my $author = $header->findvalue('fileDesc/titleStmt/author');
my $title  = $header->findvalue('fileDesc/titleStmt/title');
my $id =$header->findvalue('fileDesc/publicationStmt/idno');
do_something_with_my_data ($author, $title, $id);
}
That way, you only instantiate a parser once, you only parse the 
XML->HTML stylesheet once, and you only parse each XML document once. I 
don't know how much of this you are doing already, but eliminating 
unnecessary parsing could speed things up a fair bit.

I think the speed of the XSLT process depends a lot on how complex the 
stylesheet is. I have a script that parses XML documents and creates 
secondary XML documents which contain a small subset of the original 
data (with some fields amalgamated and otherwise massaged) and it takes 
maybe 20-40 minutes to batch about 4 or 5 gigabytes of data in about 
8500 files. My original documents are quite large and numerous, but the 
derived documents are only about 1 KB or so, and the structure of the 
original is reasonably simple. The stylesheet itself is only about 100 
lines, though the stylesheet rules do seem to include a lot of 'or' 
clauses in them. I don't know how complex your input files and 
transformations are compared to mine, or how fast your computer is, but 
96 seconds to process 1.5 MB does seem a little slow compared to what I 
am getting.

I hope some of that helps.

--
William Wueppelmann
Electronic Systems Specialist
Canadian Institute for Historical Microreproductions (CIHM)
http://www.canadiana.org


Re: perl OOP question

2003-06-26 Thread William Wueppelmann
Andy Lester wrote:
Is there any way I can just say:

use ISO::types::ILLString;
my $s = new ILLString("This is a string");


First, what you're talking about isn't object related.  It's just 
package functions.

You need to look at the Exporter module.  Bascially, you want to do this 
(off the topof my head)

package ISO::types::ILLString;

use Exporter;
our @EXPORT = qw( ILLString );
sub ILLString {
# blah blah blah
}
1;

Now, whenever you say "use ISO::types::ILLString", you get ILLString 
imported into your namespace.
There is a second way to do this. You can also say:

use Exporter;
our @EXPORT_OK = qw ( ILLString );
This means that ILLString isn't automatically imported into the 
namespace, but it will be if, in your Perl program, you use the 
following syntax:

use ISO::types::ILLString ('ILLString');

The advantage of doing it this way is that it is the program, not the 
module, that decides whether or not to import the name. This is useful 
because you may already have ILLString defined elsewhere and don't want 
to overwrite it. It can be a little nasty if a module imports symbols 
into your main namespace that you don't know about and then you spend 
hours chasing down bugs when 'ILLString' doesn't refer to what you think 
it does anymore. I think that this is now the standard practice for 
non-core Perl modules. From the Exporter Perldoc page:

As a general rule, if the module is trying to be object
oriented then export nothing. If it's just a collection of
functions then @EXPORT_OK anything but use @EXPORT with
caution.
Cheers.

--
William Wueppelmann
Electronic Systems Specialist
Canadian Institute for Historical Microreproductions (CIHM)