Re: Problem displaying characters with diacritics in the vi editor (t he octal encoding is displayed)
I haven't got much experience using Solaris, SunOS, or AIX specifically, but I would start by checking the value of the locales in use at the different machines by running "locale" or "echo $LC_ALL". If they are different, I would suggest changing the locale for the one that isn't displaying properly. Also, you might want to check to make sure that the version of vi is the same on the machines. I wonder if the problem might be that "vi" is actually a link to vim or another vi-like editor on the machines that are displaying the diacritics, while original, classic vi is being run on the ones that are displaying octals. I haven't used "classic vi" in a long time, but I do recall that it has a more limited ability to display certain characters. (Also, even if you are running vim, if vim doesn't detect a .vimrc file at startup, it defaults to running in vi-compatible mode, so another thing to check might be if you have a .vimrc file on the system that displays the characters but lack one on the other.) I don't know if this will solve the problem for you, but that's where I would start looking. Cheers. William Wueppelmann Canadiana.org (formerly Canadian Institute for Historical Microreproductions) http://www.canadiana.org [EMAIL PROTECTED] wrote: Can anyone help me on this one ? On some of our Solaris & AIX servers I can edit accented characters (Latin-1) in the vi editor without problems, on others the octal encoding of the accented characters is displayed instead (which is a pain when you're editing Aleph config files for a French interface). In the latter case the accented character displays OK on the Unix command line. I've checked various settings in the .profile files, but can't work out what's making the difference. Any tips would be much appreciated. Ian H. E.g. On one machine é entered => "\351"displayed (OS=SunOS 5.8, .profile includes 'set -o v', TERM=xterm ) On another é entered => "é" displayed (OS=SunOS 5.8, idem above ) _ Ian Hamilton Library Systems Administrator European Commission - Directorate General for Education and Culture EAC C4 - Central Library Unit * +32-2-295.24.60 (direct phone) * +32-2-299.91.89 (fax) http://europa.eu.int/comm/dgs/education_culture/index_en.htm <http://europa.eu.int/comm/dgs/education_culture/index_en.htm> http://europa.eu.int/comm/libraries/index.htm <http://europa.eu.int/comm/libraries/index.htm> http://europa.eu.int/eclas/ <http://europa.eu.int/eclas/>
Re: Separating index terms
Saiful Amin wrote: Hi all, I'm doing some data cleaning for MARC data. The MARC export I got from a legacy system has field 653 in following format: $a I want to create repeatable field separating the individual index terms. my $tag_653 = $record->field('653'); $record->delete_field('653'); my $keywords = $tag_653->subfield('a'); /*** I'm lost here (How do I separate keywords trapped in '<>' ?)**/ I'm not sure I completely understand what you're after, but if all you want to do is create an array of elements, where each element is the text inside the <> delimiters, you could do this: # Get rid of everything up to and including the first "<" $keyword =~ s/^.*?" to the end of the line $keyword =~ s/>[^>]*$//; # We now have fields delimited by "><", so we can just split at that # delimiter: my @keywords = split ('><', $keyword); -- William Wueppelmann Canadiana.org (formerly Canadian Institute for Historical Microreproductions) http://www.canadiana.org
Re: baffling perl/linux problem
Jon Legree wrote: THE QUESTION: Is there anything in the system environment (Red Hat linux 7.1, perl 5.6.0 - upgraded to 5.6.1 in an attempt to fix the problem, apache 1.3.x) that would suddenly cause grep or regular expression functions, or simple file access/reading functions to stop working or not work right? This system has worked perfectly for 3 years, and nothing has been changed recently. It is possible that certain environment variables, such as the local variables (LANG and LC_*: you can get a list of them along with their current value by running `locale') might change how certain regular expressions behave by changing what counts as a letter, though I am not absolutely certain of this. Some Linux distributions ship with the locale set to C or POSIX; I think Red Hat defaults to a human language like EN_US. But I can't really see how this would cause the behaviour you are describing. I would be inclined to look through the code to try to find where the authentication is done. It is possible that there is a bug in the code that is interacting with some recent additions or changes you made recently to your data files. If you have an older backup copy of card numbers and guest ids, you might be able to see if you can reproduce the bug using the older data set to see if it is, perhaps, the interaction of a code bug with some new data. -- William Wueppelmann Electronic Systems Specialist Canadian Institute for Historical Microreproductions (CIHM) http://www.canadiana.org/
Re: STDIN as well as command line input
Michael McDonnell wrote: A popular GNUism might be helpful here as well. Many GNU programs use an option command line argument of "--" to indicate that input should be taken from STDIN instead of from other command line arguments. '--' is usually used to mean that anything that follows is a non-option argument. For example, to remove a file called '-rf', you could use: rm -- -rf to force rm to interpret '-r' as a filename instead of a pair of options. '-' is usually used as a placeholder for STDIN. Unix programs generally read from STDIN if no files are specified on the command line. If one or more files are specified on the command line, '-' is usually taken to mean STDIN, so cat foo - bar < baz would list the contents of foo, then baz, then bar. But what Eric seems to want to do is supply *data* either from a stream on STDIN or directly on the command line. This is a little unconventional, and you have to remember that the data will have to be parsed differently depending on where it comes from. For example, "foo.pl a b c d e" produces 5 elements in @ARGV, while "echo a b c d e | foo.pl" produces one element (with a newline appended) that has to be parsed and split into elements at the correct places. This can get messy if you have escape characters, whitespace as data, and so on. I would suggest that the standard way to do this sort of thing would be to read *all* data from STDIN (and, optionally, from files specified on the command line) and treat each line as one record, if your data never contains newlines. If you had some data you wanted to pass manually to the program, you could always invoke it as: foo.pl plato-cratylus-1072532262 plato-charmides-1072462708 bacon-new-1072751992 ^D or echo "plato-cratylus-1072532262 plato-charmides-1072462708 bacon-new-1072751992" | foo.pl Either that, or supply all data on the command line, and use the backticks method as Michael suggests. Mixing the two conventions might lead to confusion later on, especially if someone else will end up having to use the program. -- William Wueppelmann Electronic Systems Specialist Canadian Institute for Historical Microreproductions (CIHM) http://www.canadiana.org/
Re: Extracting data from an XML file
Eric Lease Morgan wrote: Since my original implementation is still the fastest, and the newer implementations do not improve the speed of the application, then I must assume that the process is slow because of the XSLT transformations themselves. These transformations are straight-forward: # transform the document and save it my $doc = $parser->parse_file($file); my $results = $stylesheet->transform($doc); my $html_file = "$HTML_DIR/$id.html"; open OUT, "> $html_file"; print OUT $stylesheet->output_string($results); close OUT; # convert the HTML to plain text and save it my $html = parse_htmlfile($html_file); my $text_file = "$TEXT_DIR/$id.txt"; open OUT, "> $text_file"; print OUT $formatter->format($html); close OUT; Can you save some time by not re-parsing the HTML file? I haven't used the parse HTML feature of LibXML, but doesn't it produce the exact same kind of XML Document object? If so, you already have a copy in $results in the first part of the code, so you shouldn't need to go back and re-parse the file you just created, since $html should be identical, or at least functionally identical, to $results. I don't know whether or not you are already doing this, but you might be able to save a lot of time if you don't re-parse documents and stylesheets, re-instantiate XML parsers, and so forth. Ideally, you would call XML::LibXML->new and XML::LibXSLT->new once at the beginning of the script, immediately followed by creating a $stylesheet that contains the parsed stylesheet that you can then apply to each document in the batch. You can then parse each source XML document once and perform all of your operations on it in one go. Your script could then look something vaguely like: my $xml_parser = XML::LibXML->new; my $xslt_parser = XML::LibXSLT->new; my $xslt_doc = $xml_parser->parse_file ('stylesheet.xsl'); my $stylesheet = $xslt_parser->parse_stylesheet ($xslt_doc); foreach my $file (@files_to_process) { # Parse the document my $original_doc = $xml_parser->parse_file ($file); # Transform to HTML my $html_doc = $stylesheet->transform ($original_doc); my $html_file= my_filenaming_algorithm ($file); $html_doc->toFile ($html_file); # Transform the newly-transformed HTML (or XHTML) to plain text open TEXT_OUT (">$text_file"); print TEXT_OUT $formatter->format ($html_doc); close TEXT_OUT; # Grab selected information from the TEI header my ($header) = $html_doc->findnodes('teiHeader'); my $author = $header->findvalue('fileDesc/titleStmt/author'); my $title = $header->findvalue('fileDesc/titleStmt/title'); my $id =$header->findvalue('fileDesc/publicationStmt/idno'); do_something_with_my_data ($author, $title, $id); } That way, you only instantiate a parser once, you only parse the XML->HTML stylesheet once, and you only parse each XML document once. I don't know how much of this you are doing already, but eliminating unnecessary parsing could speed things up a fair bit. I think the speed of the XSLT process depends a lot on how complex the stylesheet is. I have a script that parses XML documents and creates secondary XML documents which contain a small subset of the original data (with some fields amalgamated and otherwise massaged) and it takes maybe 20-40 minutes to batch about 4 or 5 gigabytes of data in about 8500 files. My original documents are quite large and numerous, but the derived documents are only about 1 KB or so, and the structure of the original is reasonably simple. The stylesheet itself is only about 100 lines, though the stylesheet rules do seem to include a lot of 'or' clauses in them. I don't know how complex your input files and transformations are compared to mine, or how fast your computer is, but 96 seconds to process 1.5 MB does seem a little slow compared to what I am getting. I hope some of that helps. -- William Wueppelmann Electronic Systems Specialist Canadian Institute for Historical Microreproductions (CIHM) http://www.canadiana.org
Re: perl OOP question
Andy Lester wrote: Is there any way I can just say: use ISO::types::ILLString; my $s = new ILLString("This is a string"); First, what you're talking about isn't object related. It's just package functions. You need to look at the Exporter module. Bascially, you want to do this (off the topof my head) package ISO::types::ILLString; use Exporter; our @EXPORT = qw( ILLString ); sub ILLString { # blah blah blah } 1; Now, whenever you say "use ISO::types::ILLString", you get ILLString imported into your namespace. There is a second way to do this. You can also say: use Exporter; our @EXPORT_OK = qw ( ILLString ); This means that ILLString isn't automatically imported into the namespace, but it will be if, in your Perl program, you use the following syntax: use ISO::types::ILLString ('ILLString'); The advantage of doing it this way is that it is the program, not the module, that decides whether or not to import the name. This is useful because you may already have ILLString defined elsewhere and don't want to overwrite it. It can be a little nasty if a module imports symbols into your main namespace that you don't know about and then you spend hours chasing down bugs when 'ILLString' doesn't refer to what you think it does anymore. I think that this is now the standard practice for non-core Perl modules. From the Exporter Perldoc page: As a general rule, if the module is trying to be object oriented then export nothing. If it's just a collection of functions then @EXPORT_OK anything but use @EXPORT with caution. Cheers. -- William Wueppelmann Electronic Systems Specialist Canadian Institute for Historical Microreproductions (CIHM)