Converting using a hash

2003-12-16 Thread Jan Eden
Hi,

sorry for the lengthy post.

I recently wrote a Perl script to convert 8-bit characters to LaTeX commands. The 
first version (which works just fine) looks like this (the ... indicates more lines to 
follow):

>#!/usr/bin/perl -pw
>
>s/â/{\\glqq}/g;
>s/â/{\\grqq}/g;
>s/Ã/\\'{a}/g;
>s/Ã/\\`{a}/g;
>s/Ã/\\^{a}/g;
>s/Ã/\\"{a}/g;
>

Now I tried to use a hash instead of consecutive replacement commands. The second 
version looked like this:

>#!/usr/bin/perl -w
>
>%enctabelle = ("â"=>"{\\glqq}",
>"â"=>"{\\grqq}",
>"Ã"=>"\\'{a}",
>"Ã"=>"\\`{a}",
>"Ã"=>"\\^{a}",
>
>
>while (<>) {
>$zeile = $_;
>foreach $char (keys %enctabelle) {
>$zeile =~ s/$char/$enctabelle{$char}/g;
>}
>print $zeile;
>}

This worked, too, but it was extremely slow, obviously since the variables where 
compiled over and over again.

I gave it a third try like this (code taken from someone else's script):

>%enctabelle = ("â"=>"{\\glqq}",
>"â"=>"{\\grqq}",
>"Ã"=>"\\'{a}",
>"Ã"=>"\\`{a}",
>"Ã"=>"\\^{a}",
>
>
>while (<>) {
>   s/(.)/exists $enctabelle{$1} ? $enctabelle{$1} : $1/geo;
>   print;
>}

This did not change the text at all. When I removed the ternary operator

>s/(.)/exists $enctabelle{$1}/g;

I got an error message like this:

>Line 208:  Use of uninitialized value in substitution iterator <> line 1.

Obviously, Perl cannot interpolate variable names like $enctabelle{Ã}. Both the 
script and the file to convert are UTF-8 encoded. What's the problem here?

On another list, I got a rather complicated snippet I did not fully understand:

>#!perl
>
>%enctabelle = (...);
>
>my $re = '(' . join('|', map quotemeta($_), keys %enctabelle) . ')';
>$re = qr/$re/;
>
>while (<>) {
>  s/$re/$enctabelle{$1}/g;
>  print;
>}

Maybe the quotemeta part is what helps identifying the corresponding value?

Any hints are greatly appreciated,

Jan
-- 
Hanlon's Razor: Never attribute to malice that which can be adequately explained by 
stupidity.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Converting using a hash

2003-12-16 Thread Jenda Krynicky
From: Jan Eden <[EMAIL PROTECTED]>
> I recently wrote a Perl script to convert 8-bit characters to LaTeX
> commands. The first version (which works just fine) looks like this
> (the ... indicates more lines to follow):
>
> >#!/usr/bin/perl -pw
> >
> >s/âÇ?/{\\glqq}/g;
> >s/âÇ?/{\\grqq}/g;
> >s/Ăí/\\'{a}/g;
> >s/Ăá/\\`{a}/g;
> >s/Ăó/\\^{a}/g;
> >s/ä/\\"{a}/g;
> >
>
> Now I tried to use a hash instead of consecutive replacement commands.
> The second version looked like this:
>
> >#!/usr/bin/perl -w
> >
> >%enctabelle = ("âÇ?"=>"{\\glqq}",
> >"âÇ?"=>"{\\grqq}",
> >"Ăí"=>"\\'{a}",
> >"Ăá"=>"\\`{a}",
> >"Ăó"=>"\\^{a}",
> >
> >
> >while (<>) {
> >$zeile = $_;
> >foreach $char (keys %enctabelle) {
> >$zeile =~ s/$char/$enctabelle{$char}/g;
> >}
> >print $zeile;
> >}

You want something like this:

my $re = join '|', keys %enctabelle;
while (<>) {
s/$re/$enctabelle{$1}/go;
print $_;
}
or

my $re = join '|', keys %enctabelle;
$re = qr/($re)/;
while (<>) {
s/$re/$enctabelle{$1}/g;
print $_;
}

Jenda
= [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Converting using a hash

2003-12-16 Thread John W. Krahn
Jan Eden wrote:
> 
> Hi,

Hello,

> sorry for the lengthy post.
> 
> I recently wrote a Perl script to convert 8-bit characters to LaTeX
> commands. The first version (which works just fine) looks like this
> (the ... indicates more lines to follow):

Your regular expressions look like they are longer then 8 bits.


> >#!/usr/bin/perl -pw
> >
> >s/â??/{\\glqq}/g;
> >s/â??/{\\grqq}/g;
> >s/á/\\'{a}/g;
> >s/Ã /\\`{a}/g;
> >s/â/\\^{a}/g;
> >s/ä/\\"{a}/g;
> >
> 
> Now I tried to use a hash instead of consecutive replacement commands.
> The second version looked like this:
> 
> >#!/usr/bin/perl -w
> >
> >%enctabelle = ("â??"=>"{\\glqq}",
> >"â??"=>"{\\grqq}",
> >"á"=>"\\'{a}",
> >"Ã "=>"\\`{a}",
> >"â"=>"\\^{a}",
> >
> >
> >while (<>) {
> >$zeile = $_;
> >foreach $char (keys %enctabelle) {
> >$zeile =~ s/$char/$enctabelle{$char}/g;
> >}
> >print $zeile;
> >}
> 
> This worked, too, but it was extremely slow, obviously since the variables
> where compiled over and over again.
> 
> I gave it a third try like this (code taken from someone else's script):
> 
> >%enctabelle = ("â??"=>"{\\glqq}",
> >"â??"=>"{\\grqq}",
> >"á"=>"\\'{a}",
> >"Ã "=>"\\`{a}",
> >"â"=>"\\^{a}",
> >
> >
> >while (<>) {
> >   s/(.)/exists $enctabelle{$1} ? $enctabelle{$1} : $1/geo;
> >   print;
> >}
> 
> This did not change the text at all. When I removed the ternary operator
> 
> >s/(.)/exists $enctabelle{$1}/g;
> 
> I got an error message like this:
> 
> >Line 208:  Use of uninitialized value in substitution iterator <> line 1.
> 
> Obviously, Perl cannot interpolate variable names like $enctabelle{ä}.
> Both the script and the file to convert are UTF-8 encoded. What's the problem here?

The problem is probably that you are searching for a single byte (.) not
a UTF character.

perldoc perlunicode
perldoc utf8
perldoc bytes


> On another list, I got a rather complicated snippet I did not fully understand:
> 
> >#!perl
> >
> >%enctabelle = (...);
> >
> >my $re = '(' . join('|', map quotemeta($_), keys %enctabelle) . ')';
> >$re = qr/$re/;
> >
> >while (<>) {
> >  s/$re/$enctabelle{$1}/g;
> >  print;
> >}
> 
> Maybe the quotemeta part is what helps identifying the corresponding value?
> 
> Any hints are greatly appreciated,

Do you want the fastest code?  The shortest code?  The most maintainable
code?  What are you trying to accomplish?


John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Converting using a hash

2003-12-16 Thread Jeff 'japhy' Pinyan
On Dec 16, Jan Eden said:

>>#!perl
>>
>>%enctabelle = (...);
>>
>>my $re = '(' . join('|', map quotemeta($_), keys %enctabelle) . ')';
>>$re = qr/$re/;
>>
>>while (<>) {
>>  s/$re/$enctabelle{$1}/g;
>>  print;
>>}

Let me explain this for you, and fix it, too.

  # this produces 'key1|key2|key3|...'
  my $re = join '|',
map quotemeta($_),  # this escapes non-alphanumberic
# characters;
sort { length($b) <=> length($a) }  # sorts by length, biggest first
keys %enctabelle;   # the strings to encode

What this does is put the keys in a string, separated by a | (which means
"or" in a regex), quotemeta()d (which ensures any regex characters in them
are properly escaped), and ordered by length (longest to shortest).  That
last part is important:  if you have keys 'a' and 'ab', you want to try
matching 'ab' BEFORE you try matching 'a', or else 'ab' will NEVER be
matched.

Then, we take $re, and turn it into a compiled regex:

  $re = qr/($re)/;

The qr// operator (in perlop) gives you a compiled regex; it's good for
efficiency in certain situations (although this isn't really one of them).

  $text =~ s/$re/$enctabelle{$1}/g;

That replaces all keys with their values.

-- 
Jeff "japhy" Pinyan  [EMAIL PROTECTED]  http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
 what does y/// stand for?   why, yansliterate of course.
[  I'm looking for programming work.  If you like my work, let me know.  ]


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Converting using a hash

2003-12-17 Thread Jan Eden
Hi all,

thanks a lot for all the responses. Jeff's explanation of the snippet I mentioned in 
my original message did the trick. The hash-based solution is much faster now, 
although the first attempt (using multiple replacements on standard input) is still 
the fastest.

To answer John's questions:

>Your regular expressions look like they are longer then 8 bits.
>
>> >#!/usr/bin/perl -pw
>> >
>> >s/Ã??/{\\glqq}/g;
>> >s/Ã??/{\\grqq}/g;

This is due to mail encodings, they are really 8-bit characters.

>Do you want the fastest code?  The shortest code?  The most maintainable
>code?  What are you trying to accomplish?

Since I already had a reasonably fast and short solution, I wanted a more maintainable 
one, were I could easily extend the range of characters by editing the hash.

The hash solution is still a little sluggish, but it's more elegant, I think.

Jeff 'japhy' Pinyan wrote:

>Let me explain this for you, and fix it, too.
>
>  # this produces 'key1|key2|key3|...'
>  my $re = join '|',
>map quotemeta($_),  # this escapes non-alphanumberic
># characters;
>sort { length($b) <=> length($a) }  # sorts by length, biggest first
>keys %enctabelle;   # the strings to encode
>
>What this does is put the keys in a string, separated by a | (which means
>"or" in a regex), quotemeta()d (which ensures any regex characters in them
>are properly escaped), and ordered by length (longest to shortest).  That
>last part is important:  if you have keys 'a' and 'ab', you want to try
>matching 'ab' BEFORE you try matching 'a', or else 'ab' will NEVER be
>matched.
>
I omitted the sort command since all patterns consist of a single (8-bit) character, 
so I guess your caveat is not applicable. My original message was garbled (see above).

Thanks again,

Jan
-- 
Hanlon's Razor: Never attribute to malice that which can be adequately explained by 
stupidity.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Converting using a hash

2003-12-17 Thread Jeff 'japhy' Pinyan
On Dec 17, Jan Eden said:

>thanks a lot for all the responses. Jeff's explanation of the snippet I
>mentioned in my original message did the trick. The hash-based solution
>is much faster now, although the first attempt (using multiple
>replacements on standard input) is still the fastest.

Well, the first solution might not always work.  Consider the following:

  my %replace = (
brian => 'jones',
on => 'off',
  );

If we use the hash approach, then "this is on brian" will be "this is off
jones".  But if we were to do:

  s/brian/jones/g;
  s/on/off/g;

then we'd get "this is off joffes".

It's just something to keep in mind.

>I omitted the sort command since all patterns consist of a single (8-bit)
>character, so I guess your caveat is not applicable. My original message
>was garbled (see above).

Oh, ok.

-- 
Jeff "japhy" Pinyan  [EMAIL PROTECTED]  http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
 what does y/// stand for?   why, yansliterate of course.
[  I'm looking for programming work.  If you like my work, let me know.  ]


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]