On Fri, Nov 16, 2001 at 06:39:46PM -0800, David Wheeler wrote:
> I just noticed that DBD::Pg uses POSIX, entirely for the purpose of
> using isprint() to replace non-printable characters with their numeric
> values. Here's what the code looks like:
> 
>   $str=join("", map { isprint($_)?$_:'\\'.sprintf("%03o",ord($_)) }
>             split //, $str);
> 
> Now, this seemed rather silly to me. I couldn't imagine that it was
> efficient, and I generally like to look for reasons to lose POSIX and
> its bloat. This is what I came up with:
> 
>   $str =~ s/([^ -~])/'\\' . sprintf("%03o", ord($1))/ge;
> 
> I ran a little benchmark comparing them, and this is what the results
> looked like:
> 
> Benchmark: timing 100000 iterations of Posix, Regex...
>      isprint: 17 wallclock secs (15.64 usr +  0.01 sys = 15.65 CPU) @ 6389.78/s 
>(n=100000)
>      regex:   2 wallclock secs ( 2.35 usr +  0.00 sys =  2.35 CPU) @ 42553.19/s 
>(n=100000)

The bloat lies not in isprint(), but in the whole join, map, split
thing.

I tried various combinations of your s///ge technique and the
join/map/split technique with isprint(), character classes and cached
hash lookups.  In the end, it's s///ge vs join/map/split that really
matters.  Benchmark and program below.

The winner is the combination of s///ge and caching.

foreach my $num (0..255) {
    my $chr = chr $num;
    $U2P{$chr} = $chr =~ tr/\x20-\x7F// ? $chr : '\\'.sprintf("%03o",$num);
}

sub u2p_dw_cached {
    my($str) = shift;
    $str =~ s/([^ -~])/$U2P{$1}/ge;
    return $str;
}

Things like URI encode/decode functions do the exact same thing.

The higher the ratio of unprintable characters to printable, the
better that does against the others.

u2p_dw* uses variations on s///ge.  u2p_pg* uses variations on
join/map/split.

This is using the test string "The quick brown fox\n Jumped over the
lazy grey dog\n".  The cached version is roughly 20% faster.

Benchmark: timing 50000 iterations of u2p_dw, u2p_dw_cached, u2p_dw_posix_re, u2p_pg, 
u2p_pgtr...
    u2p_dw:  4 wallclock secs ( 3.68 usr +  0.00 sys =  3.68 CPU) @ 13586.96/s 
(n=50000)
u2p_dw_cached:  4 wallclock secs ( 3.10 usr +  0.00 sys =  3.10 CPU) @ 16129.03/s 
(n=50000)
u2p_dw_posix_re:  4 wallclock secs ( 3.72 usr +  0.00 sys =  3.72 CPU) @ 13440.86/s 
(n=50000)
    u2p_pg: 34 wallclock secs (34.60 usr +  0.00 sys = 34.60 CPU) @ 1445.09/s (n=50000)
  u2p_pgtr: 34 wallclock secs (33.76 usr +  0.02 sys = 33.78 CPU) @ 1480.17/s (n=50000)

And this uses: "The\n qu\nick\r br\rown\t fox\n J\rumped\r over \nthe
lazy\r grey\n dog\n"  Here the cached version is roughly 30% faster.

Benchmark: timing 50000 iterations of control, u2p_dw, u2p_dw_cached, u2p_dw_posix_re, 
u2p_pg, u2p_pgtr...
    u2p_dw:  8 wallclock secs ( 9.56 usr +  0.00 sys =  9.56 CPU) @ 5230.13/s (n=50000)
u2p_dw_cached:  6 wallclock secs ( 7.15 usr +  0.01 sys =  7.16 CPU) @ 6983.24/s 
(n=50000)
u2p_dw_posix_re:  8 wallclock secs ( 9.34 usr +  0.00 sys =  9.34 CPU) @ 5353.32/s 
(n=50000)
    u2p_pg: 43 wallclock secs (43.65 usr +  0.00 sys = 43.65 CPU) @ 1145.48/s (n=50000)
  u2p_pgtr: 43 wallclock secs (42.57 usr +  0.03 sys = 42.60 CPU) @ 1173.71/s (n=50000)



>   Consider using regular expressions and the "/[[:isprint:]]/" construct
>   instead.
> 
> However, I couldn't figure out how to use this constuct -- there's no
> documentation on it that I can find.

It's in perlre as of 5.6.0.  [[:isprint:]] is the uft8 character
class.  You actually want [[:print:]].  It's negation is [[:cntrl:]].
u2p_dw_posix_re() uses it below.

DBD::Pg can't use [[:foo:]] if it wishes to work with anything before
5.6. :( Since isprint(), as for as POSIX is concered, is just anything
between x20 and x7F, the s/[^ -~//ge regex should work fine.


-- 

Michael G. Schwern   <[EMAIL PROTECTED]>    http://www.pobox.com/~schwern/
Perl Quality Assurance      <[EMAIL PROTECTED]>         Kwalitee Is Job One
shitting is a chore
wet glue enters my sphincter
I shall poop no more
        -- Schwern
foreach my $num (0..255) {
    my $chr = chr $num;
    $U2P{$chr} = $chr =~ tr/\x20-\x7F// ? $chr : '\\'.sprintf("%03o",$num);
}

sub u2p_dw_cached {
    my($str) = shift;
    $str =~ s/([^ -~])/$U2P{$1}/ge;
    return $str;
}

use POSIX qw(isprint);
sub u2p_pg {
    join("", map { isprint($_)? $_ : '\\'.sprintf("%03o",ord($_)) }
         split //, $_[0]);
}

sub u2p_pgtr {
    join("", map { tr/\x20-\x7F// ? $_ : '\\'.sprintf("%03o",ord($_)) }
         split //, $_[0]);
}

sub u2p_dw {
    my($str) = shift;
    $str =~ s/([^ -~])/'\\' . sprintf("%03o", ord($1))/ge;
    return $str;
}

sub u2p_dw_posix_re {
    my($str) = shift;
    $str =~ s/([[:cntrl:]])/'\\' . sprintf("%03o", ord($1))/ge;
    return $str;
}


my @subs = qw(u2p_dw_cached u2p_pg u2p_pgtr u2p_dw u2p_dw_posix_re);
 foreach my $sub (@subs) {
     print "$sub -> ".&{$sub}("\rfoo\n")."\n";
 }


use Benchmark;

$test_string = 'The\n qu\nick\r br\rown\t fox\n J\rumped\r over \nthe lazy\r grey\n 
dog\n';
timethese(shift || -3,
          { (map { ($_ => eval qq{sub { $_("$test_string") }}) } @subs),
            control => sub {},
          }
         );
           

Reply via email to