On Fri, Nov 16, 2001 at 06:39:46PM -0800, David Wheeler wrote:
> I just noticed that DBD::Pg uses POSIX, entirely for the purpose of
> using isprint() to replace non-printable characters with their numeric
> values. Here's what the code looks like:
>
> $str=join("", map { isprint($_)?$_:'\\'.sprintf("%03o",ord($_)) }
> split //, $str);
>
> Now, this seemed rather silly to me. I couldn't imagine that it was
> efficient, and I generally like to look for reasons to lose POSIX and
> its bloat. This is what I came up with:
>
> $str =~ s/([^ -~])/'\\' . sprintf("%03o", ord($1))/ge;
>
> I ran a little benchmark comparing them, and this is what the results
> looked like:
>
> Benchmark: timing 100000 iterations of Posix, Regex...
> isprint: 17 wallclock secs (15.64 usr + 0.01 sys = 15.65 CPU) @ 6389.78/s
>(n=100000)
> regex: 2 wallclock secs ( 2.35 usr + 0.00 sys = 2.35 CPU) @ 42553.19/s
>(n=100000)
The bloat lies not in isprint(), but in the whole join, map, split
thing.
I tried various combinations of your s///ge technique and the
join/map/split technique with isprint(), character classes and cached
hash lookups. In the end, it's s///ge vs join/map/split that really
matters. Benchmark and program below.
The winner is the combination of s///ge and caching.
foreach my $num (0..255) {
my $chr = chr $num;
$U2P{$chr} = $chr =~ tr/\x20-\x7F// ? $chr : '\\'.sprintf("%03o",$num);
}
sub u2p_dw_cached {
my($str) = shift;
$str =~ s/([^ -~])/$U2P{$1}/ge;
return $str;
}
Things like URI encode/decode functions do the exact same thing.
The higher the ratio of unprintable characters to printable, the
better that does against the others.
u2p_dw* uses variations on s///ge. u2p_pg* uses variations on
join/map/split.
This is using the test string "The quick brown fox\n Jumped over the
lazy grey dog\n". The cached version is roughly 20% faster.
Benchmark: timing 50000 iterations of u2p_dw, u2p_dw_cached, u2p_dw_posix_re, u2p_pg,
u2p_pgtr...
u2p_dw: 4 wallclock secs ( 3.68 usr + 0.00 sys = 3.68 CPU) @ 13586.96/s
(n=50000)
u2p_dw_cached: 4 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) @ 16129.03/s
(n=50000)
u2p_dw_posix_re: 4 wallclock secs ( 3.72 usr + 0.00 sys = 3.72 CPU) @ 13440.86/s
(n=50000)
u2p_pg: 34 wallclock secs (34.60 usr + 0.00 sys = 34.60 CPU) @ 1445.09/s (n=50000)
u2p_pgtr: 34 wallclock secs (33.76 usr + 0.02 sys = 33.78 CPU) @ 1480.17/s (n=50000)
And this uses: "The\n qu\nick\r br\rown\t fox\n J\rumped\r over \nthe
lazy\r grey\n dog\n" Here the cached version is roughly 30% faster.
Benchmark: timing 50000 iterations of control, u2p_dw, u2p_dw_cached, u2p_dw_posix_re,
u2p_pg, u2p_pgtr...
u2p_dw: 8 wallclock secs ( 9.56 usr + 0.00 sys = 9.56 CPU) @ 5230.13/s (n=50000)
u2p_dw_cached: 6 wallclock secs ( 7.15 usr + 0.01 sys = 7.16 CPU) @ 6983.24/s
(n=50000)
u2p_dw_posix_re: 8 wallclock secs ( 9.34 usr + 0.00 sys = 9.34 CPU) @ 5353.32/s
(n=50000)
u2p_pg: 43 wallclock secs (43.65 usr + 0.00 sys = 43.65 CPU) @ 1145.48/s (n=50000)
u2p_pgtr: 43 wallclock secs (42.57 usr + 0.03 sys = 42.60 CPU) @ 1173.71/s (n=50000)
> Consider using regular expressions and the "/[[:isprint:]]/" construct
> instead.
>
> However, I couldn't figure out how to use this constuct -- there's no
> documentation on it that I can find.
It's in perlre as of 5.6.0. [[:isprint:]] is the uft8 character
class. You actually want [[:print:]]. It's negation is [[:cntrl:]].
u2p_dw_posix_re() uses it below.
DBD::Pg can't use [[:foo:]] if it wishes to work with anything before
5.6. :( Since isprint(), as for as POSIX is concered, is just anything
between x20 and x7F, the s/[^ -~//ge regex should work fine.
--
Michael G. Schwern <[EMAIL PROTECTED]> http://www.pobox.com/~schwern/
Perl Quality Assurance <[EMAIL PROTECTED]> Kwalitee Is Job One
shitting is a chore
wet glue enters my sphincter
I shall poop no more
-- Schwern
foreach my $num (0..255) {
my $chr = chr $num;
$U2P{$chr} = $chr =~ tr/\x20-\x7F// ? $chr : '\\'.sprintf("%03o",$num);
}
sub u2p_dw_cached {
my($str) = shift;
$str =~ s/([^ -~])/$U2P{$1}/ge;
return $str;
}
use POSIX qw(isprint);
sub u2p_pg {
join("", map { isprint($_)? $_ : '\\'.sprintf("%03o",ord($_)) }
split //, $_[0]);
}
sub u2p_pgtr {
join("", map { tr/\x20-\x7F// ? $_ : '\\'.sprintf("%03o",ord($_)) }
split //, $_[0]);
}
sub u2p_dw {
my($str) = shift;
$str =~ s/([^ -~])/'\\' . sprintf("%03o", ord($1))/ge;
return $str;
}
sub u2p_dw_posix_re {
my($str) = shift;
$str =~ s/([[:cntrl:]])/'\\' . sprintf("%03o", ord($1))/ge;
return $str;
}
my @subs = qw(u2p_dw_cached u2p_pg u2p_pgtr u2p_dw u2p_dw_posix_re);
foreach my $sub (@subs) {
print "$sub -> ".&{$sub}("\rfoo\n")."\n";
}
use Benchmark;
$test_string = 'The\n qu\nick\r br\rown\t fox\n J\rumped\r over \nthe lazy\r grey\n
dog\n';
timethese(shift || -3,
{ (map { ($_ => eval qq{sub { $_("$test_string") }}) } @subs),
control => sub {},
}
);