On Fri, Nov 16, 2001 at 06:39:46PM -0800, David Wheeler wrote: > I just noticed that DBD::Pg uses POSIX, entirely for the purpose of > using isprint() to replace non-printable characters with their numeric > values. Here's what the code looks like: > > $str=join("", map { isprint($_)?$_:'\\'.sprintf("%03o",ord($_)) } > split //, $str); > > Now, this seemed rather silly to me. I couldn't imagine that it was > efficient, and I generally like to look for reasons to lose POSIX and > its bloat. This is what I came up with: > > $str =~ s/([^ -~])/'\\' . sprintf("%03o", ord($1))/ge; > > I ran a little benchmark comparing them, and this is what the results > looked like: > > Benchmark: timing 100000 iterations of Posix, Regex... > isprint: 17 wallclock secs (15.64 usr + 0.01 sys = 15.65 CPU) @ 6389.78/s >(n=100000) > regex: 2 wallclock secs ( 2.35 usr + 0.00 sys = 2.35 CPU) @ 42553.19/s >(n=100000)
The bloat lies not in isprint(), but in the whole join, map, split thing. I tried various combinations of your s///ge technique and the join/map/split technique with isprint(), character classes and cached hash lookups. In the end, it's s///ge vs join/map/split that really matters. Benchmark and program below. The winner is the combination of s///ge and caching. foreach my $num (0..255) { my $chr = chr $num; $U2P{$chr} = $chr =~ tr/\x20-\x7F// ? $chr : '\\'.sprintf("%03o",$num); } sub u2p_dw_cached { my($str) = shift; $str =~ s/([^ -~])/$U2P{$1}/ge; return $str; } Things like URI encode/decode functions do the exact same thing. The higher the ratio of unprintable characters to printable, the better that does against the others. u2p_dw* uses variations on s///ge. u2p_pg* uses variations on join/map/split. This is using the test string "The quick brown fox\n Jumped over the lazy grey dog\n". The cached version is roughly 20% faster. Benchmark: timing 50000 iterations of u2p_dw, u2p_dw_cached, u2p_dw_posix_re, u2p_pg, u2p_pgtr... u2p_dw: 4 wallclock secs ( 3.68 usr + 0.00 sys = 3.68 CPU) @ 13586.96/s (n=50000) u2p_dw_cached: 4 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) @ 16129.03/s (n=50000) u2p_dw_posix_re: 4 wallclock secs ( 3.72 usr + 0.00 sys = 3.72 CPU) @ 13440.86/s (n=50000) u2p_pg: 34 wallclock secs (34.60 usr + 0.00 sys = 34.60 CPU) @ 1445.09/s (n=50000) u2p_pgtr: 34 wallclock secs (33.76 usr + 0.02 sys = 33.78 CPU) @ 1480.17/s (n=50000) And this uses: "The\n qu\nick\r br\rown\t fox\n J\rumped\r over \nthe lazy\r grey\n dog\n" Here the cached version is roughly 30% faster. Benchmark: timing 50000 iterations of control, u2p_dw, u2p_dw_cached, u2p_dw_posix_re, u2p_pg, u2p_pgtr... u2p_dw: 8 wallclock secs ( 9.56 usr + 0.00 sys = 9.56 CPU) @ 5230.13/s (n=50000) u2p_dw_cached: 6 wallclock secs ( 7.15 usr + 0.01 sys = 7.16 CPU) @ 6983.24/s (n=50000) u2p_dw_posix_re: 8 wallclock secs ( 9.34 usr + 0.00 sys = 9.34 CPU) @ 5353.32/s (n=50000) u2p_pg: 43 wallclock secs (43.65 usr + 0.00 sys = 43.65 CPU) @ 1145.48/s (n=50000) u2p_pgtr: 43 wallclock secs (42.57 usr + 0.03 sys = 42.60 CPU) @ 1173.71/s (n=50000) > Consider using regular expressions and the "/[[:isprint:]]/" construct > instead. > > However, I couldn't figure out how to use this constuct -- there's no > documentation on it that I can find. It's in perlre as of 5.6.0. [[:isprint:]] is the uft8 character class. You actually want [[:print:]]. It's negation is [[:cntrl:]]. u2p_dw_posix_re() uses it below. DBD::Pg can't use [[:foo:]] if it wishes to work with anything before 5.6. :( Since isprint(), as for as POSIX is concered, is just anything between x20 and x7F, the s/[^ -~//ge regex should work fine. -- Michael G. Schwern <[EMAIL PROTECTED]> http://www.pobox.com/~schwern/ Perl Quality Assurance <[EMAIL PROTECTED]> Kwalitee Is Job One shitting is a chore wet glue enters my sphincter I shall poop no more -- Schwern
foreach my $num (0..255) { my $chr = chr $num; $U2P{$chr} = $chr =~ tr/\x20-\x7F// ? $chr : '\\'.sprintf("%03o",$num); } sub u2p_dw_cached { my($str) = shift; $str =~ s/([^ -~])/$U2P{$1}/ge; return $str; } use POSIX qw(isprint); sub u2p_pg { join("", map { isprint($_)? $_ : '\\'.sprintf("%03o",ord($_)) } split //, $_[0]); } sub u2p_pgtr { join("", map { tr/\x20-\x7F// ? $_ : '\\'.sprintf("%03o",ord($_)) } split //, $_[0]); } sub u2p_dw { my($str) = shift; $str =~ s/([^ -~])/'\\' . sprintf("%03o", ord($1))/ge; return $str; } sub u2p_dw_posix_re { my($str) = shift; $str =~ s/([[:cntrl:]])/'\\' . sprintf("%03o", ord($1))/ge; return $str; } my @subs = qw(u2p_dw_cached u2p_pg u2p_pgtr u2p_dw u2p_dw_posix_re); foreach my $sub (@subs) { print "$sub -> ".&{$sub}("\rfoo\n")."\n"; } use Benchmark; $test_string = 'The\n qu\nick\r br\rown\t fox\n J\rumped\r over \nthe lazy\r grey\n dog\n'; timethese(shift || -3, { (map { ($_ => eval qq{sub { $_("$test_string") }}) } @subs), control => sub {}, } );