Thanks everybody for all your replies on this issue.
On Sat, 2001-11-17 at 13:54, Michael G Schwern wrote:
> It was a bug. [:print], [:cntrl:], et al are busted in 5.6.1 but
> fixed in bleadperl.
Damn. And I was going to suggest this kind of trick:
if ($^V lt v5.6) {
# use isprint;
} else {
# use [:print:]
}
Because while I understood that the reason the original code was slow
was because of the join/split/map syntax, my primary motivation was to
find a way to get rid of any need for loading POSIX.
But the whole discussion of the issues of locale, utf8, high ASCII,
etc., caused me to [run screaming from the room!] go back and look more
closely at why this code was needed at all. It turns out that the
non-printable character escaping is needed only to escape binary data
types:
if ($data_type == DBI::SQL_BINARY ||
$data_type == DBI::SQL_VARBINARY ||
$data_type == DBI::SQL_LONGVARBINARY) {
$str=join("", map { isprint($_)?$_:'\\'.sprintf("%03o",ord($_)) }
split //, $str);
}
Thus, it seemed to me, the distinction between character sets beomes
irrelevant. If it's all just binary, what difference does it make?
Well, doing a little research, it looks like this code was added to
DBD::Pg in order to support a new (and poorly documented) PostgreSQL
data type called BYTEA (why do I want to call this "buy tea"?), and that
this data type requires certain characters (bytes) to be escaped. Which
ones? The relevant discussion from the PostgreSQL Hackers list is here:
http://www.geocrawler.com/mail/thread.php3?subject=%5BHACKERS%5D+Re%3A+Toast%2Cbytea%2C+Text+-blob+all+confusing&list=10
But is pretty well covered by this single message:
http://www.geocrawler.com/mail/msg.php3?msg_id=6547225&list=10
Now, according to what Alex Pilosov (who added the BYTEA support to
DBD::Pg) wrote, there are a few specific characters that need to be
escaped, but also all non-printable characters. Joe Conway's
[relationship to "the mad scientist of Perl" unknown] reply says that
only three characters (bytes) ever need to be escaped (he later
eliminated \012). Bruce Momjian, one of the PostgreSQL developers, then
documented this need for the forthcoming PostgreSQL 7.2 release:
<para>
The <type>bytea</type> data type allows storage of binary data,
specifically allowing storage of NULLs which are entered as
<literal>'\000'</>. The first backslash is interpreted by the
single quotes, and the second is recognized by <type>bytea</> and
preceeds a three digit octal value. For a similar reason, a
backslash must be entered into a field as <literal>'\'</> or
<literal>'\134'</>. You may also have to escape line feeds and
carriage return if your interface automatically translates these. It
can store values of any length. <type>Bytea</> is a non-standard
data type.
</para>
So, Pilosov's suggestion that non-printable characters needing to be
replaced aside, if we go merely for the (now) documented escape
characters, I suggest a solution as simple as this:
my %esc = ( "'" => '\\047', # '\\' . sprintf("%03o", ord("'")),
'\\' => '\\134', # '\\' . sprintf("%03o", ord("\\")),
"\0" => '\\000' # '\\' . sprintf("%03o", ord("\0")),
);
sub simple {
$_ = shift;
s/(['\\\0])/$esc{$&}/g;
return $_;
}
Benchmarks naturally show this to be the fastest version yet -- no big
surprise there. The curious thing is that the isprint() snippet quoted
above doesn't replace the characters '\' or "'"; DBD::Pg does separate
substitutions for those characters! So it's possible, methinks, that the
whole non-printable replacement code is unnecessary -- only "\0" really
needs to be fixed. I'll have to post to the PostgreSQL Hackers list to
find out. Meanwhile, any other comments from my fellow FWP maniacs are
welcome.
Regards,
David
--
David Wheeler AIM: dwTheory
[EMAIL PROTECTED] ICQ: 15726394
Yahoo!: dew7e
Jabber: [EMAIL PROTECTED]
#!/usr/bin/perl -w
use strict;
use POSIX qw(isprint);
use Benchmark;
my %esc = ( "'" => '\\047', # '\\' . sprintf("%03o", ord("'")),
'\\' => '\\134', # '\\' . sprintf("%03o", ord("\\")),
"\0" => '\\000' # '\\' . sprintf("%03o", ord("\0")),
);
sub simple {
$_ = shift;
s/(['\\\0])/$esc{$&}/g;
return $_;
}
my %U2P;
foreach my $num (0..255) {
my $chr = chr $num;
$U2P{$chr} = isprint($chr) ? $chr : '\\'.sprintf("%03o",$num);
}
sub u2p_dw_cached {
my($str) = shift;
$str =~ s/([^ -~])/$U2P{$1}/ge;
return $str;
}
sub schwern {
$_ = shift;
s/([^\x20-\x7E])/isprint($1) ? $1 : $U2P{$1}/ge;
return $_;
}
sub original {
join("", map { isprint($_)?$_:'\\'.sprintf("%03o",ord($_)) }
split //, shift);
}
my @subs = qw(simple u2p_dw_cached schwern original);
my $test_string = "The quick\n brown\r f\0ox 'jumped' ov\\er the \tlazy \0gr\\ey dog\n";
#map { eval qq{ print "$_: ", $_("$test_string"), "\n"} } @subs;
timethese(shift || -3,
{ (map { ($_ => eval qq{sub { $_("$test_string") }}) } @subs),
control => sub {},
}
);
__END__