Bug#500210: perldoc perlrun spits out junk in synopsis

Russ Allbery Wed, 01 Oct 2008 02:14:21 -0700

Niko Tyni <[EMAIL PROTECTED]> writes:

> Any estimate on how widespread this POD problem is? Is the hardcoded
> 'pod2man --utf8' in the Lenny perldoc going to cause more grief than
> it's worth?
>
> I'm leaning on reverting that and reopening #492037 until the issue is
> sorted out in Pod-Perldoc upstream. Adding a way to enable or disable
> the '--utf8' option on the perldoc command line is one possibility,
> but it might as well cause even further trouble if upstream chooses a
> different implementation.


I looked at this some more, and there's a deeper problem.  If you run the
current pod2man with --utf8 on an input POD file that doesn't declare an
=encoding of UTF-8, any use of S<> in that POD file will result in invalid
UTF-8, even if there's no use of high-bit characters in the input POD at
all.

I think the core problem was that Pod::Man is responsible for the output
through the file handle and was missing an encoding layer.  The problem is
that we can't just call encode() on the output, since that breaks if
PERL_UNICODE is set or if an encoding was manually set on the file handle.
You get double-encoding.  I think the least bad option is for Pod::Man and
Pod::Text to force the encoding on their output file handles to UTF-8 when
--utf8 is given.

The problem with this fix is that this now really will break pod2man
--utf8 if POD documents don't have their encoding declared properly, since
it will end up double-encoding the UTF-8 given that, without =encoding,
Pod::Simple is treating the input as ISO 8859-15.  I think it's correct
according to the specifications, but existing POD text that doesn't
declare an encoding will get double-encoded output.  I can work around
this by not setting a UTF-8 output encoding unless the input encoding is
detected as UTF-8, but that's not really correct.  You *should* be able to
have an input POD document with =encoding ISO-8859-1 and run it through
pod2man --utf8 and get UTF-8 output.  But a POD document with no
=encoding according to perlpodspec has an implicit =encoding ISO-8859-1.

Pod::Text has an additional challenge.  pod2man won't produce any
non-ASCII characters without --utf8 and has been that way since the
beginning of the Pod::Simple implementation.  pod2text, on the other hand,
always passed through whatever it got.  I could just leave it alone, but
if you feed the current pod2text a document that *does* have =encoding
UTF-8 in it, you get Perl warnings about wide characters on output.  I
think the best solution here is to force the output file handle to have an
encoding matching what Pod::Simple believes the input encoding is.  This
comes the closest to preserving the traditional pass-through behavior.

I think that for lenny you may want to back out of the --utf8 change and
give it some time to settle.

Here's the patch that I'm planning on including in the next podlators
release, for reference.

diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm
index 48fe20e..b5aceef 100644
--- a/lib/Pod/Man.pm
+++ b/lib/Pod/Man.pm
@@ -36,7 +36,7 @@ use POSIX qw(strftime);
 
 @ISA = qw(Pod::Simple);
 
-$VERSION = '2.20';
+$VERSION = '2.21';
 
 # Set the debugging level.  If someone has inserted a debug function into this
 # class already, use that.  Otherwise, use any Pod::Simple debug function
@@ -736,6 +736,19 @@ sub start_document {
         return;
     }
 
+    # If we were given the utf8 option, set an output encoding on our file
+    # handle.  Wrap in an eval in case we're using a version of Perl too old
+    # to understand this.
+    #
+    # This is evil because it changes the global state of a file handle that
+    # we may not own.  However, we can't just blindly encode all output, since
+    # there may be a pre-applied output encoding (such as from PERL_UNICODE)
+    # and then we would double-encode.  This seems to be the least bad
+    # approach.
+    if ($$self{utf8}) {
+        eval { binmode ($$self{output_fh}, ':encoding(UTF-8)') };
+    }
+
     # Determine information for the preamble and then output it.
     my ($name, $section);
     if (defined $$self{name}) {
@@ -1450,8 +1463,8 @@ Pod::Man - Convert POD data to formatted *roff input
 
 =for stopwords
 en em ALLCAPS teeny fixedbold fixeditalic fixedbolditalic stderr utf8
-UTF-8 Allbery Sean Burke Ossanna Solaris formatters troff uppercased
-Christiansen
+UTF-8 UTF-8-encoded Allbery Sean Burke Ossanna Solaris formatters troff
+uppercased Christiansen
 
 =head1 SYNOPSIS
 
@@ -1608,6 +1621,12 @@ be warned that *roff source with literal UTF-8 
characters is not supported
 by many implementations and may even result in segfaults and other bad
 behavior.
 
+Be aware that, when using this option, the input encoding of your POD
+source must be properly declared unless it is US-ASCII or Latin-1.  POD
+input without an C<=encoding> command will be assumed to be in Latin-1,
+and if it's actually in UTF-8, the output will be double-encoded.  See
+L<perlpod(1)> for more information on the C<=encoding> command.
+
 =back
 
 The standard Pod::Simple method parse_file() takes one argument naming the
@@ -1643,6 +1662,12 @@ invalid.  A quote specification must be one, two, or 
four characters long.
 
 =head1 BUGS
 
+Encoding handling assumes that PerlIO is available and does not work
+properly if it isn't since encode and decode do not work well in
+combination with PerlIO encoding layers.  It's very unclear how to
+correctly handle this without PerlIO encoding layers.  The C<utf8> option
+is therefore not supported unless Perl is built with PerlIO support.
+
 There is currently no way to turn off the guesswork that tries to format
 unmarked text appropriately, and sometimes it isn't wanted (particularly
 when using POD to document something other than Perl).  Most of the work
@@ -1668,6 +1693,13 @@ Pod::Man is excessively slow.
 
 =head1 CAVEATS
 
+If Pod::Man is given the C<utf8> option, the encoding of its output file
+handle will be forced to UTF-8 if possible, overriding any existing
+encoding.  This will be done even if the file handle is not created by
+Pod::Man and was passed in from outside.  This seems to be the only way to
+consistently enforce UTF-8-encoded output regardless of PERL_UNICODE and
+other settings.
+
 The handling of hyphens and em dashes is somewhat fragile, and one may get
 the wrong one under some circumstances.  This should only matter for
 B<troff> output.
diff --git a/lib/Pod/Text.pm b/lib/Pod/Text.pm
index 98dd434..11d6d8d 100644
--- a/lib/Pod/Text.pm
+++ b/lib/Pod/Text.pm
@@ -37,7 +37,7 @@ use Pod::Simple ();
 # We have to export pod2text for backward compatibility.
 @EXPORT = qw(pod2text);
 
-$VERSION = 3.11;
+$VERSION = '3.12';
 
 ##############################################################################
 # Initialization
@@ -246,10 +246,19 @@ sub reformat {
 }
 
 # Output text to the output device.  Replace non-breaking spaces with spaces
-# and soft hyphens with nothing.
+# and soft hyphens with nothing, and then try to fix the output encoding if
+# necessary to match the input encoding unless UTF-8 output is forced.  This
+# preserves the traditional pass-through behavior of Pod::Text.
 sub output {
     my ($self, $text) = @_;
     $text =~ tr/\240\255/ /d;
+    unless ($$self{opt_utf8} || $$self{CHECKED_ENCODING}) {
+        my $encoding = $$self{encoding} || '';
+        if ($encoding) {
+            eval { binmode ($$self{output_fh}, ":encoding($encoding)") };
+        }
+        $$self{CHECKED_ENCODING} = 1;
+    }
     print { $$self{output_fh} } $text;
 }
 
@@ -272,6 +281,22 @@ sub start_document {
     $$self{MARGIN}  = $margin;  # Default left margin.
     $$self{PENDING} = [[]];     # Pending output.
 
+    # We have to redo encoding handling for each document.
+    delete $$self{CHECKED_ENCODING};
+
+    # If we were given the utf8 option, set an output encoding on our file
+    # handle.  Wrap in an eval in case we're using a version of Perl too old
+    # to understand this.
+    #
+    # This is evil because it changes the global state of a file handle that
+    # we may not own.  However, we can't just blindly encode all output, since
+    # there may be a pre-applied output encoding (such as from PERL_UNICODE)
+    # and then we would double-encode.  This seems to be the least bad
+    # approach.
+    if ($$self{opt_utf8}) {
+        eval { binmode ($$self{output_fh}, ':encoding(UTF-8)') };
+    }
+
     return '';
 }
 
@@ -640,7 +665,8 @@ __END__
 Pod::Text - Convert POD data to formatted ASCII text
 
 =for stopwords
-alt stderr Allbery Sean Burke's Christiansen
+alt stderr Allbery Sean Burke's Christiansen UTF-8 UTF-8-encoded
+pre-Unicode utf8
 
 =head1 SYNOPSIS
 
@@ -725,6 +751,19 @@ single space.  Defaults to true.
 Send error messages about invalid POD to standard error instead of
 appending a POD ERRORS section to the generated output.
 
+=item utf8
+
+By default, Pod::Text uses the same output encoding as the input encoding
+of the POD source (provided that Perl was built with PerlIO; otherwise, it
+doesn't encode its output).  If this option is given, the output encoding
+is forced to UTF-8.
+
+Be aware that, when using this option, the input encoding of your POD
+source must be properly declared unless it is US-ASCII or Latin-1.  POD
+input without an C<=encoding> command will be assumed to be in Latin-1,
+and if it's actually in UTF-8, the output will be double-encoded.  See
+L<perlpod(1)> for more information on the C<=encoding> command.
+
 =item width
 
 The column at which to wrap text on the right-hand side.  Defaults to 76.
@@ -759,6 +798,33 @@ invalid.  A quote specification must be one, two, or four 
characters long.
 
 =back
 
+=head1 BUGS
+
+Encoding handling assumes that PerlIO is available and does not work
+properly if it isn't since encode and decode do not work well in
+combination with PerlIO encoding layers.  It's very unclear how to
+correctly handle this without PerlIO encoding layers.  The C<utf8> option
+is therefore not supported unless Perl is built with PerlIO support and
+you may see spurious Perl warnings.
+
+=head1 CAVEATS
+
+If Pod::Text is given the C<utf8> option, the encoding of its output file
+handle will be forced to UTF-8 if possible, overriding any existing
+encoding.  This will be done even if the file handle is not created by
+Pod::Text and was passed in from outside.  This seems to be the only way
+to consistently enforce UTF-8-encoded output regardless of PERL_UNICODE
+and other settings.
+
+If the C<utf8> option is not given, the encoding of its output file handle
+will be forced to the detected encoding of the input POD, which preserves
+whatever the input text is.  This ensures backward compatibility with
+earlier, pre-Unicode versions of this module, without large numbers of
+Perl warnings.
+
+This is not ideal, but it seems to be the best compromise.  If it doesn't
+work for you, please let me know the details of how it broke.
+
 =head1 NOTES
 
 This is a replacement for an earlier Pod::Text module written by Tom
@@ -774,7 +840,7 @@ subclass of it does.  Look for L<Pod::Text::Termcap>.
 
 =head1 SEE ALSO
 
-L<Pod::Simple>, L<Pod::Text::Termcap>, L<pod2text(1)>
+L<Pod::Simple>, L<Pod::Text::Termcap>, L<perlpod(1)>, L<pod2text(1)>
 
 The current version of this module is always available from its web site at
 L<http://www.eyrie.org/~eagle/software/podlators/>.  It is also part of the
diff --git a/scripts/pod2man.PL b/scripts/pod2man.PL
index c353455..9f34116 100755
--- a/scripts/pod2man.PL
+++ b/scripts/pod2man.PL
@@ -271,6 +271,12 @@ However, be warned that *roff source with literal UTF-8 
characters is not
 supported by many implementations and may even result in segfaults and
 other bad behavior.
 
+Be aware that, when using this option, the input encoding of your POD
+source must be properly declared unless it is US-ASCII or Latin-1.  POD
+input without an C<=encoding> command will be assumed to be in Latin-1,
+and if it's actually in UTF-8, the output will be double-encoded.  See
+L<perlpod(1)> for more information on the C<=encoding> command.
+
 =item B<-v>, B<--verbose>
 
 Print out the name of each output file as it is being generated.
@@ -547,8 +553,8 @@ section numbering conventions.
 
 =head1 SEE ALSO
 
-L<Pod::Man>, L<Pod::Simple>, L<man(1)>, L<nroff(1)>, L<podchecker(1)>,
-L<troff(1)>, L<man(7)>
+L<Pod::Man>, L<Pod::Simple>, L<man(1)>, L<nroff(1)>, L<perlpod(1)>,
+L<podchecker(1)>, L<troff(1)>, L<man(7)>
 
 The man page documenting the an macro set may be L<man(5)> instead of
 L<man(7)> on your system.
diff --git a/scripts/pod2text.PL b/scripts/pod2text.PL
index 45a0649..ede0fe7 100755
--- a/scripts/pod2text.PL
+++ b/scripts/pod2text.PL
@@ -79,7 +79,8 @@ $options{sentence} = 0;
 Getopt::Long::config ('bundling');
 GetOptions (\%options, 'alt|a', 'code', 'color|c', 'help|h', 'indent|i=i',
             'loose|l', 'margin|left-margin|m=i', 'overstrike|o',
-            'quotes|q=s', 'sentence|s', 'stderr', 'termcap|t', 'width|w=i')
+            'quotes|q=s', 'sentence|s', 'stderr', 'termcap|t', 'utf8|u',
+            'width|w=i')
     or exit 1;
 pod2usage (1) if $options{help};
 
@@ -113,11 +114,12 @@ __END__
 pod2text - Convert POD data to formatted ASCII text
 
 =for stopwords
--aclost --alt --stderr Allbery --overstrike overstrike --termcap
+-aclostu --alt --stderr Allbery --overstrike overstrike --termcap --utf8
+UTF-8
 
 =head1 SYNOPSIS
 
-pod2text [B<-aclost>] [B<--code>] [B<-i> I<indent>] S<[B<-q> I<quotes>]>
+pod2text [B<-aclostu>] [B<--code>] [B<-i> I<indent>] S<[B<-q> I<quotes>]>
     [B<--stderr>] S<[B<-w> I<width>]> [I<input> [I<output> ...]]
 
 pod2text B<-h>
@@ -220,6 +222,18 @@ have a termcap file somewhere where Term::Cap can find it 
and requires that
 your system support termios.  With this option, the output of B<pod2text>
 will contain terminal control sequences for your current terminal type.
 
+=item B<-u>, B<--utf8>
+
+By default, B<pod2text> tries to use the same output encoding as its input
+encoding (to be backward-compatible with older versions).  This option
+says to instead force the output encoding to UTF-8.
+
+Be aware that, when using this option, the input encoding of your POD
+source must be properly declared unless it is US-ASCII or Latin-1.  POD
+input without an C<=encoding> command will be assumed to be in Latin-1,
+and if it's actually in UTF-8, the output will be double-encoded.  See
+L<perlpod(1)> for more information on the C<=encoding> command.
+
 =item B<-w>, B<--width=>I<width>, B<->I<width>
 
 The column at which to wrap text on the right-hand side.  Defaults to 76,
@@ -271,7 +285,7 @@ current terminal device.
 =head1 SEE ALSO
 
 L<Pod::Text>, L<Pod::Text::Color>, L<Pod::Text::Overstrike>,
-L<Pod::Text::Termcap>, L<Pod::Simple>
+L<Pod::Text::Termcap>, L<Pod::Simple>, L<perlpod(1)>
 
 The current version of this script is always available from its web site at
 L<http://www.eyrie.org/~eagle/software/podlators/>.  It is also part of the
diff --git a/t/man-utf8.t b/t/man-utf8.t
index a53208b..8b44d6b 100755
--- a/t/man-utf8.t
+++ b/t/man-utf8.t
@@ -39,6 +39,7 @@ print "ok 1\n";
 
 my $n = 2;
 eval { binmode (\*DATA, ':encoding(utf-8)') };
+eval { binmode (\*STDOUT, ':encoding(utf-8)') };
 while (<DATA>) {
     my %options;
     next until $_ eq "###\n";
@@ -57,7 +58,6 @@ while (<DATA>) {
     close TMP;
     my $parser = Pod::Man->new (%options) or die "Cannot create parser\n";
     open (OUT, '> out.tmp') or die "Cannot create out.tmp: $!\n";
-    eval { binmode (\*OUT, ':encoding(utf-8)') };
     $parser->parse_from_file ('tmp.pod', \*OUT);
     close OUT;
     my $accents = 0;
diff --git a/t/text-utf8.t b/t/text-utf8.t
index 3d2904a..8069478 100755
--- a/t/text-utf8.t
+++ b/t/text-utf8.t
@@ -33,7 +33,6 @@ END {
 }
 
 use Pod::Text;
-use Pod::Simple;
 
 $loaded = 1;
 print "ok 1\n";
@@ -53,7 +52,6 @@ while (<DATA>) {
     }
     close TMP;
     open (OUT, '> out.tmp') or die "Cannot create out.tmp: $!\n";
-    eval { binmode (\*OUT, ':encoding(utf-8)') };
     $parser->parse_from_file ('tmp.pod', \*OUT);
     close OUT;
     open (TMP, 'out.tmp') or die "Cannot open out.tmp: $!\n";

-- 
Russ Allbery ([EMAIL PROTECTED])             <http://www.eyrie.org/~eagle/>



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#500210: perldoc perlrun spits out junk in synopsis

Reply via email to