> Treating a 'string' as anything but a sequence of 'bytes/octets' _without 
> my explicit request or a runtime warning that I haven't specified fh 
> semantics_.

I'm still not quite following what are you being upset about.

(I'm starting to suspect that it must be because I've so completely
bought in to the Unicode model Perl has now and am unable to see what
could be the problem...)  Perl does *not* haphazardly handle a string
as anything else than bytes/octets.  Only if you either

(1) explicitly inject Unicode into by chr(), \x{...}, etc.
(2) either explicitly (binmode) or implicitly (locale) twiddle
    a filehandle so that it converts 

As far as I can understand, you were bitten by the locale.  As I told
you, that is as wanted by Larry, and also by (independently of Perl)
by the Linux Unicode people.

> > The only obvious 'magic' I can think of is the behaviour where Perl
> > checks your locale settings, and if they indicate use of UTF-8, Perl
> > switches the default encoding of the STD* streams, and any further
> > file opens to UTF-8.  This bit of magic was specificially requested by
> > Larry Wall, and also by the Linux "Unicodification" project.
> 
> This is Bad Juju (tm). It _guarantees_ script breakage (potentially
> silently!) for Unix people doing _anything_ but ASCII text manipulation.  

I repeat: I don't think you can do "more than ASCII" by hanging tooth
and nail to the "everything is bytes" credo.

> > The locale-induced UTF-8 magic can lead into situation where you have
> > to explicitly mark your filehandles "binary" (with binmode, please
> > don't use bytes), because otherwise any data going out would be
> > expected to be Unicode, that is, *text*.  If you are pushing out
> > binary bits and bytes, you should tell Perl about it.   You are
> > also simultaneously complaining about "wanting to specify things
> > yourself" and "having to use binmode"?
> 
> Yes. Because _needing_ to 'tell Perl' that I am pushing binary rather than
> text _is a change_ for *nix platforms. I should have to 'tell Perl' I am
> pushing _anything else_ than binary. Or _at a minimum_ a mandatory warning
> should be issued that I didn't declare the filehandle's encoding layer and
> it is now using encoding 'X' if I haven't explictly indicated that I
> *WANT* the system environment changing my filehandle's encodings.

I repeat: all your filehandles are still 'binary' unless you either
explicitly (binmode) or implicitly (locale) command them not be.
If you try to push Unicode (data marked as UTF-8, such as characters
beyond 255) on such a filehandle, you'll get 'Wide character' warning.
If you do not like the locale implicit switching, reset your locale
to something not /utf-?8/i in it before running the script.

> > Back to the 'UNIX' way of I/O: I'm sorry but I think the UNIX way and
> > the Unicode can't transparently cohabit.  I'm very much a UNIX geek
> > and systems programmer, and I like the simple symmetrical world of
> > UNIX I/O, but I cannot see how the byte streams of UNIX and the
> > multiple variable and fixed length encodings of Unicode can work
> > simultaneously without some sort of explicit switching.
> 
> _Explict_ switching is what I am asking for. _Implicit_ switching is what
> I am complaining about. If you want to switch based on the system env -
> fine: _But at least warn me with a good immediate warnings_ before
> changing my fh semantics if I haven't said something like

The assumption is that if you have a locale setup that indicates
UTF-8, Perl is going to assume you knew what you were doing when
you set up the locale.  *All* locale effects are 'implicit'.

>    binmode FH, ':crlf|:raw|:env';
> 
> before I go my $data = <FH>;
> 
> "Malformed UTF-8 character (unexpected end of string) at
> ./error-example.pl line 40." isn't useful: It is obscure and is produced
> distantly from the actual breakage.

perldiag has this:

=item Malformed UTF-8 character (%s)

Perl detected something that didn't comply with UTF-8 encoding rules.

One possible cause is that you read in data that you thought to be in
UTF-8 but it wasn't (it was for example legacy 8-bit data).  Another
possibility is careless use of utf8::upgrade().

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen

Reply via email to