Re: Installing Encode.pm for Perl 5.6.1

2013-07-05 Thread Nicholas Clark
On Wed, Jul 03, 2013 at 09:50:36AM +0530, Arun wrote:
> Hi Team,
> 
> We have a legacy Perl application which was developed using Perl 5.6.1.
> 
> I would like to Install Encode.pm which is a pre requisite for
> Cache::Memcached.
> 
> When i tried to install Encode.pm for Perl 5.6.1 it fails as it compatible
> with Perl 5.7, Is there any other way to install it for Perl 5.6.1..?

No, sorry. Encode needs features in perl 5.8.0 or later.

I'd be surprised if anyone subscribed to this list is using 5.6.1, both
because it's very very old, and because the Unicode support in 5.8.0 and
later is significantly better. So you're probably on your own here.

Your choices aren't ideal. If you're constrained to using 5.6.1 you should
probably switch to using the 5.6.1-known-good CPAN mirror at
http://cp5.6.1an.barnyard.co.uk/
to get older versions of modules known to pass tests on 5.6.1
There may be an older version of Cache::Memcached that works on 5.6.1

Otherwise I'd guess that your least worse choice to progress is to locally
fork Cache::Memcached and remove the code in it which requires Encode.

Nicholas Clark


Re: Determining IO layer set on filehandle

2010-01-29 Thread Nicholas Clark
On Fri, Jan 29, 2010 at 02:22:06PM +0100, Michael Ludwig wrote:
> Filehandles may have IO layers applied to them, like :utf8 or :raw.
> One of the ways to achieve that is to use the binmode() function.
> 
>   binmode $fh, ':utf8';
> 
> What I want to achieve is to set the STDOUT filehandle to ':raw' and
> then to restore the previous IO layers.

> Is there a way to determine the IO layers applying to a filehandle
> just from the filehandle itself?

I think you want PerlIO::get_layers($fh)

I'm not sure where it's documented.

Nicholas Clark


Re: TAP YAML Diagnostics

2008-04-06 Thread Nicholas Clark
On Sun, Apr 06, 2008 at 08:41:11AM -0700, Ovid wrote:

> Currently you can shove anything in there you want, but you must use
> upper-case keys for your personal use and all lower-case keys are
> reserved (and it's a parse error to use an unknown lower-case key). 
> Are there any strange Unicode issues where we might get confused about
> what is upper and lower case?)

I believe that there are code points which would be considered word
characters but do not have distinct upper and lower case forms (or by
implication title case either), but I hope that the good folks of
perl-unicode will correct me if I'm wrong.

Hence I'm not sure what the most efficient way of determining if
something is all lower case is. If I'm right, one can't just test

   if ($string eq lc $string)

because these code points would mess you up, and I *assume* that they
are not those which you want to consider reserved. I guess that one
needs to loop over all characters in the string, and verify that if
$char eq lc $char then also $char ne uc $char. (But one could first
short circuit the common pass case with the test above)

Nicholas Clark


Re: bytes pragma problems

2007-04-16 Thread Nicholas Clark
I don't think that anyone ever replied to this.

On Tue, Apr 19, 2005 at 11:51:36PM -0500, Ed Summers wrote:
> Has anyone noticed that a call to bytes::substr() sends perl into a 
> seemingly infinite loop
> under v5.8.1-RC3 built for darwin-thread-multi-2level?
> 
>   use bytes;
>   print bytes::substr( "abc", 0, 1 );

This appears to have been solved by 5.8.1 release.

> I also noticed that a call to a non-existent bytes subroutine will 
> cause an infinite loop under v5.8.2 built for i686-linux.
> 
>   use bytes;
>   print bytes::twtowtdi()
> 
> Were these problems resolved in later releases of Perl?

This appears to have been solved by 5.8.7

Nicholas Clark


Re: Unicode::Collate, useful but useless

2007-04-15 Thread Nicholas Clark
On Sun, Apr 15, 2007 at 06:52:15PM +0900, SADAHIRO Tomoyuki wrote:

> However I won't upload U::C with DUCET for Unicode 5.0.0
> until perl 5.8.9 with Unicode Character Database 5.0.0
> will have been released. The reason is:
> 
> - Due to the stability of the Unicode normalization,
>   perl 5.8.9/5.9.5 with U.C.D.5.0.0 and DUCET for 4.1.0
>   can perform conformantly the collation for 4.1.0.
> - As DUCET doesn't keep backward compatibility, the latest
>   maint (5.8.8) with U.C.D. 4.1.0 and DUCET for 5.0.0 is
>   conformant neither with 4.1.0 nor with 5.0.0.
> 
> Hence my intention is to upload a newer U::C with DUCET
> for 5.0.0 onto CPAN after release of perl 5.8.9 with
> U.C.D. 5.0.0 as a newer maint-perl.
> 
> Speaking technically perl-current (and perl-5.8.x) can have
> either of DUCET for 4.1.0 and that for 5.0.0.
> If perl 5.8.9/5.9.5 will have DUCET for 4.1.0, it can do the
> collation for 4.1.0, but not for 5.0.0. If perl 5.8.9/5.9.5
> will have DUCET for 5.0.0, it can do the collation 5.0.0
> taking advantage of its U.C.D. 5.0.0.

If I understand this all correctly, it means that if I bundled DUCET 5.0.0
with the release of 5.8.9, it could break things for people who have installed
Unicode::Collate with 5.8.8 (or earlier) and are currently using DUCET 4.1.0

So it wouldn't be a great idea.

Nicholas Clark


Re: List of unsupported unicode characters?

2007-01-10 Thread Nicholas Clark
On Wed, Jan 10, 2007 at 12:02:32AM -0800, Darren Duncan wrote:

> Now that the consortium has Unicode 5.0.0 out, I hope that Perl 5.8.9 
> includes an understanding of it.  Or if it doesn't, then Perl 5.10.0 
> should at least, and I think already does in its 5.9.x dev branch.

I think it likely that 5.8.9 will ship with Unicode 5.0.0 data

Nicholas Clark


Re: Layers Issue in SVN::Notify

2006-07-11 Thread Nicholas Clark
On Mon, Jul 10, 2006 at 11:57:52AM -0700, David Wheeler wrote:
> Greetings fellow Perlers,
> 
> I've had some complaints for a while now about non-ASCII characters  
> not properly showing up in emails sent by my SVN::Notify module. Last  
> week Éric Cholet figured out how to get it to work: He simply set the  
> LANG environment variable to fr_FR.ISO8859-1.
> 
> So the problem is the LANG environment variable. Setting it to 'C' in  
> a BEGIN block doesn't work, either. But my sense is that it shouldn't  
> make any difference what the environment is set to if I'm setting the  
> IO layer on the file handle. Here is the code for creating the handle  
> in SVN::Notify:

> # Child process. Execute the commands.
> exec @_ or die "Cannot exec $_[0]: $!\n";
> # Not reached.
> }
> }
> 
> I'm using binmode to set the IO layer on the pipe both for reading  
> and writing pipes, so I'd expect it to do the right thing vis-a-vis  
> the localization without regard to the LANG environment variable  
> (Éric had the io_layer attribute set to 'raw'). But obviously I'm  
> wrong. Is there something I'm missing about when and/or where the IO  
> layer should be set? Anyone run into something like this before?

I doubt that perl's at fault. What are you piping the data into, and
what does it think of $LANG in the environment?

Nicholas Clark


Re: Converting between UTF8 and local codepage without specifying local codepage

2005-11-09 Thread Nicholas Clark
On Wed, Nov 09, 2005 at 10:02:31AM -0500, David Schlegel wrote:
> That is helpful information. I have been spending time to determine the 
> local page by other means but have consistently been challenged that this 
> is the wrong approach and that Perl must know somehow. Getting a 
> definitive answer is almost as helpful as getting a better answer. 
> 
> Based on what you are saying, there is no way to ask Perl what the "local 
> codepage" is and hence there can be no variant of "Encode" which can be 
> told to convert from "local codepage" to UTF8 without having to provide 
> the "local codepage" value explicitly. 

Yes. A good summary of the situation.

> Is I18N::Langinfo(CODESET())  the best way to determine the local codepage 
> for Unix ? Windows seems to reliably include the codepage number in the 
> locale but Unix is all over the map.

I don't know. I have little to no experience of doing conversion of real
data, certainly for data outside of ISO-8859-1 and UTF-8, and I've never used
I18N::Langinfo. I hope that someone else on this list can give a decent
answer.

Nicholas Clark



Re: Converting between UTF8 and local codepage without specifying local codepage

2005-11-09 Thread Nicholas Clark
On Tue, Nov 08, 2005 at 05:08:08PM -0500, David Schlegel wrote:
> And yes, figuring out the local code page on unix is particularly 
> squirrelly.  The codepage for "fr_CA.ISOxxx" is pretty easy but what about 
> "fr_CA" and "fr" ? There are a lot of aliases and rules involved so that 
> the locale is just about useless (in one case you can tell it is shift-JIS 
> because the "j" in the locale is capitalized (I wish I was kidding!). 
> 
> As a number of others have suggested to me it seems like something basic 
> that Perl should absolutely know someplace internally. But I have yet to 
> find an API to get it. 
> If there was some way to do decode/encode without having to know the local 
> codepage that would make me happy to. I just want to get encode/decode to 
> work. 

No, it's not something that Perl knows internally. By default all case
conversion and similar operations are done 8 bit cleanly but assuming
US-ASCII for 8 bit data. If you C; then system locales are used
for case related operations and collation. This is done by calling the C
function setlocal() with the strings from the environment variables LC_CTYPE
and LC_COLLATE, which sets the behaviour or C functions such as toupper() and
tolower(). Hence Perl *still* has no idea what the local code page is
called, even when it's told to use it. The situation is the same for any C
program.

Unicode tables are used for Unicode data, and there is a (buggy) assumption
that 8 bit data can be converted to Unicode by assuming that it's ISO-8859-1.
Definitely buggy. Not possible to change without breaking backward
compatibility.

Nicholas Clark


Re: should a non-breaking space character be treated as whitespace in perl source?

2005-10-25 Thread Nicholas Clark
On Wed, Oct 05, 2005 at 05:20:34PM -0400, [EMAIL PROTECTED] wrote:
> Should a non-breaking space character be treated as whitespace in
> perl source code?  It doesn't appear to be:

As far as I know code points outside the range 0-127 are invalid, except as
quotes for q, qq, etc, by default. Under use utf8; Unicode word characters
can also be used in identifiers.

I doubt that this will change in perl 5, because the parser is written in C,
and so it would be very hard work to replace it with something that was fully
Unicode aware.

Nicholas Clark


Re: Encoding iso-8859-16

2005-08-19 Thread Nicholas Clark
On Fri, Aug 19, 2005 at 05:51:10PM +0530, Sastry wrote:
> Hi 
> 
> The test case uses the invariant character that is below <127 on
> ISO-8859-16 codepage. Since character 'a' has a codepoint of  129 on
> EBCDIC, is there a place in the code where it should apply
> NATIVE_TO_ASCII  macro on the input character?

I don't know.

And if the test is only checking for invariant characters below 127, it
doesn't strike me as a very thorough test.

Nicholas Clark


Re: Encoding iso-8859-16

2005-08-19 Thread Nicholas Clark
On Fri, Aug 19, 2005 at 05:01:04PM +0530, Sastry wrote:
> Hi Nicholas
> 
> With reference to my previous mail on encoding module
> 
> use Encode;
> $string = "a";
> $enc_string = encode("iso-8859-16", $string);
> print "\n String: $string\n";
> print "\n enc_string: $enc_string\n";
> 
> a)How different are those ext/Encode/def_t.c and
> ext/Encode/Byte/byte_t.c  files in EBCDIC and ASCII platforms?

I don't know. I have no experience of EBCDIC. The files describe converting
from perl's internal representation to a fixed external representation.
So I assume that they have to differ because the internal representation
differs.

> b) Why is it when I copied the above .c files from ASCII platform to
> EBCDIC worked for any codepage except  IBM-1047 codepage on EBCDCI
> platform?

I don't know. How thorough are the tests? Do the tests check for the
conversion of characters with Unicode code points >127?

You're asking questions beyond my knowledge.

Nicholas Clark


Re: Encoding iso-8859-16

2005-08-10 Thread Nicholas Clark
On Wed, Aug 10, 2005 at 02:11:45PM +0530, Sastry wrote:
> On 8/9/05, Nicholas Clark <[EMAIL PROTECTED]> wrote:
> > On Tue, Aug 09, 2005 at 10:58:48AM +0530, Sastry wrote:

> > > > $enc_string = encode("iso-8859-16", $string);

> > So $enc_string should be a single byte, 97, everywhere.
> Can you suggest some pointers in the code to fix this?

No, not really. I think that the perl level sub encode() is implemented by
Encode::XS::encode.

I'd start by putting a gdb breakpoint on that (XS_Encode__XS_encode) and
simply single stepping at a C level in from there to see where the code
goes, and check that each step does what I'd expect.

Nicholas Clark



Re: Encoding iso-8859-16

2005-08-09 Thread Nicholas Clark
On Tue, Aug 09, 2005 at 10:58:48AM +0530, Sastry wrote:
> Hi
> 
> I get 73 printed on EBCDIC platform.  I think it is supposed to print 
> 129 as it is the numeric equivalent of 'a'.
> 
> -Sastry
> 
> 
> 
> On 8/8/05, Nicholas Clark <[EMAIL PROTECTED]> wrote:

> > On your EBCDIC platform, what does this give?
> > 
> >>>>>>> It prints 73 
> > use Encode;
> > $string = "a";
> > $enc_string = encode("iso-8859-16", $string);
> > 
> > print ord ($enc_string), "\n";

73. Odd.

It should print 97 on all platforms. Because:

$string contains 1 byte, the byte that represents 'a' in the platform's
default character encoding.

The encode call should convert from the default encoding to iso-8859-16
And 'a' in iso-8859-16 is 97.
Everywhere.

So $enc_string should be a single byte, 97, everywhere.

Nicholas Clark


Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-08 Thread Nicholas Clark
On Thu, Aug 04, 2005 at 11:42:54AM +0530, Sastry wrote:
> Hi
> 
> I am trying to run this script on an EBCDIC platform using perl-5.8.6
>  
> ($a = "\x89\x8a\x8b\x8c\x8d\x8f\x90\x91") =~ tr/\x89-\x91/X/;
> is($a, "");
> 
> 
> The result I get is 
> 
>  'X«»ðý±°X'
> 
> a) Is this happening  since \x8a\x8b\x8c\x8d\x8f\x90 are the gapped
> characters in EBCDIC ?

I think so. In that \x89 is 'i' and \x91 is 'j'.


> b) Should all the bytes in $a change to X?

I don't know. It seems to be some special case code in regexec.c:

#ifdef EBCDIC
/* In EBCDIC [\x89-\x91] should include
 * the \x8e but [i-j] should not. */
if (literal_endpoint == 2 &&
((isLOWER(prevvalue) && isLOWER(ceilvalue)) ||
 (isUPPER(prevvalue) && isUPPER(ceilvalue
{
if (isLOWER(prevvalue)) {
for (i = prevvalue; i <= ceilvalue; i++)
if (isLOWER(i))
ANYOF_BITMAP_SET(ret, i);
} else {
for (i = prevvalue; i <= ceilvalue; i++)
if (isUPPER(i))
ANYOF_BITMAP_SET(ret, i);
}
}
else
#endif


which I assume is making [i-j] in a regexp leave a gap, but [\x89-\x91] not.
I don't know where ranges in tr/// are parsed, but given that I grepped
for EBCDIC and didn't find any analogous code, it looks like tr/\x89-\x91//
is treated as tr/i-j// and in turn i-j is treated as letters and always
"special cased"

I don't know if tr/i-j// and tr/\x89-\x91// should behave differently
(ie whether we currently have a bug)

Nicholas Clark


Re: Encoding iso-8859-16

2005-08-08 Thread Nicholas Clark
On Thu, Aug 04, 2005 at 11:51:44AM +0530, Sastry wrote:
> Hi
> 
> I am running the following script on EBCDIC 
> 
> use Encode;
> $string = "a";
> $enc_string = encode("iso-8859-16", $string);
> print "\n String: $string\n";
> print "\n enc_string: $enc_string\n";
> 
> 
> The output:
> 
> String: a
> enc_string: ñ (This is the character for codepoint \xF1 on iso-8859-16)
> 
> What is the expected output in enc_string?

You're doing way too many steps here at once to work out what's going on.

On your EBCDIC platform, what does this give?

use Encode;
$string = "a";
$enc_string = encode("iso-8859-16", $string);

print ord ($enc_string), "\n";

__END__

Nicholas Clark


Re: gmake (perl-5.8.6) fails on z/OS

2005-07-28 Thread Nicholas Clark
On Thu, Jul 28, 2005 at 02:37:36AM -0700, rajarshi das wrote:

> However, if I change the first instance to :
> --- utf8.c  2004-11-17 18:22:09.0 +0530
> +++ utf8.c.22005-07-28 13:48:24.0 +0530
> @@ -363,6 +363,11 @@ Perl_utf8n_to_uvuni(pTHX_ U8 *s,
> STRLEN
> warning = UTF8_WARN_EMPTY;
> goto malformed;
>  }
> +#ifdef EBCDIC
> +if (uv == 0xBA) {
> +   uv = NATIVE_TO_UTF(uv);
> +}
> +#endif
> 
>  if (UTF8_IS_INVARIANT(uv)) {
> if (retlen)
> 
> This allows gmake to complete. 
> Thanks for all your help on this.

Do you have any idea *why* this change makes things work?

Nicholas Clark


Re: bareword test on ebcdic.

2005-07-27 Thread Nicholas Clark
On Tue, Jul 26, 2005 at 08:48:10AM -0700, rajarshi das wrote:

> > For the code points being tested
> > ("\x{0442}\x{0435}\x{0441}\x{0442}")
> > does the perl source file contain the correct byte
> > sequence in UTF-EBCDIC?
> Yes it does, since I ran the test, 
> if (($hash{"\x{0442}\x{0435}\x{0441}\x{0442}"}) eq
> ($hash{eval '"\x{0442}\x{0435}\x{0441}\x{0442}"'}))
> print "ok\n";
> and the test ran fine, if that is what you mean by the
> source file containing the correct byte sequence. Or
> am I mistaken ?

You are mistaken, I'm afraid. bareword means no quotes.

In ASCII & UTF-8 land, the 1 liner

$ perl -le 'use utf8; $a{ඬ}++; print map {ord} keys %a'

gives

3500


The 3 bytes in the source code between '{' and '}' are 224, 182 and 172
which are the UTF-8 encoding for the code point 3500.

My question is, what are the bytes in UTF-EBCDIC that encode code point 3500?
If you put those 3 bytes directly between the '{' and '}' characters in
the EBCDIC version of that 1 liner, does it also print 3500?

> > If so, *that* would explain the failures, and be the
> > thing that needs
> > correcting. The test file would need if/else with a
> > different test on EBCDIC.
> what would you suggest be put in the if/ else ?

I think that the regression tests tended to do something like

if (ord 'A' == 65) {
  # Do the ASCII/UTF-8 version
} else {
  # Assume EBCDIC
}

Nicholas Clark


Re: gmake (perl-5.8.6) fails on z/OS

2005-07-26 Thread Nicholas Clark
On Tue, Jul 26, 2005 at 08:34:02AM -0700, rajarshi das wrote:

> Yes, the second call to NATIVE_TO_UTF is still present
> in the modified code. Typically, one wouldnt want to
> do a NATIVE_TO_UTF(NATIVE_TO_UTF(uv)) which is what I
> am doing due to the second call. But does that make a
> difference to miniperl ? 

Well, the code is linked into miniperl, so I can only assume that it's
getting called.

If so, does removing the second instance of NATIVE_TO_UTF() improve things?

Nicholas Clark


Re: gmake (perl-5.8.6) fails on z/OS

2005-07-26 Thread Nicholas Clark
On Tue, Jul 26, 2005 at 07:55:21AM -0700, rajarshi das wrote:

> The change is in the fn Perl_utf8n_to_uvuni :
> ---
> .
> 
> 
> #define UTF8_WARN_LONG   8
> #define UTF8_WARN_   9 /*
> Also FFFE. */
> 
> if (curlen == 0 &&
> !(flags & UTF8_ALLOW_EMPTY)) {
> warning = UTF8_WARN_EMPTY;
> goto malformed;
> }
> #ifdef EBCDIC   /* the change */
>   uv = NATIVE_TO_UTF(uv);
> #endif  /* the change ends here */
> 
> if (UTF8_IS_INVARIANT(uv)) {
> if (retlen)
> *retlen = 1;
> return (UV) (NATIVE_TO_UTF(*s));
> }
> 
> ...


A context or unified diff with the original source would have been clearer,
at least in part because most people on the list are used to it.

I don't understand this code, so I don't know why. I think that part of the
reason that you're getting few to no responses to your questions is because
few people understand this code in the first place, and none of them
understand how it interacts with EBCDIC.


The only thing I can think of is that I notice that further down that function
there is:

#ifdef EBCDIC
uv = NATIVE_TO_UTF(uv);
#else
if ((uv == 0xfe || uv == 0xff) &&
!(flags & UTF8_ALLOW_FE_FF)) {
warning = UTF8_WARN_FE_FF;
    goto malformed;
}
#endif

Is that second call to NATIVE_TO_UTF still present in your modified code?

Nicholas Clark


Re: gmake (perl-5.8.6) fails on z/OS

2005-07-26 Thread Nicholas Clark
On Tue, Jul 26, 2005 at 07:22:55AM -0700, rajarshi das wrote:
> Hi,
> I made the following modifications to utf8.c :
> #ifdef EBCDIC
>   uv = NATIVE_TO_UTF(uv);
> #endif

Where in utf8.c? Your description of what you changed is inadequate for
anyone else to understand the context of your change.

> I tried redoing it with a clean build, but it still
> fails. 
> 
> Why does configpm generate errors ? 

I don't know. I don't fully understand the workings of how perl's UTF-8
implementation is supposed to work on EBCDIC platforms.

1: Is that the only change you've made to the source code?
2: Without that change, how does your build fail? How do the errors differ?

Nicholas Clark


Re: value stored in $1

2005-06-10 Thread Nicholas Clark
On Fri, Jun 10, 2005 at 12:02:27PM +0100, Nicholas Clark wrote:
> It would be better if you sent 1 e-mail to both perl-unicode.perl.org
> and perl5-porters@perl.org to ask a given question, rather than two.

It would help if I got the address correct.
(or avoided using the format used in DNS zone files)

Nicholas Clark


Re: request for categorising unicode test failures on z/OS.

2005-04-22 Thread Nicholas Clark
On Wed, Apr 13, 2005 at 02:39:47PM +0530, Rajarshi Das wrote:
> Hi,
> I am using perl-5.8.6 and running unicode tests for the same on z/OS unix. 
> I have categorised unicode test failures as below and wish to know which 
> of the below (any or all of them) is/ are the most important to fix from a 
> customer perspective. 

The lack of responses suggests that no-one on the list is confident enough
to give an answer. I suspect that it depends a lot on your user data.

> 1) Unicode properties. E.g. regular expressions of the kind 
> \p{EastAsianWidth : A} are tested and fail. 
> 2) Testing DBM (database) filters by storing and retrieving specific data 
> (unicode alpha, beta and gamma characters) causes failures.
> 3) Unicode case folding is being tested (CaseFolding.txt) and fails. 
> 4) The gsm0338 unicode specification is tested and fails. 
> 5) Problems with utf hashes : 
> We have a test which contains the two following lines :
> ---
> use utf8;
> my %hash = (^69^22^AC^A0^69^21^69^22 => 123);
> ---
> On running the script, perl complains thus : Unrecognized character \x69.
> 
> 6) Pattern matches of the form  (":$lower:" =~ /:(([$UPPER])+):/i), where 
> $lower is a "latin small letter A with grave" and $upper is a "latin 
> capital letter A with grave", fail, whereas matches of the form 
> (":$lower:" =~ /:(($UPPER)+):/i) pass. The character class causes a 
> failure. 
> 
> 7) Writing to a file which is opened using ":raw:encoding(utf16le)" fails. 
> 
> 
> What would be the order of importance of the above categories of unicode 
> failures from a customer perspective ?

I also can't answer this, but my hunch is that from a debugging perspective,
tackling 7) and 5) first is the way to go. Until these bugs are solved, it's
quite probable that attempts to solve the other problems will be hindered
by errors introduced by these bugs.

Nicholas Clark


Re: is it utf8 or unicode?

2005-03-16 Thread Nicholas Clark
On Tue, Mar 15, 2005 at 01:06:57PM -0500, Ed Batutis wrote:

> I've seen bugs caused by utf-8 magically turning into iso-latin-1 (and
> sometimes leaving the utf-8 flag set), but that might not be what is going
> on here. (And I haven't seen that lately.)

We fixed a lot of core bugs. CPAN module authors have also fixed bugs.

Nicholas Clark


Re: is it utf8 or unicode?

2005-03-16 Thread Nicholas Clark
On Mon, Mar 14, 2005 at 12:14:12PM +, [EMAIL PROTECTED] wrote:

> Here's the problem:
> I have the data in a db, it is utf-8 encoded so I get it into perl
> as \xC3\x84. I turn on the utf-8 flag and then output it as xml
> using the module XML::LibXML. The module XML::LibXML has two output
> methods, toFH and toString.
> If I generate xml using the above data and with an encoding of utf-8,
> I get two different files. One is correct (using toFH) the other
> isn't (it contains xC4, invalid utf-8).
> toFH does not use perl's IO, toString does.
> I thought, at first, that the module may be incorrect, however,
> when the xml created by toString is parsed in memory, it passes ok.
> ie the error occurs during the output. Which means the module is ok.

We (at work) think that the module is buggy but we are yet to formally
report it. Specifically its XS code is not checking the internal UTF8
flag before doing things with the PV.

Nicholas Clark


Re: is it utf8 or unicode?

2005-03-16 Thread Nicholas Clark
On Wed, Mar 16, 2005 at 10:23:01AM +, [EMAIL PROTECTED] wrote:

> LANG is set to en_GB.
> With some messing about I have managed to create an en_GB.utf8.
> Setting LANG to that makes no difference to the perl output, as does setting 
> LC_ALL.
> Mind you, I should hope it wouldn't as :raw ignores locale, apparently.
> 
> In a nutshell, the code below should put \xc3\x84 into the output file and
> not \xc4 as it is doing. Well, I presume it should and no one is saying 
> otherwise.

No, it shouldn't put the bytes \xc3\x84 into the file
(Except on perl 5.8.0 with a UTF8 locale, or 5.8.1 or later run with the
correct -C flag to say "pay attention to a UTF8 locale". 5.8.0's behaviour
was documented, but found to be undesirable)

> #!/usr/bin/perl -w
> use Encode(_utf8_on);
> my $data = "\xC3\x84";
> _utf8_on($data);
> open FH, ">aa";
> print FH $data ;
> print length($data);

As is, except for the cases noted above, the file handle is assumed to be
8 bit, not UTF8. Perl 5 makes the assumption (arguably wrong, but we're stuck
with it now) that 8 bit file handles would like ISO-8859-1, and writes out
your characters as ISO-8859-1.

If you do this

#!/usr/bin/perl -w 
use Encode(_utf8_on); 
my $data = "\xC3\x84"; 
_utf8_on($data); 
open FH, ">aa"; 
binmode FH, ":utf8";
print FH $data ; 
print length($data); 

or this

#!/usr/bin/perl -w 
use Encode(_utf8_on); 
my $data = "\xC3\x84"; 
_utf8_on($data); 
open FH, ">:utf8", "aa"; 
print FH $data ; 
print length($data); 

to tell perl that the file handle is expecting UTF8 rather than the default,
then you get a 2 byte file output.

Nicholas Clark


Re: real UTF-8 vs. utf8n_to_uvuni()

2004-12-06 Thread Nicholas Clark
On Sun, Dec 05, 2004 at 11:58:54AM +0900, Dan Kogai wrote:

> Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a 
> problem of perl core.  So I have checked utf8.c which defines that.  
> Seems like it does not make use of PERL_UNICODE_MAX.
> 
> The patch against utf8.c fixes that.

But breaks 2 core tests, t/op/tr.t and ext/Unicode/Normalize/t/illegal.t

> --- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
> +++ perl-5.8.x.dan/utf8.c   Sun Dec  5 11:38:52 2004
> @@ -429,6 +429,13 @@
> }
> else
> uv = UTF8_ACCUMULATE(uv, *s);
> +   /* Checks if ord() > 0x10 -- dankogai */
> +   if (uv > PERL_UNICODE_MAX){
> +   if (!(flags & UTF8_ALLOW_LONG)) {
> +   warning = UTF8_WARN_LONG;
> +   goto malformed;
> +   }
> +   }
> if (!(uv > ouv)) {
> /* These cannot be allowed. */
> if (uv == ouv) {

(this is utf8 mangled by an 8 bit terminal)

not ok 54 - translit w/complement
# Failed at t/op/tr.t line 229
Wide character in print at ./test.pl line 48.
#  got 'ĬÃ
   ĭĬÃ
Ä­'
Wide character in print at ./test.pl line 48.
# expected 'Ä­Ã
   Ä­Ä­Ã
Ä­'
ok 55
ok 56 - translit w/deletion
ok 57
ok 58 - translit w/squeeze
ok 59
ok 60
ok 61
ok 62
ok 63 - UTF range
ok 64
ok 65
ok 66
ok 67
ok 68
not ok 69
# Failed at t/op/tr.t line 288
Wide character in print at ./test.pl line 48.
#  got 'È'
# expected 'X'
not ok 70
# Failed at t/op/tr.t line 291
Wide character in print at ./test.pl line 48.
#  got 'È'
# expected 'X'


and

not ok 91
# Failed test 91 in ext/Unicode/Normalize/t/illegal.t at line 53 fail #10
not ok 92
# Failed test 92 in ext/Unicode/Normalize/t/illegal.t at line 54 fail #10
not ok 93
# Failed test 93 in ext/Unicode/Normalize/t/illegal.t at line 55 fail #10
not ok 94
# Failed test 94 in ext/Unicode/Normalize/t/illegal.t at line 56 fail #10
ok 95
not ok 96
# Failed test 96 in ext/Unicode/Normalize/t/illegal.t at line 58 fail #10
not ok 97
# Failed test 97 in ext/Unicode/Normalize/t/illegal.t at line 59 fail #10
not ok 98
# Failed test 98 in ext/Unicode/Normalize/t/illegal.t at line 60 fail #10
not ok 99
# Failed test 99 in ext/Unicode/Normalize/t/illegal.t at line 61 fail #10
not ok 100
# Failed test 100 in ext/Unicode/Normalize/t/illegal.t at line 62 fail #10
not ok 101
# Failed test 101 in ext/Unicode/Normalize/t/illegal.t at line 53 fail #11
not ok 102
# Failed test 102 in ext/Unicode/Normalize/t/illegal.t at line 54 fail #11
not ok 103
# Failed test 103 in ext/Unicode/Normalize/t/illegal.t at line 55 fail #11
not ok 104
# Failed test 104 in ext/Unicode/Normalize/t/illegal.t at line 56 fail #11
ok 105
not ok 106
# Failed test 106 in ext/Unicode/Normalize/t/illegal.t at line 58 fail #11
not ok 107
# Failed test 107 in ext/Unicode/Normalize/t/illegal.t at line 59 fail #11
not ok 108
# Failed test 108 in ext/Unicode/Normalize/t/illegal.t at line 60 fail #11
not ok 109
# Failed test 109 in ext/Unicode/Normalize/t/illegal.t at line 61 fail #11
not ok 110
# Failed test 110 in ext/Unicode/Normalize/t/illegal.t at line 62 fail #11
ok 111
ok 112

I don't know what is at fault here, the tests, or the patch.

Nicholas Clark


Re: Segfault using HTML::Entities

2004-07-07 Thread Nicholas Clark
On Wed, Jun 30, 2004 at 11:19:46PM +0100, Richard Jolly wrote:

> This is now officially way above my head!
> 
> If the bug has disappeared in a recent maintenance version, do I need 
> to file a bug report? I'm sure the test case could be cut down, but I'm 
> not sure I know how.

I don't think that you need to file a report, as we're now aware of it.
Jarkko managed to cut the test case down to something very small, but we
can't manage to make a fix that doesn't break regexps in something else,
seemingly completely unrelated.

Nicholas Clark


Re: Segfault using HTML::Entities

2004-06-30 Thread Nicholas Clark
On Wed, Jun 30, 2004 at 10:15:13PM +0100, Richard Jolly wrote:
> 
> On 30 Jun 2004, at 17:52, Nicholas Clark wrote:
> 
> >On Tue, Jun 29, 2004 at 06:49:16PM +0100, Richard Jolly wrote:
> >> Script
> >
> >Could you resend the script/data test case as an attachment please?
> 
> Attached.

Thanks.

Looks like a core bug, as it's all going pear shaped somewhere in the regexp
engine. You need a UTF8 locale to provoke it:

$ LC_ALL=en_GB.utf8 PERL_UNICODE= valgrind /home/nick/Sandpit/-i-g/bin/perl5.9.2 
old.pl 
==11515== Memcheck, a memory error detector for x86-linux.
==11515== Copyright (C) 2002-2003, and GNU GPL'd, by Julian Seward.
==11515== Using valgrind-2.1.0, a program supervision framework for x86-linux.
==11515== Copyright (C) 2000-2003, and GNU GPL'd, by Julian Seward.
==11515== Estimated CPU clock rate is 2808 MHz
==11515== For more details, rerun with: -v
==11515== 
==11515== warning: Valgrind's siglongjmp is incomplete
==11515==  (it ignores cleanup handlers)
==11515==  your program may misbehave as a result
The Modern Résumé
Malformed UTF-8 character (unexpected end of string) at 
/home/nick/Sandpit/-i-g/lib/perl5/site_perl/5.9.2/i686-linux-thread-multi/HTML/Entities.pm
 line 435,  line 1.
==11515== Invalid read of size 1
==11515==at 0x817005A: Perl_utf8n_to_uvuni (utf8.c:418)
==11515==by 0x816E6DD: S_reginclass (regexec.c:4364)
==11515==by 0x81610A5: S_find_byclass (regexec.c:968)
==11515==by 0x8165259: Perl_regexec_flags (regexec.c:1945)
==11515==  Address 0x42475ED6 is 0 bytes after a block of size 18 alloc'd
==11515==at 0x40027C66: malloc (vg_replace_malloc.c:160)
==11515==by 0x80C9F31: Perl_safesysmalloc (util.c:67)
==11515==by 0x80CB5DD: Perl_savepvn (util.c:780)
==11515==by 0x8165718: Perl_regexec_flags (regexec.c:2053)
==11515== 
==11515== Invalid read of size 1
==11515==at 0x817008F: Perl_utf8n_to_uvuni (utf8.c:425)
==11515==by 0x816E6DD: S_reginclass (regexec.c:4364)
==11515==by 0x81610A5: S_find_byclass (regexec.c:968)
==11515==by 0x8165259: Perl_regexec_flags (regexec.c:1945)
==11515==  Address 0x42475ED6 is 0 bytes after a block of size 18 alloc'd
==11515==at 0x40027C66: malloc (vg_replace_malloc.c:160)
==11515==by 0x80C9F31: Perl_safesysmalloc (util.c:67)
==11515==by 0x80CB5DD: Perl_savepvn (util.c:780)
==11515==by 0x8165718: Perl_regexec_flags (regexec.c:2053)
==11515== 
==11515== Invalid read of size 1
==11515==at 0x817005A: Perl_utf8n_to_uvuni (utf8.c:418)
==11515==by 0x816E6DD: S_reginclass (regexec.c:4364)
==11515==by 0x8166CB6: S_regmatch (regexec.c:2542)
==11515==by 0x8165E1A: S_regtry (regexec.c:2198)
==11515==  Address 0x42475ED6 is 0 bytes after a block of size 18 alloc'd
==11515==at 0x40027C66: malloc (vg_replace_malloc.c:160)
==11515==by 0x80C9F31: Perl_safesysmalloc (util.c:67)
==11515==by 0x80CB5DD: Perl_savepvn (util.c:780)
==11515==by 0x8165718: Perl_regexec_flags (regexec.c:2053)
==11515== 
==11515== Invalid read of size 1
==11515==at 0x817008F: Perl_utf8n_to_uvuni (utf8.c:425)
==11515==by 0x816E6DD: S_reginclass (regexec.c:4364)
==11515==by 0x8166CB6: S_regmatch (regexec.c:2542)
==11515==by 0x8165E1A: S_regtry (regexec.c:2198)
==11515==  Address 0x42475ED6 is 0 bytes after a block of size 18 alloc'd
==11515==at 0x40027C66: malloc (vg_replace_malloc.c:160)
==11515==by 0x80C9F31: Perl_safesysmalloc (util.c:67)
==11515==by 0x80CB5DD: Perl_savepvn (util.c:780)
==11515==by 0x8165718: Perl_regexec_flags (regexec.c:2053)
==11515== 
==11515== Invalid read of size 1
==11515==at 0x8166D13: S_regmatch (regexec.c:2547)
==11515==by 0x8165E1A: S_regtry (regexec.c:2198)
==11515==by 0x816113D: S_find_byclass (regexec.c:972)
==11515==by 0x8165259: Perl_regexec_flags (regexec.c:1945)
==11515==  Address 0x42475ED7 is 1 bytes after a block of size 18 alloc'd
==11515==at 0x40027C66: malloc (vg_replace_malloc.c:160)
==11515==by 0x80C9F31: Perl_safesysmalloc (util.c:67)
==11515==by 0x80CB5DD: Perl_savepvn (util.c:780)
==11515==by 0x8165718: Perl_regexec_flags (regexec.c:2053)
Malformed UTF-8 character (unexpected non-continuation byte 0x73, immediately after 
start byte 0xe9) in substitution iterator at 
/home/nick/Sandpit/-i-g/lib/perl5/site_perl/5.9.2/i686-linux-thread-multi/HTML/Entities.pm
 line 435,  line 1.
==11515== 
==11515== Invalid read of size 1
==11515==at 0x42082515: memmove (in /lib/i686/libc-2.2.5.so)
==11515==by 0x80FC5F4: Perl_sv_setpvn (sv.c:4790)
==11515==by 0x80D3CA9: Perl_magic_get (mg.c:753)
==11515==by 0x80D23FA: Perl_mg_get (mg.c:156)
==11515==  Address 0x42475ED6 is 0 bytes after a block of size 18 alloc'd
==11515==at 0x40027C66: malloc (vg_replace_malloc.c:160)
==11515==by 0x80C9F31: Perl_safesysmalloc (util.c:67)
==11515==by 0x80CB5DD: Perl_savepvn (util.c:780)
==11

Re: Segfault using HTML::Entities

2004-06-30 Thread Nicholas Clark
On Tue, Jun 29, 2004 at 06:49:16PM +0100, Richard Jolly wrote:
>  Script

Could you resend the script/data test case as an attachment please?

It's been mangled by the format flowed on your mailer and currently I'm
getting errors which suggest that I can't undo that damage.

Thanks

Nicholas Clark


Re: Unicode filenames on Windows with Perl >= 5.8.2

2004-06-22 Thread Nicholas Clark
On Mon, Jun 21, 2004 at 08:46:07AM -0700, Jan Dubois wrote:

> I think it is possible, but it requires someone to both do the work and
> to argue for it on P5P. Without this "champion", I don't see it
> happening at all.

Nor do I. But P5P isn't big on arguing for arguing's sake these days.
Suggest a workable solution, volunteer to actually do it and I think
that everyone will be happy.

My only thought is should the API be full SVs, or char pointer plus utf8/not
flag? (possibly as 1 bit in a flags word)

Nicholas Clark


Re: BOM and principle of least surprise

2004-05-18 Thread Nicholas Clark
On Mon, May 17, 2004 at 09:02:39PM -, Erland Sommarskog wrote:

> Thanks for the update on the possibly upcoming patches!

The patch definitely exists. It's in the development track of perl (5.9.x):

http://public.activestate.com/cgi-bin/perlbrowse?patch=22818

I've not yet integrated it into the maintenance branch, but I intend to,
so unless it causes really really strange errors (very unlikely) it will
be in 5.8.5. (Which in turn will be in mid July)

Nicholas Clark


Re: How to handle unicode strings in utf8 and pre-utf8 pragma perls

2003-05-31 Thread Nicholas Clark
I can't help you on the important questions, but

On Sat, May 31, 2003 at 01:33:28AM +, Richard Evans wrote:

> Conceptually something like:
> 
>   use utf8 if $] >= 5.006;# Yes, I know this won't even compile in
>   # reality :)

use if $] >= 5.006, utf8;

On CPAN as http://search.cpan.org/author/ILYAZ/if-0.0101/
In the core since 5.8.0

Nicholas Clark


Re: Reading/writing non-Unicode files with perl5.8?

2003-01-14 Thread Nicholas Clark
On Mon, Jan 13, 2003 at 03:35:37PM -0800, Deneb Meketa wrote:
> I'm a longtime 5.005/5.6.1 user.  I recently upgraded my
> Linux system to RH8.0 and got perl5.8 in the bargain.  I
> have many perl scripts that read or write non-Unicode files,
> mostly ANSI files.  Many of those scripts have broken,
> seemingly because of Unicode-forcing behavior in perl5.8.
> 
> (It is possible that some other part of my system upgrade is
> responsible, like maybe my shell; if anyone knows of some
> kind of system-wide Unicode infestation that could be the
> cause of these problems, please let me know!)

RedHat 8 defaults to setting UTF8 locales.
UTF8 locales cause perl5.8 to switch to Unicode mode, because perl assumes
that you meant to set a UTF8 locale.

> three bytes.  I understand the Unicode translation that is
> happening here, I just don't want it!

> What I'm reading is not a UTF-8 file - it's an ANSI file!
> Is there some way to tell perl to just read the bytes without
> translation?

Changing your locale to not be UTF8 should stop all the translations.
(Make sure that the environment variables LANG, LANGUAGE and LC_ALL
and LC_CTYPE don't contain a string matching /utf-?8/i)
I don't know what sets these variables on RedHat systemwide, so I don't
know how to change them.

My personal opinion is that it was premature of RedHat to make RedHat 8.0
*default* to using UTF8 locales, given the general state of UTF8 support
in most programs running on Linux. Others may disagree.

Nicholas Clark



Re: [not-yet-a-PATCH] compress Encode better

2002-12-21 Thread Nicholas Clark
On Mon, Nov 04, 2002 at 03:26:16AM +, [EMAIL PROTECTED] wrote:
> Nicholas Clark <[EMAIL PROTECTED]> wrote:
> :I've been experimenting with how enc2xs builds the C tables that turn into the
> :shared objects. enc2xs is building tables (arrays of struct encpage_t) which
> :in turn have pointers to blocks of bytes.
> 
> Great, you seem to be getting some excellent results.
> 
> I have also wondered whether the .ucm files are needed after these
> have been built; if not, we should consider supplying with perl only
> the optimised table data if that could give us a space saving in the
> distribution - it would cut build time significantly as well as
> allowing us to consider algorithms that take much longer over the
> table optimisation, since they need be run only once when we
> integrate updated .ucm files.
> 
> Hmm, I wonder how distributable an optimal algorithm could be, and
> how many SETI-hours it would take to run? :)

Well, the brute force search could take a little while:

perl5.8.0 ../bin/enc2xs -B -Q -O -o experiment.c -f symbol_t.fnm 
Reading AdobeSymbol (AdobeSymbol)
Reading AdobeZdingbat (AdobeZdingbat)
Reading dingbats (dingbats)
Reading MacDingbats (MacDingbats)
Reading MacSymbol (MacSymbol)
Reading symbol (symbol)
Writing compiled form
Preparing for brute force search at Sat Dec 21 23:42:34 2002
There are 167 strings, 1.5e+300 permutations to try, target to beat is 1762
Total length is 1762
Starting brute force search at Sat Dec 21 23:42:34 2002
Depth 152 try 152 
'111'
 length already 1762, best is 1762, so pruning, at Sat Dec 21 23:42:34 2002

That looks to be one of the faster ones. Most of the rest give things like this:
There are 1263 strings, Inf permutations to try, target to beat is 12764

That string of 0s and 1s is part of the state record, mostly for debugging.

I think I need some of Damian's parallel Universes. Else I'm going to wear
this one out. The brute force search can quickly get to the current
(non -O) algorithm for small cases, but not for the current -O  algorithm.
So I'm nowhere near beating it. I need better cheats. Er shortcuts.

Nicholas Clark
-- 
Brainfuck better than perl? http://www.perl.org/advocacy/spoofathon/



Re: [not-yet-a-PATCH] compress Encode better

2002-12-20 Thread Nicholas Clark
On Mon, Nov 04, 2002 at 03:26:16AM +, [EMAIL PROTECTED] wrote:
> Nicholas Clark <[EMAIL PROTECTED]> wrote:

> 
> :The default method is to see if my substring is already present somewhere,
> :if so note where, if not append at the end. The (currently buggy) -O optimiser
> :method also tries to see whether it can avoid appending the entire string to
> :the end by looking for overlap at the start or the end. Clearly, I've not got
> :that bit right yet, but I've run out of time tonight. Is there a better
> :approximate algorithm that could find more space savings for more [or less :-)]
> :CPU? I guess is analogous to trying to build a word-search table, but in 1D
> :rather than 2D. (I'm hoping Hugo has experience of this sort of thing)
> 
> Not directly, no; I believe it is a question that has been studied,
> however, and it feels like the sort of problem that would yield some
> tricks at least for finding better local minima. Have you got any
> details on how many substrings there are, and what sort of profile
> of lengths they comprise?

I've hacked enc2xs (patch at end, if anyone wants to look themselves) to
calculate stats on the distribution of lengths.

> :Meanwhile, here are hard numbers. enc2xs from Encode 1.80:
> : 9386526 total
> :
> :Improved enc2xs:
> : 5084037 total
> :
> :Improved enc2xs with AGGREGATE_TABLES
> : 4706245 total
> 
> It might be worth seeing if we can easily get any useful estimates
> for a theoretical lower bound, to give us an idea how much blood
> can still be squeezed out.

Well, the absolute lower bound has to be for the unrealistic case of all
strings being substrings of the longest. However, for a more practical idea
of what's attainable, I suspect that we ought to be able to find an order
that gets a string block about the same length as the current block
compressed with Zlib.

$ perl5.8.0 ../JP/enc2xs-experiment -Q -O -o experiment.c -f sh_06_t.fnm 
Reading shiftjis (shiftjis)
Writing compiled form
32900 bytes in string tables
1724 bytes (95%) saved spotting duplicates
26 bytes (99.9%) saved using substrings
There were 4259 strings, total length 34650
Length  How many this long  How many bytes
2   272664.01%  64.01%  545215.73%  15.73%
3   2   0.05%   64.05%  6   0.02%   15.75%
4   886 20.80%  84.86%  354410.23%  25.98%
6   324 7.61%   92.46%  19445.61%   31.59%
8   109 2.56%   95.02%  872 2.52%   34.11%
10  59  1.39%   96.41%  590 1.70%   35.81%
12  24  0.56%   96.97%  288 0.83%   36.64%
14  10  0.23%   97.21%  140 0.40%   37.04%
15  2   0.05%   97.25%  30  0.09%   37.13%
16  10  0.23%   97.49%  160 0.46%   37.59%
20  3   0.07%   97.56%  60  0.17%   37.77%
21  1   0.02%   97.58%  21  0.06%   37.83%
24  2   0.05%   97.63%  48  0.14%   37.97%
30  4   0.09%   97.72%  120 0.35%   38.31%
31  1   0.02%   97.75%  31  0.09%   38.40%
32  2   0.05%   97.79%  64  0.18%   38.59%
34  2   0.05%   97.84%  68  0.20%   38.78%
36  2   0.05%   97.89%  72  0.21%   38.99%
40  1   0.02%   97.91%  40  0.12%   39.11%
45  1   0.02%   97.93%  45  0.13%   39.24%
48  2   0.05%   97.98%  96  0.28%   39.51%
60  2   0.05%   98.03%  120 0.35%   39.86%
62  1   0.02%   98.05%  62  0.18%   40.04%
66  1   0.02%   98.07%  66  0.19%   40.23%
69  1   0.02%   98.10%  69  0.20%   40.43%
78  2   0.05%   98.15%  156 0.45%   40.88%
96  2   0.05%   98.19%  192 0.55%   41.43%
100 1   0.02%   98.22%  100 0.29%   41.72%
110 1   0.02%   98.24%  110 0.32%   42.04%
111 1   0.02%   98.26%  111 0.32%   42.36%
126 1   0.02%   98.29%  126 0.36%   42.72%
128 1   0.02%   98.31%  128 0.37%   43.09%
138 1   0.02%   98.33%  138 0.40%   43.49%
153 1   0.02%   98.36%  153 0.44%   43.93%
189 35  0.82%   99.18%  661519.09%  63.02%
249 1   0.02%   99.20%  249 0.72%   63.74%
282 2   0.05%   99.25%  564 1.63%   65.37%
375 32  0.75%   100.00% 12000   34.63%  100.00%
Raw buffer is 34650 bytes, compressed is 28962 (16%)
Real buffer is 32900 bytes, compressed is 27637 (16%)

If it's not obvious, the second percentage column for each is cumulative.
Interestingly for my straw pole of Japanese encodings, there is a big
cumulative hike with the longest length:

perl5.8.0 ../JP/enc2xs-experiment -Q -O -o experiment.c -f eu_01_t.fnm 
Reading euc-jp (euc-jp)
Writing compiled form
78434 bytes in string tables
5446 bytes (93.5%) saved spotting duplicates
34 bytes (100%) saved using substrings
There were 8933 strings, total length 83914
...
276 1   0.01%   98.54%  276

Re: CGI and UTF

2002-11-20 Thread Nicholas Clark

On Wed, Nov 20, 2002 at 05:38:20PM -, Mark Proctor wrote:

[upgrading from 5.6.1 to 5.8]

> I have checked with the sysadmins at cisco and they said "no way" :(

I'm not asking this as an attempt to provide arguments to give them back - if
they are sure of their position, then it is necessary to work within it.

But did they say *why* they are so insistent that 5.8.0 is not feasible?

[such as house policy on not using .0 versions? time taken to assess and
approve releases meaning that approving 5.8.0 is a lot of effort?
Something specific they don't like about 5.8.0?]

Basically is there something that the perl development community needs to do
(or change) that would avoid this in future?

Nicholas Clark
-- 
Befunge better than perl?   http://www.perl.org/advocacy/spoofathon/



Re: [not-yet-a-PATCH] compress Encode better

2002-11-04 Thread Nicholas Clark

On Mon, Nov 04, 2002 at 08:11:04PM +0900, Dan Kogai wrote:
> NC and porters,
> 
>First of all, this is a great patch.  Not only does it optimize the 
> resulting shlibs, it seems to consume less memory during compilation.

Thanks. I wasn't actually trying to reduce memory usage during compilation
(either running the perl script, or running the C compiler)

The only change that was explictly thinking about memory and CPU usage for
the perl script was this one:
-   # We have a single long line. Split it at convenient commas.
-   $definition =~ s/(.{74,77},)/$1\n/g;
-   print $fh "$definition };\n\n";
+  # We have a single long line. Split it at convenient commas.
+  print $fh $1, "\n" while $definition =~ /\G(.{74,77},)/gcs;
+  print $fh substr ($definition, pos $definition), " };\n";

and I was bad and didn't actually benchmark its effects.
[instead of doing things in memory, and constantly re-copying the remainder of
the string every time the s///g adds a newline, the revised version prints
the sections of string out (and lets the IO system worry about aggregating
sections into one string]

> Thank you, NC.

It's not a problem. It allowed me to put of doing other stuff :-)
[such as actually writing the book review I am supposed to be doing for
http://london.pm.org/reviews/ ]

On Mon, Nov 04, 2002 at 01:42:58PM +, Nick Ing-Simmons wrote:
> Dan Kogai <[EMAIL PROTECTED]> writes:
> >On Monday, Nov 4, 2002, at 19:17 Asia/Tokyo, Nick Ing-Simmons wrote:
> >> Someone could/should write a generic test that pushes all codepoints
> >> supported by a .ucm file both ways through the generated encoder
> >> and checks for correctness. This would be a pointless thing to do
> >> as part of perl's "make test" as once the "compiler" works it works,
> >> but would be useful for folk working on the compile process.
> >
> >That is already in as t/rt.pl.  Since the test takes a long time (30 
> >seconds on my PowerBook G4 800MHz) it is not a part of standard 'make 
> >test' suite.  The NC Patch passes all that.
> 
> Excellent. 

Mr Burns couldn't have put it better. :-)

On Mon, Nov 04, 2002 at 08:19:57PM +0900, Dan Kogai wrote:
> On Monday, Nov 4, 2002, at 20:11 Asia/Tokyo, Dan Kogai wrote:
> > oh wait!  Encode.xs remains unchanged so Encode::* may still work
> 
> Confirmed.  The NC patch works w/ preexisting shlibs.

Good. It would have been worrying if it had not. The idea was not to
change any of the internal data structures visible to any code anywhere,
just to change how the U8 strings they point were arranged.

Nicholas Clark
-- 
z-code better than perl?http://www.perl.org/advocacy/spoofathon/



Re: [not-yet-a-PATCH] compress Encode better

2002-11-03 Thread Nicholas Clark
On Sun, Nov 03, 2002 at 11:13:25PM +, Nicholas Clark wrote:
> Currently the appended patch passes all regression tests on FreeBSD on
> bleadperl. However, having experimented I know that the new -O function it
> provides is buggy in some way, as running -O on the Chinese encodings gives
> regression test errors. (so don't apply it yet). I've not looked at what the
> Encode regression tests actually do, so I don't know how thoroughly they
> check whether the transformations are actually correct. In other words,
> done correctly this approach *will* generate the same transformation tables
> as before, and although I *think* I'm doing it correctly (without the -O;
> patches welcome) I'm not certain of this.

Too slow. :-)

Appended patch fixes the optimiser, which is now permanently on.
[not sure if it's worth it. I suspect it gives < .5% size saving, but it's
not convenient to check currently]

Nicholas Clark
-- 
sendmail.conf better than perl? http://www.perl.org/advocacy/spoofathon/

--- ext/Encode/bin/enc2xs.orig  Sat Jun  1 19:33:03 2002
+++ ext/Encode/bin/enc2xs   Sun Nov  3 23:34:25 2002
@@ -6,6 +6,7 @@ BEGIN {
 require Config; import Config;
 }
 use strict;
+use warnings;
 use Getopt::Std;
 my @orig_ARGV = @ARGV;
 our $VERSION  = do { my @r = (q$Revision: 1.30 $ =~ /\d+/g); sprintf "%d."."%02d" x 
$#r, @r };
@@ -186,7 +187,7 @@ END
 print C "#include \n";
 print C "#define U8 U8\n";
}
-  print C "#include \"encode.h\"\n";
+  print C "#include \"encode.h\"\n\n";
 
  }
 elsif ($cname =~ /\.enc$/)
@@ -204,6 +205,9 @@ elsif ($cname =~ /\.pet$/)
 
 my %encoding;
 my %strings;
+my $string_acc;
+my %strings_in_acc;
+
 my $saved = 0;
 my $subsave = 0;
 my $strings = 0;
@@ -250,8 +254,19 @@ if ($doC)
   foreach my $name (sort cmp_name keys %encoding)
{
 my ($e2u,$u2e,$erep,$min_el,$max_el) = @{$encoding{$name}};
-output(\*C,$name.'_utf8',$e2u);
-output(\*C,'utf8_'.$name,$u2e);
+process($name.'_utf8',$e2u);
+addstrings(\*C,$e2u);
+
+process('utf8_'.$name,$u2e);
+addstrings(\*C,$u2e);
+   }
+  outbigstring(\*C,"enctable");
+  foreach my $name (sort cmp_name keys %encoding)
+   {
+my ($e2u,$u2e,$erep,$min_el,$max_el) = @{$encoding{$name}};
+outtable(\*C,$e2u, "enctable");
+outtable(\*C,$u2e, "enctable");
+
 # push(@{$encoding{$name}},outstring(\*C,$e2u->{Cname}.'_def',$erep));
}
   foreach my $enc (sort cmp_name keys %encoding)
@@ -319,9 +334,9 @@ END
   my $perc_saved= $strings/($strings + $saved) * 100;
   my $perc_subsaved = $strings/($strings + $subsave) * 100;
   printf STDERR "%d bytes in string tables\n",$strings;
-  printf STDERR "%d bytes (%.3g%%) saved spotting duplicates\n",
+  printf STDERR "%d bytes (%.3g%%) saved spotting substrings\n",
 $saved, $perc_saved  if $saved;
-  printf STDERR "%d bytes (%.3g%%) saved using substrings\n",
+  printf STDERR "%d bytes (%.3g%%) saved using overlapping appends\n",
 $subsave, $perc_subsaved if $subsave;
  }
 elsif ($doEnc)
@@ -596,43 +611,6 @@ sub enter_fb0 {
   }
 }
 
-
-sub outstring
-{
- my ($fh,$name,$s) = @_;
- my $sym = $strings{$s};
- if ($sym)
-  {
-   $saved += length($s);
-  }
- else
-  {
-   if ($opt{'O'}) {
-   foreach my $o (keys %strings)
-{
- next unless (my $i = index($o,$s)) >= 0;
- $sym = $strings{$o};
- # gcc things that 0x0e+0x10 (anything with e+) starts to look like
- # a hexadecimal floating point constant. Silly gcc. Only p
- # introduces a floating point constant. Put the space in to stop it
- # getting confused.
- $sym .= sprintf(" +0x%02x",$i) if ($i);
- $subsave += length($s);
- return $strings{$s} = $sym;
-   }
-   }
-   $strings{$s} = $sym = $name;
-   $strings += length($s);
-   my $definition = sprintf "static const U8 %s[%d] = { ",$name,length($s);
-   # Maybe we should assert that these are all <256.
-   $definition .= join(',',unpack "C*",$s);
-   # We have a single long line. Split it at convenient commas.
-   $definition =~ s/(.{74,77},)/$1\n/g;
-   print $fh "$definition };\n\n";
-  }
- return $sym;
-}
-
 sub process
 {
   my ($name,$a) = @_;
@@ -693,7 +671,8 @@ sub process
   $a->{'Entries'} = \@ent;
 }
 
-sub outtable
+
+sub addstrings
 {
  my ($fh,$a) = @_;
  my $name = $a->{'Cname'};
@@ -701,20 +680,103 @@ sub outtable
  foreach my $b (@{$a->{'Entries'}})
   {
next unless $b->[AGG_OUT_LEN];
-   my $s = $b->[AGG_MIN_IN];
-   my $e = $b->[AGG_MAX_IN];
-   outstring($fh,sprintf("%s__%02x_%02x",$name,$s,$e),$b->[

[not-yet-a-PATCH] compress Encode better

2002-11-03 Thread Nicholas Clark
I've been experimenting with how enc2xs builds the C tables that turn into the
shared objects. enc2xs is building tables (arrays of struct encpage_t) which
in turn have pointers to blocks of bytes.

The way Nick I-S originally set it up, these blocks of bytes are named after
the part encoding transformation they represent. His output routine had two
levels of space optimisation. It always looked to see if the block of bytes it
was about to output was an exact copy of a block it had already output, and
if so it simply aliased the name of the second block to that of the first.
Additionally one can specify an -O flag to enc2xs, which turns on a brute
force substring search. This looks to see whether anything about to be output
is a substring of an existing block of bytes, and if so outputs an alias to
that offset.

However, the upshot of all this is that enc2xs generates C files with a large
number of moderate to small unsigned char arrays holding runs of bytes.
Instead, I wondered what would be the effect of concatenating all the needed
byte sequences together into one long string, output this as a single C array,
and then make all encpage_t references be offsets into this array.

The results seem to be very promising. On x86 FreeBSD, I find the sum of the
sizes of the shared object files drops by 46%. If I export AGGREGATE_TABLES=1
to make Makefile.PL make enc2xs compile files in aggregate mode I get the size
saving up to 50%. I've not looked to see where this saving is coming from,
but I presume that gcc opts to align the start of character arrays in some
way, so having the same number of bytes split into lots of strings means more
wasted space. The change will be getting some actual byte savings over the
existing system, as my continuous string lets me do Nick's -O substring search
at no extra cost for all encodings, whereas the current Makefile.pl doesn't
enable this for the non-European encodings.

Currently the appended patch passes all regression tests on FreeBSD on
bleadperl. However, having experimented I know that the new -O function it
provides is buggy in some way, as running -O on the Chinese encodings gives
regression test errors. (so don't apply it yet). I've not looked at what the
Encode regression tests actually do, so I don't know how thoroughly they
check whether the transformations are actually correct. In other words,
done correctly this approach *will* generate the same transformation tables
as before, and although I *think* I'm doing it correctly (without the -O;
patches welcome) I'm not certain of this.

I presume that finding the shortest string that has a list of other strings
as substrings is a hard problem (for some formal definition of "hard").
Currently I'm simply sorting all the strings I have into size order, and
building up my long string starting with the longest substring I need.
The default method is to see if my substring is already present somewhere,
if so note where, if not append at the end. The (currently buggy) -O optimiser
method also tries to see whether it can avoid appending the entire string to
the end by looking for overlap at the start or the end. Clearly, I've not got
that bit right yet, but I've run out of time tonight. Is there a better
approximate algorithm that could find more space savings for more [or less :-)]
CPU? I guess is analogous to trying to build a word-search table, but in 1D
rather than 2D. (I'm hoping Hugo has experience of this sort of thing)

Meanwhile, here are hard numbers. enc2xs from Encode 1.80:

  222352 18075-32/lib/auto/Encode/Byte/Byte.so
 2059045 18075-32/lib/auto/Encode/CN/CN.so
   28532 18075-32/lib/auto/Encode/EBCDIC/EBCDIC.so
 2687896 18075-32/lib/auto/Encode/JP/JP.so
 2314555 18075-32/lib/auto/Encode/KR/KR.so
   37425 18075-32/lib/auto/Encode/Symbol/Symbol.so
 2024682 18075-32/lib/auto/Encode/TW/TW.so
   12039 18075-32/lib/auto/Encode/Unicode/Unicode.so
 9386526 total

Improved enc2xs:

  190853 18075-Encode-O0/lib/auto/Encode/Byte/Byte.so
 1119692 18075-Encode-O0/lib/auto/Encode/CN/CN.so
   23003 18075-Encode-O0/lib/auto/Encode/EBCDIC/EBCDIC.so
 1351823 18075-Encode-O0/lib/auto/Encode/JP/JP.so
 1252329 18075-Encode-O0/lib/auto/Encode/KR/KR.so
   31947 18075-Encode-O0/lib/auto/Encode/Symbol/Symbol.so
 1102351 18075-Encode-O0/lib/auto/Encode/TW/TW.so
   12039 18075-Encode-O0/lib/auto/Encode/Unicode/Unicode.so
 5084037 total

Improved enc2xs with AGGREGATE_TABLES

  190853 18075-Encode-O0-Agg/lib/auto/Encode/Byte/Byte.so
 1050477 18075-Encode-O0-Agg/lib/auto/Encode/CN/CN.so
   23003 18075-Encode-O0-Agg/lib/auto/Encode/EBCDIC/EBCDIC.so
 1281004 18075-Encode-O0-Agg/lib/auto/Encode/JP/JP.so
 1179594 18075-Encode-O0-Agg/lib/auto/Encode/KR/KR.so
   31947 18075-Encode-O0-Agg/lib/auto/Encode/Symbol/Symbol.so
  937328 18075-Encode-O0-Agg/lib/auto/Encode/TW/TW.so
   12039 18075-Encode-O0-Agg/lib/auto/Encode/Unicode/Unicode.so
 4706245 total

Nicholas Clark
--

Re: RFC 2231 (was Re: Encode::MIME::Header...)

2002-10-09 Thread Nicholas Clark

On Wed, Oct 09, 2002 at 09:13:39AM +0100, Nick Ing-Simmons wrote:
> >Counter-intuitive it may be, you can pass extra 'tips' to that thingy 
> >like
> >
> >   my $e = find_encoding('MIME-Header');
> >   $e->{charset} = ISO-8859-1; # or $e->charset('ISO-8859-1') if we 
> >define a method
> >   my $encoded = $e->encode($str); # now uses =?ISO-8859-1?B?...
> 
> Despite Nick C's speed comments it may make sense to allow optional args
> to encode as well.

which ones were they? I seem to be obsessed with speed currently, so I
doubt I can find them by searching for "speed". And I can't remember why
I might have suggested that allowing optional arguments would induce
serious slowdown. (by implication even when no optional arguments are used)

Nicholas Clark

PS shameless plug for optimising your perl code talk:
   http://www.ccl4.org/~nick/P/Fast_Enough/



Re: Unicode to UTF-8

2002-09-08 Thread Nicholas Clark

On Sat, Sep 07, 2002 at 09:05:13PM -0400, Rick Dillon wrote:
> Hello.
> 
> I am currently populating html pages with content from MS Excel. I am 
> using a Java program that literally places the Excel content directly 
> into the output code (which is saved as html). It appears that Excel 
> is using Unicode characters, which is causing strange glyphs when the 
> html is viewed in a browser. Is there a Perl Way to parse the output 
> and replace the Unicode characters with asciii, or UTF-8 equivalents? 

I don't know the answer to this for sure (but my guess from your
description is that Excel is using 16 bit representation of Unicode, and
your browser expects an 8 bit encoding of some form).

If so, and Excel is only placing Unicode code points in the range 0-255
in your HTML page, then I think something as simple as s/\0(.)/$1/mg in
any perl (probably even perl4) would work. But this is a cheap hack, and
likely to break.

If your data from Excel really has Unicode code points >256, or may do in
the future, then really there's no reliable way to fix your HTML file once
it has a mix of 1 byte and 2 byte characters in it. Either your Java
program should do the conversion to 8 bit (the encoding to UTF8 is not hard,
perl's utf8.h says:

/*

 The following table is from Unicode 3.2.

 Code Points1st Byte  2nd Byte  3rd Byte  4th Byte

   U+..U+007F   00..7F
   U+0080..U+07FF   C2..DF80..BF
   U+0800..U+0FFF   E0A0..BF80..BF
   U+1000..U+CFFF   E1..EC80..BF80..BF
   U+D000..U+D7FF   ED80..9F80..BF
   U+D800..U+DFFF   *** ill-formed ***
   U+E000..U+   EE..EF80..BF80..BF
  U+1..U+3  F090..BF80..BF80..BF
  U+4..U+F  F1..F380..BF80..BF80..BF
 U+10..U+10 F480..8F80..BF80..BF

Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF,
the 90..BF in U+1..U+3, and the 80...8F in U+10..U+10.
The "gaps" are caused by legal UTF-8 avoiding non-shortest encodings:
it is technically possible to UTF-8-encode a single code point in different
ways, but that is explicitly forbidden, and the shortest possible encoding
should always be used (and that is what Perl does).

 */

and the relevant part of utf8.c for code points between 0x80 and 0x1:

if (uv < 0x800) {
*d++ = (U8)(( uv >>  6) | 0xc0);
*d++ = (U8)(( uv& 0x3f) | 0x80);
return d;
}
if (uv < 0x1) {
*d++ = (U8)(( uv >> 12) | 0xe0);
*d++ = (U8)(((uv >>  6) & 0x3f) | 0x80);
*d++ = (U8)(( uv& 0x3f) | 0x80);
return d;
}

) or alternatively your Java program should output the HTML file entirely in
16 bit, and then use something else (eg perl) to convert that to UTF8 or
whatever your browser likes. Converting the representation of Unicode from
16 bit UCS-2 to UTF8 is just byte shuffling, so any perl can do it.
Offhand, I don't know if there are modules on CPAN already to do it, but
I'd be surprised if there none - try http://search.cpan.org/

> And do I need to upgrade to perl 5.6 to do this?

If you are considering upgrading from something like 5.005, is there any
reason not to consider going straight to 5.8.0? The Unicode 5.8.0 support
in 5.8.0 is much better than 5.6.1, and it also fixes many of the bugs
still present in 5.6.1. (Nothing is perfect - a few new bugs have been
reported in 5.8.0, but generally it does seem stable and of good quality)

Nicholas Clark
-- 
Even better than the real thing:http://nms-cgi.sourceforge.net/



Re: Unicode::Collate 0.23 Released

2002-09-05 Thread Nicholas Clark

On Thu, Sep 05, 2002 at 08:36:50AM -0600, Mark Leisher wrote:
> 
> Tomoyuki> Unicode::Collate 0.23 is released.
> 
> Could you remind us where to find it again?  Thanks!

I can find it on CPAN:

http://search.cpan.org/author/SADAHIRO/Unicode-Collate-0.23/

(start at search.cpan.org, enter Unicode::Collate in the box, hit go, top
of the returned list)

CPAN's usually a the best place to start when looking for anything perl.

Nicholas Clark



Re: Pattern matching with Unicode (5.6.1)

2002-08-15 Thread Nicholas Clark

On Thu, Aug 15, 2002 at 05:28:43PM -0400, David Gray wrote:
> > I'm having a bit of a problem getting Unicode pattern 
> > matching to do what I would like it to.
> 
> I guess my question wasn't entirely clear. I'm reading in the attatched
> file and trying to split it on "\n\n".
> 
> When I'm looping over the file,
> 
> > I've (sort of) made it work by doing:
> > 
> >  # strip BOM and trailing nulls and carriage returns
> >  s/^..// if $. == 1 and s/\0//g;
> >  s/[\0\r]//g;
> 
> The two-byte BOM has me thinking it's probably UTF-16. Is there an easy
> way to tell what encoding a file uses?

Not that I know of, but all the 0 bytes make me think it is.

> > But I'm sure there must be a more elegant way to do this. 
> > Honestly, I'm not even sure where to start. Any ideas?

I find that this:

perl5.6.1  -we 'undef $/; $_=; $_ = pack "U*", unpack "v*", $_; substr ($_, 0, 
1) = ""; print $_' http://nms-cgi.sourceforge.net/



Re: how to utf8::encode and ::decode in 5.6.1

2002-08-10 Thread Nicholas Clark

On Tue, Aug 06, 2002 at 10:36:09PM +0900, SADAHIRO Tomoyuki wrote:
> 
> On Mon, 5 Aug 2002 22:17:10 +0100
> Nicholas Clark <[EMAIL PROTECTED]> wrote:
> 
> > I'm trying to backport ExtUtils::Constant from 5.8.0 to work on perl pre
> > 5.8.0. Currently ExtUtils::Constant is using utf8::encode and utf8::decode
> > to convert Unicode strings to and from their internal byte representation
> > for testing purposes.
> > 
> > For 5.005_03 I don't have a problem - I just skip all the Unicode tests! :-)
> > However, for 5.6.1 (and 5.6.0) I do. I can't work out how to (legally!) get
> > perl to give me the utf8 bytes that represent the Unicode strings, or how
> > to translate a sequence of utf8 bytes back into a perl Unicode string.
> > 
> > So how should I write utf8::encode and utf8::decode for 5.6.1 and 5.6.0?
> > I can cope if a different solution is needed on both.
> 
> How about these codelets?
> (sorry, I haven't try them on 5.6.0).

Thanks. They seem to work very well on 5.6.1
After spending a couple of nights fighting all the Unicode bugs and
unhelpfulness in 5.6.1 with various workarounds, I gave up on the idea of
5.6.0 - it's just too much trouble.

> The test.t of my Unicode::Normalize uses many pack() and unpack()
> as tests should be passed both on Perl 5.6.1 and on 5.8.0,
> and via XS and via Non-XS;
> but this technique seems not to be portable to EBCDIC. :-/

I've not got access to EBCDIC, so I've no idea what will go wrong.

However, ExtUtils-Constant-0.13.tar.gz is currently working its way round
CPAN.

I couldn't find any sort of tie hash implementation on CPAN that would
let me reliably mix UTF8 and 8 bit scalars as hash keys for 5.6.1, so I
knocked up a quick one based on your unpack/pack code. (Although I'm
storing the hash keys as a string of BER compressed integers rather than
UTF8 bytes)

Did I miss one, or would this be a useful small module to separate out and
upload to CPAN in its own right? Clearly 5.8.0 doesn't need it:


[  7980] By: jhi   on 2000/12/04  19:36:51
Log: UTF-8 hash keys, patch from Inaba Hiroto.
 Branch: perl
   ! embed.h embed.pl hv.c hv.h pod/perlapi.pod proto.h


but I guess there are people needing to stick on 5.6.1 who might find it
useful.

My experience of trying to manipulate data that is sometimes 8 bit, sometime
UTF-8 on 5.6.1? "Argh".
I'd really strongly recommend upgrading to 5.8.0, where hashes, s/// and tr///
"just work".

If anyone here tries ExtUtils::Constant and finds bugs, particularly in
the Unicode/UTF8 bits, please don't hesitate to report them.

Nicholas Clark
-- 
Even better than the real thing:http://nms-cgi.sourceforge.net/



Re: Tk804 + Encode-1.50 :-) again

2002-04-19 Thread Nicholas Clark

On Sat, Apr 20, 2002 at 04:27:15AM +0900, Dan Kogai wrote:
> Yes, please.  Emacs doesn't do spellcheck-as-you-type like recent 
> mailers in MacOS and Windows :)  (I know you can spellcheck in Emacs but 
> I am not sure if it is a good idea to to do so in .pm).

You underestimate the power of the dark side.

M-x flyspell-mode

Definitely part of the dark side because here it defaults to American.
And then refuses to start because I don't have American dictionaries
installed. ispell has no problem "just running" and finding the correct
dictionaries.

Nicholas Clark
-- 
Even better than the real thing:http://nms-cgi.sourceforge.net/



Re: [Encode] Encode::Guide ? (Was: Re: Encode::CJKguide ...)

2002-03-27 Thread Nicholas Clark

On Wed, Mar 27, 2002 at 11:12:41PM +0900, Dan Kogai wrote:
> Anton,
> 
>   I am glad you liked it but as I announced Encode::CJKguide has been 
> dropped.  I am instead planning to make even more comprehensive guide 
> that is not limited to CJK and upload it as Encode::Guide to CPAN.  I 
> will definely call for your help.

I feel it would be a useful thing to have in standard Encode the
description of what shift encodings, escape encodings and the other
jargon means, and how they work (which seemed to be the start of
Encode::Guide). It can be quite hard to follow what the various hoops
the different encoding systems are forced to jump through, when one
only knows languages which use the Roman alphabet and therefore has
had no direct experience of anything other than ASCII and ISO 8859-1.

But it's more important to get Encode working well than spend time on this
right now.

Nicholas Clark
-- 
Even better than the real thing:http://nms-cgi.sourceforge.net/



Re: Encode::XS for CJK

2002-02-21 Thread Nicholas Clark

On Thu, Jan 31, 2002 at 04:19:23AM +0900, Dan Kogai wrote:
>   And the speed of the compile script may be a problem if we want all 
> CJK to be XS-based.  It roughly takes about 25 seconds to compile single 
> CJK encoding on my FreeBSD box.  Well, I can live with that too but 
> other porters may find it frustrating

Now I've re-read this message I've just noticed that paragraph.
I did get frustrated with it.
1: It's too slow
2: It uses too much RAM. (Well, that's subjective, but my FreeBSD box only
   has 16M total, and it was not a happy bunny, swapping like crazy and taking
   over an hour to run 5 minutes worth of CPU time)

So I've been re-jigging it (and Jarkko has been commiting the improvements)
to bleadperl - not sure if you're subscribed to p5p.
By yesterday I think it was 37% faster at compiling EUC_JP, and I've found
some more things to tweak today.

[eg just found that using (unpack "n*", pack "H*", $line) makes it 2.5% faster
than (map {hex $_} $line =~ /()/g)
I think that that is portable to big endian, and to 64 bit]

I hope that I've not been tramping on things you've been doing. It's still
making output files that are byte-for-byte identical with what the original of
last week did.

I've got a question about FFFD. The original compile script does this:

 for (my $j = 0; $j < 16; $j++)
  {
   no strict 'refs';
   my $ech = &{"encode_$type"}($ch,$page);
   my $val = hex(substr($line,0,4,''));
   next if $val == 0xFFFD;
   if ($val || (!$ch && !$page))
{
 my $el  = length($ech);
 $max_el = $el if (!defined($max_el) || $el > $max_el);
 $min_el = $el if (!defined($min_el) || $el < $min_el);
 my $uch = encode_U($val);
 if (exists $seen{$uch})
  {
   warn sprintf("U%04X is %02X%02X and %02X%02X\n",
$val,$page,$ch,@{$seen{$uch}});
  }
 else
  {
   $seen{$uch} = [$page,$ch];
  }
 enter($e2u,$ech,$uch,$e2u,0);
 enter($u2e,$uch,$ech,$u2e,0);
}
   else
{
 # No character at this position
 # enter($e2u,$ech,undef,$e2u);
}
   $ch++;
  }


Is there a bug?
Should the $ch++ happen even for the cases where $val == 0xFFFD?
Currently it looks like $ch is not incremented when the input value is 0xFFFD

Nicholas Clark
-- 
EMCFT http://www.ccl4.org/~nick/CV.html



Re: Encode; Should we aggregate all EUCs?

2002-02-06 Thread Nicholas Clark

On Wed, Feb 06, 2002 at 09:59:44AM +, Nick Ing-Simmons wrote:
> Nicholas Clark <[EMAIL PROTECTED]> writes:
> >On Tue, Feb 05, 2002 at 04:29:34PM +, Nick Ing-Simmons wrote:
> >> If I throw jis208.enc into the pot, then without -O it is 12s
> >> and with -O approx 4 minutes for a trivial saving.
> >
> >Whatever is default on 14550 didn't make a very nice noise while compiling

Aargh. I meant 14566. I read the wrong directory name.
I built 14566 last night on FreeBSD 4.5
I built 14550 a couple of days ago on FreeBSD 4.5 RC

> >on my FreeBSD box. Load average was 0.11, and one of the disks was thrashing
> >a lot. [Unfortunately I have upgraded to hard disks without LED jumpers, so
> >I can no longer use the front panel blinkenlights to say whether it harassing
> >the source directory, /tmp, swap or /usr]
> >
> >Maybe I should figure out how vmstat works, but I have my suspicion that it's
> >hammering the machine in the wrong way, and maybe it could trade using more
> >memory for more less disk access in order to compile faster.
> >[Then again, more memory => swap => disks, and as best I can tell on FreeBSD
> >less memory => free memory used as extra disk buffers, so the OS is doing its
> >best however you config things]. Or am I barking up the wrong tree?
> 
> That version would have been doing substring search on on EUC_JP.
> There are some large hashes - but I have not had a problem on my machines
> (but even the laptop has 192M).

This machine has 16M RAM :-)
It's been cobbled together from freebies.

> 14564 will stop doing the search - but perhaps will use slightly more memory
> as a result.

So this what I was building last night. Sorry about the confusion.

[not sure if editing messages on the outgoing mailspool works :-)
Rebuilding reveals that miniperl is swapping like crazy. Not surprising with
only 16M RAM.]
 
Nicholas Clark
-- 
EMCFT http://www.ccl4.org/~nick/CV.html



Re: Encode; Should we aggregate all EUCs?

2002-02-05 Thread Nicholas Clark

On Tue, Feb 05, 2002 at 04:29:34PM +, Nick Ing-Simmons wrote:
> If I throw jis208.enc into the pot, then without -O it is 12s
> and with -O approx 4 minutes for a trivial saving.

Whatever is default on 14550 didn't make a very nice noise while compiling
on my FreeBSD box. Load average was 0.11, and one of the disks was thrashing
a lot. [Unfortunately I have upgraded to hard disks without LED jumpers, so
I can no longer use the front panel blinkenlights to say whether it harassing
the source directory, /tmp, swap or /usr]

Maybe I should figure out how vmstat works, but I have my suspicion that it's
hammering the machine in the wrong way, and maybe it could trade using more
memory for more less disk access in order to compile faster.
[Then again, more memory => swap => disks, and as best I can tell on FreeBSD
less memory => free memory used as extra disk buffers, so the OS is doing its
best however you config things]. Or am I barking up the wrong tree?

Nicholas Clark
-- 
EMCFT http://www.ccl4.org/~nick/CV.html



Re: Encode; Should we aggregate all EUCs?

2002-02-05 Thread Nicholas Clark

On Tue, Feb 05, 2002 at 08:38:28AM +, Nick Ing-Simmons wrote:
> Dan Kogai <[EMAIL PROTECTED]> writes:

> Perhaps we make "Build CJK encodings?" a Configure question?
> We could determine default based on locale, or (as I once
> did for a UK/USA paper size choice) by TZ.

> >107853 bytes (112%) saved spotting duplicates
> 
> Probably worth keeping.
> 
> >22801 bytes (23.6%) saved using substrings
> 
> That is where the time goes - there is a loop which uses index()
> on all existing strings to see if it can re-use one.
> It saves 22K but is that worth while?

Then surely this extra searching becomes the configure question?

  Try harder to compress CJK encodings (this will slow your build considerably)?
  [no]


Unless we find a more efficient algorithm to search for common substrings.

Nicholas Clark
-- 
EMCFT http://www.ccl4.org/~nick/CV.html