Re: Performance and interface of Encode(3pm) in perl 5.8.0-RC1

Nick Ing-Simmons Thu, 11 Jul 2002 03:59:25 -0700

Guido Flohr <[EMAIL PROTECTED]> writes:
>Until this morning I didn't know about the new Encode interface.  Last
>weekend I had started something quite similar which is entirely written
>in Perl (no C code).  I have chosen a slightly different interface,
>however, and maybe you are interested to learn why.
>
>My interface looks roughly like this:
>
>       my $cd = Locale::Iconv->new (from => 'Windows-1250',
>                                    to   => 'iso-8859-2');
>
>       my $success = $cd->recode ($input);
>
>I always convert "in-place", that's the first difference.  The main
>drawback of this are possible run-time errors "attempt to modify
>read-only values" when called with constant arguments.  But the
>memory footprint is a lot better and besides even in C you cannot
>copy large memory areas for free.  Why create unnecessary copies?


For my Tk application of encode the in-place form causes unnecessary
copies. e.g. I need the original and the form encoded into the encoding 
required by the font, or I have to copy the input arg to return location.

Doing in-place is very hard to do when converting between two variable 
length encodings. I suspect your "all perl" version is not _really_ 
doing it "in place" but just in same scalar, but in different PV "buffers".
The Encode API is writen to allow core of encodings to be written in C
Keeping return value and source separate is very useful for C.

>
>The second (and IMHO more important) difference is the object-oriented
>interface.  The object returned by the constructor can be re-used
>(and the conversion has to be initialized only once), I can pass
>it around to other objects (important in large modularized projects),
>and I can still offer a procedural interface at almost no cost

Sounds exactly like the way Encode is implemented!
I suspect you are only using Encode via its procedural interface.

>:
>
>       sub Locale::Iconv::iconv_open
>       {
>               Locale::Iconv::new (from => $_[0], to => $_[1]);
>       }
>
>And now I can do
>
>       my $cd = iconv_open ('Windows-1250' => 'iso-8859-2');
>
>and say in an iconv(3) fashion
>
>       my $success = recode ($cd, $input);
>
>Internally my objects of type Locale::Iconv contain an encoding
>chain that leads from the source (from) encoding to the destination
>(to) encoding.  Theoretically this chain can have an arbitrary
>length (like in the GNU libc iconv implementation) but I either
>know a direct path (all conversion modules are capable of converting
>into UTF-8 or my internal format) or I take an intermediate step
>via the internal representation.
>
>My internal representation is simply a reference to an array of
>the corresponding ISO 10646 codes, which allows me to use map()
>instead of operating on strings.
>
>After I have learned about Encode(3pm) I have written a test
>script that compares the three different conversion techniques
>I know: my own one, Text::Iconv which uses the iconv(3) implementation
>of the libc resp. libiconv, and finally Encode::from_to from
>perl 5.8.0.
>
>For each implementation I convert a tiny (10 bytes), a small
>(100 bytes), and a large (100 k) buffer from Windows-1250 to
>ISO-8859-2.  The buffers do not contain any characters in the
>range from /x80 to /x9f so that the conversions can never fail
>and actually do not change anything.
>
>For those implementations (mine and Text::Iconv) that allow to
>reuse a conversion handle, two flavors of the test exist: one
>that creates that handle once, and then converts in a loop,
>another that creates that handle anew in every round.
>
>On my system (GNU-Linux, glibc 2.2.2) I approximately get the
>following results (number of iterations in parentheses, results
>in seconds):
>
>              | tiny (2000000) | small (200000) | large (200)
>--------------+----------------+----------------+-------------
>Locale::Iconv |         510    |          120   |       120
>(cached)      |         160    |           90   |       120
>Text::Iconv   |          56    |            7   |         1.3 
>(cached)      |          18    |            3   |         1.3
>Encode        |         120    |            1.5 |         0.4
>
>Nice to see that Encode is a lot faster than iconv() when operating
>on large buffers.  But the result for very small buffers is
>disappointing.  My pure Perl version takes only 33 % longer for
>the same job (160 s compared to 120 s) because it doesn't have
>the overhead to resolve the aliases, find the correct encodings
>and initialize its state information for each call.  

I would use Encode that way as well.

  my $enc = find_encoding('cp1250');
  my $string  = decode($enc,$octets); 

>For that trivial
>encoding (Windows 1250 and iso-8859-2 are more or less the same)
>I could actually write a specialized module that omits the 
>intermediate ISO 10646 representation and I wouldn't be suprised
>if that conversion module outperformed the C version included
>in Encode.

For trivial translations between 8-bit encodings a canned tr///
will do the job just fine.

Encode is mainly about getting external data to/from perl's internal
form so you can manipulate it. If you just want to transform between 
encodings then dedicated tools like iconv will out-perform perl
version. However in my limited experience the problems arise when 
things do not map. As soon as that happens then you want the 
perl script to "look at it and decide what to do" and then 
the convert to internal form is a win.


> 
>One could argue that the above described test case is pathological.
>I don't think so.  The current interface of Encode is ok when you
>operate on strings.  But there are situations where you operate
>on data _streams_ and then the difference may become very significant.

Which is why we have :encoding layer in the PerlIO system.
That :
  A. caches the encoding object.
  B. Buffers the IO to and works on whole buffers to avoid
     small string effects.
  C. Handles partial characters when stream gets broken across pipe 
     boundaries etc.

>Besides, IMHO both the object-oriented and the "handle" approach
>are cleaner in design.

I quite agree - which is why Encode works the same way :-)


-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Re: Performance and interface of Encode(3pm) in perl 5.8.0-RC1

Reply via email to