Re: [ngram] Re: plans for version 1.05

Richard Jelinek Thu, 14 Feb 2008 14:53:57 -0800

On Thu, Feb 14, 2008 at 08:59:29PM -0000, Ted Pedersen wrote:
> You seem to be saying there is a better option than "use locale",


Yes - make use of the unicode capabilities of perl.

> which I'm more than willing to believe. However, what I can't estimate
> at present is how difficult or time consuming it would be to modify
> NSP in the way you describe. We'll certainly follow up on your hints

It is more tme consumng than the "use locale" way. Of course. But
given NSPs codebase - its a timely doable task.

> The advantage of "use locale" is that it seems to solve at least some
> problems, and it's a fairly simple modification to make. So as
> imperfect as it might be, it seems better than what we have now.

Ths advantage is illusional - unfortunately. llusional in the sense,
as the "some problems" it seems to solve rely on a well set up
environment on the OS side. Which isn't always the case. Moreover,
"use locale" will - in most cases - give you good results for
languages that correlate with the locale environment on a given
machine.

That is: If a user on a "czech host" with correctly set up czech
locale tries to process czech text, it will be ok. However, if the
same user on the same host, tries to process turkish text: *boom*.

> Further comments discussions on use locale versus other alternatives
> is more than welcome, and would in fact be appreciated.

I wonder why the original author had problems with an catalan text
anyway. The only two viable encodings for catalan I know of are
iso-8859-1 and windows-1252. iso-8859-1 should give him no problem,
because that's what NSP has been created and (mostly) tested with.

Probably he catched a win-1252 encoded text which could cause the
problems he described.

The effort to get a perl application unicode-clean isn't that high at
least it isn't higher than twiddling with locales. You just have to
catch all input streams (where data comes in) and all output streams
(obviously, where the application spills data) and decode (input) and
encode (output) the data respectively.

See http://search.cpan.org/~dankogai/Encode-2.23/Encode.pm

You must - and this is a mandatory requirement - always know what
encoding your input data are in. Without this, no reliable processing
can be guaranteed.


-- 
Kind regards,

     Dipl.-Inf. Richard Jelinek

     - The PetaMem Group - Prague/Nuremberg - www.petamem.com -
                 -= 2007-09-25: 49235653 Mind Units =-

Re: [ngram] Re: plans for version 1.05

Reply via email to