[ngram] Re: Problem with a token

2008-02-14 Thread mercevg
Patrick, Ted,

I added "use locale;" in line 83 but this can't improve my results:
words containing the character "l·l" (like "intel·ligència")are not
included in the results list.

But it is important to say that I add as a tokens all accents,
diaeresis and apostrophes that are used in Catalan corpus and I have
had a good results. I think it's the solution for this kind of
characters, except for the "l·l" ("l geminada").

Best regards,
Mercè
 
>
> Greetings all,
> 
> Thanks for the very interesting discussion. This is quite helpful.
> 
> Just a short note to confirm that we have not yet added the
> 
> add locale;
> 
> directive to NSP - we haven't had a release in some time, but this
will surely
> be included when we do. I am thinking it might not be a bad idea to
have a
> release simply to take care of this. Thanks to Patrick for pointing
this out
> in the first place, and then reminding us of that earlier discussion.
> 
> I would be very interested to know if this resolves the problems
with Catalan,
> French, Spanish, btw. Please do update us and the rest of the list, as
> I suspect
> this is a fairly common problem.
> 
> Cordially,
> Ted
> 
> On Feb 13, 2008 11:07 AM, mercevg <[EMAIL PROTECTED]> wrote:
> >
> >
> >
> > Patrick,
> >
> >  I have checked the latest version of NSP (v.1.03) and count.pl
doesn't
> >  contain "use locale;". I'll try to add "use locale;" in line 83,
maybe
> >  your suggestion it's my solution.
> >
> >  More or less we have the same problems with accents and other kind of
> >  characters working with French and Catalan or Spanish.
> >
> >
> >  Thank you very much!
> >
> >  Mercè
> >
> >  >
> >  > Mercè,
> >  >
> >  > I have not checked the latest version of NSP to see if count.pl
and the
> >  > other files contain "use locale;" as I suggested some time ago. The
> >  > simple inclusion of such a statement at the beginning of the Perl
> >  > scripts fixed the problems I had for French. You can have a look at
> >  this
> >  > for more information :
> >  >
> >  > http://tech.groups.yahoo.com/group/ngram/message/159
> >  >
> >  > Hope this helps...
> >  >
> >  > Regards,
> >  > Patrick
> >  >
> >
> 
> 
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>




Re: [ngram] Re: Problem with a token

2008-02-14 Thread Ted Pedersen
Thanks very much, I'm glad to hear that use locale has helped with
most of your problems. This entire episode has convinced me that
rather than waiting for the next release of NSP, I am going to go
ahead and do a release that simply includes use locale. I'll do that
this week, which I know doesn't help you but will hopefully help
others avoid a similar problem.

Regarding your issue with "la geminada", that might be a more general
Perl locale issue. You might want to check and make sure you are
running a fairly current version of Perl, and scout around some of the
Perl documentation to see if this has come up in a more general
context.

Cordially,
Ted

On Thu, Feb 14, 2008 at 3:26 AM, mercevg <[EMAIL PROTECTED]> wrote:
>
>
>
>
>
>
> Patrick, Ted,
>
>  I added "use locale;" in line 83 but this can't improve my results:
>  words containing the character "l·l" (like "intel·ligència")are not
>  included in the results list.
>
>  But it is important to say that I add as a tokens all accents,
>  diaeresis and apostrophes that are used in Catalan corpus and I have
>  had a good results. I think it's the solution for this kind of
>  characters, except for the "l·l" ("l geminada").
>
>  Best regards,
>  Mercè
>
>  >
>  > Greetings all,
>  >
>  > Thanks for the very interesting discussion. This is quite helpful.
>  >
>  > Just a short note to confirm that we have not yet added the
>  >
>  > add locale;
>  >
>  > directive to NSP - we haven't had a release in some time, but this
>  will surely
>  > be included when we do. I am thinking it might not be a bad idea to
>  have a
>  > release simply to take care of this. Thanks to Patrick for pointing
>  this out
>  > in the first place, and then reminding us of that earlier discussion.
>  >
>  > I would be very interested to know if this resolves the problems
>  with Catalan,
>  > French, Spanish, btw. Please do update us and the rest of the list, as
>  > I suspect
>  > this is a fairly common problem.
>  >
>  > Cordially,
>  > Ted
>  >
>  > On Feb 13, 2008 11:07 AM, mercevg <[EMAIL PROTECTED]> wrote:
>  > >
>  > >
>  > >
>  > > Patrick,
>  > >
>  > > I have checked the latest version of NSP (v.1.03) and count.pl
>  doesn't
>  > > contain "use locale;". I'll try to add "use locale;" in line 83,
>  maybe
>  > > your suggestion it's my solution.
>  > >
>  > > More or less we have the same problems with accents and other kind of
>  > > characters working with French and Catalan or Spanish.
>  > >
>  > >
>  > > Thank you very much!
>  > >
>  > > Mercè
>  > >
>  > > >
>  > > > Mercè,
>  > > >
>  > > > I have not checked the latest version of NSP to see if count.pl
>  and the
>  > > > other files contain "use locale;" as I suggested some time ago. The
>  > > > simple inclusion of such a statement at the beginning of the Perl
>  > > > scripts fixed the problems I had for French. You can have a look at
>  > > this
>  > > > for more information :
>  > > >
>  > > > http://tech.groups.yahoo.com/group/ngram/message/159
>  > > >
>  > > > Hope this helps...
>  > > >
>  > > > Regards,
>  > > > Patrick
>  > > >
>  > >
>  >
>  >
>  > --
>  > Ted Pedersen
>  > http://www.d.umn.edu/~tpederse
>  >
>
>  



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] Re: Problem with a token

2008-02-14 Thread mercevg
Ted,

It's a good idea to prepare a new release including use locale.

I'll check my Perl version and also the documentation. Maybe, I'll
find a solution to work with Catalan corpus.

Thanks for all your suggestions, 
Mercè


> Thanks very much, I'm glad to hear that use locale has helped with
> most of your problems. This entire episode has convinced me that
> rather than waiting for the next release of NSP, I am going to go
> ahead and do a release that simply includes use locale. I'll do that
> this week, which I know doesn't help you but will hopefully help
> others avoid a similar problem.
> 
> Regarding your issue with "la geminada", that might be a more general
> Perl locale issue. You might want to check and make sure you are
> running a fairly current version of Perl, and scout around some of the
> Perl documentation to see if this has come up in a more general
> context.
> 
> Cordially,
> Ted
> 
> On Thu, Feb 14, 2008 at 3:26 AM, mercevg <[EMAIL PROTECTED]> wrote:
> >
> >
> >
> >
> >
> >
> > Patrick, Ted,
> >
> >  I added "use locale;" in line 83 but this can't improve my results:
> >  words containing the character "l·l" (like "intel·ligència")are not
> >  included in the results list.
> >
> >  But it is important to say that I add as a tokens all accents,
> >  diaeresis and apostrophes that are used in Catalan corpus and I have
> >  had a good results. I think it's the solution for this kind of
> >  characters, except for the "l·l" ("l geminada").
> >
> >  Best regards,
> >  Mercè
> >
> >  >
> >  > Greetings all,
> >  >
> >  > Thanks for the very interesting discussion. This is quite helpful.
> >  >
> >  > Just a short note to confirm that we have not yet added the
> >  >
> >  > add locale;
> >  >
> >  > directive to NSP - we haven't had a release in some time, but this
> >  will surely
> >  > be included when we do. I am thinking it might not be a bad idea to
> >  have a
> >  > release simply to take care of this. Thanks to Patrick for pointing
> >  this out
> >  > in the first place, and then reminding us of that earlier
discussion.
> >  >
> >  > I would be very interested to know if this resolves the problems
> >  with Catalan,
> >  > French, Spanish, btw. Please do update us and the rest of the
list, as
> >  > I suspect
> >  > this is a fairly common problem.
> >  >
> >  > Cordially,
> >  > Ted
> >  >
> >  > On Feb 13, 2008 11:07 AM, mercevg  wrote:
> >  > >
> >  > >
> >  > >
> >  > > Patrick,
> >  > >
> >  > > I have checked the latest version of NSP (v.1.03) and count.pl
> >  doesn't
> >  > > contain "use locale;". I'll try to add "use locale;" in line 83,
> >  maybe
> >  > > your suggestion it's my solution.
> >  > >
> >  > > More or less we have the same problems with accents and other
kind of
> >  > > characters working with French and Catalan or Spanish.
> >  > >
> >  > >
> >  > > Thank you very much!
> >  > >
> >  > > Mercè
> >  > >
> >  > > >
> >  > > > Mercè,
> >  > > >
> >  > > > I have not checked the latest version of NSP to see if count.pl
> >  and the
> >  > > > other files contain "use locale;" as I suggested some time
ago. The
> >  > > > simple inclusion of such a statement at the beginning of
the Perl
> >  > > > scripts fixed the problems I had for French. You can have a
look at
> >  > > this
> >  > > > for more information :
> >  > > >
> >  > > > http://tech.groups.yahoo.com/group/ngram/message/159
> >  > > >
> >  > > > Hope this helps...
> >  > > >
> >  > > > Regards,
> >  > > > Patrick
> >  > > >
> >  > >
> >  >
> >  >
> >  > --
> >  > Ted Pedersen
> >  > http://www.d.umn.edu/~tpederse
> >  >
> >
> >  
> 
> 
> 
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>




Re: [ngram] Re: Problem with a token

2008-02-14 Thread Patrick Drouin
Mercè,

If you work in a Unix/Linux/MacOSX environment, make sure your 
environment variables LC_ALL and LANG are set properly to something like 
this:

LC_ALL=ca_ES
LANG=LANG=ca_ES.UTF-8

In Windows, it has to be similar but I don't know how to do it. I 
believe it has to bet set somewhere at the level of the My computer icon.

Regards,
Patrick


[ngram] plans for version 1.05

2008-02-14 Thread Ted Pedersen
Greetings all, 

I'm in the process of collecting up the various bug reports that we've
gotten since version 1.03 was released in September 2006, and I'll
resolve those in 1.05. Here's what I have so far...

1) Incorporate "use locale" throughout package (suggested by Patrick
Drouin long ago)This will make for more convenient handling of
non-English text.

2) fix "Testing/statistic/t2" missing message during install (reported
most recently by Mary Taffet, others previously)

3) fix Makefile.PL to allow for cleaner Windows install (reported by
Richard Churchill)

I will keep looking through the mailing list archives and my own
email, but those seem like the main issues that have arisen. However,
if you recall something else, or these is some feature or change you
are interested in seeing, please let me know. As you can tell NSP
releases have slowed considerably in recent years, so this is likely
to be the only release for some time to come, so please do let me know
asap if there are other issues. Comments and suggestions are of course
welcome. 

Cordially,
Ted




[ngram] Re: Problem with a token

2008-02-14 Thread solorioprofile
Hello,
I recently came across a related problem with perl and Spanish
characters. I tried "use locale" and it didn't help. After a lot of
researching on character encoding, and posting questions on different
perl forums, I found a solution that might help with the "l geminada".
As you all know perl uses its own internal representation of
characters, so what I needed to do was to decode the string, using
either Latin8 or utf8 (depending on the encoding of the text), process
the string with perl, and encode the string again before printing. 

Below is a small fragment of the perl program I used. I hope this helps.

Thamar S.

#!/usr/bin/perl
use Encode;
@lines= (<>);

foreach $line (@lines) # loop thru list
 {
  $cline = decode("utf8",$line);
  #Do something with $cline
  $oline = encode("utf8", $cline);
  print "$oline", "\n";

}



--- In ngram@yahoogroups.com, "mercevg" <[EMAIL PROTECTED]> wrote:
>
> Patrick, Ted,
> 
> I added "use locale;" in line 83 but this can't improve my results:
> words containing the character "l·l" (like "intel·ligència")are not
> included in the results list.
> 
> But it is important to say that I add as a tokens all accents,
> diaeresis and apostrophes that are used in Catalan corpus and I have
> had a good results. I think it's the solution for this kind of
> characters, except for the "l·l" ("l geminada").
> 
> Best regards,
> Mercè
>  
> >
> > Greetings all,
> > 
> > Thanks for the very interesting discussion. This is quite helpful.
> > 
> > Just a short note to confirm that we have not yet added the
> > 
> > add locale;
> > 
> > directive to NSP - we haven't had a release in some time, but this
> will surely
> > be included when we do. I am thinking it might not be a bad idea to
> have a
> > release simply to take care of this. Thanks to Patrick for pointing
> this out
> > in the first place, and then reminding us of that earlier discussion.
> > 
> > I would be very interested to know if this resolves the problems
> with Catalan,
> > French, Spanish, btw. Please do update us and the rest of the list, as
> > I suspect
> > this is a fairly common problem.
> > 
> > Cordially,
> > Ted
> > 
> > On Feb 13, 2008 11:07 AM, mercevg  wrote:
> > >
> > >
> > >
> > > Patrick,
> > >
> > >  I have checked the latest version of NSP (v.1.03) and count.pl
> doesn't
> > >  contain "use locale;". I'll try to add "use locale;" in line 83,
> maybe
> > >  your suggestion it's my solution.
> > >
> > >  More or less we have the same problems with accents and other
kind of
> > >  characters working with French and Catalan or Spanish.
> > >
> > >
> > >  Thank you very much!
> > >
> > >  Mercè
> > >
> > >  >
> > >  > Mercè,
> > >  >
> > >  > I have not checked the latest version of NSP to see if count.pl
> and the
> > >  > other files contain "use locale;" as I suggested some time
ago. The
> > >  > simple inclusion of such a statement at the beginning of the Perl
> > >  > scripts fixed the problems I had for French. You can have a
look at
> > >  this
> > >  > for more information :
> > >  >
> > >  > http://tech.groups.yahoo.com/group/ngram/message/159
> > >  >
> > >  > Hope this helps...
> > >  >
> > >  > Regards,
> > >  > Patrick
> > >  >
> > >
> > 
> > 
> > -- 
> > Ted Pedersen
> > http://www.d.umn.edu/~tpederse
> >
>




[ngram] Re: plans for version 1.05

2008-02-14 Thread mercevg
Ted,

I have two suggestions to improve the new version.

1. I have problems to extract bigrams using "Fishers exact test - left
sided" and "Fishers exact test - right sided". Could you fix this two
measures?

The error message:

Can't locate Text/NSP/Measures/2D/left.pm in @INC (@INC contains:
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl .) at
/usr/bin/statistic.pl line 452.

We don't know how to resolve this problem, because the Ngram has
installed correctly. Anyone else has this problem?

2. It could be very interesting to extract trigrams using all twelve
statistical measures. It could be possible?

Best wishes,
Mercè

>
> Greetings all, 
> 
> I'm in the process of collecting up the various bug reports that we've
> gotten since version 1.03 was released in September 2006, and I'll
> resolve those in 1.05. Here's what I have so far...
> 
> 1) Incorporate "use locale" throughout package (suggested by Patrick
> Drouin long ago)This will make for more convenient handling of
> non-English text.
> 
> 2) fix "Testing/statistic/t2" missing message during install (reported
> most recently by Mary Taffet, others previously)
> 
> 3) fix Makefile.PL to allow for cleaner Windows install (reported by
> Richard Churchill)
> 
> I will keep looking through the mailing list archives and my own
> email, but those seem like the main issues that have arisen. However,
> if you recall something else, or these is some feature or change you
> are interested in seeing, please let me know. As you can tell NSP
> releases have slowed considerably in recent years, so this is likely
> to be the only release for some time to come, so please do let me know
> asap if there are other issues. Comments and suggestions are of course
> welcome. 
> 
> Cordially,
> Ted
>




Re: [ngram] Re: plans for version 1.05

2008-02-14 Thread Ted Pedersen
Hello again...

See comments inline...

On Thu, Feb 14, 2008 at 10:38 AM, mercevg <[EMAIL PROTECTED]> wrote:
>
>
> Ted,
>
>  I have two suggestions to improve the new version.
>
>  1. I have problems to extract bigrams using "Fishers exact test - left
>  sided" and "Fishers exact test - right sided". Could you fix this two
>  measures?
>
>  The error message:
>
>  Can't locate Text/NSP/Measures/2D/left.pm in @INC (@INC contains:
>  /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8
>  /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi
>  /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl
>  /usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi
>  /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl .) at
>  /usr/bin/statistic.pl line 452.
>
>  We don't know how to resolve this problem, because the Ngram has
>  installed correctly. Anyone else has this problem?

How are you invoking the measure? I think the problem is that you must
be specifying the location of the left Fisher measure incorrectly, it
normally resides at :

Text/NSP/Measures/2D/Fisher

So it appears that you are looking in 2D for left, rather than in
Fisher...I think that is the problem, as I have been using left and
right Fisher very heavily lately without any troubles of this sort...

>  2. It could be very interesting to extract trigrams using all twelve
>  statistical measures. It could be possible?

Agreed - we have added 3-d support for a number of the measures - including

ll - log likelihood
pmi - pointwise mutual information
tmi - true mutual information
ps - poisson stirling

It would be possible to have a 3-d fisher's test, for example,
although the complexity of implementing that (and computing it) has
thus far been a barrier to adding it. For some of the other measures
it isn't clear how to extend them to 3-d (trigams) - for example odds
ratio seems like it might be inherently 2-d (although I haven't looked
at this in some time so I am not at all certain).

If there are particular 3-d methods that would be of interest we are
always curious to hear about those. Also, we are more than open to
user contributed modules for measures - that would be great in fact,
so if someone has developed something they might like to contribute
please do let let me know and we can see if that is possible (would
depend on having test cases available, documentation more or less
consistent with how we do things,etc.)

Thanks!
Ted

>
>  Best wishes,
>  Mercè
>
>
>  >
>  > Greetings all,
>  >
>  > I'm in the process of collecting up the various bug reports that we've
>  > gotten since version 1.03 was released in September 2006, and I'll
>  > resolve those in 1.05. Here's what I have so far...
>  >
>  > 1) Incorporate "use locale" throughout package (suggested by Patrick
>  > Drouin long ago)This will make for more convenient handling of
>  > non-English text.
>  >
>  > 2) fix "Testing/statistic/t2" missing message during install (reported
>  > most recently by Mary Taffet, others previously)
>  >
>  > 3) fix Makefile.PL to allow for cleaner Windows install (reported by
>  > Richard Churchill)
>  >
>  > I will keep looking through the mailing list archives and my own
>  > email, but those seem like the main issues that have arisen. However,
>  > if you recall something else, or these is some feature or change you
>  > are interested in seeing, please let me know. As you can tell NSP
>  > releases have slowed considerably in recent years, so this is likely
>  > to be the only release for some time to come, so please do let me know
>  > asap if there are other issues. Comments and suggestions are of course
>  > welcome.
>  >
>  > Cordially,
>  > Ted
>  >
>
>  



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse


Re: [ngram] plans for version 1.05

2008-02-14 Thread Richard Jelinek
On Thu, Feb 14, 2008 at 03:51:40PM -, Ted Pedersen wrote:
> 1) Incorporate "use locale" throughout package (suggested by Patrick
> Drouin long ago)This will make for more convenient handling of
> non-English text.

Wrong idea, wrong solution.

To make handling of non-Latin1 text more convenient, make NSP UTF-8
safe. Enforce correct decodes on input streams and encodes on output
streams.

"use locale" is a hack. At best. Dead-end evolution. Browse the
Monastery (http://www.perlmonks.org/) for more info and have a look at

perldoc perluniintro

especially the section about Locales:

  ·   How Does Unicode Work With Traditional Locales?

   In Perl, not very well.  Avoid using locales through the
   "locale" pragma.  Use only one or the other.  But see
   perlrun for the description of the "-C" switch and its
   environment counterpart, $ENV{PERL_UNICODE} to see how to
   enable various Unicode features, for example by using
   locale settings.


-- 
Kind regards,

 Dipl.-Inf. Richard Jelinek

 - The PetaMem Group - Prague/Nuremberg - www.petamem.com -
 -= 2007-09-25: 49235653 Mind Units =-


[ngram] Re: plans for version 1.05

2008-02-14 Thread Ted Pedersen
Thanks for the thoughts on locale, UTF-8, etc. 

You seem to be saying there is a better option than "use locale",
which I'm more than willing to believe. However, what I can't estimate
at present is how difficult or time consuming it would be to modify
NSP in the way you describe. We'll certainly follow up on your hints
and try to arrive at such an estimate, or if anyone knows and would
care to share such details that would surely be appreciated.

The advantage of "use locale" is that it seems to solve at least some
problems, and it's a fairly simple modification to make. So as
imperfect as it might be, it seems better than what we have now.

Further comments discussions on use locale versus other alternatives
is more than welcome, and would in fact be appreciated.

Cordially,
Ted

--- In ngram@yahoogroups.com, Richard Jelinek <[EMAIL PROTECTED]> wrote:
>
> On Thu, Feb 14, 2008 at 03:51:40PM -, Ted Pedersen wrote:
> > 1) Incorporate "use locale" throughout package (suggested by Patrick
> > Drouin long ago)This will make for more convenient handling of
> > non-English text.
> 
> Wrong idea, wrong solution.
> 
> To make handling of non-Latin1 text more convenient, make NSP UTF-8
> safe. Enforce correct decodes on input streams and encodes on output
> streams.
> 
> "use locale" is a hack. At best. Dead-end evolution. Browse the
> Monastery (http://www.perlmonks.org/) for more info and have a look at
> 
> perldoc perluniintro
> 
> especially the section about Locales:
> 
>   ·   How Does Unicode Work With Traditional Locales?
> 
>In Perl, not very well.  Avoid using locales through the
>"locale" pragma.  Use only one or the other.  But see
>perlrun for the description of the "-C" switch and its
>environment counterpart, $ENV{PERL_UNICODE} to see how to
>enable various Unicode features, for example by using
>locale settings.
> 
> 
> -- 
> Kind regards,
> 
>  Dipl.-Inf. Richard Jelinek
> 
>  - The PetaMem Group - Prague/Nuremberg - www.petamem.com -
>-= 2007-09-25: 49235653 Mind Units =-
>




Re: [ngram] Re: plans for version 1.05

2008-02-14 Thread Richard Jelinek
On Thu, Feb 14, 2008 at 08:59:29PM -, Ted Pedersen wrote:
> You seem to be saying there is a better option than "use locale",

Yes - make use of the unicode capabilities of perl.

> which I'm more than willing to believe. However, what I can't estimate
> at present is how difficult or time consuming it would be to modify
> NSP in the way you describe. We'll certainly follow up on your hints

It is more tme consumng than the "use locale" way. Of course. But
given NSPs codebase - its a timely doable task.

> The advantage of "use locale" is that it seems to solve at least some
> problems, and it's a fairly simple modification to make. So as
> imperfect as it might be, it seems better than what we have now.

Ths advantage is illusional - unfortunately. llusional in the sense,
as the "some problems" it seems to solve rely on a well set up
environment on the OS side. Which isn't always the case. Moreover,
"use locale" will - in most cases - give you good results for
languages that correlate with the locale environment on a given
machine.

That is: If a user on a "czech host" with correctly set up czech
locale tries to process czech text, it will be ok. However, if the
same user on the same host, tries to process turkish text: *boom*.

> Further comments discussions on use locale versus other alternatives
> is more than welcome, and would in fact be appreciated.

I wonder why the original author had problems with an catalan text
anyway. The only two viable encodings for catalan I know of are
iso-8859-1 and windows-1252. iso-8859-1 should give him no problem,
because that's what NSP has been created and (mostly) tested with.

Probably he catched a win-1252 encoded text which could cause the
problems he described.

The effort to get a perl application unicode-clean isn't that high at
least it isn't higher than twiddling with locales. You just have to
catch all input streams (where data comes in) and all output streams
(obviously, where the application spills data) and decode (input) and
encode (output) the data respectively.

See http://search.cpan.org/~dankogai/Encode-2.23/Encode.pm

You must - and this is a mandatory requirement - always know what
encoding your input data are in. Without this, no reliable processing
can be guaranteed.


-- 
Kind regards,

 Dipl.-Inf. Richard Jelinek

 - The PetaMem Group - Prague/Nuremberg - www.petamem.com -
 -= 2007-09-25: 49235653 Mind Units =-


Re: [ngram] Re: plans for version 1.05

2008-02-14 Thread Björn Wilmsmann


Richard Jelinek wrote:

Ths advantage is illusional - unfortunately. llusional in the sense,
as the "some problems" it seems to solve rely on a well set up
environment on the OS side. Which isn't always the case. Moreover,

Well, an improperly set up system locale is bound to give you all  
kinds of problems once you deal with language-specific issues anyway.  
Even Java at some points makes use of the system locale.

That is: If a user on a "czech host" with correctly set up czech
locale tries to process czech text, it will be ok. However, if the
same user on the same host, tries to process turkish text: *boom*

Do you mean that if, for example, I use \w in a regular expression it  
will work properly for Czech texts but fail on Turkish ones on that  
particular machine? AFAIK this is the intended behaviour of use  
locale, isn't it? This certainly doesn't solve the problem at hand but  
it doesn't make use locale a flawed solution either (just maybe not  
the right solution in this case).


--
Best regards,
Bjoern Wilmsmann





PGP.sig
Description: This is a digitally signed message part