[ngram] Re: Problem with a token

2008-02-14 Thread mercevg
Patrick, Ted, I added "use locale;" in line 83 but this can't improve my results: words containing the character "l·l" (like "intel·ligència")are not included in the results list. But it is important to say that I add as a tokens all accents, diaeresis and apostrophes that are used in Catalan cor

Re: [ngram] Re: Problem with a token

2008-02-14 Thread Ted Pedersen
Thanks very much, I'm glad to hear that use locale has helped with most of your problems. This entire episode has convinced me that rather than waiting for the next release of NSP, I am going to go ahead and do a release that simply includes use locale. I'll do that this week, which I know doesn't

[ngram] Re: Problem with a token

2008-02-14 Thread mercevg
Ted, It's a good idea to prepare a new release including use locale. I'll check my Perl version and also the documentation. Maybe, I'll find a solution to work with Catalan corpus. Thanks for all your suggestions, Mercè > Thanks very much, I'm glad to hear that use locale has helped with > mo

Re: [ngram] Re: Problem with a token

2008-02-14 Thread Patrick Drouin
Mercè, If you work in a Unix/Linux/MacOSX environment, make sure your environment variables LC_ALL and LANG are set properly to something like this: LC_ALL=ca_ES LANG=LANG=ca_ES.UTF-8 In Windows, it has to be similar but I don't know how to do it. I believe it has to bet set somewhere at the

[ngram] plans for version 1.05

2008-02-14 Thread Ted Pedersen
Greetings all, I'm in the process of collecting up the various bug reports that we've gotten since version 1.03 was released in September 2006, and I'll resolve those in 1.05. Here's what I have so far... 1) Incorporate "use locale" throughout package (suggested by Patrick Drouin long ago)This w

[ngram] Re: Problem with a token

2008-02-14 Thread solorioprofile
Hello, I recently came across a related problem with perl and Spanish characters. I tried "use locale" and it didn't help. After a lot of researching on character encoding, and posting questions on different perl forums, I found a solution that might help with the "l geminada". As you all know perl

[ngram] Re: plans for version 1.05

2008-02-14 Thread mercevg
Ted, I have two suggestions to improve the new version. 1. I have problems to extract bigrams using "Fishers exact test - left sided" and "Fishers exact test - right sided". Could you fix this two measures? The error message: Can't locate Text/NSP/Measures/2D/left.pm in @INC (@INC contains: /us

Re: [ngram] Re: plans for version 1.05

2008-02-14 Thread Ted Pedersen
Hello again... See comments inline... On Thu, Feb 14, 2008 at 10:38 AM, mercevg <[EMAIL PROTECTED]> wrote: > > > Ted, > > I have two suggestions to improve the new version. > > 1. I have problems to extract bigrams using "Fishers exact test - left > sided" and "Fishers exact test - right sided

Re: [ngram] plans for version 1.05

2008-02-14 Thread Richard Jelinek
On Thu, Feb 14, 2008 at 03:51:40PM -, Ted Pedersen wrote: > 1) Incorporate "use locale" throughout package (suggested by Patrick > Drouin long ago)This will make for more convenient handling of > non-English text. Wrong idea, wrong solution. To make handling of non-Latin1 text more convenient

[ngram] Re: plans for version 1.05

2008-02-14 Thread Ted Pedersen
Thanks for the thoughts on locale, UTF-8, etc. You seem to be saying there is a better option than "use locale", which I'm more than willing to believe. However, what I can't estimate at present is how difficult or time consuming it would be to modify NSP in the way you describe. We'll certainly

Re: [ngram] Re: plans for version 1.05

2008-02-14 Thread Richard Jelinek
On Thu, Feb 14, 2008 at 08:59:29PM -, Ted Pedersen wrote: > You seem to be saying there is a better option than "use locale", Yes - make use of the unicode capabilities of perl. > which I'm more than willing to believe. However, what I can't estimate > at present is how difficult or time cons

Re: [ngram] Re: plans for version 1.05

2008-02-14 Thread Björn Wilmsmann
Richard Jelinek wrote: Ths advantage is illusional - unfortunately. llusional in the sense, as the "some problems" it seems to solve rely on a well set up environment on the OS side. Which isn't always the case. Moreover, Well, an improperly set up system locale is bound to give you all kinds