[ngram] Re: Problem with a token
Patrick, Ted, I added "use locale;" in line 83 but this can't improve my results: words containing the character "l·l" (like "intel·ligència")are not included in the results list. But it is important to say that I add as a tokens all accents, diaeresis and apostrophes that are used in Catalan corpus and I have had a good results. I think it's the solution for this kind of characters, except for the "l·l" ("l geminada"). Best regards, Mercè > > Greetings all, > > Thanks for the very interesting discussion. This is quite helpful. > > Just a short note to confirm that we have not yet added the > > add locale; > > directive to NSP - we haven't had a release in some time, but this will surely > be included when we do. I am thinking it might not be a bad idea to have a > release simply to take care of this. Thanks to Patrick for pointing this out > in the first place, and then reminding us of that earlier discussion. > > I would be very interested to know if this resolves the problems with Catalan, > French, Spanish, btw. Please do update us and the rest of the list, as > I suspect > this is a fairly common problem. > > Cordially, > Ted > > On Feb 13, 2008 11:07 AM, mercevg <[EMAIL PROTECTED]> wrote: > > > > > > > > Patrick, > > > > I have checked the latest version of NSP (v.1.03) and count.pl doesn't > > contain "use locale;". I'll try to add "use locale;" in line 83, maybe > > your suggestion it's my solution. > > > > More or less we have the same problems with accents and other kind of > > characters working with French and Catalan or Spanish. > > > > > > Thank you very much! > > > > Mercè > > > > > > > > Mercè, > > > > > > I have not checked the latest version of NSP to see if count.pl and the > > > other files contain "use locale;" as I suggested some time ago. The > > > simple inclusion of such a statement at the beginning of the Perl > > > scripts fixed the problems I had for French. You can have a look at > > this > > > for more information : > > > > > > http://tech.groups.yahoo.com/group/ngram/message/159 > > > > > > Hope this helps... > > > > > > Regards, > > > Patrick > > > > > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse >
Re: [ngram] Re: Problem with a token
Thanks very much, I'm glad to hear that use locale has helped with most of your problems. This entire episode has convinced me that rather than waiting for the next release of NSP, I am going to go ahead and do a release that simply includes use locale. I'll do that this week, which I know doesn't help you but will hopefully help others avoid a similar problem. Regarding your issue with "la geminada", that might be a more general Perl locale issue. You might want to check and make sure you are running a fairly current version of Perl, and scout around some of the Perl documentation to see if this has come up in a more general context. Cordially, Ted On Thu, Feb 14, 2008 at 3:26 AM, mercevg <[EMAIL PROTECTED]> wrote: > > > > > > > Patrick, Ted, > > I added "use locale;" in line 83 but this can't improve my results: > words containing the character "l·l" (like "intel·ligència")are not > included in the results list. > > But it is important to say that I add as a tokens all accents, > diaeresis and apostrophes that are used in Catalan corpus and I have > had a good results. I think it's the solution for this kind of > characters, except for the "l·l" ("l geminada"). > > Best regards, > Mercè > > > > > Greetings all, > > > > Thanks for the very interesting discussion. This is quite helpful. > > > > Just a short note to confirm that we have not yet added the > > > > add locale; > > > > directive to NSP - we haven't had a release in some time, but this > will surely > > be included when we do. I am thinking it might not be a bad idea to > have a > > release simply to take care of this. Thanks to Patrick for pointing > this out > > in the first place, and then reminding us of that earlier discussion. > > > > I would be very interested to know if this resolves the problems > with Catalan, > > French, Spanish, btw. Please do update us and the rest of the list, as > > I suspect > > this is a fairly common problem. > > > > Cordially, > > Ted > > > > On Feb 13, 2008 11:07 AM, mercevg <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > Patrick, > > > > > > I have checked the latest version of NSP (v.1.03) and count.pl > doesn't > > > contain "use locale;". I'll try to add "use locale;" in line 83, > maybe > > > your suggestion it's my solution. > > > > > > More or less we have the same problems with accents and other kind of > > > characters working with French and Catalan or Spanish. > > > > > > > > > Thank you very much! > > > > > > Mercè > > > > > > > > > > > Mercè, > > > > > > > > I have not checked the latest version of NSP to see if count.pl > and the > > > > other files contain "use locale;" as I suggested some time ago. The > > > > simple inclusion of such a statement at the beginning of the Perl > > > > scripts fixed the problems I had for French. You can have a look at > > > this > > > > for more information : > > > > > > > > http://tech.groups.yahoo.com/group/ngram/message/159 > > > > > > > > Hope this helps... > > > > > > > > Regards, > > > > Patrick > > > > > > > > > > > > > -- > > Ted Pedersen > > http://www.d.umn.edu/~tpederse > > > > -- Ted Pedersen http://www.d.umn.edu/~tpederse
[ngram] Re: Problem with a token
Ted, It's a good idea to prepare a new release including use locale. I'll check my Perl version and also the documentation. Maybe, I'll find a solution to work with Catalan corpus. Thanks for all your suggestions, Mercè > Thanks very much, I'm glad to hear that use locale has helped with > most of your problems. This entire episode has convinced me that > rather than waiting for the next release of NSP, I am going to go > ahead and do a release that simply includes use locale. I'll do that > this week, which I know doesn't help you but will hopefully help > others avoid a similar problem. > > Regarding your issue with "la geminada", that might be a more general > Perl locale issue. You might want to check and make sure you are > running a fairly current version of Perl, and scout around some of the > Perl documentation to see if this has come up in a more general > context. > > Cordially, > Ted > > On Thu, Feb 14, 2008 at 3:26 AM, mercevg <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > Patrick, Ted, > > > > I added "use locale;" in line 83 but this can't improve my results: > > words containing the character "l·l" (like "intel·ligència")are not > > included in the results list. > > > > But it is important to say that I add as a tokens all accents, > > diaeresis and apostrophes that are used in Catalan corpus and I have > > had a good results. I think it's the solution for this kind of > > characters, except for the "l·l" ("l geminada"). > > > > Best regards, > > Mercè > > > > > > > > Greetings all, > > > > > > Thanks for the very interesting discussion. This is quite helpful. > > > > > > Just a short note to confirm that we have not yet added the > > > > > > add locale; > > > > > > directive to NSP - we haven't had a release in some time, but this > > will surely > > > be included when we do. I am thinking it might not be a bad idea to > > have a > > > release simply to take care of this. Thanks to Patrick for pointing > > this out > > > in the first place, and then reminding us of that earlier discussion. > > > > > > I would be very interested to know if this resolves the problems > > with Catalan, > > > French, Spanish, btw. Please do update us and the rest of the list, as > > > I suspect > > > this is a fairly common problem. > > > > > > Cordially, > > > Ted > > > > > > On Feb 13, 2008 11:07 AM, mercevg wrote: > > > > > > > > > > > > > > > > Patrick, > > > > > > > > I have checked the latest version of NSP (v.1.03) and count.pl > > doesn't > > > > contain "use locale;". I'll try to add "use locale;" in line 83, > > maybe > > > > your suggestion it's my solution. > > > > > > > > More or less we have the same problems with accents and other kind of > > > > characters working with French and Catalan or Spanish. > > > > > > > > > > > > Thank you very much! > > > > > > > > Mercè > > > > > > > > > > > > > > Mercè, > > > > > > > > > > I have not checked the latest version of NSP to see if count.pl > > and the > > > > > other files contain "use locale;" as I suggested some time ago. The > > > > > simple inclusion of such a statement at the beginning of the Perl > > > > > scripts fixed the problems I had for French. You can have a look at > > > > this > > > > > for more information : > > > > > > > > > > http://tech.groups.yahoo.com/group/ngram/message/159 > > > > > > > > > > Hope this helps... > > > > > > > > > > Regards, > > > > > Patrick > > > > > > > > > > > > > > > > > > -- > > > Ted Pedersen > > > http://www.d.umn.edu/~tpederse > > > > > > > > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse >
Re: [ngram] Re: Problem with a token
Mercè, If you work in a Unix/Linux/MacOSX environment, make sure your environment variables LC_ALL and LANG are set properly to something like this: LC_ALL=ca_ES LANG=LANG=ca_ES.UTF-8 In Windows, it has to be similar but I don't know how to do it. I believe it has to bet set somewhere at the level of the My computer icon. Regards, Patrick
[ngram] plans for version 1.05
Greetings all, I'm in the process of collecting up the various bug reports that we've gotten since version 1.03 was released in September 2006, and I'll resolve those in 1.05. Here's what I have so far... 1) Incorporate "use locale" throughout package (suggested by Patrick Drouin long ago)This will make for more convenient handling of non-English text. 2) fix "Testing/statistic/t2" missing message during install (reported most recently by Mary Taffet, others previously) 3) fix Makefile.PL to allow for cleaner Windows install (reported by Richard Churchill) I will keep looking through the mailing list archives and my own email, but those seem like the main issues that have arisen. However, if you recall something else, or these is some feature or change you are interested in seeing, please let me know. As you can tell NSP releases have slowed considerably in recent years, so this is likely to be the only release for some time to come, so please do let me know asap if there are other issues. Comments and suggestions are of course welcome. Cordially, Ted
[ngram] Re: Problem with a token
Hello, I recently came across a related problem with perl and Spanish characters. I tried "use locale" and it didn't help. After a lot of researching on character encoding, and posting questions on different perl forums, I found a solution that might help with the "l geminada". As you all know perl uses its own internal representation of characters, so what I needed to do was to decode the string, using either Latin8 or utf8 (depending on the encoding of the text), process the string with perl, and encode the string again before printing. Below is a small fragment of the perl program I used. I hope this helps. Thamar S. #!/usr/bin/perl use Encode; @lines= (<>); foreach $line (@lines) # loop thru list { $cline = decode("utf8",$line); #Do something with $cline $oline = encode("utf8", $cline); print "$oline", "\n"; } --- In ngram@yahoogroups.com, "mercevg" <[EMAIL PROTECTED]> wrote: > > Patrick, Ted, > > I added "use locale;" in line 83 but this can't improve my results: > words containing the character "l·l" (like "intel·ligència")are not > included in the results list. > > But it is important to say that I add as a tokens all accents, > diaeresis and apostrophes that are used in Catalan corpus and I have > had a good results. I think it's the solution for this kind of > characters, except for the "l·l" ("l geminada"). > > Best regards, > Mercè > > > > > Greetings all, > > > > Thanks for the very interesting discussion. This is quite helpful. > > > > Just a short note to confirm that we have not yet added the > > > > add locale; > > > > directive to NSP - we haven't had a release in some time, but this > will surely > > be included when we do. I am thinking it might not be a bad idea to > have a > > release simply to take care of this. Thanks to Patrick for pointing > this out > > in the first place, and then reminding us of that earlier discussion. > > > > I would be very interested to know if this resolves the problems > with Catalan, > > French, Spanish, btw. Please do update us and the rest of the list, as > > I suspect > > this is a fairly common problem. > > > > Cordially, > > Ted > > > > On Feb 13, 2008 11:07 AM, mercevg wrote: > > > > > > > > > > > > Patrick, > > > > > > I have checked the latest version of NSP (v.1.03) and count.pl > doesn't > > > contain "use locale;". I'll try to add "use locale;" in line 83, > maybe > > > your suggestion it's my solution. > > > > > > More or less we have the same problems with accents and other kind of > > > characters working with French and Catalan or Spanish. > > > > > > > > > Thank you very much! > > > > > > Mercè > > > > > > > > > > > Mercè, > > > > > > > > I have not checked the latest version of NSP to see if count.pl > and the > > > > other files contain "use locale;" as I suggested some time ago. The > > > > simple inclusion of such a statement at the beginning of the Perl > > > > scripts fixed the problems I had for French. You can have a look at > > > this > > > > for more information : > > > > > > > > http://tech.groups.yahoo.com/group/ngram/message/159 > > > > > > > > Hope this helps... > > > > > > > > Regards, > > > > Patrick > > > > > > > > > > > > > -- > > Ted Pedersen > > http://www.d.umn.edu/~tpederse > > >
[ngram] Re: plans for version 1.05
Ted, I have two suggestions to improve the new version. 1. I have problems to extract bigrams using "Fishers exact test - left sided" and "Fishers exact test - right sided". Could you fix this two measures? The error message: Can't locate Text/NSP/Measures/2D/left.pm in @INC (@INC contains: /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl .) at /usr/bin/statistic.pl line 452. We don't know how to resolve this problem, because the Ngram has installed correctly. Anyone else has this problem? 2. It could be very interesting to extract trigrams using all twelve statistical measures. It could be possible? Best wishes, Mercè > > Greetings all, > > I'm in the process of collecting up the various bug reports that we've > gotten since version 1.03 was released in September 2006, and I'll > resolve those in 1.05. Here's what I have so far... > > 1) Incorporate "use locale" throughout package (suggested by Patrick > Drouin long ago)This will make for more convenient handling of > non-English text. > > 2) fix "Testing/statistic/t2" missing message during install (reported > most recently by Mary Taffet, others previously) > > 3) fix Makefile.PL to allow for cleaner Windows install (reported by > Richard Churchill) > > I will keep looking through the mailing list archives and my own > email, but those seem like the main issues that have arisen. However, > if you recall something else, or these is some feature or change you > are interested in seeing, please let me know. As you can tell NSP > releases have slowed considerably in recent years, so this is likely > to be the only release for some time to come, so please do let me know > asap if there are other issues. Comments and suggestions are of course > welcome. > > Cordially, > Ted >
Re: [ngram] Re: plans for version 1.05
Hello again... See comments inline... On Thu, Feb 14, 2008 at 10:38 AM, mercevg <[EMAIL PROTECTED]> wrote: > > > Ted, > > I have two suggestions to improve the new version. > > 1. I have problems to extract bigrams using "Fishers exact test - left > sided" and "Fishers exact test - right sided". Could you fix this two > measures? > > The error message: > > Can't locate Text/NSP/Measures/2D/left.pm in @INC (@INC contains: > /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 > /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi > /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl > /usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi > /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl .) at > /usr/bin/statistic.pl line 452. > > We don't know how to resolve this problem, because the Ngram has > installed correctly. Anyone else has this problem? How are you invoking the measure? I think the problem is that you must be specifying the location of the left Fisher measure incorrectly, it normally resides at : Text/NSP/Measures/2D/Fisher So it appears that you are looking in 2D for left, rather than in Fisher...I think that is the problem, as I have been using left and right Fisher very heavily lately without any troubles of this sort... > 2. It could be very interesting to extract trigrams using all twelve > statistical measures. It could be possible? Agreed - we have added 3-d support for a number of the measures - including ll - log likelihood pmi - pointwise mutual information tmi - true mutual information ps - poisson stirling It would be possible to have a 3-d fisher's test, for example, although the complexity of implementing that (and computing it) has thus far been a barrier to adding it. For some of the other measures it isn't clear how to extend them to 3-d (trigams) - for example odds ratio seems like it might be inherently 2-d (although I haven't looked at this in some time so I am not at all certain). If there are particular 3-d methods that would be of interest we are always curious to hear about those. Also, we are more than open to user contributed modules for measures - that would be great in fact, so if someone has developed something they might like to contribute please do let let me know and we can see if that is possible (would depend on having test cases available, documentation more or less consistent with how we do things,etc.) Thanks! Ted > > Best wishes, > Mercè > > > > > > Greetings all, > > > > I'm in the process of collecting up the various bug reports that we've > > gotten since version 1.03 was released in September 2006, and I'll > > resolve those in 1.05. Here's what I have so far... > > > > 1) Incorporate "use locale" throughout package (suggested by Patrick > > Drouin long ago)This will make for more convenient handling of > > non-English text. > > > > 2) fix "Testing/statistic/t2" missing message during install (reported > > most recently by Mary Taffet, others previously) > > > > 3) fix Makefile.PL to allow for cleaner Windows install (reported by > > Richard Churchill) > > > > I will keep looking through the mailing list archives and my own > > email, but those seem like the main issues that have arisen. However, > > if you recall something else, or these is some feature or change you > > are interested in seeing, please let me know. As you can tell NSP > > releases have slowed considerably in recent years, so this is likely > > to be the only release for some time to come, so please do let me know > > asap if there are other issues. Comments and suggestions are of course > > welcome. > > > > Cordially, > > Ted > > > > -- Ted Pedersen http://www.d.umn.edu/~tpederse
Re: [ngram] plans for version 1.05
On Thu, Feb 14, 2008 at 03:51:40PM -, Ted Pedersen wrote: > 1) Incorporate "use locale" throughout package (suggested by Patrick > Drouin long ago)This will make for more convenient handling of > non-English text. Wrong idea, wrong solution. To make handling of non-Latin1 text more convenient, make NSP UTF-8 safe. Enforce correct decodes on input streams and encodes on output streams. "use locale" is a hack. At best. Dead-end evolution. Browse the Monastery (http://www.perlmonks.org/) for more info and have a look at perldoc perluniintro especially the section about Locales: · How Does Unicode Work With Traditional Locales? In Perl, not very well. Avoid using locales through the "locale" pragma. Use only one or the other. But see perlrun for the description of the "-C" switch and its environment counterpart, $ENV{PERL_UNICODE} to see how to enable various Unicode features, for example by using locale settings. -- Kind regards, Dipl.-Inf. Richard Jelinek - The PetaMem Group - Prague/Nuremberg - www.petamem.com - -= 2007-09-25: 49235653 Mind Units =-
[ngram] Re: plans for version 1.05
Thanks for the thoughts on locale, UTF-8, etc. You seem to be saying there is a better option than "use locale", which I'm more than willing to believe. However, what I can't estimate at present is how difficult or time consuming it would be to modify NSP in the way you describe. We'll certainly follow up on your hints and try to arrive at such an estimate, or if anyone knows and would care to share such details that would surely be appreciated. The advantage of "use locale" is that it seems to solve at least some problems, and it's a fairly simple modification to make. So as imperfect as it might be, it seems better than what we have now. Further comments discussions on use locale versus other alternatives is more than welcome, and would in fact be appreciated. Cordially, Ted --- In ngram@yahoogroups.com, Richard Jelinek <[EMAIL PROTECTED]> wrote: > > On Thu, Feb 14, 2008 at 03:51:40PM -, Ted Pedersen wrote: > > 1) Incorporate "use locale" throughout package (suggested by Patrick > > Drouin long ago)This will make for more convenient handling of > > non-English text. > > Wrong idea, wrong solution. > > To make handling of non-Latin1 text more convenient, make NSP UTF-8 > safe. Enforce correct decodes on input streams and encodes on output > streams. > > "use locale" is a hack. At best. Dead-end evolution. Browse the > Monastery (http://www.perlmonks.org/) for more info and have a look at > > perldoc perluniintro > > especially the section about Locales: > > · How Does Unicode Work With Traditional Locales? > >In Perl, not very well. Avoid using locales through the >"locale" pragma. Use only one or the other. But see >perlrun for the description of the "-C" switch and its >environment counterpart, $ENV{PERL_UNICODE} to see how to >enable various Unicode features, for example by using >locale settings. > > > -- > Kind regards, > > Dipl.-Inf. Richard Jelinek > > - The PetaMem Group - Prague/Nuremberg - www.petamem.com - >-= 2007-09-25: 49235653 Mind Units =- >
Re: [ngram] Re: plans for version 1.05
On Thu, Feb 14, 2008 at 08:59:29PM -, Ted Pedersen wrote: > You seem to be saying there is a better option than "use locale", Yes - make use of the unicode capabilities of perl. > which I'm more than willing to believe. However, what I can't estimate > at present is how difficult or time consuming it would be to modify > NSP in the way you describe. We'll certainly follow up on your hints It is more tme consumng than the "use locale" way. Of course. But given NSPs codebase - its a timely doable task. > The advantage of "use locale" is that it seems to solve at least some > problems, and it's a fairly simple modification to make. So as > imperfect as it might be, it seems better than what we have now. Ths advantage is illusional - unfortunately. llusional in the sense, as the "some problems" it seems to solve rely on a well set up environment on the OS side. Which isn't always the case. Moreover, "use locale" will - in most cases - give you good results for languages that correlate with the locale environment on a given machine. That is: If a user on a "czech host" with correctly set up czech locale tries to process czech text, it will be ok. However, if the same user on the same host, tries to process turkish text: *boom*. > Further comments discussions on use locale versus other alternatives > is more than welcome, and would in fact be appreciated. I wonder why the original author had problems with an catalan text anyway. The only two viable encodings for catalan I know of are iso-8859-1 and windows-1252. iso-8859-1 should give him no problem, because that's what NSP has been created and (mostly) tested with. Probably he catched a win-1252 encoded text which could cause the problems he described. The effort to get a perl application unicode-clean isn't that high at least it isn't higher than twiddling with locales. You just have to catch all input streams (where data comes in) and all output streams (obviously, where the application spills data) and decode (input) and encode (output) the data respectively. See http://search.cpan.org/~dankogai/Encode-2.23/Encode.pm You must - and this is a mandatory requirement - always know what encoding your input data are in. Without this, no reliable processing can be guaranteed. -- Kind regards, Dipl.-Inf. Richard Jelinek - The PetaMem Group - Prague/Nuremberg - www.petamem.com - -= 2007-09-25: 49235653 Mind Units =-
Re: [ngram] Re: plans for version 1.05
Richard Jelinek wrote: Ths advantage is illusional - unfortunately. llusional in the sense, as the "some problems" it seems to solve rely on a well set up environment on the OS side. Which isn't always the case. Moreover, Well, an improperly set up system locale is bound to give you all kinds of problems once you deal with language-specific issues anyway. Even Java at some points makes use of the system locale. That is: If a user on a "czech host" with correctly set up czech locale tries to process czech text, it will be ok. However, if the same user on the same host, tries to process turkish text: *boom* Do you mean that if, for example, I use \w in a regular expression it will work properly for Czech texts but fail on Turkish ones on that particular machine? AFAIK this is the intended behaviour of use locale, isn't it? This certainly doesn't solve the problem at hand but it doesn't make use locale a flawed solution either (just maybe not the right solution in this case). -- Best regards, Bjoern Wilmsmann PGP.sig Description: This is a digitally signed message part