[ngram] Re: Problem with a token

2008-02-13 Thread mercevg
Bjoern,

Yes, I think so! 

I work with UTF-8 (corpus, stop list, etc.). I thought that the
problem with the character l·l was similar to the accents, because I
added as a token all kind of accents used in Catalan and Spanish and
the problem was solved, but not in that case. For this reason, I try
to add this character in my tokens file or in my stopwords list, but
it doesn't work.

Mercè



 Hi there,
 
 mercevg wrote:
  I have some problems to filter n-grams in a corpus that contains words
  with this character: l·l. This character is frequently used in
  Catalan documents. In my results list I can't retrieve n-grams with
  words that contains this character.
 
  In my tokens file I have insert the line /[a-zA-Z·]+/ (with ·),
  but the results are not satisfactory.
 
  I have also tried to insert in my stop list the line /l·l/, but
  doesn't work at all, because in my results list I have bi-grams like
  intelligència. In this case, one word is divided into two words.
 
  You know what is the problem?
 
 
 This sounds like a character set / file encoding issue. All files  
 involved (corpus, filters etc.) should have the same encoding. I am  
 not sure about the specific ISO encoding for Catalan. However, I  
 suppose Catalan is covered by iso-8859-1. utf-8 should work anyway,  
 though.
 --
 Best regards,
 Bjoern Wilmsmann





[ngram] Re: Problem with a token

2008-02-13 Thread mercevg
Patrick,

I have checked the latest version of NSP (v.1.03) and count.pl doesn't
contain use locale;. I'll try to add use locale; in line 83, maybe
your suggestion it's my solution.

More or less we have the same problems with accents and other kind of
characters working with French and Catalan or Spanish.

Thank you very much!

Mercè


 Mercè,
 
 I have not checked the latest version of NSP to see if count.pl and the 
 other files contain use locale; as I suggested some time ago. The 
 simple inclusion of such a statement at the beginning of the Perl 
 scripts fixed the problems I had for French. You can have a look at
this 
 for more information :
 
 http://tech.groups.yahoo.com/group/ngram/message/159
 
 Hope this helps...
 
 Regards,
 Patrick