Bjoern, Yes, I think so!
I work with UTF-8 (corpus, stop list, etc.). I thought that the problem with the character "l·l" was similar to the accents, because I added as a token all kind of accents used in Catalan and Spanish and the problem was solved, but not in that case. For this reason, I try to add this character in my tokens file or in my stopwords list, but it doesn't work. Mercè > Hi there, > > mercevg wrote: > > I have some problems to filter n-grams in a corpus that contains words > > with this character: "l·l". This character is frequently used in > > Catalan documents. In my results list I can't retrieve n-grams with > > words that contains this character. > > > > In my tokens file I have insert the line "/[a-zA-Z·]+/" (with "·"), > > but the results are not satisfactory. > > > > I have also tried to insert in my stop list the line "/l·l/", but > > doesn't work at all, because in my results list I have bi-grams like > > "intel<>ligència". In this case, one word is divided into two words. > > > > You know what is the problem? > > > > This sounds like a character set / file encoding issue. All files > involved (corpus, filters etc.) should have the same encoding. I am > not sure about the specific ISO encoding for Catalan. However, I > suppose Catalan is covered by iso-8859-1. utf-8 should work anyway, > though. > -- > Best regards, > Bjoern Wilmsmann >