Bjoern,
Yes, I think so!
I work with UTF-8 (corpus, stop list, etc.). I thought that the
problem with the character l·l was similar to the accents, because I
added as a token all kind of accents used in Catalan and Spanish and
the problem was solved, but not in that case. For this reason, I try
to add this character in my tokens file or in my stopwords list, but
it doesn't work.
Mercè
Hi there,
mercevg wrote:
I have some problems to filter n-grams in a corpus that contains words
with this character: l·l. This character is frequently used in
Catalan documents. In my results list I can't retrieve n-grams with
words that contains this character.
In my tokens file I have insert the line /[a-zA-Z·]+/ (with ·),
but the results are not satisfactory.
I have also tried to insert in my stop list the line /l·l/, but
doesn't work at all, because in my results list I have bi-grams like
intelligència. In this case, one word is divided into two words.
You know what is the problem?
This sounds like a character set / file encoding issue. All files
involved (corpus, filters etc.) should have the same encoding. I am
not sure about the specific ISO encoding for Catalan. However, I
suppose Catalan is covered by iso-8859-1. utf-8 should work anyway,
though.
--
Best regards,
Bjoern Wilmsmann