[ngram] Re: Problem with a token

mercevg Wed, 13 Feb 2008 06:20:35 -0800

Bjoern,

Yes, I think so!


I work with UTF-8 (corpus, stop list, etc.). I thought that the
problem with the character "l·l" was similar to the accents, because I
added as a token all kind of accents used in Catalan and Spanish and
the problem was solved, but not in that case. For this reason, I try
to add this character in my tokens file or in my stopwords list, but
it doesn't work.

Mercè



> Hi there,
> 
> mercevg wrote:
> > I have some problems to filter n-grams in a corpus that contains words
> > with this character: "l·l". This character is frequently used in
> > Catalan documents. In my results list I can't retrieve n-grams with
> > words that contains this character.
> >
> > In my tokens file I have insert the line "/[a-zA-Z·]+/" (with "·"),
> > but the results are not satisfactory.
> >
> > I have also tried to insert in my stop list the line "/l·l/", but
> > doesn't work at all, because in my results list I have bi-grams like
> > "intel<>ligència". In this case, one word is divided into two words.
> >
> > You know what is the problem?
> >
> 
> This sounds like a character set / file encoding issue. All files  
> involved (corpus, filters etc.) should have the same encoding. I am  
> not sure about the specific ISO encoding for Catalan. However, I  
> suppose Catalan is covered by iso-8859-1. utf-8 should work anyway,  
> though.
> --
> Best regards,
> Bjoern Wilmsmann
>

[ngram] Re: Problem with a token

Reply via email to