Hi Hannah, Otis I cannot help but I have excatly the same problems with special german charcters. I used snowball analyser but this does not help because the problem (tokenizing) appears before the analyser comes into action. I just posted the question "Problem tokenizing UTF-8 with geman umlauts" some minutes ago which describes my problem and Hannahs seem to be similar. Do you have also UTF-8 encoded pages?
Peter MH -----Ursprüngliche Nachricht----- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 19. Mai 2004 17:42 An: Lucene Users List Betreff: Re: Problem indexing Spanish Characters It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene. Otis --- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]