Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?

Peter MH

-----Ursprüngliche Nachricht-----
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters


It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.

Otis


--- Hannah c <[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> I  am indexing a number of English articles on Spanish resorts. As
> such 
> there are a number of spanish characters throught the text, most of
> these 
> are in the place names which are the type of words I would like to
> use as 
> queries. My problem is with the StandardTokenizer class which cuts
> the word 
> into two when it comes across any of the spanish characters. I had a
> look at 
> the source but the code was generated by JavaCC and so is not very
> readable. 
> I was wondering if there was a way around this problem or which area
> of the 
> code I would need to change to avoid this.
> 
> Thanks
> Hannah Cumming

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to