Hi,
I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers.
Does anyone know where I could find one?
In answer to Peters question -yes I'm also using "UTF-8" encoded XML
documents as the source.
I also put below an example of what is happening when I tokenize the text
using the StandardTokenizer below.
Thanks Hannah
--text I'm trying to index
century palace known as la Fundación Hospital de Na. Señora del Pilar
-tokens outputed from StandardTokenizer
century
palace
known
as
la
â
FundaciÃ*
n *
Hospital
de
Na
Seà *
ora *
del
Pilar
â
---
From: "Peter M Cipollone" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Subject: Re: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 11:41:28 -0400
could you send some sample text that causes this to happen?
- Original Message -
From: "Hannah c" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, May 19, 2004 11:30 AM
Subject: Problem indexing Spanish Characters
>
> Hi,
>
> I am indexing a number of English articles on Spanish resorts. As such
> there are a number of spanish characters throught the text, most of
these
> are in the place names which are the type of words I would like to use
as
> queries. My problem is with the StandardTokenizer class which cuts the
word
> into two when it comes across any of the spanish characters. I had a
look
at
> the source but the code was generated by JavaCC and so is not very
readable.
> I was wondering if there was a way around this problem or which area of
the
> code I would need to change to avoid this.
>
> Thanks
> Hannah Cumming
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
From: PEP AD Server Administrator
<[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Subject: AW: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 18:08:56 +0200
Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?
Peter MH
-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters
It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish
If it does, take a look at Lucene Sandbox. There is a project that
allows you to use Snowball analyzers with Lucene.
Otis
--- Hannah c <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I am indexing a number of English articles on Spanish resorts. As
> such
> there are a number of spanish characters throught the text, most of
> these
> are in the place names which are the type of words I would like to
> use as
> queries. My problem is with the StandardTokenizer class which cuts
> the word
> into two when it comes across any of the spanish characters. I had a
> look at
> the source but the code was generated by JavaCC and so is not very
> readable.
> I was wondering if there was a way around this problem or which area
> of the
> code I would need to change to avoid this.
>
> Thanks
> Hannah Cumming
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Hannah
Cumming
[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]