AW: Problem indexing Spanish Characters

2004-05-21 Thread PEP AD Server Administrator
Hi all,
Martin was right. I just adapt the HTML demo as Wallen recommended and it
worked. Now I have only to deal with some crazy documents which are UTF-8
decoded mixed with entities.
Does anyone know a class which can translate entities into UTF-8 or any
other encoding?

Peter MH

-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]

Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding. 

public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPipedInputStream();
pipeOutStream = new PipedOutputStream(pipeInStream);
pipeIn = new InputStreamReader(pipeInStream);
pipeOut = new OutputStreamWriter(pipeOutStream);
//check the first 4 bytes for FFFE marker, if its
there we know its UTF-16 encoding
if (useUTF16)
{
try
{
pipeIn = new BufferedReader(new
InputStreamReader(pipeInStream, "UTF-16"));
}
catch (Exception e)
{
}
}
Thread thread = new ParserThread(this);
thread.start(); // start parsing
}
return pipeIn;
}

-Original Message-
From: Martin Remy [mailto:[EMAIL PROTECTED]

The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread wallen
Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding. 

public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPipedInputStream();
pipeOutStream = new PipedOutputStream(pipeInStream);
pipeIn = new InputStreamReader(pipeInStream);
pipeOut = new OutputStreamWriter(pipeOutStream);
//check the first 4 bytes for FFFE marker, if its
there we know its UTF-16 encoding
if (useUTF16)
{
try
{
pipeIn = new BufferedReader(new
InputStreamReader(pipeInStream, "UTF-16"));
}
catch (Exception e)
{
}
}
Thread thread = new ParserThread(this);
thread.start(); // start parsing
}
return pipeIn;
}

-Original Message-
From: Martin Remy [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 2:09 PM
To: 'Lucene Users List'
Subject: RE: AW: Problem indexing Spanish Characters


The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-Original Message-
From: Hannah c [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 10:35 AM
To: [EMAIL PROTECTED]
Subject: RE: AW: Problem indexing Spanish Characters

Hi,

I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML
documents as the source.
I also put below an example of what is happening when I tokenize the text
using the StandardTokenizer below.

Thanks Hannah



--text I'm trying to index

century palace known as la "Fundación Hospital de Na. Señora del Pilar"

-tokens outputed from StandardTokenizer

century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
Seà *
ora   *
del
Pilar
â
---



>From: "Peter M Cipollone" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Subject: Re: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 11:41:28 -0400
>
>could you send some sample text that causes this to happen?
>
>- Original Message -
>From: "Hannah c" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Sent: Wednesday, May 19, 2004 11:30 AM
>Subject: Problem indexing Spanish Characters
>
>
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As 
> > such there are a number of spanish characters throught the text, 
> > most of
>these
> > are in the place names which are the type of words I would like to 
> > use
>as
> > queries. My problem is with the StandardTokenizer class which cuts 
> > the
>word
> > into two when it comes across any of the spanish characters. I had a
>look
>at
> > the source but the code was generated by JavaCC and so is not very
>readable.
> > I was wondering if there was a way around this problem or which area 
> > of
>the
> > code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
> >
> >
> >
> > ------------
> > - To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>




>From: PEP AD Server Administrator
><[EMAIL PROTECTED]>
>Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
>To: "'Lucene Users List'" <[EMAIL PROTECTED]>
>Subject: AW: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 18:08:56 +0200
>
>

RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread Martin Remy
The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-Original Message-
From: Hannah c [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 10:35 AM
To: [EMAIL PROTECTED]
Subject: RE: AW: Problem indexing Spanish Characters

Hi,

I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML
documents as the source.
I also put below an example of what is happening when I tokenize the text
using the StandardTokenizer below.

Thanks Hannah



--text I'm trying to index

century palace known as la “Fundación Hospital de Na. Señora del Pilar”

-tokens outputed from StandardTokenizer

century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
Seà *
ora   *
del
Pilar
â
---



>From: "Peter M Cipollone" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Subject: Re: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 11:41:28 -0400
>
>could you send some sample text that causes this to happen?
>
>- Original Message -
>From: "Hannah c" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Sent: Wednesday, May 19, 2004 11:30 AM
>Subject: Problem indexing Spanish Characters
>
>
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As 
> > such there are a number of spanish characters throught the text, 
> > most of
>these
> > are in the place names which are the type of words I would like to 
> > use
>as
> > queries. My problem is with the StandardTokenizer class which cuts 
> > the
>word
> > into two when it comes across any of the spanish characters. I had a
>look
>at
> > the source but the code was generated by JavaCC and so is not very
>readable.
> > I was wondering if there was a way around this problem or which area 
> > of
>the
> > code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
> >
> >
> >
> > 
> > - To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>




>From: PEP AD Server Administrator
><[EMAIL PROTECTED]>
>Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
>To: "'Lucene Users List'" <[EMAIL PROTECTED]>
>Subject: AW: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 18:08:56 +0200
>
>Hi Hannah, Otis
>I cannot help but I have excatly the same problems with special german 
>charcters. I used snowball analyser but this does not help because the 
>problem (tokenizing) appears before the analyser comes into action.
>I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
>some minutes ago which describes my problem and Hannahs seem to be similar.
>Do you have also UTF-8 encoded pages?
>
>Peter MH
>
>-Ursprüngliche Nachricht-
>Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
>Gesendet: Mittwoch, 19. Mai 2004 17:42
>An: Lucene Users List
>Betreff: Re: Problem indexing Spanish Characters
>
>
>It looks like Snowball project supports Spanish:
>http://www.google.com/search?q=snowball spanish
>
>If it does, take a look at Lucene Sandbox.  There is a project that 
>allows you to use Snowball analyzers with Lucene.
>
>Otis
>
>
>--- Hannah c <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As 
> > such there are a number of spanish characters throught the text, 
> > most of these are in the place names which are the type of words I 
> > would like to use as queries. My problem is with the 
> > Stand

RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread Hannah c
Hi,
I had a quick look at the sandbox but my problem is that I don't need a 
spanish stemmer. However there must be a replacement tokenizer that supports 
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML 
documents as the source.
I also put below an example of what is happening when I tokenize the text 
using the StandardTokenizer below.

Thanks Hannah

--text I'm trying to index
century palace known as la “Fundación Hospital de Na. Señora del Pilar”
-tokens outputed from StandardTokenizer
century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
Seà *
ora   *
del
Pilar
â
---

From: "Peter M Cipollone" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Subject: Re: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 11:41:28 -0400
could you send some sample text that causes this to happen?
- Original Message -
From: "Hannah c" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, May 19, 2004 11:30 AM
Subject: Problem indexing Spanish Characters
>
> Hi,
>
> I  am indexing a number of English articles on Spanish resorts. As such
> there are a number of spanish characters throught the text, most of 
these
> are in the place names which are the type of words I would like to use 
as
> queries. My problem is with the StandardTokenizer class which cuts the
word
> into two when it comes across any of the spanish characters. I had a 
look
at
> the source but the code was generated by JavaCC and so is not very
readable.
> I was wondering if there was a way around this problem or which area of
the
> code I would need to change to avoid this.
>
> Thanks
> Hannah Cumming
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>



From: PEP AD Server Administrator 
<[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Subject: AW: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 18:08:56 +0200

Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?
Peter MH
-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters
It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish
If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.
Otis
--- Hannah c <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I  am indexing a number of English articles on Spanish resorts. As
> such
> there are a number of spanish characters throught the text, most of
> these
> are in the place names which are the type of words I would like to
> use as
> queries. My problem is with the StandardTokenizer class which cuts
> the word
> into two when it comes across any of the spanish characters. I had a
> look at
> the source but the code was generated by JavaCC and so is not very
> readable.
> I was wondering if there was a way around this problem or which area
> of the
> code I would need to change to avoid this.
>
> Thanks
> Hannah Cumming
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Hannah 
Cumming
[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


AW: Problem indexing Spanish Characters

2004-05-19 Thread PEP AD Server Administrator
Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?

Peter MH

-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters


It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.

Otis


--- Hannah c <[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> I  am indexing a number of English articles on Spanish resorts. As
> such 
> there are a number of spanish characters throught the text, most of
> these 
> are in the place names which are the type of words I would like to
> use as 
> queries. My problem is with the StandardTokenizer class which cuts
> the word 
> into two when it comes across any of the spanish characters. I had a
> look at 
> the source but the code was generated by JavaCC and so is not very
> readable. 
> I was wondering if there was a way around this problem or which area
> of the 
> code I would need to change to avoid this.
> 
> Thanks
> Hannah Cumming

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]