AW: Problem indexing Spanish Characters
Hi all, Martin was right. I just adapt the HTML demo as Wallen recommended and it worked. Now I have only to deal with some crazy documents which are UTF-8 decoded mixed with entities. Does anyone know a class which can translate entities into UTF-8 or any other encoding? Peter MH -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Here is an example method in org.apache.lucene.demo.html HTMLParser that uses a different buffered reader for a different encoding. public Reader getReader() throws IOException { if (pipeIn == null) { pipeInStream = new MyPipedInputStream(); pipeOutStream = new PipedOutputStream(pipeInStream); pipeIn = new InputStreamReader(pipeInStream); pipeOut = new OutputStreamWriter(pipeOutStream); //check the first 4 bytes for FFFE marker, if its there we know its UTF-16 encoding if (useUTF16) { try { pipeIn = new BufferedReader(new InputStreamReader(pipeInStream, "UTF-16")); } catch (Exception e) { } } Thread thread = new ParserThread(this); thread.start(); // start parsing } return pipeIn; } -Original Message- From: Martin Remy [mailto:[EMAIL PROTECTED] The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: AW: Problem indexing Spanish Characters
Here is an example method in org.apache.lucene.demo.html HTMLParser that uses a different buffered reader for a different encoding. public Reader getReader() throws IOException { if (pipeIn == null) { pipeInStream = new MyPipedInputStream(); pipeOutStream = new PipedOutputStream(pipeInStream); pipeIn = new InputStreamReader(pipeInStream); pipeOut = new OutputStreamWriter(pipeOutStream); //check the first 4 bytes for FFFE marker, if its there we know its UTF-16 encoding if (useUTF16) { try { pipeIn = new BufferedReader(new InputStreamReader(pipeInStream, "UTF-16")); } catch (Exception e) { } } Thread thread = new ParserThread(this); thread.start(); // start parsing } return pipeIn; } -Original Message- From: Martin Remy [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 2:09 PM To: 'Lucene Users List' Subject: RE: AW: Problem indexing Spanish Characters The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin -Original Message- From: Hannah c [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 10:35 AM To: [EMAIL PROTECTED] Subject: RE: AW: Problem indexing Spanish Characters Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la "Fundación Hospital de Na. Señora del Pilar" -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- >From: "Peter M Cipollone" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Subject: Re: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 11:41:28 -0400 > >could you send some sample text that causes this to happen? > >- Original Message - >From: "Hannah c" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Sent: Wednesday, May 19, 2004 11:30 AM >Subject: Problem indexing Spanish Characters > > > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of >these > > are in the place names which are the type of words I would like to > > use >as > > queries. My problem is with the StandardTokenizer class which cuts > > the >word > > into two when it comes across any of the spanish characters. I had a >look >at > > the source but the code was generated by JavaCC and so is not very >readable. > > I was wondering if there was a way around this problem or which area > > of >the > > code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > > > > > > > > ------------ > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >From: PEP AD Server Administrator ><[EMAIL PROTECTED]> >Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> >To: "'Lucene Users List'" <[EMAIL PROTECTED]> >Subject: AW: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 18:08:56 +0200 > >
RE: AW: Problem indexing Spanish Characters
The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin -Original Message- From: Hannah c [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 10:35 AM To: [EMAIL PROTECTED] Subject: RE: AW: Problem indexing Spanish Characters Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la Fundación Hospital de Na. Señora del Pilar -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- >From: "Peter M Cipollone" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Subject: Re: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 11:41:28 -0400 > >could you send some sample text that causes this to happen? > >- Original Message - >From: "Hannah c" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Sent: Wednesday, May 19, 2004 11:30 AM >Subject: Problem indexing Spanish Characters > > > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of >these > > are in the place names which are the type of words I would like to > > use >as > > queries. My problem is with the StandardTokenizer class which cuts > > the >word > > into two when it comes across any of the spanish characters. I had a >look >at > > the source but the code was generated by JavaCC and so is not very >readable. > > I was wondering if there was a way around this problem or which area > > of >the > > code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > > > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >From: PEP AD Server Administrator ><[EMAIL PROTECTED]> >Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> >To: "'Lucene Users List'" <[EMAIL PROTECTED]> >Subject: AW: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 18:08:56 +0200 > >Hi Hannah, Otis >I cannot help but I have excatly the same problems with special german >charcters. I used snowball analyser but this does not help because the >problem (tokenizing) appears before the analyser comes into action. >I just posted the question "Problem tokenizing UTF-8 with geman umlauts" >some minutes ago which describes my problem and Hannahs seem to be similar. >Do you have also UTF-8 encoded pages? > >Peter MH > >-Ursprüngliche Nachricht- >Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] >Gesendet: Mittwoch, 19. Mai 2004 17:42 >An: Lucene Users List >Betreff: Re: Problem indexing Spanish Characters > > >It looks like Snowball project supports Spanish: >http://www.google.com/search?q=snowball spanish > >If it does, take a look at Lucene Sandbox. There is a project that >allows you to use Snowball analyzers with Lucene. > >Otis > > >--- Hannah c <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of these are in the place names which are the type of words I > > would like to use as queries. My problem is with the > > Stand
RE: AW: Problem indexing Spanish Characters
Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la Fundación Hospital de Na. Señora del Pilar -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- From: "Peter M Cipollone" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Subject: Re: Problem indexing Spanish Characters Date: Wed, 19 May 2004 11:41:28 -0400 could you send some sample text that causes this to happen? - Original Message - From: "Hannah c" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, May 19, 2004 11:30 AM Subject: Problem indexing Spanish Characters > > Hi, > > I am indexing a number of English articles on Spanish resorts. As such > there are a number of spanish characters throught the text, most of these > are in the place names which are the type of words I would like to use as > queries. My problem is with the StandardTokenizer class which cuts the word > into two when it comes across any of the spanish characters. I had a look at > the source but the code was generated by JavaCC and so is not very readable. > I was wondering if there was a way around this problem or which area of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > From: PEP AD Server Administrator <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Subject: AW: Problem indexing Spanish Characters Date: Wed, 19 May 2004 18:08:56 +0200 Hi Hannah, Otis I cannot help but I have excatly the same problems with special german charcters. I used snowball analyser but this does not help because the problem (tokenizing) appears before the analyser comes into action. I just posted the question "Problem tokenizing UTF-8 with geman umlauts" some minutes ago which describes my problem and Hannahs seem to be similar. Do you have also UTF-8 encoded pages? Peter MH -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 19. Mai 2004 17:42 An: Lucene Users List Betreff: Re: Problem indexing Spanish Characters It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene. Otis --- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Hannah Cumming [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Problem indexing Spanish Characters
Hi Hannah, Otis I cannot help but I have excatly the same problems with special german charcters. I used snowball analyser but this does not help because the problem (tokenizing) appears before the analyser comes into action. I just posted the question "Problem tokenizing UTF-8 with geman umlauts" some minutes ago which describes my problem and Hannahs seem to be similar. Do you have also UTF-8 encoded pages? Peter MH -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 19. Mai 2004 17:42 An: Lucene Users List Betreff: Re: Problem indexing Spanish Characters It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene. Otis --- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]