Re: problem indexing large document collction on windows xp
Thilo, thanks for your effort. Could you please open a new entry in Bugzilla, mark it as [PATCH] and add the diff file with your changes. This ensures that the sources and the information will not get lost in the huge universe of mailing lists. As soon there is time, one of the comitters will review and decide if it should be committed. Bernhard Hello I encoutered a problem when i tried to index large document collections (about 20 mio documents). The indexing failed with the IOException: "Cannot delete deletables" I tried different times (with the same document collection) and allways received the error, but after a different number of documents. The exception is thrown after failing to delete the specfied file at line 212 in FSDirectory.java. I found the following cure: after the lines if (nu.exists()) if (!nu.delete()){ i replaced throw new IOException("Cannot delete " + to); with while(nu.exists()){ nu.delete(); System.out.println("delete loop"); try { Thread.sleep(5000); } catch (InterruptedException e) { throw new RuntimeException(e); } That is, now i retry deleting the file until it is successful. After the changes, i was able to index all documents. From the fact that i observed several times "delete loop" on the output console, it can be deduced that the body of the while loop was reached (and left) several times. I am running lucene on windows xp. Regards Thilo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problem indexing large document collction on windows xp
Hello I encoutered a problem when i tried to index large document collections (about 20 mio documents). The indexing failed with the IOException: "Cannot delete deletables" I tried different times (with the same document collection) and allways received the error, but after a different number of documents. The exception is thrown after failing to delete the specfied file at line 212 in FSDirectory.java. I found the following cure: after the lines > if (nu.exists()) > if (!nu.delete()){ i replaced > throw new IOException("Cannot delete " + to); with >while(nu.exists()){ >nu.delete(); >System.out.println("delete loop"); >try { >Thread.sleep(5000); >} catch (InterruptedException e) { >throw new RuntimeException(e); >} That is, now i retry deleting the file until it is successful. After the changes, i was able to index all documents. >From the fact that i observed several times "delete loop" on the output console, it can be deduced that the body of the while loop was reached (and left) several times. I am running lucene on windows xp. Regards Thilo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problem indexing
Hi, i have problem indexing in the rout C:\TXT\DOC\ But i indexing in the rout C:\TXT is OK Why is the problem ?? P.D Anybody speak spanish in the list please reply P.D. Si alguien habla español por favor respodame gracias.. -- Miguel Angel Angeles R. Asesoria en Conectividad y Servidores Telf. 97451277 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem Indexing Large Document Field
Yeap, that was the problem... I just needed to increase the maxFieldLength number. Thanks... On May 26, 2004, at 5:56 PM, [EMAIL PROTECTED] wrote: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ IndexWrite r.html#DEFAULT_MAX_FIELD_LENGTH maxFieldLength public int maxFieldLengthThe maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index terms that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError. By default, no more than 10,000 terms will be indexed for a field. -Original Message- From: Gilberto Rodriguez [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 4:04 PM To: [EMAIL PROTECTED] Subject: Problem Indexing Large Document Field I am trying to index a field in a Lucene document with about 90,000 characters. The problem is that it only indexes part of the document. It seems to only index about 65,00 characters. So, if I search on terms that are at the beginning of the text, the search works, but it fails for terms that are at the end of the document. Is there a limitation on how many characters can be stored in a document field? Any help would be appreciated, thanks Gilberto Rodriguez Software Engineer 370 CenterPointe Circle, Suite 1178 Altamonte Springs, FL 32701-3451 407.339.1177 (Ext.112) • phone 407.339.6704 • fax [EMAIL PROTECTED] • email www.conviveon.com • web This e-mail contains legally privileged and confidential information intended only for the individual or entity named within the message. If the reader of this message is not the intended recipient, or the agent responsible to deliver it to the intended recipient, the recipient is hereby notified that any review, dissemination, distribution or copying of this communication is prohibited. If this communication was received in error, please notify me by reply e-mail and delete the original message. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Gilberto Rodriguez Software Engineer 370 CenterPointe Circle, Suite 1178 Altamonte Springs, FL 32701-3451 407.339.1177 (Ext.112) • phone 407.339.6704 • fax [EMAIL PROTECTED] • email www.conviveon.com • web This e-mail contains legally privileged and confidential information intended only for the individual or entity named within the message. If the reader of this message is not the intended recipient, or the agent responsible to deliver it to the intended recipient, the recipient is hereby notified that any review, dissemination, distribution or copying of this communication is prohibited. If this communication was received in error, please notify me by reply e-mail and delete the original message. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Problem Indexing Large Document Field
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite r.html#DEFAULT_MAX_FIELD_LENGTH maxFieldLength public int maxFieldLengthThe maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index terms that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError. By default, no more than 10,000 terms will be indexed for a field. -Original Message- From: Gilberto Rodriguez [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 4:04 PM To: [EMAIL PROTECTED] Subject: Problem Indexing Large Document Field I am trying to index a field in a Lucene document with about 90,000 characters. The problem is that it only indexes part of the document. It seems to only index about 65,00 characters. So, if I search on terms that are at the beginning of the text, the search works, but it fails for terms that are at the end of the document. Is there a limitation on how many characters can be stored in a document field? Any help would be appreciated, thanks Gilberto Rodriguez Software Engineer 370 CenterPointe Circle, Suite 1178 Altamonte Springs, FL 32701-3451 407.339.1177 (Ext.112) phone 407.339.6704 fax [EMAIL PROTECTED] email www.conviveon.com web This e-mail contains legally privileged and confidential information intended only for the individual or entity named within the message. If the reader of this message is not the intended recipient, or the agent responsible to deliver it to the intended recipient, the recipient is hereby notified that any review, dissemination, distribution or copying of this communication is prohibited. If this communication was received in error, please notify me by reply e-mail and delete the original message. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem Indexing Large Document Field
Thanks, James... That solved the problem. On May 26, 2004, at 4:15 PM, James Dunn wrote: Gilberto, Look at the IndexWriter class. It has a property, maxFieldLength, which you can set to determine the max number of characters to be stored in the index. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ IndexWriter.html Jim --- Gilberto Rodriguez <[EMAIL PROTECTED]> wrote: I am trying to index a field in a Lucene document with about 90,000 characters. The problem is that it only indexes part of the document. It seems to only index about 65,00 characters. So, if I search on terms that are at the beginning of the text, the search works, but it fails for terms that are at the end of the document. Is there a limitation on how many characters can be stored in a document field? Any help would be appreciated, thanks Gilberto Rodriguez Software Engineer  370 CenterPointe Circle, Suite 1178 Altamonte Springs, FL 32701-3451  407.339.1177 (Ext.112)  phone 407.339.6704  fax [EMAIL PROTECTED]  email www.conviveon.com  web  This e-mail contains legally privileged and confidential information intended only for the individual or entity named within the message. If the reader of this message is not the intended recipient, or the agent responsible to deliver it to the intended recipient, the recipient is hereby notified that any review, dissemination, distribution or copying of this communication is prohibited. If this communication was received in error, please notify me by reply e-mail and delete the original message. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Gilberto Rodriguez Software Engineer  370 CenterPointe Circle, Suite 1178 Altamonte Springs, FL 32701-3451  407.339.1177 (Ext.112) â phone 407.339.6704 â fax [EMAIL PROTECTED] â email www.conviveon.com â web  This e-mail contains legally privileged and confidential information intended only for the individual or entity named within the message. If the reader of this message is not the intended recipient, or the agent responsible to deliver it to the intended recipient, the recipient is hereby notified that any review, dissemination, distribution or copying of this communication is prohibited. If this communication was received in error, please notify me by reply e-mail and delete the original message. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem Indexing Large Document Field
Gilberto, Look at the IndexWriter class. It has a property, maxFieldLength, which you can set to determine the max number of characters to be stored in the index. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html Jim --- Gilberto Rodriguez <[EMAIL PROTECTED]> wrote: > I am trying to index a field in a Lucene document > with about 90,000 > characters. The problem is that it only indexes part > of the document. > It seems to only index about 65,00 characters. So, > if I search on terms > that are at the beginning of the text, the search > works, but it fails > for terms that are at the end of the document. > > Is there a limitation on how many characters can be > stored in a > document field? Any help would be appreciated, > thanks > > > Gilberto Rodriguez > Software Engineer > > 370 CenterPointe Circle, Suite 1178 > Altamonte Springs, FL 32701-3451 > > 407.339.1177 (Ext.112) phone > 407.339.6704 fax > [EMAIL PROTECTED] email > www.conviveon.com web > > This e-mail contains legally privileged and > confidential information > intended only for the individual or entity named > within the message. If > the reader of this message is not the intended > recipient, or the agent > responsible to deliver it to the intended recipient, > the recipient is > hereby notified that any review, dissemination, > distribution or copying > of this communication is prohibited. If this > communication was received > in error, please notify me by reply e-mail and > delete the original > message. > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problem Indexing Large Document Field
I am trying to index a field in a Lucene document with about 90,000 characters. The problem is that it only indexes part of the document. It seems to only index about 65,00 characters. So, if I search on terms that are at the beginning of the text, the search works, but it fails for terms that are at the end of the document. Is there a limitation on how many characters can be stored in a document field? Any help would be appreciated, thanks Gilberto Rodriguez Software Engineer 370 CenterPointe Circle, Suite 1178 Altamonte Springs, FL 32701-3451 407.339.1177 (Ext.112) • phone 407.339.6704 • fax [EMAIL PROTECTED] • email www.conviveon.com • web This e-mail contains legally privileged and confidential information intended only for the individual or entity named within the message. If the reader of this message is not the intended recipient, or the agent responsible to deliver it to the intended recipient, the recipient is hereby notified that any review, dissemination, distribution or copying of this communication is prohibited. If this communication was received in error, please notify me by reply e-mail and delete the original message. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Problem indexing Spanish Characters
Hi all, Martin was right. I just adapt the HTML demo as Wallen recommended and it worked. Now I have only to deal with some crazy documents which are UTF-8 decoded mixed with entities. Does anyone know a class which can translate entities into UTF-8 or any other encoding? Peter MH -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Here is an example method in org.apache.lucene.demo.html HTMLParser that uses a different buffered reader for a different encoding. public Reader getReader() throws IOException { if (pipeIn == null) { pipeInStream = new MyPipedInputStream(); pipeOutStream = new PipedOutputStream(pipeInStream); pipeIn = new InputStreamReader(pipeInStream); pipeOut = new OutputStreamWriter(pipeOutStream); //check the first 4 bytes for FFFE marker, if its there we know its UTF-16 encoding if (useUTF16) { try { pipeIn = new BufferedReader(new InputStreamReader(pipeInStream, "UTF-16")); } catch (Exception e) { } } Thread thread = new ParserThread(this); thread.start(); // start parsing } return pipeIn; } -Original Message- From: Martin Remy [mailto:[EMAIL PROTECTED] The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: AW: Problem indexing Spanish Characters
Here is an example method in org.apache.lucene.demo.html HTMLParser that uses a different buffered reader for a different encoding. public Reader getReader() throws IOException { if (pipeIn == null) { pipeInStream = new MyPipedInputStream(); pipeOutStream = new PipedOutputStream(pipeInStream); pipeIn = new InputStreamReader(pipeInStream); pipeOut = new OutputStreamWriter(pipeOutStream); //check the first 4 bytes for FFFE marker, if its there we know its UTF-16 encoding if (useUTF16) { try { pipeIn = new BufferedReader(new InputStreamReader(pipeInStream, "UTF-16")); } catch (Exception e) { } } Thread thread = new ParserThread(this); thread.start(); // start parsing } return pipeIn; } -Original Message- From: Martin Remy [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 2:09 PM To: 'Lucene Users List' Subject: RE: AW: Problem indexing Spanish Characters The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin -Original Message- From: Hannah c [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 10:35 AM To: [EMAIL PROTECTED] Subject: RE: AW: Problem indexing Spanish Characters Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la "Fundación Hospital de Na. Señora del Pilar" -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- >From: "Peter M Cipollone" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Subject: Re: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 11:41:28 -0400 > >could you send some sample text that causes this to happen? > >- Original Message - >From: "Hannah c" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Sent: Wednesday, May 19, 2004 11:30 AM >Subject: Problem indexing Spanish Characters > > > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of >these > > are in the place names which are the type of words I would like to > > use >as > > queries. My problem is with the StandardTokenizer class which cuts > > the >word > > into two when it comes across any of the spanish characters. I had a >look >at > > the source but the code was generated by JavaCC and so is not very >readable. > > I was wondering if there was a way around this problem or which area > > of >the > > code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > > > > > > > > ---- > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >From: PEP AD Server Administrator ><[EMAIL PROTECTED]> >Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> >To: "'Lucene Users List'" <[EMAIL PROTECTED]> >Subject: AW: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 18:08:56 +0200 > >
RE: AW: Problem indexing Spanish Characters
The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin -Original Message- From: Hannah c [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 10:35 AM To: [EMAIL PROTECTED] Subject: RE: AW: Problem indexing Spanish Characters Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la Fundación Hospital de Na. Señora del Pilar -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- >From: "Peter M Cipollone" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Subject: Re: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 11:41:28 -0400 > >could you send some sample text that causes this to happen? > >- Original Message - >From: "Hannah c" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Sent: Wednesday, May 19, 2004 11:30 AM >Subject: Problem indexing Spanish Characters > > > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of >these > > are in the place names which are the type of words I would like to > > use >as > > queries. My problem is with the StandardTokenizer class which cuts > > the >word > > into two when it comes across any of the spanish characters. I had a >look >at > > the source but the code was generated by JavaCC and so is not very >readable. > > I was wondering if there was a way around this problem or which area > > of >the > > code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > > > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >From: PEP AD Server Administrator ><[EMAIL PROTECTED]> >Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> >To: "'Lucene Users List'" <[EMAIL PROTECTED]> >Subject: AW: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 18:08:56 +0200 > >Hi Hannah, Otis >I cannot help but I have excatly the same problems with special german >charcters. I used snowball analyser but this does not help because the >problem (tokenizing) appears before the analyser comes into action. >I just posted the question "Problem tokenizing UTF-8 with geman umlauts" >some minutes ago which describes my problem and Hannahs seem to be similar. >Do you have also UTF-8 encoded pages? > >Peter MH > >-Ursprüngliche Nachricht- >Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] >Gesendet: Mittwoch, 19. Mai 2004 17:42 >An: Lucene Users List >Betreff: Re: Problem indexing Spanish Characters > > >It looks like Snowball project supports Spanish: >http://www.google.com/search?q=snowball spanish > >If it does, take a look at Lucene Sandbox. There is a project that >allows you to use Snowball analyzers with Lucene. > >Otis > > >--- Hannah c <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of these are in the place names which are the type of words I > > would like to use as queries. My problem is with the > > Stand
RE: AW: Problem indexing Spanish Characters
Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la Fundación Hospital de Na. Señora del Pilar -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- From: "Peter M Cipollone" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Subject: Re: Problem indexing Spanish Characters Date: Wed, 19 May 2004 11:41:28 -0400 could you send some sample text that causes this to happen? - Original Message - From: "Hannah c" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, May 19, 2004 11:30 AM Subject: Problem indexing Spanish Characters > > Hi, > > I am indexing a number of English articles on Spanish resorts. As such > there are a number of spanish characters throught the text, most of these > are in the place names which are the type of words I would like to use as > queries. My problem is with the StandardTokenizer class which cuts the word > into two when it comes across any of the spanish characters. I had a look at > the source but the code was generated by JavaCC and so is not very readable. > I was wondering if there was a way around this problem or which area of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > From: PEP AD Server Administrator <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Subject: AW: Problem indexing Spanish Characters Date: Wed, 19 May 2004 18:08:56 +0200 Hi Hannah, Otis I cannot help but I have excatly the same problems with special german charcters. I used snowball analyser but this does not help because the problem (tokenizing) appears before the analyser comes into action. I just posted the question "Problem tokenizing UTF-8 with geman umlauts" some minutes ago which describes my problem and Hannahs seem to be similar. Do you have also UTF-8 encoded pages? Peter MH -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 19. Mai 2004 17:42 An: Lucene Users List Betreff: Re: Problem indexing Spanish Characters It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene. Otis --- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Hannah Cumming [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Problem indexing Spanish Characters
Hi Hannah, Otis I cannot help but I have excatly the same problems with special german charcters. I used snowball analyser but this does not help because the problem (tokenizing) appears before the analyser comes into action. I just posted the question "Problem tokenizing UTF-8 with geman umlauts" some minutes ago which describes my problem and Hannahs seem to be similar. Do you have also UTF-8 encoded pages? Peter MH -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 19. Mai 2004 17:42 An: Lucene Users List Betreff: Re: Problem indexing Spanish Characters It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene. Otis --- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem indexing Spanish Characters
It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene. Otis --- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > g snowball s - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problem indexing Spanish Characters
Hi, I am indexing a number of English articles on Spanish resorts. As such there are a number of spanish characters throught the text, most of these are in the place names which are the type of words I would like to use as queries. My problem is with the StandardTokenizer class which cuts the word into two when it comes across any of the spanish characters. I had a look at the source but the code was generated by JavaCC and so is not very readable. I was wondering if there was a way around this problem or which area of the code I would need to change to avoid this. Thanks Hannah Cumming - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]