problem indexing large document collction on windows xp

2004-12-30 Thread Thilo Will
Hello

I encoutered a problem when i tried to index large document collections
(about 20 mio documents).
The indexing failed with the IOException:

Cannot delete deletables

I tried different times (with the same document collection) and allways
received the error, but after a different number
of documents.

The exception is thrown after failing to delete the specfied file at
line 212 in FSDirectory.java.
I found the following cure:

after the lines
   if (nu.exists())
  if (!nu.delete()){

i replaced
   throw new IOException(Cannot delete  + to);

with
while(nu.exists()){
nu.delete();
System.out.println(delete loop);
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}

That is, now i retry deleting the file until it is successful.

After the changes, i was able to index all documents.

From the fact that i observed several times

  delete loop
  
on the output console, it can be deduced that the 
body of the while loop was reached (and left) several times.


I am running lucene on windows xp.

Regards
Thilo




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: problem indexing large document collction on windows xp

2004-12-30 Thread Bernhard Messer
Thilo,
thanks for your effort. Could you please open a new entry in Bugzilla, 
mark it as [PATCH] and add the diff file with your changes. This ensures 
that the sources and the information will not get lost in the huge 
universe of mailing lists. As soon there is time, one of the comitters 
will review and decide if it should be committed.

Bernhard
Hello
I encoutered a problem when i tried to index large document collections
(about 20 mio documents).
The indexing failed with the IOException:
Cannot delete deletables
I tried different times (with the same document collection) and allways
received the error, but after a different number
of documents.
The exception is thrown after failing to delete the specfied file at
line 212 in FSDirectory.java.
I found the following cure:
after the lines
 

 if (nu.exists())
if (!nu.delete()){
   

i replaced
 

 throw new IOException(Cannot delete  + to);
   

with
 

  while(nu.exists()){
  nu.delete();
  System.out.println(delete loop);
  try {
  Thread.sleep(5000);
  } catch (InterruptedException e) {
  throw new RuntimeException(e);
  }
   

That is, now i retry deleting the file until it is successful.
After the changes, i was able to index all documents.
From the fact that i observed several times
 delete loop
 
on the output console, it can be deduced that the 
body of the while loop was reached (and left) several times.

I am running lucene on windows xp.
Regards
Thilo
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Problem indexing

2004-10-12 Thread Miguel Angel
Hi, i have problem indexing in the rout  C:\TXT\DOC\ 

But i indexing in the rout C:\TXT  is OK

Why is the problem ??

P.D Anybody speak spanish in the list please reply
P.D.  Si alguien habla español por favor respodame gracias..

-- 
Miguel Angel Angeles R.
Asesoria en Conectividad y Servidores
Telf. 97451277

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Problem Indexing Large Document Field

2004-05-26 Thread Gilberto Rodriguez
I am trying to index a field in a Lucene document with about 90,000 
characters. The problem is that it only indexes part of the document. 
It seems to only index about 65,00 characters. So, if I search on terms 
that are at the beginning of the text, the search works, but it fails 
for terms that are at the end of the document.

Is there a limitation on how many characters can be stored in a 
document field? Any help would be appreciated, thanks

Gilberto Rodriguez
Software Engineer
 
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
 
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and confidential information 
intended only for the individual or entity named within the message. If 
the reader of this message is not the intended recipient, or the agent 
responsible to deliver it to the intended recipient, the recipient is 
hereby notified that any review, dissemination, distribution or copying 
of this communication is prohibited. If this communication was received 
in error, please notify me by reply e-mail and delete the original 
message.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem Indexing Large Document Field

2004-05-26 Thread James Dunn
Gilberto,

Look at the IndexWriter class.  It has a property,
maxFieldLength, which you can set to determine the max
number of characters to be stored in the index.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html

Jim

--- Gilberto Rodriguez
[EMAIL PROTECTED] wrote:
 I am trying to index a field in a Lucene document
 with about 90,000 
 characters. The problem is that it only indexes part
 of the document. 
 It seems to only index about 65,00 characters. So,
 if I search on terms 
 that are at the beginning of the text, the search
 works, but it fails 
 for terms that are at the end of the document.
 
 Is there a limitation on how many characters can be
 stored in a 
 document field? Any help would be appreciated,
 thanks
 
 
 Gilberto Rodriguez
 Software Engineer
    
 370 CenterPointe Circle, Suite 1178
 Altamonte Springs, FL 32701-3451
    
 407.339.1177 (Ext.112) • phone
 407.339.6704 • fax
 [EMAIL PROTECTED] • email
 www.conviveon.com • web
  
 This e-mail contains legally privileged and
 confidential information 
 intended only for the individual or entity named
 within the message. If 
 the reader of this message is not the intended
 recipient, or the agent 
 responsible to deliver it to the intended recipient,
 the recipient is 
 hereby notified that any review, dissemination,
 distribution or copying 
 of this communication is prohibited. If this
 communication was received 
 in error, please notify me by reply e-mail and
 delete the original 
 message.
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem Indexing Large Document Field

2004-05-26 Thread Gilberto Rodriguez
Thanks,  James... That solved the problem.
On May 26, 2004, at 4:15 PM, James Dunn wrote:
Gilberto,
Look at the IndexWriter class.  It has a property,
maxFieldLength, which you can set to determine the max
number of characters to be stored in the index.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ 
IndexWriter.html

Jim
--- Gilberto Rodriguez
[EMAIL PROTECTED] wrote:
I am trying to index a field in a Lucene document
with about 90,000
characters. The problem is that it only indexes part
of the document.
It seems to only index about 65,00 characters. So,
if I search on terms
that are at the beginning of the text, the search
works, but it fails
for terms that are at the end of the document.
Is there a limitation on how many characters can be
stored in a
document field? Any help would be appreciated,
thanks
Gilberto Rodriguez
Software Engineer
  
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
  
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and
confidential information
intended only for the individual or entity named
within the message. If
the reader of this message is not the intended
recipient, or the agent
responsible to deliver it to the intended recipient,
the recipient is
hereby notified that any review, dissemination,
distribution or copying
of this communication is prohibited. If this
communication was received
in error, please notify me by reply e-mail and
delete the original
message.

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]



__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Gilberto Rodriguez
Software Engineer
 
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
 
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and confidential information  
intended only for the individual or entity named within the message. If  
the reader of this message is not the intended recipient, or the agent  
responsible to deliver it to the intended recipient, the recipient is  
hereby notified that any review, dissemination, distribution or copying  
of this communication is prohibited. If this communication was received  
in error, please notify me by reply e-mail and delete the original  
message.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Problem Indexing Large Document Field

2004-05-26 Thread wallen
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#DEFAULT_MAX_FIELD_LENGTH

maxFieldLength
public int maxFieldLengthThe maximum number of terms that will be indexed
for a single field in a document. This limits the amount of memory required
for indexing, so that collections with very large files will not crash the
indexing process by running out of memory.
Note that this effectively truncates large documents, excluding from the
index terms that occur further in the document. If you know your source
documents are large, be sure to set this value high enough to accomodate the
expected size. If you set it to Integer.MAX_VALUE, then the only limit is
your memory, but you should anticipate an OutOfMemoryError.

By default, no more than 10,000 terms will be indexed for a field. 



-Original Message-
From: Gilberto Rodriguez [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 4:04 PM
To: [EMAIL PROTECTED]
Subject: Problem Indexing Large Document Field


I am trying to index a field in a Lucene document with about 90,000 
characters. The problem is that it only indexes part of the document. 
It seems to only index about 65,00 characters. So, if I search on terms 
that are at the beginning of the text, the search works, but it fails 
for terms that are at the end of the document.

Is there a limitation on how many characters can be stored in a 
document field? Any help would be appreciated, thanks


Gilberto Rodriguez
Software Engineer
   
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
   
407.339.1177 (Ext.112) • phone
407.339.6704 • fax
[EMAIL PROTECTED] • email
www.conviveon.com • web
 
This e-mail contains legally privileged and confidential information 
intended only for the individual or entity named within the message. If 
the reader of this message is not the intended recipient, or the agent 
responsible to deliver it to the intended recipient, the recipient is 
hereby notified that any review, dissemination, distribution or copying 
of this communication is prohibited. If this communication was received 
in error, please notify me by reply e-mail and delete the original 
message.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem Indexing Large Document Field

2004-05-26 Thread Gilberto Rodriguez
Yeap, that was the problem...  I just needed to increase the  
maxFieldLength number.

Thanks...
On May 26, 2004, at 5:56 PM, [EMAIL PROTECTED] wrote:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ 
IndexWrite
r.html#DEFAULT_MAX_FIELD_LENGTH

maxFieldLength
public int maxFieldLengthThe maximum number of terms that will be  
indexed
for a single field in a document. This limits the amount of memory  
required
for indexing, so that collections with very large files will not crash  
the
indexing process by running out of memory.
Note that this effectively truncates large documents, excluding from  
the
index terms that occur further in the document. If you know your source
documents are large, be sure to set this value high enough to  
accomodate the
expected size. If you set it to Integer.MAX_VALUE, then the only limit  
is
your memory, but you should anticipate an OutOfMemoryError.

By default, no more than 10,000 terms will be indexed for a field.

-Original Message-
From: Gilberto Rodriguez [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 4:04 PM
To: [EMAIL PROTECTED]
Subject: Problem Indexing Large Document Field
I am trying to index a field in a Lucene document with about 90,000
characters. The problem is that it only indexes part of the document.
It seems to only index about 65,00 characters. So, if I search on terms
that are at the beginning of the text, the search works, but it fails
for terms that are at the end of the document.
Is there a limitation on how many characters can be stored in a
document field? Any help would be appreciated, thanks
Gilberto Rodriguez
Software Engineer
  
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
  
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and confidential information
intended only for the individual or entity named within the message. If
the reader of this message is not the intended recipient, or the agent
responsible to deliver it to the intended recipient, the recipient is
hereby notified that any review, dissemination, distribution or copying
of this communication is prohibited. If this communication was received
in error, please notify me by reply e-mail and delete the original
message.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Gilberto Rodriguez
Software Engineer
 
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
 
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and confidential information  
intended only for the individual or entity named within the message. If  
the reader of this message is not the intended recipient, or the agent  
responsible to deliver it to the intended recipient, the recipient is  
hereby notified that any review, dissemination, distribution or copying  
of this communication is prohibited. If this communication was received  
in error, please notify me by reply e-mail and delete the original  
message.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


AW: Problem indexing Spanish Characters

2004-05-21 Thread PEP AD Server Administrator
Hi all,
Martin was right. I just adapt the HTML demo as Wallen recommended and it
worked. Now I have only to deal with some crazy documents which are UTF-8
decoded mixed with entities.
Does anyone know a class which can translate entities into UTF-8 or any
other encoding?

Peter MH

-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]

Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding. 

public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPipedInputStream();
pipeOutStream = new PipedOutputStream(pipeInStream);
pipeIn = new InputStreamReader(pipeInStream);
pipeOut = new OutputStreamWriter(pipeOutStream);
//check the first 4 bytes for FFFE marker, if its
there we know its UTF-16 encoding
if (useUTF16)
{
try
{
pipeIn = new BufferedReader(new
InputStreamReader(pipeInStream, UTF-16));
}
catch (Exception e)
{
}
}
Thread thread = new ParserThread(this);
thread.start(); // start parsing
}
return pipeIn;
}

-Original Message-
From: Martin Remy [mailto:[EMAIL PROTECTED]

The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying UTF-8 (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Problem indexing Spanish Characters

2004-05-19 Thread Hannah c
Hi,
I  am indexing a number of English articles on Spanish resorts. As such 
there are a number of spanish characters throught the text, most of these 
are in the place names which are the type of words I would like to use as 
queries. My problem is with the StandardTokenizer class which cuts the word 
into two when it comes across any of the spanish characters. I had a look at 
the source but the code was generated by JavaCC and so is not very readable. 
I was wondering if there was a way around this problem or which area of the 
code I would need to change to avoid this.

Thanks
Hannah Cumming

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem indexing Spanish Characters

2004-05-19 Thread Otis Gospodnetic
It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.

Otis


--- Hannah c [EMAIL PROTECTED] wrote:
 
 Hi,
 
 I  am indexing a number of English articles on Spanish resorts. As
 such 
 there are a number of spanish characters throught the text, most of
 these 
 are in the place names which are the type of words I would like to
 use as 
 queries. My problem is with the StandardTokenizer class which cuts
 the word 
 into two when it comes across any of the spanish characters. I had a
 look at 
 the source but the code was generated by JavaCC and so is not very
 readable. 
 I was wondering if there was a way around this problem or which area
 of the 
 code I would need to change to avoid this.
 
 Thanks
 Hannah Cumming
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
g snowball s

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Problem indexing Spanish Characters

2004-05-19 Thread PEP AD Server Administrator
Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question Problem tokenizing UTF-8 with geman umlauts
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?

Peter MH

-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters


It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.

Otis


--- Hannah c [EMAIL PROTECTED] wrote:
 
 Hi,
 
 I  am indexing a number of English articles on Spanish resorts. As
 such 
 there are a number of spanish characters throught the text, most of
 these 
 are in the place names which are the type of words I would like to
 use as 
 queries. My problem is with the StandardTokenizer class which cuts
 the word 
 into two when it comes across any of the spanish characters. I had a
 look at 
 the source but the code was generated by JavaCC and so is not very
 readable. 
 I was wondering if there was a way around this problem or which area
 of the 
 code I would need to change to avoid this.
 
 Thanks
 Hannah Cumming

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread Hannah c
Hi,
I had a quick look at the sandbox but my problem is that I don't need a 
spanish stemmer. However there must be a replacement tokenizer that supports 
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using UTF-8 encoded XML 
documents as the source.
I also put below an example of what is happening when I tokenize the text 
using the StandardTokenizer below.

Thanks Hannah

--text I'm trying to index
century palace known as la “Fundación Hospital de Na. Señora del Pilar”
-tokens outputed from StandardTokenizer
century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
Seà *
ora   *
del
Pilar
â
---

From: Peter M Cipollone [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 11:41:28 -0400
could you send some sample text that causes this to happen?
- Original Message -
From: Hannah c [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 11:30 AM
Subject: Problem indexing Spanish Characters

 Hi,

 I  am indexing a number of English articles on Spanish resorts. As such
 there are a number of spanish characters throught the text, most of 
these
 are in the place names which are the type of words I would like to use 
as
 queries. My problem is with the StandardTokenizer class which cuts the
word
 into two when it comes across any of the spanish characters. I had a 
look
at
 the source but the code was generated by JavaCC and so is not very
readable.
 I was wondering if there was a way around this problem or which area of
the
 code I would need to change to avoid this.

 Thanks
 Hannah Cumming



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]






From: PEP AD Server Administrator 
[EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Subject: AW: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 18:08:56 +0200

Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question Problem tokenizing UTF-8 with geman umlauts
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?
Peter MH
-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters
It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish
If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.
Otis
--- Hannah c [EMAIL PROTECTED] wrote:

 Hi,

 I  am indexing a number of English articles on Spanish resorts. As
 such
 there are a number of spanish characters throught the text, most of
 these
 are in the place names which are the type of words I would like to
 use as
 queries. My problem is with the StandardTokenizer class which cuts
 the word
 into two when it comes across any of the spanish characters. I had a
 look at
 the source but the code was generated by JavaCC and so is not very
 readable.
 I was wondering if there was a way around this problem or which area
 of the
 code I would need to change to avoid this.

 Thanks
 Hannah Cumming
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Hannah 
Cumming
[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread Martin Remy
The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying UTF-8 (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-Original Message-
From: Hannah c [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 10:35 AM
To: [EMAIL PROTECTED]
Subject: RE: AW: Problem indexing Spanish Characters

Hi,

I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using UTF-8 encoded XML
documents as the source.
I also put below an example of what is happening when I tokenize the text
using the StandardTokenizer below.

Thanks Hannah



--text I'm trying to index

century palace known as la “Fundación Hospital de Na. Señora del Pilar”

-tokens outputed from StandardTokenizer

century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
Seà *
ora   *
del
Pilar
â
---



From: Peter M Cipollone [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 11:41:28 -0400

could you send some sample text that causes this to happen?

- Original Message -
From: Hannah c [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 11:30 AM
Subject: Problem indexing Spanish Characters


 
  Hi,
 
  I  am indexing a number of English articles on Spanish resorts. As 
  such there are a number of spanish characters throught the text, 
  most of
these
  are in the place names which are the type of words I would like to 
  use
as
  queries. My problem is with the StandardTokenizer class which cuts 
  the
word
  into two when it comes across any of the spanish characters. I had a
look
at
  the source but the code was generated by JavaCC and so is not very
readable.
  I was wondering if there was a way around this problem or which area 
  of
the
  code I would need to change to avoid this.
 
  Thanks
  Hannah Cumming
 
 
 
  
  - To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 





From: PEP AD Server Administrator
[EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Subject: AW: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 18:08:56 +0200

Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german 
charcters. I used snowball analyser but this does not help because the 
problem (tokenizing) appears before the analyser comes into action.
I just posted the question Problem tokenizing UTF-8 with geman umlauts
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?

Peter MH

-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters


It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox.  There is a project that 
allows you to use Snowball analyzers with Lucene.

Otis


--- Hannah c [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I  am indexing a number of English articles on Spanish resorts. As 
  such there are a number of spanish characters throught the text, 
  most of these are in the place names which are the type of words I 
  would like to use as queries. My problem is with the 
  StandardTokenizer class which cuts the word into two when it comes 
  across any of the spanish characters. I had a look at the source but 
  the code was generated by JavaCC and so is not very readable.
  I was wondering if there was a way around this problem or which area 
  of the code I would need to change to avoid this.
 
  Thanks
  Hannah Cumming

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Hannah
Cumming
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional

RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread wallen
Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding. 

public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPipedInputStream();
pipeOutStream = new PipedOutputStream(pipeInStream);
pipeIn = new InputStreamReader(pipeInStream);
pipeOut = new OutputStreamWriter(pipeOutStream);
//check the first 4 bytes for FFFE marker, if its
there we know its UTF-16 encoding
if (useUTF16)
{
try
{
pipeIn = new BufferedReader(new
InputStreamReader(pipeInStream, UTF-16));
}
catch (Exception e)
{
}
}
Thread thread = new ParserThread(this);
thread.start(); // start parsing
}
return pipeIn;
}

-Original Message-
From: Martin Remy [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 2:09 PM
To: 'Lucene Users List'
Subject: RE: AW: Problem indexing Spanish Characters


The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying UTF-8 (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-Original Message-
From: Hannah c [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 10:35 AM
To: [EMAIL PROTECTED]
Subject: RE: AW: Problem indexing Spanish Characters

Hi,

I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using UTF-8 encoded XML
documents as the source.
I also put below an example of what is happening when I tokenize the text
using the StandardTokenizer below.

Thanks Hannah



--text I'm trying to index

century palace known as la Fundación Hospital de Na. Señora del Pilar

-tokens outputed from StandardTokenizer

century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
Seà *
ora   *
del
Pilar
â
---



From: Peter M Cipollone [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 11:41:28 -0400

could you send some sample text that causes this to happen?

- Original Message -
From: Hannah c [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 11:30 AM
Subject: Problem indexing Spanish Characters


 
  Hi,
 
  I  am indexing a number of English articles on Spanish resorts. As 
  such there are a number of spanish characters throught the text, 
  most of
these
  are in the place names which are the type of words I would like to 
  use
as
  queries. My problem is with the StandardTokenizer class which cuts 
  the
word
  into two when it comes across any of the spanish characters. I had a
look
at
  the source but the code was generated by JavaCC and so is not very
readable.
  I was wondering if there was a way around this problem or which area 
  of
the
  code I would need to change to avoid this.
 
  Thanks
  Hannah Cumming
 
 
 
  
  - To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 





From: PEP AD Server Administrator
[EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Subject: AW: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 18:08:56 +0200

Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german 
charcters. I used snowball analyser but this does not help because the 
problem (tokenizing) appears before the analyser comes into action.
I just posted the question Problem tokenizing UTF-8 with geman umlauts
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?

Peter MH

-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004