RE: Lucene1.4.1 + OutOf Memory
Hi Guy's Apologies . I am NOT Using sorting code hits = multiSearcher.search(query, new Sort(new SortField(filename, SortField.STRING))); but using multiSearcher.search(query) in Core Files setup and still getting the Error. More Advises Required.. Karthik -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 12:46 PM To: Lucene Users List Subject: Re: Lucene1.4.1 + OutOf Memory There is a memory leak in the sorting code of Lucene 1.4.1. 1.4.2 has the fix! --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys Apologies.. History Ist type : 4 subindexes + MultiSearcher + Search on Content Field Only for 2000 hits = Exception [ Too many Files Open ] IInd type : 40 Mergerd Indexes [1000 subindexes each] + MultiSearcher /ParallelSearcher + Search on Content Field Only for 2 hits = Exception [ OutOf Memeory ] System Config [same for both type] Amd Processor [High End Single] RAM 1GB O/s Linux ( jantoo type ) Appserver Tomcat 5.05 Jdk [ IBM Blackdown-1.4.1-01 ( == Jdk1.4.1) ] Index contains 15 Fields Search Done only on 1 field Retrieve 11 corrosponding fields 3 Fields are for debug details Switched from Ist type to IInd Type Can some body suggest me Why is this Happening Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene1.4.1 + OutOf Memory
Exception too many files open means: - searcher object is nor closed after query execution - too little file handlers Regards J. Karthik N S [EMAIL PROTECTED]To: Lucene Users List [EMAIL PROTECTED], et.co.in [EMAIL PROTECTED] cc: (bcc: Iouli Golovatyi/X/GP/Novartis) 10.11.2004 09:41 Subject: RE: Lucene1.4.1 + OutOf Memory Please respond to Lucene UsersCategory: |-| List| ( ) Action needed | | ( ) Decision needed | | ( ) General Information | |-| Hi Guy's Apologies . I am NOT Using sorting code hits = multiSearcher.search(query, new Sort(new SortField(filename, SortField.STRING))); but using multiSearcher.search(query) in Core Files setup and still getting the Error. More Advises Required.. Karthik -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 12:46 PM To: Lucene Users List Subject: Re: Lucene1.4.1 + OutOf Memory There is a memory leak in the sorting code of Lucene 1.4.1. 1.4.2 has the fix! --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys Apologies.. History Ist type : 4 subindexes + MultiSearcher + Search on Content Field Only for 2000 hits = Exception [ Too many Files Open ] IInd type : 40 Mergerd Indexes [1000 subindexes each] + MultiSearcher /ParallelSearcher + Search on Content Field Only for 2 hits = Exception [ OutOf Memeory ] System Config [same for both type] Amd Processor [High End Single] RAM 1GB O/s Linux ( jantoo type ) Appserver Tomcat 5.05 Jdk [ IBM Blackdown-1.4.1-01 ( == Jdk1.4.1) ] Index contains 15 Fields Search Done only on 1 field Retrieve 11 corrosponding fields 3 Fields are for debug details Switched from Ist type to IInd Type Can some body suggest me Why is this Happening Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
On Nov 10, 2004, at 2:17 AM, [EMAIL PROTECTED] wrote: Otis or Erik, do you know if a Reader continously opening should cause the Writer to fail with a Lock obtain timed out error? No need to address individuals here. With the information provided, I have no idea what the issue may be. There certainly is no issue reading and writing to an index at the same time, but only one process can be writing/deleting from the index at a time. Erik --- Lucene Users List [EMAIL PROTECTED] wrote: The attached Java file shows a locking issue that occurs with Lucene. One thread opens and closes an IndexReader. The other thread opens an IndexWriter, adds a document and then closes the IndexWriter. I would expect that this app should be able to happily run without an issues. It fails with: java.io.IOException: Lock obtain timed out Is this expected? I thought a Reader could be opened while a Writer is adding a document. Any help is appreciated. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene1.4.1 + OutOf Memory
Hi Guy's Apologies. That's Why Somebody on the form asked me to Switch to : 40 Mergerd Indexes [1000 subindexes each] + MultiSearcher / ParallelSearcher + Search on Content Field Only for 2 the problem of to many Files open was solved since now there were only 40 MergerIndexes - [1 MergerIndex has 1000 sub indexes] instead of 4 subindexes. Now I am gettinf Out of Memory Exception. Any Idea On how to Solve this problem. Thx in Advance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:16 PM To: Lucene Users List Subject: RE: Lucene1.4.1 + OutOf Memory Exception too many files open means: - searcher object is nor closed after query execution - too little file handlers Regards J. Karthik N S [EMAIL PROTECTED]To: Lucene Users List [EMAIL PROTECTED], et.co.in [EMAIL PROTECTED] cc: (bcc: Iouli Golovatyi/X/GP/Novartis) 10.11.2004 09:41 Subject: RE: Lucene1.4.1 + OutOf Memory Please respond to Lucene UsersCategory: |-| List| ( ) Action needed | | ( ) Decision needed | | ( ) General Information | |-| Hi Guy's Apologies . I am NOT Using sorting code hits = multiSearcher.search(query, new Sort(new SortField(filename, SortField.STRING))); but using multiSearcher.search(query) in Core Files setup and still getting the Error. More Advises Required.. Karthik -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 12:46 PM To: Lucene Users List Subject: Re: Lucene1.4.1 + OutOf Memory There is a memory leak in the sorting code of Lucene 1.4.1. 1.4.2 has the fix! --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys Apologies.. History Ist type : 4 subindexes + MultiSearcher + Search on Content Field Only for 2000 hits = Exception [ Too many Files Open ] IInd type : 40 Mergerd Indexes [1000 subindexes each] + MultiSearcher /ParallelSearcher + Search on Content Field Only for 2 hits = Exception [ OutOf Memeory ] System Config [same for both type] Amd Processor [High End Single] RAM 1GB O/s Linux ( jantoo type ) Appserver Tomcat 5.05 Jdk [ IBM Blackdown-1.4.1-01 ( == Jdk1.4.1) ] Index contains 15 Fields Search Done only on 1 field Retrieve 11 corrosponding fields 3 Fields are for debug details Switched from Ist type to IInd Type Can some body suggest me Why is this Happening Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene1.4.1 + OutOf Memory
On Nov 10, 2004, at 1:55 AM, Karthik N S wrote: Hi Guys Apologies.. No need to apologize for asking questions. History Ist type : 4 subindexes + MultiSearcher + Search on Content Field You've got 40,000 indexes aggregated under a MultiSearcher and you're wondering why you're running out of memory?! :O Exception [ Too many Files Open ] Are you using the compound file format? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene1.4.1 + OutOf Memory
hi all I had a similar problem with jdk1.4.1, Doug had sent me a patch which I am attaching following is the mail from Doug It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Com piled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java( Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Com piled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java( Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to
stopword AND validword throws exception
Hi! I've left out custom stopwords from my index using the StopAnalyzer(customstopwords). Now, when I try to searh the index the same way (StopAnalyzer(customstopwords)), it seems to act strange: This query works as expected: validword AND stopword (throws out the stopword part and searches for validword) This query seems to crash: stopword AND validword (java.lang.ArrayIndexOutOfBoundsException: -1) Maybe it can't handle the case if it had to remove the very first part of the query?! Can anyone else test this for me? How can I overcome this problem? (lucene-1.4-final.jar) Thanks for your time! Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching in keyword field ?
Thanks Justin, it works fine - Original Message - From: Justin Swanhart [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 09, 2004 7:41 PM Subject: Re: Searching in keyword field ? You can add the category keyword multiple times to a document. Instead of seperating your categories with a delimiter, just add the keyword multiple times. doc.add(Field.Keyword(category, ABC); doc.add(Field.Keyword(category, DEF GHI); On Tue, 9 Nov 2004 17:18:19 +0100, Thierry Ferrero (Itldev.info) [EMAIL PROTECTED] wrote: Hi All, Can i search only one word in a keyword field which contains few words. I know keyword field isn't tokenized. After many tests, i think is impossible. Someone can confirm me ? Why don't i use a text field? because the users know the category from a list (ex: category ABC, category DEF GHI, category JKL ...) and the keyword field 'category' can contains severals terms (ABC, DEF GHI, OPQ RST). I use a SnowBallAnalyzer for text field in indexing. Perhaps the better way for me, is to use a text field with the value ABC DEF_GHI JKL_NOPQ where categorys are concatinated with a _. Thanks for your reply ! Thierry. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stopword AND validword throws exception
On Wednesday 10 November 2004 10:46, Sanyi wrote: This query seems to crash: stopword AND validword (java.lang.ArrayIndexOutOfBoundsException: -1) I think this has been fixed in the development version (which will become Lucene 1.9). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stopword AND validword throws exception
Sanyi writes: This query works as expected: validword AND stopword (throws out the stopword part and searches for validword) This query seems to crash: stopword AND validword (java.lang.ArrayIndexOutOfBoundsException: -1) Maybe it can't handle the case if it had to remove the very first part of the query?! Can anyone else test this for me? How can I overcome this problem? see bug: http://issues.apache.org/bugzilla/show_bug.cgi?id=9110 Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stopword AND validword throws exception
Thanx for your replies guys. Now, I was trying to locate the latest patch for this problem group, and the last thread I've read about this is: http://issues.apache.org/bugzilla/show_bug.cgi?id=25820 It ends with an open question from Morus: If you want me to change the patch, let me know. That no big deal. Did you change the patch since then? In other words: What is the latest development in this topic? Can I simply download the latest compiled development version of lucene.jar and will it fix my problem? The lastest builds I could find are these: http://cvs.apache.org/builds/jakarta-lucene/nightly/2003-09-09/ It seems to be quite old, so please help me out! Thanx, Sanyi --- Morus Walter [EMAIL PROTECTED] wrote: Sanyi writes: This query works as expected: validword AND stopword (throws out the stopword part and searches for validword) This query seems to crash: stopword AND validword (java.lang.ArrayIndexOutOfBoundsException: -1) Maybe it can't handle the case if it had to remove the very first part of the query?! Can anyone else test this for me? How can I overcome this problem? see bug: http://issues.apache.org/bugzilla/show_bug.cgi?id=9110 Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene1.4.1 + OutOf Memory
Hi Guy's Apologies.. Yes Erik The Day I switched from Lucene1.3.1 to Lucene1.4.1 We are using the CompoundFile format to writer.setUseCompoundFile(true); Some More Advises Please. Thx in advance -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 3:05 PM To: Lucene Users List Subject: Re: Lucene1.4.1 + OutOf Memory On Nov 10, 2004, at 1:55 AM, Karthik N S wrote: Hi Guys Apologies.. No need to apologize for asking questions. History Ist type : 4 subindexes + MultiSearcher + Search on Content Field You've got 40,000 indexes aggregated under a MultiSearcher and you're wondering why you're running out of memory?! :O Exception [ Too many Files Open ] Are you using the compound file format? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene1.4.1 + OutOf Memory
Hi Rupinder Singh Mazara Apologies Can u Past the code on to the Mail instead of Attachement... [ Cause I am not bale to get the Attachement on the Company's mail ] Thx in advance Karthik -Original Message- From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 3:10 PM To: Lucene Users List Subject: RE: Lucene1.4.1 + OutOf Memory hi all I had a similar problem with jdk1.4.1, Doug had sent me a patch which I am attaching following is the mail from Doug It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Com piled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java( Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Com piled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java( Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about
RE: Lucene1.4.1 + OutOf Memory
karthik i think the core problem in your case is the use of compound files, i would be best to switch it off or alternatively issue a optimize as soon as the indexing is over. i am copying the file contents between file tags, the patch is to be applied on TermInfosReader.java, this was done to help out of memory exceptions while doing indexing file Index: src/java/org/apache/lucene/index/TermInfosReader.java === RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.ja va,v retrieving revision 1.9 diff -u -r1.9 TermInfosReader.java --- src/java/org/apache/lucene/index/TermInfosReader.java 6 Aug 2004 20:50:29 - 1.9 +++ src/java/org/apache/lucene/index/TermInfosReader.java 10 Sep 2004 17:46:47 - @@ -45,6 +45,11 @@ readIndex(); } + protected final void finalize() { +// patch for pre-1.4.2 JVMs, whose ThreadLocals leak +enumerators.set(null); + } + public int getSkipInterval() { return origEnum.skipInterval; } /file however tomcat does react in strange ways to to-many open files, try to restrict the number of IndexReader or Searchable objects that you create while doing searches, I usually keep one object to handle all my user requests public static Searcher fetchCitationSearcher(HttpServletRequest request) throws Exception { Searcher rval = (Searcher) request.getSession().getServletContext().getAttribute( luceneSearchable); if (rval == null) { rval = new IndexSearcher( fetchCitationReader(request) ); request.getSession().getServletContext().setAttribute(luceneSearchable, rval); } return rval; } -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: 10 November 2004 11:41 To: Lucene Users List Subject: RE: Lucene1.4.1 + OutOf Memory Hi Rupinder Singh Mazara Apologies Can u Past the code on to the Mail instead of Attachement... [ Cause I am not bale to get the Attachement on the Company's mail ] Thx in advance Karthik -Original Message- From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 3:10 PM To: Lucene Users List Subject: RE: Lucene1.4.1 + OutOf Memory hi all I had a similar problem with jdk1.4.1, Doug had sent me a patch which I am attaching following is the mail from Doug It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream .java(Com piled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWri ter.java( Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter .java(Com piled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMer ger.java( Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at
Re: stopword AND validword throws exception
Sanyi writes: Thanx for your replies guys. Now, I was trying to locate the latest patch for this problem group, and the last thread I've read about this is: http://issues.apache.org/bugzilla/show_bug.cgi?id=25820 It ends with an open question from Morus: If you want me to change the patch, let me know. That no big deal. Did you change the patch since then? No. But this is an independent issue from the `stopword AND word' problem. The `stopword AND word' problem just has to be taken care of in that context also. Bug 25820 basically is about better handling of AND and OR in a query. Currently `a AND b OR c AND d' equals `a AND b AND c AND d' in query parser. Can I simply download the latest compiled development version of lucene.jar and will it fix my problem? If there are no current nightly builds, I guess you will have to get the sources it from cvs directly. But the fix seems to be included in 1.4.2. see http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.96.2.4 item 5 Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stopword AND validword throws exception
But the fix seems to be included in 1.4.2. see http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.96.2.4 item 5 Thank you! I'm just downloading 1.4.2. I hope it'll work ;) Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filters for Openoffice File Indexing available (Java)
On Monday 08 November 2004 11:30, Joachim Arrasz wrote: So now we are looking for search and index Filters for Lucene, that were able to integrate out OpenOffice Files also into search result. I don't know of any existing solutions, but it's not so difficult to write one: Extract the ZIP file using Java's built-in ZIP classes and parse content.xml and meta.xml. I'm not sure if whitespace issues might become tricky, e.g. two paragraphs could be in the file as pone/pptwo/p, but for indexing a whitespace needs to be inserted between them (p was just an example, I don't know what OpenOffice.org actually uses). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filters for Openoffice File Indexing available (Java)
Hi Daniel, I don't know of any existing solutions, but it's not so difficult to write one: Extract the ZIP file using Java's built-in ZIP classes and parse content.xml and meta.xml. I'm not sure if whitespace issues might become tricky, e.g. two paragraphs could be in the file as pone/pptwo/p, but for indexing a whitespace needs to be inserted between them (p was just an example, I don't know what OpenOffice.org actually uses). that seems to be not so hard, but i never have developed something like that, so i think i need a tutorial doing this. Why should i parse meta.xml? I thaught content.xml should be enough. Thanks a lot Bye Achim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing within an XML document
Redirecting to lucene-user, which is more appropriate. I'm not sure what exactly the question is here, but: Parse your XML document and for each p element you encounter create a new Document instance, and then populate its fields with some data, like the URI data you mentioned. If you parse with DOM - just walk the node tree and make new Document whenever you encounter an element you want as a separate Document. If you are using the SAX API you'll probably want some logic in start/endElement and characters methods. When you reach the end of the element you are done with your Document instance, so add it to the IndexWriter instance that you opened once, before the parser. When you are done with the whole XML document close the IndexWriter. Otis --- Murray Altheim [EMAIL PROTECTED] wrote: Hi, I'm trying to develop a class to handle an XML document, where the contents aren't so much indexed on a per-document basis, rather on an element basis. Each element has a unique ID, so I'm looking to create a class/method similar to Lucene's Document.Document(). By way of example, I'll use some XHTML markup to illustrate what I'm trying to do: html base href=http://purl.org/ceryle/blat.xml/ [...] body p id=p1 some text to index... /p p id=p2 some more text to index... /p p id=p3 even more text to index... /p /body /html I'd very much appreciate any help in explaining how I'd go about creating a method to return a Lucene Document to index this via ID. Would I want a separate Document per p? (There are many thousands of such elements.) Everything in my system, both at the document and the individual element level is done via URL, so the method should create URLs for each p element like http://purl.org/ceryle/blat.xml#p1 http://purl.org/ceryle/blat.xml#p2 http://purl.org/ceryle/blat.xml#p3 etc. I don't need anyone to go to the trouble of coding this, just point me to how it might be done, or to any existing examples that do this kind of thing. Thanks very much! Murray .. Murray Altheim http://kmi.open.ac.uk/people/murray/ Knowledge Media Institute The Open University, Milton Keynes, Bucks, MK7 6AA, UK . If we can just get the people that can reconcile themselves to the new dispensation out of the way and then kill the few thousand people who can't reconcile themselves, then we can let the remaining 98 percent come back and live out their lives, Pike said. If we bomb the place to the ground, those peace-loving people won't have a home to live in. [...] If we simply pulverize the city, it would look bad on TV. -- John Pike U.S., Iraqi troops mass for assault on Fallujah STRATEGY: U.S. to employ snipers, robots to cut down casualties Matthew B. Stannard, San Francisco Chronicle http://www.sfgate.com/cgi-bin/article.cgi?file=/c/a/2004/11/06/MNGHL9NBU11.DTL We have a growing, maturing insurgency group. We see larger and more coordinated military attacks. They are getting better and they can self-regenerate. The idea there are x number of insurgents, and that when they're all dead we can get out is wrong. The insurgency has shown an ability to regenerate itself because there are people willing to fill the ranks of those who are killed. The political culture is more hostile to the US presence. The longer we stay, the more they are confirmed in that view. -- W Andrew Terrill Far Graver Than Vietnam, Sidney Blumenthal, The Guardian http://www.guardian.co.uk/comment/story/0,,1305360,00.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filters for Openoffice File Indexing available (Java)
On Wednesday 10 November 2004 15:18, Joachim Arrasz wrote: Why should i parse meta.xml? I thaught content.xml should be enough. It contains the file's title, keywords, and author etc (those are not in content.xml). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Indexing MS Files
I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke
Re: Indexing MS Files
That's one place to start. The other one would be textmining.org, at least for Word files. I used both POI and Textmining API in Lucene in Action, and the latter was much simpler to use. You can also find some comments about both libs in lucene-user archives. People tend to like Textmining API better. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
Thanks Otis. I am looking forward to this book. Any idea when it may be released? - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 11:54 AM Subject: Re: Indexing MS Files That's one place to start. The other one would be textmining.org, at least for Word files. I used both POI and Textmining API in Lucene in Action, and the latter was much simpler to use. You can also find some comments about both libs in lucene-user archives. People tend to like Textmining API better. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Merging multiple indexes
Whats's the simplest way to merge 2 or more indexes into one large index. Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
No need to address individuals here. Sorry about that. I just respect the knowledge that you and Otis have about Lucene so that's why I was asking you specifically. With the information provided, I have no idea what the issue may be. Running the small sample file that is attached to the original message shows how the issue is generated. It takes less than 5 minutes to occur on both Windows XP and Mac OS X. There certainly is no issue reading and writing to an index at the same time, but only one process can be writing/deleting from the index at a time. That's what I thought. I'm seeing otherwise though. --- Lucene Users List [EMAIL PROTECTED] wrote: On Nov 10, 2004, at 2:17 AM, [EMAIL PROTECTED] wrote: Otis or Erik, do you know if a Reader continously opening should cause the Writer to fail with a Lock obtain timed out error? No need to address individuals here. With the information provided, I have no idea what the issue may be. There certainly is no issue reading and writing to an index at the same time, but only one process can be writing/deleting from the index at a time. Erik --- Lucene Users List [EMAIL PROTECTED] wrote: The attached Java file shows a locking issue that occurs with Lucene. One thread opens and closes an IndexReader. The other thread opens an IndexWriter, adds a document and then closes the IndexWriter. I would expect that this app should be able to happily run without an issues. It fails with: java.io.IOException: Lock obtain timed out Is this expected? I thought a Reader could be opened while a Writer is adding a document. Any help is appreciated. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
As Manning publications said, you should be able to get it for your grandma this Christmas. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: Thanks Otis. I am looking forward to this book. Any idea when it may be released? - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 11:54 AM Subject: Re: Indexing MS Files That's one place to start. The other one would be textmining.org, at least for Word files. I used both POI and Textmining API in Lucene in Action, and the latter was much simpler to use. You can also find some comments about both libs in lucene-user archives. People tend to like Textmining API better. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Merging multiple indexes
Use IndexWriter's addIndexes(Directory[]) call. Otis --- Ravi [EMAIL PROTECTED] wrote: Whats's the simplest way to merge 2 or more indexes into one large index. Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
I used OpenOffice API to convert all Word and Excel version. For me it's the solution for complex Word and Excel document. http://api.openoffice.org/ Good luck ! // UNO API import com.sun.star.bridge.XUnoUrlResolver; import com.sun.star.uno.XComponentContext; import com.sun.star.uno.UnoRuntime; import com.sun.star.frame.XComponentLoader; import com.sun.star.frame.XStorable; import com.sun.star.beans.PropertyValue; import com.sun.star.beans.XPropertySet; import com.sun.star.lang.XComponent; import com.sun.star.lang.XMultiComponentFactory; import com.sun.star.connection.NoConnectException; import com.sun.star.io.IOException; /** This class implements a http servlet in order to convert an incoming document * with help of a running OpenOffice.org and to push the converted file back * to the client. */ public class DocConverter { private String stringHost; private String stringPort; private Xcontext xcontext; private Xbase xbase; public DocConverter(Xbase xbase,Xcontext xcontext,ServletContext sc) { this.xbase=xbase; this.xcontext=xcontext; stringHost=ApplicationUtil.getParameter(sc,openoffice.oohost); stringPort=ApplicationUtil.getParameter(sc,openoffice.ooport); } public synchronized String convertToTxt(String namedoc, String pathdoc, String stringConvertType, String stringExtension) { String stringConvertedFile = this.convertDocument(namedoc, pathdoc, stringConvertType, stringExtension); return stringConvertedFile; } /** This method converts a document to a given type by using a running * OpenOffice.org and saves the converted document to the specified * working directory. * @param stringDocumentName The full path name of the file on the server to be converted. * @param stringConvertType Type to convert to. * @param stringExtension This string will be appended to the file name of the converted file. * @return The full path name of the converted file will be returned. * @see stringWorkingDirectory */ private String convertDocument(String namedoc, String pathdoc, String stringConvertType, String stringExtension ) { String tagerr=; String stringUrl=; String stringConvertedFile = ; // Converting the document to the favoured type try { tagerr=0; // Composing the URL - suppression de l'extension stringUrl = pathdoc+/+namedoc; stringUrl=stringUrl.replace( '\\', '/' ); /* Bootstraps a component context with the jurt base components registered. Component context to be granted to a component for running. Arbitrary values can be retrieved from the context. */ XComponentContext xcomponentcontext = com.sun.star.comp.helper.Bootstrap.createInitialComponentContext( null ); /* Gets the service manager instance to be used (or null). This method has been added for convenience, because the service manager is a often used object. */ XMultiComponentFactory xmulticomponentfactory = xcomponentcontext.getServiceManager(); tagerr=2; /* Creates an instance of the component UnoUrlResolver which supports the services specified by the factory. */ Object objectUrlResolver = xmulticomponentfactory.createInstanceWithContext( com.sun.star.bridge.UnoUrlResolver, xcomponentcontext ); // Create a new url resolver XUnoUrlResolver xurlresolver = ( XUnoUrlResolver ) UnoRuntime.queryInterface( XUnoUrlResolver.class, objectUrlResolver ); // Resolves an object that is specified as follow: // uno:connection description;protocol description;initial object name Object objectInitial = xurlresolver.resolve( uno:socket,host= + stringHost + ,port= + stringPort + ;urp;StarOffice.ServiceManager ); // Create a service manager from the initial object xmulticomponentfactory = ( XMultiComponentFactory ) UnoRuntime.queryInterface( XMultiComponentFactory.class, objectInitial ); // Query for the XPropertySet interface. XPropertySet xpropertysetMultiComponentFactory = ( XPropertySet ) UnoRuntime.queryInterface( XPropertySet.class, xmulticomponentfactory ); // Get the default context from the office server. Object objectDefaultContext = xpropertysetMultiComponentFactory.getPropertyValue( DefaultContext ); // Query for the interface XComponentContext. xcomponentcontext = ( XComponentContext ) UnoRuntime.queryInterface( XComponentContext.class, objectDefaultContext ); /* A desktop environment contains tasks with one or more frames in which components can be loaded. Desktop is the environment for components which can instanciate within frames. */ XComponentLoader xcomponentloader = ( XComponentLoader ) UnoRuntime.queryInterface( XComponentLoader.class, xmulticomponentfactory.createInstanceWithContext( com.sun.star.frame.Desktop, xcomponentcontext ) ); // Preparing properties for
Re: Indexing MS Files
Thanks. Grandmas around the world will certainly be surprised this Christmas. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 12:18 PM Subject: Re: Indexing MS Files As Manning publications said, you should be able to get it for your grandma this Christmas. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: Thanks Otis. I am looking forward to this book. Any idea when it may be released? - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 11:54 AM Subject: Re: Indexing MS Files That's one place to start. The other one would be textmining.org, at least for Word files. I used both POI and Textmining API in Lucene in Action, and the latter was much simpler to use. You can also find some comments about both libs in lucene-user archives. People tend to like Textmining API better. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Acedemic Question About Indexing
I am working on debugging an existing Lucene implementation. Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply. The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting inconsistant behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index. Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them. What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml where any of these documents can be changed several times a day? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
This looks great. Thank you Thierry! - Original Message - From: Thierry Ferrero [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 12:23 PM Subject: Re: Indexing MS Files I used OpenOffice API to convert all Word and Excel version. For me it's the solution for complex Word and Excel document. http://api.openoffice.org/ Good luck ! // UNO API import com.sun.star.bridge.XUnoUrlResolver; import com.sun.star.uno.XComponentContext; import com.sun.star.uno.UnoRuntime; import com.sun.star.frame.XComponentLoader; import com.sun.star.frame.XStorable; import com.sun.star.beans.PropertyValue; import com.sun.star.beans.XPropertySet; import com.sun.star.lang.XComponent; import com.sun.star.lang.XMultiComponentFactory; import com.sun.star.connection.NoConnectException; import com.sun.star.io.IOException; /** This class implements a http servlet in order to convert an incoming document * with help of a running OpenOffice.org and to push the converted file back * to the client. */ public class DocConverter { private String stringHost; private String stringPort; private Xcontext xcontext; private Xbase xbase; public DocConverter(Xbase xbase,Xcontext xcontext,ServletContext sc) { this.xbase=xbase; this.xcontext=xcontext; stringHost=ApplicationUtil.getParameter(sc,openoffice.oohost); stringPort=ApplicationUtil.getParameter(sc,openoffice.ooport); } public synchronized String convertToTxt(String namedoc, String pathdoc, String stringConvertType, String stringExtension) { String stringConvertedFile = this.convertDocument(namedoc, pathdoc, stringConvertType, stringExtension); return stringConvertedFile; } /** This method converts a document to a given type by using a running * OpenOffice.org and saves the converted document to the specified * working directory. * @param stringDocumentName The full path name of the file on the server to be converted. * @param stringConvertType Type to convert to. * @param stringExtension This string will be appended to the file name of the converted file. * @return The full path name of the converted file will be returned. * @see stringWorkingDirectory */ private String convertDocument(String namedoc, String pathdoc, String stringConvertType, String stringExtension ) { String tagerr=; String stringUrl=; String stringConvertedFile = ; // Converting the document to the favoured type try { tagerr=0; // Composing the URL - suppression de l'extension stringUrl = pathdoc+/+namedoc; stringUrl=stringUrl.replace( '\\', '/' ); /* Bootstraps a component context with the jurt base components registered. Component context to be granted to a component for running. Arbitrary values can be retrieved from the context. */ XComponentContext xcomponentcontext = com.sun.star.comp.helper.Bootstrap.createInitialComponentContext( null ); /* Gets the service manager instance to be used (or null). This method has been added for convenience, because the service manager is a often used object. */ XMultiComponentFactory xmulticomponentfactory = xcomponentcontext.getServiceManager(); tagerr=2; /* Creates an instance of the component UnoUrlResolver which supports the services specified by the factory. */ Object objectUrlResolver = xmulticomponentfactory.createInstanceWithContext( com.sun.star.bridge.UnoUrlResolver, xcomponentcontext ); // Create a new url resolver XUnoUrlResolver xurlresolver = ( XUnoUrlResolver ) UnoRuntime.queryInterface( XUnoUrlResolver.class, objectUrlResolver ); // Resolves an object that is specified as follow: // uno:connection description;protocol description;initial object name Object objectInitial = xurlresolver.resolve( uno:socket,host= + stringHost + ,port= + stringPort + ;urp;StarOffice.ServiceManager ); // Create a service manager from the initial object xmulticomponentfactory = ( XMultiComponentFactory ) UnoRuntime.queryInterface( XMultiComponentFactory.class, objectInitial ); // Query for the XPropertySet interface. XPropertySet xpropertysetMultiComponentFactory = ( XPropertySet ) UnoRuntime.queryInterface( XPropertySet.class, xmulticomponentfactory ); // Get the default context from the office server. Object objectDefaultContext = xpropertysetMultiComponentFactory.getPropertyValue( DefaultContext ); // Query for the interface XComponentContext. xcomponentcontext = ( XComponentContext ) UnoRuntime.queryInterface( XComponentContext.class, objectDefaultContext ); /* A desktop environment contains tasks with one or more frames in which components can be loaded. Desktop is the environment
Re: Acedemic Question About Indexing
Uh, I hate to market it, but it's in the book. But you don't have to wait for it, as there already is a Lucene demo that does what you described. I am not sure if the demo always recreates the index or whether it deletes and re-adds only the new and modified files, but if it's the former, you would only need to modify the demo a little bit to check the timestamps of File objects and compare them to those stored in the index (if they are being stored - if not, you should add a field to hold that data) Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I am working on debugging an existing Lucene implementation. Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply. The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting inconsistant behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index. Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them. What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml where any of these documents can be changed several times a day? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Acedemic Question About Indexing
Don't worry, regardless of what I learn in this forum I am telling my company to get me a copy of that bad boy when it comes out (which as far as I am concerned can't be soon enough). I will pay for grama's myself. I think I have reviewed the code you are referring to and have something similar working in my own indexer (using the uid). All is well. My stupid question for the day is why would you ever want multiple indexes running if you can build one smart indexer that does everything as efficiently as possible? Does the answer to this question move me to multi threaded indexing territory? Thanks, Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:08 PM Subject: Re: Acedemic Question About Indexing Uh, I hate to market it, but it's in the book. But you don't have to wait for it, as there already is a Lucene demo that does what you described. I am not sure if the demo always recreates the index or whether it deletes and re-adds only the new and modified files, but if it's the former, you would only need to modify the demo a little bit to check the timestamps of File objects and compare them to those stored in the index (if they are being stored - if not, you should add a field to hold that data) Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I am working on debugging an existing Lucene implementation. Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply. The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting inconsistant behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index. Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them. What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml where any of these documents can be changed several times a day? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
Hi, With the information provided, I have no idea what the issue may be. Is there some information that I should post that will help determine why Lucene is giving me this error? Thanks. --- Lucene Users List [EMAIL PROTECTED] wrote: On Nov 10, 2004, at 2:17 AM, [EMAIL PROTECTED] wrote: Otis or Erik, do you know if a Reader continously opening should cause the Writer to fail with a Lock obtain timed out error? No need to address individuals here. With the information provided, I have no idea what the issue may be. There certainly is no issue reading and writing to an index at the same time, but only one process can be writing/deleting from the index at a time. Erik --- Lucene Users List [EMAIL PROTECTED] wrote: The attached Java file shows a locking issue that occurs with Lucene. One thread opens and closes an IndexReader. The other thread opens an IndexWriter, adds a document and then closes the IndexWriter. I would expect that this app should be able to happily run without an issues. It fails with: java.io.IOException: Lock obtain timed out Is this expected? I thought a Reader could be opened while a Writer is adding a document. Any help is appreciated. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED] wrote: Hi, With the information provided, I have no idea what the issue may be. Is there some information that I should post that will help determine why Lucene is giving me this error? You mentioned posting code - though I don't recall getting an attachment. If you could post it as a Bugzilla issue with your code attached, it would be preserved outside of our mailboxes. If the code is self-contained enough for me to try it, I will at some point in the near future. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Acedemic Question About Indexing
I have an application that I run monthly that indexes 40 million documents into 6 indexes, then uses a multisearcher. The advantage for me is that I can have multiple writers indexing 1/6 of that total data reducing the time it takes to index by about 5X. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:39 PM To: Lucene Users List Subject: Re: Acedemic Question About Indexing Don't worry, regardless of what I learn in this forum I am telling my company to get me a copy of that bad boy when it comes out (which as far as I am concerned can't be soon enough). I will pay for grama's myself. I think I have reviewed the code you are referring to and have something similar working in my own indexer (using the uid). All is well. My stupid question for the day is why would you ever want multiple indexes running if you can build one smart indexer that does everything as efficiently as possible? Does the answer to this question move me to multi threaded indexing territory? Thanks, Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:08 PM Subject: Re: Acedemic Question About Indexing Uh, I hate to market it, but it's in the book. But you don't have to wait for it, as there already is a Lucene demo that does what you described. I am not sure if the demo always recreates the index or whether it deletes and re-adds only the new and modified files, but if it's the former, you would only need to modify the demo a little bit to check the timestamps of File objects and compare them to those stored in the index (if they are being stored - if not, you should add a field to hold that data) Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I am working on debugging an existing Lucene implementation. Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply. The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting inconsistant behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index. Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them. What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml where any of these documents can be changed several times a day? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
Whoops! Looks like my attachment didn't make it through. I'm re-attaching my simple test app. Thanks. --- Erik Hatcher [EMAIL PROTECTED] wrote: On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED] wrote: Hi, With the information provided, I have no idea what the issue may be. Is there some information that I should post that will help determine why Lucene is giving me this error? You mentioned posting code - though I don't recall getting an attachment. If you could post it as a Bugzilla issue with your code attached, it would be preserved outside of our mailboxes. If the code is self-contained enough for me to try it, I will at some point in the near future. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
I added it to Bugzilla like you suggested: http://issues.apache.org/bugzilla/show_bug.cgi?id=32171 Let me know if you see any way to get around this issue. --- Lucene Users List [EMAIL PROTECTED] wrote: Whoops! Looks like my attachment didn't make it through. I'm re-attaching my simple test app. Thanks. --- Erik Hatcher [EMAIL PROTECTED] wrote: On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED] wrote: Hi, With the information provided, I have no idea what the issue may be. Is there some information that I should post that will help determine why Lucene is giving me this error? You mentioned posting code - though I don't recall getting an attachment. If you could post it as a Bugzilla issue with your code attached, it would be preserved outside of our mailboxes. If the code is self-contained enough for me to try it, I will at some point in the near future. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
I just ran the code you provided. On my puny PowerBook (Mac OS X 10.3.5) it dies in much less than 5 minutes. I do not know what the issue is, but certainly the actions the program is taking are atypical. Opening and closing an IndexWriter repeatedly is certainly expensive on large indexes. Indexing documents in batches is more typical, I presume. Also, maybe you need to put some sleep into the code to give the JVM a chance to catch its breath? Does that alleviate the issue? Erik On Nov 10, 2004, at 8:02 PM, [EMAIL PROTECTED] wrote: I added it to Bugzilla like you suggested: http://issues.apache.org/bugzilla/show_bug.cgi?id=32171 Let me know if you see any way to get around this issue. --- Lucene Users List [EMAIL PROTECTED] wrote: Whoops! Looks like my attachment didn't make it through. I'm re-attaching my simple test app. Thanks. --- Erik Hatcher [EMAIL PROTECTED] wrote: On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED] wrote: Hi, With the information provided, I have no idea what the issue may be. Is there some information that I should post that will help determine why Lucene is giving me this error? You mentioned posting code - though I don't recall getting an attachment. If you could post it as a Bugzilla issue with your code attached, it would be preserved outside of our mailboxes. If the code is self-contained enough for me to try it, I will at some point in the near future. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
I just added a Thread.sleep(1000) in the writer thread and it has run for quite some time, and is still running as I send this. Erik On Nov 10, 2004, at 8:02 PM, [EMAIL PROTECTED] wrote: I added it to Bugzilla like you suggested: http://issues.apache.org/bugzilla/show_bug.cgi?id=32171 Let me know if you see any way to get around this issue. --- Lucene Users List [EMAIL PROTECTED] wrote: Whoops! Looks like my attachment didn't make it through. I'm re-attaching my simple test app. Thanks. --- Erik Hatcher [EMAIL PROTECTED] wrote: On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED] wrote: Hi, With the information provided, I have no idea what the issue may be. Is there some information that I should post that will help determine why Lucene is giving me this error? You mentioned posting code - though I don't recall getting an attachment. If you could post it as a Bugzilla issue with your code attached, it would be preserved outside of our mailboxes. If the code is self-contained enough for me to try it, I will at some point in the near future. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query#rewrite Question
Hello, Our program accepts input in the form of Lucene query syntax from the user, but we wish to perform additional tasks such as thesaurus expansion. So I want to manipulate the Query object that results from parsing. My question is, is the result of the Query#rewrite method guaranteed to be either a TermQuery, a PhraseQuery, or a BooleanQuery, and if it is a BooleanQuery, do all the constituent clauses also reduce to one of the above three classes? If not, what if the original Query object was the one that was obtained from QueryParser#parse method? Can I assume the above in this restricted case? I experimented with the current version, and the above seems to be positive in this version; I'm asking if this could change in the future. Thank you.
Re: Using Lucene to store document
Hi Otis, Please let me know what HEAD version of Lucene is? Actually, I'm consider the advantages of storing document using Lucene Stored field - For my Search engine. I've tested with thousands of documents and see that retrieve document (in this case XML file) with Lucene is a little bit faster than using FS. But I cannot test with a large number of data to hava an accurate comparision. So whether Lucene can support millions of document, still balance and retrieve the with approriate speed. Nhan - FREE Spam Protection! Click Here. SpamExtract Blocks Spam. - Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
Search scalability
We have one large index for a document repository of 800,000 documents. The size of the index is 800MB. When we do searches against the index, it takes 300-500ms for a single search. We wanted to test the scalability and tried 100 parallel searches against the index with the same query and the average response time was 13 seconds. We used a simple IndexSearcher. Same searcher object was shared by all the searches. I'm sure people have success in configuring lucene for better scalability. Can somebody share their approach? Thanks Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search scalability
Hello, 100 parallel searches going against a single index on a single disk means a lot of disk seeks all happening at once. One simple way of working around this is to load your FSDirectory into RAMDirectory. This should be faster (could you report your observations/comparisons?). You can also try using ramfs if you are using Linux. Otis --- Ravi [EMAIL PROTECTED] wrote: We have one large index for a document repository of 800,000 documents. The size of the index is 800MB. When we do searches against the index, it takes 300-500ms for a single search. We wanted to test the scalability and tried 100 parallel searches against the index with the same query and the average response time was 13 seconds. We used a simple IndexSearcher. Same searcher object was shared by all the searches. I'm sure people have success in configuring lucene for better scalability. Can somebody share their approach? Thanks Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene to store document
Hello, HEAD version means that you should check out Lucene straight out of CVS. How to work with CVS is another story, probably described somewhere on jakarta.apache.org site. Otis --- Nhan Nguyen Dang [EMAIL PROTECTED] wrote: Hi Otis, Please let me know what HEAD version of Lucene is? Actually, I'm consider the advantages of storing document using Lucene Stored field - For my Search engine. I've tested with thousands of documents and see that retrieve document (in this case XML file) with Lucene is a little bit faster than using FS. But I cannot test with a large number of data to hava an accurate comparision. So whether Lucene can support millions of document, still balance and retrieve the with approriate speed. Nhan - FREE Spam Protection! Click Here. SpamExtract Blocks Spam. - Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
Yes, I tried that too and it worked. The issue is that our Operations folks plan to install this on a pretty busy box and I was hoping that Lucene wouldn't cause issues if it only had a small slice of the CPU. Guess I'll tell them to buy a bigger box! Unless you have any other ideas. I'm running some tests with a larger timeout to see if that helps. --- Erik Hatcher [EMAIL PROTECTED] wrote: I just added a Thread.sleep(1000) in the writer thread and it has run for quite some time, and is still running as I send this. Erik On Nov 10, 2004, at 8:02 PM, [EMAIL PROTECTED] wrote: I added it to Bugzilla like you suggested: http://issues.apache.org/bugzilla/show_bug.cgi?id=32171 Let me know if you see any way to get around this issue. --- Lucene Users List [EMAIL PROTECTED] wrote: Whoops! Looks like my attachment didn't make it through. I'm re-attaching my simple test app. Thanks. --- Erik Hatcher [EMAIL PROTECTED] wrote: On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED] wrote: Hi, With the information provided, I have no idea what the issue may be. Is there some information that I should post that will help determine why Lucene is giving me this error? You mentioned posting code - though I don't recall getting an attachment. If you could post it as a Bugzilla issue with your code attached, it would be preserved outside of our mailboxes. If the code is self-contained enough for me to try it, I will at some point in the near future. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search scalability
Does it take 800MB of RAM to load that index into a RAMDirectory? Or are only some of the files loaded into RAM? --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, 100 parallel searches going against a single index on a single disk means a lot of disk seeks all happening at once. One simple way of working around this is to load your FSDirectory into RAMDirectory. This should be faster (could you report your observations/comparisons?). You can also try using ramfs if you are using Linux. Otis --- Ravi [EMAIL PROTECTED] wrote: We have one large index for a document repository of 800,000 documents. The size of the index is 800MB. When we do searches against the index, it takes 300-500ms for a single search. We wanted to test the scalability and tried 100 parallel searches against the index with the same query and the average response time was 13 seconds. We used a simple IndexSearcher. Same searcher object was shared by all the searches. I'm sure people have success in configuring lucene for better scalability. Can somebody share their approach? Thanks Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Bug in the BooleanQuery optimizer? ..TooManyClauses
Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]