RE: Lucene1.4.1 + OutOf Memory

2004-11-10 Thread Karthik N S
Hi Guy's

Apologies .


  I am NOT Using sorting code

  hits = multiSearcher.search(query, new Sort(new SortField(filename,
SortField.STRING)));

 but using multiSearcher.search(query)

 in Core Files setup and still getting the Error.



 More Advises Required..


Karthik



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 12:46 PM
To: Lucene Users List
Subject: Re: Lucene1.4.1 + OutOf Memory


There is a memory leak in the sorting code of Lucene 1.4.1.
1.4.2 has the fix!

--- Karthik N S [EMAIL PROTECTED] wrote:


 Hi
 Guys

 Apologies..



 History

 Ist type :  4  subindexes   +  MultiSearcher  + Search on
 Content Field
 Only  for 2000 hits


=
 Exception  [ Too many Files Open ]





 IInd type :  40 Mergerd Indexes [1000 subindexes each]   +
 MultiSearcher
 /ParallelSearcher +  Search on Content Field Only for 2
 hits


=
 Exception  [ OutOf Memeory  ]



 System Config  [same for both type]

 Amd Processor [High End Single]
 RAM  1GB
 O/s Linux  ( jantoo type )
 Appserver Tomcat 5.05
 Jdk [ IBM  Blackdown-1.4.1-01  ( == Jdk1.4.1) ]

 Index contains 15 Fields
 Search
 Done only on 1 field
 Retrieve 11 corrosponding fields
 3 Fields  are for debug details


 Switched from Ist type to IInd Type

 Can some body suggest me Why is this Happening

 Thx in advance




   WITH WARM REGARDS
   HAVE A NICE DAY
   [ N.S.KARTHIK]





-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene1.4.1 + OutOf Memory

2004-11-10 Thread iouli . golovatyi

Exception too many files open means:
- searcher object is nor closed after query execution
- too little file handlers

Regards
J.



 
  Karthik N S 
 
  [EMAIL PROTECTED]To:   Lucene Users List 
[EMAIL PROTECTED],   
  et.co.in [EMAIL PROTECTED] 
  
   cc:   (bcc: Iouli 
Golovatyi/X/GP/Novartis)
  10.11.2004 09:41 Subject:  RE: Lucene1.4.1 + 
OutOf Memory  
  Please respond to 
 
  Lucene UsersCategory:   
|-|   
  List| ( ) Action 
needed   |   
   | ( ) Decision 
needed |   
   | ( ) General 
Information |   
   
|-|   

 

 




Hi Guy's

Apologies .


  I am NOT Using sorting code

  hits = multiSearcher.search(query, new Sort(new SortField(filename,
SortField.STRING)));

 but using multiSearcher.search(query)

 in Core Files setup and still getting the Error.



 More Advises Required..


Karthik



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 12:46 PM
To: Lucene Users List
Subject: Re: Lucene1.4.1 + OutOf Memory


There is a memory leak in the sorting code of Lucene 1.4.1.
1.4.2 has the fix!

--- Karthik N S [EMAIL PROTECTED] wrote:


 Hi
 Guys

 Apologies..



 History

 Ist type :  4  subindexes   +  MultiSearcher  + Search on
 Content Field
 Only  for 2000 hits


=
 Exception  [ Too many Files Open ]





 IInd type :  40 Mergerd Indexes [1000 subindexes each]   +
 MultiSearcher
 /ParallelSearcher +  Search on Content Field Only for 2
 hits


=
 Exception  [ OutOf Memeory  ]



 System Config  [same for both type]

 Amd Processor [High End Single]
 RAM  1GB
 O/s Linux  ( jantoo type )
 Appserver Tomcat 5.05
 Jdk [ IBM  Blackdown-1.4.1-01  ( == Jdk1.4.1) ]

 Index contains 15 Fields
 Search
 Done only on 1 field
 Retrieve 11 corrosponding fields
 3 Fields  are for debug details


 Switched from Ist type to IInd Type

 Can some body suggest me Why is this Happening

 Thx in advance




   WITH WARM REGARDS
   HAVE A NICE DAY
   [ N.S.KARTHIK]





-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locking issue

2004-11-10 Thread Erik Hatcher
On Nov 10, 2004, at 2:17 AM, [EMAIL PROTECTED] wrote:
Otis or Erik, do you know if a Reader continously opening should cause 
the
Writer to fail with a Lock obtain timed out error?
No need to address individuals here.
With the information provided, I have no idea what the issue may be.  
There certainly is no issue reading and writing to an index at the same 
time, but only one process can be writing/deleting from the index at a 
time.

Erik
--- Lucene Users List
[EMAIL PROTECTED] wrote:
The attached Java file shows a locking
issue that occurs with
Lucene.
One thread opens and closes an IndexReader.
 The other thread
opens an IndexWriter, adds a document and then closes
the
IndexWriter.  I would expect that this app should be able to
happily
run without an issues.
It fails with:
  java.io.IOException: Lock
obtain timed out
Is this expected?  I thought a Reader could be opened
while a
Writer is adding a document.
Any help is appreciated.

-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For
additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Lucene1.4.1 + OutOf Memory

2004-11-10 Thread Karthik N S
Hi Guy's


Apologies.


  That's Why  Somebody on the form asked me to Switch to


 : 40 Mergerd Indexes [1000 subindexes each]   +  MultiSearcher /
ParallelSearcher +  Search on Content Field Only for 2

  the problem of to many Files open was solved since now there were only 40
MergerIndexes - [1 MergerIndex has 1000 sub indexes]

  instead of  4 subindexes.

 Now I am gettinf Out of Memory Exception.


  Any Idea On how to Solve this problem.



Thx in Advance






-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:16 PM
To: Lucene Users List
Subject: RE: Lucene1.4.1 + OutOf Memory



Exception too many files open means:
- searcher object is nor closed after query execution
- too little file handlers

Regards
J.



  Karthik N S
  [EMAIL PROTECTED]To:   Lucene Users List
[EMAIL PROTECTED],
  et.co.in
[EMAIL PROTECTED]
   cc:   (bcc: Iouli
Golovatyi/X/GP/Novartis)
  10.11.2004 09:41 Subject:  RE: Lucene1.4.1 +
OutOf Memory
  Please respond to
  Lucene UsersCategory:
|-|
  List| ( ) Action
needed   |
   | ( )
Decision needed |
   | ( ) General
Information |

|-|






Hi Guy's

Apologies .


  I am NOT Using sorting code

  hits = multiSearcher.search(query, new Sort(new SortField(filename,
SortField.STRING)));

 but using multiSearcher.search(query)

 in Core Files setup and still getting the Error.



 More Advises Required..


Karthik



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 12:46 PM
To: Lucene Users List
Subject: Re: Lucene1.4.1 + OutOf Memory


There is a memory leak in the sorting code of Lucene 1.4.1.
1.4.2 has the fix!

--- Karthik N S [EMAIL PROTECTED] wrote:


 Hi
 Guys

 Apologies..



 History

 Ist type :  4  subindexes   +  MultiSearcher  + Search on
 Content Field
 Only  for 2000 hits


=
 Exception  [ Too many Files Open ]





 IInd type :  40 Mergerd Indexes [1000 subindexes each]   +
 MultiSearcher
 /ParallelSearcher +  Search on Content Field Only for 2
 hits


=
 Exception  [ OutOf Memeory  ]



 System Config  [same for both type]

 Amd Processor [High End Single]
 RAM  1GB
 O/s Linux  ( jantoo type )
 Appserver Tomcat 5.05
 Jdk [ IBM  Blackdown-1.4.1-01  ( == Jdk1.4.1) ]

 Index contains 15 Fields
 Search
 Done only on 1 field
 Retrieve 11 corrosponding fields
 3 Fields  are for debug details


 Switched from Ist type to IInd Type

 Can some body suggest me Why is this Happening

 Thx in advance




   WITH WARM REGARDS
   HAVE A NICE DAY
   [ N.S.KARTHIK]





-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene1.4.1 + OutOf Memory

2004-11-10 Thread Erik Hatcher
On Nov 10, 2004, at 1:55 AM, Karthik N S wrote:
Hi
Guys
Apologies..
No need to apologize for asking questions.
History
Ist type :  4  subindexes   +  MultiSearcher  + Search on Content 
Field
You've got 40,000 indexes aggregated under a MultiSearcher and you're 
wondering why you're running out of memory?!  :O

Exception  [ Too many Files Open ]
Are you using the compound file format?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Lucene1.4.1 + OutOf Memory

2004-11-10 Thread Rupinder Singh Mazara
hi all

 I had a similar problem with jdk1.4.1, Doug had sent me a patch which I am
attaching following is the mail from Doug

 It sounds like the ThreadLocal in TermInfosReader is not getting
correctly garbage collected when the TermInfosReader is collected.
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is
that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it works
for you.

Doug

Daniel Taurat wrote:
 Okay, that (1.4rc3)worked fine, too!
 Got only 257 SegmentTermEnums for 1900 objects.

 Now I will go for the final test on the production server with the
 1.4rc3 version  and about 40.000 objects.

 Daniel

 Daniel Taurat schrieb:

 Hi all,
 here is some update for you:
 I switched back to Lucene 1.3-final and now the  number of the
 SegmentTermEnum objects is controlled by gc again:
 it goes up to about 1000 and then it is down again to 254 after
 indexing my 1900 test-objects.
 Stay tuned, I will try 1.4RC3 now, the last version before FieldCache
 was introduced...

 Daniel


 Rupinder Singh Mazara schrieb:

 hi all
  I had a similar problem, i have  database of documents with 24
 fields, and a average content of 7K, with  16M+ records

  i had to split the jobs into slabs of 1M each and merging the
 resulting indexes, submissions to our job queue looked like

  java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22

 and i still had outofmemory exception , the solution that i created
 was to after every 200K, documents create a temp directory, and merge
 them together, this was done to do the first production run, updates
 are now being handled incrementally



 Exception in thread main java.lang.OutOfMemoryError
 at

org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Com
piled
 Code))
 at
 org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined
 Compiled Code))
 at
 org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined
 Compiled Code))
 at

org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled
 Code))
 at

org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(
Compiled
 Code))
 at

org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Com
piled
 Code))
 at

org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(
Compiled
 Code))
 at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled
 Code))
 at

org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled
 Code))
 at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
 at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
 at lucene.Indexer.main(CDBIndexer.java:168)



 -Original Message-
 From: Daniel Taurat [mailto:[EMAIL PROTECTED]
 Sent: 10 September 2004 14:42
 To: Lucene Users List
 Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large
 number
 of documents


 Hi Pete,
 good hint, but we actually do have physical memory of  4Gb on the
 system. But then: we also have experienced that the gc of ibm
 jdk1.3.1 that we use is sometimes
 behaving strangely with too large heap space anyway. (Limit seems to
 be 1.2 Gb)
 I can say that gc is not collecting these objects since I  forced gc
 runs when indexing every now and then (when parsing pdf-type
 objects, that is): No effect.

 regards,

 Daniel


 Pete Lewis wrote:



 Hi all

 Reading the thread with interest, there is another way I've come


 across out


 of memory errors when indexing large batches of documents.

 If you have your heap space settings too high, then you get


 swapping (which


 impacts performance) plus you never reach the trigger for garbage
 collection, hence you don't garbage collect and hence you run out


 of memory.


 Can you check whether or not your garbage collection is being
 triggered?

 Anomalously therefore if this is the case, by reducing the heap
 space you
 can improve performance get rid of the out of memory errors.

 Cheers
 Pete Lewis

 - Original Message - From: Daniel Taurat
 [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Friday, September 10, 2004 1:10 PM
 Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large


 number of


 documents






 Daniel Aber schrieb:




 On Thursday 09 September 2004 19:47, Daniel Taurat wrote:





 I am facing an out of memory problem using  Lucene 1.4.1.





 Could you try with a recent CVS version? There has been a fix



 about files


 not being deleted after 1.4.1. Not sure if that could cause the
 problems
 you're experiencing.

 Regards
 Daniel






 Well, it seems not to be files, it looks more like those
 SegmentTermEnum
 objects accumulating in memory.
 #I've seen some discussion on these objects in the
 developer-newsgroup
 that had taken place some time ago.
 I am afraid this is some kind of runaway caching I have to 

stopword AND validword throws exception

2004-11-10 Thread Sanyi
Hi!

I've left out custom stopwords from my index using the 
StopAnalyzer(customstopwords).
Now, when I try to searh the index the same way 
(StopAnalyzer(customstopwords)), it seems to act
strange:

This query works as expected:
validword AND stopword
(throws out the stopword part and searches for validword)

This query seems to crash:
stopword AND validword
(java.lang.ArrayIndexOutOfBoundsException: -1)

Maybe it can't handle the case if it had to remove the very first part of the 
query?!
Can anyone else test this for me? How can I overcome this problem?

(lucene-1.4-final.jar)

Thanks for your time!

Sanyi



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching in keyword field ?

2004-11-10 Thread Thierry Ferrero
Thanks Justin, it works fine 

- Original Message - 
From: Justin Swanhart [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 09, 2004 7:41 PM
Subject: Re: Searching in keyword field ?


 You can add the category keyword multiple times to a document.

 Instead of seperating your categories with a delimiter, just add the
 keyword multiple times.

 doc.add(Field.Keyword(category, ABC);
 doc.add(Field.Keyword(category, DEF GHI);

 On Tue, 9 Nov 2004 17:18:19 +0100, Thierry Ferrero (Itldev.info)
 [EMAIL PROTECTED] wrote:
  Hi All,
 
  Can i search only one word in a keyword field which contains few words.
  I know keyword field isn't tokenized. After many tests, i think is
  impossible.
  Someone can confirm me ?
 
  Why don't i use a text field? because the users know the category from a
  list (ex: category ABC, category DEF GHI, category  JKL ...) and the
keyword
  field 'category' can contains severals terms (ABC, DEF GHI, OPQ RST).
  I use a SnowBallAnalyzer for text field in indexing.
  Perhaps the better way for me, is to use a text field with the value
ABC
  DEF_GHI  JKL_NOPQ where categorys are concatinated with a _.
  Thanks for your reply !
 
  Thierry.
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stopword AND validword throws exception

2004-11-10 Thread Daniel Naber
On Wednesday 10 November 2004 10:46, Sanyi wrote:

 This query seems to crash:
 stopword AND validword
 (java.lang.ArrayIndexOutOfBoundsException: -1)

I think this has been fixed in the development version (which will become 
Lucene 1.9).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stopword AND validword throws exception

2004-11-10 Thread Morus Walter
Sanyi writes:
 
 This query works as expected:
 validword AND stopword
 (throws out the stopword part and searches for validword)
 
 This query seems to crash:
 stopword AND validword
 (java.lang.ArrayIndexOutOfBoundsException: -1)
 
 Maybe it can't handle the case if it had to remove the very first part of the 
 query?!
 Can anyone else test this for me? How can I overcome this problem?
 
see bug:
http://issues.apache.org/bugzilla/show_bug.cgi?id=9110

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stopword AND validword throws exception

2004-11-10 Thread Sanyi
Thanx for your replies guys.

Now, I was trying to locate the latest patch for this problem group, and the 
last thread I've
read about this is:
http://issues.apache.org/bugzilla/show_bug.cgi?id=25820
It ends with an open question from Morus:
If you want me to change the patch, let me know. That no big deal.

Did you change the patch since then?

In other words: What is the latest development in this topic?
Can I simply download the latest compiled development version of lucene.jar and 
will it fix my
problem?

The lastest builds I could find are these:
http://cvs.apache.org/builds/jakarta-lucene/nightly/2003-09-09/

It seems to be quite old, so please help me out!

Thanx,
Sanyi

--- Morus Walter [EMAIL PROTECTED] wrote:

 Sanyi writes:
  
  This query works as expected:
  validword AND stopword
  (throws out the stopword part and searches for validword)
  
  This query seems to crash:
  stopword AND validword
  (java.lang.ArrayIndexOutOfBoundsException: -1)
  
  Maybe it can't handle the case if it had to remove the very first part of 
  the query?!
  Can anyone else test this for me? How can I overcome this problem?
  
 see bug:
 http://issues.apache.org/bugzilla/show_bug.cgi?id=9110
 
 Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene1.4.1 + OutOf Memory

2004-11-10 Thread Karthik N S
Hi Guy's


Apologies..


 Yes  Erik

  The Day I switched from Lucene1.3.1 to Lucene1.4.1  We  are using  the
CompoundFile format to


writer.setUseCompoundFile(true);


Some More Advises Please.


Thx in advance

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 3:05 PM
To: Lucene Users List
Subject: Re: Lucene1.4.1 + OutOf Memory


On Nov 10, 2004, at 1:55 AM, Karthik N S wrote:

 Hi
 Guys

 Apologies..

No need to apologize for asking questions.

 History

 Ist type :  4  subindexes   +  MultiSearcher  + Search on Content
 Field

You've got 40,000 indexes aggregated under a MultiSearcher and you're
wondering why you're running out of memory?!  :O

 Exception  [ Too many Files Open ]

Are you using the compound file format?

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene1.4.1 + OutOf Memory

2004-11-10 Thread Karthik N S
Hi

  Rupinder Singh Mazara

Apologies



  Can u Past the code on to the Mail instead of Attachement...

  [ Cause I am not bale to get the Attachement  on the Company's mail ]


 Thx in advance
Karthik


-Original Message-
From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 3:10 PM
To: Lucene Users List
Subject: RE: Lucene1.4.1 + OutOf Memory


hi all

 I had a similar problem with jdk1.4.1, Doug had sent me a patch which I am
attaching following is the mail from Doug

 It sounds like the ThreadLocal in TermInfosReader is not getting
correctly garbage collected when the TermInfosReader is collected.
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is
that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it works
for you.

Doug

Daniel Taurat wrote:
 Okay, that (1.4rc3)worked fine, too!
 Got only 257 SegmentTermEnums for 1900 objects.

 Now I will go for the final test on the production server with the
 1.4rc3 version  and about 40.000 objects.

 Daniel

 Daniel Taurat schrieb:

 Hi all,
 here is some update for you:
 I switched back to Lucene 1.3-final and now the  number of the
 SegmentTermEnum objects is controlled by gc again:
 it goes up to about 1000 and then it is down again to 254 after
 indexing my 1900 test-objects.
 Stay tuned, I will try 1.4RC3 now, the last version before FieldCache
 was introduced...

 Daniel


 Rupinder Singh Mazara schrieb:

 hi all
  I had a similar problem, i have  database of documents with 24
 fields, and a average content of 7K, with  16M+ records

  i had to split the jobs into slabs of 1M each and merging the
 resulting indexes, submissions to our job queue looked like

  java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22

 and i still had outofmemory exception , the solution that i created
 was to after every 200K, documents create a temp directory, and merge
 them together, this was done to do the first production run, updates
 are now being handled incrementally



 Exception in thread main java.lang.OutOfMemoryError
 at

org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Com
piled
 Code))
 at
 org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined
 Compiled Code))
 at
 org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined
 Compiled Code))
 at

org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled
 Code))
 at

org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(
Compiled
 Code))
 at

org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Com
piled
 Code))
 at

org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(
Compiled
 Code))
 at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled
 Code))
 at

org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled
 Code))
 at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
 at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
 at lucene.Indexer.main(CDBIndexer.java:168)



 -Original Message-
 From: Daniel Taurat [mailto:[EMAIL PROTECTED]
 Sent: 10 September 2004 14:42
 To: Lucene Users List
 Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large
 number
 of documents


 Hi Pete,
 good hint, but we actually do have physical memory of  4Gb on the
 system. But then: we also have experienced that the gc of ibm
 jdk1.3.1 that we use is sometimes
 behaving strangely with too large heap space anyway. (Limit seems to
 be 1.2 Gb)
 I can say that gc is not collecting these objects since I  forced gc
 runs when indexing every now and then (when parsing pdf-type
 objects, that is): No effect.

 regards,

 Daniel


 Pete Lewis wrote:



 Hi all

 Reading the thread with interest, there is another way I've come


 across out


 of memory errors when indexing large batches of documents.

 If you have your heap space settings too high, then you get


 swapping (which


 impacts performance) plus you never reach the trigger for garbage
 collection, hence you don't garbage collect and hence you run out


 of memory.


 Can you check whether or not your garbage collection is being
 triggered?

 Anomalously therefore if this is the case, by reducing the heap
 space you
 can improve performance get rid of the out of memory errors.

 Cheers
 Pete Lewis

 - Original Message - From: Daniel Taurat
 [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Friday, September 10, 2004 1:10 PM
 Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large


 number of


 documents






 Daniel Aber schrieb:




 On Thursday 09 September 2004 19:47, Daniel Taurat wrote:





 I am facing an out of memory problem using  Lucene 1.4.1.





 Could you try with a recent CVS version? There has been a fix



 about 

RE: Lucene1.4.1 + OutOf Memory

2004-11-10 Thread Rupinder Singh Mazara
karthik

 i think the core problem in your case is the use of compound files, i would
be best to switch it off
 or alternatively issue a optimize as soon as the indexing is over.

  i am copying the file contents between file tags, the patch is to be
applied on TermInfosReader.java, this
 was done to help out of memory exceptions while doing indexing
  file
Index: src/java/org/apache/lucene/index/TermInfosReader.java
===
RCS file:
/home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.ja
va,v
retrieving revision 1.9
diff -u -r1.9 TermInfosReader.java
--- src/java/org/apache/lucene/index/TermInfosReader.java   6 Aug 2004
20:50:29 -  1.9
+++ src/java/org/apache/lucene/index/TermInfosReader.java   10 Sep 2004
17:46:47 -
@@ -45,6 +45,11 @@
 readIndex();
   }

+  protected final void finalize() {
+// patch for pre-1.4.2 JVMs, whose ThreadLocals leak
+enumerators.set(null);
+  }
+
   public int getSkipInterval() {
 return origEnum.skipInterval;
   }
/file



 however tomcat does react in strange ways to to-many open files,
 try to restrict the number of IndexReader or Searchable objects
  that you create while  doing searches,
I  usually keep one object to handle all my user requests

 public static Searcher fetchCitationSearcher(HttpServletRequest request)
throws Exception {
Searcher rval = (Searcher)
request.getSession().getServletContext().getAttribute(
luceneSearchable);
if (rval == null) {
  rval = new IndexSearcher( fetchCitationReader(request) );

request.getSession().getServletContext().setAttribute(luceneSearchable,
rval);
}
return rval;
}




-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 10 November 2004 11:41
To: Lucene Users List
Subject: RE: Lucene1.4.1 + OutOf Memory


Hi

  Rupinder Singh Mazara

Apologies



  Can u Past the code on to the Mail instead of Attachement...

  [ Cause I am not bale to get the Attachement  on the Company's mail ]


 Thx in advance
Karthik


-Original Message-
From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 3:10 PM
To: Lucene Users List
Subject: RE: Lucene1.4.1 + OutOf Memory


hi all

 I had a similar problem with jdk1.4.1, Doug had sent me a patch which I am
attaching following is the mail from Doug

 It sounds like the ThreadLocal in TermInfosReader is not getting
correctly garbage collected when the TermInfosReader is collected.
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is
that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it works
for you.

Doug

Daniel Taurat wrote:
 Okay, that (1.4rc3)worked fine, too!
 Got only 257 SegmentTermEnums for 1900 objects.

 Now I will go for the final test on the production server with the
 1.4rc3 version  and about 40.000 objects.

 Daniel

 Daniel Taurat schrieb:

 Hi all,
 here is some update for you:
 I switched back to Lucene 1.3-final and now the  number of the
 SegmentTermEnum objects is controlled by gc again:
 it goes up to about 1000 and then it is down again to 254 after
 indexing my 1900 test-objects.
 Stay tuned, I will try 1.4RC3 now, the last version before FieldCache
 was introduced...

 Daniel


 Rupinder Singh Mazara schrieb:

 hi all
  I had a similar problem, i have  database of documents with 24
 fields, and a average content of 7K, with  16M+ records

  i had to split the jobs into slabs of 1M each and merging the
 resulting indexes, submissions to our job queue looked like

  java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22

 and i still had outofmemory exception , the solution that i created
 was to after every 200K, documents create a temp directory, and merge
 them together, this was done to do the first production run, updates
 are now being handled incrementally



 Exception in thread main java.lang.OutOfMemoryError
 at

org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream
.java(Com
piled
 Code))
 at
 org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined
 Compiled Code))
 at

org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined
 Compiled Code))
 at

org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled
 Code))
 at

org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWri
ter.java(
Compiled
 Code))
 at

org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter
.java(Com
piled
 Code))
 at

org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMer
ger.java(
Compiled
 Code))
 at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled
 Code))
 at

org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled
 Code))
 at
 

Re: stopword AND validword throws exception

2004-11-10 Thread Morus Walter
Sanyi writes:
 Thanx for your replies guys.
 
 Now, I was trying to locate the latest patch for this problem group, and 
 the last thread I've
 read about this is:
 http://issues.apache.org/bugzilla/show_bug.cgi?id=25820
 It ends with an open question from Morus:
 If you want me to change the patch, let me know. That no big deal.
 
 Did you change the patch since then?
 
No. But this is an independent issue from the `stopword AND word' problem.
The `stopword AND word' problem just has to be taken care of in that context
also.
Bug 25820 basically is about better handling of AND and OR in a query.
Currently `a AND b OR c AND d'  equals  `a AND b AND c AND d' in query 
parser.

 Can I simply download the latest compiled development version of lucene.jar 
 and will it fix my
 problem?
 
If there are no current nightly builds, I guess you will have to get the
sources it from cvs directly.

But the fix seems to be included in 1.4.2.
see 
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.96.2.4
item 5

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stopword AND validword throws exception

2004-11-10 Thread Sanyi
 But the fix seems to be included in 1.4.2.
 see 
 http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.96.2.4
 item 5

Thank you! I'm just downloading 1.4.2.
I hope it'll work ;)

Sanyi




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filters for Openoffice File Indexing available (Java)

2004-11-10 Thread Daniel Naber
On Monday 08 November 2004 11:30, Joachim Arrasz wrote:

 So now we are looking for search and index Filters for Lucene, that
 were able to integrate out OpenOffice Files also into search result.

I don't know of any existing solutions, but it's not so difficult to write 
one: Extract the ZIP file using Java's built-in ZIP classes and parse 
content.xml and meta.xml. I'm not sure if whitespace issues might become 
tricky, e.g. two paragraphs could be in the file as 
pone/pptwo/p, but for indexing a whitespace needs to be inserted 
between them (p was just an example, I don't know what OpenOffice.org 
actually uses).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filters for Openoffice File Indexing available (Java)

2004-11-10 Thread Joachim Arrasz
Hi Daniel,
I don't know of any existing solutions, but it's not so difficult to write 
one: Extract the ZIP file using Java's built-in ZIP classes and parse 
content.xml and meta.xml. I'm not sure if whitespace issues might become 
tricky, e.g. two paragraphs could be in the file as 
pone/pptwo/p, but for indexing a whitespace needs to be inserted 
between them (p was just an example, I don't know what OpenOffice.org 
actually uses).
 

that seems to be not so hard, but i never have developed something like 
that, so i think i need a tutorial doing this. Why should i parse 
meta.xml? I thaught content.xml should be enough.

Thanks a lot
Bye Achim
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing within an XML document

2004-11-10 Thread Otis Gospodnetic
Redirecting to lucene-user, which is more appropriate.

I'm not sure what exactly the question is here, but:

Parse your XML document and for each p element you encounter create a
new Document instance, and then populate its fields with some data,
like the URI data you mentioned.
If you parse with DOM - just walk the node tree and make new Document
whenever you encounter an element you want as a separate Document.  If
you are using the SAX API you'll probably want some logic in
start/endElement and characters methods. When you reach the end of the
element you are done with your Document instance, so add it to the
IndexWriter instance that you opened once, before the parser.
When you are done with the whole XML document close the IndexWriter.

Otis


--- Murray Altheim [EMAIL PROTECTED] wrote:

 Hi,
 
 I'm trying to develop a class to handle an XML document, where
 the contents aren't so much indexed on a per-document basis,
 rather on an element basis. Each element has a unique ID, so
 I'm looking to create a class/method similar to Lucene's
 Document.Document(). By way of example, I'll use some XHTML
 markup to illustrate what I'm trying to do:
 
html
 base href=http://purl.org/ceryle/blat.xml/
 [...]
 body
   p id=p1
  some text to index...
   /p
   p id=p2
  some more text to index...
   /p
   p id=p3
  even more text to index...
   /p
 /body
/html
 
 I'd very much appreciate any help in explaining how I'd go about
 creating a method to return a Lucene Document to index this via
 ID. Would I want a separate Document per p? (There are many
 thousands of such elements.) Everything in my system, both at the
 document and the individual element level is done via URL, so
 the method should create URLs for each p element like
 
 http://purl.org/ceryle/blat.xml#p1
 http://purl.org/ceryle/blat.xml#p2
 http://purl.org/ceryle/blat.xml#p3
 etc.
 
 I don't need anyone to go to the trouble of coding this, just point
 me to how it might be done, or to any existing examples that do this
 kind of thing.
 
 Thanks very much!
 
 Murray
 

..
 Murray Altheim   
 http://kmi.open.ac.uk/people/murray/
 Knowledge Media Institute
 The Open University, Milton Keynes, Bucks, MK7 6AA, UK  
 .
 
If we can just get the people that can reconcile themselves
 to the new dispensation out of the way and then kill the few
 thousand people who can't reconcile themselves, then we can
 let the remaining 98 percent come back and live out their
 lives, Pike said. If we bomb the place to the ground, those
 peace-loving people won't have a home to live in. [...] If we
 simply pulverize the city, it would look bad on TV. -- John Pike
 
U.S., Iraqi troops mass for assault on Fallujah
STRATEGY: U.S. to employ snipers, robots to cut down casualties
  Matthew B. Stannard, San Francisco Chronicle
   

http://www.sfgate.com/cgi-bin/article.cgi?file=/c/a/2004/11/06/MNGHL9NBU11.DTL
 
We have a growing, maturing insurgency group. We see larger
 and more coordinated military attacks. They are getting better
 and they can self-regenerate. The idea there are x number of
 insurgents, and that when they're all dead we can get out is
 wrong. The insurgency has shown an ability to regenerate itself
 because there are people willing to fill the ranks of those who
 are killed. The political culture is more hostile to the US
 presence. The longer we stay, the more they are confirmed in
 that view. -- W Andrew Terrill
 
Far Graver Than Vietnam, Sidney Blumenthal, The Guardian
http://www.guardian.co.uk/comment/story/0,,1305360,00.html
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filters for Openoffice File Indexing available (Java)

2004-11-10 Thread Daniel Naber
On Wednesday 10 November 2004 15:18, Joachim Arrasz wrote:

  Why should i parse
 meta.xml? I thaught content.xml should be enough.

It contains the file's title, keywords, and author etc (those are not in 
content.xml).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexing MS Files

2004-11-10 Thread Luke Shannon
I need to index Word, Excel and Power Point files.

Is this the place to start?

http://jakarta.apache.org/poi/

Is there something better?

Thanks,

Luke

Re: Indexing MS Files

2004-11-10 Thread Otis Gospodnetic
That's one place to start.  The other one would be textmining.org, at
least for Word files.
I used both POI and Textmining API in Lucene in Action, and the latter
was much simpler to use.  You can also find some comments about both
libs in lucene-user archives.  People tend to like Textmining API
better.

Otis

--- Luke Shannon [EMAIL PROTECTED] wrote:

 I need to index Word, Excel and Power Point files.
 
 Is this the place to start?
 
 http://jakarta.apache.org/poi/
 
 Is there something better?
 
 Thanks,
 
 Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Luke Shannon
Thanks Otis. I am looking forward to this book. Any idea when it may be
released?

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 11:54 AM
Subject: Re: Indexing MS Files


 That's one place to start.  The other one would be textmining.org, at
 least for Word files.
 I used both POI and Textmining API in Lucene in Action, and the latter
 was much simpler to use.  You can also find some comments about both
 libs in lucene-user archives.  People tend to like Textmining API
 better.

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I need to index Word, Excel and Power Point files.
 
  Is this the place to start?
 
  http://jakarta.apache.org/poi/
 
  Is there something better?
 
  Thanks,
 
  Luke


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Merging multiple indexes

2004-11-10 Thread Ravi
Whats's the simplest way to merge 2 or more indexes into one large
index. 

Thanks in advance,
Ravi.  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locking issue

2004-11-10 Thread yahootintin . 1247688
 No need to address individuals here.

Sorry about that.  I just respect
the knowledge that you and Otis have about Lucene so that's why I was asking
you specifically.



 With the information provided, I have no idea what
the issue may be.  

Running the small sample file that is attached to the
original message shows how the issue is generated.  It takes less than 5 minutes
to occur on both Windows XP and Mac OS X.





 There certainly is no issue
reading and writing to an index at the same 

 time, but only one process
can be writing/deleting from the index at a 

 time.

That's what I thought.
 I'm seeing otherwise though.



--- Lucene Users List [EMAIL PROTECTED]
wrote:

On Nov 10, 2004, at 2:17 AM, [EMAIL PROTECTED] wrote:

  Otis or Erik, do you know if a Reader continously opening should cause


  the

  Writer to fail with a Lock obtain timed out error?

 


No need to address individuals here.

 

 With the information provided,
I have no idea what the issue may be.  

 There certainly is no issue reading
and writing to an index at the same 

 time, but only one process can be
writing/deleting from the index at a 

 time.

 

   Erik

 

 

 
--- Lucene Users List

  [EMAIL PROTECTED] wrote:

  The
attached Java file shows a locking

  issue that occurs with

  Lucene.

 

  One thread opens and closes an IndexReader.

   The other thread

  opens an IndexWriter, adds a document and then closes

  the

 
IndexWriter.  I would expect that this app should be able to

  happily

  run without an issues.

 

  It fails with:

java.io.IOException:
Lock

  obtain timed out

 

  Is this expected?  I thought a Reader
could be opened

  while a

  Writer is adding a document.

 

 
Any help is appreciated.

 

 

 

  -

 

  To unsubscribe, e-mail: [EMAIL PROTECTED]

  For

  additional commands, e-mail: [EMAIL PROTECTED]

 

 

 

  -

  To unsubscribe, e-mail: [EMAIL PROTECTED]


 For additional commands, e-mail: [EMAIL PROTECTED]




 

 -

 To unsubscribe, e-mail: [EMAIL PROTECTED]

 For
additional commands, e-mail: [EMAIL PROTECTED]

 

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Otis Gospodnetic
As Manning publications said, you should be able to get it for your
grandma this Christmas.

Otis

--- Luke Shannon [EMAIL PROTECTED] wrote:

 Thanks Otis. I am looking forward to this book. Any idea when it may
 be
 released?
 
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 11:54 AM
 Subject: Re: Indexing MS Files
 
 
  That's one place to start.  The other one would be textmining.org,
 at
  least for Word files.
  I used both POI and Textmining API in Lucene in Action, and the
 latter
  was much simpler to use.  You can also find some comments about
 both
  libs in lucene-user archives.  People tend to like Textmining API
  better.
 
  Otis
 
  --- Luke Shannon [EMAIL PROTECTED] wrote:
 
   I need to index Word, Excel and Power Point files.
  
   Is this the place to start?
  
   http://jakarta.apache.org/poi/
  
   Is there something better?
  
   Thanks,
  
   Luke
 
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Merging multiple indexes

2004-11-10 Thread Otis Gospodnetic
Use IndexWriter's addIndexes(Directory[]) call.

Otis

--- Ravi [EMAIL PROTECTED] wrote:

 Whats's the simplest way to merge 2 or more indexes into one large
 index. 
 
 Thanks in advance,
 Ravi.  
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Thierry Ferrero
I used OpenOffice API to convert all Word and Excel version.
For me it's the solution for complex Word and Excel document.
http://api.openoffice.org/
Good luck !

// UNO API
import com.sun.star.bridge.XUnoUrlResolver;
import com.sun.star.uno.XComponentContext;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.frame.XComponentLoader;
import com.sun.star.frame.XStorable;
import com.sun.star.beans.PropertyValue;
import com.sun.star.beans.XPropertySet;
import com.sun.star.lang.XComponent;
import com.sun.star.lang.XMultiComponentFactory;
import com.sun.star.connection.NoConnectException;
import com.sun.star.io.IOException;


/** This class implements a http servlet in order to convert an incoming
document
 * with help of a running OpenOffice.org and to push the converted file back
 * to the client.
 */
public class DocConverter {

 private String stringHost;
 private String stringPort;
 private Xcontext xcontext;
 private Xbase xbase;

 public DocConverter(Xbase xbase,Xcontext xcontext,ServletContext sc) {

  this.xbase=xbase;
  this.xcontext=xcontext;
stringHost=ApplicationUtil.getParameter(sc,openoffice.oohost);
stringPort=ApplicationUtil.getParameter(sc,openoffice.ooport);
   }

 public synchronized String convertToTxt(String namedoc, String pathdoc,
String stringConvertType, String stringExtension) {

String stringConvertedFile = this.convertDocument(namedoc, pathdoc,
stringConvertType, stringExtension);
  return stringConvertedFile;
 }


 /** This method converts a document to a given type by using a running
 * OpenOffice.org and saves the converted document to the specified
 * working directory.
 * @param stringDocumentName The full path name of the file on the server to
be converted.
 * @param stringConvertType Type to convert to.
 * @param stringExtension This string will be appended to the file name of
the converted file.
 * @return The full path name of the converted file will be returned.
 * @see stringWorkingDirectory
 */
 private String convertDocument(String namedoc, String pathdoc, String
stringConvertType, String stringExtension ) {

 String tagerr=;
String stringUrl=;
String stringConvertedFile = ;
// Converting the document to the favoured type
try {
  tagerr=0;
  // Composing the URL - suppression de l'extension
  stringUrl = pathdoc+/+namedoc;
 stringUrl=stringUrl.replace( '\\', '/' );
  /* Bootstraps a component context with the jurt base components
 registered. Component context to be granted to a component for
running.
 Arbitrary values can be retrieved from the context. */
  XComponentContext xcomponentcontext =
  com.sun.star.comp.helper.Bootstrap.createInitialComponentContext(
null );

  /* Gets the service manager instance to be used (or null). This method
has
 been added for convenience, because the service manager is a often
used
 object. */
  XMultiComponentFactory xmulticomponentfactory =
  xcomponentcontext.getServiceManager();
   tagerr=2;
  /* Creates an instance of the component UnoUrlResolver which
 supports the services specified by the factory. */
  Object objectUrlResolver =
  xmulticomponentfactory.createInstanceWithContext(
  com.sun.star.bridge.UnoUrlResolver, xcomponentcontext );
   // Create a new url resolver
  XUnoUrlResolver xurlresolver = ( XUnoUrlResolver )
  UnoRuntime.queryInterface( XUnoUrlResolver.class,
  objectUrlResolver );
// Resolves an object that is specified as follow:
  // uno:connection description;protocol description;initial object
name
  Object objectInitial = xurlresolver.resolve(
  uno:socket,host= + stringHost + ,port= + stringPort +
;urp;StarOffice.ServiceManager );

  // Create a service manager from the initial object
  xmulticomponentfactory = ( XMultiComponentFactory )
  UnoRuntime.queryInterface( XMultiComponentFactory.class,
objectInitial );
  // Query for the XPropertySet interface.
  XPropertySet xpropertysetMultiComponentFactory = ( XPropertySet )
  UnoRuntime.queryInterface( XPropertySet.class,
xmulticomponentfactory );
   // Get the default context from the office server.
  Object objectDefaultContext =
  xpropertysetMultiComponentFactory.getPropertyValue(
DefaultContext );

  // Query for the interface XComponentContext.
  xcomponentcontext = ( XComponentContext ) UnoRuntime.queryInterface(
  XComponentContext.class, objectDefaultContext );

  /* A desktop environment contains tasks with one or more
 frames in which components can be loaded. Desktop is the
 environment for components which can instanciate within
 frames. */
  XComponentLoader xcomponentloader = ( XComponentLoader )
  UnoRuntime.queryInterface( XComponentLoader.class,
  xmulticomponentfactory.createInstanceWithContext(
  com.sun.star.frame.Desktop, xcomponentcontext ) );

  // Preparing properties for 

Re: Indexing MS Files

2004-11-10 Thread Luke Shannon
Thanks. Grandmas around the world will certainly be surprised this
Christmas.

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 12:18 PM
Subject: Re: Indexing MS Files


 As Manning publications said, you should be able to get it for your
 grandma this Christmas.

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  Thanks Otis. I am looking forward to this book. Any idea when it may
  be
  released?
 
  - Original Message - 
  From: Otis Gospodnetic [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Wednesday, November 10, 2004 11:54 AM
  Subject: Re: Indexing MS Files
 
 
   That's one place to start.  The other one would be textmining.org,
  at
   least for Word files.
   I used both POI and Textmining API in Lucene in Action, and the
  latter
   was much simpler to use.  You can also find some comments about
  both
   libs in lucene-user archives.  People tend to like Textmining API
   better.
  
   Otis
  
   --- Luke Shannon [EMAIL PROTECTED] wrote:
  
I need to index Word, Excel and Power Point files.
   
Is this the place to start?
   
http://jakarta.apache.org/poi/
   
Is there something better?
   
Thanks,
   
Luke
  
  
  
  -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Acedemic Question About Indexing

2004-11-10 Thread Luke Shannon
I am working on debugging an existing Lucene implementation.

Before I started, I built a demo to understand Lucene. In my demo I indexed
the entire content hierarhcy all at once, and than optimize this index and
used it for queries. It was time consuming but very simply.

The code I am currently trying to fix indexes the content hierarchy by
folder creating a seperate index for each one. Thus it ends up with a bunch
of indexes. I still don't understand how this works (I am assuming they get
merged someone that I have tracked down yet) but I have noticed it doesn't
always index the right folder. This results in the users reporting
inconsistant behavior in searching after they make a change to a document.
To keep things simiple I would like to remove all the logic that figures out
which folder to index and just do them all (usually less than 1000 files) so
I end up with one index.

Would indexing time be the only area I would be losing out in, or is there
something more to the approach of creating multiple indexes and merging
them.

What is a good approach I can take to indexing a content hierarchy composed
primarily of pdf, xsl, doc and xml where any of these documents can be
changed several times a day?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Luke Shannon
This looks great. Thank you Thierry!

- Original Message - 
From: Thierry Ferrero [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 12:23 PM
Subject: Re: Indexing MS Files


 I used OpenOffice API to convert all Word and Excel version.
 For me it's the solution for complex Word and Excel document.
 http://api.openoffice.org/
 Good luck !

 // UNO API
 import com.sun.star.bridge.XUnoUrlResolver;
 import com.sun.star.uno.XComponentContext;
 import com.sun.star.uno.UnoRuntime;
 import com.sun.star.frame.XComponentLoader;
 import com.sun.star.frame.XStorable;
 import com.sun.star.beans.PropertyValue;
 import com.sun.star.beans.XPropertySet;
 import com.sun.star.lang.XComponent;
 import com.sun.star.lang.XMultiComponentFactory;
 import com.sun.star.connection.NoConnectException;
 import com.sun.star.io.IOException;


 /** This class implements a http servlet in order to convert an incoming
 document
  * with help of a running OpenOffice.org and to push the converted file
back
  * to the client.
  */
 public class DocConverter {

  private String stringHost;
  private String stringPort;
  private Xcontext xcontext;
  private Xbase xbase;

  public DocConverter(Xbase xbase,Xcontext xcontext,ServletContext sc) {

   this.xbase=xbase;
   this.xcontext=xcontext;
 stringHost=ApplicationUtil.getParameter(sc,openoffice.oohost);
 stringPort=ApplicationUtil.getParameter(sc,openoffice.ooport);
}

  public synchronized String convertToTxt(String namedoc, String pathdoc,
 String stringConvertType, String stringExtension) {

 String stringConvertedFile = this.convertDocument(namedoc,
pathdoc,
 stringConvertType, stringExtension);
   return stringConvertedFile;
  }


  /** This method converts a document to a given type by using a running
  * OpenOffice.org and saves the converted document to the specified
  * working directory.
  * @param stringDocumentName The full path name of the file on the server
to
 be converted.
  * @param stringConvertType Type to convert to.
  * @param stringExtension This string will be appended to the file name of
 the converted file.
  * @return The full path name of the converted file will be returned.
  * @see stringWorkingDirectory
  */
  private String convertDocument(String namedoc, String pathdoc, String
 stringConvertType, String stringExtension ) {

  String tagerr=;
 String stringUrl=;
 String stringConvertedFile = ;
 // Converting the document to the favoured type
 try {
   tagerr=0;
   // Composing the URL - suppression de l'extension
   stringUrl = pathdoc+/+namedoc;
  stringUrl=stringUrl.replace( '\\', '/' );
   /* Bootstraps a component context with the jurt base components
  registered. Component context to be granted to a component for
 running.
  Arbitrary values can be retrieved from the context. */
   XComponentContext xcomponentcontext =
   com.sun.star.comp.helper.Bootstrap.createInitialComponentContext(
 null );

   /* Gets the service manager instance to be used (or null). This
method
 has
  been added for convenience, because the service manager is a
often
 used
  object. */
   XMultiComponentFactory xmulticomponentfactory =
   xcomponentcontext.getServiceManager();
tagerr=2;
   /* Creates an instance of the component UnoUrlResolver which
  supports the services specified by the factory. */
   Object objectUrlResolver =
   xmulticomponentfactory.createInstanceWithContext(
   com.sun.star.bridge.UnoUrlResolver, xcomponentcontext );
// Create a new url resolver
   XUnoUrlResolver xurlresolver = ( XUnoUrlResolver )
   UnoRuntime.queryInterface( XUnoUrlResolver.class,
   objectUrlResolver );
 // Resolves an object that is specified as follow:
   // uno:connection description;protocol description;initial
object
 name
   Object objectInitial = xurlresolver.resolve(
   uno:socket,host= + stringHost + ,port= + stringPort +
 ;urp;StarOffice.ServiceManager );

   // Create a service manager from the initial object
   xmulticomponentfactory = ( XMultiComponentFactory )
   UnoRuntime.queryInterface( XMultiComponentFactory.class,
 objectInitial );
   // Query for the XPropertySet interface.
   XPropertySet xpropertysetMultiComponentFactory = ( XPropertySet )
   UnoRuntime.queryInterface( XPropertySet.class,
 xmulticomponentfactory );
// Get the default context from the office server.
   Object objectDefaultContext =
   xpropertysetMultiComponentFactory.getPropertyValue(
 DefaultContext );

   // Query for the interface XComponentContext.
   xcomponentcontext = ( XComponentContext ) UnoRuntime.queryInterface(
   XComponentContext.class, objectDefaultContext );

   /* A desktop environment contains tasks with one or more
  frames in which components can be loaded. Desktop is the
  environment 

Re: Acedemic Question About Indexing

2004-11-10 Thread Otis Gospodnetic
Uh, I hate to market it, but it's in the book.  But you don't have
to wait for it, as there already is a Lucene demo that does what you
described.  I am not sure if the demo always recreates the index or
whether it deletes and re-adds only the new and modified files, but if
it's the former, you would only need to modify the demo a little bit to
check the timestamps of File objects and compare them to those stored
in the index (if they are being stored - if not, you should add a field
to hold that data)

Otis

--- Luke Shannon [EMAIL PROTECTED] wrote:

 I am working on debugging an existing Lucene implementation.
 
 Before I started, I built a demo to understand Lucene. In my demo I
 indexed
 the entire content hierarhcy all at once, and than optimize this
 index and
 used it for queries. It was time consuming but very simply.
 
 The code I am currently trying to fix indexes the content hierarchy
 by
 folder creating a seperate index for each one. Thus it ends up with a
 bunch
 of indexes. I still don't understand how this works (I am assuming
 they get
 merged someone that I have tracked down yet) but I have noticed it
 doesn't
 always index the right folder. This results in the users reporting
 inconsistant behavior in searching after they make a change to a
 document.
 To keep things simiple I would like to remove all the logic that
 figures out
 which folder to index and just do them all (usually less than 1000
 files) so
 I end up with one index.
 
 Would indexing time be the only area I would be losing out in, or is
 there
 something more to the approach of creating multiple indexes and
 merging
 them.
 
 What is a good approach I can take to indexing a content hierarchy
 composed
 primarily of pdf, xsl, doc and xml where any of these documents can
 be
 changed several times a day?
 
 Thanks,
 
 Luke
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Acedemic Question About Indexing

2004-11-10 Thread Luke Shannon
Don't worry, regardless of what I learn in this forum I am telling my
company to get me a copy of that bad boy when it comes out (which as far as
I am concerned can't be soon enough). I will pay for grama's myself.

I think I have reviewed the code you are referring to and have something
similar working in my own indexer (using the uid). All is well.

My stupid question for the day is why would you ever want multiple indexes
running if you can build one smart indexer that does everything as
efficiently as possible? Does the answer to this question move me to multi
threaded indexing territory?

Thanks,

Luke


- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml where any of these documents can
  be
  changed several times a day?
 
  Thanks,
 
  Luke
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locking issue

2004-11-10 Thread yahootintin . 1247688
Hi,



 With the information provided, I have no 

 idea what the issue
may be.  



Is there some information that I should post that will help determine
why Lucene is giving me this error?



Thanks.



--- Lucene Users List [EMAIL PROTECTED]
wrote:

On Nov 10, 2004, at 2:17 AM, [EMAIL PROTECTED] wrote:

  Otis or Erik, do you know if a Reader continously opening should cause


  the

  Writer to fail with a Lock obtain timed out error?

 


No need to address individuals here.

 

 With the information provided,
I have no idea what the issue may be.  

 There certainly is no issue reading
and writing to an index at the same 

 time, but only one process can be
writing/deleting from the index at a 

 time.

 

   Erik

 

 

 
--- Lucene Users List

  [EMAIL PROTECTED] wrote:

  The
attached Java file shows a locking

  issue that occurs with

  Lucene.

 

  One thread opens and closes an IndexReader.

   The other thread

  opens an IndexWriter, adds a document and then closes

  the

 
IndexWriter.  I would expect that this app should be able to

  happily

  run without an issues.

 

  It fails with:

java.io.IOException:
Lock

  obtain timed out

 

  Is this expected?  I thought a Reader
could be opened

  while a

  Writer is adding a document.

 

 
Any help is appreciated.

 

 

 

  -

 

  To unsubscribe, e-mail: [EMAIL PROTECTED]

  For

  additional commands, e-mail: [EMAIL PROTECTED]

 

 

 

  -

  To unsubscribe, e-mail: [EMAIL PROTECTED]


 For additional commands, e-mail: [EMAIL PROTECTED]




 

 -

 To unsubscribe, e-mail: [EMAIL PROTECTED]

 For
additional commands, e-mail: [EMAIL PROTECTED]

 

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locking issue

2004-11-10 Thread Erik Hatcher
On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED] wrote:
Hi,
With the information provided, I have no
idea what the issue
may be.
Is there some information that I should post that will help determine
why Lucene is giving me this error?
You mentioned posting code - though I don't recall getting an 
attachment.  If you could post it as a Bugzilla issue with your code 
attached, it would be preserved outside of our mailboxes.  If the code 
is self-contained enough for me to try it, I will at some point in the 
near future.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Acedemic Question About Indexing

2004-11-10 Thread Will Allen
I have an application that I run monthly that indexes 40 million documents into 
6 indexes, then uses a multisearcher.  The advantage for me is that I can have 
multiple writers indexing 1/6 of that total data reducing the time it takes to 
index by about 5X.

-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:39 PM
To: Lucene Users List
Subject: Re: Acedemic Question About Indexing


Don't worry, regardless of what I learn in this forum I am telling my
company to get me a copy of that bad boy when it comes out (which as far as
I am concerned can't be soon enough). I will pay for grama's myself.

I think I have reviewed the code you are referring to and have something
similar working in my own indexer (using the uid). All is well.

My stupid question for the day is why would you ever want multiple indexes
running if you can build one smart indexer that does everything as
efficiently as possible? Does the answer to this question move me to multi
threaded indexing territory?

Thanks,

Luke


- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml where any of these documents can
  be
  changed several times a day?
 
  Thanks,
 
  Luke
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locking issue

2004-11-10 Thread yahootintin-lucene
Whoops!  Looks like my attachment didn't make it through.  I'm
re-attaching my simple test app.

Thanks.

--- Erik Hatcher [EMAIL PROTECTED] wrote:

 On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED]
 wrote:
  Hi,
 
  With the information provided, I have no
  idea what the issue
  may be.
 
  Is there some information that I should post that will help
 determine
  why Lucene is giving me this error?
 
 You mentioned posting code - though I don't recall getting an 
 attachment.  If you could post it as a Bugzilla issue with
 your code 
 attached, it would be preserved outside of our mailboxes.  If
 the code 
 is self-contained enough for me to try it, I will at some
 point in the 
 near future.
 
   Erik
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Locking issue

2004-11-10 Thread yahootintin . 1247688
I added it to Bugzilla like you suggested:

http://issues.apache.org/bugzilla/show_bug.cgi?id=32171



Let me know if you see any way to get around this issue.



--- Lucene
Users List [EMAIL PROTECTED] wrote:

Whoops!  Looks like my
attachment didn't make it through.  I'm

 re-attaching my simple test app.

 

 Thanks.

 

 --- Erik Hatcher [EMAIL PROTECTED] wrote:

 

  On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED]


 wrote:

   Hi,

  

   With the information provided, I have no

   idea what the issue

   may be.

  

   Is there some information
that I should post that will help

  determine

   why Lucene is giving
me this error?

  

  You mentioned posting code - though I don't recall
getting an 

  attachment.  If you could post it as a Bugzilla issue with

  your code 

  attached, it would be preserved outside of our mailboxes.
 If

  the code 

  is self-contained enough for me to try it, I will
at some

  point in the 

  near future.

  

  Erik

  

  

 

 -

  To unsubscribe, e-mail:

  [EMAIL PROTECTED]

  For additional commands, e-mail:

  [EMAIL PROTECTED]

  

  

 

 

 -

 To unsubscribe, e-mail: [EMAIL PROTECTED]

 For
additional commands, e-mail: [EMAIL PROTECTED]

 

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locking issue

2004-11-10 Thread Erik Hatcher
I just ran the code you provided.  On my puny PowerBook (Mac OS X 
10.3.5) it dies in much less than 5 minutes.

I do not know what the issue is, but certainly the actions the program 
is taking are atypical.  Opening and closing an IndexWriter repeatedly 
is certainly expensive on large indexes.  Indexing documents in batches 
is more typical, I presume.

Also, maybe you need to put some sleep into the code to give the JVM a 
chance to catch its breath?  Does that alleviate the issue?

Erik
On Nov 10, 2004, at 8:02 PM, [EMAIL PROTECTED] wrote:
I added it to Bugzilla like you suggested:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32171
Let me know if you see any way to get around this issue.
--- Lucene
Users List [EMAIL PROTECTED] wrote:
Whoops!  Looks like my
attachment didn't make it through.  I'm
re-attaching my simple test app.

Thanks.
--- Erik Hatcher [EMAIL PROTECTED] wrote:


On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED]
wrote:
Hi,
With the information provided, I have no

idea what the issue
may be.
Is there some information
that I should post that will help
determine
why Lucene is giving
me this error?
You mentioned posting code - though I don't recall
getting an
attachment.  If you could post it as a Bugzilla issue with

your code
attached, it would be preserved outside of our mailboxes.
 If
the code
is self-contained enough for me to try it, I will
at some
point in the
near future.
Erik



-

To unsubscribe, e-mail:
[EMAIL PROTECTED]

For additional commands, e-mail:
[EMAIL PROTECTED]



-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For
additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Locking issue

2004-11-10 Thread Erik Hatcher
I just added a Thread.sleep(1000) in the writer thread and it has run 
for quite some time, and is still running as I send this.

Erik
On Nov 10, 2004, at 8:02 PM, [EMAIL PROTECTED] wrote:
I added it to Bugzilla like you suggested:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32171
Let me know if you see any way to get around this issue.
--- Lucene
Users List [EMAIL PROTECTED] wrote:
Whoops!  Looks like my
attachment didn't make it through.  I'm
re-attaching my simple test app.

Thanks.
--- Erik Hatcher [EMAIL PROTECTED] wrote:


On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED]
wrote:
Hi,
With the information provided, I have no

idea what the issue
may be.
Is there some information
that I should post that will help
determine
why Lucene is giving
me this error?
You mentioned posting code - though I don't recall
getting an
attachment.  If you could post it as a Bugzilla issue with

your code
attached, it would be preserved outside of our mailboxes.
 If
the code
is self-contained enough for me to try it, I will
at some
point in the
near future.
Erik



-

To unsubscribe, e-mail:
[EMAIL PROTECTED]

For additional commands, e-mail:
[EMAIL PROTECTED]



-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For
additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Query#rewrite Question

2004-11-10 Thread Satoshi Hasegawa
Hello,

Our program accepts input in the form of Lucene query syntax from the user, 
but we wish to perform additional tasks such as thesaurus expansion. So I 
want to manipulate the Query object that results from parsing.

My question is, is the result of the Query#rewrite method guaranteed to be 
either a TermQuery, a PhraseQuery, or a BooleanQuery, and if it is a 
BooleanQuery, do all the constituent clauses also reduce to one of the above 
three classes? If not, what if the original Query object was the one that 
was obtained from QueryParser#parse method? Can I assume the above in this 
restricted case?

I experimented with the current version, and the above seems to be positive 
in this version; I'm asking if this could change in the future. Thank you. 

Re: Using Lucene to store document

2004-11-10 Thread Nhan Nguyen Dang
Hi Otis,
Please let me know what HEAD version of Lucene is?
Actually, I'm consider the advantages of storing document using Lucene Stored 
field - For  my Search engine.
I've tested with thousands of documents and see that retrieve document (in this 
case XML file) with Lucene is a little bit faster than using FS. But I cannot 
test with a large number of data to hava an accurate comparision. 
So whether Lucene can support millions of document, still balance and retrieve 
the with approriate speed.
Nhan


-
FREE Spam Protection! Click Here.
SpamExtract Blocks Spam.

-
Do you Yahoo!?
 Check out the new Yahoo! Front Page. www.yahoo.com

Search scalability

2004-11-10 Thread Ravi
 We have one large index for a document repository of 800,000 documents.
The size of the index is 800MB. When we do searches against the index,
it takes 300-500ms for a single search. We wanted to test the
scalability and tried 100 parallel searches against the index with the
same query and the average response time was 13 seconds. We used a
simple IndexSearcher. Same searcher object was shared by all the
searches. I'm sure people have success in configuring lucene for better
scalability. Can somebody share their approach?

Thanks 
Ravi. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search scalability

2004-11-10 Thread Otis Gospodnetic
Hello,

100 parallel searches going against a single index on a single disk
means a lot of disk seeks all happening at once.  One simple way of
working around this is to load your FSDirectory into RAMDirectory. 
This should be faster (could you report your
observations/comparisons?).  You can also try using ramfs if you are
using Linux.

Otis

--- Ravi [EMAIL PROTECTED] wrote:

  We have one large index for a document repository of 800,000
 documents.
 The size of the index is 800MB. When we do searches against the
 index,
 it takes 300-500ms for a single search. We wanted to test the
 scalability and tried 100 parallel searches against the index with
 the
 same query and the average response time was 13 seconds. We used a
 simple IndexSearcher. Same searcher object was shared by all the
 searches. I'm sure people have success in configuring lucene for
 better
 scalability. Can somebody share their approach?
 
 Thanks 
 Ravi. 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Lucene to store document

2004-11-10 Thread Otis Gospodnetic
Hello,

HEAD version means that you should check out Lucene straight out of
CVS.  How to work with CVS is another story, probably described
somewhere on jakarta.apache.org site.

Otis

--- Nhan Nguyen Dang [EMAIL PROTECTED] wrote:

 Hi Otis,
 Please let me know what HEAD version of Lucene is?
 Actually, I'm consider the advantages of storing document using
 Lucene Stored field - For  my Search engine.
 I've tested with thousands of documents and see that retrieve
 document (in this case XML file) with Lucene is a little bit faster
 than using FS. But I cannot test with a large number of data to hava
 an accurate comparision. 
 So whether Lucene can support millions of document, still balance and
 retrieve the with approriate speed.
 Nhan
 
 
 -
 FREE Spam Protection! Click Here.
 SpamExtract Blocks Spam.
   
 -
 Do you Yahoo!?
  Check out the new Yahoo! Front Page. www.yahoo.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locking issue

2004-11-10 Thread yahootintin-lucene
Yes, I tried that too and it worked.  The issue is that our
Operations folks plan to install this on a pretty busy box and I
was hoping that Lucene wouldn't cause issues if it only had a
small slice of the CPU.

Guess I'll tell them to buy a bigger box!  Unless you have any
other ideas.  I'm running some tests with a larger timeout to
see if that helps.

--- Erik Hatcher [EMAIL PROTECTED] wrote:

 I just added a Thread.sleep(1000) in the writer thread and it
 has run 
 for quite some time, and is still running as I send this.
 
   Erik
 
 On Nov 10, 2004, at 8:02 PM, [EMAIL PROTECTED]
 wrote:
 
  I added it to Bugzilla like you suggested:
  http://issues.apache.org/bugzilla/show_bug.cgi?id=32171
 
 
  Let me know if you see any way to get around this issue.
 
  --- Lucene
  Users List [EMAIL PROTECTED] wrote:
  Whoops!  Looks like my
  attachment didn't make it through.  I'm
  re-attaching my simple test app.
 
 
  Thanks.
 
  --- Erik Hatcher [EMAIL PROTECTED] wrote:
 
 
  On Nov 10, 2004, at 5:48 PM,
 [EMAIL PROTECTED]
 
  wrote:
  Hi,
 
  With the information provided, I have no
 
  idea what the issue
  may be.
 
  Is there some information
  that I should post that will help
  determine
  why Lucene is giving
  me this error?
 
  You mentioned posting code - though I don't recall
  getting an
  attachment.  If you could post it as a Bugzilla issue with
 
  your code
  attached, it would be preserved outside of our mailboxes.
   If
  the code
  is self-contained enough for me to try it, I will
  at some
  point in the
  near future.
 
Erik
 
 
 
 
 

-
 
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
 
  For additional commands, e-mail:
  [EMAIL PROTECTED]
 
 
 
 
 
 

-
 
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For
  additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search scalability

2004-11-10 Thread yahootintin-lucene
Does it take 800MB of RAM to load that index into a
RAMDirectory?  Or are only some of the files loaded into RAM?

--- Otis Gospodnetic [EMAIL PROTECTED] wrote:

 Hello,
 
 100 parallel searches going against a single index on a single
 disk
 means a lot of disk seeks all happening at once.  One simple
 way of
 working around this is to load your FSDirectory into
 RAMDirectory. 
 This should be faster (could you report your
 observations/comparisons?).  You can also try using ramfs if
 you are
 using Linux.
 
 Otis
 
 --- Ravi [EMAIL PROTECTED] wrote:
 
   We have one large index for a document repository of
 800,000
  documents.
  The size of the index is 800MB. When we do searches against
 the
  index,
  it takes 300-500ms for a single search. We wanted to test
 the
  scalability and tried 100 parallel searches against the
 index with
  the
  same query and the average response time was 13 seconds. We
 used a
  simple IndexSearcher. Same searcher object was shared by all
 the
  searches. I'm sure people have success in configuring lucene
 for
  better
  scalability. Can somebody share their approach?
  
  Thanks 
  Ravi. 
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-10 Thread Sanyi
Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works 
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of this 
word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an 
exception.
It should first restrict the search for the 500 documents containing the word 
spectrum, then it
should collect the variants of cab* withing these documents, which turns out 
in two or three
variants of cab* (cable, cables, maybe some more) and the search should 
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still throws a 
TooManyClauses
exception instead of saying No results, since there is no nonexistingword 
in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]