Re: Indexing/Querying Annotations and Fields for a document

2008-03-17 Thread Grant Ingersoll
You would parse the XML (or whatever) into separate strings, and put each piece into it's own Field in a Lucene Document For instance: Document doc = new Document(); String body = getBody(input); String people = getPeople(input) Field body = new Field("body", body); Field people = new Field("p

Re: Lucene 2.3.1 Index Corruption?

2008-03-17 Thread Jamie
As a further followup: The follows files are located in the index: ls /usr/local/index _0.fnm _0.frq _0.nrm _0.prx _0.tii _0.tis _1.cfs indexinfo _j.cfs segments.gen segments_s This problem appears to be intermittant and has occurred on several machines. Is there any incorrect way

Re: Huge number of Term objects in memory gives OutOfMemory error

2008-03-17 Thread Michael McCandless
You can call IndexReader.setTermInfosIndexDivisor(int) to reduce how many index terms are loaded in memory. EG setting it to 10 will load 1/10th what's loaded now, but will slow down searches. Also, you should understand why your index has so many terms. EG, use Luke to peek at the terms

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-17 Thread Michael McCandless
Daniel Noll wrote: On Monday 17 March 2008 19:38:46 Michael McCandless wrote: Well ... expungeDeletes() first forces a flush, at which point the deletions are flushed as a .del file against the just flushed segment. Still, if you call expungeDeletes after every flush (commit) then it's only 1

Re: Lucene 2.3.1 Index Corruption?

2008-03-17 Thread Michael McCandless
Hi, Can you describe what led up to this? Were there any exceptions when adding documents to the index? Was the index newly created with 2.3.1 or created on 2.3.0 or 2.2? What options are you using in your IndexWriter? Is it easy to reproduce? If so, can you call setInfoStream on your

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-17 Thread Daniel Noll
On Monday 17 March 2008 19:38:46 Michael McCandless wrote: > Well ... expungeDeletes() first forces a flush, at which point the > deletions are flushed as a .del file against the just flushed > segment. Still, if you call expungeDeletes after every flush > (commit) then it's only 1 segment whose d

Re: Huge number of Term objects in memory gives OutOfMemory error

2008-03-17 Thread Paul Smith
I'll bet the byte[] are the Norm data per field. If you have a lot of fields and do not need the normalization data for every field, I'd suggest turning that option off for fields you don't need the normalization for scoring. The calculation I understand is: 1 byte x (# fields with normal

Lucene 2.3.1 Index Corruption?

2008-03-17 Thread Jamie
Hi There I am getting the following error while searching a given index: java.io.FileNotFoundException: /usr/local/index/_0.fdt (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(Unknown Source) at org.apache.lucene.s

Huge number of Term objects in memory gives OutOfMemory error

2008-03-17 Thread Richard.Bolen
I'm running Lucene 2.3.1 with Java 1.5.0_14 on 64 bit linux. We have fairly large collections (~1gig collection files, ~1,000,000 documents). When I try to load test our application with 50 users, all doing simple searches via a web interface, we quickly get an OutOfMemory exception. When I d

Re: Indexing/Querying Annotations and Fields for a document

2008-03-17 Thread lucene-seme1 s
I already have the document preprocessed and the annotations (i.e. John) are already stored in an array with features attached to some annotations (such as the root and lemma of the word). Can you please elaborate some more on how to "index them as normally would" ? Regards, JK On Mon, Mar 17, 2

Re: IndexReader deleteDocument

2008-03-17 Thread varun sood
Hi Erick, My idea/need is to create a simple web interface where I can manage my index. managing includes.. Adding new documents to index, editing the old ones, deleting, etc.. all using a simple Web GUI so that person does not need to be a web developer to manage the index. Besides there are othe

Re: IndexReader deleteDocument

2008-03-17 Thread varun sood
Hi Erick, My idea/need is to create a simple web interface where I can manage my index. managing includes.. Adding new documents to index, editing the old ones, deleting, etc.. all using a simple Web GUI so that person does not need to be a web developer to manage the index. Besides there are othe

Re: Indexing/Querying Annotations and Fields for a document

2008-03-17 Thread Grant Ingersoll
I think there are a couple of ways you can approach this, although I have never used GATE. If these annotations are marked in line in your content, then you can either preprocess the files to have them separately and index as you normally would, or you can use the relatively new TeeTokenFil

RE: word position operator?

2008-03-17 Thread Steven A Rowe
Hi Darren, Check out SpanFirstQuery and SpanRegexQuery: Steve On 03/16/2008 at 8:55 PM, Darren Govoni wrote:

Indexing/Querying Annotations and Fields for a document

2008-03-17 Thread lucene-seme1 s
Hello, I am a newbie here and still experimenting with Lucene. I have annotations and features generated by GATE for many documents and would like to index the original content of the documents in addition to the generated annotations. The annotations are in the form of [ John loves fishing]. I w

Indexing/Querying Annotations and Fields for a document

2008-03-17 Thread lucene-seme1 s
Hello, I am a newbie here and still experimenting with Lucene. I have annotations and features generated by GATE for many documents and would like to index the original content of the documents in addition to the generated annotations. The annotations are in the form of [ John loves fishing]. I w

Re: IndexReader deleteDocument

2008-03-17 Thread Michael McCandless
I think that is quite a ways away. This possibility was briefly mentioned on the java-dev list recently, to create an IndexReader that can access the in-memory buffered adds/ deletes in IndexWriter, but it would be a very large change for Lucene. Various caches assume an index will not cha

Re: sorting a doc field takes more time

2008-03-17 Thread Grant Ingersoll
Sorting is dependent on the values in the fields. What is actually in the fields? But, yes, in general, sorting is going to be slower than just raw search. It's extra operations. It also looks like you are using the AUTO SortField, which means you are relying on Lucene to figure out how

Re: IndexReader deleteDocument

2008-03-17 Thread Cam Bazz
Hello Mike, Is there any hope for making a lucene index that is fully transparent, i.e. the indexreader seeing all the changes without reopening? Best. On Mon, Mar 17, 2008 at 12:35 PM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > Oh, sorry, no you still must reopen the IndexReader. Inde

Re: IndexReader deleteDocument

2008-03-17 Thread Michael McCandless
Oh, sorry, no you still must reopen the IndexReader. IndexReader still searches only a point in time. Mike Cam Bazz wrote: yes, I meant the same index. I thought with the new changes - the index reader would see the changes without re-opening. It would be real real cool to have that.

Re: IndexReader deleteDocument

2008-03-17 Thread Cam Bazz
yes, I meant the same index. I thought with the new changes - the index reader would see the changes without re-opening. It would be real real cool to have that. Best. -C.B. On Mon, Mar 17, 2008 at 12:28 PM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > I'm not sure what you mean by "sam

Re: IndexReader deleteDocument

2008-03-17 Thread Michael McCandless
I'm not sure what you mean by "same thread". Maybe you meant "same index"? Yes, if the IndexReader reopens. IndexWriter.commit() makes the changes visible to readers, and makes the changes durable to os/computer crash or power outage. Mike Cam Bazz wrote: Another and last question;

Re: IndexReader deleteDocument

2008-03-17 Thread Cam Bazz
Another and last question; when the user commits, will an indexreader that is reading the same thread see the changes made or not? I thought something was said about this, if my memory serves me correct. Best. On Mon, Mar 17, 2008 at 11:53 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > >

Re: IndexReader deleteDocument

2008-03-17 Thread Michael McCandless
It's a hard drive issue. When you call fsync, the OS asks the hard drive to sync. Mike Cam Bazz wrote: Hello, I understand the issue. But I have not understood - is this hardware related issue - i.e a harddisk? or operating system? If I am using linux would the OS lie about fsyncing?

Re: IndexReader deleteDocument

2008-03-17 Thread Cam Bazz
Hello, I understand the issue. But I have not understood - is this hardware related issue - i.e a harddisk? or operating system? If I am using linux would the OS lie about fsyncing? could I do anything in the kernel to stop it from lying? or is this just a harddrive related issue... Best. On Mo

Re: IndexReader deleteDocument

2008-03-17 Thread Michael McCandless
When you write to a file, modern OSs by default just buffer those writes in memory rather than actually writing them immediately to disk. Modern hard drives do the same (so, after the OS flushes to the hard drive, the hard drive actually just buffers the writes, too). Then, when it's a

Re: IndexReader deleteDocument

2008-03-17 Thread Cam Bazz
Hello, What do you mean by IO system lying on fsync? Best. On Mon, Mar 17, 2008 at 10:40 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > Yes that's already been committed to trunk as well. > > IndexWriter now has a commit() method which syncs all referenced > files in the index to stable

Re: IndexReader deleteDocument

2008-03-17 Thread Michael McCandless
Yes that's already been committed to trunk as well. IndexWriter now has a commit() method which syncs all referenced files in the index to stable storage (assuming your IO system doesn't "lie" on fsync). Mike On Mar 17, 2008, at 4:33 AM, Cam Bazz wrote: Nice. Thanks. will the 2.4 have

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-17 Thread Michael McCandless
Daniel Noll wrote: On Thursday 13 March 2008 19:46:20 Michael McCandless wrote: But, when a normal merge of segments with deletions completes, your docIDs will shift. In trunk we now explicitly compute the docID shifting that happens after a merge, because we don't always flush pending delete

Re: IndexReader deleteDocument

2008-03-17 Thread Cam Bazz
Nice. Thanks. will the 2.4 have commit improvements that we previously talked about? best regards. -C.B. On Mon, Mar 17, 2008 at 10:31 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > The trunk version of Lucene (eventually 2.4) now has deletion by > query, in IndexWriter. > > Mike > > C

Re: IndexReader deleteDocument

2008-03-17 Thread Michael McCandless
The trunk version of Lucene (eventually 2.4) now has deletion by query, in IndexWriter. Mike Cam Bazz wrote: Hello Erick, Has anyone found a way for deleting a document with a query? I understand it can be deleted via terms, but I need to delete a document with two terms, that is the

Re: IndexReader deleteDocument

2008-03-17 Thread Cam Bazz
Hello Erick, Has anyone found a way for deleting a document with a query? I understand it can be deleted via terms, but I need to delete a document with two terms, that is the only way I can identify my document is by looking at two terms not one. best. On Fri, Mar 14, 2008 at 4:58 PM, Erick Eri