Re: Indexing/Querying Annotations and Fields for a document

2008-03-18 Thread mark harwood
I've used a custom analyzer before now to "blend in" GATE annotations as tokens at the same position as the words they relate to. E.g. Fred Smith works for Microsoft would be tokenized ordinarily as the following tokens: positionoffsettext ======== 1

Re: Lucene 2.3.1 Index Corruption?

2008-03-18 Thread Jamie
Hi Michael Sorry for the late reply. As you guessed, it missed my attention. Michael McCandless wrote: Hi, Can you describe what led up to this? My application indexes emails. In this particular instance, I had reindexed all emails from their original sources. The error occurred while I w

RE: Huge number of Term objects in memory gives OutOfMemory error

2008-03-18 Thread Richard.Bolen
Does each searchable have it's own copy of Term and TermInfo arrays? So the amount in memory would grow with each new Searchable instance? If so, it might be worthwhile to implement a singleton MultiSearcher that is closed and re-opened periodically. What do you think? Thanks again, Rich ___

CorruptIndexException with some versions of java

2008-03-18 Thread Ian Lea
Hi When bulk loading into a new index I'm seeing this exception Exception in thread "Thread-1" org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _4l: fieldsReader shows 67861 but segmentInfo shows 67862 at or

Re: Huge number of Term objects in memory gives OutOfMemory error

2008-03-18 Thread Michael McCandless
<[EMAIL PROTECTED]> wrote: Does each searchable have it's own copy of Term and TermInfo arrays? So the amount in memory would grow with each new Searchable instance? If so, it might be worthwhile to implement a singleton MultiSearcher that is closed and re-opened periodically. What d

Re: Lucene 2.3.1 Index Corruption?

2008-03-18 Thread Michael McCandless
It looks like you ignore any IOException coming out of IndexWriter.close? Can you put some code in the catch clause around writer.close to see if you are hitting some exception there? Also, you forcefully remove the write lock if it's present. But are you absolutely certain there isn't

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Michael McCandless
Can you call IndexWriter.setInfoStream(...) and get the error to happen and post back the resulting output? And, turn on assertions (java -ea) since that may catch the issue sooner. Can you describe you are setting up IndexWriter (autoCommit, compound, etc.), and what your documents are

Re: Lucene 2.3.1 Index Corruption?

2008-03-18 Thread Michael McCandless
Yes fdt/fdx hold stored fields. When the first buffered document is added these files are created. The only way they disappear (through Lucene's APIs) is if a writer is opened on that directory, and, those files are not referenced by the current segments file. This is why I'm concerned

Re: Lucene 2.3.1 Index Corruption?

2008-03-18 Thread Jamie
Michael McCandless wrote: Yes fdt/fdx hold stored fields. When the first buffered document is added these files are created. The only way they disappear (through Lucene's APIs) is if a writer is opened on that directory, and, those files are not referenced by the current segments file. Th

Re: Lucene 2.3.1 Index Corruption?

2008-03-18 Thread Michael McCandless
OK, opening two writers at once is definitely a recipe for disaster. Please post back on whether this does or doesn't resolve it. Previous versions of Lucene didn't write the fdt/fdx files until a segment is flushed, so it's possible you escaped index corruption (but, lost documents) before.

java.lang.OutOfMemoryError: Java heap space when sorting the fields

2008-03-18 Thread sandyg
this is my search content QueryParser parser = new QueryParser("keyword",new StandardAnalyzer()); Query query = parser.parse("1"); Sort sort = new Sort(new SortField(sortField)); Hits hits = searcher.search(query,sort); And i had huge data about 13 millions of records i am not

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Michael McCandless
One question: do you know whether 67,861 docs "feels like" a newly flushed segment, or, the result of a merge? Ie, roughly how many docs are you buffering in IndexWriter before it flushes? Are they very small documents and your RAM buffer is large? Mike Ian Lea wrote: Hi When bulk l

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Ian Lea
The data is loaded in chunks of up to 100K docs in separate runs of the program if that helps answer the first question. All buffers have default values, docs are small but not tiny, JVM is running with default settings. Answers to previous questions, and infostream, will follow once the job has

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Ian Lea
Documents are biblio records. All have title, author etc. stored, some have a few extra fields as well. Typically around 25 fields per doc. The index is created with compound format, everything else as default. I've rerun the job until failure. Different numbers this time, but basically the sa

PriorityQueue - something to watch for when using paging

2008-03-18 Thread mark harwood
I came across an interesting quirk when using Lucene's PriorityQueue. It's not a bug per se but I thought it might be worth logging here if anyone else experiences it. I was using a PriorityQueue to support a GUI that pages through the top terms in an index. It was observed that terms were ofte

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Michael McCandless
I don't see an attachment here -- maybe the mailing list software stripped it off. If so can you send directly to me? Thanks. Mike Ian Lea wrote: Documents are biblio records. All have title, author etc. stored, some have a few extra fields as well. Typically around 25 fields per doc.

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Yonik Seeley
On Tue, Mar 18, 2008 at 7:38 AM, Ian Lea <[EMAIL PROTECTED]> wrote: > Hi > > > When bulk loading into a new index I'm seeing this exception > > Exception in thread "Thread-1" > org.apache.lucene.index.MergePolicy$MergeException: > org.apache.lucene.index.CorruptIndexException: doc counts differ

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Ian Lea
It's failed on servers running SuSE 10.0 and 8.2 (ancient!) $ uname -a shows Linux phoebe 2.6.13-15-smp #1 SMP Tue Sep 13 14:56:15 UTC 2005 x86_64 x86_64 x86_64 GNU/Linux and Linux phobos 2.4.20-64GB-SMP #1 SMP Mon Mar 17 17:56:03 UTC 2003 i686 unknown unknown GNU/Linux The first one has a 2.8G

Question with Hits Interface

2008-03-18 Thread Ramdas M Ramakrishnan
Hi I am using a MultiFieldQueryParser to parse and search the index. Once I have the Hits and iterate thru it, I need to know the following? For every hit document I need to know under which indexed field was this Hit originating from? Say I have indexed 2 Fields how will I know from the Hit whi

Re: java.lang.OutOfMemoryError: Java heap space when sorting the fields

2008-03-18 Thread Chris Lu
This is because sorting will load all values in that sortFirled into memory. If it's an integer, you will need 4*N bytes, which is additional 52M for you. There is no programatical way to increase memory size. -- Chris Lu - Instant Scalable Full-Text Search On Any Databa

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Michael McCandless
Ian, Could you apply the attached patch applied to the head of the 2.3 branch? It only adds more asserts, to try to pinpoint where exactly this corruption starts. Then, re-run the test with asserts enabled and infoStream turned on and post back. Thanks. Mike Ian Lea wrote: It'

Re: java.lang.OutOfMemoryError: Java heap space when sorting the fields

2008-03-18 Thread Mark Miller
To sort on 13mil docs will take like at least 400 mb for the field cache. Thats if you only sort on one field...it can grow fast if you allow multi field sorting. How much RAM are you giving your app? sandyg wrote: this is my search content QueryParser parser = new QueryParser("keyword",new

Re: java.lang.OutOfMemoryError: Java heap space when sorting the fields

2008-03-18 Thread Mark Miller
Whoops...10 times to much there. more like 40 meg I think. A string sort could be a bit higher though, you also need to store all of terms to index into. sandyg wrote: this is my search content QueryParser parser = new QueryParser("keyword",new StandardAnalyzer()); Query query = parser.parse

LUCENE-933 / SOLR-261

2008-03-18 Thread Jake Mannix
Hey folks, I was wondering what the status of LUCENE-933 (stop words can cause the queryparser to end up with no results, due to an e.g. +(the) clause in the resultant BooleanQuery). According to the tracking bug, it's resolved, and there's a patch, but where has that patch been applied? I trie

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Michael McCandless
Hi Ian, Sheesh that's odd. The SegmentMerger produced an .fdx file that is one document too short. Can you run with this patch now, again applied to head of 2.3 branch? I just added another assert inside the loop that does the field merging. I will scrutinize this code... Mike I

Re: Indexing/Querying Annotations and Fields for a document

2008-03-18 Thread lucene-seme1 s
Can you please share the custom Analyzer you have ? In particular, I am interested in knowing how to get access to the position, offset values for each token. Regards, JK On Tue, Mar 18, 2008 at 10:48 AM, mark harwood <[EMAIL PROTECTED]> wrote: > I've used a custom analyzer before now to "blend

Contrib Highlighter and Phrase search

2008-03-18 Thread Spencer Tickner
Hi List, Thanks in advance for any help. I'm working with the contrib highlighting class and am having issues when doing searches with a phrase. I've been able to duplicate this behaviour in the HighlighterTest class. When calling the testGetBestFragmentsPhrase() method I get the correct: John K

Re: LUCENE-933 / SOLR-261

2008-03-18 Thread Doron Cohen
hi Jake, yes it was commited in Lucene - this is visible in the JIRA issue when if you switch to the "Subversion Commits" tab. where you can also see the actual diffs that took place. Best, Doron On Tue, Mar 18, 2008 at 7:14 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > Hey folks, > I was wonder

Re: Contrib Highlighter and Phrase search

2008-03-18 Thread Mark Miller
The contrib Highlighter is not position sensitive. You can try out the patch I have been working here if you are interested: https://issues.apache.org/jira/browse/LUCENE-794 Spencer Tickner wrote: Hi List, Thanks in advance for any help. I'm working with the contrib highlighting class and am

Re: CorruptIndexException with some versions of java

2008-03-18 Thread Michael McCandless
Ian can you attach your version of SegmentMerger.java? Somehow my lines are off from yours. Mike Ian Lea wrote: Mike Latest patch produces similar exception: Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError: after

Re: LUCENE-933 / SOLR-261

2008-03-18 Thread Jake Mannix
Ah, thanks. So since solr-1.2.0 is using lucene-*-2007-05-20_00_04-53.jarin its distribution, is this why SOLR-261 is still open? I thought that maybe it would be a simple drop in replacement, but when I tossed in lucene-*-2.3.1.jar to solr, it didn't fix the problem, so maybe something in solr n

Re: Contrib Highlighter and Phrase search

2008-03-18 Thread Spencer Tickner
Thanks, I'll give that a try. Cheers, Spencer On Tue, Mar 18, 2008 at 1:50 PM, Mark Miller <[EMAIL PROTECTED]> wrote: > The contrib Highlighter is not position sensitive. You can try out the > patch I have been working here if you are interested: > https://issues.apache.org/jira/browse/LUCENE-

Re: Question with Hits Interface

2008-03-18 Thread Daniel Noll
On Wednesday 19 March 2008 01:44:33 Ramdas M Ramakrishnan wrote: > I am using a MultiFieldQueryParser to parse and search the index. Once I > have the Hits and iterate thru it, I need to know the following? > > For every hit document I need to know under which indexed field was this > Hit originati

Re: LUCENE-933 / SOLR-261

2008-03-18 Thread Chris Hostetter
: Ah, thanks. So since solr-1.2.0 is using : lucene-*-2007-05-20_00_04-53.jarin its distribution, : is this why SOLR-261 is still open? SOLR-261 was left open because it hadn't been verified yet -- I just did that and resolved the issue against the trunk. : I thought that maybe it would be a s

Re: phrase search with custom TokenFilter

2008-03-18 Thread Chris Hostetter
You're going to want to change your TokenFilter so that it emits the split pieces tokens immediately after the original token and with a positionIncrement of "0" .. don't buffer then up and wait for the entire stream to finish first. it true order of the tokens in the tokenstream and the posit

Re: Contrib Highlighter and Phrase search

2008-03-18 Thread markharw00d
See https://issues.apache.org/jira/browse/LUCENE-794 Spencer Tickner wrote: Hi List, Thanks in advance for any help. I'm working with the contrib highlighting class and am having issues when doing searches with a phrase. I've been able to duplicate this behaviour in the HighlighterTest class.

Re: Indexing/Querying Annotations and Fields for a document

2008-03-18 Thread markharw00d
lucene-seme1 s wrote: Can you please share the custom Analyzer you have ? Unfortunately it's not mine to share but see the Lucene Token and Analyzer classes - it's not particularly hard to code. - To unsubscribe, e-mail: [

Re: java.lang.OutOfMemoryError: Java heap space when sorting the fields

2008-03-18 Thread sandyg
Thanks for the reply. Actually am sorting on a specific field that is on keyword feild which is unique and i had 1 gb ram markrmiller wrote: > > To sort on 13mil docs will take like at least 400 mb for the field > cache. Thats if you only sort on one field...it can grow fast if you > allow mu

Re: java.lang.OutOfMemoryError: Java heap space when sorting the fields

2008-03-18 Thread sandyg
How can i do sorting on the results i get (if already hits are there then how to sort on the hits),instead of sorting on all the values b4 getting results Chris Lu wrote: > > This is because sorting will load all values in that sortFirled into > memory. > > If it's an integer, you will need 4*