Re: All results

2008-05-15 Thread Otis Gospodnetic
What does your code look like? If you are using Hits, what does hits.length() give you? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Hasan Diwan <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Friday, May 16, 2008 1:48:56

Re: All results

2008-05-15 Thread Hasan Diwan
On 15/05/2008, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > You can get all matches via Hits if you want, it's just that Lucene will > need to do some re-querying under the hood. Why don't you use the > search() method that takes HitCollector to get all docs - I thought > that's what you

Re: All results

2008-05-15 Thread Otis Gospodnetic
Hi, You can get all matches via Hits if you want, it's just that Lucene will need to do some re-querying under the hood. Why don't you use the search() method that takes HitCollector to get all docs - I thought that's what you were trying to use in the first place. Otis -- Sematext -- htt

Re: All results

2008-05-15 Thread Hasan Diwan
Otis, On 15/05/2008, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > That method should let you have *all* non-zero scoring docs if filter == > null. > If that's not the case then I think that's a bug. If you can come up with a > unit test that shows the bug, please post it in JIRA. >From the

Re: All results

2008-05-15 Thread Otis Gospodnetic
Hi Hasan, That method should let you have *all* non-zero scoring docs if filter == null. If that's not the case then I think that's a bug. If you can come up with a unit test that shows the bug, please post it in JIRA. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: confused about an entry in the FAQ

2008-05-15 Thread Otis Gospodnetic
pong. Is that the most optimal use of FieldSelector? What happens if you remove it from that HitCollector.collect method? It looks like you are creating a new FieldSelector object for each hit found in each search thread. If it's not that, is the index optimized? If not, does optimizing it make

All results

2008-05-15 Thread Hasan Diwan
It would appear that to see all results (including low scoring) I need to pass a different Filter to Searcher.search[1]. If filter is null, only the highest-scoring results are returned. How do I change the threshold for hits returned? -- Cheers, Hasan Diwan <[EMAIL PROTECTED]> 1. http://lucene.

OT: Parsing Russian text from RTF

2008-05-15 Thread Bowesman Antony
Not directly Lucene related, but I'm out of ideas and I'm not a Russian speaker... I'm extracting text from RTF to pump into Lucene. I'm using the original RTFEditorKit() code shown in LIA, p252 (actually, it's Nutch's RTFParser) I have an RTF document, which starts with --- {\rtf1\ansi\ansic

Re: Update document with fields which are not stored

2008-05-15 Thread Jean-Claude Antonio
Thanks Karl. My apologies for the duplicate mail sent. >>Is Lucene your primary data store? Almost, as most properties of my items can be queried. I would like to be able to "not" store these fields though, but the fact that I need to update my documents (delete + create), forces me to store th

Re: Possible Bug when Querying?

2008-05-15 Thread Matthew Hall
No I did not, because I'm not performing a search with a leading wildcard, nor am I intending to allow that behavior. But what I do want to be able to search on is a word that starts with a * by escaping it, because sadly our data contains such things. Matt Karl Wettin wrote: 15 maj 2008 k

Re: Possible Bug when Querying?

2008-05-15 Thread Karl Wettin
15 maj 2008 kl. 18.33 skrev Matthew Hall: 12:23:05,602 INFO [STDOUT] org.apache.lucene.queryParser.ParseException: Cannot parse '\*ache*': '*' not allowed as first character in PrefixQuery 12:23:05,602 INFO [STDOUT] Failure in QS_MarkerSearch.searchMarkerNomen 12:23:05,602 ERROR [STDER

Re: Document clustering with Lucene

2008-05-15 Thread Otis Gospodnetic
Have you tried using Carrot2 with Lucene? They work quite well in tandem! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Supheakmungkol SARIN <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Wednesday, May 14, 2008 11:23:45 PM

Update document with fields which are not stored

2008-05-15 Thread Jean-Claude Antonio
Hello, We are using lucene for a while, and we are happy with it. Now we want to optimize some space. We are parsing versions of files and we want to keep track of history and also know which one is the newest we set a flag to it (field newest=true). so when a new version comes along : - we w

Re: Update document with fields which are not stored

2008-05-15 Thread Karl Wettin
15 maj 2008 kl. 19.15 skrev Jean-Claude Antonio: This work perfectly, but for this we need to have a content field as new Field("content", content, Field.Store.YES, Field.Index.TOKENIZED) to be able to update the current document which stores the content. We wish not to store the content as the

Re: Exact match query on a field in index which has been indexed using StandardAnalyzer

2008-05-15 Thread Karl Wettin
14 maj 2008 kl. 17.30 skrev Erick Erickson: Another possibility would be to introduce marker tokens in your field, index something like "$ member of technical staff $" and then, when querying for exact matches, *add* the $ tokens to the beginning and end of the query. Just a note, I've hit pro

Update document with fields which are not stored

2008-05-15 Thread Jean-Claude Antonio
Hello, We are using lucene for a while, and we are happy with it. Now we want to optimize some space. We are parsing versions of files and we want to keep track of history and also know which one is the newest we set a flag to it (field newest=true). so when a new version comes along : - we w

Possible Bug when Querying?

2008-05-15 Thread Matthew Hall
Greetings, I'm searching against a data set using lucene that contains searches such as the following: *ache* *aChe* etc and so forth, sadly this part of the dataset is imported via an external client, so we have no real way of controlling how they format it. Now, to make matters a bit mor

Re: IndexWriter cache swetspots

2008-05-15 Thread Karl Wettin
15 maj 2008 kl. 09.46 skrev Michael McCandless: Mark Miller wrote: Its been months since i've tested this sort of thing, but from what I remember there is a point where as you go higher, performance starts to very slowly drop. The point was lower than I'd expect, and def created what look

Re: text extraction from pdf

2008-05-15 Thread Bill Janssen
> Problem I am having is that some of them has multiple columns. and multiple > word boxes. Does the xpdf patch extract different columns and wordboxes? It tells you where each word is. Columns you have to do for yourself. Bill > > In UpLib, I use xpdf-3.02pl2 with a patch which gives me positi

Re: Lucene's Mean Average Precision

2008-05-15 Thread Dave Kor
I haven't participated in TREC for the past 2 years, so I am wonder which TREC track were you comparing your results against? The last time I checked, Lucene's score for the Terabyte track wasn't wonderful, but it was still pretty decent. Bear in mind that Lucene uses the plain old vanilla TF-IDF

Re: text extraction from pdf

2008-05-15 Thread Cam Bazz
Hello Bill, Problem I am having is that some of them has multiple columns. and multiple word boxes. Does the xpdf patch extract different columns and wordboxes? Best, -C.B. On Wed, May 14, 2008 at 6:35 PM, Bill Janssen <[EMAIL PROTECTED]> wrote: > > > the unix program pdf2text can convert keep

Re: IndexWriter cache swetspots

2008-05-15 Thread Michael McCandless
Mark Miller wrote: Its been months since i've tested this sort of thing, but from what I remember there is a point where as you go higher, performance starts to very slowly drop. The point was lower than I'd expect, and def created what looked like sweet spot settings. This was my recollect