Re: Getting only the Ids, not the whole documents.

2007-08-03 Thread Mike Klaas
You still have a disk seek per doc if the index can't fit in memory (usually more costly than reading the fields) . Why not use FieldCache? -Mike On 2-Aug-07, at 5:41 PM, Mark Miller wrote: If you are just retrieving your custom id and you have more stored fields (and they are not tiny) yo

Re: How can I get the Document Frequency for a specific term??? And more questions...

2007-08-03 Thread Grant Ingersoll
On Aug 3, 2007, at 9:47 AM, tierecke wrote: Hi, Can I know in how many documents a term appears (DF - Document Frequency)? Does Lucene keep it? Can I retrieve it? See the TermEnum class (IndexReader.terms() Now - an even more advanced question: Since I have a 77GB index, I cut it into

Nested Fields

2007-08-03 Thread Spencer Tickner
Hi, and thanks in advace for any help. I'm fairly new to lucene so excuse the ignorance. I'm attempting to field an XML documents with nested fields. So: This That would give me hits for: bar:This bat:That foo:ThisThat The only way I can see a way of doing this now is to field each eleme

Re: multiple field searcher

2007-08-03 Thread Steven Rowe
qaz zaq wrote: > I have Search Terms: T1, T2... Tn. Also I have document fields of F1 F2... Fm. > > I want to search the match documents across F1 to Fm fields,i.e., all of the > T1, T2, ...Tn need to be matched, but can be in the combination of T1, T2, > ... Tn field. > > I check the MultiFie

Re: Performance improvements using writer.delete vs reader.delete

2007-08-03 Thread Mark Miller
Heh. I suppose I'll defer to your judgment. In my mind, the simple system to make is to just buffer the adds, buffer the deletes - later apply the adds, apply the deletes (or the reverse). I am sure something in Solr would have a more sophisticated process, but my guess was about what the new L

multiple field searcher

2007-08-03 Thread qaz zaq
I have Search Terms: T1, T2... Tn. Also I have document fields of F1 F2... Fm. I want to search the match documents across F1 to Fm fields,i.e., all of the T1, T2, ...Tn need to be matched, but can be in the combination of T1, T2, ... Tn field. I check the MultiFieldQueryParser, it doesn't app

Re: Can I do boosting based on term postions?

2007-08-03 Thread Shailendra Sharma
Ah, Good way ! On 8/4/07, Paul Elschot <[EMAIL PROTECTED]> wrote: > > On Friday 03 August 2007 20:35, Shailendra Sharma wrote: > > Paul, > > > > If I understand Cedric right, he wants to have different boosting > depending > > on search term positions in the document. By using SpanFirstQuery he >

Re: Can I do boosting based on term postions?

2007-08-03 Thread Paul Elschot
On Friday 03 August 2007 20:35, Shailendra Sharma wrote: > Paul, > > If I understand Cedric right, he wants to have different boosting depending > on search term positions in the document. By using SpanFirstQuery he will > only be able to consider in terms till particular position; > but he won

Re: Can I do boosting based on term postions?

2007-08-03 Thread Shailendra Sharma
Paul, If I understand Cedric right, he wants to have different boosting depending on search term positions in the document. By using SpanFirstQuery he will only be able to consider in terms till particular position; but he won't be able to do something like following: a) Give 100% boosting to ma

Re: Performance improvements using writer.delete vs reader.delete

2007-08-03 Thread Mike Klaas
On 3-Aug-07, at 3:27 AM, Mark Miller wrote: Also, IndexWriter probably buffers better than you would. If you buffer a delete with IndexWriter and then add a document that would be removed by that delete right after, when the buffered deletes are flushed, your latest doc will not be removed

Re: strange MultiFieldQueryParser error: java.lang.Integer

2007-08-03 Thread Luca Rondanini
Sometimes I feel stupid! ;) Thank you very much! Luca testn wrote: Boost must be Map Luca123 wrote: Hi all, I've always used the MultiFieldQueryParser class without problems but now i'm experiencing a strange problem. This is my code: Map boost = new HashMap(); boost.put("field1",5); boos

Re: extracting non-english text from word, pdf, etc....??

2007-08-03 Thread Ryan Ackley
The textmining library (textmining.org) for Word docs should work fine with non-english text as well. Let me know if it doesn't On 8/2/07, Ben Litchfield <[EMAIL PROTECTED]> wrote: > In terms of PDF documents... > > PDFBox should work just fine with any latin based languages; at this > time certai

Re: strange MultiFieldQueryParser error: java.lang.Integer

2007-08-03 Thread testn
Boost must be Map Luca123 wrote: > > Hi all, > I've always used the MultiFieldQueryParser class without problems but > now i'm experiencing a strange problem. > This is my code: > > Map boost = new HashMap(); > boost.put("field1",5); > boost.put("field2",1); > > Analyzer analyzer = new Standa

Re: Can I do boosting based on term postions?

2007-08-03 Thread Paul Elschot
Cedric, You can choose the end limit for SpanFirstQuery yourself. Regards, Paul Elschot On Friday 03 August 2007 05:38, Cedric Ho wrote: > Hi Paul, > > Isn't SpanFirstQuery only match those with position less than a > certain end position? > > I am rather looking for a query that would score

Re: Get the TokenStream of an indexed but unstored field

2007-08-03 Thread tierecke
I fixed my question later. I meant I did not STORE the document themselves. Anyway - the issue is already solved, thank to testn. But there are new hard (for me) questions. Thanks a lot! Erick Erickson wrote: > > I indexed a large number of large documents, but I did not index the > document the

Re: Get the TokenStream of an indexed but unstored field

2007-08-03 Thread Erick Erickson
<<>> This is really confusing since it's self-contradictory. Could you post the lines where you do the document.add() for the fields in question? Best Erick On 8/3/07, tierecke <[EMAIL PROTECTED]> wrote: > > > Hi, > > I indexed a large number of large documents, but I did not index the > documen

strange MultiFieldQueryParser error: java.lang.Integer

2007-08-03 Thread Luca Rondanini
Hi all, I've always used the MultiFieldQueryParser class without problems but now i'm experiencing a strange problem. This is my code: Map boost = new HashMap(); boost.put("field1",5); boost.put("field2",1); Analyzer analyzer = new StandardAnalyzer(STOP_WORDS); String[] s_fields = new String[2

Re: Get the terms and frequency vector of an indexed but unstored field

2007-08-03 Thread tierecke
Thanks a lot, that works 100%!... Fortunately, I did use the flag to state that Lucene should store the term frequency vector. Otherwise, I'd have to index 77GB right now... :-) -- View this message in context: http://www.nabble.com/Get-the-terms-and-frequency-vector-of-an-indexed-but-unstored-f

How can I get the Document Frequency for a specific term???

2007-08-03 Thread tierecke
Hi, Can I know in how many documents a term appears (DF - Document Frequency)? Does Lucene keep it? Can I retrieve it? thanks a lot from Amsterdam, Nir. -- View this message in context: http://www.nabble.com/How-can-I-get-the-Document-Frequency-for-a-specific-termtf4212615.html#a11983532

Re: How do YOU detect corrupt indexes?

2007-08-03 Thread Joe R
We're planning on using encryption at the filesystem level (whole-disk encryption) and, to be honest, I don't have a mechanism that can produce the changes I'm talking about. Neither does my boss, unfortunately ;) He came along one day and asked, "how do we know when data changed on disk without

Re: Get the terms and frequency vector of an indexed but unstored field

2007-08-03 Thread testn
you can use IndexReader.getTermFreqVectors(int n) to get all terms and their frequencies. Make sure when you create an index, you choose option to store it by specifying Field.TermVector option. Check out http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf tierecke wrote: > > Hi, >

Re: Performance improvements using writer.delete vs reader.delete

2007-08-03 Thread Mark Miller
Also, IndexWriter probably buffers better than you would. If you buffer a delete with IndexWriter and then add a document that would be removed by that delete right after, when the buffered deletes are flushed, your latest doc will not be removed. Its unlikely your own buffer system would work

Get the terms and frequency vector of an indexed but unstored field

2007-08-03 Thread tierecke
Hi, I indexed a large number of large documents, but I did not store the document themselves, just indexed them. Now I am interested in getting the vector (i.e.: the terms indexed and the frequency) of that indexed but unstored field. doc.getField (fieldname) returns null. How can I get the data?