Re: How to check, whether Index is optimized or not?

2006-01-11 Thread Otis Gospodnetic
I don't think we have a public API for that, but the index is considered optimized when it contains only a single segment. Then, we could add the following to IndexReader: public boolean isOptimized() { return segmentInfos.size() == 1; } I think that should do it. Otis - Original Messag

BarCampNYC: Lucene presentations

2006-01-11 Thread Otis Gospodnetic
Hi, For those in or near New York, this coming weekend there will be a geeky event called BarCampNYC: http://barcamp.org/index.cgi?BarCampNYC A few people will be presenting Lucene, Ferret, and related stuff. I'll be giving away a few copies of Lucene in Action and also presenting Simpy at ht

Re: BTree

2006-01-11 Thread shailesh kumar
I had looked at the document you had listed as well as used a Hex editor to look at the segment files. .That is how I came to know about the lexicographic sorting. But was not sure if BTree is used. If I understand correctly a Binary tree (i.e each node only 2 children) or a high order Ba

Re: Cache index in RAMDirectory and evict

2006-01-11 Thread Otis Gospodnetic
Kan, Some (all?) of what you described will typically be handled for you by the file system. Yes, the JVM would blow up with a OOM error if the index is too big to fit in RAM. Otis - Original Message From: Kan Deng <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Cc: Kan Deng <[EM

Cache index in RAMDirectory and evict

2006-01-11 Thread Kan Deng
Hi, there, In "Lucene in action", it mentions in Section 3.2.3 "reading indexes into memory" that, "...RAMDirectory's constructor can be used to read a file system-based index into memory, allowing the application that accesses it to benefit from the superior speed of the RAM: RAMDirectory

Re: Generating phrase queries from term queries

2006-01-11 Thread Yonik Seeley
A phrase query with slop scores matching documents higher when the terms are closer together. "a b c"~1 -Yonik On 1/10/06, Eric Jain <[EMAIL PROTECTED]> wrote: > Is there an efficient way to determine if two or more terms frequently > appear next to each other sequence? For a query like: > >

Re: Boolean Query

2006-01-11 Thread Chris Hostetter
: BooleanQuery query = new BooleanQuery(); : for(Term t: terms) : { : query = new TermQuery(t); : query.add(t, false, false); // ist his wrong? : } : : If I construct the query as a string like "A a OR B b OR C" I get much more : results. I assume that the Boolean query uses an AND oper

Re: Generating phrase queries from term queries

2006-01-11 Thread Chris Hostetter
: If you can express each phrase as a SpanNearQuery, the occurrences : of the phrases can be easily obtained by iterating over the result of : getSpans() on SpanNearQuery. : It's not as efficient as a specialized PhraseQuery, though. I think you are missunderstanding his goal. (Assuming *I* unde

Boolean Query

2006-01-11 Thread Klaus
Hi, I have got another question... How do I construct a BooleanQuery, where the terms with the query a connected with OR? I have a list of term, representing to high scored terms in a document. Here is my code BooleanQuery query = new BooleanQuery(); for(Term t: terms) { query = new Ter

Re: Generating phrase queries from term queries

2006-01-11 Thread Paul Elschot
On Wednesday 11 January 2006 11:33, Eric Jain wrote: > Paul Elschot wrote: > > One way that might be better is to provide your own Scorer > > that works on the term positions of the three or more terms. > > This would be better for performance because it only uses one > > term positions object per

Re: top n words within a results set?

2006-01-11 Thread Chris Brown
Excellent!! Thank you so much! - Original Message - From: "Grant Ingersoll" <[EMAIL PROTECTED]> To: Sent: Wednesday, January 11, 2006 12:07 PM Subject: Re: top n words within a results set? Hey Chris, There is just such an analyzer, called the PerFieldAnalyzerWrapper. The trick i

Re: top n words within a results set?

2006-01-11 Thread Grant Ingersoll
Hey Chris, There is just such an analyzer, called the PerFieldAnalyzerWrapper. The trick is the Analyzer always passes in the Field name when it gets the TokenStream, -Grant Chris Brown wrote: Bear with me, I might be missing something My documents get indexed ( writer.addDocument(doc

Re: RF and IDF

2006-01-11 Thread Yonik Seeley
Click on "Source Repository" off of the main Lucene page. Here is a pointer to the search package containing TermQuery/Weight/Scorer http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/search/?sortby=file#dirlist Look in TermQuert for TermWeight (it's an inner class).

Re: top n words within a results set?

2006-01-11 Thread Chris Brown
Bear with me, I might be missing something My documents get indexed ( writer.addDocument(doc) ) with one IndexWriter created using one Analyzer (the SnowballAnalyzer). So unless you can somehow use a different Analyzer per field I don't see how the second field will help. If I get the TermF

AW: RF and IDF

2006-01-11 Thread Klaus
Thx, but where can I find this classes? >If you really want to understand how scoring works, I'd suggest also >looking at TermWeight/TermScorer. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: RF and IDF

2006-01-11 Thread Yonik Seeley
On 1/11/06, Klaus <[EMAIL PROTECTED]> wrote: > Hi all, > > do you know how the tf und idf values are computed by the default > similarity? I mean the exact mathematical equation. Well, here is the default Similarity: /** Expert: Default scoring implementation. */ public class DefaultSimilarity ex

RF and IDF

2006-01-11 Thread Klaus
Hi all, do you know how the tf und idf values are computed by the default similarity? I mean the exact mathematical equation. Thx, Klaus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PRO

Re: highlighting phrases

2006-01-11 Thread Erik Hatcher
Harini - you won't find a custom analyzer that does exactly what you've described, but building custom analyzers is pretty straightforward. You can learn a lot about it by looking at the pieces within Lucene's source code or the examples (and text) from Lucene in Action. Reading from an

Re: BTree

2006-01-11 Thread Erik Hatcher
On Jan 11, 2006, at 7:23 AM, shailesh kumar wrote: Does Lucene use a BTree kind of structure for storing the index (atleast in the memory) .? or is it just a list. Based on the file format in the index directory ( where in the terms are are lexicographically sorted in one of the files ) I

Re: highlighting phrases

2006-01-11 Thread Harini Raghavan
Hi Erik, I had a look at the SpansExtractor class by Mark, that can convert any Query to spans. But I think ultimately the analyzer that is used to convert the text in to TokenStream is what is more important. I am using the StandardAnalyzer and it seems to return a stream of Tokens where each to

How to check, whether Index is optimized or not?

2006-01-11 Thread Maxim Patramanskij
Hello dear Lucene users! Is their an easy way to check, whether index is optimized or not? Best regards, Max - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: top n words within a results set?

2006-01-11 Thread Grant Ingersoll
I believe the usual solution is to have a separate field on the same document for display purposes (I am assumming you are trying to display the values of the indexed field) that is not stemmed. The tradeoff is in disk space, of course. Chris Brown wrote: Okay, I've taken Grant's advice an

BTree

2006-01-11 Thread shailesh kumar
Does Lucene use a BTree kind of structure for storing the index (atleast in the memory) .? or is it just a list. Based on the file format in the index directory ( where in the terms are are lexicographically sorted in one of the files ) I am not sure if BTree is used. ( Because constructing a

Re: top n words within a results set?

2006-01-11 Thread Chris Brown
Okay, I've taken Grant's advice and aggregated the TermFreqVector's for each term in the applicable field. It works quite well, there's just one glitch. Some words like "party" and "picture" appear as "parti" and "pictur". I am using the SnowballAnalyzer, I suspect that's what's changing the word

Re: Generating phrase queries from term queries

2006-01-11 Thread Eric Jain
Paul Elschot wrote: One way that might be better is to provide your own Scorer that works on the term positions of the three or more terms. This would be better for performance because it only uses one term positions object per query term (a, b, and c here). I'm trying to extract the actual phr

Re: Generating phrase queries from term queries

2006-01-11 Thread Andrzej Bialecki
Paul Elschot wrote: On Wednesday 11 January 2006 00:09, Eric Jain wrote: Is there an efficient way to determine if two or more terms frequently appear next to each other sequence? For a query like: a b c one or more of the following suggestions could be generated: "a b c" "a b" c a "b c"