Re: How does Lucene decides which fields have termvectors stored and which not?

2014-08-19 Thread Sachin Kulkarni
Hi Kumaran, See below some part of the code and the .alg file. Here is the function from DocMaker.java from the package "package org.apache.lucene.benchmark.byTask.feeds" /** Set the configuration parameters of this doc maker. */ public void setConfig(Config config, ContentSource source) {

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Trejkaz
Lucene 4.9 gives much the same result. import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.ja.JapaneseAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField; import

Re: Lucene Query

2014-08-19 Thread Jin Guang Zheng
Thanks so much, that works. Jin On Tue, Aug 19, 2014 at 4:13 PM, Uwe Schindler wrote: > Hi, > Look at his docs. He has only 2 docs, the second one 3 keywords. > > I would use a simple phrase query with a slop value < Analyzers > positionIncrementGap. This is the gap between fields with same na

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Trejkaz
On Tue, Aug 19, 2014 at 5:27 PM, Uwe Schindler wrote: > Hi, > > You forgot to close (or commit) IndexWriter before opening the reader. Huh? The code I posted is closing it: try (IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_36, analyser))) {

Re: Lucene Query

2014-08-19 Thread Tri Cao
Oh sorry guys, ignore what I said. I am going to get myself a coffee. Uwe is absolutely correct here. On Aug 19, 2014, at 01:13 PM, Uwe Schindler wrote: Hi, Look at his docs. He has only 2 docs, the second one 3 keywords. I would use a simple phrase query with a slop value < Analyzers positi

Re: Lucene Query

2014-08-19 Thread Uwe Schindler
Hi, Look at his docs. He has only 2 docs, the second one 3 keywords. I would use a simple phrase query with a slop value < Analyzers positionIncrementGap. This is the gap between fields with same name. Span or phrase cannot cross the gap, if slop if small enough, but large enough to find the te

Re: Lucene Query

2014-08-19 Thread Tri Cao
Whoops, the constraint should be MUST to force all terms present: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST On Aug 19, 2014, at 01:05 PM, "Tri Cao" wrote: OR operator does that, AND only returns docs with ALL terms present. Note that you h

Re: Lucene Query

2014-08-19 Thread Tri Cao
OR operator does that, AND only returns docs with ALL terms present. Note that you have two options here 1. Create a BooleanQuery object (see the Java doc I linked below) and programatically add the term queries with the following constraint: http://lucene.apache.org/core/4_6_0/core/org/apache/l

Re: Lucene Query

2014-08-19 Thread Jin Guang Zheng
Thanks for reply, but won't BooleanQuery return both doc1 and doc2 with query: label:States AND label:America AND label:United Best, Jin On Tue, Aug 19, 2014 at 2:07 PM, Tri Cao wrote: > given that example, the easy way is a boolean AND query of all the terms: > > > http://lucene.apache.org/c

Re: How does Lucene decides which fields have termvectors stored and which not?

2014-08-19 Thread Sachin Kulkarni
Hi Kumaran, I am using the benchmark utility from Lucene and doing the indexing via an .alg file. Would you like to see the alg file instead? Thank you. Regards, Sachin On Tue, Aug 19, 2014 at 9:42 AM, Kumaran Ramasubramanian wrote: > Hi Sachin > > i want to look into ur indexing cod

Re: Lucene Query

2014-08-19 Thread Tri Cao
given that example, the easy way is a boolean AND query of all the terms: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html However, if your corpus is more sophisticated you'll find that relevance ranking is not always that trivial :) On Aug 19, 2014, at 11:00

Lucene Query

2014-08-19 Thread Jin Guang Zheng
Hi, I am wondering if someone can help me on this: I have index: doc 1 -- label: United States of America doc 2 -- label: United doc 2 -- label: America doc 2 -- label: States I am wondering how to generate a query with terms: states united america so only doc 1 returns. I was thinking Spa

Re: Calculate Term Frequency

2014-08-19 Thread Tri Cao
Erick, Solr termfreq implementation also uses DocsEnum with the assumption that freq are called on ascending doc IDs which is valid when scoring from from the hit list. If freq is requested for an out of order doc, a new DocsEnum has to be created. Bianca, can you explain your use case in more

Re: Calculate Term Frequency

2014-08-19 Thread Michael Sokolov
Have you looked into term vectors? I think they should fit your bill pretty neatly. Here's a nice blog post with helpful background info: http://blog.jpountz.net/post/41301889664/putting-term-vectors-on-a-diet -Mike On 8/19/2014 10:04 AM, Bianca Pereira wrote: Hi everybody, I would like

Re: Calculate Term Frequency

2014-08-19 Thread Erick Erickson
Hmmm, I'm not at all an expert here, but Solr has a function query "termfreq" that does what you're doing I think? I wonder if the code for that function query would be a good place to copy (or even make use of)? See TermFreqValueSource... Maybe not helpful at all, but... Erick On Tue, Aug 19, 20

Re: How does Lucene decides which fields have termvectors stored and which not?

2014-08-19 Thread Kumaran Ramasubramanian
Hi Sachin i want to look into ur indexing code. please share it - Kumaran R On Tue, Aug 19, 2014 at 7:18 PM, Sachin Kulkarni wrote: > Hi, > > Sorry for all the code, It got sent out accidentally. > > The following code is part of the Benchmark utility in Lucene, specifically > Subm

Calculate Term Frequency

2014-08-19 Thread Bianca Pereira
Hi everybody, I would like to know your suggestions to calculate Term Frequency in a Lucene document. Currently I am using MultiFields.getTermDocsEnum, iterating through the DocsEnum 'de' returned and getting the frequency with de.freq() for the desired document. My solution gives me the resu

Re: How does Lucene decides which fields have termvectors stored and which not?

2014-08-19 Thread Sachin Kulkarni
Hi, Sorry for all the code, It got sent out accidentally. The following code is part of the Benchmark utility in Lucene, specifically SubmissionReport.java // Here reader is the IndexReader. Iterator itr = docMap.entrySet().iterator(); int totalNumDocuments = reader.numDocs();

Re: How does Lucene decides which fields have termvectors stored and which not?

2014-08-19 Thread Sachin Kulkarni
Hi Kumaran, The following code is part of the Benchmark utility in Lucene, specifically SubmissionReport.java Iterator itr = docMap.entrySet().iterator(); int totalNumDocuments = reader.numDocs(); ScoreDoc sd[] = td.scoreDocs; String sep = " \t "; DocNameExtractor docext = new DocNameExtracto

RE: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Uwe Schindler
Hi, You forgot to close (or commit) IndexWriter before opening the reader. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Trejkaz [mailto:trej...@trypticon.org] > Sent: Tuesday, August 19, 2014 6:

Re: How does Lucene decides which fields have termvectors stored and which not?

2014-08-19 Thread Kumaran Ramasubramanian
Hi Sachin Kulkarni, If possible, Please share your code. - Kumaran R On Tue, Aug 19, 2014 at 9:07 AM, Sachin Kulkarni wrote: > Hi, > > I am using Lucene 4.6.0. > > I have been storing 5 fields for my documents in the index, namely body, > title, docname, docdate and docid. > > But whe