Term/Phrase frequencies
Hi, I am new to Lucene. If I want to know the term or phrase frequency of an input document, will it be possible through Lucene? Thanks, Manjula
Re: Term/Phrase frequencies
Hi Erik, Thanks for the reply. What I want to do is, to identify key terms and key phrases of a document according to their number of occurences in the document. Output should be the highest freequency words and (two or three word) phrases. For this purpose can I use Lucene? Thanks Manjula On Thu, May 6, 2010 at 6:09 PM, Erick Erickson wrote: > Terms are relatively easy, see TermFreqVector in the JavaDocs. > > Phrases aren't as easy, before you go there, though, what is the > high-level problem you're trying to solve? Possibly this is an XY problem > (see http://people.apache.org/~hossman/#xyproblem). > > Best > Erick > > On Thu, May 6, 2010 at 6:39 AM, manjula wijewickrema >wrote: > > > Hi, > > > > I am new to Lucene. If I want to know the term or phrase frequency of an > > input document, will it be possible through Lucene? > > > > Thanks, > > Manjula > > >
Trace only exactly matching terms!
Hi, I am using Lucene 2.9.1 . I have downloaded and run the 'HelloLucene.java' class by modifing the input document and user query in various ways. Once I put the document sentenses as 'Lucene in actions' insted of 'Lucene in action', and I gave the query as 'action' and run the programme. But it did not show me the 'Lucene in action as a hit'! What is the reason for this? Why it doesn't tackle word 'actions' as a hit? Does Lucene identify only the exactly matching words? Thanks Manjula
Re: Trace only exactly matching terms!
Hi Anshum & Erick, As you have mentioned, I used SnowballAnalyzer for stemming purposes. It worked nicely. Thnks a lot for your guidence. Manjula. On Fri, May 7, 2010 at 8:27 PM, Erick Erickson wrote: > The other approach is to use a stemmer both at index and query time. > > BTW, it's very easy to make a "custom" analyzer by chaining together > the Tokenizer and as many filters (e.g. PorterStemFilter), essentially > composing your analyzer from various pre-built Lucene parts. > > HTH > Erick > > On Fri, May 7, 2010 at 9:07 AM, Anshum wrote: > > > Hi Manjula, > > Yes lucene by default would only tackle exact term matches unless you use > a > > custom analyzer to expand the index/query. > > > > -- > > Anshum Gupta > > http://ai-cafe.blogspot.com > > > > The facts expressed here belong to everybody, the opinions to me. The > > distinction is yours to draw > > > > > > On Fri, May 7, 2010 at 2:22 PM, manjula wijewickrema < > manjul...@gmail.com > > >wrote: > > > > > Hi, > > > > > > I am using Lucene 2.9.1 . I have downloaded and run the > > 'HelloLucene.java' > > > class by modifing the input document and user query in various ways. > Once > > I > > > put the document sentenses as 'Lucene in actions' insted of 'Lucene in > > > action', and I gave the query as 'action' and run the programme. But it > > did > > > not show me the 'Lucene in action as a hit'! What is the reason for > this? > > > Why it doesn't tackle word 'actions' as a hit? Does Lucene identify > only > > > the > > > exactly matching words? > > > > > > Thanks > > > Manjula > > > > > >
Class_for_HighFrequencyTerms
Hi, If I index a document (single document) in Lucene, then how can I get the term frequencies (even the first and second highest occuring terms) of that document? Is there any class/method to do taht? If anybody knows, pls. help me. Thanks Manjula
Re: Class_for_HighFrequencyTerms
Dear Erick, I lokked for it and even added IndexReader.java and TermFreqVector.java from http://www.jarvana.com/jarvana/search?search_type=class&java_class=org.apache.lucene.index.IndexReader . But after adding the system indicated a lot of errors in the source code IndexReader.java (eg: DirectoryOwningReader cannot be resolved to a type, indexCommit cannot be resolved to a type, SegmentInfos cannot be resolved, TermEnum cannot be resolved to a type, etc.). I am using Lucene 2.9.1 and this particular website has listed this source code under 2.9.1 version of Lucene. What is the reason for this kind of scenario? Do I have to add another JAR file (in order to solve this even I added lucene-core-2.9.1-sources.jar, but nothing happened). Pls. be kind enough to make a reply. Tanks Manjula On Tue, May 11, 2010 at 1:26 AM, Erick Erickson wrote: > Have you looked at TermFreqVector? > > Best > Erick > > On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema > wrote: > > > Hi, > > > > If I index a document (single document) in Lucene, then how can I get the > > term frequencies (even the first and second highest occuring terms) of > that > > document? Is there any class/method to do taht? If anybody knows, pls. > help > > me. > > > > Thanks > > Manjula > > >
Re: Class_for_HighFrequencyTerms
thanks On Tue, May 11, 2010 at 3:31 PM, wrote: > Sounds like your path is messed up and you're not using maven correctly. > Start with the jar version that contains the class you require and use maven > pom to correctly resolve dependencies > Adam > Sent using BlackBerry® from Orange > > -Original Message- > From: manjula wijewickrema > Date: Tue, 11 May 2010 15:13:12 > To: > Subject: Re: Class_for_HighFrequencyTerms > > Dear Erick, > > I lokked for it and even added IndexReader.java and TermFreqVector.java > from > > http://www.jarvana.com/jarvana/search?search_type=class&java_class=org.apache.lucene.index.IndexReader > . > But after adding the system indicated a lot of errors in the source code > IndexReader.java (eg: DirectoryOwningReader cannot be resolved to a > type, indexCommit > cannot be resolved to a type, SegmentInfos cannot be resolved, TermEnum > cannot be resolved to a type, etc.). I am using Lucene 2.9.1 and this > particular website has listed this source code under 2.9.1 version of > Lucene. What is the reason for this kind of scenario? Do I have to add > another JAR file (in order to solve this even I added > lucene-core-2.9.1-sources.jar, but nothing happened). Pls. be kind enough > to > make a reply. > > Tanks > Manjula > > On Tue, May 11, 2010 at 1:26 AM, Erick Erickson >wrote: > > > Have you looked at TermFreqVector? > > > > Best > > Erick > > > > On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema > > wrote: > > > > > Hi, > > > > > > If I index a document (single document) in Lucene, then how can I get > the > > > term frequencies (even the first and second highest occuring terms) of > > that > > > document? Is there any class/method to do taht? If anybody knows, pls. > > help > > > me. > > > > > > Thanks > > > Manjula > > > > > > >
Error of the code
Dear All, I am trying to get the term frequencies (through TermFreqVector) of a document (using Lucene 2.9.1). In order to do that I have used the following code. But there is a compile time error in the code and I can't figure it out. Could somebody can guide me what's wrong with it. Compile time error I got: Cannot make a static reference to the non-static method getTermFreqVector(int, String) from the type IndexReader. Code: *import* org.apache.lucene.analysis.standard.StandardAnalyzer; *import* org.apache.lucene.document.Document; * import* org.apache.lucene.document.Field; * import* org.apache.lucene.index.IndexWriter; * import* org.apache.lucene.queryParser.ParseException; * import* org.apache.lucene.queryParser.QueryParser; * import* org.apache.lucene.search.*; * import* org.apache.lucene.store.Directory; * import* org.apache.lucene.store.RAMDirectory; * import* org.apache.lucene.util.Version; * import* org.apache.lucene.index.IndexReader; * import* org.apache.lucene.index.TermEnum; * import* org.apache.lucene.index.Term; * import* org.apache.lucene.index.TermFreqVector; * import* java.io.IOException; * public* *class* DemoTest { *public* *static* *void* main(String[] args) { StandardAnalyzer analyzer = *new* StandardAnalyzer(Version.*LUCENE_CURRENT* ); *try* { Directory directory = *new* RAMDirectory(); IndexWriter iwriter = *new* IndexWriter(directory, analyzer, *true*,*new*IndexWriter.MaxFieldLength(25000)); Document doc = *new* Document(); String text = "This is the text to be indexed."; doc.add(*new* Field("fieldname", text, Field.Store.*YES*,Field.Index.* ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*)); iwriter.addDocument(doc); TermFreqVector vector = IndexReader.getTermFreqVector(0, "fieldname" ); * int* size = vector.size(); *for* ( String term : vector.getTerms() ) System.*out*.println( "size = " + size ); iwriter.close(); IndexSearcher isearcher = *new* IndexSearcher(directory, *true*); QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*, "fieldname", analyzer); Query query = parser.parse("text"); ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs; System.*out*.println("hits.length(1) = " + hits.length); // Iterate through the results: *for* (*int* i = 0; i < hits.length; i++) { Document hitDoc = isearcher.doc(hits.doc); System.*out*.println("hitDoc.get(\"fieldname\") (This is the text to be indexed) = " + hitDoc.get("fieldname")); } isearcher.close(); directory.close(); } *catch* (Exception ex) { ex.printStackTrace(); } } } Thanks in advance Manjula
Re: Error of the code
Dear Ian, Thanks a lot for your immediate reply. As you have mentioned I replaced the lines as follows. IndexReader ir=IndexReader.open(directory); TermFreqVector vector=ir.getTermFreqVector(0,"fieldname"); Now the error has been vanished and thanks for it. But I can't still see the results although I have moved those lines after iwriter.close(). What's the reason for this? sample code after modifications: . String text = "This is the text to be indexed."; doc.add(*new* Field("fieldname", text, Field.Store.*YES*,Field.Index.* ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*)); iwriter.addDocument(doc); iwriter.close(); IndexReader ir=IndexReader.open(directory); TermFreqVector vector=ir.getTermFreqVector(0,"fieldname"); * int* size = vector.size(); *for* ( String term : vector.getTerms() ) System.*out*.println( "size = " + size ); IndexSearcher isearcher = *new* IndexSearcher(directory, *true*); .. .. I appreciate your kind coperation Manjula On Thu, May 13, 2010 at 3:45 PM, Ian Lea wrote: > You need to replace this: > > TermFreqVector vector = IndexReader.getTermFreqVector(0, "fieldname" ); > > with > > IndexReader ir = whatever(...); > TermFreqVector vector = ir.getTermFreqVector(0, "fieldname" ); > > And you'll need to move it to after the writer.close() call if you > want it to see the doc you've just added. > > > > -- > Ian. > > > > On Thu, May 13, 2010 at 11:07 AM, manjula wijewickrema > wrote: > > Dear All, > > > > I am trying to get the term frequencies (through TermFreqVector) of a > > document (using Lucene 2.9.1). In order to do that I have used the > following > > code. But there is a compile time error in the code and I can't figure it > > out. Could somebody can guide me what's wrong with it. > > Compile time error I got: > > Cannot make a static reference to the non-static method > > getTermFreqVector(int, String) from the type IndexReader. > > > > Code: > > > > *import* org.apache.lucene.analysis.standard.StandardAnalyzer; > > > > *import* org.apache.lucene.document.Document; > > * > > > > import* org.apache.lucene.document.Field; > > * > > > > import* org.apache.lucene.index.IndexWriter; > > * > > > > import* org.apache.lucene.queryParser.ParseException; > > * > > > > import* org.apache.lucene.queryParser.QueryParser; > > * > > > > import* org.apache.lucene.search.*; > > * > > > > import* org.apache.lucene.store.Directory; > > * > > > > import* org.apache.lucene.store.RAMDirectory; > > * > > > > import* org.apache.lucene.util.Version; > > > > * > > > > import* org.apache.lucene.index.IndexReader; > > * > > > > import* org.apache.lucene.index.TermEnum; > > * > > > > import* org.apache.lucene.index.Term; > > * > > > > import* org.apache.lucene.index.TermFreqVector; > > > > * > > > > import* java.io.IOException; > > * > > > > public* *class* DemoTest { > > > > *public* *static* *void* main(String[] args) { > > > > StandardAnalyzer analyzer = *new* > StandardAnalyzer(Version.*LUCENE_CURRENT* > > ); > > > > *try* { > > > > Directory directory = *new* RAMDirectory(); > > > > IndexWriter iwriter = *new* IndexWriter(directory, analyzer, > > *true*,*new*IndexWriter.MaxFieldLength(25000)); > > > > Document doc = *new* Document(); > > > > String text = "This is the text to be indexed."; > > > > doc.add(*new* Field("fieldname", text, Field.Store.*YES*,Field.Index.* > > ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*)); > > > > iwriter.addDocument(doc); > > > > TermFreqVector vector = IndexReader.getTermFreqVector(0, "fieldname" ); > > * > > > > int* size = vector.size(); > > > > *for* ( String term : vector.getTerms() ) > > > > System.*out*.println( "size = " + size ); > > > > iwriter.close(); > > > > IndexSearcher isearcher = *new* IndexSearcher(directory, *true*); > > > > QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*, > "fieldname", > > analyzer); > > > > Query query = parser.parse("text"); > > > > ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs; > > > > System.*out*.println("hits.length(1) = " + hits.length); > > > > // Iterate through the results: > > > > *for* (*int* i = 0; i < hits.length; i++) { > > > > Document hitDoc = isearcher.doc(hits.doc); > > > > System.*out*.println("hitDoc.get(\"fieldname\") (This is the text to be > > indexed) = " + > > > > hitDoc.get("fieldname")); > > > > } > > > > isearcher.close(); > > > > directory.close(); > > > > } *catch* (Exception ex) { > > > > ex.printStackTrace(); > > > > } > > > > } > > > > } > > > > > > > > Thanks in advance > > > > Manjula > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Error of the code
Hi Ian, Thanx for your reply. vector.size() returns the total number of indexed terms in the index. However I was able to run the program and get the results finally with your help. Thanks a lot. Manjula On Thu, May 13, 2010 at 6:52 PM, Ian Lea wrote: > What does vector.size() return? You don't appear to be doing anything > with the String term in "for ( String term : vector.getTerms() )" - > presumably you intend to. > > > -- > Ian. > > On Thu, May 13, 2010 at 1:16 PM, manjula wijewickrema > wrote: > > Dear Ian, > > > > Thanks a lot for your immediate reply. As you have mentioned I replaced > the > > lines as follows. > > > > > > IndexReader ir=IndexReader.open(directory); > > > > TermFreqVector vector=ir.getTermFreqVector(0,"fieldname"); > > > > Now the error has been vanished and thanks for it. But I can't still see > the > > results although I have moved those lines after iwriter.close(). What's > the > > reason for this? > > > > sample code after modifications: > > . > > > > > > String text = "This is the text to be indexed."; > > > > doc.add(*new* Field("fieldname", text, Field.Store.*YES*,Field.Index.* > > ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*)); > > > > iwriter.addDocument(doc); > > > > iwriter.close(); > > > > IndexReader ir=IndexReader.open(directory); > > > > TermFreqVector vector=ir.getTermFreqVector(0,"fieldname"); > > * > > > > int* size = vector.size(); > > > > *for* ( String term : vector.getTerms() ) > > > > System.*out*.println( "size = " + size ); > > > > IndexSearcher isearcher = *new* IndexSearcher(directory, *true*); > > .. > > .. > > I appreciate your kind coperation > > Manjula > > On Thu, May 13, 2010 at 3:45 PM, Ian Lea wrote: > > > >> You need to replace this: > >> > >> TermFreqVector vector = IndexReader.getTermFreqVector(0, "fieldname" ); > >> > >> with > >> > >> IndexReader ir = whatever(...); > >> TermFreqVector vector = ir.getTermFreqVector(0, "fieldname" ); > >> > >> And you'll need to move it to after the writer.close() call if you > >> want it to see the doc you've just added. > >> > >> > >> > >> -- > >> Ian. > >> > >> > >> > >> On Thu, May 13, 2010 at 11:07 AM, manjula wijewickrema > >> wrote: > >> > Dear All, > >> > > >> > I am trying to get the term frequencies (through TermFreqVector) of a > >> > document (using Lucene 2.9.1). In order to do that I have used the > >> following > >> > code. But there is a compile time error in the code and I can't figure > it > >> > out. Could somebody can guide me what's wrong with it. > >> > Compile time error I got: > >> > Cannot make a static reference to the non-static method > >> > getTermFreqVector(int, String) from the type IndexReader. > >> > > >> > Code: > >> > > >> > *import* org.apache.lucene.analysis.standard.StandardAnalyzer; > >> > > >> > *import* org.apache.lucene.document.Document; > >> > * > >> > > >> > import* org.apache.lucene.document.Field; > >> > * > >> > > >> > import* org.apache.lucene.index.IndexWriter; > >> > * > >> > > >> > import* org.apache.lucene.queryParser.ParseException; > >> > * > >> > > >> > import* org.apache.lucene.queryParser.QueryParser; > >> > * > >> > > >> > import* org.apache.lucene.search.*; > >> > * > >> > > >> > import* org.apache.lucene.store.Directory; > >> > * > >> > > >> > import* org.apache.lucene.store.RAMDirectory; > >> > * > >> > > >> > import* org.apache.lucene.util.Version; > >> > > >> > * > >> > > >> > import* org.apache.lucene.index.IndexReader; > >> > * > >> > > >> > import* org.apache.lucene.index.TermEnum; > >> > * > >> > > >> > import* org.apache.lucene.index.Term; > >> > * > >> > > >> > import* org.apache.lucene.index.TermFreqVector; > &
Access indexed terms
Hi, Is it possible to put the indexed terms into an array in lucene. For example, imagine I have indexed a single document in Lucene and now I want to acces those terms in the index. Is it possible to retrieve (call) those terms as array elements? If it is possible, then how? Thanks, Manjula
Re: Access indexed terms
Hi Andrzej Thanx for the reply. But as you have mentioned, creating arrays for indexed terms seems to be little difficult. Here my intention is to find the term frequencies (of terms) of an indexed document. I can find the term frequency of a particular term (giving as a query) if I specify the term in the code. But I really want is to get the term frequency (or even the number of times it appears in the document) of the all indexed terms (or high frequency terms) without named them in the code. Is there an alternative way to do that? Thanks Manjula On Fri, May 14, 2010 at 4:00 PM, Andrzej Bialecki wrote: > On 2010-05-14 11:35, manjula wijewickrema wrote: > > Hi, > > > > Is it possible to put the indexed terms into an array in lucene. For > > example, imagine I have indexed a single document in Lucene and now I > want > > to acces those terms in the index. Is it possible to retrieve (call) > those > > terms as array elements? If it is possible, then how? > > In short: unless you created TermFrequencyVector when adding the > document, the answer is "with great difficulty". > > For a working code that does this see here: > > > http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/DocReconstructor.java > > If you really need such kind of access in your application then add your > documents with term vectors with offsets and positions. Even then, > depending on the Analyzer you used, the process is lossy - some input > data that was discarded by Analyzer is simply no longer available. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Access indexed terms
Dear Andrzej, Thanx for your valuable help. I also noticed this HighFreqTerms approach in the Lucene email archive and try to use it. In order to do that I have downloaded lucene-misc-2.9.1.jar and added org.apache.lucene.misc package into my project. Now I think I have to call this HighFreqTerms class in my code. But I was unable to find any guidence of how to do it? If you can pls. be kind enough to tell me how can I use this class in my code. Thanx Manjula On Fri, May 14, 2010 at 6:16 PM, Andrzej Bialecki wrote: > On 2010-05-14 14:24, manjula wijewickrema wrote: > > Hi Andrzej > > > > Thanx for the reply. But as you have mentioned, creating arrays for > indexed > > terms seems to be little difficult. Here my intention is to find the term > > frequencies (of terms) of an indexed document. I can find the term > frequency > > of a particular term (giving as a query) if I specify the term in the > code. > > But I really want is to get the term frequency (or even the number of > times > > it appears in the document) of the all indexed terms (or high frequency > > terms) without named them in the code. Is there an alternative way to do > > that? > > Yes, see the discussion here: > > https://issues.apache.org/jira/browse/LUCENE-2393 > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
How to call high fre. terms using HighFreTerms class
Hi, I am struggling with using HighFreTerms class for the purpose of find high fre. terms in my index. My target is to get the high frequency terms in an indexed document (single document). To do that I have added org.apache.lucene.misc package into my project. I think upto that point I am correct. But after that I have no an idea of how to call this in my coding. Although I have looked in the lucene email archive, I was unable to find a hint regarding to call of this class. If anybody can pls. give me a sample code for using this class (and relevent methods) in the code which suit to my purpose. I appreciate your kind help. Thanks Manjula
Re: How to call high fre. terms using HighFreTerms class
hi Erick, Thanx On Sat, May 15, 2010 at 5:37 PM, Erick Erickson wrote: > It looks like a stand-alone program, so you don't call it. > You probably want to get the source code and take a look at > how that program works to get an idea of how to do what you want. > > See the instructions here for getting the source: > http://wiki.apache.org/lucene-java/HowToContribute > > HTH > Erick > > On Sat, May 15, 2010 at 1:49 AM, manjula wijewickrema > wrote: > > > Hi, > > > > I am struggling with using HighFreTerms class for the purpose of find > high > > fre. terms in my index. My target is to get the high frequency terms in > an > > indexed document (single document). To do that I have added > > org.apache.lucene.misc package into my project. I think upto that point I > > am > > correct. But after that I have no an idea of how to call this in my > > coding. Although I have looked in the lucene email archive, I was unable > to > > find a hint regarding to call of this class. If anybody can pls. give me > a > > sample code for using this class (and relevent methods) in the code which > > suit to my purpose. I appreciate your kind help. > > > > Thanks > > Manjula > > >
Problem of getTermFrequencies()
Hi, I wrote a code with a view to display the indexed terms and get their term frequencies of a single document. Although it displys those terms in the index, it does not give the term frequencies. Instead it displays ' frequencies are:[...@80fa6f '. What's the reason for this. The code I have written and the display, can be given as follows. Code: * import* org.apache.lucene.analysis.standard.StandardAnalyzer; * import* org.apache.lucene.document.Document; * import* org.apache.lucene.document.Field; * import* org.apache.lucene.index.IndexWriter; * import* org.apache.lucene.index.IndexReader; * import* org.apache.lucene.queryParser.ParseException; * import* org.apache.lucene.queryParser.QueryParser; * import* org.apache.lucene.search.*; * import* org.apache.lucene.store.Directory; * import* org.apache.lucene.store.RAMDirectory; * import* org.apache.lucene.util.Version; * import* org.apache.lucene.index.TermFreqVector; * import* java.io.BufferedReader; * import* java.io.FileReader; * import* java.io.IOException; * import* org.apache.lucene.analysis.StopAnalyzer; * import* org.apache.lucene.analysis.snowball.SnowballAnalyzer; * public* *class* Testing{ * public* *static* *void* main(String[] args) *throws* IOException, ParseException { //StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English", StopAnalyzer. ENGLISH_STOP_WORDS); *try*{ Directory directory=*new* RAMDirectory(); IndexWriter w = *new* IndexWriter(directory, analyzer, *true*, IndexWriter.MaxFieldLength.*UNLIMITED*); Document doc = *new* Document(); String text="This is a sample codes code for testing lucene's capabilities over lucene term frequencies"; doc.add(*new* Field("title", text, Field.Store.*YES*, Field.Index.*ANALYZED* ,Field.TermVector.*YES*)); w.addDocument(doc); w.close(); IndexReader ir=IndexReader.open(directory); TermFreqVector[] tfv=ir.getTermFreqVectors(0); // for (int xy = 0; xy < tfv.length; xy++) { String[] terms = tfv[0].getTerms(); *int*[] freqs=tfv[0].getTermFrequencies(); //System.out.println("terms are:"+tfv[xy]); //System.out.println("length is:"+terms.length); System.*out*.println("array terms are:"+tfv[0]); System.*out*.println("terms are:"+terms); System.*out*.println("frequencies are:"+freqs); // } }*catch*(Exception ex){ ex.printStackTrace(); } } } Display: array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1, sampl/1, term/1, test/1} terms are:[Ljava.lang.String;@1e13d52 frequencies are:[...@80fa6f If some body can pls. help me to get the desired output. Thanx, Manjula.
Re: Problem of getTermFrequencies()
Dear Ian, I changed it as you said and now it is working nicely. Thanks a lot for your kind help. Manjula On Mon, May 17, 2010 at 6:46 PM, Ian Lea wrote: > terms and freqs are arrays. Try terms[i] and freqs[i]. > > > -- > Ian. > > > On Mon, May 17, 2010 at 12:23 PM, manjula wijewickrema > wrote: > > Hi, > > > > I wrote a code with a view to display the indexed terms and get their > term > > frequencies of a single document. Although it displys those terms in the > > index, it does not give the term frequencies. Instead it displays ' > frequencies > > are:[...@80fa6f '. What's the reason for this. The code I have written and > the > > display, can be given as follows. > > > > Code: > > > > * > > > > import* org.apache.lucene.analysis.standard.StandardAnalyzer; > > * > > > > import* org.apache.lucene.document.Document; > > * > > > > import* org.apache.lucene.document.Field; > > * > > > > import* org.apache.lucene.index.IndexWriter; > > * > > > > import* org.apache.lucene.index.IndexReader; > > * > > > > import* org.apache.lucene.queryParser.ParseException; > > * > > > > import* org.apache.lucene.queryParser.QueryParser; > > * > > > > import* org.apache.lucene.search.*; > > * > > > > import* org.apache.lucene.store.Directory; > > * > > > > import* org.apache.lucene.store.RAMDirectory; > > * > > > > import* org.apache.lucene.util.Version; > > * > > > > import* org.apache.lucene.index.TermFreqVector; > > > > * > > > > import* java.io.BufferedReader; > > * > > > > import* java.io.FileReader; > > * > > > > import* java.io.IOException; > > * > > > > import* org.apache.lucene.analysis.StopAnalyzer; > > * > > > > import* org.apache.lucene.analysis.snowball.SnowballAnalyzer; > > > > > > * > > > > public* *class* Testing{ > > > > * > > > > public* *static* *void* main(String[] args) *throws* IOException, > > ParseException { > > > > //StandardAnalyzer analyzer = new > StandardAnalyzer(Version.LUCENE_CURRENT); > > > > SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English", > StopAnalyzer. > > ENGLISH_STOP_WORDS); > > > > *try*{ > > > > Directory directory=*new* RAMDirectory(); > > > > IndexWriter w = *new* IndexWriter(directory, analyzer, *true*, > > > > IndexWriter.MaxFieldLength.*UNLIMITED*); > > > > Document doc = *new* Document(); > > > > String text="This is a sample codes code for testing lucene's > capabilities > > over lucene term frequencies"; > > > > doc.add(*new* Field("title", text, Field.Store.*YES*, > Field.Index.*ANALYZED* > > ,Field.TermVector.*YES*)); > > > > w.addDocument(doc); > > > > w.close(); > > > > IndexReader ir=IndexReader.open(directory); > > > > TermFreqVector[] tfv=ir.getTermFreqVectors(0); > > > > // for (int xy = 0; xy < tfv.length; xy++) { > > > > String[] terms = tfv[0].getTerms(); > > > > *int*[] freqs=tfv[0].getTermFrequencies(); > > > > //System.out.println("terms are:"+tfv[xy]); > > > > //System.out.println("length is:"+terms.length); > > > > System.*out*.println("array terms are:"+tfv[0]); > > > > System.*out*.println("terms are:"+terms); > > > > System.*out*.println("frequencies are:"+freqs); > > > > // } > > > > }*catch*(Exception ex){ > > > > ex.printStackTrace(); > > > > } > > > > } > > > > } > > > > > > > > Display: > > > > array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1, > > sampl/1, term/1, test/1} > > > > terms are:[Ljava.lang.String;@1e13d52 > > > > frequencies are:[...@80fa6f > > > > > > > > If some body can pls. help me to get the desired output. > > > > Thanx, > > > > Manjula. > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Problem of getTermFrequencies()
Thanx On Mon, May 17, 2010 at 10:19 PM, Grant Ingersoll wrote: > Note, depending on your downstream use, you may consider using a > TermVectorMapper that allows you to construct your own data structures as > needed. > > -Grant > > On May 17, 2010, at 3:16 PM, Ian Lea wrote: > > > terms and freqs are arrays. Try terms[i] and freqs[i]. > > > > > > -- > > Ian. > > > > > > On Mon, May 17, 2010 at 12:23 PM, manjula wijewickrema > > wrote: > >> Hi, > >> > >> I wrote a code with a view to display the indexed terms and get their > term > >> frequencies of a single document. Although it displys those terms in the > >> index, it does not give the term frequencies. Instead it displays ' > frequencies > >> are:[...@80fa6f '. What's the reason for this. The code I have written > and the > >> display, can be given as follows. > >> > >> Code: > >> > >> * > >> > >> import* org.apache.lucene.analysis.standard.StandardAnalyzer; > >> * > >> > >> import* org.apache.lucene.document.Document; > >> * > >> > >> import* org.apache.lucene.document.Field; > >> * > >> > >> import* org.apache.lucene.index.IndexWriter; > >> * > >> > >> import* org.apache.lucene.index.IndexReader; > >> * > >> > >> import* org.apache.lucene.queryParser.ParseException; > >> * > >> > >> import* org.apache.lucene.queryParser.QueryParser; > >> * > >> > >> import* org.apache.lucene.search.*; > >> * > >> > >> import* org.apache.lucene.store.Directory; > >> * > >> > >> import* org.apache.lucene.store.RAMDirectory; > >> * > >> > >> import* org.apache.lucene.util.Version; > >> * > >> > >> import* org.apache.lucene.index.TermFreqVector; > >> > >> * > >> > >> import* java.io.BufferedReader; > >> * > >> > >> import* java.io.FileReader; > >> * > >> > >> import* java.io.IOException; > >> * > >> > >> import* org.apache.lucene.analysis.StopAnalyzer; > >> * > >> > >> import* org.apache.lucene.analysis.snowball.SnowballAnalyzer; > >> > >> > >> * > >> > >> public* *class* Testing{ > >> > >> * > >> > >> public* *static* *void* main(String[] args) *throws* IOException, > >> ParseException { > >> > >> //StandardAnalyzer analyzer = new > StandardAnalyzer(Version.LUCENE_CURRENT); > >> > >> SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English", > StopAnalyzer. > >> ENGLISH_STOP_WORDS); > >> > >> *try*{ > >> > >> Directory directory=*new* RAMDirectory(); > >> > >> IndexWriter w = *new* IndexWriter(directory, analyzer, *true*, > >> > >> IndexWriter.MaxFieldLength.*UNLIMITED*); > >> > >> Document doc = *new* Document(); > >> > >> String text="This is a sample codes code for testing lucene's > capabilities > >> over lucene term frequencies"; > >> > >> doc.add(*new* Field("title", text, Field.Store.*YES*, > Field.Index.*ANALYZED* > >> ,Field.TermVector.*YES*)); > >> > >> w.addDocument(doc); > >> > >> w.close(); > >> > >> IndexReader ir=IndexReader.open(directory); > >> > >> TermFreqVector[] tfv=ir.getTermFreqVectors(0); > >> > >> // for (int xy = 0; xy < tfv.length; xy++) { > >> > >> String[] terms = tfv[0].getTerms(); > >> > >> *int*[] freqs=tfv[0].getTermFrequencies(); > >> > >> //System.out.println("terms are:"+tfv[xy]); > >> > >> //System.out.println("length is:"+terms.length); > >> > >> System.*out*.println("array terms are:"+tfv[0]); > >> > >> System.*out*.println("terms are:"+terms); > >> > >> System.*out*.println("frequencies are:"+freqs); > >> > >> // } > >> > >> }*catch*(Exception ex){ > >> > >> ex.printStackTrace(); > >> > >> } > >> > >> } > >> > >> } > >> > >> > >> > >> Display: > >> > >> array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1, > >> sampl/1, term/1, test/1} > >> > >> terms are:[Ljava.lang.String;@1e13d52 > >> > >> frequencies are:[...@80fa6f > >> > >> > >> > >> If some body can pls. help me to get the desired output. > >> > >> Thanx, > >> > >> Manjula. > >> > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Arrange terms[i]
Hi, I wrote aprogram to get the ferquencies and terms of an indexed document. The output comes as follows; If I print : +tfv[0] Output: array terms are:{title: capabl/1, code/2, frequenc/1, lucen/4, over/1, sampl/1, term/4, test/1} In the same way I can print terms[i] and freqs[i], but the problem is while I am printing terms[i], output (array elements) comes according to the English alphabetic order (as above) and freqs[i] also arrange according that particular order. Is there a way to arrange terms[i] according to the ascending/descending order of their frequencies? Thanx in advance. Manjula
Re: Arrange terms[i]
Dear Grant, Thanks for your reply. Manjula On Mon, May 24, 2010 at 4:37 PM, Grant Ingersoll wrote: > > On May 20, 2010, at 5:15 AM, manjula wijewickrema wrote: > > > Hi, > > > > I wrote aprogram to get the ferquencies and terms of an indexed document. > > The output comes as follows; > > > > > > If I print : +tfv[0] > > > > Output: > > > > array terms are:{title: capabl/1, code/2, frequenc/1, lucen/4, over/1, > > sampl/1, term/4, test/1} > > > > In the same way I can print terms[i] and freqs[i], but the problem is > while > > I am printing terms[i], output (array elements) comes according to the > > English alphabetic order (as above) and freqs[i] also arrange according > that > > particular order. Is there a way to arrange terms[i] according to the > > ascending/descending order of their frequencies? > > Yes, have a look at the TermVectorMapper. You will need to implement a > variation of this to build up the data structures you need. > > -Grant > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
How to get file names instead of paths?
Hi, Using the following programme I was able to get the entire file path of indexed files which matched with the given queries. But my intention is to get only the file names even without .txt extention as I need to send these file names as labels to another application. So, pls. let me know how can I get only the file names in the following code. Thanx in advance! Manjula. My code: * public* *class* LuceneDemo { *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = "filesToIndex" ; *public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory"; *public* *static* *final* String *FIELD_PATH* = "path"; *public* *static* *final* String *FIELD_CONTENTS* = "contents"; *public* *static* *void* main(String[] args) *throws* Exception { *createIndex*(); *searchIndex*("rice"); *searchIndex*("milk"); *searchIndex*("banana"); *searchIndex*("foo"); } *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS); *boolean* recreateIndexIfExists = *true*; IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, recreateIndexIfExists); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document document = *new* Document(); String path = file.getCanonicalPath(); document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index. UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); document.add(*new* Field(*FIELD_CONTENTS*, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } *public* *static* *void* searchIndex(String searchString) *throws*IOException, ParseException { System.*out*.println("Searching for '" + searchString + "'"); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.*out*.println("Number of hits: " + hits.length()); TopDocs results = indexSearcher.search(query,10); ScoreDoc[] hits1 = results.scoreDocs; *for* (ScoreDoc hit : hits1) { Document doc = indexSearcher.doc(hit.doc); System.*out*.printf("%5.3f %s\n",hit.score,doc.get(*FIELD_CONTENTS*)); } Iterator it = hits.iterator(); *while* (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(*FIELD_PATH*); System.*out*.println("Hit: " + path); } } }
Re: How to get file names instead of paths?
Dear Ian, The segment you have suggested, working nicely. Thanx a lot for your kind help. Manjula. On Fri, Jun 11, 2010 at 4:00 PM, Ian Lea wrote: > Something like this > > File f = new File(path); > String fn = f.getName(); > return fn.substring(0, fn.lastIndexOf(".")); > > > -- > Ian. > > > On Fri, Jun 11, 2010 at 11:20 AM, manjula wijewickrema > wrote: > > Hi, > > > > Using the following programme I was able to get the entire file path of > > indexed files which matched with the given queries. But my intention is > to > > get only the file names even without .txt extention as I need to send > these > > file names as labels to another application. So, pls. let me know how can > I > > get only the file names in the following code. > > > > Thanx in advance! > > Manjula. > > > > > > My code: > > > > * > > > > public* *class* LuceneDemo { > > > > *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = > "filesToIndex" > > ; > > > > *public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory"; > > > > *public* *static* *final* String *FIELD_PATH* = "path"; > > > > *public* *static* *final* String *FIELD_CONTENTS* = "contents"; > > > > *public* *static* *void* main(String[] args) *throws* Exception { > > > > *createIndex*(); > > > > *searchIndex*("rice"); > > > > *searchIndex*("milk"); > > > > *searchIndex*("banana"); > > > > *searchIndex*("foo"); > > > > } > > > > *public* *static* *void* createIndex() *throws* CorruptIndexException, > > LockObtainFailedException, IOException { > > > > SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English", > > StopAnalyzer.ENGLISH_STOP_WORDS); > > > > *boolean* recreateIndexIfExists = *true*; > > > > IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, > > recreateIndexIfExists); > > > > File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); > > > > File[] files = dir.listFiles(); > > > > *for* (File file : files) { > > > > Document document = *new* Document(); > > > > String path = file.getCanonicalPath(); > > > > document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, > Field.Index. > > UN_TOKENIZED,Field.TermVector.*YES*)); > > > > Reader reader = *new* FileReader(file); > > > > document.add(*new* Field(*FIELD_CONTENTS*, reader)); > > > > indexWriter.addDocument(document); > > > > } > > > > indexWriter.optimize(); > > > > indexWriter.close(); > > > > } > > > > *public* *static* *void* searchIndex(String searchString) > > *throws*IOException, ParseException { > > > > System.*out*.println("Searching for '" + searchString + "'"); > > > > Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); > > > > IndexReader indexReader = IndexReader.open(directory); > > > > IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); > > > > SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English", > > StopAnalyzer.ENGLISH_STOP_WORDS); > > > > QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); > > > > Query query = queryParser.parse(searchString); > > > > Hits hits = indexSearcher.search(query); > > > > System.*out*.println("Number of hits: " + hits.length()); > > > > TopDocs results = indexSearcher.search(query,10); > > > > ScoreDoc[] hits1 = results.scoreDocs; > > > > *for* (ScoreDoc hit : hits1) { > > > > Document doc = indexSearcher.doc(hit.doc); > > > > System.*out*.printf("%5.3f %s\n",hit.score,doc.get(*FIELD_CONTENTS*)); > > > > } > > > > Iterator it = hits.iterator(); > > > > *while* (it.hasNext()) { > > > > Hit hit = it.next(); > > > > Document document = hit.getDocument(); > > > > String path = document.get(*FIELD_PATH*); > > > > System.*out*.println("Hit: " + path); > > > > } > > > > } > > > > } > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Lucene Scoring
Hi, In my application, I input only single term query (at one time) and get back the corresponding scorings for those queries. But I am little struggling of understanding Lucene scoring. I have reffered http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html and some other pages to resolve my matters. But some are still remain. 1) Why it has taken the squareroot of frequency as the tf value and square of the idf vale in score function? 2) If I enter single term query, then what will return bythe coord(q,d)? Since there are always one term in the query, I think always it should be 1! Am I correct? 3) I am also struggling understanding sumOfSquaredWeights (in queryNorm(q)). As I can understand, this value depends on the nature of the query we input and depends on that, it uses different methods such as TermQuery, MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery, etc. But if I always use single term query, then what will be the way selected by the system from above? If somebody can pls. help me to resolve these problems. Appreciate any reply from you. Regards, Manjula
Re: Lucene Scoring
Dear Grant, Thanks a lot for your guidence. As you have mentioned, I tried to use explain() method to get the explanations for relevant scoring. But, once I call the explain() method, system indicated the following error. Error- 'The method explain(Query,int) in the type Searcher is not applicable for the arguments (String, int)'. In my code I call the explain() method as follows- Searcher.explain("rice",0); Possibly the wrong with my way of passing parameters. In my case, I have chosen "rice" as my query and indexed only one document. Could you pls. let me know what's wrong with this. I also included the code with this. Thanx Manjula code- ** *import* org.apache.lucene.search.Searcher; *public* *class* LuceneDemo { *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = "filesToIndex" ; *public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory"; *public* *static* *final* String *FIELD_PATH* = "path"; *public* *static* *final* String *FIELD_CONTENTS* = "contents"; *public* *static* *void* main(String[] args) *throws* Exception { *createIndex*(); *searchIndex*("rice"); } *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS); *boolean* recreateIndexIfExists = *true*; IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, recreateIndexIfExists); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document document = *new* Document(); String path = file.getCanonicalPath(); document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index. UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); document.add(*new* Field(*FIELD_CONTENTS*, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } *public* *static* *void* searchIndex(String searchString) *throws*IOException, ParseException { System.*out*.println("Searching for '" + searchString + "'"); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.*out*.println("Number of hits: " + hits.length()); TopDocs results = indexSearcher.search(query,10); ScoreDoc[] hits1 = results.scoreDocs; *for* (ScoreDoc hit : hits1) { Document doc = indexSearcher.doc(hit.doc); //System.out.printf("%5.3f %s\n",hit.score,doc.get(FIELD_CONTENTS)); System.*out*.println(hit.score); Searcher.explain("rice",0); } Iterator it = hits.iterator(); *while* (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(*FIELD_PATH*); System.*out*.println("Hit: " + path); } } } On Mon, Jul 5, 2010 at 7:46 PM, Grant Ingersoll wrote: > > On Jul 5, 2010, at 5:02 AM, manjula wijewickrema wrote: > > > Hi, > > > > In my application, I input only single term query (at one time) and get > back > > the corresponding scorings for those queries. But I am little struggling > of > > understanding Lucene scoring. I have reffered > > > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html > > and > > some other pages to resolve my matters. But some are still remain. > > > > 1) Why it has taken the squareroot of frequency as the tf value and > square > > of the idf vale in score function? > > Somewhat arbitrary, I suppose, but I think someone way back did some tests > and decided it performed "best" in general. More importantly, the point of > the Similarity class is you can override these if you desire. > > > > > 2) If I enter single term query, then what will return bythe coord(q,d)? > > Since there are always one term in the query, I think always it should be > 1! > > Am I correct? > > Should be. You can run the explain() method to confirm. > > > > > 3) I am also struggling understanding sumOfSquaredWeights (in > queryNorm(q)). > > As I can understand, this value depends on the nature of the query we > input > > and depends on that, it uses different methods such as TermQuery, > > MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery, > etc. > > But
Re: Lucene Scoring
Dear Ian, Thanks a lot for your reply. The way you proposed, working correctly and solved half of my matter. Once I run the program, system gave me the following output. output- ** Searching for 'milk' Number of hits: 1 0.13287117 0.13287117 = (MATCH) fieldWeight(contents:milk in 0), product of: 1.7320508 = tf(termFreq(contents:milk)=3) 0.30685282 = idf(docFreq=1, maxDocs=1) 0.25 = fieldNorm(field=contents, doc=0) Hit: D:\JADE\work\MobilNet\Lucene291\filesToIndex\deron-foods.txt *** Here, I have no any problems of calculating values for tf, and idf. But I have no idea of how to calculate fieldNorm. According to http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int) I think norm(t,d) gives the value for fieldNorm and in my case, the system returns the value lengthNorm(field) for norm(t,d), 1) Am I correct? 2) If so, coluld you pls. let me know the way (formula) of calculating lengthNorm(field)? (I checked several documents and codes to understand this. But was unable to find the mathematical formula behind this method). 3) If lengthNorm(field) is not the case behind fieldNorm, then how to calculate fieldNorm? Pls. help me to resolve this matter. Manjula. On Tue, Jul 6, 2010 at 12:47 PM, Ian Lea wrote: > You are calling the explain method incorrectly. You need something like > > System.out.println(indexSearcher.explain(query, 0)); > > > See the javadocs for details. > > > -- > Ian. > > > On Tue, Jul 6, 2010 at 7:39 AM, manjula wijewickrema > wrote: > > Dear Grant, > > > > Thanks a lot for your guidence. As you have mentioned, I tried to use > > explain() method to get the explanations for relevant scoring. But, once > I > > call the explain() method, system indicated the following error. > > > > Error- > > 'The method explain(Query,int) in the type Searcher is not applicable for > > the arguments (String, int)'. > > > > In my code I call the explain() method as follows- > > Searcher.explain("rice",0); > > > > Possibly the wrong with my way of passing parameters. In my case, I have > > chosen "rice" as my query and indexed only one document. > > > > Could you pls. let me know what's wrong with this. I also included the > code > > with this. > > > > Thanx > > Manjula > > > > code- > > ** > > > > *import* org.apache.lucene.search.Searcher; > > > > *public* *class* LuceneDemo { > > > > *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = > "filesToIndex" > > ; > > > > *public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory"; > > > > *public* *static* *final* String *FIELD_PATH* = "path"; > > > > *public* *static* *final* String *FIELD_CONTENTS* = "contents"; > > > > *public* *static* *void* main(String[] args) *throws* Exception { > > > > *createIndex*(); > > > > *searchIndex*("rice"); > > > > } > > > > *public* *static* *void* createIndex() *throws* CorruptIndexException, > > LockObtainFailedException, IOException { > > > > SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English", > > StopAnalyzer.ENGLISH_STOP_WORDS); > > > > *boolean* recreateIndexIfExists = *true*; > > > > IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, > > recreateIndexIfExists); > > > > File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); > > > > File[] files = dir.listFiles(); > > > > *for* (File file : files) { > > > > Document document = *new* Document(); > > > > String path = file.getCanonicalPath(); > > > > document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, > Field.Index. > > UN_TOKENIZED,Field.TermVector.*YES*)); > > > > Reader reader = *new* FileReader(file); > > > > document.add(*new* Field(*FIELD_CONTENTS*, reader)); > > > > indexWriter.addDocument(document); > > > > } > > > > indexWriter.optimize(); > > > > indexWriter.close(); > > > > } > > > > *public* *static* *void* searchIndex(String searchString) > > *throws*IOException, ParseException { > > > > System.*out*.println("Searching for '" + searchString + "'"); > > > > Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); > > > > IndexReader indexReader = IndexReader.open(directory); &
Why not normalization?
Hi, In my application, I input only one index file and enter only single term query to check the lucene score. I used explain method to see the way of obtaining results and system gave me the result as product of tf, idf, fieldNorm. 1) Although Lucene uses tf to calculate scoring it seems to me that term frequency has not been normalized. Even if I index several documents, it does not normalize tf value. Therefore, since the total number of words in index documents are varied, can't there be a fault in Lucene's scoring? 2) What is the formula to calculate this fieldNorm value? If somebody can pls. help me. Thnks in advance Manjula.
Re: Why not normalization?
Hi Rebecca, Thanks for your valuble comments. Yes I observed tha, once the number of terms of the goes up, fieldNorm value goes down correspondingly. I think, therefore there won't be any default due to the variation of total number of terms in the document. Am I right? Manjula. On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson wrote: > hi, > > > 1) Although Lucene uses tf to calculate scoring it seems to me that term > > frequency has not been normalized. Even if I index several documents, it > > does not normalize tf value. Therefore, since the total number of words > > in index documents are varied, can't there be a fault in Lucene's > scoring? > > tf = term frequency i.e. the number of times the term appears in the > document, > while idf is inverse document frequency - is a measure of how rare a term > is, > i.e. related to how many documents the term appears in. > > if term1 occurs more frequently in a document i.e. tf is higher, you > want to weight > the document higher when you search for term1 > > but if term1 is a very frequent term, ie. in lots of documents, then > its probably not > as important to an overall search (where we have term1, term2 etc) so you > want > to downweight it (idf comes in) > > then the normalisations like length normalisation (allow for 'fair' scoring > across varied field length) come in too. > > the tf-idf scoring formula used by lucene is a scoring method that's > been around > a long long time... there are competing scoring metrics but that's an IR > thing > and not an argument you want to start on the lucene lists! :) > > these are IR ('information retrieval') concepts and you might want to start > by > going to through the tf-idf scoring / some explanations for this kind > of scoring. > > http://en.wikipedia.org/wiki/Tf%E2%80%93idf > http://wiki.apache.org/lucene-java/InformationRetrieval > > > > 2) What is the formula to calculate this fieldNorm value? > > in terms of how lucene implements its tf-idf scoring - you can see here: > http://lucene.apache.org/java/3_0_2/scoring.html > > also, the lucene in action book is a really good book if you are starting > out > with lucene (and will save you a lot of grief with understanding > lucene / setting > up your application!), it covers all the basics and then moves on to more > advanced stuff and has lots of code examples too: > http://www.manning.com/hatcher2/ > > hope that helps, > > bec :) > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
scoring and index size
Hi, I run a single programme to see the way of scoring by Lucene for single indexed document. The explain() method gave me the following results. *** Searching for 'metaphysics' Number of hits: 1 0.030706111 0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of: 10.246951 = tf(termFreq(contents:metaphys)=105) 0.30685282 = idf(docFreq=1, maxDocs=1) 0.009765625 = fieldNorm(field=contents, doc=0) * But I encountered the following problems; 1) In this case, I did not change or done anything to Boost values. So that should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in Lucene email archive, default boost values=1) 2) But, even if I manually calculate the value for fieldNorm (as =1/sqrt(terms in field)), it doesn't match (approximately it matches) with the value with given by the system for fieldNorm. Can this be due to encode/decode precision loss of norm? 3) In my indexed document, my indexed document was consisted with total number of 19078 words including 125 times of word 'metaphysics' (i.e my query. I input single term query) . But as you can see in the above output, system gives only 105 counts for word 'metaphysics'. But once I reduce some part of my index document and count the number of 'metaphysics' words and checked with the system results. I noticed that with reduction of text from index document, system counts it correctly. Why this kind of behaviour? Is there any limitation for the indexed documents? If somebody can pls. help me to solve these problems. Thanks! Manjula.
Re: scoring and index size
Uwe, thanx for your comments. Following is the code I used in this case. Could you pls. let me know where I have to insert UNLIMITED field length? and how? Tanx again! Manjula code-- * public* *class* LuceneDemo { *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = "filesToIndex" ; *public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory"; *public* *static* *final* String *FIELD_PATH* = "path"; *public* *static* *final* String *FIELD_CONTENTS* = "contents"; *public* *static* *void* main(String[] args) *throws* Exception { *createIndex*(); //searchIndex("rice AND milk"); *searchIndex*("metaphysics"); //searchIndex("banana"); //searchIndex("foo"); } *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS); *boolean* recreateIndexIfExists = *true*; IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, recreateIndexIfExists); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document document = *new* Document(); //contents#setOmitNorms(true); String path = file.getCanonicalPath(); document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index. UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); document.add(*new* Field(*FIELD_CONTENTS*, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } *public* *static* *void* searchIndex(String searchString) *throws*IOException, ParseException { System.*out*.println("Searching for '" + searchString + "'"); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.*out*.println("Number of hits: " + hits.length()); TopDocs results = indexSearcher.search(query,10); ScoreDoc[] hits1 = results.scoreDocs; *for* (ScoreDoc hit : hits1) { Document doc = indexSearcher.doc(hit.doc); //System.out.printf("%5.3f %s\n",hit.score,doc.get(FIELD_CONTENTS)); System.*out*.println(hit.score); //Searcher.explain("rice",0); //System.out.println(indexSearcher.explain(query, 0)); } System.*out*.println(indexSearcher.explain(query, 0)); //System.out.println(indexSearcher.explain(query, 1)); //System.out.println(indexSearcher.explain(query, 2)); //System.out.println(indexSearcher.explain(query, 3)); Iterator it = hits.iterator(); *while* (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(*FIELD_PATH*); System.*out*.println("Hit: " + path); } } } On Fri, Jul 9, 2010 at 1:06 PM, Uwe Schindler wrote: > Maybe you have MaxFieldLength.LIMITED instead of UNLIMITED? Then the number > of terms per document is limited. > > The calculation precision is limited by the float norm encoding, but also > if > your analyzer removed stop words, so the norm is not what you exspect? > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: manjula wijewickrema [mailto:manjul...@gmail.com] > > Sent: Friday, July 09, 2010 9:21 AM > > To: java-user@lucene.apache.org > > Subject: scoring and index size > > > > Hi, > > > > I run a single programme to see the way of scoring by Lucene for single > > indexed document. The explain() method gave me the following results. > > *** > > > > Searching for 'metaphysics' > > > > Number of hits: 1 > > > > 0.030706111 > > > > 0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of: > > > > 10.246951 = tf(termFreq(contents:metaphys)=105) > > > > 0.30685282 = idf(docFreq=1, maxDocs=1) > > > > 0.009765625 = fieldNorm(field=contents, doc=0) > > > > * > > > > But I encountered the following problems; > > > > 1) In this case, I did not change or done anything to Boost values. So > that > > should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in > Lucene > > email archive, default boost values=1) > > > > 2) But, even if I manually calculate the value for fieldNorm
Re: Why not normalization?
Thanx On Fri, Jul 9, 2010 at 1:10 PM, Uwe Schindler wrote: > > Thanks for your valuble comments. Yes I observed tha, once the number of > > terms of the goes up, fieldNorm value goes down correspondingly. I think, > > therefore there won't be any default due to the variation of total number > of > > terms in the document. Am I right? > > With the current scoring model advanced statistics are not available. There > are currently some approaches to add BM25 support to Lucene, for what the > index format needs to be enhanced to contain more statistics (number of > terms per document, avg number of terms per document,...). > > > On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson > > wrote: > > > > > hi, > > > > > > > 1) Although Lucene uses tf to calculate scoring it seems to me that > > > > term frequency has not been normalized. Even if I index several > > > > documents, it does not normalize tf value. Therefore, since the > > > > total number of words in index documents are varied, can't there be > > > > a fault in Lucene's > > > scoring? > > > > > > tf = term frequency i.e. the number of times the term appears in the > > > document, while idf is inverse document frequency - is a measure of > > > how rare a term is, i.e. related to how many documents the term > > > appears in. > > > > > > if term1 occurs more frequently in a document i.e. tf is higher, you > > > want to weight the document higher when you search for term1 > > > > > > but if term1 is a very frequent term, ie. in lots of documents, then > > > its probably not as important to an overall search (where we have > > > term1, term2 etc) so you want to downweight it (idf comes in) > > > > > > then the normalisations like length normalisation (allow for 'fair' > > > scoring across varied field length) come in too. > > > > > > the tf-idf scoring formula used by lucene is a scoring method that's > > > been around a long long time... there are competing scoring metrics > > > but that's an IR thing and not an argument you want to start on the > > > lucene lists! :) > > > > > > these are IR ('information retrieval') concepts and you might want to > > > start by going to through the tf-idf scoring / some explanations for > > > this kind of scoring. > > > > > > http://en.wikipedia.org/wiki/Tf%E2%80%93idf > > > http://wiki.apache.org/lucene-java/InformationRetrieval > > > > > > > > > > 2) What is the formula to calculate this fieldNorm value? > > > > > > in terms of how lucene implements its tf-idf scoring - you can see > here: > > > http://lucene.apache.org/java/3_0_2/scoring.html > > > > > > also, the lucene in action book is a really good book if you are > > > starting out with lucene (and will save you a lot of grief with > > > understanding lucene / setting up your application!), it covers all > > > the basics and then moves on to more advanced stuff and has lots of > > > code examples too: > > > http://www.manning.com/hatcher2/ > > > > > > hope that helps, > > > > > > bec :) > > > > > > - > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: scoring and index size
Hi Koji, Thanks for your information Manjula On Fri, Jul 9, 2010 at 5:04 PM, Koji Sekiguchi wrote: > (10/07/09 19:30), manjula wijewickrema wrote: > >> Uwe, thanx for your comments. Following is the code I used in this case. >> Could you pls. let me know where I have to insert UNLIMITED field length? >> and how? >> Tanx again! >> Manjula >> >> >> > Manjula, > > You can set UNLIMITED field length to IW constructor: > > > http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#IndexWriter%28org.apache.lucene.store.Directory,%20org.apache.lucene.analysis.Analyzer,%20boolean,%20org.apache.lucene.index.IndexWriter.MaxFieldLength%29 > > Koji > > -- > http://www.rondhuit.com/en/ > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
MaxFieldLength
Hi, I have seen that, onece the field length of a document goes over a certain limit ( http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH gives it as 10,000 terms-default) Lucene truncates those documents. Is there any possibility to truncate documents, if we increase the number of indexed documents (assume, there are no any individual documents which exceed the default MaxFieldLength of Lucene)? Thanx Manjula.
Re: MaxFieldLength
Ok Erick, answer is there. If there is no any document exceeds the default maxfieldlength, then no any document will be truncated although we increase the no. of documents in the index. A'm I correct? Thanx for your commitment. Manjula. On Tue, Jul 13, 2010 at 3:57 AM, Erick Erickson wrote: > I'm not sure I understand your question. The number of documents > has no bearing on the field length of each, which is what the > max field length is all about. You can change the value here > by calling Indexwriter.setMaxFieldLength to something shorter > than the default. > > So no, if no document exceeds the default (Terms, not characters), > no document will be truncated. > > The 10,000 limit also has no bearing on how much space indexing > a document takes as long as there are fewer then 10,000 terms. That > is, a document with 5,000 terms will take up just as much space > with any MaxfieldLength > 5,000. > > HTH > Erick > > On Mon, Jul 12, 2010 at 4:00 AM, manjula wijewickrema > wrote: > > > Hi, > > > > I have seen that, onece the field length of a document goes over a > certain > > limit ( > > > > > http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH > > gives > > it as 10,000 terms-default) Lucene truncates those documents. Is there > any > > possibility to truncate documents, if we increase the number of indexed > > documents (assume, there are no any individual documents which exceed the > > default MaxFieldLength of Lucene)? > > > > Thanx > > Manjula. > > >
Databases
Hi, Normally, when I am building my index directory for indexed documents, I used to keep my indexed files simply in a directory called 'filesToIndex'. So in this case, I do not use any standar database management system such as mySql or any other. 1) Will it be possible to use mySql or any other for the purpose of manage indexed documents in Lucene? 2) Is it necessary to follow such kind of methodology with Lucene? 3) If we do not use such type of database management system, will there be any disadvantages with large number of indexed files? Appreciate any reply from you. Thanks, Manjula.
Re: Databases
Hi, Thanks a lot for your information. Regards, Manjula. On Fri, Jul 23, 2010 at 12:48 PM, tarun sapra wrote: > You can use HibernateSearch to maintain the synchronization between Lucene > index and Mysql RDBMS. > > On Fri, Jul 23, 2010 at 11:16 AM, manjula wijewickrema > wrote: > > > Hi, > > > > Normally, when I am building my index directory for indexed documents, I > > used to keep my indexed files simply in a directory called > 'filesToIndex'. > > So in this case, I do not use any standar database management system such > > as mySql or any other. > > > > 1) Will it be possible to use mySql or any other for the purpose of > manage > > indexed documents in Lucene? > > > > 2) Is it necessary to follow such kind of methodology with Lucene? > > > > 3) If we do not use such type of database management system, will there > be > > any disadvantages with large number of indexed files? > > > > Appreciate any reply from you. > > Thanks, > > Manjula. > > > > > > -- > Thanks & Regards > Tarun Sapra >
Phrase indexing and searching
Dear list, My Lucene programme is able to index single words and search the most matching documents (based on term frequencies) documents from a corpus to the input document. Now I want to index two word phrases and search the matching corpus documents (based on phrase frequencies) to the input documents. ex:- input document: blue house is very beautiful split it into phrases (say two term phrases) like: blue house house very very beautiful etc. Is it possible to do this with Lucene? If so how can I do it? Thanks, Manjula.
Phrase indexing and searching
Dear All, My Lucene programme is able to index single words and search the most matching documents (based on term frequencies) documents from a corpus to the input document. Now I want to index two word phrases and search the matching corpus documents (based on phrase frequencies) to the input documents. ex:- input document: blue house is very beautiful split it into phrases (say two term phrases) like: blue house house very very beautiful etc. Is it possible to do this with Lucene? If so how can I do it? Thanks, Manjula.
Re: Phrase indexing and searching
Hi Steve, Thanks for the reply. Could you please simply let me know how to embed SingleFilter in the code for both indexing and searching? Coz, different people suggest different snippets to the code and they did not do the job. Thanks, Manjula. On Mon, Dec 23, 2013 at 8:42 PM, Steve Rowe wrote: > Hi Manjula, > > Sounds like ShingleFilter will do what you want: < > > http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html > > > > Steve > www.lucidworks.com > On Dec 22, 2013 11:25 PM, "Manjula Wijewickrema" > wrote: > > > Dear All, > > > > My Lucene programme is able to index single words and search the most > > matching documents (based on term frequencies) documents from a corpus to > > the input document. > > Now I want to index two word phrases and search the matching corpus > > documents (based on phrase frequencies) to the input documents. > > > > ex:- > > input document: > > blue house is very beautiful > > > > split it into phrases (say two term phrases) like: > > blue house > > house very > > very beautiful > > etc. > > > > Is it possible to do this with Lucene? If so how can I do it? > > > > Thanks, > > > > Manjula. > > >
Re: Is it wrong to create index writer on each query request.
Hi, What are the other disadvantages (other than the time factor) of creating index for every request? Manjula. On Thu, Jun 5, 2014 at 2:34 PM, Aditya wrote: > Hi Rajendra > > You should NOT create index writer for every request. > > >>Whether it is time consuming to update index writer when new document > will come. > No. > > Regards > Aditya > www.findbestopensource.com > > > > On Thu, Jun 5, 2014 at 12:24 PM, Rajendra Rao > > wrote: > > > I have system in which documents and Query comes frequently .I am > > creating index writer in memory every time for each query I request . I > > want to know Is it good to separate Index Writing and loading and Query > > request ? Whether It is good to save index writer on hard disk .Whether > it > > is time consuming to update index writer when new document will come. > > >
ShingleAnalyzerWrapper question
Hi, In my programme, I can index and search a document based on unigrams. I modified the code as follows to obtain the results based on bigrams. However, it did not give me the desired output. * *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { *final* String[] NEW_STOP_WORDS = {"a", "able", "about", "actually", "after", "allow", "almost", "already", "also", "although", "always", "am", "an", "and", "any", "anybody"}; //only a portion SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English", NEW_STOP_WORDS ); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY* ); ShingleAnalyzerWrapper sw=*new* ShingleAnalyzerWrapper(analyzer,2); sw.setOutputUnigrams(*false*); IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, *true*,IndexWriter.MaxFieldLength.*UNLIMITED*); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document doc = *new* Document(); String text=""; doc.add(*new* Field("contents",text,Field.Store.*YES*, Field.Index.UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); doc.add(*new* Field(*FIELD_CONTENTS*, reader)); w.addDocument(doc); } w.optimize(); w.close(); } Still the output is; {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3, name/1, sabaragamuwa/1, univers/1} *** If anybody can, please help me to obtain the correct output. Thanks, Manjula.
Re: ShingleAnalyzerWrapper question
Dear Steve, It works. Thanks. On Wed, Jun 11, 2014 at 6:18 PM, Steve Rowe wrote: > You should give sw rather than analyzer in the IndexWriter actor. > > Steve > www.lucidworks.com > On Jun 11, 2014 2:24 AM, "Manjula Wijewickrema" > wrote: > > > Hi, > > > > In my programme, I can index and search a document based on unigrams. I > > modified the code as follows to obtain the results based on bigrams. > > However, it did not give me the desired output. > > > > * > > > > *public* *static* *void* createIndex() *throws* CorruptIndexException, > > LockObtainFailedException, > > > > > > > > IOException { > > > > > > > > > > > > *final* String[] NEW_STOP_WORDS = {"a", "able", "about", > > "actually", "after", "allow", "almost", "already", "also", "although", > > "always", "am", "an", "and", "any", "anybody"}; //only a portion > > > > > > > > SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English", > > NEW_STOP_WORDS ); > > > > Directory directory = > > FSDirectory.getDirectory(*INDEX_DIRECTORY* > > ); > > > > > > > > ShingleAnalyzerWrapper sw=*new* > > ShingleAnalyzerWrapper(analyzer,2); > > > > sw.setOutputUnigrams(*false*); > > > > > > > > IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, > > *true*,IndexWriter.MaxFieldLength.*UNLIMITED*); > > > > File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); > > > > File[] files = dir.listFiles(); > > > > > > > > > > > > *for* (File file : files) { > > > > > > > > Document doc = *new* Document(); > > > > String text=""; > > > > doc.add(*new* Field("contents",text,Field.Store.*YES*, > > Field.Index.UN_TOKENIZED,Field.TermVector.*YES*)); > > > > > > > > > > > > Reader reader = *new* FileReader(file); > > > > doc.add(*new* Field(*FIELD_CONTENTS*, reader)); > > > > w.addDocument(doc); > > > > } > > > > w.optimize(); > > > > w.close(); > > > > > > > > } > > > > > > > > > > Still the output is; > > > > > > {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, > manjula/3, > > name/1, sabaragamuwa/1, univers/1} > > > > *** > > > > > > If anybody can, please help me to obtain the correct output. > > > > > > Thanks, > > > > > > Manjula. > > >
Why bigram tf-idf is 0?
Hi, In my programme, I tried to select the most relevant document based on bigrams. System gives me the following output. {contents: /1, assist librarian/1, assist manjula/2, assist sabaragamuwa/1, fine manjula/1, librari manjula/1, librarian sabaragamuwa/1, main librari/2, manjula assist/4, manjula fine/1, manjula name/1, name manjula/1, sabaragamuwa univers/3, univers main/2, univers sabaragamuwa/1} The frequencies of the bigrams are also correctly identified by the system. But the tf-idf scores of these bigrams are given as 0. However, the same programme gives the correct tf-idf values for unigrams. Following is the code snippet that I wrote to determine the tf-idf of bigrams. for(int q1=1; q1 it = hits.iterator(); TopDocs results=indexSearcher.search(query,10); ScoreDoc[] hits1=results.scoreDocs; for(ScoreDoc hit:hits1){ Document doc=indexSearcher.doc(hit.doc); tfidf[q1-1]=hit.score; } } *** Here, "hit.score" should give the tf-idf value of each bigram. Why it is given as 0? If someone can please explain me how to resolve this problem. Thanks, Manjula.
bigram problem
Hi, Could please explain me how to determine the tf-idf score for bigrams. My program is able to index and search bigrams correctly, but it does not calculate the tf-idf for bigrams. If someone can, please help me to resolve this. Regards, Manjula.
Re: bigram problem
Dear Parnab, Thanks a lot for your guidance. I prefer to follow the second method, as I have already indexed the bigrams using ShingleFilterWrapper. But, I have no any idea about how to use NGramTokenizer here. So, could you please write one or two lines of the code which shows how to use NGramTokenizer for bigrams. Thanks, Manjula. On Wed, Jul 2, 2014 at 7:05 PM, parnab kumar wrote: > TF is straight forward, you can simply count the no of occurrences in the > doc by simple string matching. For IDF you need to know total no of docs in > the collection and the no. of docs having the bigram. reader.maxDoc() will > give you the total no of docs in the collection. To calculate the number of > docs containing the bigram use a phrase query with slop factor set to 0. > The number of docs returned by the indexsearcher with the phrase query will > be the number of docs having the bigram. I hope this is fine. > > Alternatively, use NGramTokenizer where ( n=2 in your case) while > indexing. In such a case, each bigram can interpreted as a normal lucene > term. > > Thanks, > Parnab > > > On Wed, Jul 2, 2014 at 8:45 AM, Manjula Wijewickrema > wrote: > > > Hi, > > > > Could please explain me how to determine the tf-idf score for bigrams. My > > program is able to index and search bigrams correctly, but it does not > > calculate the tf-idf for bigrams. If someone can, please help me to > resolve > > this. > > > > Regards, > > Manjula. > > >
Why hit is 0 for bigrams?
Hi, I tried to index bigrams from a documhe system gave and the system gave me the following output with the frequencies of the bigrams(output 1): array size:15 array terms are:{contents: /1, assist librarian/1, assist manjula/2, assist sabaragamuwa/1, fine manjula/1, librari manjula/1, librarian sabaragamuwa/1, main librari/2, manjula assist/4, manjula fine/1, manjula name/1, name manjula/1, sabaragamuwa univers/3, univers main/2, univers sabaragamuwa/1} For this I used the follwing code in the createIndex() class: ShingleAnalyzerWrapper sw=*new *ShingleAnalyzerWrapper(analyzer,2); sw.setOutputUnigrams(*false*); Then I tried search the indexed bigrams of the same document using the following code in searchIndex()class: IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); Analyzer analyzer = *new* WhitespaceAnalyzer(); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(terms[pos[freqs.length-q1]]); System.*out*.println("Query: " +query); Hits hits = indexSearcher.search(query); System.*out*.println("Number of hits: " + hits.length()); For this, the system gave me the following output (output2): Query: contents:manjula contents:assist Number of hits: 0 Query: contents:sabaragamuwa contents:univers Number of hits: 0 Query: contents:univers contents:main Number of hits: 0 Query: contents:main contents:librari Number of hits: 0 If someone can please explain me; (1)why 'contents: /1' is included in the array as an array element? (output 1) (2) why the system return me the query as 'contents:manjula contents:assist' instead of 'manjula assist'? (output 2) (3) why the number of hits given as 0 instead of their frequencies? (output 2) I highly appreciate your kind reply. Manjula.
Analyzer
Hi, In my work, I am using Lucene and two java classes. In the first one, I index a document and in the second one, I try to search the most relevant document for the indexed document in the first one. In the first java class, I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer in the searchIndex method and pass the highest frequency terms into the second Java class. In the second class, I use SnowballAnalyzer in the createIndex method (this index is for the collection of documents to be searched, or it is my database) and StandardAnalyser in the searchIndex method (I pass the highest frequently occuring term of the first class as the search term parameter to the searchIndex method of the second class). Using Analyzers in this manner, what I am willing is to do the stemming, stop-words in both indexes (in both classes) and to search those a few high frequency words (of the first index) in the second index. So, if my intention is clear to you, could you please let me know whether it is correct or not the way I have used Analyzers? I highly appreciate any comment. Thanx. Manjula.
Re: Analyzer
Hi Steve, Thanx a lot for your reply. Yes there are only two classes and it's corrcet that the way you have realized the problem. As you have instructed, I checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and it seems to me that it gives better results rather than StandardAnalyzer. So could you please let me know what are the differences between StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your response. Thanx. Manjula. On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe wrote: > Hi Manjula, > > It's not terribly clear what you're doing here - I got lost in your > description of your (two? or maybe four?) classes. Sometimes things are > easier to understand if you provide more concrete detail. > > I suspect that you could benefit from reading the book Lucene in Action, > 2nd edition: > > http://www.manning.com/hatcher3/ > > You would also likely benefit from using Luke, the Lucene index browser, to > better understand your indexes' contents and debug how queries match > documents: > > http://code.google.com/p/luke/ > > I think your question is whether you're using Analyzers correctly. It > sounds like you are creating two separate indexes (one for each of your > classes), and you're using SnowballAnalyzer on the indexing side for both > indexes, and StandardAnalyzer on the query side. > > The usual advice is to use the same Analyzer on both the query and the > index side. But it appears to be the case that you are taking stemmed index > terms from your index #1 and then querying index #2 using these stemmed > terms. If this is true, then you want the query-time analyzer in your > second index not to change the query terms. You'll likely get better > results using WhitespaceAnalyzer, which tokenizes on whitespace and does no > further analysis, rather than StandardAnalyzer. > > Steve > > > -Original Message- > > From: manjula wijewickrema [mailto:manjul...@gmail.com] > > Sent: Monday, November 29, 2010 4:32 AM > > To: java-user@lucene.apache.org > > Subject: Analyzer > > > > Hi, > > > > In my work, I am using Lucene and two java classes. In the first one, I > > index a document and in the second one, I try to search the most relevant > > document for the indexed document in the first one. In the first java > > class, > > I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer > > in > > the searchIndex method and pass the highest frequency terms into the > > second > > Java class. In the second class, I use SnowballAnalyzer in the > createIndex > > method (this index is for the collection of documents to be searched, or > > it > > is my database) and StandardAnalyser in the searchIndex method (I pass > the > > highest frequently occuring term of the first class as the search term > > parameter to the searchIndex method of the second class). Using Analyzers > > in > > this manner, what I am willing is to do the stemming, stop-words in both > > indexes (in both classes) and to search those a few high frequency words > > (of > > the first index) in the second index. So, if my intention is clear to > you, > > could you please let me know whether it is correct or not the way I have > > used Analyzers? I highly appreciate any comment. > > > > Thanx. > > Manjula. >
Re: Analyzer
Dear Erick, Thanx for your information. Manjula. On Tue, Nov 30, 2010 at 6:37 PM, Erick Erickson wrote: > WhitespaceAnalyzer does just that, splits the incoming stream on > white space. > > From the javadocs for StandardAnalyzer: > > A grammar-based tokenizer constructed with JFlex > > This should be a good tokenizer for most European-language documents: > > - Splits words at punctuation characters, removing punctuation. However, > a dot that's not followed by whitespace is considered part of a token. > - Splits words at hyphens, unless there's a number in the token, in which > case the whole token is interpreted as a product number and is not split. > - Recognizes email addresses and internet hostnames as one token. > > Many applications have specific tokenizer needs. If this tokenizer does not > suit your application, please consider copying this source code directory > to > your project and maintaining your own grammar-based tokenizer. > > > Best > > Erick > > On Tue, Nov 30, 2010 at 12:06 AM, manjula wijewickrema > wrote: > > > Hi Steve, > > > > Thanx a lot for your reply. Yes there are only two classes and it's > corrcet > > that the way you have realized the problem. As you have instructed, I > > checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and > > it > > seems to me that it gives better results rather than StandardAnalyzer. So > > could you please let me know what are the differences between > > StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your > response. > > Thanx. > > > > Manjula. > > > > > > On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe wrote: > > > > > Hi Manjula, > > > > > > It's not terribly clear what you're doing here - I got lost in your > > > description of your (two? or maybe four?) classes. Sometimes things > are > > > easier to understand if you provide more concrete detail. > > > > > > I suspect that you could benefit from reading the book Lucene in > Action, > > > 2nd edition: > > > > > > http://www.manning.com/hatcher3/ > > > > > > You would also likely benefit from using Luke, the Lucene index > browser, > > to > > > better understand your indexes' contents and debug how queries match > > > documents: > > > > > > http://code.google.com/p/luke/ > > > > > > I think your question is whether you're using Analyzers correctly. It > > > sounds like you are creating two separate indexes (one for each of your > > > classes), and you're using SnowballAnalyzer on the indexing side for > both > > > indexes, and StandardAnalyzer on the query side. > > > > > > The usual advice is to use the same Analyzer on both the query and the > > > index side. But it appears to be the case that you are taking stemmed > > index > > > terms from your index #1 and then querying index #2 using these stemmed > > > terms. If this is true, then you want the query-time analyzer in your > > > second index not to change the query terms. You'll likely get better > > > results using WhitespaceAnalyzer, which tokenizes on whitespace and > does > > no > > > further analysis, rather than StandardAnalyzer. > > > > > > Steve > > > > > > > -Original Message- > > > > From: manjula wijewickrema [mailto:manjul...@gmail.com] > > > > Sent: Monday, November 29, 2010 4:32 AM > > > > To: java-user@lucene.apache.org > > > > Subject: Analyzer > > > > > > > > Hi, > > > > > > > > In my work, I am using Lucene and two java classes. In the first one, > I > > > > index a document and in the second one, I try to search the most > > relevant > > > > document for the indexed document in the first one. In the first java > > > > class, > > > > I use the SnowballAnalyzer in the createIndex method and > > StandardAnalyzer > > > > in > > > > the searchIndex method and pass the highest frequency terms into the > > > > second > > > > Java class. In the second class, I use SnowballAnalyzer in the > > > createIndex > > > > method (this index is for the collection of documents to be searched, > > or > > > > it > > > > is my database) and StandardAnalyser in the searchIndex method (I > pass > > > the > > > > highest frequently occuring term of the first class as the search > term > > > > parameter to the searchIndex method of the second class). Using > > Analyzers > > > > in > > > > this manner, what I am willing is to do the stemming, stop-words in > > both > > > > indexes (in both classes) and to search those a few high frequency > > words > > > > (of > > > > the first index) in the second index. So, if my intention is clear to > > > you, > > > > could you please let me know whether it is correct or not the way I > > have > > > > used Analyzers? I highly appreciate any comment. > > > > > > > > Thanx. > > > > Manjula. > > > > > >
Editing StopWordList
Hi, 1) In my application, I need to add more words to the stop word list. Therefore, is it possible to add more words into the default lucene stop word list? 2) If is it possible, then how can I do this? Appreciate any comment from you. Thanks, Manjula.
Re: Editing StopWordList
Hi Gupta, Thanx a lot for your reply. But I could not understand whether I could modify (adding more words) to the default stop word list or should I have to make a new list as an array as follows. public string[] NEW_STOP_WORDS = { "a", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "no", "not", "of", "on", "or", "s", "such", "t", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with", "inc","incorporated","co.","ltd","ltd.", "we", "you", "your", "us", etc...}; then call it as follows, SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English", StopAnalyzer.NEW_STOP_WORDS ); Am I correct? Or if not could you explain me how can I do this? Thanx in advance. Manjula. On Tue, Dec 21, 2010 at 10:36 AM, Anshum wrote: > Hi Manjula, > You could initialize the Analyzer using a modified stop word set. Use > the *StopAnalyzer.ENGLISH_STOP_WORDS_SET > *to get the default stopset and then add your own words to it. You could > then initialize the analyzer using this new stop set instead of the default > stop set. > Hope that helps. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Tue, Dec 21, 2010 at 9:20 AM, manjula wijewickrema > wrote: > > > Hi, > > > > 1) In my application, I need to add more words to the stop word list. > > Therefore, is it possible to add more words into the default lucene stop > > word list? > > > > 2) If is it possible, then how can I do this? > > > > Appreciate any comment from you. > > > > Thanks, > > Manjula. > > >
hit.score
Hi, Can someone help me to understand the value given by 'hit.score' in Lucene. I indexed a single document with five different words with different frequencies and try to understand this value. However, it doesn't seem to be normalized term frequency or tf-idf. I am using Lucene 2.91. Any help would be highly appreciated.
Re: hit.score
Thanks Adrien. On Mon, Mar 27, 2017 at 6:56 PM, Adrien Grand wrote: > You can use IndexSearcher.explain to see how the score was computed. > > Le lun. 27 mars 2017 à 14:46, Manjula Wijewickrema a > écrit : > > > Hi, > > > > Can someone help me to understand the value given by 'hit.score' in > Lucene. > > I indexed a single document with five different words with different > > frequencies and try to understand this value. However, it doesn't seem to > > be normalized term frequency or tf-idf. I am using Lucene 2.91. > > > > Any help would be highly appreciated. > > >
Only term frequencies
Hi, I have a document collection with hundreds of documents. I need to do know the term frequency for a given query term in each document. I know that 'hit.score' will give me the Lucene score for each document (and it includes term frequency as well). But I need to call only term frequencies in each document. How can I do this? I highly appreciate your kind response.
Total of term frequencies
Hi, Is there any way to get the total count of terms in the Term Frequency Vector (tvf)? I need to calculate the Normalized term frequency of each term in my tvf. I know how to obtain the length of the tvf, but it doesn't work since I need to count duplicate occurrences as well. Highly appreciate your kind response.
TermFrequency for a String
IndexReader.getTermFreqVectors(2)[0].getTermFrequencies()[5]; In the above example, Lucene gives me the term frequency of the 5th term (e.g. say "planet") in the tfv of the corpus document "2". But I need to get the term frequency for a specified term using its string value. E.g.: term frequency of the term specified as "planet" (i.e. not specified in terms of its position "5", but specified using its string value "planet"). Is there any way to do this? I highly appreciate your kind reply!