Term/Phrase frequencies

2010-05-06 Thread manjula wijewickrema
Hi,

I am new to Lucene. If I want to know the term or phrase frequency of an
input document, will it be possible through Lucene?

Thanks,
Manjula


Re: Term/Phrase frequencies

2010-05-06 Thread manjula wijewickrema
Hi Erik,

Thanks for the reply. What I want to do is, to identify key terms and key
phrases of a document according to their number of occurences in the
document. Output should be the highest freequency words and (two or three
word) phrases. For this purpose can I use Lucene?

Thanks
Manjula

On Thu, May 6, 2010 at 6:09 PM, Erick Erickson wrote:

> Terms are relatively easy, see TermFreqVector in the JavaDocs.
>
> Phrases aren't as easy, before you go there, though, what is the
> high-level problem you're trying to solve? Possibly this is an XY problem
> (see http://people.apache.org/~hossman/#xyproblem).
>
> Best
> Erick
>
> On Thu, May 6, 2010 at 6:39 AM, manjula wijewickrema  >wrote:
>
> > Hi,
> >
> > I am new to Lucene. If I want to know the term or phrase frequency of an
> > input document, will it be possible through Lucene?
> >
> > Thanks,
> > Manjula
> >
>


Trace only exactly matching terms!

2010-05-07 Thread manjula wijewickrema
Hi,

I am using Lucene 2.9.1 . I have downloaded and run the 'HelloLucene.java'
class by modifing the input document and user query in various ways. Once I
put the document sentenses as 'Lucene in actions' insted of 'Lucene in
action', and I gave the query as 'action' and run the programme. But it did
not show me the 'Lucene in action as a hit'! What is the reason for this?
Why it doesn't tackle word 'actions' as a hit? Does Lucene identify only the
exactly matching words?

Thanks
Manjula


Re: Trace only exactly matching terms!

2010-05-10 Thread manjula wijewickrema
Hi Anshum & Erick,

As you have mentioned, I used SnowballAnalyzer for stemming purposes. It
worked nicely. Thnks a lot for your guidence.

Manjula.

On Fri, May 7, 2010 at 8:27 PM, Erick Erickson wrote:

> The other approach is to use a stemmer both at index and query time.
>
> BTW, it's very easy to make a "custom" analyzer by chaining together
> the Tokenizer and as many filters (e.g. PorterStemFilter), essentially
> composing your analyzer from various pre-built Lucene parts.
>
> HTH
> Erick
>
> On Fri, May 7, 2010 at 9:07 AM, Anshum  wrote:
>
> > Hi Manjula,
> > Yes lucene by default would only tackle exact term matches unless you use
> a
> > custom analyzer to expand the index/query.
> >
> > --
> > Anshum Gupta
> > http://ai-cafe.blogspot.com
> >
> > The facts expressed here belong to everybody, the opinions to me. The
> > distinction is yours to draw
> >
> >
> > On Fri, May 7, 2010 at 2:22 PM, manjula wijewickrema <
> manjul...@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I am using Lucene 2.9.1 . I have downloaded and run the
> > 'HelloLucene.java'
> > > class by modifing the input document and user query in various ways.
> Once
> > I
> > > put the document sentenses as 'Lucene in actions' insted of 'Lucene in
> > > action', and I gave the query as 'action' and run the programme. But it
> > did
> > > not show me the 'Lucene in action as a hit'! What is the reason for
> this?
> > > Why it doesn't tackle word 'actions' as a hit? Does Lucene identify
> only
> > > the
> > > exactly matching words?
> > >
> > > Thanks
> > > Manjula
> > >
> >
>


Class_for_HighFrequencyTerms

2010-05-10 Thread manjula wijewickrema
Hi,

If I index a document (single document) in Lucene, then how can I get the
term frequencies (even the first and second highest occuring terms) of that
document? Is there any class/method to do taht? If anybody knows, pls. help
me.

Thanks
Manjula


Re: Class_for_HighFrequencyTerms

2010-05-11 Thread manjula wijewickrema
Dear Erick,

I lokked for it and even added IndexReader.java and TermFreqVector.java
from
http://www.jarvana.com/jarvana/search?search_type=class&java_class=org.apache.lucene.index.IndexReader
.
But after adding the system indicated a lot of errors in the source code
IndexReader.java (eg: DirectoryOwningReader cannot be resolved to a
type, indexCommit
cannot be resolved to a type, SegmentInfos cannot be resolved, TermEnum
cannot be resolved to a type, etc.). I am using Lucene 2.9.1 and this
particular website has listed this source code under 2.9.1 version of
Lucene. What is the reason for this kind of scenario? Do I have to add
another JAR file (in order to solve this even I added
lucene-core-2.9.1-sources.jar, but nothing happened). Pls. be kind enough to
make a reply.

Tanks
Manjula

On Tue, May 11, 2010 at 1:26 AM, Erick Erickson wrote:

> Have you looked at TermFreqVector?
>
> Best
> Erick
>
> On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema
> wrote:
>
> > Hi,
> >
> > If I index a document (single document) in Lucene, then how can I get the
> > term frequencies (even the first and second highest occuring terms) of
> that
> > document? Is there any class/method to do taht? If anybody knows, pls.
> help
> > me.
> >
> > Thanks
> > Manjula
> >
>


Re: Class_for_HighFrequencyTerms

2010-05-13 Thread manjula wijewickrema
thanks

On Tue, May 11, 2010 at 3:31 PM,  wrote:

> Sounds like your path is messed up and you're not using maven correctly.
> Start with the jar version that contains the class you require and use maven
> pom to correctly resolve dependencies
> Adam
> Sent using BlackBerry® from Orange
>
> -Original Message-
> From: manjula wijewickrema 
> Date: Tue, 11 May 2010 15:13:12
> To: 
> Subject: Re: Class_for_HighFrequencyTerms
>
> Dear Erick,
>
> I lokked for it and even added IndexReader.java and TermFreqVector.java
> from
>
> http://www.jarvana.com/jarvana/search?search_type=class&java_class=org.apache.lucene.index.IndexReader
> .
> But after adding the system indicated a lot of errors in the source code
> IndexReader.java (eg: DirectoryOwningReader cannot be resolved to a
> type, indexCommit
> cannot be resolved to a type, SegmentInfos cannot be resolved, TermEnum
> cannot be resolved to a type, etc.). I am using Lucene 2.9.1 and this
> particular website has listed this source code under 2.9.1 version of
> Lucene. What is the reason for this kind of scenario? Do I have to add
> another JAR file (in order to solve this even I added
> lucene-core-2.9.1-sources.jar, but nothing happened). Pls. be kind enough
> to
> make a reply.
>
> Tanks
> Manjula
>
> On Tue, May 11, 2010 at 1:26 AM, Erick Erickson  >wrote:
>
> > Have you looked at TermFreqVector?
> >
> > Best
> > Erick
> >
> > On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema
> > wrote:
> >
> > > Hi,
> > >
> > > If I index a document (single document) in Lucene, then how can I get
> the
> > > term frequencies (even the first and second highest occuring terms) of
> > that
> > > document? Is there any class/method to do taht? If anybody knows, pls.
> > help
> > > me.
> > >
> > > Thanks
> > > Manjula
> > >
> >
>
>


Error of the code

2010-05-13 Thread manjula wijewickrema
Dear All,

I am trying to get the term frequencies (through TermFreqVector) of a
document (using Lucene 2.9.1). In order to do that I have used the following
code. But there is a compile time error in the code and I can't figure it
out. Could somebody can guide me what's wrong with it.
Compile time error I got:
Cannot make a static reference to the non-static method
getTermFreqVector(int, String) from the type IndexReader.

Code:

 *import* org.apache.lucene.analysis.standard.StandardAnalyzer;

*import* org.apache.lucene.document.Document;
*

import* org.apache.lucene.document.Field;
*

import* org.apache.lucene.index.IndexWriter;
*

import* org.apache.lucene.queryParser.ParseException;
*

import* org.apache.lucene.queryParser.QueryParser;
*

import* org.apache.lucene.search.*;
*

import* org.apache.lucene.store.Directory;
*

import* org.apache.lucene.store.RAMDirectory;
*

import* org.apache.lucene.util.Version;

*

import* org.apache.lucene.index.IndexReader;
*

import* org.apache.lucene.index.TermEnum;
*

import* org.apache.lucene.index.Term;
*

import* org.apache.lucene.index.TermFreqVector;

*

import* java.io.IOException;
*

public* *class* DemoTest {

*public* *static* *void* main(String[] args) {

StandardAnalyzer analyzer = *new* StandardAnalyzer(Version.*LUCENE_CURRENT*
);

*try* {

Directory directory = *new* RAMDirectory();

IndexWriter iwriter = *new* IndexWriter(directory, analyzer,
*true*,*new*IndexWriter.MaxFieldLength(25000));

Document doc = *new* Document();

String text = "This is the text to be indexed.";

doc.add(*new* Field("fieldname", text, Field.Store.*YES*,Field.Index.*
ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*));

iwriter.addDocument(doc);

TermFreqVector vector = IndexReader.getTermFreqVector(0, "fieldname" );
*

int* size = vector.size();

*for* ( String term : vector.getTerms() )

System.*out*.println( "size = " + size );

iwriter.close();

IndexSearcher isearcher = *new* IndexSearcher(directory, *true*);

QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*, "fieldname",
analyzer);

Query query = parser.parse("text");

ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs;

System.*out*.println("hits.length(1) = " + hits.length);

// Iterate through the results:

*for* (*int* i = 0; i < hits.length; i++) {

Document hitDoc = isearcher.doc(hits.doc);

System.*out*.println("hitDoc.get(\"fieldname\") (This is the text to be
indexed) = " +

hitDoc.get("fieldname"));

}

isearcher.close();

directory.close();

} *catch* (Exception ex) {

ex.printStackTrace();

}

}

}



Thanks in advance

Manjula


Re: Error of the code

2010-05-13 Thread manjula wijewickrema
Dear Ian,

Thanks a lot for your immediate reply. As you have mentioned I replaced the
lines as follows.


IndexReader ir=IndexReader.open(directory);

TermFreqVector vector=ir.getTermFreqVector(0,"fieldname");

Now the error has been vanished and thanks for it. But I can't still see the
results although I have moved those lines after iwriter.close(). What's the
reason for this?

sample code after modifications:
.


String text = "This is the text to be indexed.";

 doc.add(*new* Field("fieldname", text, Field.Store.*YES*,Field.Index.*
ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*));

iwriter.addDocument(doc);

iwriter.close();

IndexReader ir=IndexReader.open(directory);

TermFreqVector vector=ir.getTermFreqVector(0,"fieldname");
*

int* size = vector.size();

*for* ( String term : vector.getTerms() )

System.*out*.println( "size = " + size );

IndexSearcher isearcher = *new* IndexSearcher(directory, *true*);
..
..
I appreciate your kind coperation
Manjula
On Thu, May 13, 2010 at 3:45 PM, Ian Lea  wrote:

> You need to replace this:
>
> TermFreqVector vector = IndexReader.getTermFreqVector(0, "fieldname" );
>
> with
>
> IndexReader ir = whatever(...);
> TermFreqVector vector = ir.getTermFreqVector(0, "fieldname" );
>
> And you'll need to move it to after the writer.close() call if you
> want it to see the doc you've just added.
>
>
>
> --
> Ian.
>
>
>
> On Thu, May 13, 2010 at 11:07 AM, manjula wijewickrema
>  wrote:
> > Dear All,
> >
> > I am trying to get the term frequencies (through TermFreqVector) of a
> > document (using Lucene 2.9.1). In order to do that I have used the
> following
> > code. But there is a compile time error in the code and I can't figure it
> > out. Could somebody can guide me what's wrong with it.
> > Compile time error I got:
> > Cannot make a static reference to the non-static method
> > getTermFreqVector(int, String) from the type IndexReader.
> >
> > Code:
> >
> >  *import* org.apache.lucene.analysis.standard.StandardAnalyzer;
> >
> > *import* org.apache.lucene.document.Document;
> > *
> >
> > import* org.apache.lucene.document.Field;
> > *
> >
> > import* org.apache.lucene.index.IndexWriter;
> > *
> >
> > import* org.apache.lucene.queryParser.ParseException;
> > *
> >
> > import* org.apache.lucene.queryParser.QueryParser;
> > *
> >
> > import* org.apache.lucene.search.*;
> > *
> >
> > import* org.apache.lucene.store.Directory;
> > *
> >
> > import* org.apache.lucene.store.RAMDirectory;
> > *
> >
> > import* org.apache.lucene.util.Version;
> >
> > *
> >
> > import* org.apache.lucene.index.IndexReader;
> > *
> >
> > import* org.apache.lucene.index.TermEnum;
> > *
> >
> > import* org.apache.lucene.index.Term;
> > *
> >
> > import* org.apache.lucene.index.TermFreqVector;
> >
> > *
> >
> > import* java.io.IOException;
> > *
> >
> > public* *class* DemoTest {
> >
> > *public* *static* *void* main(String[] args) {
> >
> > StandardAnalyzer analyzer = *new*
> StandardAnalyzer(Version.*LUCENE_CURRENT*
> > );
> >
> > *try* {
> >
> > Directory directory = *new* RAMDirectory();
> >
> > IndexWriter iwriter = *new* IndexWriter(directory, analyzer,
> > *true*,*new*IndexWriter.MaxFieldLength(25000));
> >
> > Document doc = *new* Document();
> >
> > String text = "This is the text to be indexed.";
> >
> > doc.add(*new* Field("fieldname", text, Field.Store.*YES*,Field.Index.*
> > ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*));
> >
> > iwriter.addDocument(doc);
> >
> > TermFreqVector vector = IndexReader.getTermFreqVector(0, "fieldname" );
> > *
> >
> > int* size = vector.size();
> >
> > *for* ( String term : vector.getTerms() )
> >
> > System.*out*.println( "size = " + size );
> >
> > iwriter.close();
> >
> > IndexSearcher isearcher = *new* IndexSearcher(directory, *true*);
> >
> > QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*,
> "fieldname",
> > analyzer);
> >
> > Query query = parser.parse("text");
> >
> > ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs;
> >
> > System.*out*.println("hits.length(1) = " + hits.length);
> >
> > // Iterate through the results:
> >
> > *for* (*int* i = 0; i < hits.length; i++) {
> >
> > Document hitDoc = isearcher.doc(hits.doc);
> >
> > System.*out*.println("hitDoc.get(\"fieldname\") (This is the text to be
> > indexed) = " +
> >
> > hitDoc.get("fieldname"));
> >
> > }
> >
> > isearcher.close();
> >
> > directory.close();
> >
> > } *catch* (Exception ex) {
> >
> > ex.printStackTrace();
> >
> > }
> >
> > }
> >
> > }
> >
> >
> >
> > Thanks in advance
> >
> > Manjula
> >
>
>  -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Error of the code

2010-05-14 Thread manjula wijewickrema
Hi Ian,

Thanx for your reply. vector.size() returns the total number of indexed
terms in the index. However I was able to run the program and get the
results finally with your help. Thanks a lot.

Manjula

On Thu, May 13, 2010 at 6:52 PM, Ian Lea  wrote:

> What does vector.size() return?  You don't appear to be doing anything
> with the String term in "for ( String term : vector.getTerms() )" -
> presumably you intend to.
>
>
> --
> Ian.
>
> On Thu, May 13, 2010 at 1:16 PM, manjula wijewickrema
>   wrote:
> > Dear Ian,
> >
> > Thanks a lot for your immediate reply. As you have mentioned I replaced
> the
> > lines as follows.
> >
> >
> > IndexReader ir=IndexReader.open(directory);
> >
> > TermFreqVector vector=ir.getTermFreqVector(0,"fieldname");
> >
> > Now the error has been vanished and thanks for it. But I can't still see
> the
> > results although I have moved those lines after iwriter.close(). What's
> the
> > reason for this?
> >
> > sample code after modifications:
> > .
> > 
> >
> > String text = "This is the text to be indexed.";
> >
> >  doc.add(*new* Field("fieldname", text, Field.Store.*YES*,Field.Index.*
> > ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*));
> >
> > iwriter.addDocument(doc);
> >
> > iwriter.close();
> >
> > IndexReader ir=IndexReader.open(directory);
> >
> > TermFreqVector vector=ir.getTermFreqVector(0,"fieldname");
> > *
> >
> > int* size = vector.size();
> >
> > *for* ( String term : vector.getTerms() )
> >
> > System.*out*.println( "size = " + size );
> >
> > IndexSearcher isearcher = *new* IndexSearcher(directory, *true*);
> > ..
> > ..
> > I appreciate your kind coperation
> > Manjula
> > On Thu, May 13, 2010 at 3:45 PM, Ian Lea  wrote:
> >
> >> You need to replace this:
> >>
> >> TermFreqVector vector = IndexReader.getTermFreqVector(0, "fieldname" );
> >>
> >> with
> >>
> >> IndexReader ir = whatever(...);
> >> TermFreqVector vector = ir.getTermFreqVector(0, "fieldname" );
> >>
> >> And you'll need to move it to after the writer.close() call if you
> >> want it to see the doc you've just added.
> >>
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >>
> >> On Thu, May 13, 2010 at 11:07 AM, manjula wijewickrema
> >>  wrote:
> >> > Dear All,
> >> >
> >> > I am trying to get the term frequencies (through TermFreqVector) of a
> >> > document (using Lucene 2.9.1). In order to do that I have used the
> >> following
> >> > code. But there is a compile time error in the code and I can't figure
> it
> >> > out. Could somebody can guide me what's wrong with it.
> >> > Compile time error I got:
> >> > Cannot make a static reference to the non-static method
> >> > getTermFreqVector(int, String) from the type IndexReader.
> >> >
> >> > Code:
> >> >
> >> >  *import* org.apache.lucene.analysis.standard.StandardAnalyzer;
> >> >
> >> > *import* org.apache.lucene.document.Document;
> >> > *
> >> >
> >> > import* org.apache.lucene.document.Field;
> >> > *
> >> >
> >> > import* org.apache.lucene.index.IndexWriter;
> >> > *
> >> >
> >> > import* org.apache.lucene.queryParser.ParseException;
> >> > *
> >> >
> >> > import* org.apache.lucene.queryParser.QueryParser;
> >> > *
> >> >
> >> > import* org.apache.lucene.search.*;
> >> > *
> >> >
> >> > import* org.apache.lucene.store.Directory;
> >> > *
> >> >
> >> > import* org.apache.lucene.store.RAMDirectory;
> >> > *
> >> >
> >> > import* org.apache.lucene.util.Version;
> >> >
> >> > *
> >> >
> >> > import* org.apache.lucene.index.IndexReader;
> >> > *
> >> >
> >> > import* org.apache.lucene.index.TermEnum;
> >> > *
> >> >
> >> > import* org.apache.lucene.index.Term;
> >> > *
> >> >
> >> > import* org.apache.lucene.index.TermFreqVector;
> &

Access indexed terms

2010-05-14 Thread manjula wijewickrema
Hi,

Is it possible to put the indexed terms into an array in lucene. For
example, imagine I have indexed a single document in Lucene and now I want
to acces those terms in the index. Is it possible to retrieve (call) those
terms as array elements? If it is possible, then how?

Thanks,
Manjula


Re: Access indexed terms

2010-05-14 Thread manjula wijewickrema
Hi Andrzej

Thanx for the reply. But as you have mentioned, creating arrays for indexed
terms seems to be little difficult. Here my intention is to find the term
frequencies (of terms) of an indexed document. I can find the term frequency
of a particular term (giving as a query) if I specify the term in the code.
But I really want is to get the term frequency (or even the number of times
it appears in the document) of the all indexed terms (or high frequency
terms) without named them in the code. Is there an alternative way to do
that?

Thanks
Manjula


On Fri, May 14, 2010 at 4:00 PM, Andrzej Bialecki  wrote:

>  On 2010-05-14 11:35, manjula wijewickrema wrote:
> > Hi,
> >
> > Is it possible to put the indexed terms into an array in lucene. For
> > example, imagine I have indexed a single document in Lucene and now I
> want
> > to acces those terms in the index. Is it possible to retrieve (call)
> those
> > terms as array elements? If it is possible, then how?
>
> In short: unless you created TermFrequencyVector when adding the
> document, the answer is "with great difficulty".
>
> For a working code that does this see here:
>
>
> http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/DocReconstructor.java
>
> If you really need such kind of access in your application then add your
> documents with term vectors with offsets and positions. Even then,
> depending on the Analyzer you used, the process is lossy - some input
> data that was discarded by Analyzer is simply no longer available.
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Access indexed terms

2010-05-14 Thread manjula wijewickrema
Dear Andrzej,

Thanx for your valuable help. I also noticed this HighFreqTerms approach in
the Lucene email archive and try to use it. In order to do that I have
downloaded lucene-misc-2.9.1.jar and added org.apache.lucene.misc package
into my project. Now I think I have to call this HighFreqTerms class in my
code. But I was unable to find any guidence of how to do it? If you can pls.
be kind enough to tell me how can I use this class in my code.

Thanx
Manjula


On Fri, May 14, 2010 at 6:16 PM, Andrzej Bialecki  wrote:

> On 2010-05-14 14:24, manjula wijewickrema wrote:
> > Hi Andrzej
> >
> > Thanx for the reply. But as you have mentioned, creating arrays for
> indexed
> > terms seems to be little difficult. Here my intention is to find the term
> > frequencies (of terms) of an indexed document. I can find the term
> frequency
> > of a particular term (giving as a query) if I specify the term in the
> code.
> > But I really want is to get the term frequency (or even the number of
> times
> > it appears in the document) of the all indexed terms (or high frequency
> > terms) without named them in the code. Is there an alternative way to do
> > that?
>
> Yes, see the discussion here:
>
> https://issues.apache.org/jira/browse/LUCENE-2393
>
>
> --
>  Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


How to call high fre. terms using HighFreTerms class

2010-05-14 Thread manjula wijewickrema
Hi,

I am struggling with using HighFreTerms class for the purpose of find high
fre. terms in my index. My target is to get the high frequency terms in an
indexed document (single document). To do that I have added
org.apache.lucene.misc package into my project. I think upto that point I am
correct. But after that I have no an idea of how to call this in my
coding. Although I have looked in the lucene email archive, I was unable to
find a hint regarding to call of this class. If anybody can pls. give me a
sample code for using this class (and relevent methods) in the code which
suit to my purpose. I appreciate your kind help.

Thanks
Manjula


Re: How to call high fre. terms using HighFreTerms class

2010-05-17 Thread manjula wijewickrema
hi Erick,
Thanx

On Sat, May 15, 2010 at 5:37 PM, Erick Erickson wrote:

> It looks like a stand-alone program, so you don't call it.
> You probably want to get the source code and take a look at
> how that program works to get an idea of how to do what you want.
>
> See the instructions here for getting the source:
> http://wiki.apache.org/lucene-java/HowToContribute
>
> HTH
> Erick
>
> On Sat, May 15, 2010 at 1:49 AM, manjula wijewickrema
> wrote:
>
> > Hi,
> >
> > I am struggling with using HighFreTerms class for the purpose of find
> high
> > fre. terms in my index. My target is to get the high frequency terms in
> an
> > indexed document (single document). To do that I have added
> > org.apache.lucene.misc package into my project. I think upto that point I
> > am
> > correct. But after that I have no an idea of how to call this in my
> > coding. Although I have looked in the lucene email archive, I was unable
> to
> > find a hint regarding to call of this class. If anybody can pls. give me
> a
> > sample code for using this class (and relevent methods) in the code which
> > suit to my purpose. I appreciate your kind help.
> >
> > Thanks
> > Manjula
> >
>


Problem of getTermFrequencies()

2010-05-17 Thread manjula wijewickrema
Hi,

I wrote a code with a view to display the indexed terms and get their term
frequencies of a single document. Although it displys those terms in the
index, it does not give the term frequencies. Instead it displays ' frequencies
are:[...@80fa6f '. What's the reason for this. The code I have written and the
display, can be given as follows.

Code:

 *

import* org.apache.lucene.analysis.standard.StandardAnalyzer;
*

import* org.apache.lucene.document.Document;
*

import* org.apache.lucene.document.Field;
*

import* org.apache.lucene.index.IndexWriter;
*

import* org.apache.lucene.index.IndexReader;
*

import* org.apache.lucene.queryParser.ParseException;
*

import* org.apache.lucene.queryParser.QueryParser;
*

import* org.apache.lucene.search.*;
*

import* org.apache.lucene.store.Directory;
*

import* org.apache.lucene.store.RAMDirectory;
*

import* org.apache.lucene.util.Version;
*

import* org.apache.lucene.index.TermFreqVector;

*

import* java.io.BufferedReader;
*

import* java.io.FileReader;
*

import* java.io.IOException;
*

import* org.apache.lucene.analysis.StopAnalyzer;
*

import* org.apache.lucene.analysis.snowball.SnowballAnalyzer;


*

public* *class* Testing{

*

public* *static* *void* main(String[] args) *throws* IOException,
ParseException {

//StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English", StopAnalyzer.
ENGLISH_STOP_WORDS);

*try*{

Directory directory=*new* RAMDirectory();

IndexWriter w = *new* IndexWriter(directory, analyzer, *true*,

IndexWriter.MaxFieldLength.*UNLIMITED*);

Document doc = *new* Document();

String text="This is a sample codes code for testing lucene's capabilities
over lucene term frequencies";

doc.add(*new* Field("title", text, Field.Store.*YES*, Field.Index.*ANALYZED*
,Field.TermVector.*YES*));

w.addDocument(doc);

w.close();

IndexReader ir=IndexReader.open(directory);

TermFreqVector[] tfv=ir.getTermFreqVectors(0);

// for (int xy = 0; xy < tfv.length; xy++) {

String[] terms = tfv[0].getTerms();

*int*[] freqs=tfv[0].getTermFrequencies();

//System.out.println("terms are:"+tfv[xy]);

//System.out.println("length is:"+terms.length);

System.*out*.println("array terms are:"+tfv[0]);

System.*out*.println("terms are:"+terms);

System.*out*.println("frequencies are:"+freqs);

// }

 }*catch*(Exception ex){

ex.printStackTrace();

}

}

}



Display:

array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1,
sampl/1, term/1, test/1}

terms are:[Ljava.lang.String;@1e13d52

frequencies are:[...@80fa6f



If some body can pls. help me to get the desired output.

Thanx,

Manjula.


Re: Problem of getTermFrequencies()

2010-05-17 Thread manjula wijewickrema
Dear Ian,

I changed it as you said and now it is working nicely. Thanks a lot for your
kind help.

Manjula

On Mon, May 17, 2010 at 6:46 PM, Ian Lea  wrote:

> terms and freqs are arrays.  Try terms[i] and freqs[i].
>
>
> --
> Ian.
>
>
> On Mon, May 17, 2010 at 12:23 PM, manjula wijewickrema
>  wrote:
> > Hi,
> >
> > I wrote a code with a view to display the indexed terms and get their
> term
> > frequencies of a single document. Although it displys those terms in the
> > index, it does not give the term frequencies. Instead it displays '
> frequencies
> > are:[...@80fa6f '. What's the reason for this. The code I have written and
> the
> > display, can be given as follows.
> >
> > Code:
> >
> >  *
> >
> > import* org.apache.lucene.analysis.standard.StandardAnalyzer;
> > *
> >
> > import* org.apache.lucene.document.Document;
> > *
> >
> > import* org.apache.lucene.document.Field;
> > *
> >
> > import* org.apache.lucene.index.IndexWriter;
> > *
> >
> > import* org.apache.lucene.index.IndexReader;
> > *
> >
> > import* org.apache.lucene.queryParser.ParseException;
> > *
> >
> > import* org.apache.lucene.queryParser.QueryParser;
> > *
> >
> > import* org.apache.lucene.search.*;
> > *
> >
> > import* org.apache.lucene.store.Directory;
> > *
> >
> > import* org.apache.lucene.store.RAMDirectory;
> > *
> >
> > import* org.apache.lucene.util.Version;
> > *
> >
> > import* org.apache.lucene.index.TermFreqVector;
> >
> > *
> >
> > import* java.io.BufferedReader;
> > *
> >
> > import* java.io.FileReader;
> > *
> >
> > import* java.io.IOException;
> > *
> >
> > import* org.apache.lucene.analysis.StopAnalyzer;
> > *
> >
> > import* org.apache.lucene.analysis.snowball.SnowballAnalyzer;
> >
> >
> > *
> >
> > public* *class* Testing{
> >
> > *
> >
> > public* *static* *void* main(String[] args) *throws* IOException,
> > ParseException {
> >
> > //StandardAnalyzer analyzer = new
> StandardAnalyzer(Version.LUCENE_CURRENT);
> >
> > SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
> StopAnalyzer.
> > ENGLISH_STOP_WORDS);
> >
> > *try*{
> >
> > Directory directory=*new* RAMDirectory();
> >
> > IndexWriter w = *new* IndexWriter(directory, analyzer, *true*,
> >
> > IndexWriter.MaxFieldLength.*UNLIMITED*);
> >
> > Document doc = *new* Document();
> >
> > String text="This is a sample codes code for testing lucene's
> capabilities
> > over lucene term frequencies";
> >
> > doc.add(*new* Field("title", text, Field.Store.*YES*,
> Field.Index.*ANALYZED*
> > ,Field.TermVector.*YES*));
> >
> > w.addDocument(doc);
> >
> > w.close();
> >
> > IndexReader ir=IndexReader.open(directory);
> >
> > TermFreqVector[] tfv=ir.getTermFreqVectors(0);
> >
> > // for (int xy = 0; xy < tfv.length; xy++) {
> >
> > String[] terms = tfv[0].getTerms();
> >
> > *int*[] freqs=tfv[0].getTermFrequencies();
> >
> > //System.out.println("terms are:"+tfv[xy]);
> >
> > //System.out.println("length is:"+terms.length);
> >
> > System.*out*.println("array terms are:"+tfv[0]);
> >
> > System.*out*.println("terms are:"+terms);
> >
> > System.*out*.println("frequencies are:"+freqs);
> >
> > // }
> >
> >  }*catch*(Exception ex){
> >
> > ex.printStackTrace();
> >
> > }
> >
> > }
> >
> > }
> >
> >
> >
> > Display:
> >
> > array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1,
> > sampl/1, term/1, test/1}
> >
> > terms are:[Ljava.lang.String;@1e13d52
> >
> > frequencies are:[...@80fa6f
> >
> >
> >
> > If some body can pls. help me to get the desired output.
> >
> > Thanx,
> >
> > Manjula.
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Problem of getTermFrequencies()

2010-05-20 Thread manjula wijewickrema
Thanx

On Mon, May 17, 2010 at 10:19 PM, Grant Ingersoll wrote:

> Note, depending on your downstream use, you may consider using a
> TermVectorMapper that allows you to construct your own data structures as
> needed.
>
> -Grant
>
> On May 17, 2010, at 3:16 PM, Ian Lea wrote:
>
> > terms and freqs are arrays.  Try terms[i] and freqs[i].
> >
> >
> > --
> > Ian.
> >
> >
> > On Mon, May 17, 2010 at 12:23 PM, manjula wijewickrema
> >  wrote:
> >> Hi,
> >>
> >> I wrote a code with a view to display the indexed terms and get their
> term
> >> frequencies of a single document. Although it displys those terms in the
> >> index, it does not give the term frequencies. Instead it displays '
> frequencies
> >> are:[...@80fa6f '. What's the reason for this. The code I have written
> and the
> >> display, can be given as follows.
> >>
> >> Code:
> >>
> >>  *
> >>
> >> import* org.apache.lucene.analysis.standard.StandardAnalyzer;
> >> *
> >>
> >> import* org.apache.lucene.document.Document;
> >> *
> >>
> >> import* org.apache.lucene.document.Field;
> >> *
> >>
> >> import* org.apache.lucene.index.IndexWriter;
> >> *
> >>
> >> import* org.apache.lucene.index.IndexReader;
> >> *
> >>
> >> import* org.apache.lucene.queryParser.ParseException;
> >> *
> >>
> >> import* org.apache.lucene.queryParser.QueryParser;
> >> *
> >>
> >> import* org.apache.lucene.search.*;
> >> *
> >>
> >> import* org.apache.lucene.store.Directory;
> >> *
> >>
> >> import* org.apache.lucene.store.RAMDirectory;
> >> *
> >>
> >> import* org.apache.lucene.util.Version;
> >> *
> >>
> >> import* org.apache.lucene.index.TermFreqVector;
> >>
> >> *
> >>
> >> import* java.io.BufferedReader;
> >> *
> >>
> >> import* java.io.FileReader;
> >> *
> >>
> >> import* java.io.IOException;
> >> *
> >>
> >> import* org.apache.lucene.analysis.StopAnalyzer;
> >> *
> >>
> >> import* org.apache.lucene.analysis.snowball.SnowballAnalyzer;
> >>
> >>
> >> *
> >>
> >> public* *class* Testing{
> >>
> >> *
> >>
> >> public* *static* *void* main(String[] args) *throws* IOException,
> >> ParseException {
> >>
> >> //StandardAnalyzer analyzer = new
> StandardAnalyzer(Version.LUCENE_CURRENT);
> >>
> >> SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
> StopAnalyzer.
> >> ENGLISH_STOP_WORDS);
> >>
> >> *try*{
> >>
> >> Directory directory=*new* RAMDirectory();
> >>
> >> IndexWriter w = *new* IndexWriter(directory, analyzer, *true*,
> >>
> >> IndexWriter.MaxFieldLength.*UNLIMITED*);
> >>
> >> Document doc = *new* Document();
> >>
> >> String text="This is a sample codes code for testing lucene's
> capabilities
> >> over lucene term frequencies";
> >>
> >> doc.add(*new* Field("title", text, Field.Store.*YES*,
> Field.Index.*ANALYZED*
> >> ,Field.TermVector.*YES*));
> >>
> >> w.addDocument(doc);
> >>
> >> w.close();
> >>
> >> IndexReader ir=IndexReader.open(directory);
> >>
> >> TermFreqVector[] tfv=ir.getTermFreqVectors(0);
> >>
> >> // for (int xy = 0; xy < tfv.length; xy++) {
> >>
> >> String[] terms = tfv[0].getTerms();
> >>
> >> *int*[] freqs=tfv[0].getTermFrequencies();
> >>
> >> //System.out.println("terms are:"+tfv[xy]);
> >>
> >> //System.out.println("length is:"+terms.length);
> >>
> >> System.*out*.println("array terms are:"+tfv[0]);
> >>
> >> System.*out*.println("terms are:"+terms);
> >>
> >> System.*out*.println("frequencies are:"+freqs);
> >>
> >> // }
> >>
> >>  }*catch*(Exception ex){
> >>
> >> ex.printStackTrace();
> >>
> >> }
> >>
> >> }
> >>
> >> }
> >>
> >>
> >>
> >> Display:
> >>
> >> array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1,
> >> sampl/1, term/1, test/1}
> >>
> >> terms are:[Ljava.lang.String;@1e13d52
> >>
> >> frequencies are:[...@80fa6f
> >>
> >>
> >>
> >> If some body can pls. help me to get the desired output.
> >>
> >> Thanx,
> >>
> >> Manjula.
> >>
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Arrange terms[i]

2010-05-20 Thread manjula wijewickrema
Hi,

I wrote aprogram to get the ferquencies and terms of an indexed document.
The output comes as follows;


If I print : +tfv[0]

Output:

array terms are:{title: capabl/1, code/2, frequenc/1, lucen/4, over/1,
sampl/1, term/4, test/1}

In the same way I can print terms[i] and freqs[i], but the problem is while
I am printing terms[i], output (array elements) comes according to the
English alphabetic order (as above) and freqs[i] also arrange according that
particular order. Is there a way to arrange terms[i] according to the
ascending/descending order of their frequencies?

Thanx in advance.

Manjula


Re: Arrange terms[i]

2010-05-25 Thread manjula wijewickrema
Dear Grant,

Thanks for your reply.

Manjula

On Mon, May 24, 2010 at 4:37 PM, Grant Ingersoll wrote:

>
> On May 20, 2010, at 5:15 AM, manjula wijewickrema wrote:
>
> > Hi,
> >
> > I wrote aprogram to get the ferquencies and terms of an indexed document.
> > The output comes as follows;
> >
> >
> > If I print : +tfv[0]
> >
> > Output:
> >
> > array terms are:{title: capabl/1, code/2, frequenc/1, lucen/4, over/1,
> > sampl/1, term/4, test/1}
> >
> > In the same way I can print terms[i] and freqs[i], but the problem is
> while
> > I am printing terms[i], output (array elements) comes according to the
> > English alphabetic order (as above) and freqs[i] also arrange according
> that
> > particular order. Is there a way to arrange terms[i] according to the
> > ascending/descending order of their frequencies?
>
> Yes, have a look at the TermVectorMapper.  You will need to implement a
> variation of this to build up the data structures you need.
>
> -Grant
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


How to get file names instead of paths?

2010-06-11 Thread manjula wijewickrema
Hi,

Using the following programme I was able to get the entire file path of
indexed files which matched with the given queries. But my intention is to
get only the file names even without .txt extention as I need to send these
file names as labels to another application. So, pls. let me know how can I
get only the file names in the following code.

Thanx in advance!
Manjula.


My code:

*

public* *class* LuceneDemo {

*public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = "filesToIndex"
;

*public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory";

*public* *static* *final* String *FIELD_PATH* = "path";

*public* *static* *final* String *FIELD_CONTENTS* = "contents";

*public* *static* *void* main(String[] args) *throws* Exception {

*createIndex*();

*searchIndex*("rice");

*searchIndex*("milk");

*searchIndex*("banana");

*searchIndex*("foo");

 }

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException, IOException {

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);

*boolean* recreateIndexIfExists = *true*;

IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
recreateIndexIfExists);

File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

File[] files = dir.listFiles();

*for* (File file : files) {

Document document = *new* Document();

String path = file.getCanonicalPath();

document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index.
UN_TOKENIZED,Field.TermVector.*YES*));

Reader reader = *new* FileReader(file);

document.add(*new* Field(*FIELD_CONTENTS*, reader));

indexWriter.addDocument(document);

 }

indexWriter.optimize();

indexWriter.close();

}

*public* *static* *void* searchIndex(String searchString)
*throws*IOException, ParseException {

System.*out*.println("Searching for '" + searchString + "'");

Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);

IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);

QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);

Query query = queryParser.parse(searchString);

Hits hits = indexSearcher.search(query);

System.*out*.println("Number of hits: " + hits.length());

TopDocs results = indexSearcher.search(query,10);

ScoreDoc[] hits1 = results.scoreDocs;

*for* (ScoreDoc hit : hits1) {

Document doc = indexSearcher.doc(hit.doc);

System.*out*.printf("%5.3f %s\n",hit.score,doc.get(*FIELD_CONTENTS*));

}

 Iterator it = hits.iterator();

*while* (it.hasNext()) {

Hit hit = it.next();

Document document = hit.getDocument();

String path = document.get(*FIELD_PATH*);

System.*out*.println("Hit: " + path);

}

}

}


Re: How to get file names instead of paths?

2010-06-15 Thread manjula wijewickrema
Dear Ian,

The segment you have suggested, working nicely. Thanx a lot for your kind
help.

Manjula.

On Fri, Jun 11, 2010 at 4:00 PM, Ian Lea  wrote:

> Something like this
>
> File f = new File(path);
> String fn = f.getName();
> return fn.substring(0, fn.lastIndexOf("."));
>
>
> --
> Ian.
>
>
> On Fri, Jun 11, 2010 at 11:20 AM, manjula wijewickrema
>  wrote:
> > Hi,
> >
> > Using the following programme I was able to get the entire file path of
> > indexed files which matched with the given queries. But my intention is
> to
> > get only the file names even without .txt extention as I need to send
> these
> > file names as labels to another application. So, pls. let me know how can
> I
> > get only the file names in the following code.
> >
> > Thanx in advance!
> > Manjula.
> >
> >
> > My code:
> >
> > *
> >
> > public* *class* LuceneDemo {
> >
> > *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* =
> "filesToIndex"
> > ;
> >
> > *public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory";
> >
> > *public* *static* *final* String *FIELD_PATH* = "path";
> >
> > *public* *static* *final* String *FIELD_CONTENTS* = "contents";
> >
> > *public* *static* *void* main(String[] args) *throws* Exception {
> >
> > *createIndex*();
> >
> > *searchIndex*("rice");
> >
> > *searchIndex*("milk");
> >
> > *searchIndex*("banana");
> >
> > *searchIndex*("foo");
> >
> >  }
> >
> > *public* *static* *void* createIndex() *throws* CorruptIndexException,
> > LockObtainFailedException, IOException {
> >
> >  SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
> > StopAnalyzer.ENGLISH_STOP_WORDS);
> >
> > *boolean* recreateIndexIfExists = *true*;
> >
> > IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
> > recreateIndexIfExists);
> >
> > File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
> >
> > File[] files = dir.listFiles();
> >
> > *for* (File file : files) {
> >
> > Document document = *new* Document();
> >
> > String path = file.getCanonicalPath();
> >
> > document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*,
> Field.Index.
> > UN_TOKENIZED,Field.TermVector.*YES*));
> >
> > Reader reader = *new* FileReader(file);
> >
> > document.add(*new* Field(*FIELD_CONTENTS*, reader));
> >
> > indexWriter.addDocument(document);
> >
> >  }
> >
> > indexWriter.optimize();
> >
> > indexWriter.close();
> >
> > }
> >
> > *public* *static* *void* searchIndex(String searchString)
> > *throws*IOException, ParseException {
> >
> > System.*out*.println("Searching for '" + searchString + "'");
> >
> > Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);
> >
> > IndexReader indexReader = IndexReader.open(directory);
> >
> > IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);
> >
> >  SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
> > StopAnalyzer.ENGLISH_STOP_WORDS);
> >
> > QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);
> >
> > Query query = queryParser.parse(searchString);
> >
> > Hits hits = indexSearcher.search(query);
> >
> > System.*out*.println("Number of hits: " + hits.length());
> >
> > TopDocs results = indexSearcher.search(query,10);
> >
> > ScoreDoc[] hits1 = results.scoreDocs;
> >
> > *for* (ScoreDoc hit : hits1) {
> >
> > Document doc = indexSearcher.doc(hit.doc);
> >
> > System.*out*.printf("%5.3f %s\n",hit.score,doc.get(*FIELD_CONTENTS*));
> >
> > }
> >
> >  Iterator it = hits.iterator();
> >
> > *while* (it.hasNext()) {
> >
> > Hit hit = it.next();
> >
> > Document document = hit.getDocument();
> >
> > String path = document.get(*FIELD_PATH*);
> >
> > System.*out*.println("Hit: " + path);
> >
> > }
> >
> > }
> >
> > }
> >
>
>  -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Lucene Scoring

2010-07-05 Thread manjula wijewickrema
Hi,

In my application, I input only single term query (at one time) and get back
the corresponding scorings for those queries. But I am little struggling of
understanding Lucene scoring. I have reffered
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
and
some other pages to resolve my matters. But some are still remain.

1) Why it has taken the squareroot of frequency as the tf value and square
of the idf vale in score function?

2) If I enter single term query, then what will return bythe coord(q,d)?
Since there are always one term in the query, I think always it should be 1!
Am I correct?

3) I am also struggling understanding sumOfSquaredWeights (in queryNorm(q)).
As I can understand, this value depends on the nature of the query we input
and depends on that, it uses different methods such as TermQuery,
MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery, etc.
But if I always use single term query, then what will be the way selected by
the system from above?

If somebody can pls. help me to resolve these problems. Appreciate any reply
from you.

Regards,
Manjula


Re: Lucene Scoring

2010-07-05 Thread manjula wijewickrema
Dear Grant,

Thanks a lot for your guidence. As you have mentioned, I tried to use
explain() method to get the explanations for relevant scoring. But, once I
call the explain() method, system indicated the following error.

Error-
'The method explain(Query,int) in the type Searcher is not applicable for
the arguments (String, int)'.

In my code I call the explain() method as follows-
Searcher.explain("rice",0);

Possibly the wrong with my way of passing parameters. In my case, I have
chosen "rice" as my query and indexed only one document.

Could you pls. let me know what's wrong with this. I also included the code
with this.

Thanx
Manjula

code-
**

*import* org.apache.lucene.search.Searcher;

*public* *class* LuceneDemo {

*public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = "filesToIndex"
;

*public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory";

*public* *static* *final* String *FIELD_PATH* = "path";

*public* *static* *final* String *FIELD_CONTENTS* = "contents";

*public* *static* *void* main(String[] args) *throws* Exception {

*createIndex*();

*searchIndex*("rice");

 }

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException, IOException {

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);

*boolean* recreateIndexIfExists = *true*;

IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
recreateIndexIfExists);

File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

File[] files = dir.listFiles();

*for* (File file : files) {

Document document = *new* Document();

String path = file.getCanonicalPath();

document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index.
UN_TOKENIZED,Field.TermVector.*YES*));

Reader reader = *new* FileReader(file);

document.add(*new* Field(*FIELD_CONTENTS*, reader));

indexWriter.addDocument(document);

 }

indexWriter.optimize();

indexWriter.close();

}

*public* *static* *void* searchIndex(String searchString)
*throws*IOException, ParseException {

System.*out*.println("Searching for '" + searchString + "'");

Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);

IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);

QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);

Query query = queryParser.parse(searchString);

Hits hits = indexSearcher.search(query);

System.*out*.println("Number of hits: " + hits.length());

TopDocs results = indexSearcher.search(query,10);

ScoreDoc[] hits1 = results.scoreDocs;

*for* (ScoreDoc hit : hits1) {

Document doc = indexSearcher.doc(hit.doc);

//System.out.printf("%5.3f %s\n",hit.score,doc.get(FIELD_CONTENTS));

System.*out*.println(hit.score);

Searcher.explain("rice",0);

}

 Iterator it = hits.iterator();

*while* (it.hasNext()) {

Hit hit = it.next();

Document document = hit.getDocument();

String path = document.get(*FIELD_PATH*);

System.*out*.println("Hit: " + path);

}

}

}


On Mon, Jul 5, 2010 at 7:46 PM, Grant Ingersoll  wrote:

>
> On Jul 5, 2010, at 5:02 AM, manjula wijewickrema wrote:
>
> > Hi,
> >
> > In my application, I input only single term query (at one time) and get
> back
> > the corresponding scorings for those queries. But I am little struggling
> of
> > understanding Lucene scoring. I have reffered
> >
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> > and
> > some other pages to resolve my matters. But some are still remain.
> >
> > 1) Why it has taken the squareroot of frequency as the tf value and
> square
> > of the idf vale in score function?
>
> Somewhat arbitrary, I suppose, but I think someone way back did some tests
> and decided it performed "best" in general.  More importantly, the point of
> the Similarity class is you can override these if you desire.
>
> >
> > 2) If I enter single term query, then what will return bythe coord(q,d)?
> > Since there are always one term in the query, I think always it should be
> 1!
> > Am I correct?
>
> Should be.  You can run the explain() method to confirm.
>
> >
> > 3) I am also struggling understanding sumOfSquaredWeights (in
> queryNorm(q)).
> > As I can understand, this value depends on the nature of the query we
> input
> > and depends on that, it uses different methods such as TermQuery,
> > MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery,
> etc.
> > But 

Re: Lucene Scoring

2010-07-07 Thread manjula wijewickrema
Dear Ian,

Thanks a lot for your reply. The way you proposed, working correctly and
solved half of my matter.
Once I run the program, system gave me the following output.
output-
**
Searching for 'milk'

Number of hits: 1

0.13287117

0.13287117 = (MATCH) fieldWeight(contents:milk in 0), product of:

1.7320508 = tf(termFreq(contents:milk)=3)

0.30685282 = idf(docFreq=1, maxDocs=1)

0.25 = fieldNorm(field=contents, doc=0)

Hit: D:\JADE\work\MobilNet\Lucene291\filesToIndex\deron-foods.txt
***
Here, I have no any problems of calculating values for tf, and idf. But I
have no idea of how to calculate fieldNorm. According to
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int)
I think norm(t,d) gives the value for fieldNorm and in my case, the system
returns the value lengthNorm(field) for norm(t,d),

1) Am I correct?
2) If so, coluld you pls. let me know the way (formula) of calculating
lengthNorm(field)? (I checked several documents and codes to understand
this. But was unable to find the mathematical formula behind this method).
3) If lengthNorm(field) is not the case behind fieldNorm, then how to
calculate fieldNorm?

Pls. help me to resolve this matter.

Manjula.


On Tue, Jul 6, 2010 at 12:47 PM, Ian Lea  wrote:

> You are calling the explain method incorrectly.  You need something like
>
>  System.out.println(indexSearcher.explain(query, 0));
>
>
> See the javadocs for details.
>
>
> --
> Ian.
>
>
> On Tue, Jul 6, 2010 at 7:39 AM, manjula wijewickrema
>  wrote:
> > Dear Grant,
> >
> > Thanks a lot for your guidence. As you have mentioned, I tried to use
> > explain() method to get the explanations for relevant scoring. But, once
> I
> > call the explain() method, system indicated the following error.
> >
> > Error-
> > 'The method explain(Query,int) in the type Searcher is not applicable for
> > the arguments (String, int)'.
> >
> > In my code I call the explain() method as follows-
> > Searcher.explain("rice",0);
> >
> > Possibly the wrong with my way of passing parameters. In my case, I have
> > chosen "rice" as my query and indexed only one document.
> >
> > Could you pls. let me know what's wrong with this. I also included the
> code
> > with this.
> >
> > Thanx
> > Manjula
> >
> > code-
> > **
> >
> > *import* org.apache.lucene.search.Searcher;
> >
> > *public* *class* LuceneDemo {
> >
> > *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* =
> "filesToIndex"
> > ;
> >
> > *public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory";
> >
> > *public* *static* *final* String *FIELD_PATH* = "path";
> >
> > *public* *static* *final* String *FIELD_CONTENTS* = "contents";
> >
> > *public* *static* *void* main(String[] args) *throws* Exception {
> >
> > *createIndex*();
> >
> > *searchIndex*("rice");
> >
> >  }
> >
> > *public* *static* *void* createIndex() *throws* CorruptIndexException,
> > LockObtainFailedException, IOException {
> >
> >  SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
> > StopAnalyzer.ENGLISH_STOP_WORDS);
> >
> > *boolean* recreateIndexIfExists = *true*;
> >
> > IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
> > recreateIndexIfExists);
> >
> > File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
> >
> > File[] files = dir.listFiles();
> >
> > *for* (File file : files) {
> >
> > Document document = *new* Document();
> >
> > String path = file.getCanonicalPath();
> >
> > document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*,
> Field.Index.
> > UN_TOKENIZED,Field.TermVector.*YES*));
> >
> > Reader reader = *new* FileReader(file);
> >
> > document.add(*new* Field(*FIELD_CONTENTS*, reader));
> >
> > indexWriter.addDocument(document);
> >
> >  }
> >
> > indexWriter.optimize();
> >
> > indexWriter.close();
> >
> > }
> >
> > *public* *static* *void* searchIndex(String searchString)
> > *throws*IOException, ParseException {
> >
> > System.*out*.println("Searching for '" + searchString + "'");
> >
> > Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);
> >
> > IndexReader indexReader = IndexReader.open(directory);
&

Why not normalization?

2010-07-07 Thread manjula wijewickrema
Hi,

In my application, I input only one index file and enter only single term
query to check the lucene score. I used explain method to see the way of
obtaining results and system gave me the result as product of tf, idf,
fieldNorm.

1) Although Lucene uses tf to calculate scoring it seems to me that term
frequency has not been normalized. Even if I index several documents, it
does not normalize tf value. Therefore, since the total number of words
in index documents are varied, can't there be a fault in Lucene's scoring?

2) What is the formula to calculate this fieldNorm value?

If somebody can pls. help me.

Thnks in advance
Manjula.


Re: Why not normalization?

2010-07-08 Thread manjula wijewickrema
Hi Rebecca,

Thanks for your valuble comments. Yes I observed tha, once the number of
terms of the goes up, fieldNorm value goes down correspondingly. I think,
therefore there won't be any default due to the variation of total number of
terms in the document. Am I right?

Manjula.

On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson  wrote:

> hi,
>
> > 1) Although Lucene uses tf to calculate scoring it seems to me that term
> > frequency has not been normalized. Even if I index several documents, it
> > does not normalize tf value. Therefore, since the total number of words
> > in index documents are varied, can't there be a fault in Lucene's
> scoring?
>
> tf = term frequency i.e. the number of times the term appears in the
> document,
> while idf is inverse document frequency - is a measure of how rare a term
> is,
> i.e. related to how many documents the term appears in.
>
> if term1 occurs more frequently in a document i.e. tf is higher, you
> want to weight
> the document higher when you search for term1
>
> but if term1 is a very frequent term, ie. in lots of documents, then
> its probably not
> as important to an overall search (where we have term1, term2 etc) so you
> want
> to downweight it (idf comes in)
>
> then the normalisations like length normalisation (allow for 'fair' scoring
> across varied field length) come in too.
>
> the tf-idf scoring formula used by lucene is a  scoring method that's
> been around
> a long long time... there are competing scoring metrics but that's an IR
> thing
> and not an argument you want to start on the lucene lists! :)
>
> these are IR ('information retrieval') concepts and you might want to start
> by
> going to through the tf-idf scoring / some explanations for this kind
> of scoring.
>
> http://en.wikipedia.org/wiki/Tf%E2%80%93idf
> http://wiki.apache.org/lucene-java/InformationRetrieval
>
>
> > 2) What is the formula to calculate this fieldNorm value?
>
> in terms of how lucene implements its tf-idf scoring - you can see here:
> http://lucene.apache.org/java/3_0_2/scoring.html
>
> also, the lucene in action book is a really good book if you are starting
> out
> with lucene (and will save you a lot of grief with understanding
> lucene / setting
> up your application!), it covers all the basics and then moves on to more
> advanced stuff and has lots of code examples too:
> http://www.manning.com/hatcher2/
>
> hope that helps,
>
> bec :)
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


scoring and index size

2010-07-09 Thread manjula wijewickrema
Hi,

I run a single programme to see the way of scoring by Lucene for single
indexed document. The explain() method gave me the following results.
***

Searching for 'metaphysics'

Number of hits: 1

0.030706111

0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of:

10.246951 = tf(termFreq(contents:metaphys)=105)

0.30685282 = idf(docFreq=1, maxDocs=1)

0.009765625 = fieldNorm(field=contents, doc=0)

*

But I encountered the following problems;

1) In this case, I did not change or done anything to Boost values. So that
should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in Lucene
email archive,  default boost values=1)

2) But, even if I manually calculate the value for fieldNorm (as
=1/sqrt(terms in field)), it doesn't match (approximately it matches) with
the value with given by the system for fieldNorm. Can this be due to
encode/decode precision loss of norm?

3) In my indexed document, my indexed document was consisted with total
number of 19078 words including 125 times of word 'metaphysics' (i.e my
query. I input single term query) . But as you can see in the above output,
system gives only 105 counts for word 'metaphysics'. But once I reduce some
part of my index document and count the number of 'metaphysics' words and
checked with the system results. I noticed that with reduction of text from
index document, system counts it correctly. Why this kind of behaviour? Is
there any limitation for the indexed documents?

If somebody can pls. help me to solve these problems.

Thanks!

Manjula.


Re: scoring and index size

2010-07-09 Thread manjula wijewickrema
Uwe, thanx for your comments. Following is the code I used in this case.
Could you pls. let me know where I have to insert UNLIMITED field length?
and how?
Tanx again!
Manjula

code--

*

public* *class* LuceneDemo {

*public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = "filesToIndex"
;

*public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory";

*public* *static* *final* String *FIELD_PATH* = "path";

*public* *static* *final* String *FIELD_CONTENTS* = "contents";

*public* *static* *void* main(String[] args) *throws* Exception {

*createIndex*();

//searchIndex("rice AND milk");

*searchIndex*("metaphysics");

//searchIndex("banana");

//searchIndex("foo");

 }

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException, IOException {

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);

*boolean* recreateIndexIfExists = *true*;

IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
recreateIndexIfExists);

File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

File[] files = dir.listFiles();

*for* (File file : files) {

Document document = *new* Document();

//contents#setOmitNorms(true);

String path = file.getCanonicalPath();

document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index.
UN_TOKENIZED,Field.TermVector.*YES*));

Reader reader = *new* FileReader(file);

document.add(*new* Field(*FIELD_CONTENTS*, reader));

indexWriter.addDocument(document);

 }

indexWriter.optimize();

indexWriter.close();

}

*public* *static* *void* searchIndex(String searchString)
*throws*IOException, ParseException {

System.*out*.println("Searching for '" + searchString + "'");

Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);

IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);

QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);

Query query = queryParser.parse(searchString);

Hits hits = indexSearcher.search(query);

System.*out*.println("Number of hits: " + hits.length());

TopDocs results = indexSearcher.search(query,10);

ScoreDoc[] hits1 = results.scoreDocs;

*for* (ScoreDoc hit : hits1) {

Document doc = indexSearcher.doc(hit.doc);

//System.out.printf("%5.3f %s\n",hit.score,doc.get(FIELD_CONTENTS));

System.*out*.println(hit.score);

//Searcher.explain("rice",0);

//System.out.println(indexSearcher.explain(query, 0));

}

System.*out*.println(indexSearcher.explain(query, 0));

//System.out.println(indexSearcher.explain(query, 1));

//System.out.println(indexSearcher.explain(query, 2));

//System.out.println(indexSearcher.explain(query, 3));

Iterator it = hits.iterator();

*while* (it.hasNext()) {

Hit hit = it.next();

Document document = hit.getDocument();

String path = document.get(*FIELD_PATH*);

System.*out*.println("Hit: " + path);

}

}

}






On Fri, Jul 9, 2010 at 1:06 PM, Uwe Schindler  wrote:

> Maybe you have MaxFieldLength.LIMITED instead of UNLIMITED? Then the number
> of terms per document is limited.
>
> The calculation precision is limited by the float norm encoding, but also
> if
> your analyzer removed stop words, so the norm is not what you exspect?
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: manjula wijewickrema [mailto:manjul...@gmail.com]
> > Sent: Friday, July 09, 2010 9:21 AM
> > To: java-user@lucene.apache.org
> > Subject: scoring and index size
> >
> > Hi,
> >
> > I run a single programme to see the way of scoring by Lucene for single
> > indexed document. The explain() method gave me the following results.
> > ***
> >
> > Searching for 'metaphysics'
> >
> > Number of hits: 1
> >
> > 0.030706111
> >
> > 0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of:
> >
> > 10.246951 = tf(termFreq(contents:metaphys)=105)
> >
> > 0.30685282 = idf(docFreq=1, maxDocs=1)
> >
> > 0.009765625 = fieldNorm(field=contents, doc=0)
> >
> > *
> >
> > But I encountered the following problems;
> >
> > 1) In this case, I did not change or done anything to Boost values. So
> that
> > should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in
> Lucene
> > email archive,  default boost values=1)
> >
> > 2) But, even if I manually calculate the value for fieldNorm 

Re: Why not normalization?

2010-07-09 Thread manjula wijewickrema
Thanx

On Fri, Jul 9, 2010 at 1:10 PM, Uwe Schindler  wrote:

> > Thanks for your valuble comments. Yes I observed tha, once the number of
> > terms of the goes up, fieldNorm value goes down correspondingly. I think,
> > therefore there won't be any default due to the variation of total number
> of
> > terms in the document. Am I right?
>
> With the current scoring model advanced statistics are not available. There
> are currently some approaches to add BM25 support to Lucene, for what the
> index format needs to be enhanced to contain more statistics (number of
> terms per document, avg number of terms per document,...).
>
> > On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson 
> > wrote:
> >
> > > hi,
> > >
> > > > 1) Although Lucene uses tf to calculate scoring it seems to me that
> > > > term frequency has not been normalized. Even if I index several
> > > > documents, it does not normalize tf value. Therefore, since the
> > > > total number of words in index documents are varied, can't there be
> > > > a fault in Lucene's
> > > scoring?
> > >
> > > tf = term frequency i.e. the number of times the term appears in the
> > > document, while idf is inverse document frequency - is a measure of
> > > how rare a term is, i.e. related to how many documents the term
> > > appears in.
> > >
> > > if term1 occurs more frequently in a document i.e. tf is higher, you
> > > want to weight the document higher when you search for term1
> > >
> > > but if term1 is a very frequent term, ie. in lots of documents, then
> > > its probably not as important to an overall search (where we have
> > > term1, term2 etc) so you want to downweight it (idf comes in)
> > >
> > > then the normalisations like length normalisation (allow for 'fair'
> > > scoring across varied field length) come in too.
> > >
> > > the tf-idf scoring formula used by lucene is a  scoring method that's
> > > been around a long long time... there are competing scoring metrics
> > > but that's an IR thing and not an argument you want to start on the
> > > lucene lists! :)
> > >
> > > these are IR ('information retrieval') concepts and you might want to
> > > start by going to through the tf-idf scoring / some explanations for
> > > this kind of scoring.
> > >
> > > http://en.wikipedia.org/wiki/Tf%E2%80%93idf
> > > http://wiki.apache.org/lucene-java/InformationRetrieval
> > >
> > >
> > > > 2) What is the formula to calculate this fieldNorm value?
> > >
> > > in terms of how lucene implements its tf-idf scoring - you can see
> here:
> > > http://lucene.apache.org/java/3_0_2/scoring.html
> > >
> > > also, the lucene in action book is a really good book if you are
> > > starting out with lucene (and will save you a lot of grief with
> > > understanding lucene / setting up your application!), it covers all
> > > the basics and then moves on to more advanced stuff and has lots of
> > > code examples too:
> > > http://www.manning.com/hatcher2/
> > >
> > > hope that helps,
> > >
> > > bec :)
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: scoring and index size

2010-07-12 Thread manjula wijewickrema
Hi Koji,

Thanks for your information

Manjula



On Fri, Jul 9, 2010 at 5:04 PM, Koji Sekiguchi  wrote:

> (10/07/09 19:30), manjula wijewickrema wrote:
>
>> Uwe, thanx for your comments. Following is the code I used in this case.
>> Could you pls. let me know where I have to insert UNLIMITED field length?
>> and how?
>> Tanx again!
>> Manjula
>>
>>
>>
> Manjula,
>
> You can set UNLIMITED field length to IW constructor:
>
>
> http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#IndexWriter%28org.apache.lucene.store.Directory,%20org.apache.lucene.analysis.Analyzer,%20boolean,%20org.apache.lucene.index.IndexWriter.MaxFieldLength%29
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


MaxFieldLength

2010-07-12 Thread manjula wijewickrema
Hi,

I have seen that, onece the field length of a document goes over a certain
limit (
http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
gives
it as 10,000 terms-default) Lucene truncates those documents. Is there any
possibility to truncate documents, if we increase the number of indexed
documents (assume, there are no any individual documents which exceed the
default MaxFieldLength of Lucene)?

Thanx
Manjula.


Re: MaxFieldLength

2010-07-12 Thread manjula wijewickrema
Ok Erick, answer is there. If there is no any document exceeds the default
maxfieldlength, then no any document will be truncated although we increase
the no. of documents in the index. A'm I correct? Thanx for your commitment.

Manjula.

On Tue, Jul 13, 2010 at 3:57 AM, Erick Erickson wrote:

> I'm not sure I understand your question. The number of documents
> has no bearing on the field length of each, which is what the
> max field length is all about. You can change the value here
> by calling Indexwriter.setMaxFieldLength to something shorter
> than the default.
>
> So no, if no document exceeds the default (Terms, not characters),
> no document will be truncated.
>
> The 10,000 limit also has no bearing on how much space indexing
> a document takes as long as there are fewer then 10,000 terms. That
> is, a document with 5,000 terms will take up just as much space
> with any MaxfieldLength > 5,000.
>
> HTH
> Erick
>
> On Mon, Jul 12, 2010 at 4:00 AM, manjula wijewickrema
> wrote:
>
> > Hi,
> >
> > I have seen that, onece the field length of a document goes over a
> certain
> > limit (
> >
> >
> http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
> > gives
> > it as 10,000 terms-default) Lucene truncates those documents. Is there
> any
> > possibility to truncate documents, if we increase the number of indexed
> > documents (assume, there are no any individual documents which exceed the
> > default MaxFieldLength of Lucene)?
> >
> > Thanx
> > Manjula.
> >
>


Databases

2010-07-22 Thread manjula wijewickrema
Hi,

Normally, when I am building my index directory for indexed documents, I
used to keep my indexed files simply in a directory called 'filesToIndex'.
So in this case, I do not use any standar database management system such
as mySql or any other.

1) Will it be possible to use mySql or any other for the purpose of manage
indexed documents in Lucene?

2) Is it necessary to follow such kind of methodology with Lucene?

3) If we do not use such type of database management system, will there be
any disadvantages with large number of indexed files?

Appreciate any reply from you.
Thanks,
Manjula.


Re: Databases

2010-07-27 Thread manjula wijewickrema
Hi,

Thanks a lot for your information.

Regards,
Manjula.

On Fri, Jul 23, 2010 at 12:48 PM, tarun sapra  wrote:

> You can use HibernateSearch to maintain the synchronization between Lucene
> index and Mysql  RDBMS.
>
> On Fri, Jul 23, 2010 at 11:16 AM, manjula wijewickrema
> wrote:
>
> > Hi,
> >
> > Normally, when I am building my index directory for indexed documents, I
> > used to keep my indexed files simply in a directory called
> 'filesToIndex'.
> > So in this case, I do not use any standar database management system such
> > as mySql or any other.
> >
> > 1) Will it be possible to use mySql or any other for the purpose of
> manage
> > indexed documents in Lucene?
> >
> > 2) Is it necessary to follow such kind of methodology with Lucene?
> >
> > 3) If we do not use such type of database management system, will there
> be
> > any disadvantages with large number of indexed files?
> >
> > Appreciate any reply from you.
> > Thanks,
> > Manjula.
> >
>
>
>
> --
> Thanks & Regards
> Tarun Sapra
>


Phrase indexing and searching

2013-12-18 Thread Manjula Wijewickrema
Dear list,

My Lucene programme is able to index single words and search the most
matching documents (based on term frequencies) documents from a corpus to
the input document.
Now I want to index two word phrases and search the matching corpus
documents (based on phrase frequencies) to the input documents.

ex:-
input document:
blue house is very beautiful

split it into phrases (say two term phrases) like:
blue house
house very
very beautiful
etc.

 Is it possible to do this with Lucene? If so how can I do it?

Thanks,

Manjula.


Phrase indexing and searching

2013-12-22 Thread Manjula Wijewickrema
Dear All,

My Lucene programme is able to index single words and search the most
matching documents (based on term frequencies) documents from a corpus to
the input document.
Now I want to index two word phrases and search the matching corpus
documents (based on phrase frequencies) to the input documents.

ex:-
input document:
blue house is very beautiful

split it into phrases (say two term phrases) like:
blue house
house very
very beautiful
etc.

 Is it possible to do this with Lucene? If so how can I do it?

Thanks,

Manjula.


Re: Phrase indexing and searching

2013-12-23 Thread Manjula Wijewickrema
Hi Steve,

Thanks for the reply. Could you please simply let me know how to embed
SingleFilter in the code for both indexing and searching? Coz, different
people suggest different snippets to the code and they did not do the job.

Thanks,

Manjula.


On Mon, Dec 23, 2013 at 8:42 PM, Steve Rowe  wrote:

> Hi Manjula,
>
> Sounds like ShingleFilter will do what you want: <
>
> http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
> >
>
> Steve
> www.lucidworks.com
> On Dec 22, 2013 11:25 PM, "Manjula Wijewickrema" 
> wrote:
>
> > Dear All,
> >
> > My Lucene programme is able to index single words and search the most
> > matching documents (based on term frequencies) documents from a corpus to
> > the input document.
> > Now I want to index two word phrases and search the matching corpus
> > documents (based on phrase frequencies) to the input documents.
> >
> > ex:-
> > input document:
> > blue house is very beautiful
> >
> > split it into phrases (say two term phrases) like:
> > blue house
> > house very
> > very beautiful
> > etc.
> >
> >  Is it possible to do this with Lucene? If so how can I do it?
> >
> > Thanks,
> >
> > Manjula.
> >
>


Re: Is it wrong to create index writer on each query request.

2014-06-05 Thread Manjula Wijewickrema
Hi,

What are the other disadvantages (other than the time factor) of creating
index for every request?

Manjula.


On Thu, Jun 5, 2014 at 2:34 PM, Aditya  wrote:

> Hi Rajendra
>
> You should NOT create index writer for every request.
>
> >>Whether it is time consuming to update index writer when new document
> will come.
> No.
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
>
> On Thu, Jun 5, 2014 at 12:24 PM, Rajendra Rao  >
> wrote:
>
> > I have system in which documents and Query comes  frequently  .I am
> > creating index writer in memory every time for each query I request . I
> > want to know Is it good to separate Index Writing and loading  and Query
> > request ?  Whether It is good to save index writer on hard disk .Whether
> it
> > is time consuming to update index writer when new document will come.
> >
>


ShingleAnalyzerWrapper question

2014-06-10 Thread Manjula Wijewickrema
Hi,

In my programme, I can index and search a document based on unigrams. I
modified the code as follows to obtain the results based on bigrams.
However, it did not give me the desired output.

*

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException,



IOException {





*final* String[] NEW_STOP_WORDS = {"a", "able", "about",
"actually", "after", "allow", "almost", "already", "also", "although",
"always", "am",   "an", "and", "any", "anybody"};  //only a portion



SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
NEW_STOP_WORDS );

Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*
);



ShingleAnalyzerWrapper sw=*new*
ShingleAnalyzerWrapper(analyzer,2);

sw.setOutputUnigrams(*false*);



IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
*true*,IndexWriter.MaxFieldLength.*UNLIMITED*);

File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

File[] files = dir.listFiles();





*for* (File file : files) {



  Document doc = *new* Document();

  String text="";

  doc.add(*new* Field("contents",text,Field.Store.*YES*,
Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));





  Reader reader = *new* FileReader(file);

  doc.add(*new* Field(*FIELD_CONTENTS*, reader));

  w.addDocument(doc);

}

w.optimize();

w.close();



  }




Still the output is;


{contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3,
name/1, sabaragamuwa/1, univers/1}

***


If anybody can, please help me to obtain the correct output.


Thanks,


Manjula.


Re: ShingleAnalyzerWrapper question

2014-06-16 Thread Manjula Wijewickrema
Dear Steve,

It works. Thanks.




On Wed, Jun 11, 2014 at 6:18 PM, Steve Rowe  wrote:

> You should give sw rather than analyzer in the IndexWriter actor.
>
> Steve
> www.lucidworks.com
>  On Jun 11, 2014 2:24 AM, "Manjula Wijewickrema" 
> wrote:
>
> > Hi,
> >
> > In my programme, I can index and search a document based on unigrams. I
> > modified the code as follows to obtain the results based on bigrams.
> > However, it did not give me the desired output.
> >
> > *
> >
> > *public* *static* *void* createIndex() *throws* CorruptIndexException,
> > LockObtainFailedException,
> >
> >
> >
> > IOException {
> >
> >
> >
> >
> >
> > *final* String[] NEW_STOP_WORDS = {"a", "able", "about",
> > "actually", "after", "allow", "almost", "already", "also", "although",
> > "always", "am",   "an", "and", "any", "anybody"};  //only a portion
> >
> >
> >
> > SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
> > NEW_STOP_WORDS );
> >
> > Directory directory =
> > FSDirectory.getDirectory(*INDEX_DIRECTORY*
> > );
> >
> >
> >
> > ShingleAnalyzerWrapper sw=*new*
> > ShingleAnalyzerWrapper(analyzer,2);
> >
> > sw.setOutputUnigrams(*false*);
> >
> >
> >
> > IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
> > *true*,IndexWriter.MaxFieldLength.*UNLIMITED*);
> >
> > File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
> >
> > File[] files = dir.listFiles();
> >
> >
> >
> >
> >
> > *for* (File file : files) {
> >
> >
> >
> >   Document doc = *new* Document();
> >
> >   String text="";
> >
> >   doc.add(*new* Field("contents",text,Field.Store.*YES*,
> > Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));
> >
> >
> >
> >
> >
> >   Reader reader = *new* FileReader(file);
> >
> >   doc.add(*new* Field(*FIELD_CONTENTS*, reader));
> >
> >   w.addDocument(doc);
> >
> > }
> >
> > w.optimize();
> >
> > w.close();
> >
> >
> >
> >   }
> >
> >
> > 
> >
> > Still the output is;
> >
> >
> > {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1,
> manjula/3,
> > name/1, sabaragamuwa/1, univers/1}
> >
> > ***
> >
> >
> > If anybody can, please help me to obtain the correct output.
> >
> >
> > Thanks,
> >
> >
> > Manjula.
> >
>


Why bigram tf-idf is 0?

2014-06-24 Thread Manjula Wijewickrema
Hi,

In my programme, I tried to select the most relevant document based on
bigrams.

System gives me the following output.

{contents: /1, assist librarian/1, assist manjula/2, assist sabaragamuwa/1,
fine manjula/1, librari manjula/1, librarian sabaragamuwa/1, main
librari/2, manjula assist/4, manjula fine/1, manjula name/1, name
manjula/1, sabaragamuwa univers/3, univers main/2, univers sabaragamuwa/1}

The frequencies of the bigrams are also correctly identified by the system.
But the tf-idf scores of these bigrams are given as 0. However, the same
programme gives the correct tf-idf values for unigrams.

Following is the code snippet that I wrote to determine the tf-idf of
bigrams.




for(int q1=1; q1 it = hits.iterator();
 TopDocs results=indexSearcher.search(query,10);
ScoreDoc[] hits1=results.scoreDocs;
for(ScoreDoc hit:hits1){
 Document doc=indexSearcher.doc(hit.doc);
 tfidf[q1-1]=hit.score;
 }
  }

***
Here, "hit.score" should give the tf-idf value of each bigram. Why it is
given as 0? If someone can please explain me how to resolve this problem.

Thanks,
Manjula.


bigram problem

2014-07-02 Thread Manjula Wijewickrema
Hi,

Could please explain me how to determine the tf-idf score for bigrams. My
program is able to index and search bigrams correctly, but it does not
calculate the tf-idf for bigrams. If someone can, please help me to resolve
this.

Regards,
Manjula.


Re: bigram problem

2014-07-02 Thread Manjula Wijewickrema
Dear Parnab,

Thanks a lot for your guidance. I prefer to follow the second method, as I
have already indexed the bigrams using ShingleFilterWrapper. But, I have no
any idea about how to use NGramTokenizer here. So, could you please write
one or two lines of the code which shows how to use NGramTokenizer for
bigrams.

Thanks,
Manjula.


On Wed, Jul 2, 2014 at 7:05 PM, parnab kumar  wrote:

> TF is straight forward, you can simply count the no of occurrences in the
> doc by simple string matching. For IDF you need to know total no of docs in
> the collection and the no. of docs having the bigram. reader.maxDoc() will
> give you the total no of docs in the collection. To calculate the number of
> docs containing the bigram use a phrase query with slop factor set to 0.
> The number of docs returned by the indexsearcher with the phrase query will
> be the number of docs having the bigram. I hope this is fine.
>
> Alternatively, use   NGramTokenizer where ( n=2 in your case) while
> indexing. In such a case, each bigram can interpreted as a normal lucene
> term.
>
> Thanks,
> Parnab
>
>
> On Wed, Jul 2, 2014 at 8:45 AM, Manjula Wijewickrema 
> wrote:
>
> > Hi,
> >
> > Could please explain me how to determine the tf-idf score for bigrams. My
> > program is able to index and search bigrams correctly, but it does not
> > calculate the tf-idf for bigrams. If someone can, please help me to
> resolve
> > this.
> >
> > Regards,
> > Manjula.
> >
>


Why hit is 0 for bigrams?

2014-07-07 Thread Manjula Wijewickrema
Hi,

I tried to index bigrams from a documhe system gave and the system gave me
the following output with the frequencies of the bigrams(output 1):

array size:15
array terms are:{contents: /1, assist librarian/1, assist manjula/2, assist
sabaragamuwa/1, fine manjula/1, librari manjula/1, librarian
sabaragamuwa/1, main librari/2, manjula assist/4, manjula fine/1, manjula
name/1, name manjula/1, sabaragamuwa univers/3, univers main/2, univers
sabaragamuwa/1}

For this I used the follwing code in the createIndex() class:


ShingleAnalyzerWrapper sw=*new *ShingleAnalyzerWrapper(analyzer,2);

sw.setOutputUnigrams(*false*);



Then I tried search the indexed bigrams of the same document using the
following code in searchIndex()class:


IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);



Analyzer analyzer = *new* WhitespaceAnalyzer();



QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);



Query query = queryParser.parse(terms[pos[freqs.length-q1]]);



System.*out*.println("Query: " +query);



Hits hits = indexSearcher.search(query);

System.*out*.println("Number of hits: " + hits.length());




For this, the system gave me the following output (output2):


Query: contents:manjula contents:assist

Number of hits: 0

Query: contents:sabaragamuwa contents:univers

Number of hits: 0

Query: contents:univers contents:main

Number of hits: 0

Query: contents:main contents:librari

Number of hits: 0


If someone can please explain me;


(1)why 'contents: /1' is included in the array as an array element? (output
1)


(2) why the system return me the query as 'contents:manjula
contents:assist' instead of 'manjula assist'? (output 2)


(3) why the number of hits given as 0 instead of their frequencies? (output
2)


I highly appreciate your kind reply.


Manjula.


Analyzer

2010-11-29 Thread manjula wijewickrema
Hi,

In my work, I am using Lucene and two java classes. In the first one, I
index a document and in the second one, I try to search the most relevant
document for the indexed document in the first one. In the first java class,
I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer in
the searchIndex method and pass the highest frequency terms into the second
Java class. In the second class, I use SnowballAnalyzer in the createIndex
method (this index is for the collection of documents to be searched, or it
is my database) and StandardAnalyser in the searchIndex method (I pass the
highest frequently occuring term of the first class as the search term
parameter to the searchIndex method of the second class). Using Analyzers in
this manner, what I am willing is to do the stemming, stop-words in both
indexes (in both classes) and to search those a few high frequency words (of
the first index) in the second index. So, if my intention is clear to you,
could you please let me know whether it is correct or not the way I have
used Analyzers? I highly appreciate any comment.

Thanx.
Manjula.


Re: Analyzer

2010-11-29 Thread manjula wijewickrema
Hi Steve,

Thanx a lot for your reply. Yes there are only two classes and it's corrcet
that the way you have realized the problem. As you have instructed, I
checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and it
seems to me that it gives better results rather than StandardAnalyzer. So
could you please let me know what are the differences between
StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your response.
Thanx.

Manjula.


On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe  wrote:

> Hi Manjula,
>
> It's not terribly clear what you're doing here - I got lost in your
> description of your (two? or maybe four?) classes.  Sometimes things are
> easier to understand if you provide more concrete detail.
>
> I suspect that you could benefit from reading the book Lucene in Action,
> 2nd edition:
>
>   http://www.manning.com/hatcher3/
>
> You would also likely benefit from using Luke, the Lucene index browser, to
> better understand your indexes' contents and debug how queries match
> documents:
>
>   http://code.google.com/p/luke/
>
> I think your question is whether you're using Analyzers correctly.  It
> sounds like you are creating two separate indexes (one for each of your
> classes), and you're using SnowballAnalyzer on the indexing side for both
> indexes, and StandardAnalyzer on the query side.
>
> The usual advice is to use the same Analyzer on both the query and the
> index side.  But it appears to be the case that you are taking stemmed index
> terms from your index #1 and then querying index #2 using these stemmed
> terms.  If this is true, then you want the query-time analyzer in your
> second index not to change the query terms.  You'll likely get better
> results using WhitespaceAnalyzer, which tokenizes on whitespace and does no
> further analysis, rather than StandardAnalyzer.
>
> Steve
>
> > -Original Message-
> > From: manjula wijewickrema [mailto:manjul...@gmail.com]
> > Sent: Monday, November 29, 2010 4:32 AM
> > To: java-user@lucene.apache.org
> > Subject: Analyzer
> >
> > Hi,
> >
> > In my work, I am using Lucene and two java classes. In the first one, I
> > index a document and in the second one, I try to search the most relevant
> > document for the indexed document in the first one. In the first java
> > class,
> > I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer
> > in
> > the searchIndex method and pass the highest frequency terms into the
> > second
> > Java class. In the second class, I use SnowballAnalyzer in the
> createIndex
> > method (this index is for the collection of documents to be searched, or
> > it
> > is my database) and StandardAnalyser in the searchIndex method (I pass
> the
> > highest frequently occuring term of the first class as the search term
> > parameter to the searchIndex method of the second class). Using Analyzers
> > in
> > this manner, what I am willing is to do the stemming, stop-words in both
> > indexes (in both classes) and to search those a few high frequency words
> > (of
> > the first index) in the second index. So, if my intention is clear to
> you,
> > could you please let me know whether it is correct or not the way I have
> > used Analyzers? I highly appreciate any comment.
> >
> > Thanx.
> > Manjula.
>


Re: Analyzer

2010-12-02 Thread manjula wijewickrema
Dear Erick,

Thanx for your information.

Manjula.

On Tue, Nov 30, 2010 at 6:37 PM, Erick Erickson wrote:

> WhitespaceAnalyzer does just that, splits the incoming stream on
> white space.
>
> From the javadocs for StandardAnalyzer:
>
> A grammar-based tokenizer constructed with JFlex
>
> This should be a good tokenizer for most European-language documents:
>
>   - Splits words at punctuation characters, removing punctuation. However,
>   a dot that's not followed by whitespace is considered part of a token.
>   - Splits words at hyphens, unless there's a number in the token, in which
>   case the whole token is interpreted as a product number and is not split.
>   - Recognizes email addresses and internet hostnames as one token.
>
> Many applications have specific tokenizer needs. If this tokenizer does not
> suit your application, please consider copying this source code directory
> to
> your project and maintaining your own grammar-based tokenizer.
>
>
> Best
>
> Erick
>
> On Tue, Nov 30, 2010 at 12:06 AM, manjula wijewickrema
> wrote:
>
> > Hi Steve,
> >
> > Thanx a lot for your reply. Yes there are only two classes and it's
> corrcet
> > that the way you have realized the problem. As you have instructed, I
> > checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and
> > it
> > seems to me that it gives better results rather than StandardAnalyzer. So
> > could you please let me know what are the differences between
> > StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your
> response.
> > Thanx.
> >
> > Manjula.
> >
> >
> > On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe  wrote:
> >
> > > Hi Manjula,
> > >
> > > It's not terribly clear what you're doing here - I got lost in your
> > > description of your (two? or maybe four?) classes.  Sometimes things
> are
> > > easier to understand if you provide more concrete detail.
> > >
> > > I suspect that you could benefit from reading the book Lucene in
> Action,
> > > 2nd edition:
> > >
> > >   http://www.manning.com/hatcher3/
> > >
> > > You would also likely benefit from using Luke, the Lucene index
> browser,
> > to
> > > better understand your indexes' contents and debug how queries match
> > > documents:
> > >
> > >   http://code.google.com/p/luke/
> > >
> > > I think your question is whether you're using Analyzers correctly.  It
> > > sounds like you are creating two separate indexes (one for each of your
> > > classes), and you're using SnowballAnalyzer on the indexing side for
> both
> > > indexes, and StandardAnalyzer on the query side.
> > >
> > > The usual advice is to use the same Analyzer on both the query and the
> > > index side.  But it appears to be the case that you are taking stemmed
> > index
> > > terms from your index #1 and then querying index #2 using these stemmed
> > > terms.  If this is true, then you want the query-time analyzer in your
> > > second index not to change the query terms.  You'll likely get better
> > > results using WhitespaceAnalyzer, which tokenizes on whitespace and
> does
> > no
> > > further analysis, rather than StandardAnalyzer.
> > >
> > > Steve
> > >
> > > > -Original Message-
> > > > From: manjula wijewickrema [mailto:manjul...@gmail.com]
> > > > Sent: Monday, November 29, 2010 4:32 AM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Analyzer
> > > >
> > > > Hi,
> > > >
> > > > In my work, I am using Lucene and two java classes. In the first one,
> I
> > > > index a document and in the second one, I try to search the most
> > relevant
> > > > document for the indexed document in the first one. In the first java
> > > > class,
> > > > I use the SnowballAnalyzer in the createIndex method and
> > StandardAnalyzer
> > > > in
> > > > the searchIndex method and pass the highest frequency terms into the
> > > > second
> > > > Java class. In the second class, I use SnowballAnalyzer in the
> > > createIndex
> > > > method (this index is for the collection of documents to be searched,
> > or
> > > > it
> > > > is my database) and StandardAnalyser in the searchIndex method (I
> pass
> > > the
> > > > highest frequently occuring term of the first class as the search
> term
> > > > parameter to the searchIndex method of the second class). Using
> > Analyzers
> > > > in
> > > > this manner, what I am willing is to do the stemming, stop-words in
> > both
> > > > indexes (in both classes) and to search those a few high frequency
> > words
> > > > (of
> > > > the first index) in the second index. So, if my intention is clear to
> > > you,
> > > > could you please let me know whether it is correct or not the way I
> > have
> > > > used Analyzers? I highly appreciate any comment.
> > > >
> > > > Thanx.
> > > > Manjula.
> > >
> >
>


Editing StopWordList

2010-12-20 Thread manjula wijewickrema
Hi,

1) In my application, I need to add more words to the stop word list.
Therefore, is it possible to add more words into the default lucene stop
word list?

2) If is it possible, then how can I do this?

Appreciate any comment from you.

Thanks,
Manjula.


Re: Editing StopWordList

2010-12-21 Thread manjula wijewickrema
Hi Gupta,

Thanx a lot for your reply. But I could not understand whether I could
modify (adding more words) to the default stop word list or should I have to
make a new list as an array as follows.
public string[] NEW_STOP_WORDS = { "a", "and", "are", "as", "at", "be",
"but", "by", "for", "if", "in", "into", "is", "no", "not", "of", "on", "or",

"s", "such", "t", "that", "the", "their", "then", "there", "these", "they",
"this", "to", "was", "will", "with",
"inc","incorporated","co.","ltd","ltd.", "we", "you", "your", "us",
etc...};

then call it as follows,

SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
StopAnalyzer.NEW_STOP_WORDS );
Am I correct?
Or if not could you explain me how can I do this?

Thanx in advance.
Manjula.

On Tue, Dec 21, 2010 at 10:36 AM, Anshum  wrote:

> Hi Manjula,
> You could initialize the Analyzer using a modified stop word set. Use
> the *StopAnalyzer.ENGLISH_STOP_WORDS_SET
> *to get the default stopset and then add your own words to it. You could
> then initialize the analyzer using this new stop set instead of the default
> stop set.
> Hope that helps.
>
> --
> Anshum Gupta
> http://ai-cafe.blogspot.com
>
>
> On Tue, Dec 21, 2010 at 9:20 AM, manjula wijewickrema
> wrote:
>
> > Hi,
> >
> > 1) In my application, I need to add more words to the stop word list.
> > Therefore, is it possible to add more words into the default lucene stop
> > word list?
> >
> > 2) If is it possible, then how can I do this?
> >
> > Appreciate any comment from you.
> >
> > Thanks,
> > Manjula.
> >
>


hit.score

2017-03-27 Thread Manjula Wijewickrema
Hi,

Can someone help me to understand the value given by 'hit.score' in Lucene.
I indexed a single document with five different words with different
frequencies and try to understand this value. However, it doesn't seem to
be normalized term frequency or tf-idf. I am using Lucene 2.91.

Any help would be highly appreciated.


Re: hit.score

2017-03-27 Thread Manjula Wijewickrema
Thanks Adrien.

On Mon, Mar 27, 2017 at 6:56 PM, Adrien Grand  wrote:

> You can use IndexSearcher.explain to see how the score was computed.
>
> Le lun. 27 mars 2017 à 14:46, Manjula Wijewickrema  a
> écrit :
>
> > Hi,
> >
> > Can someone help me to understand the value given by 'hit.score' in
> Lucene.
> > I indexed a single document with five different words with different
> > frequencies and try to understand this value. However, it doesn't seem to
> > be normalized term frequency or tf-idf. I am using Lucene 2.91.
> >
> > Any help would be highly appreciated.
> >
>


Only term frequencies

2017-04-06 Thread Manjula Wijewickrema
Hi,

I have a document collection with hundreds of documents. I need to do know
the term frequency for a given query term in each document. I know that
'hit.score' will give me the Lucene score for each document (and it
includes term frequency as well). But I need to call only term frequencies
in each document. How can I do this?

I highly appreciate your kind response.


Total of term frequencies

2017-04-16 Thread Manjula Wijewickrema
Hi,

Is there any way to get the total count of terms in the Term Frequency
Vector  (tvf)? I need to calculate the Normalized term frequency of each
term in my tvf. I know how to obtain the length of the tvf, but it doesn't
work since I need to count duplicate occurrences as well.

Highly appreciate your kind response.


TermFrequency for a String

2017-04-28 Thread Manjula Wijewickrema
IndexReader.getTermFreqVectors(2)[0].getTermFrequencies()[5];

In the above example, Lucene gives me the term frequency of the 5th term
(e.g. say "planet") in the tfv of the corpus document "2".

But I need to get the term frequency for a specified term using its string
value.

E.g.:
term frequency of the term specified as "planet" (i.e. not specified in
terms of its position "5", but specified using its string value "planet").

Is there any way to do this?

I highly appreciate your kind reply!