Re: Lucene Scoring

2010-07-07 Thread manjula wijewickrema
Dear Ian,

Thanks a lot for your reply. The way you proposed, working correctly and
solved half of my matter.
Once I run the program, system gave me the following output.
output-
**
Searching for 'milk'

Number of hits: 1

0.13287117

0.13287117 = (MATCH) fieldWeight(contents:milk in 0), product of:

1.7320508 = tf(termFreq(contents:milk)=3)

0.30685282 = idf(docFreq=1, maxDocs=1)

0.25 = fieldNorm(field=contents, doc=0)

Hit: D:\JADE\work\MobilNet\Lucene291\filesToIndex\deron-foods.txt
***
Here, I have no any problems of calculating values for tf, and idf. But I
have no idea of how to calculate fieldNorm. According to
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int)
I think norm(t,d) gives the value for fieldNorm and in my case, the system
returns the value lengthNorm(field) for norm(t,d),

1) Am I correct?
2) If so, coluld you pls. let me know the way (formula) of calculating
lengthNorm(field)? (I checked several documents and codes to understand
this. But was unable to find the mathematical formula behind this method).
3) If lengthNorm(field) is not the case behind fieldNorm, then how to
calculate fieldNorm?

Pls. help me to resolve this matter.

Manjula.


On Tue, Jul 6, 2010 at 12:47 PM, Ian Lea  wrote:

> You are calling the explain method incorrectly.  You need something like
>
>  System.out.println(indexSearcher.explain(query, 0));
>
>
> See the javadocs for details.
>
>
> --
> Ian.
>
>
> On Tue, Jul 6, 2010 at 7:39 AM, manjula wijewickrema
>  wrote:
> > Dear Grant,
> >
> > Thanks a lot for your guidence. As you have mentioned, I tried to use
> > explain() method to get the explanations for relevant scoring. But, once
> I
> > call the explain() method, system indicated the following error.
> >
> > Error-
> > 'The method explain(Query,int) in the type Searcher is not applicable for
> > the arguments (String, int)'.
> >
> > In my code I call the explain() method as follows-
> > Searcher.explain("rice",0);
> >
> > Possibly the wrong with my way of passing parameters. In my case, I have
> > chosen "rice" as my query and indexed only one document.
> >
> > Could you pls. let me know what's wrong with this. I also included the
> code
> > with this.
> >
> > Thanx
> > Manjula
> >
> > code-
> > **
> >
> > *import* org.apache.lucene.search.Searcher;
> >
> > *public* *class* LuceneDemo {
> >
> > *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* =
> "filesToIndex"
> > ;
> >
> > *public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory";
> >
> > *public* *static* *final* String *FIELD_PATH* = "path";
> >
> > *public* *static* *final* String *FIELD_CONTENTS* = "contents";
> >
> > *public* *static* *void* main(String[] args) *throws* Exception {
> >
> > *createIndex*();
> >
> > *searchIndex*("rice");
> >
> >  }
> >
> > *public* *static* *void* createIndex() *throws* CorruptIndexException,
> > LockObtainFailedException, IOException {
> >
> >  SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
> > StopAnalyzer.ENGLISH_STOP_WORDS);
> >
> > *boolean* recreateIndexIfExists = *true*;
> >
> > IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
> > recreateIndexIfExists);
> >
> > File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
> >
> > File[] files = dir.listFiles();
> >
> > *for* (File file : files) {
> >
> > Document document = *new* Document();
> >
> > String path = file.getCanonicalPath();
> >
> > document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*,
> Field.Index.
> > UN_TOKENIZED,Field.TermVector.*YES*));
> >
> > Reader reader = *new* FileReader(file);
> >
> > document.add(*new* Field(*FIELD_CONTENTS*, reader));
> >
> > indexWriter.addDocument(document);
> >
> >  }
> >
> > indexWriter.optimize();
> >
> > indexWriter.close();
> >
> > }
> >
> > *public* *static* *void* searchIndex(String searchString)
> > *throws*IOException, ParseException {
> >
> > System.*out*.println("Searching for '" + searchString + "'");
> >
> > Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);
> >
> > IndexReader indexReader = IndexReader.open(directory);
> >
> > IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);
> >
> >  SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
> > StopAnalyzer.ENGLISH_STOP_WORDS);
> >
> > QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);
> >
> > Query query = queryParser.parse(searchString);
> >
> > Hits hits = indexSearcher.search(query);
> >
> > System.*out*.println("Number of hits: " + hits.length());
> >
> > TopDocs results = indexSearcher.search(query,10);
> >
> > ScoreDoc[] hits1 = results.scoreDocs;
> >
> > *for* (ScoreDoc hit : hits1) {
> >
> > Document doc = indexSearcher.doc(hit.doc);
> >
> > //System.out.printf("%5.3f %s\n",hit.score,doc.get(FIELD_CONTENTS));
> >
> > System.*out*.println(hit.score);

Re: Adding a new field to existing Index

2010-07-07 Thread Naveen Kumar
Hi Andrzej Bialecki

When you suggested -
"There are some other low-level ways to do this, but the easiest is to
  use a FilterIndexReader, especially since you just want to add a
stored
  field - implement a subclass of FilterIndexReader that adds a new
field
  in getFieldNames() and document(int). Then use
  IndexWriter.addIndexes(IndexReader[]) to create the output index."
I believe you assumed that all the existing fields are stored. I have a few
fields which are only indexed, not stored. Is there a way to add a new
Field(stored, not indexed) to document in such an index, without reindexing
the whole index.
Any suggestions will be very helpful!

Thank you
Naveen Kumar

On Wed, Jun 30, 2010 at 12:34 PM, Andrzej Bialecki  wrote:

> On 2010-06-29 13:40, Naveen Kumar wrote:
> > Hey,
> >
> > I need to add a new field (a stored , not indexed field) for all
> > documents present in an existing large index. Reindexing the whole
> > index will be very costly. Is there a way to do this or any work
> > around?
>
> There are some other low-level ways to do this, but the easiest is to
> use a FilterIndexReader, especially since you just want to add a stored
> field - implement a subclass of FilterIndexReader that adds a new field
> in getFieldNames() and document(int). Then use
> IndexWriter.addIndexes(IndexReader[]) to create the output index.
>
> >
> > I would also like to know, if data or term vector, of a field
> > indexed without storing, can somehow be retrieved. This would enable
> > a work around solution to my problem.
>
> Not really, and the re-construction is very costly. Indexing is a lossy
> process, so not all content can be recovered. See the "Reconstruct &
> Edit" functionality in Luke (http://www.getopt.org/luke).
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Adding a new field to existing Index

2010-07-07 Thread Andrzej Bialecki

On 2010-07-07 14:49, Naveen Kumar wrote:

Hi Andrzej Bialecki

When you suggested -
 "There are some other low-level ways to do this, but the easiest is to
   use a FilterIndexReader, especially since you just want to add a
stored
   field - implement a subclass of FilterIndexReader that adds a new
field
   in getFieldNames() and document(int). Then use
   IndexWriter.addIndexes(IndexReader[]) to create the output index."
I believe you assumed that all the existing fields are stored. I have a few
fields which are only indexed, not stored. Is there a way to add a new
Field(stored, not indexed) to document in such an index, without reindexing
the whole index.
Any suggestions will be very helpful!


Unfortunately no - my previous advice still applies:


I would also like to know, if data or term vector, of a field
indexed without storing, can somehow be retrieved. This would enable
a work around solution to my problem.


Not really, and the re-construction is very costly. Indexing is a lossy
process, so not all content can be recovered. See the "Reconstruct&
Edit" functionality in Luke (http://www.getopt.org/luke).


At this point it will be less costly to reindex.

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Adding a new field to existing Index

2010-07-07 Thread Naveen Kumar
Thanks for the quick reply!
I will go ahead with reindexing of all the data.

On Wed, Jul 7, 2010 at 6:27 PM, Andrzej Bialecki  wrote:

> On 2010-07-07 14:49, Naveen Kumar wrote:
>
>> Hi Andrzej Bialecki
>>
>> When you suggested -
>> "There are some other low-level ways to do this, but the easiest is to
>>   use a FilterIndexReader, especially since you just want to add a
>> stored
>>   field - implement a subclass of FilterIndexReader that adds a new
>> field
>>   in getFieldNames() and document(int). Then use
>>   IndexWriter.addIndexes(IndexReader[]) to create the output index."
>> I believe you assumed that all the existing fields are stored. I have a
>> few
>> fields which are only indexed, not stored. Is there a way to add a new
>> Field(stored, not indexed) to document in such an index, without
>> reindexing
>> the whole index.
>> Any suggestions will be very helpful!
>>
>
> Unfortunately no - my previous advice still applies:
>
>
>  I would also like to know, if data or term vector, of a field
 indexed without storing, can somehow be retrieved. This would enable
 a work around solution to my problem.

>>>
>>> Not really, and the re-construction is very costly. Indexing is a lossy
>>> process, so not all content can be recovered. See the "Reconstruct&
>>> Edit" functionality in Luke (http://www.getopt.org/luke).
>>>
>>
> At this point it will be less costly to reindex.
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Issue Lucene-2421 and NativeFSLockFactory.clearLock behaviour?

2010-07-07 Thread Shai Erera
Yes, looks like clearLock should be changed to not throw the exception, but
rather do a best effort - call delete() but don't respond to its return
value. I'll change that on 3x, I'm not sure if a backport to 3.0.x is needed
(doesn't seem to justify a 3.0.3 ...)

Shai

On Wed, Jul 7, 2010 at 8:59 AM, Ted McFadden  wrote:

> Hi,
>
> For Lucene 3.0.2, issue LUCENE-2421 (
> https://issues.apache.org/jira/browse/LUCENE-2421) changed
> NativeFSLock.release to not raise an exception if a write.lock file could
> not be deleted since the presence of the file itself does not mean a lock
> is
> held.
>
> Should NativeFSLockFactory.clearLock also be changed to not raise an
> exception if it can't delete the write.lock file? The comments in the
> clearLock method seem to suggest the method is really no longer necessary,
> but IndexWriter.init can still call it.
>
> If the write.lock is held from deletion by antivirus or something as
> described in LUCENE-2421, IndexWriter construction looks like it can fail
> unnecessarily for the same reason:
>
> IndexWriter
>   IndexWriter.init
>  Directory.clearLock
> NativeFSLockFactory.clearLock:
>...
>if (lockFile.exists() && !lockFile.delete()){
>   throw new IOException("Cannot delete " + lockFile);
>}
>
>
> We have seen this exception path once in the wild (on a Windows box).
>
> I can work around this with a custom LockFactory but thought I should check
> if I'm reading the code right.
>
> Cheers,
>
> Ted
>
>
> --
> Ted McFadden
> Chief Engineer
>
> Leximancer Pty Ltd
> Queensland, Australia
> http://www.leximancer.com
>


Re: Issue Lucene-2421 and NativeFSLockFactory.clearLock behaviour?

2010-07-07 Thread Shai Erera
Double-checking the code, this isn't that simple :). Someone can call
clearLock while the lock is held (for some unknown reason), in which case we
do want to signal failure. The clearLock jdoc specifies that it forcefully
unlocks and removes the lock ...

Currently, the method does not unlock anything - just attempts to remove the
lock. If the lock is still held, it will fail w/ the exception ... so there
are two cases:
1) You call clearLock w/o calling IndexWriter.unlock() first, and the lock
is held by another process --> here you wouldn't want to method to silently
fail, because an attempt to lock the Directory would fail, which will be
confusing.
2) The lock is not used, either 'cause you call IW.unlock(), however there
is an external process that holds the lock, preventing its delete(). Here
you wouldn't care if the method silently fails ...

I guess what we should do is try to forcefully unlock it first, and if that
succeeds then delete the lock file, ignoring the returned output. Or change
the javadocs.

I'll check it

Shai

On Wed, Jul 7, 2010 at 7:28 PM, Shai Erera  wrote:

> Yes, looks like clearLock should be changed to not throw the exception, but
> rather do a best effort - call delete() but don't respond to its return
> value. I'll change that on 3x, I'm not sure if a backport to 3.0.x is needed
> (doesn't seem to justify a 3.0.3 ...)
>
> Shai
>
>
> On Wed, Jul 7, 2010 at 8:59 AM, Ted McFadden  wrote:
>
>> Hi,
>>
>> For Lucene 3.0.2, issue LUCENE-2421 (
>> https://issues.apache.org/jira/browse/LUCENE-2421) changed
>> NativeFSLock.release to not raise an exception if a write.lock file could
>> not be deleted since the presence of the file itself does not mean a lock
>> is
>> held.
>>
>> Should NativeFSLockFactory.clearLock also be changed to not raise an
>> exception if it can't delete the write.lock file? The comments in the
>> clearLock method seem to suggest the method is really no longer necessary,
>> but IndexWriter.init can still call it.
>>
>> If the write.lock is held from deletion by antivirus or something as
>> described in LUCENE-2421, IndexWriter construction looks like it can fail
>> unnecessarily for the same reason:
>>
>> IndexWriter
>>   IndexWriter.init
>>  Directory.clearLock
>> NativeFSLockFactory.clearLock:
>>...
>>if (lockFile.exists() && !lockFile.delete()){
>>   throw new IOException("Cannot delete " + lockFile);
>>}
>>
>>
>> We have seen this exception path once in the wild (on a Windows box).
>>
>> I can work around this with a custom LockFactory but thought I should
>> check
>> if I'm reading the code right.
>>
>> Cheers,
>>
>> Ted
>>
>>
>> --
>> Ted McFadden
>> Chief Engineer
>>
>> Leximancer Pty Ltd
>> Queensland, Australia
>> http://www.leximancer.com
>>
>
>


Why not normalization?

2010-07-07 Thread manjula wijewickrema
Hi,

In my application, I input only one index file and enter only single term
query to check the lucene score. I used explain method to see the way of
obtaining results and system gave me the result as product of tf, idf,
fieldNorm.

1) Although Lucene uses tf to calculate scoring it seems to me that term
frequency has not been normalized. Even if I index several documents, it
does not normalize tf value. Therefore, since the total number of words
in index documents are varied, can't there be a fault in Lucene's scoring?

2) What is the formula to calculate this fieldNorm value?

If somebody can pls. help me.

Thnks in advance
Manjula.


Re: Why not normalization?

2010-07-07 Thread Rebecca Watson
hi,

> 1) Although Lucene uses tf to calculate scoring it seems to me that term
> frequency has not been normalized. Even if I index several documents, it
> does not normalize tf value. Therefore, since the total number of words
> in index documents are varied, can't there be a fault in Lucene's scoring?

tf = term frequency i.e. the number of times the term appears in the document,
while idf is inverse document frequency - is a measure of how rare a term is,
i.e. related to how many documents the term appears in.

if term1 occurs more frequently in a document i.e. tf is higher, you
want to weight
the document higher when you search for term1

but if term1 is a very frequent term, ie. in lots of documents, then
its probably not
as important to an overall search (where we have term1, term2 etc) so you want
to downweight it (idf comes in)

then the normalisations like length normalisation (allow for 'fair' scoring
across varied field length) come in too.

the tf-idf scoring formula used by lucene is a  scoring method that's
been around
a long long time... there are competing scoring metrics but that's an IR thing
and not an argument you want to start on the lucene lists! :)

these are IR ('information retrieval') concepts and you might want to start by
going to through the tf-idf scoring / some explanations for this kind
of scoring.

http://en.wikipedia.org/wiki/Tf%E2%80%93idf
http://wiki.apache.org/lucene-java/InformationRetrieval


> 2) What is the formula to calculate this fieldNorm value?

in terms of how lucene implements its tf-idf scoring - you can see here:
http://lucene.apache.org/java/3_0_2/scoring.html

also, the lucene in action book is a really good book if you are starting out
with lucene (and will save you a lot of grief with understanding
lucene / setting
up your application!), it covers all the basics and then moves on to more
advanced stuff and has lots of code examples too:
http://www.manning.com/hatcher2/

hope that helps,

bec :)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org