using previous results on a new search

2004-06-04 Thread Antoine Brun
Hello,
 
I am new to Lucene and more generally to search engines. As my company has decided to 
base its new software on Lucene, I have one first question about Lucene querying 
functionnalities.
 
We are investigating the possibility to insert previous search results to a new query. 
 
 
Does anyone knows if it is possible or if such an evolution is under development
 
Thanks
 
Antoine Brun


-
Yahoo! Mail : votre e-mail personnel et gratuit qui vous suit partout !
Créez votre Yahoo! Mail

Dialoguez en direct avec vos amis grâce à Yahoo! Messenger !

score and frequency

2004-06-04 Thread Niraj Alok
Hi,

I am having some problems with the score of lucene. 
I am trying to get the results displayed according to hits.score and it is giving the 
results correctly. 
However I do not want the frequency factor to be used for the computation of the 
score. 

Is it possible to get the score which does not have the frequency factor in it ? 


Regards,
Niraj 


Re: using previous results on a new search

2004-06-04 Thread Otis Gospodnetic
 p.s. This ought to go on the wiki :)

It's now included in a Lucene FAQ.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: using previous results on a new search

2004-06-04 Thread Erik Hatcher
On Jun 4, 2004, at 3:07 AM, Antoine Brun wrote:
We are investigating the possibility to insert previous search results 
to a new query.

Does anyone knows if it is possible or if such an evolution is under 
development
I suppose you mean search within search, so that the second search is 
constrained by the results of the first query.  If so

There are two primary options:
  - Use QueryFilter with the previous query as the filter (search the 
archives for QueryFilter and Doug's recommendations against using it 
for this purpose)

  - Combine the previous query with the current query using 
BooleanQuery, using the previous query as required.

The BooleanQuery is the most recommended way.
Erik
p.s. This ought to go on the wiki :)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: score and frequency

2004-06-04 Thread Niraj Alok
Hi Erik,

Thanks for the suggestion.

I tried this:
public class RelevanceSimilarity extends DefaultSimilarity

{

public float tf(float freq) {

System.out.println(discounting frequency);

return (float)1;

}

}



and in my query class, I used :

Similarity.setDefault(similarity);

Hits hits = is.search(query);

for(i = 0; i  hits.length(); i ++)

  result = result + hits.score(i);



However, this is still not giving me the expected result. Do I need to do
something else?


Regards,
Niraj

- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, June 04, 2004 1:55 PM
Subject: Re: score and frequency


 On Jun 4, 2004, at 2:52 AM, Niraj Alok wrote:
  Hi,
 
  I am having some problems with the score of lucene.
  I am trying to get the results displayed according to hits.score and
  it is giving the results correctly.
  However I do not want the frequency factor to be used for the
  computation of the score.
 
  Is it possible to get the score which does not have the frequency
  factor in it ?

 Have a look at the javadocs for Similarity.  DefaultSimilarity is used
 unless otherwise specified.  You could subclass that and override this:

public float tf(float freq) {
  return (float)Math.sqrt(freq);
}

 and return 1.0.  This might give you the effect you want.

 Erik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Author or SearchBean

2004-06-04 Thread lucene
Hi!

Where can I get the mail address of the author of SearchBean (sandbox) from?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Author or SearchBean

2004-06-04 Thread Erik Hatcher
SearchBean should be discussed on this list - no need to contact the 
original developer directly (in fact, it's a better practice to discuss 
open source code in the appropriate public forums).

Erik
On Jun 4, 2004, at 5:56 AM, [EMAIL PROTECTED] wrote:
Hi!
Where can I get the mail address of the author of SearchBean (sandbox) 
from?

Timo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


why the score is not always 1.0 when comparing two identical strings?

2004-06-04 Thread uddam chukmol
hi,
 
i'm not so convinced by the way Lucene compute the score. 
 
I tried to compare two string by using a program. In the program, i index the first 
string as if i indexed a document and use the queryParser with the same analyzer that 
I used to index the first string to analyze my second string and to form a query from 
it. 

I run the program for the first time with the first string as: 
This is the text to index with Lucene CREATE TABLE Elements ( TYPELEMENT varchar 
(255) NULL , CLEELEMENT varchar (255) NULL , LIBELEM varchar (255) NULL , CODENTITE 
varchar (255) NULL , CLEENTITE varchar (255) NULL , DONNNEEA1 varchar (255) NULL , 
DONNEEB1 varchar (255) NULL , DONNEEA2 varchar (255) NULL , DONNEEB2 varchar (255) 
NULL , DONNEEA3 varchar (255) NULL , DONNEEB3 varchar (255) NULL , DONNEEA4 varchar 
(255) NULL , DONNEEB4 varchar (255) NULL , DONNEEA5 varchar (255) NULL , DONNEEB5 
varchar (255) NULL , TOP1 varchar (255) NULL , TOP2 varchar (255) NULL , TOP3 varchar 
(255) NULL , TOP4 varchar (255) NULL , TOP5 varchar (255) NULL , QTE1 varchar (255) 
NULL , QTE2 varchar (255) NULL , QTE3 varchar (255) NULL , MONTANT1 varchar (255) NULL 
, MONTANT2 varchar (255) NULL , MONTANT3 varchar (255) NULL , DATE1 varchar (255) NULL 
, DATE2 varchar (255) NULL , DATE3 varchar (255) NULL , STATUT varchar (255) NULL , 
DATPRISENCPTSTAT varchar (255) NULL ).
 
I used the same string as to form my query and i got the final score of these two 
string which is 1.0.
 
Then something suprised me when i changed to two strings into All work and no play 
makes Jack a dull boy and compared them by using one as a document and other to form 
the query. The result was just not 1.0. it was 0.3033.. instead. 
 
I used Eclipse as my Java Editor. Any conflict with Lucene?
 
Any idea/suggestion of what went wrong over here?
 
Uddam


-
Do you Yahoo!?
Friends.  Fun. Try the all-new Yahoo! Messenger

Re: why the score is not always 1.0 when comparing two identical strings?

2004-06-04 Thread Brisbart Franck
You're not the first one to ask this question.
I suggest you to have a look in the mailing list archive and to search 
for the messages 'Lucene Scoring Behavior'.
Here is the link below:
http://issues.apache.org/eyebrowse/SearchList?listId=listName=lucene-user%40jakarta.apache.orgsearchText=%22lucene+scoring+behavior%22defaultField=subjectSearch=Search

Cheers,
Franck
uddam chukmol wrote:
hi,
 
i'm not so convinced by the way Lucene compute the score. 
 
I tried to compare two string by using a program. In the program, i index the first string as if i indexed a document and use the queryParser with the same analyzer that I used to index the first string to analyze my second string and to form a query from it. 

I run the program for the first time with the first string as: 
This is the text to index with Lucene CREATE TABLE Elements ( TYPELEMENT varchar (255) NULL , CLEELEMENT varchar (255) NULL , LIBELEM varchar (255) NULL , CODENTITE varchar (255) NULL , CLEENTITE varchar (255) NULL , DONNNEEA1 varchar (255) NULL , DONNEEB1 varchar (255) NULL , DONNEEA2 varchar (255) NULL , DONNEEB2 varchar (255) NULL , DONNEEA3 varchar (255) NULL , DONNEEB3 varchar (255) NULL , DONNEEA4 varchar (255) NULL , DONNEEB4 varchar (255) NULL , DONNEEA5 varchar (255) NULL , DONNEEB5 varchar (255) NULL , TOP1 varchar (255) NULL , TOP2 varchar (255) NULL , TOP3 varchar (255) NULL , TOP4 varchar (255) NULL , TOP5 varchar (255) NULL , QTE1 varchar (255) NULL , QTE2 varchar (255) NULL , QTE3 varchar (255) NULL , MONTANT1 varchar (255) NULL , MONTANT2 varchar (255) NULL , MONTANT3 varchar (255) NULL , DATE1 varchar (255) NULL , DATE2 varchar (255) NULL , DATE3 varchar (255) NULL , STATUT varchar (255) NULL , DATPRISENCPTSTAT varchar (255) NULL ).
 
I used the same string as to form my query and i got the final score of these two string which is 1.0.
 
Then something suprised me when i changed to two strings into All work and no play makes Jack a dull boy and compared them by using one as a document and other to form the query. The result was just not 1.0. it was 0.3033.. instead. 
 
I used Eclipse as my Java Editor. Any conflict with Lucene?
 
Any idea/suggestion of what went wrong over here?
 
Uddam


-
Do you Yahoo!?
Friends.  Fun. Try the all-new Yahoo! Messenger

--
Franck Brisbart
RD
http://www.kelkoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: score and frequency

2004-06-04 Thread Phil brunet
Hi to all.
Maybe the term frequency is not the only parameter you need to override to 
customize the score attributed by Lucene.

Maybe you should consider the normalisation factor, the idf and the coord 
factor ?

Philippe
From: Niraj Alok [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Subject: Re: score and frequency
Date: Fri, 4 Jun 2004 15:13:32 +0530
Hi Erik,
Thanks for the suggestion.
I tried this:
public class RelevanceSimilarity extends DefaultSimilarity
{
public float tf(float freq) {
System.out.println(discounting frequency);
return (float)1;
}
}

and in my query class, I used :
Similarity.setDefault(similarity);
Hits hits = is.search(query);
for(i = 0; i  hits.length(); i ++)
  result = result + hits.score(i);

However, this is still not giving me the expected result. Do I need to do
something else?
Regards,
Niraj
- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, June 04, 2004 1:55 PM
Subject: Re: score and frequency
 On Jun 4, 2004, at 2:52 AM, Niraj Alok wrote:
  Hi,
 
  I am having some problems with the score of lucene.
  I am trying to get the results displayed according to hits.score and
  it is giving the results correctly.
  However I do not want the frequency factor to be used for the
  computation of the score.
 
  Is it possible to get the score which does not have the frequency
  factor in it ?

 Have a look at the javadocs for Similarity.  DefaultSimilarity is used
 unless otherwise specified.  You could subclass that and override this:

public float tf(float freq) {
  return (float)Math.sqrt(freq);
}

 and return 1.0.  This might give you the effect you want.

 Erik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Bloquez les fenêtres pop-up, c'est gratuit ! http://toolbar.msn.fr
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: score and frequency

2004-06-04 Thread Brisbart Franck
Hi,
Be careful to set the default similarity 
'Similarity.setDefault(similarity)' before creating your search instance 
(IndexSearcher).
If you change the default similarity after, you'll still use the old one.
You'd better use the 'searcher.setSimilarity' method on your searcher.

Franck
Phil brunet wrote:
Hi to all.
Maybe the term frequency is not the only parameter you need to override 
to customize the score attributed by Lucene.

Maybe you should consider the normalisation factor, the idf and the 
coord factor ?

Philippe
From: Niraj Alok [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Subject: Re: score and frequency
Date: Fri, 4 Jun 2004 15:13:32 +0530
Hi Erik,
Thanks for the suggestion.
I tried this:
public class RelevanceSimilarity extends DefaultSimilarity
{
public float tf(float freq) {
System.out.println(discounting frequency);
return (float)1;
}
}

and in my query class, I used :
Similarity.setDefault(similarity);
Hits hits = is.search(query);
for(i = 0; i  hits.length(); i ++)
  result = result + hits.score(i);

However, this is still not giving me the expected result. Do I need to do
something else?
Regards,
Niraj
- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, June 04, 2004 1:55 PM
Subject: Re: score and frequency
 On Jun 4, 2004, at 2:52 AM, Niraj Alok wrote:
  Hi,
 
  I am having some problems with the score of lucene.
  I am trying to get the results displayed according to hits.score and
  it is giving the results correctly.
  However I do not want the frequency factor to be used for the
  computation of the score.
 
  Is it possible to get the score which does not have the frequency
  factor in it ?

 Have a look at the javadocs for Similarity.  DefaultSimilarity is used
 unless otherwise specified.  You could subclass that and override this:

public float tf(float freq) {
  return (float)Math.sqrt(freq);
}

 and return 1.0.  This might give you the effect you want.

 Erik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Bloquez les fenêtres pop-up, c'est gratuit ! http://toolbar.msn.fr
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Franck Brisbart
RD
http://www.kelkoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Distributed searches and RAM Dir

2004-06-04 Thread markharw00d
 Look up Mark Harwood and Lucene. ..provided some nice sequential
UML diagrams with notes

Those notes went missing recently when the ISP canned my free account.

I've resurrected them at my new site here:
http://www.inperspective.com/lucene/distrib/index.htm

Cheers
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Jayant Kumar wrote:
Please find enclosed jvmdump.txt which contains a dump
of our search program after about 20 seconds of
starting the program.
Also enclosed is the file queries.txt which contains
few sample search queries.
Thanks for the data.  This is exactly what I was looking for.
Thread-14 prio=1 tid=0x080a7420 nid=0x468e waiting for monitor entry 
[4d61a000..4d61ac18]
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:112)
- waiting to lock 0x44c95228 (a org.apache.lucene.index.TermInfosReader)
Thread-12 prio=1 tid=0x080a58e0 nid=0x468e waiting for monitor entry 
[4d51a000..4d51ad18]
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:112)
- waiting to lock 0x44c95228 (a org.apache.lucene.index.TermInfosReader)
These are all stuck looking terms up in the dictionary (TermInfos). 
Things would be much faster if your queries didn't have so many terms.

Query : (  (  (  (  (  FIELD1: proof OR  FIELD2: proof OR  FIELD3: proof OR  FIELD4: proof OR  FIELD5: proof OR  FIELD6: proof OR  FIELD7: proof ) AND (  FIELD1: george bush OR  FIELD2: george bush OR  FIELD3: george bush OR  FIELD4: george bush OR  FIELD5: george bush OR  FIELD6: george bush OR  FIELD7: george bush )  ) AND (  FIELD1: script OR  FIELD2: script OR  FIELD3: script OR  FIELD4: script OR  FIELD5: script OR  FIELD6: script OR  FIELD7: script )  ) AND (  (  FIELD1: san OR  FIELD2: san OR  FIELD3: san OR  FIELD4: san OR  FIELD5: san OR  FIELD6: san OR  FIELD7: san ) OR (  (  FIELD1: war OR  FIELD2: war OR  FIELD3: war OR  FIELD4: war OR  FIELD5: war OR  FIELD6: war OR  FIELD7: war ) OR (  (  FIELD1: gulf OR  FIELD2: gulf OR  FIELD3: gulf OR  FIELD4: gulf OR  FIELD5: gulf OR  FIELD6: gulf OR  FIELD7: gulf ) OR (  (  FIELD1: laden OR  FIELD2: laden OR  FIELD3: laden OR  FIELD4: laden OR  FIELD5: laden OR  FIELD6: laden OR  FIELD7: laden ) OR (  (  FIE
LD1: ttouristeat OR  FIELD2: ttouristeat OR  FIELD3: ttouristeat OR  FIELD4: 
ttouristeat OR  FIELD5: ttouristeat OR  FIELD6: ttouristeat OR  FIELD7: ttouristeat ) 
OR (  (  FIELD1: pow OR  FIELD2: pow OR  FIELD3: pow OR  FIELD4: pow OR  FIELD5: pow 
OR  FIELD6: pow OR  FIELD7: pow ) OR (  FIELD1: bin OR  FIELD2: bin OR  FIELD3: bin OR 
 FIELD4: bin OR  FIELD5: bin OR  FIELD6: bin OR  FIELD7: bin )  )  )  )  )  )  )  )  ) 
AND  RANGE: ([ 0800 TO 1100 ]) AND  (  S_IDa: (7 OR 8 OR 9 OR 10 OR 11 OR 12 OR 13 OR 
14 OR 15 OR 16 OR 17 )  or  S_IDb: (2 )  )
All your queries look for terms in fields 1-7.  If you instead combined 
the contents of fields 1-7 in a single field, and searched that field, 
then your searches would contain far fewer terms and be much faster.

Also, I don't know how many terms your RANGE queries match, but that 
could also be introducing large numbers of terms which would slow things 
down too.

But, still, you have identified a bottleneck: TermInfosReader caches a 
TermEnum and hence access to it must be synchronized.  Caching the enum 
greatly speeds sequential access to terms, e.g., when merging, 
performing range or prefix queries, etc.  Perhaps however the cache 
should be done through a ThreadLocal, giving each thread its own cache 
and obviating the need for synchronization...

Please tell me if you are able to simplify your queries and if that 
speeds things.  I'll look into a ThreadLocal-based solution too.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Writing a stemmer

2004-06-04 Thread Andrzej Bialecki
Leo Galambos wrote:
Erik Hatcher [EMAIL PROTECTED] wrote:
__

How proficient must I be in a language for which I wish to write the 
stemmer?
I would venture to say you would need to be an expert in a language to 
write a decent stemmer.

I'm sorry for a self-promo ;), but
the stemmer of egothor project can be
adapted to any language, and you needn't be
a language expert. Moreover, the stemmer
achieves better F-measure than Porter's stemmers.
No reason to be too modest, Leo.. I tested your stemmer on English, 
Swedish and Polish texts (including F-measure vs. training set size 
plots), and it works exceptionally well indeed. Highly recommended!

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Doug Cutting wrote:
Please tell me if you are able to simplify your queries and if that 
speeds things.  I'll look into a ThreadLocal-based solution too.
I've attached a patch that should help with the thread contention, 
although I've not tested it extensively.

I still don't fully understand why your searches are so slow, though. 
Are the indexes stored on the local disk of the machine?  Indexes 
accessed over the network can be very slow.

Anyway, give this patch a try.  Also, if anyone else can try this and 
report back whether it makes multi-threaded searching faster, or 
anything else slower, or is buggy, that would be great.

Thanks,
Doug
Index: src/java/org/apache/lucene/index/TermInfosReader.java
===
RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.java,v
retrieving revision 1.6
diff -u -u -r1.6 TermInfosReader.java
--- src/java/org/apache/lucene/index/TermInfosReader.java	20 May 2004 11:23:53 -	1.6
+++ src/java/org/apache/lucene/index/TermInfosReader.java	4 Jun 2004 21:45:15 -
@@ -29,7 +29,8 @@
   private String segment;
   private FieldInfos fieldInfos;
 
-  private SegmentTermEnum enumerator;
+  private ThreadLocal enumerators = new ThreadLocal();
+  private SegmentTermEnum origEnum;
   private long size;
 
   TermInfosReader(Directory dir, String seg, FieldInfos fis)
@@ -38,19 +39,19 @@
 segment = seg;
 fieldInfos = fis;
 
-enumerator = new SegmentTermEnum(directory.openFile(segment + .tis),
-			   fieldInfos, false);
-size = enumerator.size;
+origEnum = new SegmentTermEnum(directory.openFile(segment + .tis),
+   fieldInfos, false);
+size = origEnum.size;
 readIndex();
   }
 
   public int getSkipInterval() {
-return enumerator.skipInterval;
+return origEnum.skipInterval;
   }
 
   final void close() throws IOException {
-if (enumerator != null)
-  enumerator.close();
+if (origEnum != null)
+  origEnum.close();
   }
 
   /** Returns the number of term/value pairs in the set. */
@@ -58,6 +59,15 @@
 return size;
   }
 
+  private SegmentTermEnum getEnum() {
+SegmentTermEnum enum = (SegmentTermEnum)enumerators.get();
+if (enum == null) {
+  enum = terms();
+  enumerators.set(enum);
+}
+return enum;
+  }
+
   Term[] indexTerms = null;
   TermInfo[] indexInfos;
   long[] indexPointers;
@@ -102,16 +112,17 @@
   }
 
   private final void seekEnum(int indexOffset) throws IOException {
-enumerator.seek(indexPointers[indexOffset],
-	  (indexOffset * enumerator.indexInterval) - 1,
+getEnum().seek(indexPointers[indexOffset],
+	  (indexOffset * getEnum().indexInterval) - 1,
 	  indexTerms[indexOffset], indexInfos[indexOffset]);
   }
 
   /** Returns the TermInfo for a Term in the set, or null. */
-  final synchronized TermInfo get(Term term) throws IOException {
+  TermInfo get(Term term) throws IOException {
 if (size == 0) return null;
 
-// optimize sequential access: first try scanning cached enumerator w/o seeking
+// optimize sequential access: first try scanning cached enum w/o seeking
+SegmentTermEnum enumerator = getEnum();
 if (enumerator.term() != null // term is at or past current
 	 ((enumerator.prev != null  term.compareTo(enumerator.prev)  0)
 	|| term.compareTo(enumerator.term()) = 0)) {
@@ -128,6 +139,7 @@
 
   /** Scans within block for matching term. */
   private final TermInfo scanEnum(Term term) throws IOException {
+SegmentTermEnum enumerator = getEnum();
 while (term.compareTo(enumerator.term())  0  enumerator.next()) {}
 if (enumerator.term() != null  term.compareTo(enumerator.term()) == 0)
   return enumerator.termInfo();
@@ -136,10 +148,12 @@
   }
 
   /** Returns the nth term in the set. */
-  final synchronized Term get(int position) throws IOException {
+  final Term get(int position) throws IOException {
 if (size == 0) return null;
 
-if (enumerator != null  enumerator.term() != null  position = enumerator.position 
+SegmentTermEnum enumerator = getEnum();
+if (enumerator != null  enumerator.term() != null 
+position = enumerator.position 
 	position  (enumerator.position + enumerator.indexInterval))
   return scanEnum(position);		  // can avoid seek
 
@@ -148,6 +162,7 @@
   }
 
   private final Term scanEnum(int position) throws IOException {
+SegmentTermEnum enumerator = getEnum();
 while(enumerator.position  position)
   if (!enumerator.next())
 	return null;
@@ -156,12 +171,13 @@
   }
 
   /** Returns the position of a Term in the set or -1. */
-  final synchronized long getPosition(Term term) throws IOException {
+  final long getPosition(Term term) throws IOException {
 if (size == 0) return -1;
 
 int indexOffset = getIndexOffset(term);
 seekEnum(indexOffset);
 
+SegmentTermEnum enumerator = getEnum();
 

RE: Writing a stemmer

2004-06-04 Thread Musku, Anil (LA)
Leo

Thanks for your reply. I have taken a look at egothor.org. It does appear to
be pretty simple. However, I need to use Lucene as my search engine.

From what I understand, it appears that I need to be pretty conversant (if
not an expert) with a language for which I wish to write a stemmer. Moreover,
this stemmer can be used with the egothor search engine only? Can I use this
stemmer with Lucene? If yes, how?

Regards,
Anil

-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 03, 2004 8:54 PM
To: Lucene Users List
Subject: Re: Writing a stemmer

Erik Hatcher [EMAIL PROTECTED] wrote:
__

 How proficient must I be in a language for which I wish to write the 
 stemmer?
I would venture to say you would need to be an expert in a language to 
write a decent stemmer.

I'm sorry for a self-promo ;), but
the stemmer of egothor project can be
adapted to any language, and you needn't be
a language expert. Moreover, the stemmer
achieves better F-measure than Porter's stemmers.

Cheers,
Leo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: score and frequency

2004-06-04 Thread Niraj Alok
I have set the searcher.setSimilarity  as well as also tried setting the
coord factor to 1.

The problem as given by an example is : Lets say I have titles to be
displayed depending upon the search.
E.g if i have ice hockey as the search item and if it is default
similarity, my results are :

ice hockey0.9994
ice hockey0.75
ice hockey0.75
winter Olympics: hockey, ice, medallists0.17402513
ice age0.073680125
National Hockey League0.020266924
Cracking the Ice Age0.018420031
ground-ice0.011512519
ice hockey: British Sekonda Superleague Play-Off Championship:
finals0.0069075115
 (the numbers indicating the score).


But if i set the similarity as my overridden one, the results become:
ice hockey0.9994
ice hockey0.75
ice hockey0.75
ice age0.22104037
winter Olympics: hockey, ice, medallists0.17402513
National Hockey League0.060800765
Cracking the Ice Age0.055260092
ground-ice0.034537554
ice hockey: British Sekonda Superleague Play-Off Championship:
finals0.020722535


I want all the titles which have both ice and hockey to come above the
rest (to have higher scores)
Meaning i would wish the results to appear like:

ice hockey
ice hockey
ice hockey
winter Olympics: hockey, ice, medallists
ice hockey: British Sekonda Superleague Play-Off Championship: finals
ice age
National Hockey League
Cracking the Ice Age
ground-ice

My overriden similarity class contains just this method:
public float coord(int overlap, int maxOverlap) {

return 1.0f;

}





I feel it is the weight factor which is producing indesirable results. Any
help in this regard would be highly appreciated.

Regards,
Niraj

- Original Message -
From: Brisbart Franck [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, June 04, 2004 8:46 PM
Subject: Re: score and frequency


 Hi,

 Be careful to set the default similarity
 'Similarity.setDefault(similarity)' before creating your search instance
 (IndexSearcher).
 If you change the default similarity after, you'll still use the old one.
 You'd better use the 'searcher.setSimilarity' method on your searcher.

 Franck


 Phil brunet wrote:
  Hi to all.
 
  Maybe the term frequency is not the only parameter you need to override
  to customize the score attributed by Lucene.
 
  Maybe you should consider the normalisation factor, the idf and the
  coord factor ?
 
  Philippe
 
  From: Niraj Alok [EMAIL PROTECTED]
  Reply-To: Lucene Users List [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Subject: Re: score and frequency
  Date: Fri, 4 Jun 2004 15:13:32 +0530
 
  Hi Erik,
 
  Thanks for the suggestion.
 
  I tried this:
  public class RelevanceSimilarity extends DefaultSimilarity
 
  {
 
  public float tf(float freq) {
 
  System.out.println(discounting frequency);
 
  return (float)1;
 
  }
 
  }
 
 
 
  and in my query class, I used :
 
  Similarity.setDefault(similarity);
 
  Hits hits = is.search(query);
 
  for(i = 0; i  hits.length(); i ++)
 
result = result + hits.score(i);
 
 
 
  However, this is still not giving me the expected result. Do I need to
do
  something else?
 
 
  Regards,
  Niraj
 
  - Original Message -
  From: Erik Hatcher [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Friday, June 04, 2004 1:55 PM
  Subject: Re: score and frequency
 
 
   On Jun 4, 2004, at 2:52 AM, Niraj Alok wrote:
Hi,
   
I am having some problems with the score of lucene.
I am trying to get the results displayed according to hits.score
and
it is giving the results correctly.
However I do not want the frequency factor to be used for the
computation of the score.
   
Is it possible to get the score which does not have the frequency
factor in it ?
  
   Have a look at the javadocs for Similarity.  DefaultSimilarity is
used
   unless otherwise specified.  You could subclass that and override
this:
  
  public float tf(float freq) {
return (float)Math.sqrt(freq);
  }
  
   and return 1.0.  This might give you the effect you want.
  
   Erik
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  _
  Bloquez les fenêtres pop-up, c'est gratuit ! http://toolbar.msn.fr
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


 --
 Franck Brisbart
 RD
 http://www.kelkoo.com


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-