Re: Writing a stemmer

2004-06-05 Thread Vladimir Yuryev
Hi, Andjej!
How you tested the Polish texts with what stemer?
Thanks,
Vladimir.
No reason to be too modest, Leo.. I tested your stemmer on English, 
Swedish and Polish texts (including F-measure vs. training set size 
plots), and it works exceptionally well indeed. Highly recommended!

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: score and frequency

2004-06-05 Thread Erik Hatcher
On Jun 5, 2004, at 1:13 AM, Niraj Alok wrote:
I want all the titles which have both ice and hockey to come above 
the
rest (to have higher scores)
Meaning i would wish the results to appear like:

ice hockey
ice hockey
ice hockey
winter Olympics: hockey, ice, medallists
ice hockey: British Sekonda Superleague Play-Off Championship: finals
ice age
National Hockey League
Cracking the Ice Age
ground-ice
My overriden similarity class contains just this method:
public float coord(int overlap, int maxOverlap) {
return 1.0f;
}

Use IndexSearcher.explain(Query, docId) to see how the various factors 
in the equation are being set.

You are better off using DefaultSimilarity's implementation of coord() 
than just returning 1.0.  You want, if overlap is greater, to return a 
greater number.  Look at the numbers being passed to coord(), and in 
the cases where both ice and hockey are present you are probably 
getting 2.  Maybe just return (float) overlap as a first try and see 
the results then.  The explain feature should give you the details you 
need to adjust the equation though - although the default 
implementation does boost the score of documents that have multiple 
terms matching.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problems with lucene in multithreaded environment

2004-06-05 Thread Jayant Kumar
Thanks for the patch. It helped in increasing the
search speed to a good extent. But when we tried to
give about 100 queries in 10 seconds, then again we
found that after about 15 seconds, the response time
per query increased. Enclosed is the dump which we
took after about 30 seconds of starting the search.
The maximum query time has reduced from 200-300
seconds to about 50 seconds.

We were able to simplify the searches further by
consolidating the fields in the index but that
resulted in increasing the index size to 2.5 GB as we
required fields 2-5 and fields 1-7 in different
searches. Our indexes are on the local disk therefor
there is no network i/o involved.

Thanks
Jayant

 --- Doug Cutting [EMAIL PROTECTED] wrote:  Doug
Cutting wrote:
  Please tell me if you are able to simplify your
 queries and if that 
  speeds things.  I'll look into a ThreadLocal-based
 solution too.
 
 I've attached a patch that should help with the
 thread contention, 
 although I've not tested it extensively.
 
 I still don't fully understand why your searches are
 so slow, though. 
 Are the indexes stored on the local disk of the
 machine?  Indexes 
 accessed over the network can be very slow.
 
 Anyway, give this patch a try.  Also, if anyone else
 can try this and 
 report back whether it makes multi-threaded
 searching faster, or 
 anything else slower, or is buggy, that would be
 great.
 
 Thanks,
 
 Doug
  Index:

src/java/org/apache/lucene/index/TermInfosReader.java

===
 RCS file:

/home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.java,v
 retrieving revision 1.6
 diff -u -u -r1.6 TermInfosReader.java
 ---

src/java/org/apache/lucene/index/TermInfosReader.java
 20 May 2004 11:23:53 -1.6
 +++

src/java/org/apache/lucene/index/TermInfosReader.java
 4 Jun 2004 21:45:15 -
 @@ -29,7 +29,8 @@
private String segment;
private FieldInfos fieldInfos;
  
 -  private SegmentTermEnum enumerator;
 +  private ThreadLocal enumerators = new
 ThreadLocal();
 +  private SegmentTermEnum origEnum;
private long size;
  
TermInfosReader(Directory dir, String seg,
 FieldInfos fis)
 @@ -38,19 +39,19 @@
  segment = seg;
  fieldInfos = fis;
  
 -enumerator = new
 SegmentTermEnum(directory.openFile(segment +
 .tis),
 -fieldInfos, false);
 -size = enumerator.size;
 +origEnum = new
 SegmentTermEnum(directory.openFile(segment +
 .tis),
 +   fieldInfos,
 false);
 +size = origEnum.size;
  readIndex();
}
  
public int getSkipInterval() {
 -return enumerator.skipInterval;
 +return origEnum.skipInterval;
}
  
final void close() throws IOException {
 -if (enumerator != null)
 -  enumerator.close();
 +if (origEnum != null)
 +  origEnum.close();
}
  
/** Returns the number of term/value pairs in the
 set. */
 @@ -58,6 +59,15 @@
  return size;
}
  
 +  private SegmentTermEnum getEnum() {
 +SegmentTermEnum enum =
 (SegmentTermEnum)enumerators.get();
 +if (enum == null) {
 +  enum = terms();
 +  enumerators.set(enum);
 +}
 +return enum;
 +  }
 +
Term[] indexTerms = null;
TermInfo[] indexInfos;
long[] indexPointers;
 @@ -102,16 +112,17 @@
}
  
private final void seekEnum(int indexOffset)
 throws IOException {
 -enumerator.seek(indexPointers[indexOffset],
 -   (indexOffset * enumerator.indexInterval) -
 1,
 +getEnum().seek(indexPointers[indexOffset],
 +   (indexOffset * getEnum().indexInterval) - 1,
 indexTerms[indexOffset],
 indexInfos[indexOffset]);
}
  
/** Returns the TermInfo for a Term in the set,
 or null. */
 -  final synchronized TermInfo get(Term term) throws
 IOException {
 +  TermInfo get(Term term) throws IOException {
  if (size == 0) return null;
  
 -// optimize sequential access: first try
 scanning cached enumerator w/o seeking
 +// optimize sequential access: first try
 scanning cached enum w/o seeking
 +SegmentTermEnum enumerator = getEnum();
  if (enumerator.term() != null
 // term is at or past current
((enumerator.prev != null 
 term.compareTo(enumerator.prev)  0)
   || term.compareTo(enumerator.term()) = 0)) {
 @@ -128,6 +139,7 @@
  
/** Scans within block for matching term. */
private final TermInfo scanEnum(Term term) throws
 IOException {
 +SegmentTermEnum enumerator = getEnum();
  while (term.compareTo(enumerator.term())  0 
 enumerator.next()) {}
  if (enumerator.term() != null 
 term.compareTo(enumerator.term()) == 0)
return enumerator.termInfo();
 @@ -136,10 +148,12 @@
}
  
/** Returns the nth term in the set. */
 -  final synchronized Term get(int position) throws
 IOException {
 +  final Term get(int position) throws IOException {
  if (size == 0) return null;
  
 -if (enumerator != null  

Re: Writing a stemmer

2004-06-05 Thread Andrzej Bialecki
Vladimir Yuryev wrote:
Hi, Andjej!
How you tested the Polish texts with what stemer?
Thanks,
Vladimir.
No reason to be too modest, Leo.. I tested your stemmer on English, 
Swedish and Polish texts (including F-measure vs. training set size 
plots), and it works exceptionally well indeed. Highly recommended!
Well, I have several corpora of Polish language, which together amount 
to roughly 90,000 words (nouns and verbs) having at least 4 inflected 
forms. This set is randomized (i.e. lines of words + forms are in random 
order). I've split this into two parts - one of a fixed size, as a test 
set, and one of variable size as a training set. Then I compile stemmer 
tables using variable number of training examples, and using differnt 
settings (trie, multi-trie, different optimizations, etc..). Then for 
each output table I test the precision/recall of correct base forms 
(lemmatization), and of ability to create unique stems (stemming). 
Finally, I select the best table, which gives reasonably good results 
vs. table size. To put it in plain terms, e.g. for tables roughly 300kB 
in size (created from training set of 3000 unique words + their forms) 
in best cases I get ~90% of correct stems, and ~70% of correct lemmas. 
Which is a _very_ good result!

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


No tvx reader

2004-06-05 Thread David Spencer
Using 1.4rc3.
Running an app that indexes 50k documents (thus it just uses an 
IndexWriter).
One field has that boolean set for it to have a term vector stored for 
it, while other 11 fields don't.

On stdout I see No tvx file 13 times.
Glancing thru the src it seems this comes from TermVectorReader.
The generated index seeems fine.
What could be causing this and is this normal?
thx,
Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]