Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:


I tested it on a 5 mln index.



Thanks, this is great data!

Can you please tell a bit more about the experiments?  In particular:
 . How were scores assigned to pages?  Link analysis?  log(number of 
incoming links) or OPIC?



log()


 . How were the queries generated?  From a log or randomly?



Queries have been picked up manually, to test the worst performing cases 
from a real query log.



 . How many queries did you test with?



A couple, not many. This of course needs to be improved, but I didn't 
want to spend too much time until we polish the code.



 . When results differed greatly, did they look a lot worse?



Yes. E.g. see the differences for MAX_HITS=1



My attempt to sort a 38M page index failed with OutOfMemory.  Sigh.

For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. 
queries, which executed in e.g. 500 ms now executed in 10-20ms 
(perfRate=40). Following the intuition, performance drops as we 
increase MAX_HITS, until it reaches a more or less original values 
(perfRate=1) for MAX_HITS=30 (for a 5 mln doc index). After that, 
increasing MAX_HITS actually worsens the performance (perfRate  1) 
- which can be explained by the fact that the standard HitCollector 
doesn't collect as many documents, if they score too low.



This doesn't make sense to me.  It should never be slower.  We're not 
actually keeping track of any more hits, only stopping earlier.



Yes, that's puzzling... Please read below.



* Two-term Nutch queries result in complex Lucene BooleanQueries over 
many index fields, includng also PhraseQueries. These fared much 
worse than single-term queries: actually, the topN values were very 
low until MAX_HITS was increased to large values, and then all of a 
sudden all topN-s flipped into the 80-90% ranges.



It would be interesting to try altering the generated query, to see if 
it is the phrases or simply multiple terms which cause problems.  To 
do this, one could hack the query-basic plugin, or simply alter query 
boost parameters.  This 



I actually forgot to write that I don't use any of Nutch code. Early on 
I decided to eliminate this part in order to get first the raw 
performance from Lucene - but still using the Lucene queries 
corresponding to translated Nutch queries.


Please see the attached scripts. You will need to put bsh and lucene on 
your classpath, as well as compile and add the LimitingCollector classes 
(it was not possible to implement this inside bsh scripts, due to 
vagaries of class overloading in BeanShell). You need to customize the 
paths to both indexes inside the script. Then run:


   java -cp bsh.jar:lucene.jar:LimitedCollector.jar bsh.Interpreter 
test-opt.sh +url:information 1


The script will iterate 10 times for a warm-up, and then report the 
timing, and matching, and topN results. You will get an HTML table with 
the first 100 results.


would help us figure out where the optimization is failing.  Suel used 
multi-term queries, but not phrases, so we expect that the phrases are 
causing the problem, but it would be good to see for certain.  We've 
also never tuned Nutch's phrase matching, so it's also possible that 
we may sometimes over-emphasize the phrase component in scores.  For 
example, a slop of 10 might give better results and/or be more 
amenable to this optimization.



In several installations I use smaller values of slop (around 20-40). 
But this is motivated by better quality matches, not by performance, so 
I didn't test for this...




I also noticed that the values of topN depended strongly on the 
document frequency of terms in the query. For a two-term query, where 
both terms have average document frequency, the topN values start 
from ~50% for low MAX_HITS. For a two-term query where one of the 
terms has a very high document frequency, the topN values start from 
0% for low MAX_HITS. See the spreadsheet for details.



Were these actually useful queries?  For example, I would not be 
concerned if results differed greatly for a query like 'to be', since 
that's not a very useful query.  Try searching for 'the the' on Google.




No, these were not stop-words. High DF is around 0.3, i.e. the term 
occurs in 30% documents; but it's still a valid and useful term.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


import org.apache.lucene.search.*;

public class LimitedCollector extends TopDocCollector {
private int maxHits;

public LimitedCollector(int numHits, int maxHits) {
  super(maxHits);
  this.maxHits = maxHits;
}

public void collect(int doc, float score) {
  if (getTotalHits() = maxHits)
throw new LimitExceeded(doc);
  super.collect((int)doc, (float)score);
}
 

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Doug Cutting

Andrzej Bialecki wrote:

 . How were the queries generated?  From a log or randomly?


Queries have been picked up manually, to test the worst performing cases 
from a real query log.


So, for example, the 50% error rate might not be typical, but could be 
worst-case.



 . When results differed greatly, did they look a lot worse?


Yes. E.g. see the differences for MAX_HITS=1


The graph just shows that they differ, not how much better or worse they 
are, since the baseline is not perfect.  When the top-10 is 50% 
different, are those 5 different hits markedly worse matches to your eye 
than the five they've displaced, or are they comparable?  That's what 
really matters.


I actually forgot to write that I don't use any of Nutch code. Early on 
I decided to eliminate this part in order to get first the raw 
performance from Lucene - but still using the Lucene queries 
corresponding to translated Nutch queries.


What part of Nutch are you trying to avoid?  Perhaps you could try 
measuring your Lucene-only benchmark against a Nutch-based one.  If they 
don't differ markedly then you can simply use Nutch, which makes it a 
stronger benchmark.  If they differ, then we should figure out why.


In several installations I use smaller values of slop (around 20-40). 
But this is motivated by better quality matches, not by performance, so 
I didn't test for this...


But that's a great reason to test for it!  If lower slop can improve 
result quality, then we should certainly see if it also makes 
optimizations easier.


Doug


Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:


 . How were the queries generated?  From a log or randomly?



Queries have been picked up manually, to test the worst performing 
cases from a real query log.



So, for example, the 50% error rate might not be typical, but could be 
worst-case.



Yes, that's true.




 . When results differed greatly, did they look a lot worse?



Yes. E.g. see the differences for MAX_HITS=1



The graph just shows that they differ, not how much better or worse 
they are, since the baseline is not perfect.  When the top-10 is 50% 
different, are those 5 different hits markedly worse matches to your 
eye than the five they've displaced, or are they comparable?  That's 
what really matters.




Hmm. I'm not sure I agree with this. Your reasoning would be true if we 
were changing the ranking formula. But the goal IMHO with these patches 
is to return equally complete results, using the same ranking formula.


I specifically avoided using normalized scores, instead using the 
absolute scores in TopDocs. And the absolute scores in both cases are 
exactly the same, for those results that are present.


What is wrong is that some results that should be there (judging by the 
ranking) are simply missing. So, it's about the recall, and the baseline 
index gives the best estimate.


I actually forgot to write that I don't use any of Nutch code. Early 
on I decided to eliminate this part in order to get first the raw 
performance from Lucene - but still using the Lucene queries 
corresponding to translated Nutch queries.



What part of Nutch are you trying to avoid?  Perhaps you could try 
measuring your Lucene-only benchmark against a Nutch-based one.  If 
they don't differ markedly then you can simply use Nutch, which makes 
it a stronger benchmark.  If they differ, then we should figure out why.



Again, I don't see it this way. Nutch results will always be worse than 
pure Lucene, because of the added layers. If I can't improve the 
performance in Lucene code (which takes  85% time for every query) then 
no matter how well optimized Nutch code is it won't get far.


So, I'm reproducing the same queries that Nutch passes to Lucene, in 
order to simulate the typical load generated by Nutch, but avoiding any 
non-essential code that could skew the results. What's wrong with that? 
I think this approach makes the benchmark easier to understand and 
isolated from non-essential factors. When we reach a significant 
improvement on this level, of course we should move to use Nutch code to 
test how well it works on the top.




In several installations I use smaller values of slop (around 20-40). 
But this is motivated by better quality matches, not by performance, 
so I didn't test for this...



But that's a great reason to test for it!  If lower slop can improve 
result quality, then we should certainly see if it also makes 
optimizations easier.



I forgot to mention this - the tests I ran already used the smaller 
values: the slop was set to 20.


That's another advantage of using Lucene directly in this script - you 
can provide any query structure on the command-line without changing the 
code in Nutch.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:

Ok, I just tested IndexSorter for now. It appears to work correctly, 
at least I get exactly the same results, with the same scores and the 
same explanations, if I run the smae queries on the original and on 
the sorted index.



Here's a more complete version, still mostly untested.  This should 
make searches faster.  We'll see how much good the results are...


This includes a patch to Lucene to make it easier to write hit 
collectors that collect TopDocs.


I'll test this on a 38M document index tomorrow.



I'll test it soon - one comment, though. Currently you use a subclass of 
RuntimeException to stop the collecting. I think we should come up with 
a better mechanism - throwing exceptions is too costly. Perhaps the 
HitCollector.collect() method should return a boolean to signal whether 
the searcher should continue working.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Doug Cutting

Andrzej Bialecki wrote:
I'll test it soon - one comment, though. Currently you use a subclass of 
RuntimeException to stop the collecting. I think we should come up with 
a better mechanism - throwing exceptions is too costly.


I thought about this, but I could not see a simple way to achieve it. 
And one exception thrown per query is not very expensive.  But it is bad 
style.  Sigh.


Perhaps the 
HitCollector.collect() method should return a boolean to signal whether 
the searcher should continue working.


We don't really want a HitCollector in this case: we want a TopDocs.  So 
the patch I made is required: we need to extend the HitCollector that 
implements TopDocs-based searching.


Long-term, to avoid the 'throw', we'd need to also:

1. Change:
 TopDocs Searchable.search(Query, Filter, int numHits)
   to:
 TopDocs Searchable.search(Query, Filter, int numHits, maxTotalHits)

2. Add, for back-compatibility:
 TopDocs Searcher.search(Query, Filter, int numHits) {
   return search(query, filter, numHits, Integer.MAX_VALUE);
 }

3. Add a new method:
 /** Return false to stop hit processing. */
 boolean HitCollector.processHit(int doc, float score) {
   collect(doc, score);   // for back-compatibility
   return true;
 }
   Then change all calls to HitCollector.collect to instead call this,
   and deprecate HitCollector.collect.

I think that would do it.  But is it worth it?

In the past I've frequently wanted to be able to extend TopDocs-based 
searching, so I think the Lucene patch I've constructed so far is 
generally useful.


Doug


Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:

Ok, I just tested IndexSorter for now. It appears to work correctly, 
at least I get exactly the same results, with the same scores and the 
same explanations, if I run the smae queries on the original and on 
the sorted index.



Here's a more complete version, still mostly untested.  This should 
make searches faster.  We'll see how much good the results are...


This includes a patch to Lucene to make it easier to write hit 
collectors that collect TopDocs.


I'll test this on a 38M document index tomorrow.



I tested it on a 5 mln index.

The original index is considered the baseline, i.e. it represents 
normative values for scoring and ranking. These results are compared to 
results from the optimized index, and scores and positions of hits are 
also recorded. Finally, these two lists are matched, and relative 
differences in scoring and ranking are calculated.


At the end, I calculate the top10 %, top50% and top100%, defined as a 
percent of the top-N hits from the optimized index, which match the 
top-N hits from the baseline index. Ideally, all these measures should 
be 100%, i.e. all top-N hits from the optimized index should match 
corresponding top-N hits from the baseline index.


One variable which affects greatly both the recall and the performance 
is the maximum number of hits considered by the TopDocCollector. In my 
tests I used values between 1,000 up to 500,000 (which represents 1/10th 
of the full index in my case).


Now, the results. I collected all test results in a spreadsheet 
(OpenDocument or PDF format), you can download it from:


   http://www.getopt.org/nutch/20051214/nutchPerf.ods
   http://www.getopt.org/nutch/20051214/nutchPerf.pdf

For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. 
queries, which executed in e.g. 500 ms now executed in 10-20ms 
(perfRate=40). Following the intuition, performance drops as we increase 
MAX_HITS, until it reaches a more or less original values (perfRate=1) 
for MAX_HITS=30 (for a 5 mln doc index). After that, increasing 
MAX_HITS actually worsens the performance (perfRate  1) - which can be 
explained by the fact that the standard HitCollector doesn't collect as 
many documents, if they score too low.


* Single-term Nutch queries (i.e. which do not produce Lucene 
PhraseQueries) yield relatively good values of topN, even for relatively 
small values of MAX_HITS - however, MAX_HITS=1000 yields all topN=0%. 
The minimum useful value for my index was MAX_HITS=1 (perfRate=30), 
and this yields quite acceptable top10=90%, but less acceptable top50 
and top100. Please see the spreadsheet for details.


* Two-term Nutch queries result in complex Lucene BooleanQueries over 
many index fields, includng also PhraseQueries. These fared much worse 
than single-term queries: actually, the topN values were very low until 
MAX_HITS was increased to large values, and then all of a sudden all 
topN-s flipped into the 80-90% ranges.


I also noticed that the values of topN depended strongly on the document 
frequency of terms in the query. For a two-term query, where both terms 
have average document frequency, the topN values start from ~50% for low 
MAX_HITS. For a two-term query where one of the terms has a very high 
document frequency, the topN values start from 0% for low MAX_HITS. See 
the spreadsheet for details.


Conclusions: more work is needed... ;-)

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Doug Cutting

Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the 
first-n matches? Otherwise we still need to scan the whole posting list...


Yes.  I was just posting the work-in-progress.

We will also need to estimate the total number of matches by 
extrapolating linearly from the maximum doc id processed.  Finally, it 
is probably rather slow for large indexes, whose .fdt won't fit in 
memory.  A simple way to improve that might be to use 
Similarity.floatToByte-encoded floats when sorting (e.g., the norm from 
an untokenized field) so that documents whose boosts are close are not 
re-ordered.  I'll start work on these in the morning.  (It is currently 
my middle-of-night.)


Doug


Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:

Shouldn't this be combined with a HitCollector that collects only the 
first-n matches? Otherwise we still need to scan the whole posting 
list...



Yes.  I was just posting the work-in-progress.



Ok, I just tested IndexSorter for now. It appears to work correctly, at 
least I get exactly the same results, with the same scores and the same 
explanations, if I run the smae queries on the original and on the 
sorted index. For now, the query response time is identical as far as I 
can tell.


We will also need to estimate the total number of matches by 
extrapolating linearly from the maximum doc id processed.



...which should be reported by the custom HitCollector, right?

Finally, it is probably rather slow for large indexes, whose .fdt 
won't fit in memory.  A simple way to improve that might be to use 
Similarity.floatToByte-encoded floats when sorting (e.g., the norm 
from an untokenized field) so that 



Yes, for an index that was 5 mln docs the IndexOptimizer takes ~10 min. 
to complete, this IndexSorter took over 1 hour...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Doug Cutting

Andrzej Bialecki wrote:
By all means please start, this is still near the limits of my knowledge 
of Lucene... ;-)


Okay, I'll try to get something working fairly soon.

Doug


Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Doug Cutting

Andrzej Bialecki wrote:
By all means please start, this is still near the limits of my knowledge 
of Lucene... ;-)


Attached is a class which sorts a Nutch index by boost.  I have only 
tested it on a ~100 page index, where it appears to work correctly. 
Please tell me how it works for you.


Doug
/**
 * Copyright 2005 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the License);
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an AS IS BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.nutch.indexer;

import java.io.File;
import java.io.IOException;
import java.util.Date;
import java.util.Arrays;

import org.apache.lucene.index.*;
import org.apache.lucene.document.*;
import org.apache.lucene.store.*;

import org.apache.nutch.util.NutchConf;

/** Sort a Nutch index by page score.  Higher scoring documents are assigned
 * smaller document numbers. */
public class IndexSorter {

  private static class PostingMap implements Comparable {
private int newDoc;
private long offset;

public int compareTo(Object o) {  // order by newDoc id
  return this.newDoc - ((PostingMap)o).newDoc;
}
  }

  private static class SortedTermPositions implements TermPositions {
private TermPositions original;
private int[] oldToNew;

private int docFreq;

private PostingMap[] postingMaps = new PostingMap[0];
private int pointer;

private int freq;
private int position;

private static final String TEMP_FILE = temp;
private final RAMDirectory tempDir = new RAMDirectory();
private final RAMOutputStream out =
  (RAMOutputStream)tempDir.createOutput(TEMP_FILE);
private IndexInput in;

public SortedTermPositions(TermPositions original, int[] oldToNew) {
  this.original = original;
  this.oldToNew = oldToNew;
}

public void seek(Term term) throws IOException {
  throw new UnsupportedOperationException();
}

public void seek(TermEnum terms) throws IOException {
  original.seek(terms);

  docFreq = terms.docFreq();
  pointer = -1;

  if (docFreq  postingMaps.length) { // grow postingsMap
PostingMap[] newMap = new PostingMap[docFreq];
System.arraycopy(postingMaps, 0, newMap, 0, postingMaps.length);
for (int i = postingMaps.length; i  docFreq; i++) {
  newMap[i] = new PostingMap();
}
postingMaps = newMap;
  }

  out.reset();

  int i = 0;
  while (original.next()) {
PostingMap map = postingMaps[i++];
map.newDoc = oldToNew[original.doc()];// remap the newDoc id
map.offset = out.getFilePointer();// save pointer to buffer

final int tf = original.freq();   // buffer tf  positions
out.writeVInt(tf);
int prevPosition = 0;
for (int j = tf; j  0; j--) {// delta encode positions
  int p = original.nextPosition();
  out.writeVInt(p - prevPosition);
  prevPosition = p;
}
  }
  out.flush();
  docFreq = i;// allow for deletions
  
  Arrays.sort(postingMaps, 0, docFreq);   // resort by mapped doc ids

  // NOTE: this might be substantially faster if RAMInputStream were public
  // and supported a reset() operation.
  in = tempDir.openInput(TEMP_FILE);
}

public boolean next() throws IOException {
  pointer++;
  if (pointer  docFreq) {
in.seek(postingMaps[pointer].offset);
freq = in.readVInt();
position = 0;
return true;
  }
  return false;
}
  
public int doc() { return postingMaps[pointer].newDoc; }
public int freq() { return freq; }

public int nextPosition() throws IOException {
  int positionIncrement = in.readVInt();
  position += positionIncrement;
  return position;
}

public int read(int[] docs, int[] freqs) {
  throw new UnsupportedOperationException();
}
public boolean skipTo(int target) {
  throw new UnsupportedOperationException();
}

public void close() throws IOException {
  original.close();
}

  }

  private static class SortingReader extends FilterIndexReader {

private int[] oldToNew;
private int[] newToOld;

public SortingReader(IndexReader oldReader, int[] oldToNew) {
  super(oldReader);
  this.oldToNew = oldToNew;
  
  this.newToOld = new int[oldReader.numDocs()];
  int oldDoc = 0;
  while (oldDoc  oldToNew.length) {
int newDoc = 

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:

By all means please start, this is still near the limits of my 
knowledge of Lucene... ;-)



Attached is a class which sorts a Nutch index by boost.  I have only 
tested it on a ~100 page index, where it appears to work correctly. 
Please tell me how it works for you.



Shouldn't this be combined with a HitCollector that collects only the 
first-n matches? Otherwise we still need to scan the whole posting list...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com