Re: IndexSorter optimizer

2006-01-04 Thread Doug Cutting

Byron Miller wrote:

On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)


Both.  The highest-scoring pages are kept in separate indexes that are 
searched first.  When a query fails to match 1000 or so documents in the 
high-scoring indexes then the entire dataset is searched.  In general 
there can be multiple levels, e.g.: high-scoring, mid-scoring and 
low-scoring indexes, with the vast majority of pages in the last 
category, and the vast majority of queries resolved consulting only the 
first category.


What I have implemented so far for Nutch is a single-index version of 
this.  The current index-sorting implementation does not yet scale well 
to indexes larger than ~50M urls.  It is a proof-of-concept.


A better long-term approach is to introduce another MapReduce pass that 
collects Lucene documents (or equivalent) as values, and page scores as 
keys.  Then the indexing MapReduce pass can partition and sort by score 
before creating indexes.  The distributed search code will also need to 
be modified to search high-score indexes first.


Doug


Re: IndexSorter optimizer

2006-01-04 Thread Andrzej Bialecki

Doug Cutting wrote:


Byron Miller wrote:


On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)



Both.  The highest-scoring pages are kept in separate indexes that are 
searched first.  When a query fails to match 1000 or so documents in 
the high-scoring indexes then the entire dataset is searched.  In 
general there can be multiple levels, e.g.: high-scoring, mid-scoring 
and low-scoring indexes, with the vast majority of pages in the last 
category, and the vast majority of queries resolved consulting only 
the first category.


What I have implemented so far for Nutch is a single-index version of 
this.  The current index-sorting implementation does not yet scale 
well to indexes larger than ~50M urls.  It is a proof-of-concept.


A better long-term approach is to introduce another MapReduce pass 
that collects Lucene documents (or equivalent) as values, and page 
scores as keys.  Then the indexing MapReduce pass can partition and 
sort by score before creating indexes.  The distributed search code 
will also need to be modified to search high-score indexes first.



The WWW2005 conference presented a couple of interesting papers on the 
subject (http://www2005.org), among others these:


1. http://www2005.org/cdrom/docs/p235.pdf
2. http://www2005.org/cdrom/docs/p245.pdf
3. http://www2005.org/cdrom/docs/p257.pdf

The techniques described in the first paper are not too difficult to 
implement, especially the Carmel's method of index pruning, which gives 
satisfactory results at moderate costs.


The third paper, by Long  Suel, presents a concept of using a cache of 
intersections for multi-term queries, which we already sort of use with 
CachingFilters, only they propose to store them on-disk instead of 
limiting the cache to relatively small number of filters kept in RAM...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: IndexSorter optimizer

2006-01-03 Thread Byron Miller
On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)

With this patch and a top result set in the xml file
does that mean it will stop scanning the index at that
point?  Is there a methodology to actually prune the
index on some scaling factor so that a  4 billion page
index can be searchable only 1k results deep on
average?

seems like some sort of method to do the above would
cut your search processing/index size down fairly
well. But it may be a more expensive to post process
to this scale then it is to simply push and let the
query optimize ignore it as needed.. afterall disk
space is getting rather cheap compared to cpu
processing  memory.



--- Doug Cutting [EMAIL PROTECTED] wrote:

 Andrzej Bialecki wrote:
  I'm happy to report that further tests performed
 on a larger index seem 
  to show that the overall impact of the IndexSorter
 is definitely 
  positive: performance improvements are
 significant, and the overall 
  quality of results seems at least comparable, if
 not actually better.
 
 Great news!
 
 I will submit the Lucene patches ASAP, now that we
 know they're useful.
 
 Doug
 



Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting

Andrzej Bialecki wrote:
I'm happy to report that further tests performed on a larger index seem 
to show that the overall impact of the IndexSorter is definitely 
positive: performance improvements are significant, and the overall 
quality of results seems at least comparable, if not actually better.


Great news!

I will submit the Lucene patches ASAP, now that we know they're useful.

Doug


Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:

Using the original index, it was possible for pages with high tf/idf 
of a term, but with a low boost value (the OPIC score), to outrank 
pages with high boost but lower tf/idf of a term. This phenomenon 
leads quite often to results that are perceived as junk, e.g. pages 
with a lot of repeated terms, but with little other real content, 
like for example navigation bars.



Sounds like tf/idf might be de-emphasized in scoring.  Perhaps 
NutchSimilarity.tf() should use log() instead of sqrt() when 
field==content?



I don't think it's that simple, the OPIC score is what determined this 
behaviour, and it doesn't correspond at all to tf/idf, but to a human 
judgement.




To conclude, I will add the IndexSorter.java to the core classes, and 
I suggest to continue the experiments ...



I've updated the version of Lucene included with Nutch to have the 
required patch.  Would you like me to commit IndexSorter.java or would 
you?



Please do it. There are two typos in your version of IndexSorter, you 
used numDocs() in two places instead of maxDoc(), which for indexes with 
deleted docs (after dedup) leads to exceptions.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki

Doug Cutting wrote:


I have committed this, along with the LuceneQueryOptimizer changes.

I could only find one place where I was using numDocs() instead of 
maxDoc().



Right, I confused two bugs from different files - the other bug still 
exists in the committed version of the 
LuceneQueryOptimizer.LimitedCollector constructor, instead of 
super(maxHits) it should be super(numHits) - this was actually the bug, 
which was causing that mysterious slowdown for  higher values of MAX_HITS.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




IndexSorter optimizer

2005-12-21 Thread Andrzej Bialecki

Hi,

I'm happy to report that further tests performed on a larger index seem 
to show that the overall impact of the IndexSorter is definitely 
positive: performance improvements are significant, and the overall 
quality of results seems at least comparable, if not actually better.


The reason why result quality seems better is quite interesting, and it 
shows that the simple top-N measures that I was using in my benchmarks 
may have been too simplistic.


Using the original index, it was possible for pages with high tf/idf of 
a term, but with a low boost value (the OPIC score), to outrank pages 
with high boost but lower tf/idf of a term. This phenomenon leads 
quite often to results that are perceived as junk, e.g. pages with a 
lot of repeated terms, but with little other real content, like for 
example navigation bars.


Using the optimized index, I reported previously that some of the 
top-scoring results were missing. As it happens, the missing results 
were typically the junk pages with high tf/idf but low boost. Since 
we collect up to N hits, going from higher to lower boost values, the 
junk pages with low boost value were automatically eliminated. So, 
overall the subjective quality of results was improved. On the other 
hand, some of the legitimate results with a decent boost values were 
also skipped because they didn't fit within the fixed number of hits... 
ah, well. Perhaps we should limit the number of hits in LimitedCollector 
using a cutoff boost value, and not the maximum number of hits (or 
maybe both?).


This again brings to attention the importance of the OPIC score: it 
represents a query-independent opinion about the quality of the page - 
whichever way you calculate it. If you use PageRank, it (allegedly) 
corresponds to other people's opinions about the page, thus providing an 
objective quality opinion. If you use a simple list of 
white/black-listed sites that you like/dislike, then it represents your 
own subjective opinion on the quality of the site; etc, etc... In this 
way, running a search engine that provides good results is not just a 
plain precision, recall, tf/idf and other tangible measures, it's also a 
sort of political statement of the engine's operator. ;-)


To conclude, I will add the IndexSorter.java to the core classes, and I 
suggest to continue the experiments ...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: IndexSorter optimizer

2005-12-21 Thread Stefan Groschupf

Hi Andrzej,

wow are really great news!
Using the optimized index, I reported previously that some of the  
top-scoring results were missing. As it happens, the missing  
results were typically the junk pages with high tf/idf but low  
boost. Since we collect up to N hits, going from higher to lower  
boost values, the junk pages with low boost value were  
automatically eliminated. So, overall the subjective quality of  
results was improved. On the other hand, some of the legitimate  
results with a decent boost values were also skipped because they  
didn't fit within the fixed number of hits... ah, well. Perhaps we  
should limit the number of hits in LimitedCollector using a cutoff  
boost value, and not the maximum number of hits (or maybe both?).


As far we experiment it would be good to have booth.

To conclude, I will add the IndexSorter.java to the core classes,  
and I suggest to continue the experiments ...


May someone out there in the community has a commercial search engine  
running (e.g. google appliance or similar) so we may can setup a  
nutch with the same pages and compare the results.
I guess it will be difficult to compare nutch with yahoo or google  
since nobody of us has a 4 billion index up and running. I would run  
one on my laptop but I do not have the bandwidth to fetch until next  
two days. :-D

Great work!

Cheers,
Stefan 


Re: IndexSorter optimizer

2005-12-21 Thread Byron Miller
I've got 400mill db i can run this against over the
next few days.

-byron

--- Stefan Groschupf [EMAIL PROTECTED] wrote:

 Hi Andrzej,
 
 wow are really great news!
  Using the optimized index, I reported previously
 that some of the  
  top-scoring results were missing. As it happens,
 the missing  
  results were typically the junk pages with high
 tf/idf but low  
  boost. Since we collect up to N hits, going from
 higher to lower  
  boost values, the junk pages with low boost
 value were  
  automatically eliminated. So, overall the
 subjective quality of  
  results was improved. On the other hand, some of
 the legitimate  
  results with a decent boost values were also
 skipped because they  
  didn't fit within the fixed number of hits... ah,
 well. Perhaps we  
  should limit the number of hits in
 LimitedCollector using a cutoff  
  boost value, and not the maximum number of hits
 (or maybe both?).
 
 As far we experiment it would be good to have booth.
 
  To conclude, I will add the IndexSorter.java to
 the core classes,  
  and I suggest to continue the experiments ...
 
 May someone out there in the community has a
 commercial search engine  
 running (e.g. google appliance or similar) so we may
 can setup a  
 nutch with the same pages and compare the results.
 I guess it will be difficult to compare nutch with
 yahoo or google  
 since nobody of us has a 4 billion index up and
 running. I would run  
 one on my laptop but I do not have the bandwidth to
 fetch until next  
 two days. :-D
 Great work!
 
 Cheers,
 Stefan 
 



Re: IndexSorter optimizer

2005-12-21 Thread American Jeff Bowden

Andrzej Bialecki wrote:


Hi,

I'm happy to report that further tests performed on a larger index 
seem to show that the overall impact of the IndexSorter is definitely 
positive: performance improvements are significant, and the overall 
quality of results seems at least comparable, if not actually better.



This is very interesting.  What's the computational complexity and disk 
I/O for index sorting as compared other operations on an index (e.g. 
adding/deleting N documents and running optimize)?




Re: IndexSorter optimizer

2005-12-21 Thread Andrzej Bialecki

American Jeff Bowden wrote:


Andrzej Bialecki wrote:


Hi,

I'm happy to report that further tests performed on a larger index 
seem to show that the overall impact of the IndexSorter is definitely 
positive: performance improvements are significant, and the overall 
quality of results seems at least comparable, if not actually better.




This is very interesting.  What's the computational complexity and 
disk I/O for index sorting as compared other operations on an index 
(e.g. adding/deleting N documents and running optimize)?



Comparable to optimize(). All index data needs to be read and copied, so 
the whole process is I/O bound.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com