RE: Performance optimization for Nutch index / query

2007-02-22 Thread Steve Severance
Hi,

I would like to comment if I might. I am not a Nutch/Lucene hacker yet. I have 
been working with it for only a few weeks. However I am looking at extending it 
significantly to add some new features. Now some of these will require 
extending Lucene as well. First I have a test implementation of PageRank that 
is really an approximation that runs ontop of map reduce. Are people interested 
in having this in the index? I am interested in how this and other meta data 
might interact with your super field. For instance I am also looking at using 
relevance feedback and having that as one of the criteria for ranking 
documents. I was also considering using an outside data source, possibly even 
another Lucene index to store these values on a per document basis. 

The other major feature I am thinking about is using distance between words and 
text type. Do you know of anyone who has done this?

Regards,

Steve


iVirtuoso, Inc
Steve Severance
Partner, Chief Technology Officer
[EMAIL PROTECTED]
mobile: (240) 472 - 9645


-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 7:44 PM
To: nutch-dev@lucene.apache.org
Subject: Performance optimization for Nutch index / query

Hi all,

This very long post is meant to initiate a discussion. There is no code 
yet. Be warned that it discusses low-level Nutch/Lucene stuff.

Nutch queries are currently translated into complex Lucene queries. This 
is necessary in order to take into account score factors coming from 
various document parts, such as URL, host, title, content, and anchors.

Typically, the translation provided by query-basic looks like this for 
single term queries:

(1)
Query: term1
Parsed: term1
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0)

For queries consisting of two or more terms it looks like this (Nutch 
uses implicit AND):

(2)
Query: term1 term2
Parsed: term1 term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) url:"term1 
term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 content:"term1 
term2"~2147483647 title:"term1 term2"~2147483647^1.5 host:"term1 
term2"~2147483647^2.0

By the way, please note the absurd default slop value - in case of 
anchors it defeats the purpose of having the ANCHOR_GAP ...

Let's list other common query types:

(3)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) +(url:term3^4.0 
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
url:"term1 term2 term3"~2147483647^4.0 anchor:"term1 term2 term3"~4^2.0 
content:"term1 term2 term3"~2147483647 title:"term1 term2 
term3"~2147483647^1.5 host:"term1 term2 term3"~2147483647^2.0

For phrase queries it looks like this:

(4)
Query: "term1 term2"
Parsed: "term1 term2"
Translated: +(url:"term1 term2"^4.0 anchor:"term1 term2"^2.0 
content:"term1 term2" title:"term1 term2"^1.5 host:"term1 term2"^2.0)

For mixed term and phrase queries it looks like this:

(5)
Query: term1 "term2 term3"
Parsed: term1 "term2 term3"
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:"term2 term3"^4.0 anchor:"term2 
term3"^2.0 content:"term2 term3" title:"term2 term3"^1.5 host:"term2 
term3"^2.0)

For queries with NOT operator it looks like this:

(6)
Query: term1 -term2
Parsed: term1 -term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) -(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0)

(7)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) -(url:term3^4.0 
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
url:"term1 term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 
content:"term1 term2"~2147483647 title:"term1 term2"~2147483647^1.5 
host:"term1 term2"~2147483647^2.0

(8)
Query: "term1 term2" -term3
Parsed: "term1 term2" -term3
Translated: +(url:"term1 term2"^4.0 anchor:"term1 term2"^2.0 
content:"term1 term2" title:"term1 term2"^1.5 host:"term1 term2"^2.0) 
-(url:term3^4.0 anchor:term3^2.0 content:term3 title:term3^1.5 
host:term3^2.0)

(9)
Query: term1 -"term2 term3"
Parsed: term1 -"term2 term3"
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) -(url:"term2 term3"^4.0 anchor:"term2 
term3"^2.0 content:"term2 term3" title:"term2 term3"^1.5 host:"term2 
term3"^2.0)


WHEW ... !!! Are you tired? Well, Lucene is tired of these queries too. 
They are too long! They are absurdly lon

Re: Performance optimization for Nutch index / query

2007-02-23 Thread Andrzej Bialecki

Steve Severance wrote:

Hi,

I would like to comment if I might. I am not a Nutch/Lucene hacker yet. I have been working with it for only a few weeks. However I am looking at extending it significantly to add some new features. Now some of these will require extending Lucene as well. First I have a test implementation of PageRank that is really an approximation that runs ontop of map reduce. Are people interested in having this in the index? I am interested in how 


I'm definitely interested - I have a semi-working map-reduce PR 
implementation too, we should probably join forces ;)


this and other meta data might interact with your super field. For instance I am also looking at using relevance feedback and having that as one of the criteria for ranking documents. I was also considering using an outside data source, possibly even another Lucene index to store these values on a per document basis. 


The other major feature I am thinking about is using distance between words and 
text type. Do you know of anyone who has done this?
  


I'm not sure I understand what you mean by this ... Could give an example?

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Performance optimization for Nutch index / query

2007-02-23 Thread Enis Soztutar

Hi,

The proposal about the query filters is quite interesting. I have run a 
simple test for the performance comparison between one field and multi 
field queries in a 200k+ index. I have put logs in 
DistributedSearch$Client as :


long start = System.currentTimeMillis();
Hits hits = client.search(query, 10, null, null, false);
System.out.println("time:" + (System.currentTimeMillis() - start));

I have seen 2 or 3 times improvement for one field queries :

site:term1  vs term1

Since I did not implemented query filters searching content or title or 
host. However I suppose that the improvement is due to the fairly low 
number of documents that returned from the former query. I suppose that 
most of the latency is due to RPC calls. We could measure the real index 
searching time from the NutchBean's main method, however then the index 
readers are not warmed up, we should run dummy queries first before 
measuring.


Andrej, have you tested the runtimes of these complex queries? If so 
would you share the results?


>Looking at the code of Lucene's TermQuery and TermScorer it should be 
possible to score the document differently
>depending on the absolute term position (we would just need to 
retrieve TermPositions instead of TermDocs, as
>TermQuery does now). The only problem now is to determine efficiently 
which sub-field are we in at the moment,

> in order to apply different boosts.

I see another possible issue about using a single field. tf and idf 
values will be changed when using multiple field data are mashed-up in a 
field. Thus this will affect scoring. The Similarity interface of lucene 
does not pass the fieldname as an argument to tf() and idf(). This 
interface should be changed also. However, a single field or not, I 
think the Similarity#tf() and Similarity#idf() should always pass 
fieldname. So that implementations of Similarity, such as 
NutchSimilarity, can handle per-field tf, idf generation code. Such a 
think can be useful, such as in the type field, in which case idf of a 
type does not make so much sense. What do you think, should we propose 
this to lucene community?




Andrzej Bialecki wrote:

Hi all,

This very long post is meant to initiate a discussion. There is no 
code yet. Be warned that it discusses low-level Nutch/Lucene stuff.


Nutch queries are currently translated into complex Lucene queries. 
This is necessary in order to take into account score factors coming 
from various document parts, such as URL, host, title, content, and 
anchors.


Typically, the translation provided by query-basic looks like this for 
single term queries:


(1)
Query: term1
Parsed: term1
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0)


For queries consisting of two or more terms it looks like this (Nutch 
uses implicit AND):


(2)
Query: term1 term2
Parsed: term1 term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) url:"term1 
term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 content:"term1 
term2"~2147483647 title:"term1 term2"~2147483647^1.5 host:"term1 
term2"~2147483647^2.0


By the way, please note the absurd default slop value - in case of 
anchors it defeats the purpose of having the ANCHOR_GAP ...


Let's list other common query types:

(3)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) +(url:term3^4.0 
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
url:"term1 term2 term3"~2147483647^4.0 anchor:"term1 term2 
term3"~4^2.0 content:"term1 term2 term3"~2147483647 title:"term1 term2 
term3"~2147483647^1.5 host:"term1 term2 term3"~2147483647^2.0


For phrase queries it looks like this:

(4)
Query: "term1 term2"
Parsed: "term1 term2"
Translated: +(url:"term1 term2"^4.0 anchor:"term1 term2"^2.0 
content:"term1 term2" title:"term1 term2"^1.5 host:"term1 term2"^2.0)


For mixed term and phrase queries it looks like this:

(5)
Query: term1 "term2 term3"
Parsed: term1 "term2 term3"
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:"term2 term3"^4.0 anchor:"term2 
term3"^2.0 content:"term2 term3" title:"term2 term3"^1.5 host:"term2 
term3"^2.0)


For queries with NOT operator it looks like this:

(6)
Query: term1 -term2
Parsed: term1 -term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) -(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0)


(7)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) -(url:term3^4.0 
anchor:term3^2.0 content:term3 title:

Re: Performance optimization for Nutch index / query

2007-02-23 Thread Doug Cutting

Andrzej Bialecki wrote:
The degree of simplification is very substantial. Our NutchSuperQuery 
doesn't have to do much more work than a simple TermQuery, so we can 
assume that the cost to run it is the same as TermQuery times some 
constant. What we gain then is the cost of not running all those boolean 
clauses ...


The NutchSuperQuery would have to do more work, to boost things and 
since postings would be longer, and postings would also compress more 
poorly, so while there'd probably be some improvement, it wouldn't be 
quite as fast as a single-term query.


If you're still with me at this point I must congratulate you. :) 
However, that's as far as I thought it through for now - let the 
discussion start! If you are a Lucene hacker I would gladly welcome your 
review or even code contributions .. ;)


An implementation to consider is payloads.  If each posting has a weight 
attached, then the fieldBoost*fieldNorm could be stored there, and a 
simple gap-based method could be used to inhibit cross-field matches. 
Queries would look similar to your proposed approach.


http://www.gossamer-threads.com/lists/lucene/java-dev/37409

One might optimize the payload implementation with run-length 
compression: if a run of postings have the same payload it could be 
represented once at the start of the run along with the run's length. 
That would keep postings small, reducing i/o.


Doug