Paul Elschot wrote:

There is one indexing parameter that might help performance
for BooleanScorer2, it is the skip interval in Lucene's TermInfosWriter.
The current value is 16, and there was a question about it
on 16 Oct 2005 on java-dev with title "skipInterval".
I don't know how the value of skipInterval was initially determined.
It's possible that a larger value gives somewhat better query
performance in this case.
Changing the skip interval might require reindexing, though.

In Nutch the default is 128. And yes, changing this requires re-creating the index (actually, it's enough to optimize it, so that the .tii file is re-written).

I considered a specialised scorer for the earlier query:

+(url:term1^4.0 anchor:term1^2.0 content:term1
  title:term1^1.5  host:term1^2.0)
+(url:term2^4.0 anchor:term2^2.0 content:term2
  title:term2^1.5 host:term2^2.0)
url:"term1 term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0
content:"term1 term2"~2147483647
title:"term1 term2"~2147483647^1.5
host:"term1 term2"~2147483647^2.0
[...]

Thank you for the detailed analysis. Currently we pursue a totally different approach: limiting the size of the index by clever selection of the most promising postings, and resorting the posting lists so that they are ordered according to a "pagerank"-like value, so that we could skip postings coming from less significant docs. Please see the nutch-dev discussion for more details.

Oh, BTW: I just found the DisjunctionMaxQuery class, recently added it seems. Do you think this query structure could benefit from using it instead of the BooleanQuery?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to